First off, full code can be found at: https://github.com/qhdwight/qgame-rs/blob/main/assets/shaders/voxels.wgsl
In previous attempts to polygonize a scalar field, aka copy pasting Paul Borke's implementation of marching cubes, I did everything on the CPU. It turns out iterating multiple chunks of size 32x32x32 in serial is not very fast.
Good news! We have something called GPUs. They have more cores, but each one is less powerful and can only really do math. Even better is that Bevy supports compute shaders, including those written in WGSL and GLSL. I decided to go with WGSL since I already knew GLSL and wanted to learn more about other options. The downside was there was almost no information about WGSL besides the official spec. However, it was pretty good, and I worked out most problems by looking there.
So what? Well, Marching Cubes can be parallelized pretty easily; it deals with computation for each cell of big 3D grid, and each cell only really cares about itself. However, with most things there is a catch: When you compute a face, you have to atomically add it to the list of output vertices. Same goes with indices, normals, uvs, etc.
atomic<T>
feature of WGSL. We just need to track where the next vertex should be placed:
struct Atomics {
vertices_head: atomic<u32>
indices_head: atomic<u32>
};
...
[[group(0), binding(3)]]
var<storage, read_write> global_atomics: Atomics;
We can then use that as following:
var tri_idx: u32 = 0u;
loop {
var start_vert_idx = atomicAdd(&global_atomics.vertices_head, 3u);
var start_indices_idx = atomicAdd(&global_atomics.indices_head, 3u);
let v0 = vertices[ uniform_tri_table.data[cube_idx][tri_idx + 0u] ];
let v1 = vertices[ uniform_tri_table.data[cube_idx][tri_idx + 1u] ];
let v2 = vertices[ uniform_tri_table.data[cube_idx][tri_idx + 2u] ];
out_vertices.data[start_vert_idx + 0u] = v0;
out_vertices.data[start_vert_idx + 1u] = v1;
out_vertices.data[start_vert_idx + 2u] = v2;
out_indices.data[start_indices_idx + 0u] = start_vert_idx + 0u;
out_indices.data[start_indices_idx + 1u] = start_vert_idx + 1u;
out_indices.data[start_indices_idx + 2u] = start_vert_idx + 2u;
let normal = cross(v0 - v1, v0 - v2);
out_normals.data[start_vert_idx + 0u] = normal;
out_normals.data[start_vert_idx + 1u] = normal;
out_normals.data[start_vert_idx + 2u] = normal;
out_uvs.data[start_vert_idx + 0u] = vec2<f32>(0.0, 0.0);
out_uvs.data[start_vert_idx + 1u] = vec2<f32>(1.0, 0.0);
out_uvs.data[start_vert_idx + 2u] = vec2<f32>(0.0, 1.0);
tri_idx = tri_idx + 3u;
if (uniform_tri_table.data[cube_idx][tri_idx] == -1) {
break;
}
}
In general atomics are slow on a GPU, but notice we only have to do one single increment per worker. In this case by three since that is how many vertices each face has.
Notice also that there are several lines that seem like they could be done in a loop. I decided to leave it unrolled because in general GPUs do very poorly with branching. Yes, the SPIR-V compiler can probably optimize it, but I wanted to be safe and signal my intention. Some more egregious uses of this same idea can be found here:
var vertices = array<vec3<f32>, 12>(
f32((uniform_edge_table.data[cube_idx] & (1u << 0u)) != 0u) * interp_vertex(positions[0u], positions[1u], densities[0u], densities[1u]),
f32((uniform_edge_table.data[cube_idx] & (1u << 1u)) != 0u) * interp_vertex(positions[1u], positions[2u], densities[1u], densities[2u]),
f32((uniform_edge_table.data[cube_idx] & (1u << 2u)) != 0u) * interp_vertex(positions[2u], positions[3u], densities[2u], densities[3u]),
f32((uniform_edge_table.data[cube_idx] & (1u << 3u)) != 0u) * interp_vertex(positions[3u], positions[0u], densities[3u], densities[0u]),
f32((uniform_edge_table.data[cube_idx] & (1u << 4u)) != 0u) * interp_vertex(positions[4u], positions[5u], densities[4u], densities[5u]),
f32((uniform_edge_table.data[cube_idx] & (1u << 5u)) != 0u) * interp_vertex(positions[5u], positions[6u], densities[5u], densities[6u]),
f32((uniform_edge_table.data[cube_idx] & (1u << 6u)) != 0u) * interp_vertex(positions[6u], positions[7u], densities[6u], densities[7u]),
f32((uniform_edge_table.data[cube_idx] & (1u << 7u)) != 0u) * interp_vertex(positions[7u], positions[4u], densities[7u], densities[4u]),
f32((uniform_edge_table.data[cube_idx] & (1u << 8u)) != 0u) * interp_vertex(positions[0u], positions[4u], densities[0u], densities[4u]),
f32((uniform_edge_table.data[cube_idx] & (1u << 9u)) != 0u) * interp_vertex(positions[1u], positions[5u], densities[1u], densities[5u]),
f32((uniform_edge_table.data[cube_idx] & (1u << 10u)) != 0u) * interp_vertex(positions[2u], positions[6u], densities[2u], densities[6u]),
f32((uniform_edge_table.data[cube_idx] & (1u << 11u)) != 0u) * interp_vertex(positions[3u], positions[7u], densities[3u], densities[7u]),
);
This also uses the trick of multiplying by a value that evaluates to 1 or 0 instead of using a branch. There might be a better way to do this with WGSL select, but same idea.
struct Voxel {
flags: u32;
density: f32;
};
...
[[group(0), binding(2)]]
var<storage, read> in_voxels: VoxelBuffer;
...
let chunk_sz = 32;
fn get_flat_index(pos: vec3<i32>) -> u32 {
return u32(pos.x + pos.y * chunk_sz + pos.z * chunk_sz * chunk_sz);
}
fn get_voxel_density(pos: vec3<i32>) -> f32 {
var density: f32 = 0.0;
if (pos.x >= 0 && pos.x < chunk_sz
&& pos.y >= 0 && pos.y < chunk_sz
&& pos.z >= 0 && pos.z < chunk_sz) {
density = in_voxels.data[get_flat_index(pos)].density;
}
return density;
}
[[stage(compute), workgroup_size(8, 8, 8)]]
fn main([[builtin(global_invocation_id)]] invocation_id: vec3<u32>) {
let pos = vec3<i32>(invocation_id);
let voxel = in_voxels.data[get_flat_index(pos)];
...
}
Now here is the Bevy part.
go backend
bootstrap frontend
docker deploy