4
\$\begingroup\$

I've been working on a WebGPU-backed Rust renderer that batches together 2D colored sprites. I hadn't gotten onto the "sprite" portion of it yet, but I'm already finding some performance issues. Although the code works, the renderer can't easily handle 8000 colored quads, while other renderers seem to easily hold their own against tens of thousands of textured quads. Heck, even the "similar questions" on this site was pointing me to this question that described an easy 50k sprites rendered!

The general process is as follows:

  • Define MAX_QUAD_COUNT as something like 8192
  • Preallocate a large buffer enough to hold MAX_QUAD_COUNT * 4 vertices
  • Preallocate a large buffer enough to hold MAX_QUAD_COUNT colors
  • Fill in a large index buffer ahead of time (quads always use the same index pattern)
  • For each draw_quad invocation, write 4 vertices to the CPU-side vertex data buffer and the quad color to the uniform color data
  • Finally, on flush, copy the color data to a buffer (refresh_colors), create a bind group for that, and copy the vertices to a vertex buffer, then submit and reset n_quads_drawn to 0
  • Repeat next frame

If it makes any difference, I'm running this on an M1 Monterey MacBook Air.

So, how can I go about speeding this up?

What I've Tried

During the process of trying to speed it up, I've tried:

  • Removing the matrix transforms and simply using static geometry
  • Setting colors as a uniform storage buffer instead of per-vertex (this actually helped, but not by a lot)
  • Computing texture coordinates in the shaders instead of having them as part of the vertex information (didn't make much of a difference, so I kept it)

(now I was getting pretty frustrated)

  • Removing the process of writing the vertices to the vertex buffer (no change)
  • Removing the process of writing the colors to the storage buffer (no change)
  • Removing both (still no change)

So I now apparently know that simply submitting 8192 vertices to the GPU is quite taxing, and most of the time isn't spent in vertex transformations.

Here's the Flamegraph for my program, showing that most of the time is indeed spent in flush.

enter image description here

Also, it seems like most of the time that flush is taking is due to wgpu calls, which are in turn due to MacOS-native Metal calls. Are there any unnecessary graphics API calls that I can eliminate?

(XCode's Instruments also verifies that most of the time is spent in flush) (Also, at about 8k sprites, the milliseconds-per-frame start going really crazy, going along the lines of 16, 17, 16, 33, 0, 33, and so on. Why is it so irregular?)

The Code

Additional info:

  • GraphicsSubsystem just holds some global state like the Surface and the Instance and the Device, who knows if one day I'll add an EGUI rendering layer that also needs access to this state?
  • OrthographicCamera just wraps a glam::f32::Mat4::orthographic_lh call.
  • Textures are still a WIP, and I've just used a temporary pixel art knight texture because why not
  • Window's pump events calls sdl2's event poll_iter function, I'm not sure if this has any performance implications. The Flamegraph seems to suggest otherwise, though.

Source Code:

main.rs

extern crate infinity;
use infinity as inf;

pub struct Demo {
    last_time: std::time::Instant,
}

impl Default for Game {
    fn default() -> Self {
        Self {
            last_time: std::time::Instant::now(),
        }
    }
}

impl inf::App for Demo {
    fn setup(&mut self, _ctx: &mut inf::Context) {
        println!("setting up");
    }

    fn update(&mut self, ctx: &mut inf::Context) {
        let new_time = std::time::Instant::now();
        println!("frame time: {} ms", (new_time - self.last_time).as_millis());
        self.last_time = new_time;

        for _ in 0..8000 {
            ctx.r2d.draw_quad(inf::render2d::DrawQuadDescriptor {
                pos: inf::IVec2::new(100, 200),
                size: inf::IVec2::new(100, 200),
                color: inf::IVec3::new(230, 51, 51),
                ..Default::default()
            });
        }
    }

    fn shutdown(&mut self, _ctx: &mut inf::Context) {
        println!("shutting down");
    }
}

fn main() {
    inf::run(Demo::default());
}

infinity/lib.rs

// A bunch of includes I've omitted here

pub trait App {
    fn setup(&mut self, ctx: &mut Context);
    fn update(&mut self, ctx: &mut Context);
    fn shutdown(&mut self, ctx: &mut Context);
}

pub struct Context {
    pub gfx: Rc<GraphicsSubsystem>,
    pub event_bus: EventSubsystem,
    pub input: InputSubsystem,
    pub r2d: Renderer2d,
}

pub fn run<T: App>(mut client: T) {
    LogSubsystem::init();

    let mut event_bus = EventSubsystem::init();
    let input = InputSubsystem::init(&mut event_bus);
    let mut window = WindowSubsystem::init(WindowConfig::default());
    let gfx = Rc::new(GraphicsSubsystem::init(&window));
    let r2d = Renderer2d::init(gfx.clone());

    let mut ctx = Context {
        gfx,
        r2d,
        event_bus,
        input,
    };

    client.setup(&mut ctx);

    loop {
        if !window.pump_events(&mut ctx.event_bus) {
            break;
        }

        client.update(&mut ctx);
        ctx.r2d.update();
    }

    client.shutdown(&mut ctx);
}

renderer2d.rs

use crate::graphics::{util, util::Uniform, GraphicsSubsystem};
use crate::render3d::{Camera, OrthographicCamera};
use crate::window::{LogicalSize, PhysicalSize};
use crate::{IVec2, IVec4, Mat4};
use std::rc::Rc;
use wgpu::{include_wgsl, util::DeviceExt};

mod quad;
pub use quad::DrawQuadDescriptor;

#[repr(C)]
#[derive(Copy, Clone, Debug, Pod, Zeroable, Default)]
pub struct Vertex {
    pub pos: [u32; 2],
}

impl Vertex {
    const ATTRIBUTES: &'static [wgpu::VertexAttribute] =
        &wgpu::vertex_attr_array![0 => Uint32x2];

    pub fn desc() -> wgpu::VertexBufferLayout<'static> {
        wgpu::VertexBufferLayout {
            array_stride: std::mem::size_of::<Vertex>() as wgpu::BufferAddress,
            step_mode: wgpu::VertexStepMode::Vertex,
            attributes: &Self::ATTRIBUTES,
        }
    }
}

#[derive(Debug)]
pub struct Uniforms {
    pub camera: Mat4,
    pub colors: [[u32; 4]; Renderer2d::MAX_QUADS],

    camera_buffer: wgpu::Buffer,
    colors_buffer: wgpu::Buffer,
    bind_group: wgpu::BindGroup,
    layout: wgpu::BindGroupLayout,
}

impl Uniforms {
    pub fn new(
        device: &wgpu::Device,
        camera: Mat4,
        colors: [[u32; 4]; Renderer2d::MAX_QUADS],
        diffuse_texture_view: &wgpu::TextureView,
        diffuse_sampler: &wgpu::Sampler,
    ) -> Self {
        let camera_buffer =
            device.create_buffer_init(&wgpu::util::BufferInitDescriptor {
                label: Some("renderer2d.uniforms.camera_buffer"),
                contents: bytemuck::cast_slice(camera.as_ref()),
                usage: wgpu::BufferUsages::UNIFORM
                    | wgpu::BufferUsages::COPY_DST,
            });

        let colors_buffer =
            device.create_buffer_init(&wgpu::util::BufferInitDescriptor {
                label: Some("renderer2d.uniforms.colors_buffer"),
                contents: bytemuck::cast_slice(&colors),
                usage: wgpu::BufferUsages::STORAGE
                    | wgpu::BufferUsages::COPY_DST,
            });

        let layout = device.create_bind_group_layout(
            &wgpu::BindGroupLayoutDescriptor {
                label: Some("renderer2d.uniforms.bind_group_layout"),
                entries: &[
                    wgpu::BindGroupLayoutEntry {
                        binding: 0,
                        visibility: wgpu::ShaderStages::VERTEX,
                        ty: wgpu::BindingType::Buffer {
                            ty: wgpu::BufferBindingType::Uniform,
                            has_dynamic_offset: false,
                            min_binding_size: None,
                        },
                        count: None,
                    },
                    wgpu::BindGroupLayoutEntry {
                        binding: 1,
                        visibility: wgpu::ShaderStages::VERTEX,
                        ty: wgpu::BindingType::Buffer {
                            ty: wgpu::BufferBindingType::Storage {
                                read_only: true,
                            },
                            has_dynamic_offset: false,
                            min_binding_size: None,
                        },
                        count: None,
                    },
                    wgpu::BindGroupLayoutEntry {
                        binding: 2,
                        visibility: wgpu::ShaderStages::FRAGMENT,
                        ty: wgpu::BindingType::Texture {
                            sample_type: wgpu::TextureSampleType::Float {
                                filterable: true,
                            },
                            view_dimension: wgpu::TextureViewDimension::D2,
                            multisampled: false,
                        },
                        count: None,
                    },
                    wgpu::BindGroupLayoutEntry {
                        binding: 3,
                        visibility: wgpu::ShaderStages::FRAGMENT,
                        ty: wgpu::BindingType::Sampler(
                            wgpu::SamplerBindingType::Filtering,
                        ),
                        count: None,
                    },
                ],
            },
        );

        let bind_group =
            device.create_bind_group(&wgpu::BindGroupDescriptor {
                label: Some("renderer2d.uniforms.bind_group"),
                layout: &layout,
                entries: &[
                    wgpu::BindGroupEntry {
                        binding: 0,
                        resource: camera_buffer.as_entire_binding(),
                    },
                    wgpu::BindGroupEntry {
                        binding: 1,
                        resource: colors_buffer.as_entire_binding(),
                    },
                    wgpu::BindGroupEntry {
                        binding: 2,
                        resource: wgpu::BindingResource::TextureView(
                            diffuse_texture_view,
                        ),
                    },
                    wgpu::BindGroupEntry {
                        binding: 3,
                        resource: wgpu::BindingResource::Sampler(
                            diffuse_sampler,
                        ),
                    },
                ],
            });

        Self {
            camera,
            colors,

            layout,
            bind_group,
            camera_buffer,
            colors_buffer,
        }
    }

    pub fn refresh_colors(&mut self, queue: &wgpu::Queue, len: usize) {
        queue.write_buffer(
            &self.colors_buffer,
            0 as wgpu::BufferAddress,
            bytemuck::cast_slice(&self.colors[0..len]),
        );
    }
}

impl Uniform for Uniforms {
    fn as_bind_group(&self) -> &wgpu::BindGroup {
        &self.bind_group
    }

    fn bind_group_layout(&self) -> &wgpu::BindGroupLayout {
        &self.layout
    }
}

pub struct Renderer2d {
    gfx: Rc<GraphicsSubsystem>,

    indices: wgpu::Buffer,
    vertices: wgpu::Buffer,
    uniforms: Uniforms,

    vertices_data: [Vertex; Self::MAX_VERTICES],
    n_quads_drawn: usize,

    pipeline: wgpu::RenderPipeline,
}

impl Renderer2d {
    const MAX_QUADS: usize = 8192;
    const MAX_INDICES: usize = Self::MAX_QUADS * 6;
    const MAX_VERTICES: usize = Self::MAX_QUADS * 4;

    pub fn init(gfx: Rc<GraphicsSubsystem>) -> Self {
        // This index format will work for all quads, so there's no
        // need to recreate it on the fly, and we can just forget about it
        // after populating the index buffer.
        let mut indices_data: [u16; Self::MAX_INDICES] =
            [0u16; Self::MAX_INDICES];
        for i in 0..Self::MAX_QUADS {
            indices_data[i * 6 + 0] = (i * 4 + 0) as u16;
            indices_data[i * 6 + 1] = (i * 4 + 1) as u16;
            indices_data[i * 6 + 2] = (i * 4 + 3) as u16;
            indices_data[i * 6 + 3] = (i * 4 + 1) as u16;
            indices_data[i * 6 + 4] = (i * 4 + 2) as u16;
            indices_data[i * 6 + 5] = (i * 4 + 3) as u16;
        }

        // Populate the index buffer using the indices. This buffer won't
        // need to change during the entire lifetime of the `Renderer2d.
        let indices =
            gfx.device
                .create_buffer_init(&wgpu::util::BufferInitDescriptor {
                    label: Some("renderer2d.index_buffer"),
                    contents: bytemuck::cast_slice(&indices_data),
                    usage: wgpu::BufferUsages::INDEX,
                });

        let vertices_data = [Vertex::default(); Self::MAX_VERTICES];

        let vertices =
            gfx.device
                .create_buffer_init(&wgpu::util::BufferInitDescriptor {
                    label: Some("renderer2d.vertex_buffer"),
                    contents: bytemuck::cast_slice(&vertices_data),
                    usage: wgpu::BufferUsages::VERTEX
                        | wgpu::BufferUsages::COPY_DST,
                });

        let ortho = OrthographicCamera::new(LogicalSize {
            width: 800,
            height: 600,
        });
        let camera_data = ortho.view_proj();

        let diffuse_bytes = include_bytes!("../graphics/textures/knight.png");
        let diffuse_image = image::load_from_memory(diffuse_bytes).unwrap();
        let diffuse_rgba = diffuse_image.to_rgba8();

        use image::GenericImageView;
        let dimensions = diffuse_image.dimensions();

        let texture_size = wgpu::Extent3d {
            width: dimensions.0,
            height: dimensions.1,
            depth_or_array_layers: 1,
        };

        let diffuse_texture =
            gfx.device.create_texture(&wgpu::TextureDescriptor {
                size: texture_size,
                mip_level_count: 1,
                sample_count: 1,
                dimension: wgpu::TextureDimension::D2,
                format: wgpu::TextureFormat::Rgba8UnormSrgb,
                usage: wgpu::TextureUsages::COPY_DST
                    | wgpu::TextureUsages::TEXTURE_BINDING,
                label: Some("Knight Texture"),
            });

        gfx.queue.write_texture(
            wgpu::ImageCopyTexture {
                texture: &diffuse_texture,
                mip_level: 0,
                origin: wgpu::Origin3d::ZERO,
                aspect: wgpu::TextureAspect::All,
            },
            &diffuse_rgba,
            wgpu::ImageDataLayout {
                offset: 0,
                bytes_per_row: std::num::NonZeroU32::new(4 * dimensions.0),
                rows_per_image: std::num::NonZeroU32::new(dimensions.1),
            },
            texture_size,
        );

        let diffuse_texture_view = diffuse_texture
            .create_view(&wgpu::TextureViewDescriptor::default());
        let diffuse_sampler =
            gfx.device.create_sampler(&wgpu::SamplerDescriptor {
                address_mode_u: wgpu::AddressMode::ClampToEdge,
                address_mode_v: wgpu::AddressMode::ClampToEdge,
                address_mode_w: wgpu::AddressMode::ClampToEdge,
                mag_filter: wgpu::FilterMode::Nearest,
                min_filter: wgpu::FilterMode::Nearest,
                mipmap_filter: wgpu::FilterMode::Nearest,
                ..Default::default()
            });

        let uniforms = Uniforms::new(
            &gfx.device,
            camera_data.clone(),
            [[0, 0, 0, 255]; Self::MAX_QUADS],
            &diffuse_texture_view,
            &diffuse_sampler,
        );
        let shader_module = gfx
            .device
            .create_shader_module(include_wgsl!("shaders/triangle.wgsl"));

        let pipeline = util::make_pipeline(
            &gfx.device,
            &[&uniforms.bind_group_layout()],
            &[Vertex::desc()],
            &shader_module,
            &shader_module,
            gfx.surface_format,
        );

        Self {
            gfx,
            indices,
            vertices,
            pipeline,
            uniforms,
            vertices_data,
            n_quads_drawn: 0,
        }
    }

    pub fn resize(&mut self, new_size: PhysicalSize) {}
    pub fn resize_to_fit(&mut self) {}

    pub fn draw_quad(&mut self, desc: DrawQuadDescriptor) {
        const QUAD_BASE_POSITIONS: &[IVec2; 4] = &[
            IVec2::new(1, 0),
            IVec2::new(1, 1),
            IVec2::new(0, 1),
            IVec2::new(0, 0),
        ];

        if self.n_quads_drawn == Self::MAX_QUADS {
            self.flush();
        }

        self.uniforms.colors[self.n_quads_drawn] = [
            desc.color.x as u32,
            desc.color.y as u32,
            desc.color.z as u32,
            255,
        ];

        for i in 0..4 {
            self.vertices_data[i + self.n_quads_drawn * 4] = Vertex {
                pos: (QUAD_BASE_POSITIONS[i] * desc.size + desc.pos)
                    .as_uvec2()
                    .to_array(),
            };
        }

        self.n_quads_drawn += 1;
    }

    pub fn update(&mut self) {
        self.flush();
    }

    pub fn flush(&mut self) {
        self.uniforms
            .refresh_colors(&self.gfx.queue, self.n_quads_drawn);

        self.gfx.queue.write_buffer(
            &self.vertices,
            0 as wgpu::BufferAddress,
            bytemuck::cast_slice(
                &self.vertices_data[0..(self.n_quads_drawn * 4)],
            ),
        );

        let surface_texture = self.gfx.surface.get_current_texture().unwrap();
        let surface_view = surface_texture
            .texture
            .create_view(&wgpu::TextureViewDescriptor::default());

        let mut encoder = self.gfx.device.create_command_encoder(
            &wgpu::CommandEncoderDescriptor {
                label: Some("renderer2d.command_encoder"),
            },
        );

        {
            let mut rp = util::make_render_pass(
                &mut encoder,
                &surface_view,
                wgpu::Color {
                    r: 0.02,
                    g: 0.02,
                    b: 0.04,
                    a: 1.0,
                },
            );

            rp.set_pipeline(&self.pipeline);
            rp.set_vertex_buffer(
                0,
                self.vertices.slice(
                    ..self.n_quads_drawn as u64
                        * 4
                        * std::mem::size_of::<Vertex>() as u64,
                ),
            );
            rp.set_index_buffer(
                self.indices.slice(
                    ..self.n_quads_drawn as u64
                        * 6
                        * std::mem::size_of::<u16>() as u64,
                ),
                wgpu::IndexFormat::Uint16,
            );
            rp.set_bind_group(0, self.uniforms.as_bind_group(), &[]);
            rp.draw_indexed(0..(self.n_quads_drawn as u32 * 6), 0, 0..1);
        }

        self.gfx.queue.submit(std::iter::once(encoder.finish()));
        surface_texture.present();

        self.n_quads_drawn = 0;
    }
}

quad.wgsl

// VERTEX SHADER

struct VertexInput {
    @location(0) pos: vec2<u32>,
};

struct VertexOutput {
    @builtin(position) pos: vec4<f32>,
    @location(0) color: vec4<f32>,
    @location(1) tex_coords: vec2<f32>
};

@group(0) @binding(0)
var<uniform> view_proj: mat4x4<f32>;
@group(0) @binding(1)
var<storage, read> colors: array<vec4<u32>>;

@vertex
fn vs_main(@builtin(vertex_index) vertex_index: u32, in: VertexInput) -> VertexOutput {
    let i = vertex_index % 4u;
    var tex_u: f32 = step(f32(i), 1.5);
    var tex_v: f32 = step(f32(i), 2.5) * clamp(f32(i), 0.0, 1.0);
    
    var out: VertexOutput;
    out.pos = view_proj * vec4<f32>(f32(in.pos.x), f32(in.pos.y), 0.0, 1.0);
    out.color = vec4<f32>(colors[vertex_index / 4u]) / 255.0;
    out.tex_coords = vec2<f32>(tex_u, tex_v);
    
    return out;
}

// FRAGMENT SHADER

@group(0) @binding(2)
var texture: texture_2d<f32>;
@group(0) @binding(3)
var texture_sampler: sampler;

@fragment
fn fs_main(in: VertexOutput) -> @location(0) vec4<f32> {
    return textureSample(texture, texture_sampler, in.tex_coords) * in.color;
}

util.rs

pub fn make_pipeline(
    device: &wgpu::Device,
    bind_groups: &[&wgpu::BindGroupLayout],
    vertex_buffers: &[wgpu::VertexBufferLayout],
    vertex_shader: &wgpu::ShaderModule,
    fragment_shader: &wgpu::ShaderModule,
    output_format: wgpu::TextureFormat,
) -> wgpu::RenderPipeline {
    let pl_layout =
        device.create_pipeline_layout(&wgpu::PipelineLayoutDescriptor {
            label: Some("render_pipeline_layout"),
            bind_group_layouts: bind_groups,
            push_constant_ranges: &[],
        });

    device.create_render_pipeline(&wgpu::RenderPipelineDescriptor {
        label: Some("render_pipeline"),
        layout: Some(&pl_layout),
        vertex: wgpu::VertexState {
            module: vertex_shader,
            entry_point: "vs_main",
            buffers: vertex_buffers,
        },
        fragment: Some(wgpu::FragmentState {
            module: fragment_shader,
            entry_point: "fs_main",
            targets: &[Some(wgpu::ColorTargetState {
                format: output_format,
                blend: Some(wgpu::BlendState::ALPHA_BLENDING),
                write_mask: wgpu::ColorWrites::ALL,
            })],
        }),
        depth_stencil: None,

        primitive: wgpu::PrimitiveState {
            topology: wgpu::PrimitiveTopology::TriangleList,
            polygon_mode: wgpu::PolygonMode::Fill,
            ..Default::default()
        },

        multisample: wgpu::MultisampleState::default(),
        multiview: None,
    })
}

pub fn make_render_pass<'a>(
    encoder: &'a mut wgpu::CommandEncoder,
    target: &'a wgpu::TextureView,
    clear_color: wgpu::Color,
) -> wgpu::RenderPass<'a> {
    encoder.begin_render_pass(&wgpu::RenderPassDescriptor {
        label: Some("render_pass"),
        color_attachments: &[Some(wgpu::RenderPassColorAttachment {
            view: target,
            ops: wgpu::Operations {
                load: wgpu::LoadOp::Clear(clear_color),
                store: true,
            },
            resolve_target: None,
        })],
        depth_stencil_attachment: None,
    })
}
\$\endgroup\$

1 Answer 1

3
\$\begingroup\$

The third argument to the draw_indexed() method is the number of instances of an object you want the GPU to handle on each draw() call. You are passing 0..1 to it, which essentially tells the GPU to draw your quad once.

So, you are issuing 8000+ draw() calls per frame, which is very costly.

You want to draw as many objects as possible that share the same pipeline in a single draw() call. Using instances, you can easily handle millions of quads per frame.

This is the relevant page of the excellent Wgpu tutorial by Benjamin Hansen on how to achieve that: https://sotrh.github.io/learn-wgpu/beginner/tutorial7-instancing/#instancing

\$\endgroup\$

Not the answer you're looking for? Browse other questions tagged or ask your own question.