WebGPU shader limits

The only limits are the ones you set yourself (or so they said).

Limits can be a fun thing to look out for when exploring unconventional ways to use something in a safe, consensual, way.

A while ago I got in my mind that it would be cool to try to dump data directly in the shader code string. Engrave the data in the code instead of going with the common approach of providing it through some external mechanism (textures, buffers, etc). This manipulation of the code string would happen before it being compiled and by making use of some sort of pre-processor that could update the shader code in unusual ways. A lispic meta shader code as a service or something like that.

That road led me to over-engineer a demoscene engine that still grabs my attention to this day and hopefully will bring its first fruits in Inercia 2025 or 2026. But this will be the subject for a future post.

In the meanwhile there were some learnings that I found that might be cool enough to be worth of your attention. All code is WGSL shader code.

By the book limits

Querying the WebGPU device is the regular way to know about some of the limits the shader language might have. It gives developers the device reported maximum values for certain things like the maximum storage buffer binding size and maximum invocations per workgroup etc.

These queried limits are the ones that change between devices, and naturally do not cover all the limits that we are bound to. For instance a lot of the limits come from WebGPU and WGSL specs and these are often the minimum limits that the implementations must support.

For this blog post I am interested in the WGSL shader code limits, which the spec lays out in a nice table:

"WGSL Limits from the spec" (image from the WGSL 9 Sep 2025 spec limits section)

To explore these, lets start with a simple shader code, the triangle.

You can’t spell Triskel without TRI.

The simple WGSL triangle can be done with the vertex code:

@vertex
fn vertexMain(
  @builtin(vertex_index) VertexIndex : u32
) -> @builtin(position) vec4f {

  // A static array with the coordinates of this triangle
  var pos = array<vec2f, 3>(
    vec2(0.5, 0.5), vec2(-0.5, -0.5), vec2(0.5, -0.5),
  );

  return vec4f(pos[VertexIndex], 0.0, 1.0);
}

How many vertices can we fit into that var pos = array<vec2f, 3> array?

The last row of the WGSL limits table tells us that the minimum supported value that WebGPU has to guarantee for the "Maximum number of elements in value constructor expression of array type " is 2047.

In order to test this, lets introduce a bit of JS in this shader module string to extend the last vertex to guarantee that the array is filled n times:

const preprocessWGSL = (n: number) => `
@vertex
fn vertexMain(
  @builtin(vertex_index) VertexIndex : u32
) -> @builtin(position) vec4f {

    var pos = array<vec2f, ${n}>(
    ${
  "vec2(0.0, 0.5), vec2(-0.5, -0.5), " +
  // Extend this vertex:
  "vec2(0.5, -0.5), ".repeat(n - 2).slice(0, -2)
});

  return vec4f(pos[VertexIndex], 0.0, 1.0);
}`;

Procedural generation of limits

Now I want to replicate this kind of logic of declaring a super big thing in the WGSL code to check the other limits of that spec table.

The idea is to do this in three steps:

  1. Create a shader module with the offending code, and check for compilation errors. If it passes:
  2. Create a WebGPU pipeline with the offending code, and check if it doesn't throw. If it passes:
  3. Go to 1. but increase the number up until a value is found that breaks either 1. or 2.

All values provided are for my machine, which is an intel macos with an Intel and AMD gpu. The reported adapter vendor is "Intel" and architecture is "gen-9".

All code is run in a webworker in an offline canvas.

Lets go bottom up through the WGSL table limits and do this!

Maximum number of elements in value constructor expression of array

The table says it WGSL should support at least 2047 elements.

Using the code above and recipe I get the following in my intel apple macbook:

Maximum combined byte-size of all variables instantiated in the workgroup address...

The WGSL table says we should be able to have at least 16384 bytes.

This value is the same as the one communicated by the hardware device limits at maxComputeWorkgroupStorageSize.

To test this up, I created a simple compute shader with another really big array. I am not sure if it is necessary to include a barrier to prevent some sort of memory layout optimization or dead code elimination here, so I added it nevertheless.

// output, this is just to enforce the usage of the workgroup data
@group(0) @binding(0) var<storage, read_write> output: array<f32>;

// the ${n} here is coming from the JS preprocessor
var<workgroup> sharedData: array<f32, ${n}>;

@compute @workgroup_size(64)
fn computeMain(@builtin(local_invocation_id) localId: vec3<u32>) {
  let idx = localId.x;

  // fill it up! (the ${n} here is coming from the JS preprocessor)
  if (idx < ${n}) {
    sharedData[idx] = f32(idx);
  }
  workgroupBarrier();

  // compiler might be smart to just ignore the variable if not used, so
  // lets use it here just to avoid dead code elimination:
  if (idx == 0) {
    var sum = 0.0;
    // the ${n} here is coming from the JS preprocessor
    for (var i = 0u; i < ${n}; i++) {
      sum += sharedData[i];
    }
    output[0] = sum;
  }
}

We could try to increase the hardware limits by requesting a bit more from the device, but I am just looking at what happens at the limit for creative purposes.

Lets move on to the next row of the WGSL spec limits table.

Maximum combined byte-size of all variables instantiated in the function address...

The spec says that we should have 8192 bytes. The thing here is that function space variables are unique to each invocation, they are not shared, so no need for the barrier as above.

Also this limit is not communicated by the hardware device when querying for limits, so lets see.

@group(0) @binding(0) var<storage, read_write> output: array<f32>;

fn someFunctionWithLocalMemory() -> f32 {
  // Same thing as before, ${n} is the JS preprocessor input:
  var localMem: array<f32, ${n}>;
  
  // ${n} is the JS preprocessor
  for (var i = 0u; i < ${n}; i++) {
    localMem[i] = f32(i);
  }
  
  var sum = 0.0;
  // ${n} is the JS preprocessor
  for (var i = 0u; i < ${n}; i++) {
    sum += localMem[i];
  }
  
  return sum;
}

@compute @workgroup_size(1)
fn computeMain() {
  output[0] = someFunctionWithLocalMemory();
}

Maximum combined byte-size of all variables instantiated in the private address space

The private address space is cool for module-scope variables that are unique for each invocation but persist accross function calls within it.

The limit in the spec table is again 8192 bytes, to test this I will declare a big array as var<private> (which is the default when declaring variables in the module scope) and see how big it can be. Again this is not a limit exposed through the WebGPU API hardware device limits.

@group(0) @binding(0) var<storage, read_write> output: array<f32>;

// Same as before, a big array:
var<private> privateData: array<f32, ${n}>;

@compute @workgroup_size(1)
fn computeMain() {

  // Initialize it, the ${n} is set by this test shader preprocessor
  for (var i = 0u; i < ${n}; i++) {
    privateData[i] = f32(i);
  }
  
  var sum = 0.0;
  // Use all the data to avoid it being optimized away, like above
  // the ${n} is set by this test shader preprocessor
  for (var i = 0u; i < ${n}; i++) {
    sum += privateData[i];
  }
  
  output[0] = sum;
}

And lets move on to the next one.

Maximum number of case selector values in a switch statement

This is an interesting one, it certainly did not cross my mind to explore the amount of case selector values. I am just going to do a quick test here and not go into the very specifics of it, a few questions pop up in my mind like "do empty case blocks count?", or "what if I group cases together, do they add up to this limit?".

Im going with just single values per case:

@group(0) @binding(0) var<storage, read_write> output: array<f32>;

@compute @workgroup_size(1)
fn main(@builtin(global_invocation_id) id: vec3<u32>) {
  let testValue = i32(id.x) % ${numCases};
  var result: f32 = -1.0;
  
  switch testValue {
${cases.join("\n")}
    default: {
      result = 999.0;
    }
  }
  
  output[0] = result;
}

Maximum number of parameters for a function

This is a cool way for a preprocessor to pass down extra data in the shader string code. The spec says that we have at least 255 arguments for each function, so lets create a preprocessor that uses structs and see how WebGPU behaves.

@group(0) @binding(0) var<storage, read_write> output: array<f32>;

struct MyStruct {
  value: f32,
  flag: bool
}

fn someFunction(${
    Array.from({ length: n }, (_, i) => `s${i}: MyStruct`).join(", ")
  }) -> f32 {
  var sum = 0.0;
  ${Array.from({ length: n }, (_, i) => `sum += s${i}.value;`).join("\n      ")}
  return sum;
}

@compute @workgroup_size(1)
fn main() {
  let s = MyStruct(1.0, true);
  output[0] = someFunction(${Array.from({ length: n }, () => "s").join(", ")});
    }

This limit seems to be per function, and structs are treated as just another param. It could maybe be used creatively as a way to dump extra data into a shader code string when other limits are reached. Though I imagine the compilation time would become a factor at some point.

Maximum nesting depth of brace-enclosed statements in a function

To test this limit I want to dump a lot of nested if's in the shader string code. Specs says we can have at least 127 nested block statements.

Here is the full code for it, including the #computePipeline from the WebGPU engine I am working at, where I moved the WebGPU spec to a fully declarative preprocessor (more about this next post).

export const limitsNestedIfShaderModule = (depth: number) => {
  let opening = "";
  let closing = "";
  let indent = "";

  // Create a bunch of nested ifs
  for (let i = 0; i < depth; i++) {
    opening += `${indent}if (${i}.0 < ${depth}.0) {\n`;
    indent += "  ";
    closing = `${indent}output[0] = ${i}.0;\n` + closing;
    if (i < depth - 1) {
      closing = "  ".repeat(depth - i - 1) + "}\n" + closing;
    } else {
      closing = indent.slice(0, -2) + "}\n" + closing;
    }
  }

  const shaderCode = `
#computePipeline someComputation {
  layout=auto
  compute={
    entryPoint=computeMain
    module=code 
  }
}

#shaderModule code {
  code="
@group(0) @binding(0) var<storage, read_write> output: array<f32>;

@compute @workgroup_size(1)
fn computeMain() {
${opening}${closing}
}

"}
  `;

  return shaderCode;
};

(As a way to treat shader code as data, I did bring the WebGPU spec to a declarative space as a high-level abstraction preprocessor for the demoscene engine I'm working at.)

The results are in:

Now this is strange. Safari is doing the right thing, "brace-enclosed statements" include the function block and beyond. However Chrome appears to be doing something entirely different, maybe counting closing braces as well? Not really sure what is happening here.

This got me wondering about the behaviour of nested composite types, which is coming right next.

Maximum nesting depth of a composite type

Spec says the limit is 15. I am going to generate a big nested array like this:

array<array<array<array<array<array<array<array<array<array<array<array<array<array<array<array<array<array<array<array<array<array<array<array<array<array<array<array<array<array<array<array<f32, 2>, 2>, 2>, 2>, 2>, 2>, 2>, 2>, 2>, 2>, 2>, 2>, 2>, 2>, 2>, 2>, 2>, 2>, 2>, 2>, 2>, 2>, 2>, 2>, 2>, 2>, 2>, 2>, 2>, 2>, 2>, 2>

These are the results:

This suggests that Chrome maybeeee is not be explicitly checking nesting depth at all (just maybe) and just letting us hit whatever other limit might be in their parsing stack.

Maximum number of members in a structure type

The final limit that I think might be potentially interesting to test as a shader developer is the maximum number of members in a struct.

The spec says we can have capacity for at least 1023 members in a struct.

@group(0) @binding(0) var<storage, read_write> output: array<f32>;

// Preprocessor creates n members for this struct:
struct BigStruct {
${members}
}

// Create and use them here to avoid them from being optimized:
@compute @workgroup_size(1)
fn computeMain() {
  var s = BigStruct(${initValues});
  
  ${sumLoop}
  output[0] = result;
}

Chrome is generous here and goes well beyond the minimum provided.

A table of it all

These results fit well within my expectations, Safari as always is more strict and spec oriented, while Chrome goes above the limits in some fronts while sometimes having a less polished failure mode.

Limit (WGSL spec)Spec minSafari TP 227Chrome 140
Array constructor elements2047Errs >2047 (2047)Compiles; Pipelines breaks at 2048; UI may stall
Workgroup storage bytes (var<workgroup>)16384Errs in pipelinePipeline Error; suggests requiredLimits 32768
Function local bytes (function stack)8192Errs in pipelineNo limit reached; compiles up to “array count” 65536
Private addr space bytes (var<private>)8192Errs in pipelineNo limit reached; same as above
Case selectors in a switch1023Errs >1023Errs >16383
Function parameters255Errs >255Errs >255
Struct members1023Errs >1023Errs >16383
Nested brace-enclosed statements127Errs >127Errs >63 (below spec min 127)
Nested composite types15Errs >15Errs >29 fails with a different limit

Conclusion

I think that there are at least two takeaways I can safely make from poking at WGSL spec limits:

  1. The spec minimums are real guardrails. Safari Technology Preview enforces them very strictly, which makes it great to validate what the portable floor actually is.

  2. Browsers may go beyond the minimums. Chrome lets you stretch several rows (struct members, switch cases, local/private bytes) way past the spec floor, but you’ll sometimes trip different internal limits first (like array-element counts) or see pipeline-time failures rather than neat compile errors.

If you’re considering “engraving” data into the shader source for creative or preprocessing reasons my advice so far would be:

All of these were run with a preprocessor demo player that I am working on for fun, if you are curious I'll share a follow-up shortly and ideally a WebGPU demo at Inercia 2025.