Usar búferes correctamente en sombreadores de cálculo OpenGL

Aug 19 2020

Estoy reescribiendo un algoritmo que escribí por primera vez usando una operación de matriz / vector en los núcleos OpenGL para intentar maximizar el rendimiento.

Tengo un conocimiento básico de OpenGL, por lo que pude hacer que las cosas funcionaran, pero tengo muchos problemas a la hora de tomar varias opciones que ofrece OpenGL, especialmente los parámetros del búfer que supongo que tienen un gran impacto en mi caso. donde leo y escribo muchos datos.

Llamo secuencialmente a los tres núcleos:

Primero :

/* Generated constants (for all three shaders): 
 *   #version 430
 *   const vec3 orig
 *   const float vx
 *   const ivec2 size
 *   const uint projections
 *   const uint subIterations
 */
layout(local_size_x = 1, local_size_y = 1) in;

layout(std430, binding = 0) buffer bufferA { //GL_SHADER_STORAGE_BUFFER, GL_DYNAMIC_READ
    uint bufferProjection[]; //Written and read (AtomicAdd) by this shader, read by the second kernel
};
layout(std430, binding = 1) readonly buffer bufferB { //GL_SHADER_STORAGE_BUFFER, GL_DYNAMIC_READ
    uint layer[]; //Written and read by the third kernel, read by this shader and by glGetNamedBufferSubData
};
layout(std140) uniform bufferMat { //GL_UNIFORM_BUFFER, GL_STATIC_DRAW
    mat4 proj_mat[projections*subIterations]; //Read only by this shader and the third
};
layout(location = 0) uniform int z;
layout(location = 1) uniform int subit;

void main() {
    vec4 layer_coords = vec4(orig,1.0) + vec4(gl_GlobalInvocationID.x, z, gl_GlobalInvocationID.y, 0.0)*vx;
    uint val = layer[gl_GlobalInvocationID.y*size.x + gl_GlobalInvocationID.x];
    for(int i = 0; i < projections; ++i) {
        vec4 proj_coords = proj_mat[subit+i*subIterations]*layer_coords;
        ivec2 tex_coords = ivec2(floor((proj_coords.xy*size)/(2.0*proj_coords.w)) + size/2);
        bool valid = all(greaterThanEqual(tex_coords, ivec2(0,0))) && all(lessThan(tex_coords, size));
        atomicAdd(bufferProjection[tex_coords.y*size.x+tex_coords.x+i*(size.x*size.y)], valid?val:0);
    }
}

Segundo:

layout(local_size_x = 1, local_size_y = 1) in;

layout(std430, binding = 0) buffer bufferA { //GL_SHADER_STORAGE_BUFFER, GL_DYNAMIC_READ
    float updateProjection[]; //Written by this shader, read by the third kernel
};
layout(std430, binding = 1) readonly buffer bufferB { //GL_SHADER_STORAGE_BUFFER, GL_DYNAMIC_READ
    uint bufferProjection[]; //Written by the first, read by this shader
};
layout(std430, binding = 2) readonly buffer bufferC { //GL_SHADER_STORAGE_BUFFER, GL_DYNAMIC_READ
    uint originalProjection[]; //Only modified by glBufferSubData, read by this shader
};

void main() {
    for(int i = 0; i < projections; ++i) {
        updateProjection[gl_GlobalInvocationID.x+i*(size.x*size.y)] = float(originalProjection[gl_GlobalInvocationID.x+i*(size.x*size.y)])/float(bufferProjection[gl_GlobalInvocationID.x+i*(size.x*size.y)]);
    }
}

Tercero:

layout(local_size_x = 1, local_size_y = 1) in;

layout(std430, binding = 0) readonly buffer bufferA { //GL_SHADER_STORAGE_BUFFER, GL_DYNAMIC_READ
    float updateProjection[]; //Written by the second kernel, read by this shader
};
layout(std430, binding = 1) buffer bufferB { //GL_SHADER_STORAGE_BUFFER, GL_DYNAMIC_READ
    uint layer[]; //Written and read by this shader, read by the first kernel and by glGetNamedBufferSubData
};
layout(std140) uniform bufferMat { //GL_UNIFORM_BUFFER, GL_STATIC_DRAW
    mat4 proj_mat[projections*subIterations]; //Read only by this shader and and the first
};
layout(location = 0) uniform int z;
layout(location = 1) uniform int subit;
layout(location = 2) uniform float weight;

void main() {
    vec4 layer_coords = vec4(orig,1.0) + vec4(gl_GlobalInvocationID.x, z, gl_GlobalInvocationID.y, 0.0)*vx;
    float acc = 0;
    for(int i = 0; i < projections; ++i) {
        vec4 proj_coords = proj_mat[subit+i*subIterations]*layer_coords;
        ivec2 tex_coords = ivec2(floor((proj_coords.xy*size)/(2.0*proj_coords.w)) + size/2);
        bool valid = all(greaterThanEqual(tex_coords, ivec2(0,0))) && all(lessThan(tex_coords, size));
        acc += valid?updateProjection[tex_coords.y*size.x+tex_coords.x+i*(size.x*size.y)]:0;
    }
    float val = pow(float(layer[gl_GlobalInvocationID.y*size.x + gl_GlobalInvocationID.x])*(acc/projections), weight);
    layer[gl_GlobalInvocationID.y*size.x + gl_GlobalInvocationID.x] = uint(val);
}

Lo que se me ocurrió leyendo el documento de OpenGL:

Algunos valores que son iguales durante toda la duración del algoritmo se generan como const antes de compilar el sombreador. Especialmente útil para el límite de bucle for
bufferMat, que es muy pequeño en comparación con los otros búferes, se coloca en un UBO, que debería tener un mejor rendimiento que SSBO. ¿Puedo mejorar el rendimiento del evento convirtiéndolo en una constante en tiempo de compilación? Es pequeño, pero todavía unos cientos mat4
Los otros búfer, que se leen y escriben varias veces, deberían ser mejores como SSBO
Tengo problemas para entender cuál podría ser el mejor valor para los parámetros de "uso" del búfer. Todos los búferes se escriben y se leen varias veces, no estoy seguro de qué poner aquí.
Si lo entiendo correctamente, local_size solo es útil cuando se comparten datos entre invocaciones, por lo que debería mantenerlo en uno.

Con mucho gusto tomaría cualquier recomendación o sugerencia sobre dónde buscar para optimizar esos núcleos.

Respuestas

1 NicolBolas Aug 19 2020 at 21:51

¿Puedo mejorar el rendimiento del evento convirtiéndolo en una constante en tiempo de compilación?

Tendrás que perfilarlo. Dicho esto, "unos cientos de mat4" no es "pequeño" .

Tengo problemas para entender cuál podría ser el mejor valor para los parámetros de "uso" del búfer. Todos los búferes se escriben y se leen varias veces, no estoy seguro de qué poner aquí.

Primero, los parámetros de uso se refieren a su uso del objeto de búfer, no al uso de OpenGL de la memoria detrás de ellos. Es decir, que estamos hablando de funciones como glBufferSubData, glMapBufferRangey así sucesivamente. READsignifica que la CPU leerá del búfer, pero no escribirá en él. DRAWsignifica que la CPU escribirá en el búfer, pero no leerá de él.

Segundo ... realmente no debería importarte. Las sugerencias de uso son terribles, están mal especificadas y se han utilizado tan mal que muchas implementaciones las ignoran . La implementación GL de NVIDIA es probablemente la que los toma más en serio.

En su lugar, utilice búferes de almacenamiento inmutables . Esas "sugerencias de uso" no son sugerencias ; son requisitos de API. Si no lo usa GL_DYNAMIC_STORAGE_BIT, no puede escribir en el búfer a través de glBufferSubData. Etcétera.

Si lo entiendo correctamente, local_size solo es útil cuando se comparten datos entre invocaciones, por lo que debería mantenerlo en uno.

No. De hecho, nunca use 1. Si todas las invocaciones están haciendo lo suyo, sin barreras de ejecución o similares, entonces debe elegir un tamaño local que sea equivalente al tamaño de frente de onda del hardware en el que está trabajando. . Obviamente, eso depende de la implementación, pero 32 es ciertamente un valor predeterminado mejor que 1.