Friday, May 8, 2015

Keep it simple

 unsigned char alphaValue = GetAlphaValue(pixelIndex);  
 if (alphaValue > 250)  
 {  
   PutPixel();  
 }  
 if (alphaValue < 5)  
 {  
   return;  
 }  


The above code is a trivial optimization for a 2D software renderer I currently write for a course I'll be giving at my old university.

At one point I was thinking of mulithreading the rasterization but those 5-ish lines saved me from all the work.

Sometime you just got to keep it simple.

----

I agree that this solution may be suboptimal for most users - however it is perfectly applicabale for the university project.

Tuesday, March 10, 2015

Don't use volatile variables for busy waiting

I'm currently working on a multithreaded renderer as a replacement for the current rendering API in the K15 Engine which was always kind of janky.

The multithreaded renderer works simular to how Vulkan supports multithreaded applications.

Several threads can create and fill their own render command buffers. A separate render thread goes through all submitted render command buffers in serial and maps the commands to OpenGL / DirectX calls.

The logic for this multithreaded rendering approach could be visualized as followed:


Although the individual render command buffers are getting filled asynchronously, care has to be taken when the render command buffers are getting dispatched and the render thread is getting kicked off.

As both, the dispatch buffer (which feeds the render thread with render command buffers in a previously defined order) and the render command buffer themselfes are double buffered, we have to make sure that command buffers are not getting dispatched and filled with commands while the internal buffers are getting flipped.

My initial (and honestly naive) approach was to define a flag for the the dispatcher and the individual command buffers which was getting set whenever the internal buffers where flipped.

In case you wanted to add a render command to a command buffer that was currently getting flipped, the renderer checked the flag and would busy wait as long as the flag was set. Whenever the render thread was done with the buffer flipping, it would unset the flag and the thread that was trying to add commands to the command buffer would stop busy waiting and start adding commands the the command buffers.

The implementation looked like this:

 void K15_AddRenderCommand(K15_RenderCommand* p_RenderCommand, K15_RenderCommandBuffer* p_RenderCommandBuffer)  
 {  
    //stall until the buffer is flipped  
    while((p_RenderCommandBuffer->flags & K15_RENDER_COMMAND_BUFFER_FLIP_FLAG) > 0);   
    //... add render command to command buffer  
 }  

The code itself worked as intended and commands where only added to a render command buffer when the internal buffers where flipped successfully.

For the Release configuration I also had to make the flags variable of the K15_RenderCommandBuffer volatile as the optimizer modified the code so that

 while((p_RenderCommandBuffer->flags & K15_RENDER_COMMAND_BUFFER_FLIP_FLAG) > 0);  

would only read the value of the flags variable from the core's cache (keep in mind, the value was getting changed in another thread - the render thread).

The volatile keyword tells the optimizer to just ignore optimizations for this particular variable and just flat out reread it from memory whenever its value was getting accessed.

Although the code worked as intented, the performance was nowhere near to what I was expecting.

Thinking about what the volatile keyword really did for the above code made it perfectly clear to me *why* the code wasn't running as fast as I was thinking it would.

As a consequence for adding the volatile keyword to the flags variable, the value of the variable was always getting read from memory rather than from the core's internal cache (as stated above). The roundtrip to memory was what made the busy waiting so slow. Depended on the hardware implementation the render thread, which was writing to the flags variable, probably also had to always drop its internal cache where the value of the flags variable was stored and write the value to main memory whenever another thread requested the value of the flags variable. Another fact that made my naive approach not very performant.

After some refactoring the code now uses a semaphore for synchronization which in turn yields the performance I was expecting.