Tuesday, March 10, 2015

Don't use volatile variables for busy waiting

I'm currently working on a multithreaded renderer as a replacement for the current rendering API in the K15 Engine which was always kind of janky.

The multithreaded renderer works simular to how Vulkan supports multithreaded applications.

Several threads can create and fill their own render command buffers. A separate render thread goes through all submitted render command buffers in serial and maps the commands to OpenGL / DirectX calls.

The logic for this multithreaded rendering approach could be visualized as followed:


Although the individual render command buffers are getting filled asynchronously, care has to be taken when the render command buffers are getting dispatched and the render thread is getting kicked off.

As both, the dispatch buffer (which feeds the render thread with render command buffers in a previously defined order) and the render command buffer themselfes are double buffered, we have to make sure that command buffers are not getting dispatched and filled with commands while the internal buffers are getting flipped.

My initial (and honestly naive) approach was to define a flag for the the dispatcher and the individual command buffers which was getting set whenever the internal buffers where flipped.

In case you wanted to add a render command to a command buffer that was currently getting flipped, the renderer checked the flag and would busy wait as long as the flag was set. Whenever the render thread was done with the buffer flipping, it would unset the flag and the thread that was trying to add commands to the command buffer would stop busy waiting and start adding commands the the command buffers.

The implementation looked like this:

 void K15_AddRenderCommand(K15_RenderCommand* p_RenderCommand, K15_RenderCommandBuffer* p_RenderCommandBuffer)  
 {  
    //stall until the buffer is flipped  
    while((p_RenderCommandBuffer->flags & K15_RENDER_COMMAND_BUFFER_FLIP_FLAG) > 0);   
    //... add render command to command buffer  
 }  

The code itself worked as intended and commands where only added to a render command buffer when the internal buffers where flipped successfully.

For the Release configuration I also had to make the flags variable of the K15_RenderCommandBuffer volatile as the optimizer modified the code so that

 while((p_RenderCommandBuffer->flags & K15_RENDER_COMMAND_BUFFER_FLIP_FLAG) > 0);  

would only read the value of the flags variable from the core's cache (keep in mind, the value was getting changed in another thread - the render thread).

The volatile keyword tells the optimizer to just ignore optimizations for this particular variable and just flat out reread it from memory whenever its value was getting accessed.

Although the code worked as intented, the performance was nowhere near to what I was expecting.

Thinking about what the volatile keyword really did for the above code made it perfectly clear to me *why* the code wasn't running as fast as I was thinking it would.

As a consequence for adding the volatile keyword to the flags variable, the value of the variable was always getting read from memory rather than from the core's internal cache (as stated above). The roundtrip to memory was what made the busy waiting so slow. Depended on the hardware implementation the render thread, which was writing to the flags variable, probably also had to always drop its internal cache where the value of the flags variable was stored and write the value to main memory whenever another thread requested the value of the flags variable. Another fact that made my naive approach not very performant.

After some refactoring the code now uses a semaphore for synchronization which in turn yields the performance I was expecting.

Thursday, December 4, 2014

Readability Snippet #1 : Pass by reference

When passing parameters by reference in C++ you basically have 2 options to do so:

Snippet #1: Pass the reference of the variable you want to pass to a function
Example:

 void foo(int& myIntParameter)  
 {  
   //... do something with myIntParameter  
 }  
 int main(int argc, char** argv)  
 {  
   int myIntVariable = 5;  
   foo(myIntVariable);  
 }  

Snippet #2: Pass the address of the variable you want to pass to a function
Example:

 void foo(int* myIntParameter)  
 {  
   if (myIntParameter)  
   {  
    //... do something with myIntParameter  
   }  
 }  
 int main(int argc, char** argv)  
 {  
   int myIntVariable = 5;  
   foo(&myIntVariable);  
 }  


Now, when reading code where variables get passed by reference like in snippet #1, it is actually not visible at first if the variable gets passed by value or passed by reference as the calling convention for both are actually the same. One had to look up the actual function definition by finding out whether the variable gets passed by reference or value. I consider this tedious as you always have to browse through your sourcecode to find out if a function call could *potentially* alter your variables value.

In this post, I'd like to present a way to make call by reference function calls more readable.

Snippet #3: Creating macro identifier for call by reference via actual variable reference

When using call by reference like in snippet #1, you could define some kind of macro qualifier that you can add to the function call.

Example:
 #define ref  
 void foo(int& myIntParameter)  
 {  
 }  
 int main(int argc, char** argv)  
 {  
   int myIntVariable = 5;  
   foo(ref myIntVariable);  
 }  

Unfortunately this has the disadvantage of not doing any compile-time checks.

------

Another solution would be to just don't use call by reference calls like in snippet #1 and just pass the address of the variable you want to modify. This method however usually introduces some sanity checks like null pointer checks etc. So you basically trade readability with compactness.

Monday, June 2, 2014

Weekend prototype #1 C/C++ Enum Parsing

This weekend I prototyped a command line tool written in Lua for parsing C/C++ files to get a list of enums used in the codebase that has been parsed.

Once the codebase has been parsed, the tool will create a new header file (or rewrite an existing one) and save all exported enums as string arrays.

Consider the following C/C++ Header:
 #ifndef _Transform_h_  
 #define _Transform_h_  
 class Transform {  
    enum TransformSource {  
      TS_NONE = 0,  
      TS_PARENT,  
      TS_SCENE,  
      TS_COUNT  
    };   
    Transform();  
    ~Transform();  
    //etc...  
 };  
 #endif //_Transform_h_  

If this file would serve as input to the enum parsing tool, the tool would create the following output:
 const char* TransformSourceStr[] = {  
    "TS_NONE",  
    "TS_PARENT",  
    "TS_SCENE",  
    "TS_COUNT"  
 };   

More informations about the tool and the source code can be found in the BitBucket repository.

Thursday, February 6, 2014

C++ Build Preprocessing Thoughts

Something that's been on my mind lately is how you could automatize many otherwise tedious and repetitive programming tasks by using custom build preprocessing tools.

If you're familiar with Qt's meta object compiler (moc) then you should have a good understanding of what I'm trying to talk about in this post.

When I first thought of useful C/C++ build preprocessing steps I was adding reflection and serialization to my engine.

Serialization, for instance, was just filling out a virtual serialize() function, where each member variable of a class was written to a binary stream.

Basically it was like this:

 Example::serialize(Stream& p_Stream)  
 {  
    p_Stream << m_Value1;  
    p_Stream << m_Value2;  
    p_Stream << m_Value3;  
    //custom class  
    m_Value4->serialize(p_Stream);  
    //...and so on  
 }  

As you can imagine, this got tedious and boring pretty quick...And that was when I though about a more elegant solution for this.

A few days later, build preprocessing came to my mind.

The basic idea is to use some kind of parser-tool (clang anyone? Getting a hang of clang can be pretty tedious, but it will pay off eventually) to parse your C/C++ code and
identify variables by using custom type qualifier. The information you gather by identifying your variables like this can be used to automatically create various function (like the serialize() function in this example)
In my codebase this would look like:

 #define serializable  
 class Example  
 {  
 public:  
    Example();  
    ~Exmaple();  
 private:  
    serializable int m_Value1;  
    serializable float m_Value2;  
    serializable bool m_Value3;  
    serializable CustomClass m_Value4;  
 };  

The tool I build (in lua - still learning how to use clang) scans my codebase for variables with the custom qualifier 'serializable'. It then saves these variables and automatically creates the serialize() function with the information it gathered during the codebase parsing.

This approach saved me various hours of tedious code doublication so far :)

Some more examples I thought of that would make sense would be

  • Code reflection (again using custom qualifiers)
  • Enum <--> String conversion 

Friday, October 18, 2013

Lua IDE Decoda is now open source and free

Decoda is an IDE for the scripting language Lua. It's finally free and open source. Be sure to download it now if you're working with Lua as it is by far the best Lua IDE out there.

http://unknownworlds.com/decoda/


Monday, October 14, 2013

Memory management 2/2

Wow after a year I finally manage to write this post.

Well, as promised we'll take a look at some memory management implementations (stack allocator, block allocator, pool allocator)

Motivation

Ok, so what's the motiviation behind all this? Why would we even need something like memory management? What's wrong about new and delete (or malloc and free if you're a C programmer)?

Well basically there's nothing wrong with using new and delete. It all depends on your use-case, code base complexity and overall performance.

Many developer out there will tell you to use malloc/new as little as possible and that you're doomed if you won't listen to them. Well that is probably true for AAA title but if you're just doing a small indie game or are just getting started with a little game then there's IMO nothing wrong with using malloc/free or new/delete.

If you're however working on a bigger game or even a console game, memory management will be unavoidable.

The reason for that is primary performance and secondary memory profiling/organisation (e.g. to make sure each subsystem is within its memory budget).

Whenever you're requesting memory by calling new/malloc, your operation system will most likely switch into so called 'Kernel Mode' to search for free memory in order to satisfy your memory request (some operation systems however implement malloc/new in a way that they'll get a bigger block of memory than you requested and will work with that block as long as it has enough free memory to satisfy your request - however to get that memory even these implementation have to switch to 'Kernel Mode' though they won't do it for each time you call malloc/new)

The switch from 'User Mode' to 'Kernel Mode' and back from 'Kernel Mode' to 'User Mode' is something that is considerably slow. If you're having a high amount of allocations per frame you'll run into serious performance problems using raw new/malloc to allocate memory.

Solution

The solution for this problem is - once you know it - pretty simple.

I'll only tell you how I implemented memory management in my engine. Other engines / games / projects will probably use different implementations / strategies...So don't get too fixed on this ;)

The solution is - in short - to just allocate a big block of memory using malloc/new and work with this block.
So instead of calling malloc/new each time you need memory, use the memory from one big block of memory which'll get allocated via malloc/new at the beginning of your application (for example). The costs from switching to 'Kernel Mode' and back will be eliminated by that and you can use some neat memory allocation pattern to make the best use of the allocated memory block.

Another benefit by this approach is IMO that you can exactly control how much memory your application allocates. So if you're working on a console with very limited memory, you can just allocate all available memory at the beginning of the application and then use that to satisfy all your allocations. Once all memory has been allocated and an allocation can not be satisfied you can just print a message and return NULL. (most consoles will probably just crash once you've requesting more memory than the console has to offer, so printing a message is definately a better way than just crashing)

The big question now is "what do we do with the memory block once we allocated it?" unfortunately you can't call malloc/new on that block to get slices of memory from it. What you'll have to do is implement your own memory allocation pattern to get memory out of that block.

Memory allocation pattern is what the next chapter will be about.

Memory allocation patterns


Stack Allocator

The easiest pattern to implement would be the stack allocator.

Basically the stack allocator just consists of a pointer to the current position in the memory block - it's like an offset from the start of the memory block.

For each allocation request, the stack allocator will return the current pointer. After satisfying the request the pointer will get increased by the size of the memory request.

Example:
At the beginning the memory offset of the stack allocator will be 0. So the current pointer will be pointing to the memory at the beginning of the memory block (as the current pointer = start of memory block + offset)

Let's assume the user wants to allocate 10 byte from the stack allocator. What happens is that the current pointer will get returned and the offset will get increased by the size of the requested memory (in this example the size would be 10).

To clarify that example here's some code:
 byte* beginning_of_memory_block = getMemoryBlock();  
 byte* requested_memory = beginning_of_memory_block + offset;   //offset = 0
 offset += sizeof(requested_memory); //10  
 byte* current_pointer = beginning_of_memory_block + offset;  //offset = 10
 setCurrentPointer(current_pointer);  
 return requested_memory  

This is simplified pseudo-code to illustrate the concept of the stack allocator.

Deallocation is even simpler. Just decrease the pointer by the size of the deallocated memory.

Note: Deallocation has to be in the reverse order of the allocation order. That's the big contra of using a stack allocator. You can't allocate and deallocate memory in arbitrary order!

Pool allocator

The pool allocator is also not that complicated to implement.

The theory about a pool allocator is it'll split a memory block into memory chunks of a specific size (for example the size of a user defined class). Each time you request memory, the pool allocator will go through its chunks and check if there's a free chunk - if there is, it will return that chunk.

There are multiple ways to implement a pool allocator.
One of the simpler implementations would be to use an array.

If you were to write a pool allocator class, you could use C++ templates to create that array on the stack.
 template<unsigned int SizePerObject,unsigned int AmountObjects>  
 class PoolAllocator  
 {  
 //... Costructor, Destructor etc.  
 private:  
    unsigned char m_Memory[SizePerObject * AmountObjects];  
 };  

To keep track of used chunks you could use another array that stores a boolean value per chunk to indicate whether the chunk is free or not
 template<unsigned int SizePerObject,unsigned int AmountObjects>  
 class PoolAllocator  
 {  
 //... Costructor, Destructor etc.  
 private:  
    unsigned char m_Memory[SizePerObject * AmountObjects];  
    bool m_FreeChunks[AmountObjects];  
 };  

If the user wants to allocate memory from the PoolAllocator you could implement it in a way that the allocate function iterates over the array until it finds a chunk that is free (m_FreeChunks[i] == true). You could then use the position of the free chunk to get the memory from the memory block itself.
 for(int i = 0;i < AmountObjects; ++i)  
 {  
    if(m_FreeChunks[i])  
    {  
      m_FreeChunks[i] = false;  
      return m_Memory[SizePerObject * i];  
    }  
 }  
 printf("StackAllocator could not satisfy memory request.");  
 return NULL;  

Deallocation is - like with the stack allocator- simpler than allocation.
For deallocation you just need to set the specific chunk the memory came from to 'not used' (m_FreeChunks[i] = true in our example).

Note: Due to its layout, a pool allocator can be used to allocate and deallocate objects in arbitrary order. However, the pool allocator can not be used to allocate memory of arbitrary size! 


Block allocator


The block allocator is the best of both worlds. Allocations/deallocations can be made in arbitrary order and in arbitrary size. But before you get too euphoric and forget about the two previous allocator pattern, note that the algorithm for allocating memory is more complex than the simplistic allocation algorithm of the pool and stack allogator. Therefore, excessive allocation/deallocation using the block allocator will be considerably slower than using a pool or stack allocator.

The theory behind the block allocator is to split the underlying memory block into several smaller memory blocks. But unlike the pool allocator we don't do that at the beginning using a constant size, but rather during allocation with a variable size. 

The used blocks will get saved in a internal array (or linked list) and on deallocation the block allocator tries to merge multiple deallocated blocks into a larger block, so that we'll end up with - again - one large memory block when all allocated memory from the block allocator has been deallocated.
 struct MemoryBlock  
 {  
    unsigned int StartPosition;  
    unsigned int EndPosition;
    bool Free;
 };  
 class BlockAllocator  
 {  
    //Constructor, destructor etc...  
 private:  
    unsigned char* m_Memory;  
    MemoryBlock* m_Blocks[AMOUNT_MEMORY_BLOCKS];  
 };  
This is a simplified declaration of a MemoryBlock struct and the BlockAllocator class.

The algorithm for allocating memory could look like this
 for(int i = 0;i < AMOUNT_MEMORY_BLOCKS;++i)  
 {  
    if(m_MemoryBlocks[i]->Free)  
    {  
      unsigned int newEndPos = m_MemoryBlock[i]->StartPosition + sizeof(AllocatedObject);  
      m_MemoryBlock[i]->EndPosition = newEndPos;  
      m_MemoryBlock[i]->Free = false;  
      createMemoryBlock(); //create new memory block  
      return m_Memory + m_MemoryBlock[i]->StartPosition; //return memory  
    }  
 }  
 printf("Block allocator - can not satisfy memory request");  
 return NULL;  

Again, this is just pseudo code ;).

I hope the informations provided in this and the last post was enough to give you a basic understanding of memory management in C++ and why it is important to know how to implement your own allocator.

If you still have questions or don't understand my examples, feel free to contact me.

In the next post I'll probably talk about the interaction between your game engine and an external game editor.

Tuesday, September 18, 2012

Memory management 1/2

Hey guys!

As I promised before via Twitter, this post is about the memory management / tracking systems we use in our K15 EngineV2 (I hate that name...gotta change it soon).

Memory leak detection and memory tracking

Memory leaks...At least every serious C/C++ programmer has had several in his career. They are hard to track and solve sometimes. Sure, there are tools that help you with that such as Application Verifier from Microsoft or some other debug helper, but they are either ridiculously slow during run-time or hard to configure. 

I've decided to implement a very, very easy to program but nevertheless efficient memory leak detection system which also can be used to track memory usage of the engine at the same time.

Lets start off by creating a new struct to hold all those useful information that we want to track.

 struct MemoryBlock  
      {  
           MemoryBlock *Next;          /*Address to next memory block*/  
           MemoryBlock *Previous;     /*Address to previous memory block*/  
           const char *File;          /*File the allocation occurred*/  
           unsigned int Line;          /*Line at which the new call happened*/  
           unsigned int Size;          /*Size of the allocated memory*/  
           bool IsArray;               /*Was the memory allocated by new[]?*/  
      };  

The first two members (namely Next and Previous) are to keep track of each MemoryBlock in a linked list (we'll come to that later).

The next member File and Line are to keep track of where exactly the new call of the leaking memory occurred. That gives you a good idea of where to start looking for errors in your code.

Member Size is to keep track of how much memory has been allocated and with IsArray you can check whether or not the memory has been allocated via new[] or new.

The next step would be to overload global new, new[], delete and delete[]. This is easily done:

 void *operator new[](unsigned int iSize,const char* sFile,unsigned int iLineNumber);  
 void *operator new(unsigned int iSize,const char* sFile,unsigned int iLineNumber);  
 void operator delete[](void* pPointer);  
 void operator delete(void* pPointer);  

You may wonder where the parameter sFile and iLineNumber are getting filled in the new[] and new operators (iSize is getting filled automatically). For this purpose I created a simple macro that looks like this:

 #define K15_NEW new(__FILE__,__LINE__)  

Whenever you use K15_NEW instead of the normal new (or new[]) operator, the global new operators that we just declared are getting called (If you're unfamiliar with the __FILE__ and __LINE__ macros just look here).

What we do next is to declare a function that does all the dirty work for us (allocating memory, expand the linked list of MemoryBlock structs, counting the call to new and add the size of the new memory to an internal counter).

This would be the code of such a function:

 void *Allocate(unsigned int iSize,const char* sFile,unsigned int iLine,bool isArray)  
 {  
      //We need some more size for the MemoryBlock struct.  
      iSize += sizeof(MemoryBlock);  
      //Allocate memory.  
      char *pPointer = malloc(iSize);  
      //This function does fill the MemoryBlock struct and add it to the linked list.  
      ProtocolMemoryAllocation(pPointer,iSize,sFile,sLineNumber,bArray);  
      //We need to shift the memory so that the memory we return is the memory the user wanted.  
      pPointer += sizeof(MemoryBlock);  
      return pPointer;  
 }  

 void ProtocolMemoryAllocation( void* pPointer,unsigned int iSize,const char* sFile,unsigned int iLineNumber,bool bArray )  
 {  
      MemoryBlock* pBlock = (MemoryBlock*)pPointer;  
      //Set all the flags  
      pBlock->IsArray = bArray;  
      pBlock->Size = iSize;  
      pBlock->File = sFile;  
      pBlock->Line = iLineNumber;  
      //Add block to linked list.  
      _AddMemoryBlock(pBlock);  
      //Increase size of allocated memory (to keep track of the memory usage)  
      ms_iAllocatedMemory += iSize;  
      //Increment the counter that keeps track of new/delete calls.  
      ++ms_iAmountAllocations;  
 }  

The same goes for deallocation:

 void Free( char *pPointer,bool bArray )  
 {  
      //Shift the pointer so that it points to the whole memory block that we allocated previously  
      //(including MemoryBlock).  
      pPointer -= sizeof(MemoryBlock);  
      //Delete the linkedlist entry and do decrease the new/delete counter.  
      ProtocolMemoryDeallocation(pPointer,bArray);  
      //finally free the memory.  
      free(pPointer);  
 }  

 void ProtocolMemoryDeallocation( void* pPointer,bool bArray )  
 {  
       MemoryBlock *pBlock = (MemoryBlock*)pPointer;  
      //Does the memory gets freed with delete[] if it got allocated with new[]?  
       assert(bArray == pBlock->IsArray);  
      //Decrease the amount of currently allocated memory.  
       ms_iAllocatedMemory -= pBlock->Size;  
      //Decrement the new/delete counter.  
       --ms_iAmountAllocations;  
      //Remove the block from the linked list of MemoryBlock structs.  
       _RemoveMemoryBlock(pBlock);  
 }  

The next thing we need to do to implement the memory tracker is to call the above implement functions by our overloaded new, new[], delete and delete[] operators. This is an easy task:

 void *operator new[](unsigned int iSize,const char* sFile,unsigned int sLineNumber)  
 {  
      return Allocate(iSize,sFile,sLineNumber,true);  
 }  
 void *operator new(unsigned int iSize,const char* sFile,unsigned int sLineNumber)  
 {  
      return Allocate(iSize,sFile,sLineNumber,false);  
 }  
 void operator delete[](void* pPointer)  
 {  
      Free(pPointer,true);  
 }  
 void operator delete(void* pPointer)  
 {  
      Free(pPointer,false);  
 }  

To add a little bit of leak detection, we could implement yet another function that iterates over the list of MemoryBlock structs (which should be empty if there's no leak in your application) and print the desired information into a log file or a message box. Call this function at the end of your application and you're good to go!

That's it...We now implemented a fully working and yet efficient memory tracking and leak detection tool.

The next post will be completely about memory management. I'll introduce two different approaches to  memory management (being a memory pool and a memory heap).