Jan 11, 2012

Compile-Time Hash String Generation

Motivation
Recently, I worked on a code-base where everything was referenced by strings.  It was quite expensive as you can imagine, so to avoid the cost of string comparison, hash values(ints) were used.  This system had a giant hash string registry, which generates a hash value and stores both the hash and actual strings whenever the code sees a new string, well all during run-time.

I personally did not enjoy this system for a number of reasons:

  • it takes up too much memory
    • almost 90% of those strings were never really used as real char * string representation
    • they were simply just look-up keys
    • so no need to store actual string representations: just hash values were enough
    • then we can generate hash values offline
  • hash string registry was quite slow to be multithread-safe. New strings can be registered by any thread, so we had to have a locking mechanism, which turned out to be a bottle neck time to time
  • strings in the executable or data files are easy to read by anyone with a hex editor. Maybe it makes attackers' life a bit easier?

So I thought of doing something better...... (as follows)

We Need Two Different String Types
I think we should adopt two different ways of handling strings for those 90% and 10%.

1. Pure char* String (10%)
Any string that needs actual string representations should be in this format:
  • filenames that the game needs to load sometime (unless your filenames are hashed, too. If you wanna go nuts sure.. but I think it's too extreme)
  • any string that needs to be printed out on the screen.  You really don't want your users to read some hex number like 0xFFFFF, 0x888888 or 0x000000, right? (looking at the picture below.. I think it might be okay... lol)
    Image Source: icanhascheeszburger.com
Since they would be only 10% of the string for most games, probably you can just do strcmp() on them instead of generating and storing hash values for them.  But if you really insist, you can still do the similar thing to what our hash string manger used to do. (still better than having 10 times more strings stored on memory)

2. Hash-Valued String (90%)
So all the other strings, of which char* representations are never used, will be just a simple hash value, an int.  A good example is any strings
  • only used for comparison
  • only used as look-up key
#1(pure char* string) is very straight forward, so from now on I'll only focus on #2(Hash-Valued String) in this post.

Choosing a Good Hash Function: x65599
We all know what hash function is.  Just in case you don't, it's just a function that will "hopefully" generate unique integer values for different strings. (I said "hopefully" because there is chance to have a same vale for tow different strings: we call it hash collision.)  Once you have a unique hash value for each strings, you can simply compare two strings by those hash values instead of comparing every single characters in them What you get out of this? It's faster and simpler.

Then how do you generate a unique hash value for each string?  Well, there a lot of different hash functions. Some have less collisions while the others have more.  Some are faster while the others are slow. You can choose whatever based on your need.  For me, a hash function for game should:
  • have near-zero collision
  • versatile enough to be used as run-time function: you will possibly need to concatenate strings and then generate hash value to compare in your game 
  • be fast enough to be used in game
I did some research and found this amazing comparison chart by Chris Savoie, and I decided the best one for me is x65599. (Also I liked the name quite a lot.  Anything that starts with X = a badass )

I also looked into the implementation (shown below).  It's simple. You keep multiplying 65599. A-Ha! that's why it's called x65599 :D

// A hash function with multiplier 65599 (from Red Dragon book)
unsigned int generateHash(const char *string, size_t len)
{
  unsigned int hash = 0;
  for(size_t i = 0; i < len; ++i)
  {
     hash = 65599 * hash + string[i];
  }
  return hash ^ (hash >> 16);
}

Alright, now I picked the hash function.  Let's move onto what to do in tools-side and code-side.

Saving Data from Tools
If there is any tool that used to save useless char* values into the game data, it's time to change it to save out a hash value, instead.  It's simple.  Just call that function from your tool (or implement same function if the tools use different language from your game).

Simply done.  Moving onto in-game code.

Generating Compile-time Hash Value in Code
Alright let's say, I want to find a bone named "funny_bone".  I would usually write a code like this:

bones.find("funny_bone");

but now the tools export a hash value for this string, so I have to write something like this:

const char * boneToFind = "funny_bone";
bones.find( generateHash(boneToFind, strlen(boneToFind) );

Sure, it would work.  But wouldn't the string "funny_bone" into our executable?  If so, it will use the same or more amount of memory than our old string registry.  But this string is a const string, and compiler "might" know about it.  Can we make sure the compiler does all the necessary computation during compile time and somehow magically spits out assembly code equivalent to this?

// yup 0XF1C66FD7F is the real hash value for "funny_bone"
bones.find( 0xF1C6FD7F );    

I think we can if these two things can happen:

  • inlining the hash function: if compiler choose to not inline the hash function, there is no way it can calculate an int value during the compile-time. So this is a must.  inline keyword is kinda a guideline on most compilers, but I don't think it's a hard thing to achieve.
  • unrolling the hash loop: we are dealing with constant length string, so if we have a way to tell compiler to unroll this and do all the calculation during compile-time, it would work.  

Help Inlining generateHash(const char *, size_t)
First off, I want to make sure the implementation of generateHash(const char*, size_t) function is available for inlining when the compiler compiles the code.  Also I want to allow programmers to use a simple macro without calling strlen(const char *) specifically.  So I decided to add a define like this:

#define HASH_STRING(str) generateHash(str, strlen(str));

With this, my hash.h looks like this:

// compiile-time hash string test
// author: Pope Kim (www.popekim.com)

#include <string.h>
#define HASH_STRING(str) generateHash(str, strlen(str));

// A hash function with multiplier 65599 (from Red Dragon book).
// we put this implementation into header file so that compiler can
// always inline it.
inline unsigned int generateHash(const char *string, size_t len)
{
  unsigned int hash = 0;
  for(size_t i = 0; i < len; ++i)
  {
    hash = 65599 * hash + string[i];
  }
  return hash ^ (hash >> 16);
}

Test Code
Now I made a test code to see if different compilers and optimization options can flat down HASH_STRING(str) into a single integer value in compile-time.

This is my main.cpp:

// compiile-time hash string test
// author: Pope Kim (www.popekim.com)


#include <stdio.h>

#include "hash.h"


int main(int args, char** argv)
{
  unsigned int hashValue = HASH_STRING("funny_bone");
  printf("hash value is 0x%8x\n", hashValue);

  return 0;
}


Now, let's see if the compilers do a good job on this.

With Visual Studio 2010 SP1
I've created a quick win32 console application project and tested it with different optimization flag.  This is what I did.
  1. Change project configuration to Release
  2. To output assembly file, I went to Project Properties > C/C++ > Output Files > Assembler Output, and chose Assembly-Only Listing (/FA)
  3. Then to test different optimization flag, I went to Project Properties > C/C++ > Optimization and compiled with the following settings:
Disabled (/Od)
Two findings:
  • generateHash() function call is not inlined, as expected
  • interestingly, strlen() is optimized: look at push 10


_main PROC ; COMDAT

; File e:\temp\x65599\x65599\main.cpp

; Line 11

push ebp

mov ebp, esp

push ecx

; Line 12

push 10 ; 0000000aH

push OFFSET $SG-5

call ?generateHash@@YAIPBDI@Z ; generateHash

add esp, 8

mov DWORD PTR _hashValue$[ebp], eax

; Line 13

mov eax, DWORD PTR _hashValue$[ebp]

push eax

push OFFSET $SG-6

call DWORD PTR __imp__printf

add esp, 8
; Line 15
xor eax, eax
; Line 16
mov esp, ebp
pop ebp
ret 0
_main ENDP


Minimize Size(/O1)
same as disabled, but the function call is inlined at least.

_main PROC ; COMDAT
; File e:\temp\x65599\x65599\main.cpp
; Line 12
xor ecx, ecx
xor eax, eax
$LL5@main:
movsx edx, BYTE PTR ??_C@_0L@DDOFCBGB@funny_bone?$AA@[eax]
imul ecx, 65599 ; 0001003fH
add ecx, edx
inc eax
cmp eax, 10 ; 0000000aH
jb SHORT $LL5@main
mov eax, ecx
shr eax, 16 ; 00000010H
xor eax, ecx
; Line 13
push eax
push OFFSET ??_C@_0BF@DJEFNLLJ@hash?5value?5is?50x?$CF8x?6?$AA@
call DWORD PTR __imp__printf
pop ecx
pop ecx
; Line 15
xor eax, eax
; Line 16
ret 0
_main ENDP


Maximize Speed(/O2)
You see the first line? push -238617217 which is 0xF1C6FDF?  IT IS ALL FLATTENED DOWN TO AN INT!

; Line 13
push -238617217 ; f1c6fd7fH
push OFFSET ??_C@_0BF@DJEFNLLJ@hash?5value?5is?50x?$CF8x?6?$AA@
call DWORD PTR __imp__printf
add esp, 8
; Line 15
xor eax, eax
; Line 16
ret 0
_main ENDP



So I opened the resulting .exe file in a text editor to see if there's any string named "funny_bone".  And I couldn't find any! YAY!





Full Optimization(/Ox)
Walla! again!

_main PROC ; COMDAT
; File e:\temp\x65599\x65599\main.cpp
; Line 13
push -238617217 ; f1c6fd7fH
push OFFSET $SG-6
call DWORD PTR __imp__printf
add esp, 8
; Line 15
xor eax, eax
; Line 16
ret 0
_main ENDP


But when I searched "funny_bone" in the exe file...

WTF? Why are you there, funny_bone?

Seriously?  Full Optimization option doesn't strip unused strings?  To be sure I tested with another simple program, but the result was same.  Even this very simple string doesn't get stripped out with this compiler option.

void idiot()
{
  const char* idiot = "OMG";
}

I talked to this to Karl, and he found String Pooling option is off for /Ox by default while it is on for both /O1 and /O2. Strange eh?  As soon as I enabled C/C++ > Code Generation > Enable String Polling the string disappeared.  YAY!

With G++
How can I finish a test without trying G++?  I used G++ 4.5.3 to test this with this compiler flags:

g++ *.cpp -pedantic -Wall -S <optimization-flag>

Note that -S flag is to stop generating after assembler

-O0
-O0 flag means no optimization pretty much, so i didn't expect anything.  Still strlen() function seems to be converted to 10, at least.

LFE4:
.def ___main; .scl 2; .type 32; .endef
.section .rdata,"dr"
LC0:
.ascii "funny_bone\0"
LC1:
.ascii "hash value is 0x%8x\12\0"
.text
.globl _main
.def _main; .scl 2; .type 32; .endef
_main:
LFB5:
pushl %ebp
LCFI4:
movl %esp, %ebp
LCFI5:
andl $-16, %esp
LCFI6:
subl $32, %esp
LCFI7:
call ___main
movl $10, 4(%esp)
movl $LC0, (%esp)
call __Z12generateHashPKcj
movl %eax, 28(%esp)
movl 28(%esp), %eax
movl %eax, 4(%esp)
movl $LC1, (%esp)
call _printf
movl $0, %eax
leave
LCFI8:
ret

-O1
now generateHash() function is inlined too.  But still doing all the calculation.

.def ___main; .scl 2; .type 32; .endef
.section .rdata,"dr"
LC0:
.ascii "funny_bone\0"
LC1:
.ascii "hash value is 0x%8x\12\0"
.text
.globl _main
.def _main; .scl 2; .type 32; .endef
_main:
LFB5:
pushl %ebp
LCFI0:
movl %esp, %ebp
LCFI1:
andl $-16, %esp
LCFI2:
pushl %ebx
LCFI3:
subl $28, %esp
LCFI4:
call ___main
movl $LC0, %eax
movl $LC0+10, %ebx
movl $0, %edx
L2:
imull $65599, %edx, %edx
movsbl (%eax), %ecx
addl %ecx, %edx
addl $1, %eax
cmpl %ebx, %eax
jne L2
movl %edx, %eax
shrl $16, %eax
xorl %eax, %edx
movl %edx, 4(%esp)
movl $LC1, (%esp)
call _printf
movl $0, %eax
addl $28, %esp
popl %ebx
LCFI5:
movl %ebp, %esp
LCFI6:
popl %ebp
LCFI7:
ret

-O2
Same as -O1.  Of course, loop unroll optimization is only enabled with -O3.

.def ___main; .scl 2; .type 32; .endef
.section .rdata,"dr"
LC0:
.ascii "funny_bone\0"
LC1:
.ascii "hash value is 0x%8x\12\0"
.text
.p2align 4,,15
.globl _main
.def _main; .scl 2; .type 32; .endef
_main:
LFB5:
pushl %ebp
LCFI0:
movl %esp, %ebp
LCFI1:
andl $-16, %esp
LCFI2:
subl $16, %esp
LCFI3:
call ___main
movl $LC0, %eax
xorl %edx, %edx
.p2align 4,,7
L2:
imull $65599, %edx, %edx
movsbl (%eax), %ecx
addl $1, %eax
addl %ecx, %edx
cmpl $LC0+10, %eax
jne L2
movl %edx, %eax
shrl $16, %eax
xorl %edx, %eax
movl %eax, 4(%esp)
movl $LC1, (%esp)
call _printf
xorl %eax, %eax
leave
LCFI4:
ret

Also I tried to search the string in the exe file, but it was not there.. YAY!

-O3
Finally... Seeing  movl $-238617217, 4(%esp)?  It's flattened down!

.def ___main; .scl 2; .type 32; .endef
.section .rdata,"dr"
LC0:
.ascii "hash value is 0x%8x\12\0"
.text
.p2align 4,,15
.globl _main
.def _main; .scl 2; .type 32; .endef
_main:
LFB5:
pushl %ebp
LCFI0:
movl %esp, %ebp
LCFI1:
andl $-16, %esp
LCFI2:
subl $16, %esp
LCFI3:
call ___main
movl $-238617217, 4(%esp)
movl $LC0, (%esp)
call _printf
xorl %eax, %eax
leave
LCFI4:
ret

-Os
-Os stands for size.  And it didn't do good :(

LFE4:
.def ___main; .scl 2; .type 32; .endef
.section .rdata,"dr"
LC0:
.ascii "funny_bone\0"
LC1:
.ascii "hash value is 0x%8x\12\0"
.text
.globl _main
.def _main; .scl 2; .type 32; .endef
_main:
LFB5:
pushl %ebp
LCFI7:
movl %esp, %ebp
LCFI8:
andl $-16, %esp
LCFI9:
subl $16, %esp
LCFI10:
call ___main
movl $10, 4(%esp)
movl $LC0, (%esp)
call __Z12generateHashPKcj
movl $LC1, (%esp)
movl %eax, 4(%esp)
call _printf
xorl %eax, %eax
leave
LCFI11:
ret


Quick Summary
So let me quickly summarize which optimization flags you should use for VS 2010 SP1 and G++ to use this amazing(maybe a bit stupid if you are cynical :P) trick.

Visual Studio 2010 SP1
  • /O2
  • /Ox with Enable String Pooling on
G++ 4.5.3
  • -O3

Debugging
Alright.  It's good we removed the string representations from code.  The size of the executable is smaller, but it's horrible for debugging.  When I got a crash on a bone "named" 0xF1C6FD7F, how do I know which bone I have to look at in 3DS Max?  ARRGHH.. Okay, so apparently we really need char* string for at least debugging!

So should I revert everything I did so far? I don't want to. :)  This data is only useful for debugging, so I should devise a way of debugging it without affecting the disc build.  Actually, it is a very easy problem. I just need a string database, which has both pairs of <hash key, char*> for all the strings saved out from the data baking tools and used in our code.

Generating Debug String Database
In what format, should I store our string database? Using a light-weight SQL DB file is definitely an option.  What about SQL-lite?  I heard good things about it.  Or I can simply outputs a list of pairs of <int, char*> values into a text file. I would probably choose a plain text file, or compressed text file, to make it easier to load the file in C++ codebase. Whatever it is, the filename will be debug.string_db.

Saving Debug String DB from Tools
So now I just need to change my tools to open debug.string_db file and insert any new string entries into this while it's saving game-ready data.

Very simply done!

Saving Debug String DB from Codebase
Then what do we do with the strings in the code?  Luckily enough, I defined HASH_STRING("") macro.  I can just write a C# or python script searching through all the text files in my codebase to find any char* strings wrapped by that macro.  Regular expression? sure.  Then I would run same hash generation code to get the hash key, and write this into debug.string_db file.

Not very simple.  But easy and fast enough.... done!

Looking Up String Values
So now the question is how we can easily find what string it is while we are in Visual Studio. 

String Lookup Tool
Do you remember using DirectX Error Lookup tool that comes with DirectX SDK?  It looks like this:



Maybe I can write a tool like this.  This tool would simply open debug.string_db file and find any hash value I type in.  Yeah I can just copy and paste the hash value from Visual Studio Watch window into this.  Yup, it's a bit annoying but it's easy to make.

Visual Studio Plug-in?
Next thought I have is something I'm not sure if it's possible because I don't know too much about making a Visual Studio plug-in.

If a Visual Studio plug-in is allowed to read a text file(or a SQL DB) and modifies what shows up in Watch window, maybe I can do this.  It's just something I need to take some time and see if it's possible.. I haven't done it so far.  I think I'll be fine with the string lookup tool for now

In-Game Hash/String Registry
Or I can make a debug-only hash/string registry, which simply loads debug.string_db file.  Then anyone can find out the actual string representation easily in the code.  This will be only available on debug build, and the loading code will simply disappear in the disc build.

Maybe I'm Crazy But
While I was thinking about the last option for looking up string values, I found this is actually very close to how the localization database works.  You have a string id(key) and the value of the given language.  If you want to swap to different language you will just load different localization database file.  In the files, all the keys are same as before, just the string values are different.

So, if one decides to implement the in-game hash/string registry option for debugging, maybe he can use the same architecture for localization?  I don't know too much about localization database, so I might be just being stupid. I just had this crazy idea. heh =)


One BIG Problem, Though
I was excited for a while.  Then Noel, my buddie and one of the best programmers I've met in this industry, asked me if the loop unroll works if it the string length is longer.  So I did a quick test.

With Visual Studio 2010 SP1
I found Visual Studio 2010 SP1 works only up to 10 characters.  This is the result I got when I used 11 character-long string "funny_bone1".

_main PROC ; COMDAT

; File e:\temp\x65599\main.cpp
; Line 12
xor ecx, ecx
xor eax, eax
npad 12
$LL5@main:
movsx edx, BYTE PTR $SG-5[eax]
imul ecx, 65599 ; 0001003fH
inc eax
add ecx, edx
cmp eax, 11 ; 0000000bH
jb SHORT $LL5@main
mov eax, ecx
shr eax, 16 ; 00000010H
xor eax, ecx
; Line 13
push eax
push OFFSET $SG-6
call DWORD PTR __imp__printf
add esp, 8
; Line 15
xor eax, eax
; Line 16
ret 0
_main ENDP


With G++
G++ did a better job.  G++ works up to 17 characters. This is the result I got when i used 18 character-long string "funny_bone12345678"


.def _main; .scl 2; .type 32; .endef
_main:
LFB5:
pushl %ebp
LCFI0:
movl %esp, %ebp
LCFI1:
andl $-16, %esp
LCFI2:
subl $16, %esp
LCFI3:
call ___main
movl $LC0, %eax
xorl %edx, %edx
.p2align 4,,7
L2:
imull $65599, %edx, %edx
movsbl (%eax), %ecx
addl $1, %eax
addl %ecx, %edx
cmpl $LC0+18, %eax
jne L2
movl %edx, %eax
shrl $16, %eax
xorl %edx, %eax
movl %eax, 4(%esp)
movl $LC1, (%esp)
call _printf
xorl %eax, %eax
leave
LCFI4:
ret



What Does This Mean?
This means it will optimize strings with length equal to or less than 10 or 17 depending on your compilers. For other strings, it'll do all the calculation during run-time. Is it still worth it?  I think so. You will just need to follow this guideline:

  • if possible, keep look-up key strings short
  • don't call HASH_STRING() macro multiple times on a same string.  In other words, cache the hash value somewhere(e.g, as a member variable of an object)

Also there's a chance that future compilers do a better job at unrolling this.  Up to 64 characters would be nice. (Unfortunately, VS 2011 Preview doesn't do any better job....)

Or constexpr from C++11 can be used for this one day?  It's not even being supported by VS 2011 Preview... so whateva......

The best solution would be a compiler switch like this:

inline unsigned int generateHash(const char *string, size_t len)
{
  unsigned int hash = 0;


  #pragma unroll
  for(size_t i = 0; i < len; ++i)
  {
    hash = 65599 * hash + string[i];
  }
  return hash ^ (hash >> 16);
}


This would unroll the loop if len is constant.  I believe the IBM compiler supports something similar to this.  And we have this kinda switch in HLSL.  So why can't we have it in our C++ compilers?

Microsoft, can we please have this option, pretty please?

Walkaround (for now)
Since my original was posted, Mikkel Gjoel let me know Humus had a way to generate compile-time hash more than this limit.

I tested it quickly and it works great.  I tested up to 64 chars! There is an usability issue: you can't have a generic generateHash(const char*) function with specialized generateHash(const char &(string)[N]); functions together.  Compiler gets confused.  So now I have to make two different versions.

From the altdevblogaday link posted in the comment I found a nice way to get around the striked-through issue:


struct ConstCharWrapper
{
    inline ConstCharWrapper(const char* str) : m_str(str) {}
    const char* m_str;
};




inline unsigned int generateHash(ConstCharWrapper wrapper, size_t len)
{
    const char* string = wrapper.m_str;


    // same as before
}


Also I found VS 2010 is not smart enough to know these two are same, thus not flattening down the first even with my way: (g++ is smart enough to know it).

#1
const char * const funny = "funny_bone";
unsigned int hashValue = HASH_STRING(funny);


#2

unsigned int hashValue = HASH_STRING("funny_bone");


But using #1 confuses regex when generating the debug string DB anyways, so probably I can just ignore it.  But how would I force educate other programmers to use the first form if the string is const?  That's something I'm trying to figure out.

Anyways, if I figure out a nice way to use this new info in a less painful way, I'll write another post.  This post has gotten too long already.. ugggh...


9 comments:

  1. Good stuff. That sucks that the VS compiler can only unroll up to 10 loops.

    What we did back at one of my old companies was to have a command-line tool that would take a string "foo" as input and spit out something like this:

    0x48234522 /* hash:"foo" */

    It's pretty easy to hook up a command-line tool in VS to run with the current selection as input, so generating a hash id was just a few clicks and a copy&paste.

    This has all of the advantages of memory and speed (even in debug!), and can be put into an external database just as easily as your macro.

    Of course, this means that the hash function was really hard to change in a big, multi-solution codebase, but the speed tradeoffs were probably worth it.

    ReplyDelete
    Replies
    1. We actually ended up doing it for certain things with vs plugin as u said, but there were enough human mistakes to make me not liking it.

      Maybe the problem was that our preprocessor representation had both char * and int. People often copy n paste and change the string without regenerating the hash val. So it looked like hash collision for a while.

      Delete
  2. Hi Pope! Funny that you would post this article, I was preparing one myself, so I tweaked it to use the same hash function as you :-) I believe there is a way to force inlining using template metaprogramming. However I'm more of a C guy so I came up with a
    working preprocessor implementation based on a personal hash function. It works perfectly with gcc/g++ but I haven't tried it with Visual Studio yet.

    ReplyDelete
  3. Check this out too. (see the history to see humus' way to make it work for even longer chars.. I tested up to 64 chars: all works great both with VS2010 and g++)

    http://twitter.com//BlindRenderer/status/157519159180263424

    There are few usability issues especially a buffer like this: const char constBuffer[65] = "only3"; I'm trying to see if there's a walkaround :)

    Also VS 2010 is not smart enough to know these two are same, thus not flattening down the first one with the Humus way(g++ is smart enough :D):

    #1
    const char * const constBuffer = "rawr";
    unsigned int hashValue = HASH_STRING_CONST(constBuffer);

    #2
    unsigned int hashValue = HASH_STRING_CONST("rawr");

    ReplyDelete
  4. A great article, Pope. You may have already seen this, but if not, this will be of interest to you. http://altdevblogaday.com/2011/10/27/quasi-compile-time-string-hashing/

    ReplyDelete
  5. @all2one: I haven't seen that article, but the way it works seems to be close to the one I read from the boost library.

    Anyways, that altdevblogaday post gives me some trick I can play now, so i'm gonna experiment something more! Thank you!

    ReplyDelete
  6. This is a case where D's compile time function evaluation would be super useful.

    http://www.d-programming-language.org/function.html#interpretation

    That said, how much of an issue are collisions in real code? And if you're exporting the data to an external tool anyway, why not bring an external tool like gperf into the mix and generate a perfect hash table pre-compile? Pre-processing source with another tool is a bit dirty, though.

    ReplyDelete
    Replies
    1. you said it already: my intention was not to use any offline tool. But I guess I failed? :)

      Delete
  7. see here, macro solution:
    http://lolengine.net/blog/2011/12/20/cpp-constant-string-hash

    ReplyDelete