This blog provides an introduction to our newly developed open-source command-line interface (CLI) binary analysis tool called memgram. To show its capabilities we walk though using it to help reverse engineer a file format for a popular video game. Hacking video games for retro consoles first got me into security, hopefully this walk-through will provide insight into some of the techniques involved in reverse engineering file formats, and in using our new tool.
memgram was built to help describe and visualise data structures and files when reverse engineering. Custom data structures can be quickly described in an easily readable TOML-compliant format called a grammar. memgram applies these grammars at offsets in files and displays formatted, prettified output of data based on the grammar.
memgram is heavily inspired by the GUI hex editors:
Here’s an example table of some test prettified data:
This section walks through reverse engineering one of the file formats from a game, by building a grammar to describe it, and prettifying the data using memgram. Let us first explore exactly what the file format is, and why to reverse engineer it.
The files we’re reverse engineering contain a list of entries of the name, length and offset to entries in files used by the game to store encoded text data. This text data is used in the in-game menus and conversations, for example:
If you want to modify the game by changing the text displayed in-game, or provide a translation to an unsupported language, the first step is understanding what files store the text, how it is formatted and how to extract it.
First we extract the in-game file-system. This gives us a directory structure to explore. After some exploration we find folders which may relate to the text displayed onscreen. Notice that each of these folders contain two files: crowd.fs and index.fs. We’ll be examining both, and how they relate to each other.
Displaying the contents of ./MessageTable/index.fs using hexdump we get the following output:
Immediately we see what appears to be filenames in ASCII format, the first three being:
We also see that it’s not simply a list of filenames, as before each filename there’s additional data. Looking at the data that appears before the first ASCII filename AdventurerMessageData.btb we get the following 16 bytes:
2c 00 00 00 00 00 00 00 7e 0c 00 00 3d ca 44 49
Let’s see if we can find a relationship between the data these 16 bytes represent and other data stored in index.fs or crowd.fs. The 0x2c at the start can be interpreted in a number of ways depending on the files endianness. Let’s assume the value is either just 0x2C or little-endian 0x002c (i.e. 0000002c or 00000000 0000002c).
What do we get when we hexdump 64 bytes from offset 2C (-s 0x2C) into index.fs? Again using hexdump, the command “hexdump -C -s 0x2C -n 64 MessageTable/index.fs” gives us:
Interestingly, we get another 16 bytes that precede filename:
54 00 00 00 80 0c 00 00 8e 01 00 00 22 40 51 f0
Let’s try the same thing again, this time seeing what’s at offset 0x54 into index.fs:
Again, 16 bytes that precede a file name:
80 00 00 00 10 0e 00 00 e4 02 00 00 1b ff 9d 46
This pattern continues until the last entry in the file. Notice that in the last two images, after the first four bytes we can see other data, for example in the first example we have 0x54000000 followed by 0x800c0000 and in the second we have 0x80000000 followed by 0x100e0000.
Since we see a pattern of reaching being able to view the next entry by converting the first four bytes to little-endian format and using it as an offset e.g 0x54000000 -> 0x00000054 and 0x80000000 -> 0x00000080 let’s assume that this is a little-endian field and the length is four bytes. We can also assume that the field points to the next entry of a data structure in index.fs.
Let’s create a new grammar file in order to describe the format so far:
[metadata] name = 'IndexFS Entry' variable_size_fields = [['','','','']] multiply_fields = [['','']] [[fields]] name = 'Next Entry Offset' size = 4 data_type = 'Offset' display_format = 'hexle' description = 'Offset to next entry in index.fs'
If we break down the syntax:
Next we run memgram with the following flags -g, -b, -E and -d flags. These flags do the following:
As you can see we get back a nicely coloured and formatted view of the data in little-endian format.
Now that we have the first piece of data in our unknown structure figured out, it’s time to look at the remaining 12 bytes. Based on the first field we’re going to make some assumptions about the second field and test to see if they are true. We’re going to assume that the second field is also four bytes long and little-endian. Let’s add another field to our grammar and call it “Unknown”:
[[fields]] name = 'Unknown' size = 4 data_type = '???' display_format = 'hexle' description = 'N/A'
Okay now lets grab the first three entries for this unknown data type using memgram and use the -s flag to specify the offset to the start each entry:
We get back three values:
00 00 00 00
00 00 0C 80
00 00 0E 10
Perhaps these are offsets too? However they cannot be offsets in index.fs because the total size of index.fs is only 0x1C0 bytes.
Let’s try instead to display data with hexdump at offsets in the other file crowd.fs using these values.
Interesting! We get the ASCII string “BTBF” at the start of each of these offsets. Notice that each of the filenames we found in index.fs earlier ended with .btb. Perhaps crowd.fs contains these files and the field we’re examining provides an offset to the start of each of the files? In order to answer this question we’re going to examine the data between the first “BTBF” (offset 00 00 00 00) and the second “BTBF” (offset 00 00 0C 80) in crowd.fs.
Since we have the Japanese version of the game we’ll be looking for a text encoding that can represent its two phonetic alphabets hiragana (平仮名) and katakana (片仮名), and its logographic kanji (漢字).
Unlike English which has 26 characters in its alphabet and can be represented in a single byte’s worth of data, Japanese needs to be able to represent thousands of characters in its text encoding. Because of this, we’ll be looking for a text encoding which uses at least two bytes to represent each character. It can often be the case that game companies use their own custom encoding to represent these characters instead of a standardised encoding like Unicode. This is primarily done to save space by only including data for characters which appear in the in-game script. If this were the case we’d also have to reverse engineer the text encoding!
Some of the common encoding schemes I’ve come across are Shift_JIS and Unicode. We’ll be looking for data which looks like it could be either of these and converting the data to both. If we get valid sentences and words in Japanese we’ll know that we’ve picked the correct encoding.
Examining the data starting offset at 0x2B4 in crowd.fs we see a general pattern of XX 30 for a number of bytes, then at offset 318 we see a series of null bytes. This pattern starts again at 0x326 until 0x3A8. We can make an educated guess that this could be a text encoding, and that anything XX 30 could be hiragana since there are fewer than 256 characters in its phonetic alphabet (which can fit in the XX byte). Because thousands of kanji characters exist, variations larger than 256, like 0xBA4E, could represent kanji characters.
We extract the data in crowd.fs from offset 0x2B4 to offset 0xC80 (the next “BTBF” pattern). And then we try to view the data using the two encoding schemes we picked.
First we try interpreting the extracted data as Shift-JIS:
Oh dear, that’s clearly not the correct encoding! So let’s try interpreting the extracted data as UTF-16-LE:
Nice! We have Japanese. For example, from line 8 to 10 we have the text which appears when you save the game:
That leaves the question, what is the data that appears before this text encoding starts? Perhaps it’s metadata describing where these messages start, and other information about them. That’s a discussion for another day.
Back to the original topic, we can now assume that this second entry in index.fs is an offset to the start of each of these .btb files. We update our grammar file so it now looks like:
[metadata] name = 'IndexFS Entry' variable_size_fields = [['','','','']] multiply_fields = [['','']] [[fields]] name = 'Next Entry Offset' size = 4 data_type = 'Offset' display_format = 'hexle' description = 'Offset to next entry in index.fs' [[fields]] name = 'BTB File Start' size = 4 data_type = 'Offset' display_format = 'hexle' description = 'Start offset of .btb file in crowd.fs'
Prettifying data via memgram now looks like:
Time to move on to the third field. Again we add a temporary field entry in our grammar, give it a size of four, set display format to little endian and view the data for the next three entries.
We get back three values in our new Unknown third field:
C7E
18E
2E4
How do these values relate to each other? Unlike the previous two fields, they do not increase each entry. We could try seeing what’s at these offsets in crowd.fs like we did with the previous field. However, if they were direct offsets into crowd.fs they would all end up being inside the first .btb file, which does not make much sense.
If we look at our first Unknown field, we see it has a value of 0xC7E. Looking at the next entry’s BTB File Start we get a value of 0xC80. Notice that there’s only a difference of two between them. Perhaps this unknown entry is the length of each file in crowd.fs – 2? To confirm this theory we can add the BTB File Start and Unknown of the second entry (C80 + 18E = E0E). 0xE0E is two less than that of 0xE10 which is the start of the third entry. If we continue to look at further entries we see this pattern repeat.
We can now add the third field to our grammar:
[[fields]] name = 'BTB File Length' size = 4 data_type = 'Length' display_format = 'hexle' description = 'Length of .btb file in crowd.fs - 2'
Output via memgram now looks like the image below:
If we get formatted output for the fourth field in the first three entries in index.fs we get the following:
At the moment I can’t find a relationship between these values and entries in index.fs or crowd.fs. My guess would be that these are checksums or properties for each file. We’ll leave this for now and continue onto the ASCII filename.
Finally we are left with variable-length file names at the end of each entry. We can calculate the length of these entries with:
It is possible that the game performs this calculation in order to know when the each ASCII filename ends. However, it’s also common for functions dealing with text to read a string a character at a time until it gets a 0x00 null byte.
If we look again at our original image of the data in index.fs, we can see that after every filename there is one or more null bytes, followed by the next entry:
We can tell memgram to do something similar by indicating that the field Filename is variable length, and giving it the option null_ascii in the metadata section for the key variable_size_fields.
name = 'IndexFS Entry' variable_size_fields = [['','null_ascii','','Filename']] multiply_fields = [['','']]
Now we can add the final entry:
[[fields]] name = 'Filename' size = 0 data_type = 'ASCII' display_format = 'ascii' description = 'Name of file in crowd.fs'
Our final grammar file now looks like the following:
[metadata] name = 'IndexFS Entry' variable_size_fields = [['','null_ascii','','Filename']] multiply_fields = [['','']] [[fields]] name = 'Next Entry Offset' size = 4 data_type = 'Offset' display_format = 'hexle' description = 'Offset to next entry in index.fs' [[fields]] name = 'BTB File Start' size = 4 data_type = 'Offset' display_format = 'hexle' description = 'Start offset of .btb file in crowd.fs' [[fields]] name = 'BTB File Length' size = 4 data_type = 'Length' display_format = 'hexle' description = 'Length of .btb file in crowd.fs - 2' [[fields]] name = 'Unknown' size = 4 data_type = '???' display_format = 'hexle' description = 'N/A' [[fields]] name = 'Filename' size = 0 data_type = 'ASCII' display_format = 'ascii' description = 'Name of file in crowd.fs'
memgram can now prettify any entry in these index.fs files. The final output from memgram looks like the below:
Hopefully this has given you an idea about some of memgram’s capabilities, and how it can aid reverse engineering tasks. memgram is only a month old and this is my first program written in Rust. It would be great to add more features, to name a few:
memgram is open-source, so please don’t hesitate to add some of these or your own features or improve existing code!
Complete information on the grammar specification and memgrams other capabilities can be found on the GitHub page.
For our latest research, and for links and comments on other research, follow our Lab on Twitter.