Here is my solution for the Dragon Research Group monthly challenge. I like this kind of challenges, especially because there’s a long deadline that gives you the possibility to work on it with no hurry in your spare time. Everything starts from two files: a memory dump and a file taken from the disk. The goal consists in founding a secret hidden message.
The intuitive method
The content of found_in_memory file is just a series of bytes:
b'9f064...bbdf'
A ‘b’ char and then a series of bytes inside ”’ chars, due to this particular format the sequence should represents a specific thing. If you are familiar with script languages you should know that it’s the common way used to declare bytes sequence in Python (version > 3). At the moment it’s only a thought, but I’ll take in mind when I inspect the other file: found_on_disk. This file is completely different, it’s a mix of obscure sequences of bytes and visible and meaningful strings. Here are some of them:
Blowfish
Crypto.Cipher
swapcase
MODE_CBC
plaintext
divmod
encrypt
decrypt
snake.py
This is just a part of the string list but I think it’s enough: my Python intuition was right and PyCrypto package seems to be involved. The first thing I did was to try to decrypt the sequence of bytes from the found_in_memory file content using Blowfish (CBC mode) crypto system. What about the key? The string list obtained from found_on_disk has a suspicious entry: “byJVnn48lJE9gp8sPtDTvTj1dw”. I decided to give it a try. The result was a surprise because the decrypted string contains the hidden message:
“The work of Carl Friedrich Gauss who in 1809 used the normal density to predict the locations of astronomical objects brought awareness of the density to a wide scientific audience; in consequence it is also called the Gaussian density.”
There are some more bytes at the beginning/end of the decrypted string, I think they are related to Blowfish CBC scheme but I can’t say for sure because I’m only guessing.
Too bad you can find the hidden message with the right intuition in few minutes without knowing what’s behind the content of found_on_disk file. Anyway, I think we should always learn something from a challenge so I’ll go on with a detailed approach to the identification of the content of found_on_disk (and the explanation of the first/last unknown bytes of course).
It’s obvious that found_on_disk is related to Python, but the question is: what kind of file is this? It’s a bytecode compiled version of the Python’s source file and it has .pyc extension.
There are some cool tools around the net able to re-create original source code from .pyc code in few seconds, but I didn’t try any of them. I did almost everything by hand because it’s always good to have a background of what’s going on, just in case automatic tools are not able to help you. Besides that it’s much more funny! In the next part I’ll show you how to obtain the exact original source code written by the author of the challenge.
.pyc header
.pyc has a specific structure and it’s composed by an header followed by a sequence of objects. The header is a concatenation of 3 dwords representing:
– magic: the Python version used to compile the source: “9E 0C 0D 0A” refers to version 3.3
– moddate: it’s the modification timestamp: “Mon Sep 2 10:06:38 2013”
– source_len: refers to the number of bytes of the source file that generated the .pyc: 0x236 bytes are used inside the source code
.pyc objects
The bytes after the header are part of the declared objects, the file is indeed a sequence of objects. Every object is identified by a specific type (defined by a single byte). Here is the list of the types used inside our found_on_disk file:
#define TYPE_CODE 'c'
#define TYPE_TUPLE '('
#define TYPE_INT 'i'
#define TYPE_UNICODE 'u'
#define TYPE_NONE 'N'
#define TYPE_STRING 's'
To fully understand what’s going on I suggest you to take a look at marshal.c file inside your Python package because it contains almost all the information about every defined object.
The first defined type object is ‘c’ (the byte 0x63 located at 0x0C offset), it’s a TYPE_CODE object. As you can see from marshal.c this type is a sort of container, from this point I can access a lot of precious informations about the defined function:
offset 0x0D: argcount 0 // Number of argument
offset 0x11: kwonlyargument 0 // Number of keyword argument
offset 0x15: nlocals 0 // Number of local variables inside the function
offset 0x19: stacksize 4 // Stack size
offset 0x1D: flags 0x64 // Flags
offset 0x21: code TYPE_STRING // Bytecode sequence
offset 0x184: consts TYPE_TUPLE // Sequence of constants used
offset 0x1EE: names TYPE_TUPLE // Sequence of names used
offset 0x34A: varnames TYPE_TUPLE // Sequence of local variable names (it's empty)
offset 0x34F: freevars TYPE_TUPLE // Sequence of free variables (it's empty)
offset 0x354: cellvars TYPE_TUPLE // Names of local variables that are referenced by nested functions
offset 0x359: filename TYPE_STRING // The name of the source file
offset 0x366: name TYPE_STRING // Function name
offset 0x373: firstlineno 1 // The first line number of the source file .py
offset 0x377: lnotab TYPE_STRING // Array of unsigned bytes used to map bytecode offsets to source code line
The scheme reveals all the things you need to know in order to recreate the original file. There’s no other information inside found_on_disk file because the object covers the entire file, it means that the source code contains a direct definition without external arguments (argcount=0).
What’s the name of the source file? “filename” field is a TYPE_STRING object and with the help of marshal.c I can understand its format and get the name of the original file. The dword after the type byte represents the length of the string, the string comes right after the length definition:
75 08 00 00 00 73 6E 61 6B 65 2E 70 79
Length is 8 and the name of the file is “snake.py”.
Another quick information is represented by the defined function name, it’s the standard name “<module>” used to identify main function (check compile.c).
To fill the source file with something I have to inspect the bytecode sequence pointed by ‘code’ field. It’s defined as a TYPE_STRING, and I know it’s 0x15E bytes long. The first few bytes are:
64 00 00 64 01 00 6C 00 00 6D 01 00 5A 01 00 01...
It doesn’t have much sense in this form. To understand the sequence I’m going to use the Python’s disassembler for Python bytecode. It’s included in the Python package and you can read something about it at http://docs.python.org/3/library/dis.html. Load up a Python’s shell and import the “dis” module using “import dis” command. Now that the module has been loaded you can start disassembling any sequence of bytecode you want using the ‘dis’ method. I’ll start with the above sequence using this command:
dis.dis(b'\x64\x00\x00\x64\x01\x00\x6c\x00\x00\x6d\x01\x00\x5a\x01\x00\x01')
The dis’s output is:
0 LOAD_CONST 0
3 LOAD_CONST 1
6 IMPORT_NAME 0
9 IMPORT_FROM 1
12 STORE_NAME 1
15 POP_TOP
Nice, the code is perfectly disassembled but I have to give a meaning at these disassembled lines. Again, the dis page I linked above will help you:
LOAD_CONST(consti) // Pushes co_consts[consti] onto the stack
IMPORT_NAME(namei) // Imports the module co_names[namei]
IMPORT_FROM(namei) // Loads the attribute co_names[namei] from the module found in top-of-the-stack (TOS)
STORE_NAME(namei) // Implements name = TOS
POP_TOP() // Removes the TOS item.
It’s really simple to understand everything, but there’s something I didn’t tell you: where can I get constants and names? I didn’t write anything about constants and names until now, the time is come. Look at the above TYPE_CODE definition:
offset 0x184: consts TYPE_TUPLE // Sequence of constants used
offset 0x1EE: names TYPE_TUPLE // Sequence of names used
A tuple is defined as a sequence of objects, the number of elements inside the tuple is defined in the dword following the TYPE_TUPLE byte.
Const_tuple contains 9 objects. It’s easy to read them, here is the const list (element index goes from 0 to 8):
0
'Blowfish',)
('Random',)
('pack',)
None
b'byJVnn48lJE9gp8sPtDTvTj1dw'
b''
1
b
And now the names list (34 elements):
'Crypto.Cipher', 'Blowfish', 'Crypto', 'Random', 'struct', 'pack', 'binascii', 'block_size', 'bs', 'r', 's', 'x', 'append', 'str', 'swapcase', 'k', 'u', 'key', 'new', 'read', 'iv', 'MODE_CBC', 'cipher', 'plaintext', 'divmod', 'len', 'plen', 'padding', 'encrypt', 'msg', 'print', 'hexlify', 'decrypt', 'dmsg'
So, with these precious information I can finally give a meaning to the previous disasmed code:
64 00 00 LOAD_CONST 0 // Load const value 0 (index 0 in const list)
64 01 00 LOAD_CONST 1 // Load 'Blowfish' string (index 1 in const list)
6C 00 00 IMPORT_NAME 0 // Import module Crypt.cipher (index 0 in name list)
6D 01 00 IMPORT_FROM 1 // Load the attribute Blowfish from the module Crypt.cipher on TOS
5A 01 00 STORE_NAME 1 // 'Blowfish' = TOS
01 POP_TOP // Pop TOS
Believe it or not the disasm is the result of the compiled Python instruction:
from Crypto.Cipher import Blowfish
This is the instruction at line #1 of snake.py file. I know it’s at line #1 because “firstlineno” object value is 1 (look inside TYPE_CODE object above).
At this point I can load the entire bytecode sequence inside the dis function and I’ll have the compiled program in front of my eyes. Try using the command:
dis.dis(b'\x64\x00\x00\x64\x01\x00\x6C\x00\x00\x6D\x01\x00\x5A\x01\x00\x01\x64\x00\x00\x64\x02\x00\x6C\x02\x00\x6D\x03\x00\x5A\x03\x00\x01\x64\x00\x00\x64\x03\x00\x6C\x04\x00\x6D\x05\x00\x5A\x05\x00\x01\x64\x00\x00\x64\x04\x00\x6C\x06\x00\x5A\x06\x00\x65\x01\x00\x6A\x07\x00\x5A\x08\x00\x64\x05\x00\x5A\x09\x00\x67\x00\x00\x5A\x0A\x00\x78\x27\x00\x65\x09\x00\x44\x5D\x1F\x00\x5A\x0B\x00\x65\x0A\x00\x6A\x0C\x00\x65\x0D\x00\x65\x0B\x00\x83\x01\x00\x6A\x0E\x00\x83\x00\x00\x83\x01\x00\x01\x71\x58\x00\x57\x65\x09\x00\x5A\x0F\x00\x65\x0A\x00\x5A\x10\x00\x65\x09\x00\x5A\x11\x00\x65\x03\x00\x6A\x12\x00\x83\x00\x00\x6A\x13\x00\x65\x08\x00\x83\x01\x00\x5A\x14\x00\x65\x01\x00\x6A\x12\x00\x65\x11\x00\x65\x01\x00\x6A\x15\x00\x65\x14\x00\x83\x03\x00\x5A\x16\x00\x64\x06\x00\x5A\x17\x00\x65\x08\x00\x65\x18\x00\x65\x19\x00\x65\x17\x00\x83\x01\x00\x65\x08\x00\x83\x02\x00\x64\x07\x00\x19\x18\x5A\x1A\x00\x65\x1A\x00\x67\x01\x00\x65\x1A\x00\x14\x5A\x1B\x00\x65\x05\x00\x64\x08\x00\x65\x1A\x00\x14\x65\x1B\x00\x8C\x01\x00\x5A\x1B\x00\x65\x14\x00\x65\x16\x00\x6A\x1C\x00\x65\x17\x00\x65\x1B\x00\x17\x83\x01\x00\x17\x5A\x1D\x00\x65\x1E\x00\x65\x1D\x00\x83\x01\x00\x01\x65\x1E\x00\x65\x06\x00\x6A\x1F\x00\x65\x1D\x00\x83\x01\x00\x83\x01\x00\x01\x65\x16\x00\x6A\x20\x00\x65\x1D\x00\x83\x01\x00\x5A\x21\x00\x65\x1E\x00\x65\x21\x00\x65\x19\x00\x65\x14\x00\x83\x01\x00\x64\x04\x00\x85\x02\x00\x19\x83\x01\x00\x01\x64\x04\x00\x53')
The beginning of the result is:
0 LOAD_CONST 0 (0)
3 LOAD_CONST 1 (1)
6 IMPORT_NAME 0 (0)
9 IMPORT_FROM 1 (1)
12 STORE_NAME 1 (1)
15 POP_TOP
16 LOAD_CONST 0 (0)
19 LOAD_CONST 2 (2)
22 IMPORT_NAME 2 (2)
25 IMPORT_FROM 3 (3)
28 STORE_NAME 3 (3)
31 POP_TOP
32 LOAD_CONST 0 (0)
...
In addition to the instruction names there are some numbers on every line. The numbers on the right are related to the index of the constant or name used in the instruction, it’s possible to reveal the name of the element and I’ll tell you later how. The number on the left represents the index of the bytecode of the current instruction. LOAD_CONST starts at bytecode 0 and it’s 3 bytes long; the second LOAD_CONST start at bytecode 3 and it’s 3 bytes long and so on.
I told you how to translate bytecodes into a Python instruction but a question may arise: how many bytecodes do I need to translate a single original instruction of the source code? I mean, “from Crypto.Cipher import Blowfish” instruction starts at bytecode 0 and it ends at offset 0x0F, why? The answer is inside “lnotab” object:
73 2C 00 00 00 (10 01) (10 01) (10 01) (0C 02) (09 02) (06 02) (06 01) … (17 02) (0A 01) (13 01) (0F 02)
It’s a TYPE_STRING object with 0x2C defined bytes and it’s a sequence of couples of bytes. An example is worth a thousand words: applying lnotab to the first decoded instructions I will get:
1 0 LOAD_CONST 1° instruction starts here
... from Crypto.Cipher import Blowfish
15 POP_TOP instruction ends here
2 16 LOAD_CONST 2° instruction starts here
... from Crypto import Random
31 POP_TOP instruction ends here
3 32 LOAD_CONST 3° instruction starts
... from struct import pack
47 POP_TOP instruction ends
4 48 LOAD_CONST 4° instruction starts
... import binascii
57 STORE_NAME instruction ends
6 60 LOAD_NAME 5° instruction starts
... bs=Blowfish.block_size
First couple of bytes: (10 01). Second instruction of snake.py is at line 1+0x01=2, and it starts from the bytecode at offset 0+0x10=16.
Next couple of bytes: (10 01). Third instruction of snake.py is at line 2+0x01=3, and it starts from the bytecode at offset 16+0x10=32.
Fifth instruction of snake.py is at line 4+0x02=6, and it starts from offset 48+0x0C=60.
To understand a single snippet is quite intuitive and after some more instructions identification you’ll be able to obtain the original code really fast. Well, two more examples on how to retrieve the right instructions and then I’ll write down the complete source code.
line #17: cipher = Blowfish.new(key, Blowfish.MODE_CBC, iv)
17
162 LOAD_NAME 1 (Blowfish) //
165 LOAD_ATTR 18 (new) // Create "Blowfish.new"
168 LOAD_NAME 17 (key)
171 LOAD_NAME 1 (Blowfish)
174 LOAD_ATTR 21 (MODE_CBC) // Create "BlowFish.MODE_CBC"
177 LOAD_NAME 20 (iv)
180 CALL_FUNCTION 3 (3 positional, 0 keyword pair) // Blowfish.new(key, Blowfish.MODE_CBC, iv)
183 STORE_NAME 22 (cipher) // cipher = Blowfish.new(key, Blowfish.MODE_CBC, iv)
line #28: print(dmsg[len(iv):])
28
320 LOAD_NAME 30 (print)
323 LOAD_NAME 33 (dmsg)
326 LOAD_NAME 25 (len)
329 LOAD_NAME 20 (iv)
332 CALL_FUNCTION 1 (1 positional, 0 keyword pair) // "len(iv)"
335 LOAD_CONST 4 (None)
338 BUILD_SLICE 2 // "len(iv):"
341 BINARY_SUBSCR // "dmsg[len(iv):]"
342 CALL_FUNCTION 1 (1 positional, 0 keyword pair) // print(dmsg[len(iv):])
345 POP_TOP
Easy, right? One last thing: as you can see from these two last disasmed codes the output contains some more information than the previous outputs, names and const are visible and you don’t have to find them manually inside the name_list/const_list. To obtain something like that you can use a nice Python script by Ned Batchelder availbale here. The source is really old but it’s still valid. It’s suitable for compiled files with Python 2.x version but with some minor adjustments you’ll be able to adapt it for v3.x version too (Hint: add source file size support…).
snake.py
Finally, here is the original source code:
from Crypto.Cipher import Blowfish
from Crypto import Random
from struct import pack
import binascii
bs=Blowfish.block_size
r=b'byJVnn48lJE9gp8sPtDTvTj1dw'
s=[]
for x in r:
s.append(str(x).swapcase())
k=r
u=s
key=r
iv=Random.new().read(bs)
cipher= Blowfish.new(key, Blowfish.MODE_CBC, iv)
plaintext=b''
plen= bs - divmod(len(plaintext),bs)[1]
padding= [plen]*plen
padding= pack('b'*plen, *padding)
msg= iv + cipher.encrypt(plaintext + padding)
print(msg)
print(binascii.hexlify(msg))
dmsg = cipher.decrypt(msg)
print(dmsg[len(iv):])
If you compile this one (“import py_compile” and then “py_compile.compile(‘snake.py’)”) and you compare it with the found_on_disk file you’ll get at most two differences: the timestamp (you can’t do nothing to make it equal to the original), and the size in bytes of the source file (you can fix to the right value 0x236 by simply adding/removing one or more spaces/tabs).
The program does a Blowfish (208 bit key) encryption/decryption surrounded by some use less instructions. I can now confirm that the hidden message is the one about Gauss and in the end I can even understand the meaning of the unknown bytes inside the decrypted message: as you can see from the source code they are part of the encryption scheme.
That’s all! Thanks to Dragon Research Group for this nice challenge!