5

I have a program that takes a string (with some structure) as input. My aim is to collect information on what values each character in input is compared to as it gets parsed using concolic execution. What is the right way to go about it?

My subject is a simple recursive descent parser for expressions and it accepts inputs of the form 1+(2*3) etc.

What I have tried so far:

Wrote a program that constructs an initial state with string arguments, executed it through stepping one at a time, and at the end, collect constraints.

import sys
import angr

class Program:
    def __init__(self, exe):
        self.exe = exe
        self.project = angr.Project(exe, load_options={'auto_load_libs': False})

    def set_input(self, arg):
        self.arg1 = arg

        self.initial_state = self.project.factory.entry_state(
                args=[self.exe, self.arg1],
                # does not seem to startup the unicorn engine
                add_options=angr.options.unicorn,
                )

    def run(self):
        state = self.initial_state
        while True:
            succ = state.step()
            if len(succ.successors) > 1:
                raise Exception('more successors %d' % len(succ.successors))
            if not succ.successors: return state
            state, = succ.successors

def main(exe, arg):
    prog = Program(exe)
    prog.set_input(arg)
    res = prog.run()
    print("constraints: %d" % len(res.solver.constraints))
    for i in res.solver.constraints:
        print(i)
    print('done')

if __name__ == '__main__':
    assert len(sys.argv) >= 3
    main(sys.argv[1], sys.argv[2])

When I execute this, I get zero constraints.

Using constrained symbolic arguments

I got the idea that I should make the arguments symbolic, and add constraints to each characters in the input, using state.add_constraints, and look for the constraints produced at the end.

import sys
import angr
import claripy

class Program:
    ARG_PREFIX = 'sym_arg'
    def __init__(self, exe):
        self.exe = exe
        self.project = angr.Project(exe, load_options={'auto_load_libs': False})

    def set_input(self, arg):
        # generate arg1 from individual characters.
        self.arg1 = self.make_symbolic_char_args(arg)

        self.initial_state = self.project.factory.entry_state(
                args=[self.exe, self.arg1],
                add_options=angr.options.unicorn,
                # does not seem to affect the number of constraints created
                # remove_options=angr.options.simplification
                )
        self.constrain_input_chars(self.initial_state, self.arg1a, arg)
        self.string_terminate(self.initial_state, self.arg1a, arg)

    def string_terminate(self, state, symarg, inarg):
        state.add_constraints(symarg[len(inarg)] == 0)

    def constrain_input_chars(self, state, symarg, sarg):
        constraints = []
        for i,a in enumerate(sarg):
            state.add_constraints(symarg[i] == a)

    def run(self):
        state = self.initial_state
        while True:
            succ = state.step()
            if len(succ.successors) > 1:
                raise Exception('more successors %d' % len(succ.successors))
            if not succ.successors: return state
            state, = succ.successors

    def make_symbolic_char_args(self, instr, symbolic=True):
        if not symbolic: return instr
        input_len = len(instr)
        largs = range(0, input_len+1)
        arg1k = ['%s_%d' % (Program.ARG_PREFIX, i) for i in largs]
        self.arg1h = {k:claripy.BVS(k, 8) for k in arg1k}
        self.arg1a = [self.arg1h[k] for k in arg1k]
        return reduce(lambda x,y: x.concat(y), self.arg1a)


def main(exe, arg):
    prog = Program(exe)
    prog.set_input(arg)
    res = prog.run()
    print("constraints: %d" % len(res.solver.constraints))
    for i in res.solver.constraints:
        print(i)
    print('done')

if __name__ == '__main__':
    assert len(sys.argv) >= 3
    main(sys.argv[1], sys.argv[2])

While this results in a large number of constraints at the end, I am at a loss how to interpret these constraints to what I want -- comparisons to input characters. The constraints are fairly complex, but there are lots duplicated portions in each item. Any advice on translating these constraints that look like below to actual character comparisons would be greatly appreciated.

...
<Bool (((if (sym_arg_0_0_8 == 0) then 0x0 else (if (sym_arg_1_1_8 == 0) then 0x1 else (if (sym_arg_2_2_8 == 0) then 0x2 else (if (sym_arg_3_3_8 == 0) then 0x3 else 0x4)))) + 0x7fffffffffeffd4) - 0x7fffffffffeffd4) == strlen_5_64>
<Bool 0x1 <= strlen_5_64>
<Bool ((0#56 .. 0#7 .. __invert__((if ((sym_arg_0_0_8 == 40) && (sym_arg_0_0_8[7:7] == 0)) then 1 else 0))) & 0xff) != 0x0>
<Bool (((if (sym_arg_0_0_8 == 0) then 0x0 else (if (sym_arg_1_1_8 == 0) then 0x1 else (if (sym_arg_2_2_8 == 0) then 0x2 else (if (sym_arg_3_3_8 == 0) then 0x3 else 0x4)))) + 0x7fffffffffeffd4) - 0x7fffffffffeffd4) == strlen_6_64>
<Bool 0x1 <= strlen_6_64>
<Bool ((0#56 .. 0#7 .. (if ((sym_arg_0_0_8 == 49) && (sym_arg_0_0_8[7:7] == 0)) then 1 else 0)) & 0xff) != 0x0>
<Bool (((if (sym_arg_0_0_8 == 0) then 0x0 else (if (sym_arg_1_1_8 == 0) then 0x1 else (if (sym_arg_2_2_8 == 0) then 0x2 else (if (sym_arg_3_3_8 == 0) then 0x3 else 0x4)))) + 0x7fffffffffeffd4) - 0x7fffffffffeffd4) == strlen_7_64>
<Bool 0x2 <= strlen_7_64>
<Bool ((0#56 .. 0#7 .. (if ((sym_arg_1_1_8 == 49) && (sym_arg_1_1_8[7:7] == 0)) then 1 else 0)) & 0xff) == 0x0>
<Bool (((if (sym_arg_0_0_8 == 0) then 0x0 else (if (sym_arg_1_1_8 == 0) then 0x1 else (if (sym_arg_2_2_8 == 0) then 0x2 else (if (sym_arg_3_3_8 == 0) then 0x3 else 0x4)))) + 0x7fffffffffeffd4) - 0x7fffffffffeffd4) == strlen_8_64>
<Bool 0x2 <= strlen_8_64>
...
Problem: too slow

The above approach also takes too long to execute even for inputs a few characters long. I tried executing the above on simple recursive descent parser for expressions with input of a few characters: (1+4/1+4/(1+4)). It took more than 30 minutes to parse!, while the running the parser directly comes out in less than a second. I would be OK with that time if there was no other way out, but this being concolic, I cant shake the feeling that there must be a better way.

Using unconstrained symbolic arguments, but exploring my own

The idea is that I start with the input constraints such that each symbolic character equal to corresponding input. Then I provide my own path of execution such that only the paths that satisfy the input constraints are chosen for further execution. My idea was that since I do not start the original character constraints in the initial state, I can inspect the final state to determine other constraints that accumulated on each character.

import claripy
import sys
import angr

class Program:
    ARG_PREFIX = 'sym_arg'
    def __init__(self, exe):
        self.exe = exe
        self.project = angr.Project(exe, load_options={'auto_load_libs': False})

    def set_input(self, arg):
        # generate arg1 from individual characters.
        self.arg1 = self.make_symbolic_char_args(arg)
        self._constraints = []
        self.constrain_input_chars(self.arg1a, arg)
        self.string_terminate(self.arg1a, arg)
        self.constraints = claripy.And(*self._constraints)
        self.initial_state = self.project.factory.entry_state(
                args=[self.exe, self.arg1],
                add_options=angr.options.unicorn,
                )

    def string_terminate(self, symarg, inarg):
        self._constraints.append(symarg[len(inarg)] == 0)

    def constrain_input_chars(self, symarg, sarg):
        for i,a in enumerate(sarg):
            self._constraints.append(symarg[i] == a)

    def get_satisfying_state(self, state):
        succ = state.step()
        if not succ.successors: return None
        result = [i.satisfiable(extra_constraints=(self.constraints,)) for i in succ.successors]
        assert len(result) == 1
        return result[0]

    def run(self):
        state = self.initial_state
        while True:
            newstate = self.get_satisfying_state(state)
            if not newstate: return state

    def make_symbolic_char_args(self, instr, symbolic=True):
        if not symbolic: return instr
        input_len = len(instr)
        largs = range(0, input_len+1)
        arg1k = ['%s_%d' % (Program.ARG_PREFIX, i) for i in largs]
        self.arg1h = {k:claripy.BVS(k, 8) for k in arg1k}
        self.arg1a = [self.arg1h[k] for k in arg1k]
        return reduce(lambda x,y: x.concat(y), self.arg1a)

def main(exe, arg):
    prog = Program(exe)
    prog.set_input(arg)
    res = prog.run()
    print("constraints: %d" % len(res.solver.constraints))
    for i in res.solver.constraints:
        print(i)
    print('done')

if __name__ == '__main__':
    assert len(sys.argv) >= 3
    main(sys.argv[1], sys.argv[2])

This again did not work, as it takes extremely long time to parse even a few characters.

Other options I have considered
  • Writing a memory hook that monitors when the input string memory is read from. But this wont work for me because the characters may be copied before compared.

I have read the angr docs multiple times now and can't figure out what to do. I cant shake the feeling that it should be simpler as angr is a symbolic execution engine that supports concolic execution, and what I am doing is pretty much plain vanila concolic execution. What can I do next?

Any help would be greatly appreciated!.

1
  • I also need a similar function. Triton may be a better choice for this regard. If you have got any solution?
    – YaFeng Luo
    Commented Jan 4, 2020 at 0:17

0