6

To get an idea of what I'm doing, I am writing a python parser that will parse directx .x text files.

The problem I have deals with how the files are formatted. Although I'm writing it in python, I'm looking for general algorithms for dealing with this sort of parsing.

.x files define data using templates. The format of a template is

template_name {
   [some_data]
}

The goal I have is to parse the file line-by-line and whenever I come across a template, I will deal with it accordingly.

My initial approach was to check if a line contains an opening or closing brace. If it's an open brace, then I will check what the template name is.

Now the catch here is that the open brace doesn't have to occur on the same line as the template name. It could just as well be

template_name
{
   [some_data]
}

So if I were to use my "open brace exists" criteria, it won't work for any files that use the latter format.

A lot of languages also use curly braces (though I'm not sure when people would be parsing the scripts themselves), so I was wondering if anyone knows how to accurately get the template name (or in some other languages, it could just as well be a function name, though there aren't any keywords to look for)

2
  • 1
    I don't know how complex this scripting language is, but pretty much every parsing algorithm that's powerful enough to be widely used and studied can handle this. You just have to tell it to skip over whitespace, either by adding a tokenizer in between that filters out noise like this (e.g. PLY) or making the parser skip over whitespace when appropriate. Even the simpler of those algorithms (such as recursive descent and Pratt parsers) are more complex than if '{' in line:, but OTOH they scale to more complicated grammars.
    – user7043
    Commented Jun 28, 2011 at 15:43
  • Googling for a format spec, I found this: paulbourke.net/dataformats/directx
    – PaulMcG
    Commented Jun 29, 2011 at 8:18

3 Answers 3

6

If you are using Python then you should look at using PyParsing, I have used it with great success for a years now.

The pyparsing module is an alternative approach to creating and executing simple grammars, vs. the traditional lex/yacc approach, or the use of regular expressions. The pyparsing module provides a library of classes that client code uses to construct the grammar directly in Python code...

5

Please, don't write yet-another-bad-parser. Take a look at a lexer/parser (e.g. PLY) and do it properly. It will be less painful for you and much less painful for your users.

3
  • 4
    This doesn't answer the question. You presume your use case is Keikoku's use case. Who's to say this isn't a learning exercise? Commented Jun 28, 2011 at 16:08
  • Thanks, I saw a couple projects that parses C++ and C scripts, which could be useful.
    – MxLDevs
    Commented Jun 28, 2011 at 16:23
  • 1
    @Corbin: Actually it does answer it in the general sense. Studying/using a proper lexer/parser will highlight that this problem has been well-solved, in multiple ways, for at least a few decades. Commented Jun 28, 2011 at 16:29
5

Here is how a pyparsing solution would look:

from pyparsing import *

LBRACE,RBRACE,LBRACK,RBRACK,SEMI = map(Suppress,'{}[];')

TEMPLATE = Keyword("template")
ident = Word(alphas, alphanums+"_")
uuid = Regex(r'<[0-9a-fA-F]{8}(-[0-9a-fA-F]{4}){3}-[0-9a-fA-F]{12}>')
integer = Word(nums)
arrayDim = LBRACK + (integer|ident) + RBRACK

baseType = oneOf("WORD DWORD FLOAT DOUBLE CHAR UCHAR BYTE STRING CSTRING UNICODE")
ARRAY = Keyword("array")
typeRef = baseType | ident

memberDefn = Group(ARRAY + typeRef("type") + ident("name") + 
                       ZeroOrMore(arrayDim)("dims") + SEMI |
                   typeRef("type") + ident("name") + SEMI)
templateDefn = (TEMPLATE + ident("name") + LBRACE + 
    Optional(uuid)("uuid") +
    ZeroOrMore(memberDefn)("members") + 
    RBRACE)


sample = """
some stuff...

template Mesh {
<3D82AB44-62DA-11cf-AB39-0020AF71E433>
DWORD nVertices;
array Vector vertices[nVertices];
DWORD nFaces;
array MeshFace faces[nFaces][4];
}

more stuff...
"""

for tplt in templateDefn.searchString(sample):
    print tplt.dump()
    for mbr in tplt.members:
        print mbr.dump(indent='  ')
    print tplt.name, tplt.uuid

prints:

['template', 'Mesh', '<3D82AB44-62DA-11cf-AB39-0020AF71E433>', ['DWORD', 'nVertices'], ['array', 'Vector', 'vertices', 'nVertices'], ['DWORD', 'nFaces'], ['array', 'MeshFace', 'faces', 'nFaces', '4']]
- members: [['DWORD', 'nVertices'], ['array', 'Vector', 'vertices', 'nVertices'], ['DWORD', 'nFaces'], ['array', 'MeshFace', 'faces', 'nFaces', '4']]
- name: Mesh
- uuid: <3D82AB44-62DA-11cf-AB39-0020AF71E433>
  ['DWORD', 'nVertices']
  - name: nVertices
  - type: DWORD
  ['array', 'Vector', 'vertices', 'nVertices']
  - dims: ['nVertices']
  - name: vertices
  - type: Vector
  ['DWORD', 'nFaces']
  - name: nFaces
  - type: DWORD
  ['array', 'MeshFace', 'faces', 'nFaces', '4']
  - dims: ['nFaces', '4']
  - name: faces
  - type: MeshFace
Mesh <3D82AB44-62DA-11cf-AB39-0020AF71E433>

Not the answer you're looking for? Browse other questions tagged or ask your own question.