Parsing scripts that use curly braces

Question

To get an idea of what I'm doing, I am writing a python parser that will parse directx .x text files.

The problem I have deals with how the files are formatted. Although I'm writing it in python, I'm looking for general algorithms for dealing with this sort of parsing.

.x files define data using templates. The format of a template is

template_name {
   [some_data]
}

The goal I have is to parse the file line-by-line and whenever I come across a template, I will deal with it accordingly.

My initial approach was to check if a line contains an opening or closing brace. If it's an open brace, then I will check what the template name is.

Now the catch here is that the open brace doesn't have to occur on the same line as the template name. It could just as well be

template_name
{
   [some_data]
}

So if I were to use my "open brace exists" criteria, it won't work for any files that use the latter format.

A lot of languages also use curly braces (though I'm not sure when people would be parsing the scripts themselves), so I was wondering if anyone knows how to accurately get the template name (or in some other languages, it could just as well be a function name, though there aren't any keywords to look for)

I don't know how complex this scripting language is, but pretty much every parsing algorithm that's powerful enough to be widely used and studied can handle this. You just have to tell it to skip over whitespace, either by adding a tokenizer in between that filters out noise like this (e.g. PLY) or making the parser skip over whitespace when appropriate. Even the simpler of those algorithms (such as recursive descent and Pratt parsers) are more complex than if '{' in line:, but OTOH they scale to more complicated grammars. — user7043, Commented Jun 28, 2011 at 15:43
Googling for a format spec, I found this: paulbourke.net/dataformats/directx — PaulMcG, Commented Jun 29, 2011 at 8:18

gnat · Accepted Answer · 2013-08-24 23:08:41Z

6

If you are using Python then you should look at using PyParsing, I have used it with great success for a years now.

The pyparsing module is an alternative approach to creating and executing simple grammars, vs. the traditional lex/yacc approach, or the use of regular expressions. The pyparsing module provides a library of classes that client code uses to construct the grammar directly in Python code...

edited Aug 24, 2013 at 23:08

gnat

21.1k29 gold badges115 silver badges291 bronze badges

answered Jun 28, 2011 at 16:07

user7519

Add a comment |

Peter Rowell · Accepted Answer · 2011-06-28 15:42:35Z

5

Please, don't write yet-another-bad-parser. Take a look at a lexer/parser (e.g. PLY) and do it properly. It will be less painful for you and much less painful for your users.

answered Jun 28, 2011 at 15:42

Peter Rowell

7,5082 gold badges31 silver badges33 bronze badges

4

This doesn't answer the question. You presume your use case is Keikoku's use case. Who's to say this isn't a learning exercise?
– Corbin March
Commented Jun 28, 2011 at 16:08
Thanks, I saw a couple projects that parses C++ and C scripts, which could be useful.
– MxLDevs
Commented Jun 28, 2011 at 16:23
1

@Corbin: Actually it does answer it in the general sense. Studying/using a proper lexer/parser will highlight that this problem has been well-solved, in multiple ways, for at least a few decades.
– Peter Rowell
Commented Jun 28, 2011 at 16:29

Add a comment |

PaulMcG · Accepted Answer · 2011-06-29 05:16:49Z

Here is how a pyparsing solution would look:

from pyparsing import *

LBRACE,RBRACE,LBRACK,RBRACK,SEMI = map(Suppress,'{}[];')

TEMPLATE = Keyword("template")
ident = Word(alphas, alphanums+"_")
uuid = Regex(r'<[0-9a-fA-F]{8}(-[0-9a-fA-F]{4}){3}-[0-9a-fA-F]{12}>')
integer = Word(nums)
arrayDim = LBRACK + (integer|ident) + RBRACK

baseType = oneOf("WORD DWORD FLOAT DOUBLE CHAR UCHAR BYTE STRING CSTRING UNICODE")
ARRAY = Keyword("array")
typeRef = baseType | ident

memberDefn = Group(ARRAY + typeRef("type") + ident("name") + 
                       ZeroOrMore(arrayDim)("dims") + SEMI |
                   typeRef("type") + ident("name") + SEMI)
templateDefn = (TEMPLATE + ident("name") + LBRACE + 
    Optional(uuid)("uuid") +
    ZeroOrMore(memberDefn)("members") + 
    RBRACE)


sample = """
some stuff...

template Mesh {
<3D82AB44-62DA-11cf-AB39-0020AF71E433>
DWORD nVertices;
array Vector vertices[nVertices];
DWORD nFaces;
array MeshFace faces[nFaces][4];
}

more stuff...
"""

for tplt in templateDefn.searchString(sample):
    print tplt.dump()
    for mbr in tplt.members:
        print mbr.dump(indent='  ')
    print tplt.name, tplt.uuid

prints:

['template', 'Mesh', '<3D82AB44-62DA-11cf-AB39-0020AF71E433>', ['DWORD', 'nVertices'], ['array', 'Vector', 'vertices', 'nVertices'], ['DWORD', 'nFaces'], ['array', 'MeshFace', 'faces', 'nFaces', '4']]
- members: [['DWORD', 'nVertices'], ['array', 'Vector', 'vertices', 'nVertices'], ['DWORD', 'nFaces'], ['array', 'MeshFace', 'faces', 'nFaces', '4']]
- name: Mesh
- uuid: <3D82AB44-62DA-11cf-AB39-0020AF71E433>
  ['DWORD', 'nVertices']
  - name: nVertices
  - type: DWORD
  ['array', 'Vector', 'vertices', 'nVertices']
  - dims: ['nVertices']
  - name: vertices
  - type: Vector
  ['DWORD', 'nFaces']
  - name: nFaces
  - type: DWORD
  ['array', 'MeshFace', 'faces', 'nFaces', '4']
  - dims: ['nFaces', '4']
  - name: faces
  - type: MeshFace
Mesh <3D82AB44-62DA-11cf-AB39-0020AF71E433>

Stack Exchange Network

Parsing scripts that use curly braces

3 Answers 3

Not the answer you're looking for? Browse other questions tagged
python
algorithms
parsing
or ask your own question.

Hot Network Questions

Parsing scripts that use curly braces

3 Answers 3

Not the answer you're looking for? Browse other questions tagged pythonalgorithmsparsing or ask your own question.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
python
algorithms
parsing
or ask your own question.