Implementing parser combinators in a functional manner

Question

I am just starting on my journey of learning clojure (I have just read to the end of chapter 3 in clojure for the brave and true). So for practice, I wrote some parser combinator functions. I am quite familiar with several imperative and object oriented languages, but this is my first time using a lisp language, and my first time programming in the functional paradigms.

How can this code be improved to be more idiomatic clojure code?

How can I condense the sequential parser function? I can write it in 6 lines of python, here it's close to 40 lines.

Also, I am not really sure the commonly accepted way of formatting code, like indentation and bracket placement- is this style commonly used?

Anyways, I welcome any and all feedback.

(ns main.core
  (:gen-class))

; the error message for the parser functions
(def error false)

; Parses a single charactor
(defn lit[c]
  (fn [stream]
    (if (= (first stream) (first (seq c)))
      (list (seq c) (rest stream))
      error
      )
    )
  )

; returns first successful parse
(defn alt[& ps]
  (fn [stream]
    (reduce #(or %1 %2) error (map #(% stream) ps))
    )
  )

; Tries to parse in series, error if any have an error
; This is the python code this function was based on

; It takes in a list of parser functions, tries to run all of them in a row,
; and if any of them fail, returns an error, otherwise it returns the stream,
; which changes each time a parser is run, as well as the results of all the successful parses in a list.

; def SequenceParser(*parsers):
;     def parser(stream):
;         res = []
;         for p in parsers:
;             r, stream = p(stream)
;             res.append(r)
;         return (res, stream)
;     return parser

(defn seqn[& ps]
  (fn [stream]
    (apply
      (defn inner
        ([stm p0]
         (p0 stm)
         )
        ([stm p0 p1]
         ((fn [r0]
            (if r0
              ((fn [r1]
                 (if r1
                   (list (list (nth r0 0) (nth r1 0)) (nth r1 1))
                   error
                   )) (p1 (nth r0 1)))
              error
              )) (p0 stm))
         )
        ([stm p0 p1 & p]
         (def r0 (p0 stm))
         (if r0
           ((fn [r1]
              (if r1
                ((fn [res]
                   (list (list (nth r0 0) (nth r1 0) (nth res 0)) (nth res 1))
                   ) (apply inner (nth r1 1) p))
                error
                ))(p1 (nth r0 1)))
           error
           )
         )
        )
      stream ps)
    )
  )

; Parser that parses as many times as it can, at least once though, or it is an error
(defn many [p0]
  (fn [stream]
    ((fn inner [stm]
       ((fn [res]
          (if res
            ((fn [r1]
               (if r1
                 (list (concat (nth res 0) (nth r1 0)) (nth r1 1))
                 res
                 )
               ) (inner (nth res 1)))
            res
            )) (p0 stm))
       ) stream
     )
    )
  )

; parser that applys a function to the result of another parser
(defn func [p0 f]
  (fn [stream]
    ((fn [res]
       (list (f (nth res 0)) (nth res 1))
       ) (p0 stream))
    )
  )

(defn num-to-int [n]
  (Integer/parseInt (apply str n))
  )

(defn get-val [[res remains]]
  res
  )


(defn -main
  "Main function, this computes the number 1904 from the string '34*56'"
  [& args]

  (def digit (alt (lit "1") (lit "2") (lit "3") (lit "4") (lit "5") (lit "6") (lit "7") (lit "8") (lit "9") (lit "0")))
  (def number (func (many digit) num-to-int))
  (def mul (func (seqn number (lit "*") number), (fn [[a b c]] (* a c))))
  (println (get-val(mul "34*56")))

  )

For reference parser contaminators are functions that generate functions that parse string data. The results of the parser combinators can be combined, hence the name to easily produce complex parsing functions.

http://theorangeduck.com/page/you-could-have-invented-parser-combinators This is a great read that explains them well if anyone is interested.

Carcigenicate · Accepted Answer · 2018-09-22 03:35:57Z

I'm going to admit right off the bat here that I've never written a parser using combinator functions before. I have quite a bit of experience with Clojure though, so I'll be primarily focusing on idiomatic Clojure, and proper functional practice.

I have a few main concerns with your code:

You're using def inside of defn! This is a very bad idea. def (and by extension, defn) create(s) globals that persist even after the function they're in has returned! Run your -main, then check what the value of digit is. You're in for a surprise. It's in scope, and has the last value that was assigned to it. From a functional perspective especially, this is bad. def carries out side effects and is leaking the internal state of your functions into the global scope. Use let instead.

Instead of writing:
```
(def a 1)
(def b 2)
(println (+ a b))
```
Write:
```
(let [a 1
      b 2]
  (println (+ a b)))
```
let creates a restricted scope, and just generally looks much cleaner.
Your brace placement is very reminiscent of imperative style curly brace placement. This is causing a lot of bloat and making your code more verbose than it needs to be.

I'm guessing you're doing this to make matching braces easier for you. If this is the case, don't make your life unnecessarily difficult! Lisps are quite difficult to write unassisted. If you're going to write Clojure, you should spend an afternoon getting familiar with Par-Infer. It infers parenthesis placement based on indentation, so you can write Clojure like it's Python. This is huge. Besides the odd edge case, I never need to manually add a closing brace, or worry about its placement.

I highly recommend IntelliJ with the Cursive plugin; both of which have free community versions. I've been using them for roughly 2 years now, and it's a phenomenal environment to write in.

Just to show the difference, this is what your many function automatically gets formatted as when I paste it into IntelliJ:
```
(defn many [p0]
  (fn [stream]
    ((fn inner [stm]
       ((fn [res]
          (if res
            ((fn [r1]
               (if r1
                 (list (concat (nth res 0) (nth r1 0)) (nth r1 1))
                 res))

             (inner (nth res 1)))
            res))
        (p0 stm)))
     stream)))
```
No more needing to carefully place trailing braces to make debugging brace placement issues easier. Just let Par-Infer handle it for you.
Your variable names are bad. Names like r1 and p0 give very little information, and are the kind of names I'd expect to see post-decompilation. For the sake of you and those who may read your code, more descriptive names are a must.
You have what appears to be documentation as comments above your functions. defn has a built-in ability to handle documentation:
```
(defn alt
  "Returns first successful parse"
  [& ps]
  (fn [stream]
    (reduce #(or %1 %2) error (map #(% stream) ps))))
```
You use this for -main, but nowhere else.

After taking the above suggestions into consideration (except for the variable names), your code would look more like:

(ns main.core
  (:gen-class))

(def error "The error message for the parser functions"
  false)

(defn lit
  "Parses a single charactor"
  [c]
  (fn [stream]
    (if (= (first stream) (first (seq c)))
      (list (seq c) (rest stream))
      error)))

(defn alt
  "Returns first successful parse"
  [& ps]
  (fn [stream]
    (reduce #(or %1 %2) error (map #(% stream) ps))))

(defn seqn
  "Tries to parse in series, error if any have an error"
  [& ps]
  (fn [stream]
    (apply
      (letfn [(inner ([stm p0] ; Letfn is the local equivilent of defn
                      (p0 stm))

                     ([stm p0 p1]
                      ((fn [r0]
                         (if r0
                           ((fn [r1]
                              (if r1
                                (list (list (nth r0 0) (nth r1 0)) (nth r1 1))
                                error)
                             (p1 (nth r0 1))))
                           error)
                        (p0 stm))))

                     ([stm p0 p1 & p]
                      (let [r0 (p0 stm)]
                        (if r0
                          ((fn [r1]
                             (if r1
                               ((fn [res]
                                  (list (list (nth r0 0) (nth r1 0) (nth res 0)) (nth res 1))
                                 (apply inner (nth r1 1) p)))
                               error)
                            (p1 (nth r0 1))))
                          error))))])

      stream ps)))

(defn many
  "Parser that parses as many times as it can, at least once though, or it is an error"
  [p0]
  (fn [stream]
    ((fn inner [stm]
       ((fn [res]
          (if res
            ((fn [r1]
               (if r1
                 (list (concat (nth res 0) (nth r1 0)) (nth r1 1))
                 res)

              (inner (nth res 1))))
            res)
         (p0 stm)))
      stream))))

(defn func
  "Parser that applies a function to the result of another parser."
  [p0 f]
  (fn [stream]
    ((fn [res]
       (list (f (nth res 0)) (nth res 1))
      (p0 stream)))))

(defn num-to-int [n]
  (Integer/parseInt (apply str n)))

(defn get-val [[res remains]]
  res)

(defn -main
  "Main function, this computes the number 1904 from the string '34*56'"
  [& args]

  (let [digit (alt (lit "1") (lit "2") (lit "3") (lit "4") (lit "5") (lit "6") (lit "7") (lit "8") (lit "9") (lit "0"))
        number (func (many digit) num-to-int)
        mul (func (seqn number (lit "*") number), (fn [[a b c]] (* a c)))]

    (println (get-val (mul "34*56")))))

These were the main things. Now I'll just comment on more nitpicky things:

nil is more commonly used to represent bad data, instead of false. Also, unless you think you may want to change the value of error later, I'd also just write nil directly instead of labeling it as error. It's generally understood that nil represents the lack of usable data, which appears to be your intent here.

list is most commonly used when writing macros to create code in the macro (because lisps are homoiconic). Consider just using a vector literal ([]) instead unless you absolutely need list behavior:

(list (seq c) (rest stream))

to

[(seq c) (rest stream)]

Some of your functions are huge! I would definitely consider breaking them down. For example, I don't see why inner is defined inside of seqn; it doesn't form a closure over any local variables. From the looks of it, it's quite complicated and makes up the majority of the functionality of seqn. I'd make it it's own function, then have seqn use it. That will leave you with two smaller, easier to test functions. Also, are you sure inner should be a giant function with multiple arities? It wouldn't be cleaner to have it with fewer arities, or even make it multiple smaller functions? Remember, when programming functionally, the function is considered the smallest unit of code. Functions should be as large as necessary, but as small as possible. If you look at my repositories, I have a ton of small, bite-sized functions that each have an extremely focused, narrow scope. This makes testing very easy, and prevents you from needing to go over a giant wall of code making up a function trying to decipher what it's doing in it's entirety.

For num-to-int, long is the native number type that Clojure uses. If you cast to int, it's just going to be converted to long somewhere down the road, and that could potentially lead to a performance hit. You might also want to make it clearer that "n" is actually a seq. Calling the parameter n then writing (apply str n) is very confusing, since I wouldn't expect n to be a sequence. I would change it to:

(defn num-to-int [n-seq]
  (Long/parseLong (apply str n-seq)))

get-val is basically just first. Unless you really like how the name helps readability, I'd just use first.

(alt (lit "1") (lit "2") (lit "3") (lit "4") (lit "5") (lit "6") (lit "7") (lit "8") (lit "9") (lit "0"))

has a ton of repetition. There's a lot of ways to fix this. I'd lean towards something like this though:

(->> "1234567890"
  (map str) ; Turn each char into a string
  (map lit) ; Turn each string into a literal f
  (apply alt)) ; Then give them to alt

Note that even though map is used twice, this only does a single iteration due to laziness! This could also be shortened using comp:

(->> "1234567890"
  (map (comp lit str)) ; Turn each char into a string, then give to lit
  (apply alt)) ; Then give them to alt

Although I find the former more readable.

I'm trusting that your constant return of anonymous functions is necessary for the algorithm. I've never seen so many functions being returned from other functions! If it's necessary, it's necessary, but while trying to debug a issue, I got a mess of an error like this:

ArityException Wrong number of args (0) passed to: parser-review-fixed/func/fn--1421/fn--1422 clojure.lang.AFn.throwArity (AFn.java:429)

The function inside of the function inside of func threw an arity error! Good luck debugging that!

The recursive call inside of inner inside of many could/should be "optimized". Change:

(inner (nth res 1))

to

(recur (nth res 1))

Use of recur prevents Stack Overflows by essentially making use of Tail-Call Optimization (although, from what I understand, the inner workings are far more complicated than that).

I personally love threading macros (-> and ->> mainly). I'd lean towards writing your alt function as:

(defn alt
  "Returns first successful parse"
  [& ps]
  (fn [stream]
    (->> ps
         (map #(% stream))
         (reduce #(or %1 %2) error)))) ; %1 can just be written as %

That's nice (imho), but it's not optimized. reduce actually allows for early exit using reduced! It would be more efficient (but more verbose) to write:

(defn alt
  "Returns first successful parse"
  [& ps]
  (fn [stream]
    (->> ps
         (map #(% stream))
         (reduce (fn [acc x]
                   (if x
                     (reduced x)
                     acc))
                 error))))

As soon as an accumulator for reduce becomes reduced, the reduction ends, and the reduced value is returned. This prevents the entire sequence from needing to be checked if, say, the first value is truthy.

But, the core already has a function that basically does this: some. Since you don't need to do any conversion prior to testing for thruthiness though, you can use identity to "forward" the value in place of a predicate. Then alt just becomes:

(defn alt
  "Returns first successful parse"
  [& ps]
  (fn [stream]
    (->> ps
         (map #(% stream))
         (some identity)))) ; some returns the first truthy value

I came back to this and realized it can be reduced further! You can get rid of the call to map and make the #(% stream) the predicate of some. This makes it trivial:

(defn alt
  "Returns first successful parse"
  [& ps]
  (fn [stream]
    (some #(% stream) ps)))

Regarding your recent edit clarifying what seqn is doing, these are the two options I'd choose between:

(defn sequence-parser1 [& parsers]
  (fn [stream]
    (reduce (fn [[res acc-stream] p]
              (let [[r stream'] (p acc-stream)]
                [(conj res r) stream']))
            [[] stream]
            parsers)))

(defn sequence-parser2 [& parsers]
  (fn [stream]
    (loop [[p & rest-parsers] parsers
           acc-stream stream
           res []]
      (if p
        (let [[r stream'] (p acc-stream)]
          (recur rest-parsers stream' (conj res r)))

        [res acc-stream]))))

Both are basically the same code; just a reduction over parsers. I tend to lean towards using loop when I have multiple accumulators though, as I find constant pairing/deconstruction in the reducing function to be messy.

If you didn't need to accumulate a stream while iterating the parsers, this could be done very succinctly using map, but alas, that wouldn't work here unfortunately.

WARNING!

I have to admit, I have basically no clue what most of this code is doing, and because of the unidiomatic way that it's written, it confused the hell out of my IDE. Because of the indentation and the constant application of temporary functions, this was extraordinarily difficult to make runnable after I fixed it up. If you try to run that first "touched up" code dump I posted, you'll get an error. There's a mismatched brace somewhere in the seqn I posted, but I can't for the life of me find the issue. Use what I posted here more as a general guide instead of copying and pasting it and trying to use it.

I have basically no clue what most of this code is doing, does this mean code written in Clojure is hard to read and maintain? — Billal Begueradj, Commented May 15, 2018 at 5:13
@BillalBEGUERADJ No, this code is just written in a very obtuse, un-idiomatic style (no offence to the OP), describing an area of CS I haven't studied. I was pointing out that further recommendations could be made, but that I was over my head in that regard. — Carcigenicate, Commented May 15, 2018 at 5:17
Hey, thanks for the feedback, it really helps. This was the first piece of clojure code I have written that wasn't directly copied from a book. I learned some really great stuff. Like the fact that (def ) is global, which explains a bunch of struggles I was having earlier. So thanks for all this. — user1762507, Commented May 15, 2018 at 15:35
@user1762507 No problem. Also, see may last edit a minute ago. I saw the Python code you posted, and included at the end of the review a "direct translation" of the Python code; or at least as direct as you can get. — Carcigenicate, Commented May 15, 2018 at 15:43
Thanks! I felt like my seqn function was overtly long, and possibly wrong, I just didn't know any other way to implement it. Now I have a lot to work on. Thanks for this. — user1762507, Commented May 15, 2018 at 15:55

Yizhe Sun · Accepted Answer · 2020-05-27 03:15:38Z

Probably late, but here's a re-write of seqn.

(defn sequence-parser
  "Returns a parser that runs given parsers in order, and returns result in a sequence."
  [& parsers]
  (fn [stream]
    (reduce (fn [[acc stream] parser]
              (let [[x rest-stream] (parser stream)]
                [(conj acc x) rest-stream]))
            [[] stream]
            parsers)))

Some points:

Use vectors instead of lists for general purpose sequences.
Give descriptive names to functions and variables. If you can't, make it anonymous or inline it.
Use reduce to loop over a sequence and accumulate a result.
You can deconstruct vectors in the argument list(as in [acc stream] and [x rest-stream]).
Use nil as indication of failure.

Stack Exchange Network

Implementing parser combinators in a functional manner

2 Answers 2

WARNING!

Not the answer you're looking for? Browse other questions tagged
beginner
functional-programming
reinventing-the-wheel
clojure
or ask your own question.

Hot Network Questions

Implementing parser combinators in a functional manner

2 Answers 2

WARNING!

Not the answer you're looking for? Browse other questions tagged beginnerfunctional-programmingreinventing-the-wheelclojure or ask your own question.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
beginner
functional-programming
reinventing-the-wheel
clojure
or ask your own question.