Regular expressions of sequences implementation namespace.
The implementation is very similar to Packrat or GLL parser combinators. The parsing functions need to be written in CPS to support backtracking inside :, :+ and :repeat. They also need to be trampolined because the (manually) CPS-converted code (for :, :+ and :repeat) has to use tail calls instead of loops and Clojure does not have TCO.
Because backtracking is used we need to memoize (parsing function, seq position, register stack) triples to avoid exponential behaviour. Discarding the memoization cache after traversing an input seq also requires trampolining. Because regular expressions don't use (nontail) recursion by definition, finding a memoization entry just means the parser already went 'here' and ultimately failed; much simpler than the graph-structured stacks of GLL. And the register stack is only there for and used by :repeat.
NOTE: For the memoization to work correctly, every node in the schema tree
must get its own validation/explanation/... function instance. So even every
(malli.impl.regex/cat)
call must return a new fn instance although it does not
close over anything.
https://epsil.github.io/gll/ is a nice explanation of GLL parser combinators and has links to papers etc. It also inspired Instaparse, which Engelberg had a presentation about at Clojure/West 2014.
Despite the CPS and memoization, this implementation looks more like normal
Clojure code than the 'Pike VM' in Seqexp. Hopefully JITs also see it that
way and compile decent machine code for it. It is also much easier to extend
for actual parsing (e.g. encode, decode [and parse?]) instead of just
recognition for validate
.
For a more detailed explanation of this namespace see also https://www.metosin.fi/blog/malli-regex-schemas/.
Regular expressions of sequences implementation namespace. The implementation is very similar to Packrat or GLL parser combinators. The parsing functions need to be written in CPS to support backtracking inside :*, :+ and :repeat. They also need to be trampolined because the (manually) CPS-converted code (for :*, :+ and :repeat) has to use tail calls instead of loops and Clojure does not have TCO. Because backtracking is used we need to memoize (parsing function, seq position, register stack) triples to avoid exponential behaviour. Discarding the memoization cache after traversing an input seq also requires trampolining. Because regular expressions don't use (nontail) recursion by definition, finding a memoization entry just means the parser already went 'here' and ultimately failed; much simpler than the graph-structured stacks of GLL. And the register stack is only there for and used by :repeat. NOTE: For the memoization to work correctly, every node in the schema tree must get its own validation/explanation/... function instance. So even every `(malli.impl.regex/cat)` call must return a new fn instance although it does not close over anything. https://epsil.github.io/gll/ is a nice explanation of GLL parser combinators and has links to papers etc. It also inspired Instaparse, which Engelberg had a presentation about at Clojure/West 2014. Despite the CPS and memoization, this implementation looks more like normal Clojure code than the 'Pike VM' in Seqexp. Hopefully JITs also see it that way and compile decent machine code for it. It is also much easier to extend for actual parsing (e.g. encode, decode [and parse?]) instead of just recognition for `validate`. For a more detailed explanation of this namespace see also https://www.metosin.fi/blog/malli-regex-schemas/.
cljdoc is a website building & hosting documentation for Clojure/Script libraries
× close