Mouse: from Parsing Expressions to a practical parser

Version 1.7

Parsing Expression Grammar (PEG), introduced by Ford in [1], is a new way to specify a recursive-descent parser with limited backtracking. Mouse is a tool to transcribe PEG into an executable parser written in Java.

Recursive-descent parsing with backtracking

Recursive-descent parsers have been around for a while. Already in 1961, Lucas [2], suggested the use of recursive procedures that reflect the syntax of the language being parsed. This close connection between the code and the grammar is the great advantage of recursive-descent parsers. It makes the parser easy to maintain and modify.

A recursive-descent parser is simple to construct from a classical context-free grammar if the grammar has the so-called LL(1) property; it means that the parser can always decide what to do by looking at the next input character. However, forcing the language into the LL(1) mold can make the grammar -- and the parser -- unreadable.

The LL(1) restriction can be circumvented by the use of backtracking. Backtracking means that the parser proceeds by trial and error: goes back and tries another alternative if it took a wrong decision. However, an exhaustive search of all alternatives may require an exponential time. A reasonable compromise is limited backtracking, also called "fast-back" in [3].

Limited backtracking was adopted in at least two of the early top-down designs: the Atlas Compiler Compiler of Brooker and Morris [4,5], and TMG (the TransMoGrifier) of McClure [6]. The syntax specification used in TMG was later formalized and analyzed by Birman and Ullman [7,8]. It appears in [9] as "Top-Down Parsing Language" (TDPL) and "Generalized TDPL" (GTDPL).

Parsing Expression Grammar is a development of this latter. A Web site devoted to PEG contains list of publications and a link to discussion forum.

PEG programming

Parsers defined by PEG do not require a separate "lexer" or "scanner". Together with lifting of the LL(1) restriction, this gives a very convenient tool when we need an ad-hoc parser for some application. However, PEG is not well understood as a language definition tool. The literature contains many examples of surprising behavior. It is thus desirable to construct PEG parsers for languages defined in a more traditional form, for example, the Extended Backus-Naur Form (EBNF).

To its external appearance, PEG is very like an EBNF grammar. In fact, few simple typographical changes convert EBNF grammar into PEG. It may happen that this PEG accepts exactly the language defined by EBNF. One can then say that the EBNF grammar "is its own PEG parser". However, in the general case the PEG so obtained recognizes only a subset of the language defined by EBNF; this is the effect of backtracking being limited.

A recent paper [10] presents some sufficient conditions for EBNF grammar to be its own PEG parser. Some hints on how to construct PEG parser for an EBNF grammar that does not satisfy these conditions are given in [11,12].

Packrat parsing

Even the limited backtracking may require a lot of time. In [13,14], PEG was introduced together with a technique called packrat parsing. Packrat parsing handles backtracking by extensive memoization: storing all results of parsing procedures. It guarantees linear parsing time at a large memory cost.

(The name "packrat" comes from pack rat - a small rodent (Neotoma cinerea) known for hoarding unnecessary items. "Memoization", introduced in [15], is the technique of reusing stored results of function calls instead of recomputing them.)

Wikipedia lists a number of generators producing packrat parsers from PEG.

Mouse - not a pack rat

The amount of backtracking does not matter in small interactive applications where the input is short and performance not critical. Moreover, the usual programming languages do not require much backtracking. The experiments reported in [16,17] demonstrated a moderate backtracking activity in PEG parsers for programming languages Java and C.

In view of these facts, it is useful to construct PEG parsers where the complexity of packrat technology is abandoned in favor of simple and transparent design. This is the idea of Mouse: a parser generator that transcribes PEG into a set of recursive procedures that closely follow the grammar. The name Mouse was chosen in contrast to Rats!, one of the first generators producing packrat parsers [18]. Optionally, Mouse can offer a small amount of memoization using the technique described in [16]. Both Mouse and the resulting parser are written in Java, which makes them operating-system independent.

An integral feature of Mouse is the mechanism for specifying semantics (also in Java).


This is an example of PEG:

Sum     = Space Sign Number (AddOp Number)* !_ {} ;
Number  = Digits? "." Digits Space {}
        / Digits Space {} ;
Sign    = ("-" Space)? ;
AddOp   = [-+] Space ;
Digits  = [0-9]+ ;
Space   = " "* ;

Mouse transcribes it into an executable parser written in Java. Each rule of PEG becomes a Java method named after the rule. For example, the first rule is transcribed into this method;

//  Sum = Space Sign Number (AddOp Number)* !_ {} ;
private boolean Sum()
    if (!Number()) return reject();
    while (Sum_0());
    if (!aheadNot()) return reject();
    return accept();

A pair of brackets {} at the end of a PEG rule indicates that you provide a semantic action for the rule. It is a method in a separate Java class and may look like this:

//  Sum = Space Sign Number AddOp Number ... AddOp Number
//          0     1    2      3     4         n-2   n-1
void Sum()
    int n = rhsSize();
    int s = (Double)rhs(2).get();
    if (!rhs(1).isEmpty()) s = -s;
    for (int i=4;i<n;i+=2)
      if (rhs(i-1).charAt(0)=='+')
        s += (Double)rhs(i).get();
        s -= (Double)rhs(i).get();
Special functions like "rhsSize()" and "rhs(i)" provide access to parts of the parsed text.


  1. ^ B. Ford. Parsing Expression Grammars: A Recognition-Based Syntactic Foundation. Proceedings of the 31st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, Venice 2004, pp. 111 - 122.
  2. ^ P. Lucas. The Structure of Formula-Translators.
    ALGOL Bulletin Supplement 16, September 1961, pp. 1 - 21.
  3. ^ F. R. A. Hopgood. Compiling Techniques, MacDonalds 1969.
  4. ^ P. A. Brooker, D. Morris. A General Translation Program for Phrase Structure Languages. The Computer Journal 9, 1 (1962), pp. 1 -10.
  5. ^ S. Rosen. A Compiler-Building System Developed by Brooker and Morris. Communications of ACM 7, 7 (1964) pp. 403 - 414.
  6. ^ R. M. McClure. TMG - a syntax directed compiler.
    Proceedings of the 20th ACM National Conference, Cleveland, Ohio 1965, pp. 262 - 274.
  7. ^ A. Birman. The TMG Recognition Schema. Ph.D. thesis, Princeton University, 1970.
  8. ^ A. Birman and J. D. Ullman. Parsing Algorithms with Backtrack.
    Information and Control 23 (1973), pp. 1 -34.
  9. ^ A. V. Aho and J. D. Ullman. The Theory of Parsing, Translation and Compiling,
    Vol. I, Parsing. Prentice Hall, 1972.
  10. ^ R. R. Redziejowski. From EBNF to PEG.
    Fundamenta Informaticae 128, 1-2 (2013), pp. 177 - 191.
  11. ^ F. Mascarenhas, S. Medeiros and R. Ierusalimschy. On the Relation between Context-Free Grammars and Parsing Expression Grammars.
    Science of Computer Programming 89 (2014) pp. 235 - 250.
  12. ^ R. R. Redziejowski. More about converting BNF to PEG.
    Fundamenta Informaticae 133, 2-3 (2014), pp. 257 - 270.
  13. ^ B. Ford. Packrat Parsing: a Practical Linear-Time Algorithm with Backtracking.
    M.Sc. thesis, MIT 2002.
  14. ^ B. Ford. Packrat Parsing: Simple, Powerful, Lazy, Linear Time.
    Proceedings of the 2002 International Conference on Functional Programming, Pittsburgh 2002, pp. 36 - 47.
  15. ^ D. Michie. Memo Functions and Machine Learning. Nature 218 (1968), pp. 19 - 22.
  16. ^ R. R. Redziejowski. Parsing Expression Grammar as a Primitive Recursive-Descent Parser with Backtracking. Fundamenta Informaticae 79, 3-4 (2007), pp. 513 - 524.
  17. ^ R. R. Redziejowski. Some aspects of Parsing Expression Grammar.
    Fundamenta Informaticae 85, 1-4 (2008), pp. 441 - 454.
  18. ^ R. Grimm. Practical Packrat Parsing.
    Technical Report TR2004-854, Dept. of Computer Science, New York University (2004).



Download user's manual / tutorial (PDF file).
Download the Mouse package (gzipped TAR file).

License: Apache Version 2.

Sample grammars

Parsing Expression Grammar for Java 1.8

Based on Java Language Specification, Java SE8 Edition dated 2014-03-03, with corrections based on test results and javac parser code.

Download the grammar in Mouse format (text file).

Parsing Expression Grammar for Java 1.7

Based on Chapters 3 and 18 Based on of Java Language Specification, Java SE7 Edition dated 2012-07-27, with corrections based on other chapters and test results.

Download the grammar in Mouse format (text file).

Parsing Expression Grammar for Java 1.6

Based on Chapters 3 and 18 of Java Language Specification, Third Edition, with corrections based on other chapters and test results.

It is an updated version of the grammar used for experiments described in the paper Parsing Expression Grammar as a Primitive Recursive-Descent Parser with Backtracking.

Download the grammar in Mouse format (text file).

Parsing Expression Grammar for C

Based on ISO/IEC 9899.1999:TC2 standard without preprocessor directives,
meaning that it applies to preprocessed C source.

It is the grammar used for experiments described in the paper Some aspects of Parsing Expression Grammar. It has been tested against 100 preprocessed C files from different open source projects.

The typedef feature makes the syntax of C context-dependent, which cannot be expressed in PEG formalism. The problem is solved by semantic actions provided together with the grammar.

Download the grammar in Mouse format (text file).
Download semantic actions for the grammar (Java source).

Latest change 2015-07-20