Mouse: from Parsing Expressions to a practical parser

Version 2.3


Parsing Expression Grammar (PEG), introduced by Ford in [1], is a way to define the syntax of a programming language. It encodes a recursive-descent parser for that language. Mouse is a tool to transcribe PEG into an executable parser. Both Mouse and the resulting parser are written in Java, which makes them operating-system independent. An integral feature of Mouse is the mechanism for specifying semantics (also in Java). You find an example below.

Recursive-descent parsing with limited backtracking

Recursive-descent parsers consist of procedures that correspond to syntax rules. The procedures call each other recursively, each being responsible for recognizing input strings defined by its rule. The syntax definition and its parser can be then viewed as the same thing, presented in two different ways. This design is very transparent and easy to modify when the language evolves. The idea can be traced back to Lucas [2], who suggested it for grammars that later became known as the Extended Backus-Naur Form (EBNF).

The problem is that procedures corresponding to certain types of syntax rules must decide which procedure to call next. It is easily solved for a class of languages that have the so-called LL(1) property: the decision can be made by looking at the next input symbol. Some syntax definitions can be transformed to satisfy this property. However, forcing the language into the LL(1) mold can make the grammar -- and the parser -- unreadable.

One can always use brute force: trial and error. It means trying the alternatives one after another and backtracking after a failure. But, an exhaustive search may require exponential time, so a possible option is limited backtracking: never return after a partial success. This approach has been used in actual parsers [4,5] and is described in [3, 6, 7, 8]. It has been eventually used by Ford [1] to create Parsing Expression Grammar (PEG).

A Web site devoted to PEG contains list of publications and a link to discussion forum.

PEG programming

Parsers defined by PEG do not require a separate "lexer" or "scanner". Together with lifting of the LL(1) restriction, this gives a very convenient tool when you need an ad-hoc parser for some application.

In its external appearance, PEG is very like an EBNF grammar. But, the limited backtracking may result in some strings conforming to EBNF being rejected by PEG. It is well-known that PEG A = aaAaa/aa accepts only a string of "a"s whose length is a power of 2, while the same grammar viewed as EBNF accepts any string of even length. If you do not want to exploit such effects, you would most likely see your grammar as EBNF and expect the PEG parser to behave accordingly. Indeed, for some grammars the limited backtracking finds everything that would be found by full backtracking. We say that such grammar is "PEG-complete". A PEG parser for PEG-complete EBNF grammar accepts exactly the language defined by that EBNF.

Some conditions for PEG-completness have been identified in [9,10]. The Mouse package includes a tool, the PEG Explorer, that assists you in checking whether your grammar is PEG-complete. It is described here.

It is difficult for the Explorer to automatically distinguish between identifiers and keywords defined in the traditional PEG manner. To facilitate this task, version 2.3 of Mouse introduces the "colon" operator that makes these constructs Explorer-friendly. You find it in Section 18 of the Manual.

Mouse - not a pack rat

Even the limited backtracking may require a lot of time. In [11] PEG was introduced together with a technique called packrat parsing. Packrat parsing handles backtracking by extensive memoization: storing all results of parsing procedures. It guarantees linear parsing time at a large memory cost. (The name "packrat" comes from pack rat - a small rodent Neotoma cinerea known for hoarding unnecessary items.)

Wikipedia lists a number of generators producing packrat parsers from PEG.

The amount of backtracking does not matter in small interactive applications where the input is short and performance not critical. Moreover, the usual programming languages do not require much backtracking. The experiments reported in [12,13] demonstrated a moderate backtracking activity in PEG parsers for programming languages Java and C.

In view of these facts, it is useful to construct PEG parsers where the complexity of packrat technology is abandoned in favor of simple and transparent design. This is the idea of Mouse: a parser generator that transcribes PEG into a set of recursive procedures that closely follow the grammar. The name Mouse was chosen in contrast to Rats!, one of the first generators producing packrat parsers [14]. Optionally, Mouse can offer a small amount of memoization using the technique described in [13].

Left recursion

The recursive-descent process used by PEG parsers can not be applied to a grammar that contains left recursion, because rule such as A = Aa/b would result in an infinite descent. But, for many reasons this pattern is often used for defining the syntax of programming languages.

Converting left recursion to iteration is possible, but is tedious, error-prone, and obscures the spirit of the grammar. Several ways of extending PEG to handle left recursion have been suggested. One approach is a modification of packrat parser [15], and [16] shows how to exploit only part of packrat technology. Another approach [17] introduces a bound on the number of left-recursive evaluations that is gradually increased.

Starting with Version 2.0, Mouse supports left recursion using an experimental method of recursive ascent, which follows the idea from [18]. It boils down to constructing behind the scenes an alternative PEG to take care of left-recursive parts. It is explained here.

The support is limited to the Choice and Sequence operators, and there is a requirement that the first expression in a recursive Sequence must be non-nullable.

You find more details in Section 17 of the Manual. The added Examples 2R, 4R, and 8R are, respectively, copies of Examples 2, 4, and 8 with grammar rewritten as left-recursive. Example 11 contains samples of left-recursive grammars found in the literature. The package contains also a grammar for Java 18 with left-recursive definition of Primary.
 


References

  1. ^ B. Ford. Parsing Expression Grammars: A Recognition-Based Syntactic Foundation. Proceedings of the 31st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, Venice 2004, pp. 111 - 122.
  2. ^ P. Lucas. The Structure of Formula-Translators.
    ALGOL Bulletin Supplement 16, September 1961, pp. 1 - 21.
  3. ^ F. R. A. Hopgood. Compiling Techniques, MacDonalds 1969.
  4. ^ P. A. Brooker, D. Morris. A General Translation Program for Phrase Structure Languages. The Computer Journal 9, 1 (1962), pp. 1 - 10.
  5. ^ R. M. McClure. TMG - a syntax directed compiler.
    Proceedings of the 20th ACM National Conference, Cleveland, Ohio 1965, pp. 262 - 274.
  6. ^ A. Birman. The TMG Recognition Schema. Ph.D. thesis, Princeton University, 1970.
  7. ^ A. Birman and J. D. Ullman. Parsing Algorithms with Backtrack.
    Information and Control 23 (1973), pp. 1 - 34.
  8. ^ A. V. Aho and J. D. Ullman. The Theory of Parsing, Translation and Compiling,
    Vol. I, Parsing. Prentice Hall, 1972.
  9. ^ F. Mascarenhas, S. Medeiros and R. Ierusalimschy. On the Relation between Context-Free Grammars and Parsing Expression Grammars.
    Science of Computer Programming 89 (2014) pp. 235 - 250.
  10. ^ R. R. Redziejowski. From EBNF to PEG.
    Fundamenta Informaticae 128, 1-2 (2013), pp. 177 - 191.
  11. ^ B. Ford. Packrat Parsing: Simple, Powerful, Lazy, Linear Time.
    Proceedings of the 2002 International Conference on Functional Programming, Pittsburgh 2002, pp. 36 - 47.
  12. ^ R. R. Redziejowski. Parsing Expression Grammar as a Primitive Recursive-Descent Parser with Backtracking. Fundamenta Informaticae 79, 3-4 (2007), pp. 513 - 524.
  13. ^ R. R. Redziejowski. Some aspects of Parsing Expression Grammar.
    Fundamenta Informaticae 85, 1-4 (2008), pp. 441 - 454.
  14. ^ R. Grimm. Practical Packrat Parsing.
    Technical Report TR2004-854, Dept. of Computer Science, New York University (2004).
  15. ^ A. Warth, J. R. Douglass, T.D. Millstein. Packrat parsers can support left recursion. Proceedings of the 2008 ACM Symposium on Partial Evaluation and Semantics-based Program Manipulation,(2008), pp. 103 - 110.
  16. ^ L. Tratt. Direct left-recursive parsing expression grammars. Tech. report EIS-10-01, School of Engineering and Information Sciences, Middlesex University (2010).
  17. ^ S. Medeiros, F. Mascarenhas, R. Ierusalimschy. Left Recursion in Parsing Expression Grammars. Science of Computer Programming 96, P2 (2014), pp. 177 - 190.
  18. ^ O. Hill. Support for Left-Recursive PEGs (2010).

 

Example

This is an example of PEG:

Sum     = Space Sign Number (AddOp Number)* !_ {} ;
Number  = Digits? "." Digits Space {}
        / Digits Space {} ;
Sign    = ("-" Space)? ;
AddOp   = [-+] Space ;
Digits  = [0-9]+ ;
Space   = " "* ;

Mouse transcribes it into an executable parser written in Java. Each rule of PEG becomes a Java method named after the rule. For example, the first rule is transcribed into this method;

//=====================================================================
//  Sum = Space Sign Number (AddOp Number)* !_ {} ;
//=====================================================================
private boolean Sum()
  {
    begin("Sum");
    Space();
    Sign();
    if (!Number()) return reject();
    while (Sum_0());
    if (!aheadNot()) return reject();
    sem.Sum();
    return accept();
  }

A pair of brackets {} at the end of a PEG rule indicates that you provide a semantic action for the rule. It is a method in a separate Java class and may look like this:

//-------------------------------------------------------------------
//  Sum = Space Sign Number AddOp Number ... AddOp Number
//          0     1    2      3     4         n-2   n-1
//-------------------------------------------------------------------
void Sum()
  {
    int n = rhsSize();
    int s = (Double)rhs(2).get();
    if (!rhs(1).isEmpty()) s = -s;
    for (int i=4;i<n;i+=2)
    {
      if (rhs(i-1).charAt(0)=='+')
        s += (Double)rhs(i).get();
      else
        s -= (Double)rhs(i).get();
    }
    System.out.println(s);
  }
Special functions like "rhsSize()" and "rhs(i)" provide access to parts of the parsed text.
 

Downloads

Download user's manual / tutorial (PDF file).
Download the Mouse package (gzipped TAR file).

License: Apache Version 2.
 


Sample grammars

Parsing Expression Grammar for Java 18

Based on The Java Language Specification, Java SE18 Edition dated 2022-02-23, with corrections based on test results. Includes left-recursive definition of Primary and uses the new "colon" operator.

Download the grammar in Mouse format (text file).

Parsing Expression Grammar for C

Rather ancient, based on ISO/IEC 9899.1999:TC2 standard without preprocessor directives, meaning that it applies to preprocessed C source.

The typedef feature makes the syntax of C context-dependent, which cannot be expressed in PEG formalism. The problem is solved by semantic actions provided together with the grammar.

Download the grammar in Mouse format (text file).
Download semantic actions for the grammar (Java source).
 


Latest change 2022-08-05