Version 2.3
Parsing Expression Grammar (PEG), introduced by Ford in [1], is a way to define the syntax of a programming language. It encodes a recursive-descent parser for that language. Mouse is a tool to transcribe PEG into an executable parser. Both Mouse and the resulting parser are written in Java, which makes them operating-system independent. An integral feature of Mouse is the mechanism for specifying semantics (also in Java). You find an example below.
Recursive-descent parsers consist of procedures that correspond to syntax rules. The procedures call each other recursively, each being responsible for recognizing input strings defined by its rule. The syntax definition and its parser can be then viewed as the same thing, presented in two different ways. This design is very transparent and easy to modify when the language evolves. The idea can be traced back to Lucas [2], who suggested it for grammars that later became known as the Extended Backus-Naur Form (EBNF).
The problem is that procedures corresponding to certain types of syntax rules must decide which procedure to call next. It is easily solved for a class of languages that have the so-called LL(1) property: the decision can be made by looking at the next input symbol. Some syntax definitions can be transformed to satisfy this property. However, forcing the language into the LL(1) mold can make the grammar -- and the parser -- unreadable.
One can always use brute force: trial and error. It means trying the alternatives one after another and backtracking after a failure. But, an exhaustive search may require exponential time, so a possible option is limited backtracking: never return after a partial success. This approach has been used in actual parsers [4,5] and is described in [3, 6, 7, 8]. It has been eventually used by Ford [1] to create Parsing Expression Grammar (PEG).
A Web site devoted to PEG contains list of publications and a link to discussion forum.
Parsers defined by PEG do not require a separate "lexer" or "scanner". Together with lifting of the LL(1) restriction, this gives a very convenient tool when you need an ad-hoc parser for some application.
In its external appearance, PEG is very like an EBNF grammar.
But, the limited backtracking may result in some strings conforming to EBNF
being rejected by PEG.
It is well-known that PEG
A = aaAaa/aa
accepts only a string of "a"s whose length is a power of 2,
while the same grammar viewed as EBNF accepts any string of even length.
If you do not want to exploit such effects, you would most likely
see your grammar as EBNF and expect the PEG parser to behave accordingly.
Indeed, for some grammars the limited backtracking
finds everything that would be found by full backtracking.
We say that such grammar is "PEG-complete".
A PEG parser for PEG-complete EBNF grammar accepts
exactly the language defined by that EBNF.
Some conditions for PEG-completness have been identified in [9,10]. The Mouse package includes a tool, the PEG Explorer, that assists you in checking whether your grammar is PEG-complete. It is described here.
It is difficult for the Explorer to automatically distinguish between identifiers and keywords defined in the traditional PEG manner. To facilitate this task, version 2.3 of Mouse introduces the "colon" operator that makes these constructs Explorer-friendly. You find it in Section 18 of the Manual.
Even the limited backtracking may require a lot of time. In [11] PEG was introduced together with a technique called packrat parsing. Packrat parsing handles backtracking by extensive memoization: storing all results of parsing procedures. It guarantees linear parsing time at a large memory cost. (The name "packrat" comes from pack rat - a small rodent Neotoma cinerea known for hoarding unnecessary items.)
Wikipedia lists a number of generators producing packrat parsers from PEG.
The amount of backtracking does not matter in small interactive applications where the input is short and performance not critical. Moreover, the usual programming languages do not require much backtracking. The experiments reported in [12,13] demonstrated a moderate backtracking activity in PEG parsers for programming languages Java and C.
In view of these facts, it is useful to construct PEG parsers where the complexity of packrat technology is abandoned in favor of simple and transparent design. This is the idea of Mouse: a parser generator that transcribes PEG into a set of recursive procedures that closely follow the grammar. The name Mouse was chosen in contrast to Rats!, one of the first generators producing packrat parsers [14]. Optionally, Mouse can offer a small amount of memoization using the technique described in [13].
The recursive-descent process used by PEG parsers can not be applied to a grammar
that contains left recursion, because rule such as A = Aa/b
would result in an infinite descent.
But, for many reasons this pattern is often used for defining the syntax
of programming languages.
Converting left recursion to iteration is possible, but is tedious, error-prone, and obscures the spirit of the grammar. Several ways of extending PEG to handle left recursion have been suggested. One approach is a modification of packrat parser [15], and [16] shows how to exploit only part of packrat technology. Another approach [17] introduces a bound on the number of left-recursive evaluations that is gradually increased.
Starting with Version 2.0, Mouse supports left recursion using an experimental method of recursive ascent, which follows the idea from [18]. It boils down to constructing behind the scenes an alternative PEG to take care of left-recursive parts. It is explained here.
The support is limited to the Choice and Sequence operators, and there is a requirement that the first expression in a recursive Sequence must be non-nullable.
You find more details in Section 17 of the Manual.
The added Examples 2R, 4R, and 8R are, respectively, copies of Examples 2, 4, and 8 with grammar
rewritten as left-recursive.
Example 11 contains samples of left-recursive grammars found in the literature.
The package contains also a grammar for Java 18 with
left-recursive definition of Primary.
This is an example of PEG:
Sum = Space Sign Number (AddOp Number)* !_ {} ; Number = Digits? "." Digits Space {} / Digits Space {} ; Sign = ("-" Space)? ; AddOp = [-+] Space ; Digits = [0-9]+ ; Space = " "* ;
Mouse transcribes it into an executable parser written in Java. Each rule of PEG becomes a Java method named after the rule. For example, the first rule is transcribed into this method;
//===================================================================== // Sum = Space Sign Number (AddOp Number)* !_ {} ; //===================================================================== private boolean Sum() { begin("Sum"); Space(); Sign(); if (!Number()) return reject(); while (Sum_0()); if (!aheadNot()) return reject(); sem.Sum(); return accept(); }
A pair of brackets {} at the end of a PEG rule indicates that you provide a semantic action for the rule. It is a method in a separate Java class and may look like this:
//------------------------------------------------------------------- // Sum = Space Sign Number AddOp Number ... AddOp Number // 0 1 2 3 4 n-2 n-1 //------------------------------------------------------------------- void Sum() { int n = rhsSize(); int s = (Double)rhs(2).get(); if (!rhs(1).isEmpty()) s = -s; for (int i=4;i<n;i+=2) { if (rhs(i-1).charAt(0)=='+') s += (Double)rhs(i).get(); else s -= (Double)rhs(i).get(); } System.out.println(s); }Special functions like "rhsSize()" and "rhs(i)" provide access to parts of the parsed text.
Download user's manual / tutorial (PDF file).
Download the Mouse package (gzipped TAR file).
License: Apache Version 2.
Based on The Java Language Specification, Java SE18 Edition dated 2022-02-23, with corrections based on test results. Includes left-recursive definition of Primary and uses the new "colon" operator.
Download the grammar in Mouse format (text file).
Rather ancient, based on ISO/IEC 9899.1999:TC2 standard without preprocessor directives, meaning that it applies to preprocessed C source.
The typedef feature makes the syntax of C context-dependent, which cannot be expressed in PEG formalism. The problem is solved by semantic actions provided together with the grammar.
Download the grammar in Mouse format (text file).
Download semantic actions for the grammar (Java source).
Latest change 2022-08-05