PEG Explorer

Part of Mouse version 1.9


Parsing Expression Grammar (PEG) is useful, but is not well understood as a language definition tool. Literature contains many examples of surprising behavior. In its appearance, PEG is practically identical to the familiar Extended Backus-Naur Form (EBNF). However, because PEG uses limited backtracking, the language accepted by parsing expression e is often a subset of the language defined by e according to EBNF. This difference gives the impression of PEG being unpredictable. The purpose of PEG Explorer is to assist in detecting and understanding the difference.

The limited backtracking applies to expressions of the form A = e1/e2 and means that once e1 succeeded, there is no backtracking to try e2 in the case of a later failure. It is rather obvious that limited backtracking has no effect if A = e1/e2 satisfies this condition:

(C) No word defined by e1 is a prefix of any word defined by e2 Tail(A) where Tail(A) defines the set of all strings that can follow A in a string conforming to the grammar.

Clearly, if e1 succeeded, the input starts with a word defined by e1. A successful backtracking would mean that the same input conforms to e2 Tail(A), which is excluded by (C). Thus, backtracking will fail and skipping it makes no difference. The limited backtracking is "efficient".

A special case of (C) is that first letters of words defined by e1 are distinct from first letters of words defined by e2 Tail(A). This is known as the LL(1) condition, and is easily checked. Many expressions that do not satisfy LL(1) still satisfy (C), but there is no general way to check this. However, a large number of cases can be checked by inspection. The function of PEG Explorer is to facilitate such inspection. It checks the LL(1) condition for all expressions e1/e2 and lets you explore those that fail the test.

The expressions A = e?, A = e* and A = e+ are abbreviations for, respectively, A = e/ε, A = eA/ε and A = ee*, where ε is the empty word. The Explorer treats them as such.

Instead of working with single letters, PEG specifies "terminals" that may be strings of letters of sets of letters. Instead of comparing single letters, the Explorer checks if pairs of such terminals are "disjoint" or "conflicting".

In the following, the Explorer is introduced by applying it to two grammars supplied with Mouse as examples.

Explore Examples

Example 4

Using command prompt, go to the directory Mouse\examples\Example4 and enter the command:

java mouse.ExplorePEG -G myGrammar.txt
The command should open a window like this:

You see there all Rules of the grammar. They are listed in alphabetical order to facilitate finding them in a large grammar.

By clicking the "All" button you can see all subexpressions, with names indicating their positions within a Rule:

By selecting a Rule and clicking the "Find" button you see only the expressions using that Rule. You can use the "Rules", "All", and "Find" buttons to switch between these views.

Explorer checks the LL(1) condition for all expressions and lets you explore all cases that do not satisfy the condition. You can see thse expressions by clicking the "Non-LL1" button. The result is as follows:

It tells you that the choice in "Number" does not satisfy LL(1). Which is rather obvious since both alternatives may start with "Digits". To explore it, click on the line. Or select the line and click the "Show" button. The result is a new window that looks like this:

You see here again the LL(1) conflict that you are going to explore, and the reason for the conflict: both alternatives define strings that may start with [0-9]. Below you see both alternatives ready for exploration. The expressions above and below "========" are, respectively, the expressions e1 and e2 Tail(A) in the condition (C) that you want to check. You are going to transform these expressions, and in the process perhaps remove sub-expressions that are not interesting.

You may start by clicking on "Digits?". (You may instead select "Digits?" and click the "Expand" button.) The result is as follows:

You "expanded" the "Digits?" into its two alternatives: namely "Digits" and empty string, and created two alternatives of the original expression, one with each alternative of "Digits?". The top expression is now

(Digits "." Digits Space) / ("." Digits Space),
with two alternatives represented as two lines. It is equivalent to the original e1.

Clearly, the second line can not define a prefix of the bottom expression, so this alternative can be ignored in further exploration. By clicking the "Filter" button, you remove alternatives with first terminals disjoint with first terminals of all strings represented by expressions on the other side of "========". The result is as follows:

To see if any string defined by the top expression can be a prefix any string defined by the bottom expression, you need to see the strings represented by "Tail(Number)". These are the strings that can follow "Number" in an input conforming to the grammar. You find them by clicking on "Tail(Number)", which produces this result:

Here "Tail(Number)" has been expanded to "(AddOp Number)* !_".

You can immediately see that "Digits" can only be followed here by a plus or minus or end of text, possibly preceded by white space. As "Digits" in the top expression are always followed by a dot, none of these strings can be a prefix of a string defined by bottom expression.

To become better convinced, you may click the "Strip" button. It eliminates "Digits" from furhter consideration:

Clicking now the "Filter" button produces this:

showing that the top and bottom expressions to the right of "--" had mutually disjoint first terminals. You have thus verified that the expression being explored satisfies (C), and its limited backtracking is efficient. As it was the only one left after the LL(1) check, limited backtracking is efficient for all expressions in the grammar. The grammar defines thus the same language as its EBNF interpretation.

A note: The original meaning of "LL(1)" is that a top-down parser can choose its way by looking at the next input letter. Here we have the situation where the parser can choose its way by looking at input within the reach of two parsing procedures: the "Digits" and whatever follows. It was suggested that such grammar should be called "LL(2p)".

Example 5

Using command prompt, go to the directory Mouse\examples\Example5 and enter the command:

java mouse.ExplorePEG -G myGrammar.txt
The command should open this window:

A click on "Non-LL1" produces this:

A click on the first line opens this window:

It tells you that the expression "Digits" does not satisfy LL(1). Recall that "Digits" is defined as [0-9]+, which is a syntactic sugar for [0-9]([0-9][0-9]* / ε ).
The expression [0-9][0-9]* / ε does not satisfy LL(1) because both [0-9][0-9]* and ε Tail(Digits) define strings starting with [0-9].

But, the expression [0-9][0-9]* / ε does not either satisfy (C). One of the strings defined by [0-9][0-9]* is the terminal [0-9]. And strings defined by "Tail(Digits)" may have [0-9] as prefix.

It means that the parsing expression [0-9]+ will gobble up any [0-9] from the following "Tail(Digits)" and will never backtrack to try gobbling less. You may want to find out where does this happen. For this purpose you have to take a closer look at "Tail(Digits)".

A click on "Tail(Digits)" has this effect:

A click on "Filter" followed by a click on "Tail(Factor)" produces this:

Finally, a click on "(MultOp Factor)*" gives this result:

The expressions under "========" represent some of the possible strings belonging to "Tail(Digits)". You know that both "Space" and "MultOp" may both represent empty strings. (You may well click on them to check). You also know that "Factor" may start with "Digits", and thus with [0-9]. (Again, you may use some clicks to check that.)

Here is the culprit: "Digits" - at the end of a "Factor" - may be immediately followed by "Digits" starting another "Factor". The EBNF form of the grammar is ambigous: "23" may be interpreted as either the number "23" or product "2 times 3", with multiplication operator replaced by empty string (which is allowed by the grammar). PEG will choose one of these alternatives, namely number "23" and not backtrack to try the other one. As this happens to be exactly how a human reader interprets "23", you may accept this behavior. (A blank between the Factors removes the ambiguity: "2 3" is uniquely interpreted as "2 times 3".)

You can now close the window, which brings you back the the "Not LL1" window, and explore the other case shown there. It boils down the the same problem: the ambiguity when a multiplication operator between two Factors is the empty string.

More tools

The two examples above presented most of the functions used to explore an LL(1) conflict. Here comes the rest.

Select single terminal conflict

Sometimes an LL(1) violation may involve sevaral pairs of conflicting terminals, like in this case:

It comes from one of the sample Java grammars. You can see here that, for example, a string in "FloatLiteral" may start with [0-9] which the parser may confuse with "0x" that starts a string in "IntegerLiteral". You may decide to examine only the strings starting with exactly these terminals. To do this, you click on the line [0-9] <==> "0x". It changes the window as follows:

From this moment on, expressions not containing strings starting with selected terminals will be filtered out. For example, two steps in expanding "FloatLiteral" and "IntegerLiteral" will look like tis:

You see that the second step eliminated "HexFloat" that never starts with [0-9] as well as "OctalNumeral" and "DecimalNumeral" that never not start with "0x".
(Of course, this filtering stops as soon as you use the "Strip" function.)

Ignore part of expression

Sometimes you want to ignore a specific subexpression such as "Space". By selecting it with the mouse and clicking the "Delete" button you remove it from the expression.

You may also remove a line by selecting any item within it and clicking "Delete Line". By clicking "Choose Line" you remove all lines except the selected one.

Return to square one

By clicking the "Reset" button you return to the original presentation of the conflict.
(An "undo" function reversing a single operation may be implemented later on.)

Handle predicates

An automatic checking of the LL(1) condition in presence of predicates is difficult. Therefore the Explorer approximates the condition by assuming that all predicates succeed. This results in many false alarms, and you have to decide by inspection whether the alarm id false or not. The following is a typical example taken again from the Java sample grammar:

You decide to have a closer look at the first pair of terminals and after a couple of "Expand" operations you obtain this:

A final click on "Identifier" reveals this:

In the computation of LL(1), "!Keyword" was assumed to succeed, that is, to represent empty string; thus, the bottom expression was allowed to define strings starting with "Letter", which may be any of [a-z], for example, b. This also happens to be the first letter in "byte", so we have an LL(1) conflict.

This version of Explorer does not allow you to expand the contents of the predicate, but you can find in the grammar that "byte" followed by !LetterOrDigit is a special case of "Keyword". Thus, "!Keyword" fails on that string, so it can not be a prefix of anything defined by the bottom expression. The alarm was false and the choice satisfies (C).

Explore expressions

First

You can explore individual expressions starting with the window "Rules", "All", or "Find". In each of them you can select an expression with the mouse.

Suppose you have selected "Factor". By clicking on the "First" button, you open a new window that shows the expressions that "Factor" may call to consume the first portion of its input:

By clicking on any of them (or selecting and clicking "More") you see expressions that the selected expression may call first:

Here you clicked on "Factor.1", and you can see that it can may call "." or "Digits?" to consume the first portion of its input. Proceeding in this way, you can see all expressions that "Factor.1" may invoke directly or indirectly to consume the first portion of its input:

By pressing "All" you see all expressions that "Factor" may invoke directly or indirectly as first:

Tail

By clicking on the "Tail" button, you display the Tail of expression selected on the "Rules", "All", or "Find" window.

Selecting "Digits" and clicking "Tail" opens this window:

It shows what can follow after "Digits" when it is invoked in three different contexts. (As you can see, "Digits" is invoked in three different places, all from "Factor".) For more information about the context, you click on a line, or select a line and click "Explain". As an example, selecting the third line gives this result:

You can see here how "Digits" invoked from "Digits?" in "Factor" is eventually followed by ".", another "Digits", and "Space" before "Factor" eventually returns control to its caller and is followed by "Tail(Factor)".


Latest change 2017-03-01