PEG Explorer

Part of Mouse version 1.9.2


The limited backtracking of PEG applies to expressions of the form e1/e2. It means that once e1 succeeded, there is no attempt to try e2 in the case of a later failure. Intuitively, e1/e2 should define the union of languages of e1 and e2. But, limited backtracking may exclude the language of e2 or part of it, with unexpected and confusing results.

One can easily see that limited backtracking does not hide e2 if A = e1/e2 satisfies this condition:

(C) No string defined by e1 is a prefix of any string defined by e2 Tail(A) where Tail(A) is the set of all strings that can follow A in a string conforming to the grammar.

Clearly, if e1 succeeded, the input starts with a string defined by e1. A successful backtracking would mean that the same input conforms to e2 Tail(A), which is excluded by (C). Thus, backtracking must fail and skipping it makes no difference. We say that the limited backtracking is "efficient": it finds everything that would be found by full backtracking. We also say that the expression is "disjoint".

A special case of (C) is that first letters of strings defined by e1 are distinct from first letters of strings defined by e2 Tail(A). This is known as the LL(1) condition, and is easily checked. Many expressions that do not satisfy LL(1) still satisfy (C), but there is no general way to check this. However, a large number of cases can be checked by inspection. The Explorer checks the LL(1) condition for all expressions e1/e2 in a given grammar and lets you inspect those that fail the test.

The expressions A = e?, A = e* and A = e+ are abbreviations for, respectively, A = e/ε, A = eA/ε and A = ee*, where ε is the empty word. For these expressions (C) becomes:

(C') No string defined by e is a prefix of any string defined by Tail(A),

and the LL(1) condition is that the first letters of strings defined by e are distinct from first letters of strings defined by Tail(A).

Instead of working with single letters, PEG specifies "terminals" that may be strings of letters of sets of letters. Instead of comparing single letters, the Explorer checks if pairs of such terminals are "disjoint".

In the following, the Explorer is introduced by applying it to two grammars supplied with Mouse as examples.

Explore Examples

Example 4

Using command prompt, go to the directory Mouse\examples\Example4 and enter the command:

java mouse.ExplorePEG -G myGrammar.txt
The command should open a window like this:

You see there all Rules of the grammar. They are listed in alphabetical order to facilitate finding them in a large grammar.

By clicking the "All" button you can see all subexpressions, with names indicating their positions within a Rule:

By selecting a Rule and clicking the "Find" button you see only the expressions using that Rule. You can use the "Rules", "All", and "Find" buttons to switch between these views.

By clicking the "Non-LL1" button, you can see all choice expressions that do not satisfy LL(1). The result is here as follows:

It tells you that the choice in "Number" does not satisfy LL(1). Which is rather obvious since both alternatives may start with "Digits". To explore it, click on the line. Or select the line and click the "Show" button. The result is a new window that looks like this:

You see here again the LL(1) conflict that you are going to explore, and the reason for the conflict: both alternatives define strings that may start with [0-9]. The expressions above and below "========" are, respectively, e1 and e2 Tail(A) in the condition (C) that you want to check. To see if they satisfy (C), you will transform the expressions.

You may start by clicking on "Digits?". (You may instead select "Digits?" and click the "Expand" button.) The result is as follows:

You replaced "Digits?" by its defiition, that is, "Digits / ε". The top expression is now

(Digits "." Digits Space) / ("." Digits Space),
with two alternatives represented as two lines. It is equivalent to the original e1.

Clearly, the second line can not define a prefix of the bottom expression, so this alternative can be ignored in further exploration. By clicking the "Filter" button, you remove alternatives with first terminals disjoint with first terminals of all strings represented by expressions on the other side of "========". The result is as follows:

To see if any string defined by the top expression can be a prefix any string defined by the bottom expression, you need to see the strings represented by "Tail(Number)". These are the strings that can follow "Number" in an input conforming to the grammar. You find them by clicking on "Tail(Number)", which produces this result:

Here "Tail(Number)" has been expanded to "(AddOp Number)* !_".

You can immediately see that "Digits" can only be followed here by a plus or minus or end of text, possibly preceded by white space. As "Digits" in the top expression are always followed by a dot, none of these strings can be a prefix of a string defined by bottom expression.

To become better convinced, you may click the "Strip" button. It eliminates "Digits" from furhter consideration:

Clicking now the "Filter" button produces this:

showing that the top and bottom expressions to the right of "--" had mutually disjoint first terminals. You have thus verified that the expression being explored satisfies (C), and its limited backtracking is efficient. As it was the only one left after the LL(1) check, limited backtracking is efficient for all expressions in the grammar. The grammar defines thus the same language as its EBNF interpretation.

A note: The original meaning of "LL(1)" is that a top-down parser can choose its way by looking at the next input letter. Here we have the situation where the parser can choose its way by looking at input within the reach of two parsing procedures: the "Digits" and whatever follows. It was suggested that such grammar should be called "LL(2p)".

Example 5

Using command prompt, go to the directory Mouse\examples\Example5 and enter the command:

java mouse.ExplorePEG -G myGrammar.txt
The command should open this window:

A click on "Non-LL1" produces this:

A click on the first line opens this window:

It tells you that the expression "Digits", defined as [0-9]+, does not satisfy LL(1). The condition (C') is in this case that [0-9] must not be a prefix of any string in Tail(Digits), which is stated by the first line in the window. The second line shows that, unfortunately, Tail(Digits) contains a string starting with [0-9]. It means that "Digits" does not satisfy (C'). The parsing expression [0-9]+ will gobble up any [0-9] from the following Tail(Digits) and will never backtrack to try gobbling less. You may want to find out where does this happen. For this purpose you have to take a closer look at "Tail(Digits)".

A click on "Tail(Digits)" has this effect:

A click on "Filter" followed by a click on "Tail(Factor)" produces this:

Finally, a click on "(MultOp Factor)*" gives this result:

The expressions under "========" represent some of the possible strings belonging to "Tail(Digits)". You know that both "Space" and "MultOp" may both represent empty strings. (You may well click on them to check). You also know that "Factor" may start with "Digits", and thus with [0-9]. (Again, you may use some clicks to check that.)

Here is the culprit: the "Digits" in question must be those appearing at the end of a "Factor". Our inspection shows that they may be immediately followed by "Digits" starting another "Factor". The EBNF form of the grammar is thus ambiguous: with empty string as multiplication operator, "23" may be interpreted as either the number "23" or product "2 times 3". PEG will choose one of these alternatives, namely the number "23" and not backtrack to try the other one. As this happens to be exactly how a human reader interprets "23", you may accept this behavior.

You can now close the window, which brings you back the the "Not LL1" window, and explore the other case shown there. It boils down the the same problem: the ambiguity when a multiplication operator between two Factors is the empty string.

More tools

The two examples above presented most of the functions used to explore an LL(1) conflict. Here comes the rest.

Select single terminal conflict

Sometimes an LL(1) violation may involve sevaral pairs of conflicting terminals, like in this case:

It comes from one of the sample Java grammars. You can see here that, for example, a string in "FloatLiteral" may start with [0-9] which the parser may confuse with "0x" that starts a string in "IntegerLiteral". You may decide to examine only the strings starting with exactly these terminals. To do this, you click on the line [0-9] <==> "0x". It changes the window as follows:

From this moment on, expressions not containing strings starting with selected terminals will be filtered out. For example, two steps in expanding "FloatLiteral" and "IntegerLiteral" will look like tis:

You see that the second step eliminated "HexFloat" that never starts with [0-9] as well as "OctalNumeral" and "DecimalNumeral" that never not start with "0x".
(Of course, this filtering stops as soon as you use the "Strip" function.)

Ignore part of expression

Sometimes you want to ignore a specific subexpression such as "Space". By selecting it with the mouse and clicking the "Delete" button you remove it from the expression.

You may also remove a line by selecting any item within it and clicking "Delete Line". By clicking "Choose Line" you remove all lines except the selected one.

Return to square one

By clicking the "Reset" button you return to the original presentation of the conflict.
(An "undo" function reversing a single operation may be implemented later on.)

Handle predicates

An automatic checking of the LL(1) condition in presence of predicates is difficult. Therefore the Explorer approximates the condition by assuming that all predicates succeed. This results in many false alarms, and you have to decide by inspection whether the alarm id false or not. The following is a typical example taken again from the Java sample grammar:

You decide to have a closer look at the first pair of terminals and after a couple of "Expand" operations you obtain this:

A final click on "Identifier" reveals this:

In the computation of LL(1), "!Keyword" was assumed to succeed, that is, to represent empty string; thus, the bottom expression was allowed to define strings starting with "Letter", which may be any of [a-z], for example, b. This also happens to be the first letter in "byte", so we have an LL(1) conflict.

This version of Explorer does not allow you to expand the contents of the predicate, but you can find in the grammar that "byte" followed by !LetterOrDigit is a special case of "Keyword". Thus, "!Keyword" fails on that string, so it can not be a prefix of anything defined by the bottom expression. The alarm was false and the choice satisfies (C).

Explore expressions

First

You can explore individual expressions starting with the window "Rules", "All", or "Find". In each of them you can select an expression with the mouse.

Suppose you have selected "Factor". By clicking on the "First" button, you open a new window that shows the expressions that "Factor" may call to consume the first portion of its input:

By clicking on any of them (or selecting and clicking "More") you see expressions that the selected expression may call first:

Here you clicked on "Factor.1", and you can see that it can may call "." or "Digits?" to consume the first portion of its input. Proceeding in this way, you can see all expressions that "Factor.1" may invoke directly or indirectly to consume the first portion of its input:

By pressing "All" you see all expressions that "Factor" may invoke directly or indirectly as first:

Tail

By clicking on the "Tail" button, you display the Tail of expression selected on the "Rules", "All", or "Find" window.

Selecting "Digits" and clicking "Tail" opens this window:

It shows what can follow after "Digits" when it is invoked in three different contexts. (As you can see, "Digits" is invoked in three different places, all from "Factor".) For more information about the context, you click on a line, or select a line and click "Explain". As an example, selecting the third line gives this result:

You can see here how "Digits" invoked from "Digits?" in "Factor" is eventually followed by ".", another "Digits", and "Space" before "Factor" eventually returns control to its caller and is followed by "Tail(Factor)".


Latest change 2017-12-14