Parsing expression grammar

Parsing expression grammar

A parsing expression grammar, or PEG, is a type of analytic formal grammar that describes a formal language in terms of a set of rules for recognizing strings in the language. A parsing expression grammar essentially represents a recursive descent parser in a pure schematic form that expresses only syntax and is independent of the way an actual parser might be implemented or what it might be used for. Parsing expression grammars look similar to regular expressions or context-free grammars (CFG) in Backus-Naur form (BNF) notation, but have a different interpretation.

Unlike CFGs, PEGs are not ambiguous; if a string parses, it has exactly one valid parse tree. This suits PEGs well to parsing computer languages, but not natural languages.

Definition

Formally, a parsing expression grammar consists of:

* A finite set "N" of "nonterminal symbols".
* A finite set Σ of "terminal symbols" that is disjoint from "N".
* A finite set "P" of "parsing rules".

Each parsing rule in "P" has the form "A" → "e", where "A" is a nonterminal symbol and "e" is a "parsing expression". A parsing expression is a hierarchical expression similar to a regular expression, which is constructed in the following fashion:

# An "atomic parsing expression" consists of:
#* any terminal symbol,
#* any nonterminal symbol, or
#* the empty string ε.
# Given any existing parsing expressions "e", "e"1, and "e"2, a new parsing expression can be constructed using the following operators:
#* "Sequence": "e"1 "e"2
#* "Ordered choice": "e"1 / "e"2
#* "Zero-or-more": "e"*
#* "One-or-more": "e"+
#* "Optional": "e"?
#* "And-predicate": &"e"
#* "Not-predicate": !"e"

Unlike in a context-free grammar or other generative grammars, in a parsing expression grammar there must be "exactly one rule" in the grammar having a given nonterminal on its left-hand side. That is, rules act as "definitions" in a PEG, and each nonterminal must have one and only one definition.

Interpretation of parsing expressions

Each nonterminal in a parsing expression grammar essentially represents a parsing function in a recursive descent parser, and the corresponding parsing expression represents the "code" comprising the function. Each parsing function conceptually takes an input string as its argument, and yields one of the following results:

* "success", in which the function may optionally move forward or "consume" one or more characters of the input string supplied to it, or
* "failure", in which case no input is consumed.

A nonterminal may succeed without actually consuming any input, and this is considered an outcome distinct from failure.

An atomic parsing expression consisting of a single terminal succeeds if the first character of the input string matches that terminal, and in that case consumes the input character; otherwise the expression yields a failure result. An atomic parsing expression consisting of the empty string always trivially succeeds without consuming any input. An atomic parsing expression consisting of a nonterminal "A" represents a recursive call to the nonterminal-function "A".

The sequence operator "e"1 "e"2 first invokes "e"1, and if "e"1 succeeds, subsequently invokes "e"2 on the remainder of the input string left unconsumed by "e"1, and returns the result. If either "e"1 or "e"2 fails, then the sequence expression "e"1 "e"2 fails.

The choice operator "e"1 / "e"2 first invokes "e"1, and if "e"1 succeeds, returns its result immediately. Otherwise, if "e"1 fails, then the choice operator backtracks to the original input position at which it invoked "e"1, but then calls "e"2 instead, returning "e"2's result.

The zero-or-more, one-or-more, and optional operators consume zero or more, one or more, or zero or one consecutive repetitions of their sub-expression "e", respectively. Unlike in context-free grammars and regular expressions, however, these operators "always" behave greedily, consuming as much input as possible and never backtracking. (Regular expressions start by matching greedily, but they backtrack and try shorter matches if they fail to match.) For example, the expression a* will always consume as many a's as are consecutively available in the input string, and the expression (a* a) will always fail because the first part (a*) will never leave any a's for the second part to match.

Finally, the and-predicate and not-predicate operators implement syntactic predicates. The expression &"e" invokes the sub-expression "e", and then succeeds if "e" succeeds and fails if "e" fails, but in either case "never consumes any input". Conversely, the expression !"e" succeeds if "e" fails and fails if "e" succeeds, again consuming no input in either case. Because these can use an arbitrarily complex sub-expression "e" to "look ahead" into the input string without actually consuming it, they provide a powerful syntactic lookahead and disambiguation facility.

Examples

This is a PEG that recognizes mathematical formulas that apply the basic four operations to non-negative integers.

:"Value" ← [0-9] + / '(' "Expr" ')':"Product" ← "Value" (('*' / '/') "Value")*:"Sum" ← "Product" (('+' / '-') "Product")*:"Expr" ← "Sum"

In the above example, the terminal symbols are characters of text, represented by characters in single quotes, such as '(' and ')'. The range [0-9] is also a shortcut for ten characters, indicating any one of the digits 0 through 9. (This range syntax is the same as the syntax used by regular expressions.) The nonterminal symbols are the ones that expand to other rules: "Value", "Product", "Sum", and "Expr".

The examples below drop quotation marks in order to be easier to read. Lowercase letters are terminal symbols, while capital letters in italics are nonterminals. Actual PEG parsers would require the lowercase letters to be in quotes.

The parsing expression (a/b)* matches and consumes an arbitrary-length sequence of a's and b's. The rule "S" ← a "S"? b describes the simple context-free "matching language" { a^n b^n : n ge 1 } . The following parsing expression grammar describes the classic non-context-free language { a^n b^n c^n : n ge 1 } :

:"S" ← &("A" !b) a+ "B" !c:"A" ← a "A"? b:"B" ← b "B"? c

The following recursive rule matches standard C-style if/then/else statements in such a way that the optional "else" clause always binds to the innermost "if", because of the implicit prioritization of the '/' operator. (In a context-free grammar, this construct yields the classic dangling else ambiguity.):"S" ← if "C" then "S" else "S" / if "C" then "S"

The parsing expression foo &(bar) matches and consumes the text "foo" but only if it is followed by the text "bar". The parsing expression foo !(bar) matches the text "foo" but only if it is "not" followed by the text "bar". The expression !(a+ b) a matches a single "a" but only if it is not the first in an arbitrarily-long sequence of a's followed by a b.

The following recursive rule matches Pascal-style nested comment syntax, (* which can (* nest *) like this *). The comment symbols appear in double quotes to distinguish them from PEG operators.

:"Begin" ← "(*":"End" ← "*)":"C" ← "Begin" "N"* "End"
:"N" ← "C" / (!"Begin" !"End" "Z")
:"Z" ← "any single character"

Implementing parsers from parsing expression grammars

Any parsing expression grammar can be converted directly into a recursive descent parserFact|date=February 2007. Due to the unlimited lookahead capability that the grammar formalism provides, however, the resulting parser could exhibit exponential time performance in the worst case.

By memoizing the results of intermediate parsing steps and ensuring that each parsing function is only invoked at most once at a given input position, however, it is possible to convert any parsing expression grammar into a "packrat parser", which always runs in linear time at the cost of substantially greater storage space requirements.

A packrat parsercite web
url=http://pdos.csail.mit.edu/~baford/packrat/thesis
title=Packrat Parsing: a Practical Linear-Time Algorithm with Backtracking
accessdate=2007-07-27
last=Ford
first=Bryan
year=2002
month=September
publisher=Massachusetts Institute of Technology
] is a form of parser similar to a recursive descent parser in construction, except that during the parsing process it memoizes the intermediate results of all invocations of the mutually recursive parsing functions. Because of this memoization, a packrat parser has the ability to parse many context-free grammars and "any" parsing expression grammar (including some that do not represent context-free languages) in linear time.

It is also possible to build LL parsers and LR parsers from parsing expression grammars, but the unlimited lookahead capability of the grammar formalism is lost in this case.

Advantages

PEGs make a good replacement for regular expressions, because they are strictly more powerful. For example, a regular expression inherently cannot find matched pairs of parentheses, because it is not recursive, but a PEG can.

Any PEG can be parsed in linear time by using a packrat parser, as described above.

Parsers for languages expressed as a CFG, such as LR parsers, require a separate tokenization step to be done first, which breaks up the input based on the location of spaces, punctuation, etc. The tokenization is necessary because of the way these parsers use "lookahead" to parse CFGs that meet certain requirements in linear time. PEGs do not require tokenization to be a separate step, and tokenization rules can be written in the same way as any other grammar rule.

Many CFGs contain inherent ambiguities, even when they're intended to describe unambiguous languages. The "dangling else" problem in C, C++, and Java is one example. These problems are often resolved by applying a rule outside of the grammar. In a PEG, these ambiguities never arise, because of prioritization.

Disadvantages

PEGs are new and not widely implemented. In contrast, regular expressions and CFGs have been around for decades, the code to parse them has been extensively optimized, and many programmers are familiar with how to use them.

PEGs cannot express "left-recursive" rules where a rule refers to itself without moving forward in the string. For example, in the arithmetic grammar above, it would be tempting to move some rules around so that the precedence order of products and sums could be expressed in one line:

:"Value" ← [0-9.] + / '(' "Expr" ')':"Product" ← "Expr" (('*' / '/') "Expr")*:"Sum" ← "Expr" (('+' / '-') "Expr")*:"Expr" ← "Product" / "Sum" / "Value"

The problem with this is that it says that to match an Expr, you need to first see if a Product matches there, and to match a Product, you need to first see if an Expr matches there. This is not possible.

However, left-recursive rules can always be rewritten to eliminate left-recursion. For example, a left-recursive rule can repeat a certain expression indefinitely, as in the CFG rule:

:"string-of-a" ← "string-of-a" 'a' | 'a'

This can be rewritten in a PEG using the plus operator:

:"string-of-a" ← 'a'+

PEGs are also associated with "packrat parsing", which uses memoization to eliminate redundant parsing steps. Packrat parsing requires storage proportional to the total input size, rather than the depth of the parse tree as with LR parsers. With some modification, traditional packrat parsing can be made to support left recursion.cite paper
author = Alessandro Warth, James R. Douglass, Todd Millstein
title = Packrat Parsers Can Support Left Recursion
date = January 2008
url = http://www.vpri.org/pdf/tr2007002_packrat.pdf
format = PDF
publisher = Viewpoints Research Institute
accessdate = 2008-10-02
]

See also

* Formal grammar
* Regular expressions
* Top-down parsing language
* Comparison of parser generators

References

External links

* [http://www.brynosaurus.com/pub/lang/peg-slides.pdf Parsing Expression Grammars: A Recognition-Based Syntactic Foundation] (PDF slides)
* [http://pdos.csail.mit.edu/~baford/packrat/ The Packrat Parsing and Parsing Expression Grammars Page]
* [http://pdos.csail.mit.edu/~baford/packrat/thesis/ Packrat Parsing: a Practical Linear-Time Algorithm with Backtracking]
* [http://www.cs.nyu.edu/rgrimm/xtc/rats.html Rats!]
* [http://treetop.rubyforge.org/ Treetop] is a PEG parser generator for the Ruby Programming Language.
* The constructed language Lojban has a [http://www.digitalkingdom.org/~rlpowell/hobbies/lojban/grammar/ fairly large PEG grammar] allowing unambiguous parsing of Lojban text.
* The [http://www.inf.puc-rio.br/~roberto/lpeg/ LPeg] library provides support for PEGs in Lua.
* The Aurochs parser generator has an [http://aurochs.fr/cgi/demo.cgi on-line parsing demo] that displays the parse tree for any given grammar and input
* [http://code.google.com/p/pegtl/ PEGTL] is a PEG parser generator toolkit which uses C++ templates to generate parsers during the compilation (as opposed to runtime) phase. Requires a C++ compiler supporting the C++0x standard.
* [http://fossil.wanderinghorse.net/repos/parsepp/ parsepp] is similar to PEGTL, but works using current (pre-C++0x) C++ compilers.
* [http://www.cs.ucla.edu/~awarth/ometa/ OMeta] , a PEG parser with various enhancement, and which can process non-character input streams.
* [http://piumarta.com/software/peg/ peg/leg] , a PEG parser generator for C, works similarly to lex/yacc. That is, the client provides a PEG grammar as input and it generates C code implementing the grammar.
* [http://fossil.wanderinghorse.net/repos/pegc/ pegc] is a C library similar in design to [http://fossil.wanderinghorse.net/repos/parsepp/ parsepp] and [http://code.google.com/p/pegtl/ PEGTL] , for creating parsers via the composition of rule objects (C structs).


Wikimedia Foundation. 2010.

Игры ⚽ Нужно решить контрольную?

Look at other dictionaries:

  • Parsing — Ein Parser [ˈpɑːɹsɚ] (engl. to parse „analysieren“ bzw. von lateinisch pars „Teil“; im Deutschen gelegentlich auch Zerteiler) ist ein Computerprogramm, das in der Computertechnik für die Zerlegung und Umwandlung einer beliebigen Eingabe in ein… …   Deutsch Wikipedia

  • Formal grammar — In formal semantics, computer science and linguistics, a formal grammar (also called formation rules) is a precise description of a formal language ndash; that is, of a set of strings over some alphabet. In other words, a grammar describes which… …   Wikipedia

  • Top-down parsing language — (TDPL) is a type of analytic formal grammar developed by Alexander Birman in the early 1970s in order to study formally the behavior of a common class of practical top down parsers that support a limited form of backtracking. Birman originally… …   Wikipedia

  • Context-free grammar — In formal language theory, a context free grammar (CFG) is a formal grammar in which every production rule is of the form V → w where V is a single nonterminal symbol, and w is a string of terminals and/or nonterminals (w can be empty). The… …   Wikipedia

  • Parser Grammar Engine — The Parser Grammar Engine (originally Parrot Grammar Engine) or PGE is a compiler and runtime for a Perl 6 rules for the Parrot virtual machine. [cite web | url=http://search.cpan.org/ ltoetsch/parrot 0.2.2/compilers/pge/README | title=Parrot… …   Wikipedia

  • Parser Grammar Engine — Moteur d analyse de grammaire Pour les articles homonymes, voir PGE. Le Parser Grammar Engine (PGE ou en français, moteur d analyse de grammaire) est un compilateur et un moteur d exécution pour les regex Perl 6 pour la machine virtuelle… …   Wikipédia en Français

  • Parsing — In computer science and linguistics, parsing, or, more formally, syntactic analysis, is the process of analyzing a sequence of tokens to determine their grammatical structure with respect to a given (more or less) formal grammar.Parsing is also… …   Wikipedia

  • Regular expression — In computing, a regular expression provides a concise and flexible means for matching (specifying and recognizing) strings of text, such as particular characters, words, or patterns of characters. Abbreviations for regular expression include… …   Wikipedia

  • Definite clause grammar — A definite clause grammar (DCG) is a way of expressing grammar, either for natural or formal languages, in a logic programming language such as Prolog. DCGs are usually associated with Prolog, but similar languages such as Mercury also include… …   Wikipedia

  • Ambiguous grammar — In computer science, a grammar is said to be an ambiguous grammar if there is some string that it can generate in more than one way (i.e., the string has more than one parse tree or more than one leftmost derivation). A language is inherently… …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”