Lexical analysis

Lexical analysis

In computer science, lexical analysis is the process of converting a sequence of characters into a sequence of tokens. Programs performing lexical analysis are called lexical analyzers or lexers. A lexer is often organized as separate scanner and tokenizer functions, though the boundaries may not be clearly defined.

Lexical grammar

The specification of a programming language will include a set of rules, often expressed syntactically, specifying the set of possible character sequences that can form a token or lexeme. The whitespace characters are often ignored during lexical analysis.

Token

A token is a categorized block of text. The block of text corresponding to the token is known as a lexeme. A lexical analyzer processes "lexemes" to categorize them according to function, giving them meaning. This assignment of meaning is known as tokenization. A token can look like anything; it just needs to be a useful part of the structured text.

Consider this expression in the C programming language::sum=3+2;Tokenized in the following table:

Tokens are frequently defined by regular expressions, which are understood by a lexical analyzer generator such as lex. The lexical analyzer (either generated automatically by a tool like lex, or hand-crafted) reads in a stream of characters, identifies the lexemes in the stream, and categorizes them into tokens. This is called "tokenizing." If the lexer finds an invalid token, it will report an error.

Following tokenizing is parsing. From there, the interpreted data may be loaded into data structures, for general use, interpretation, or compiling.

Consider a text describing a calculation::46 - number_of(cows);The lexemes here might be: "46", "-", "number_of", "(", "cows", ")" and ";". The lexical analyzer will denote lexemes "46" as 'number', "-" as 'character' and "number_of" as a separate token. Even the lexeme ";" in some languages (such as C) has some special meaning.

canner

The first stage, the scanner, is usually based on a finite state machine. It has encoded within it information on the possible sequences of characters that can be contained within any of the tokens it handles (individual instances of these character sequences are known as lexemes). For instance, an "integer" token may contain any sequence of numerical digit characters. In many cases, the first non-whitespace character can be used to deduce the kind of token that follows and subsequent input characters are then processed one at a time until reaching a character that is not in the set of characters acceptable for that token (this is known as the maximal munch rule). In some languages the lexeme creation rules are more complicated and may involve backtracking over previously read characters.

Tokenizer

"Tokenization" is the process of demarcating and possibly classifying sections of a string of input characters. The resulting tokens are then passed on to some other form of processing. The process can be considered a sub-task of parsing input.

Take, for example, the following string. Unlike humans, a computer cannot intuitively 'see' that there are 9 words. To a computer this is only a series of 43 characters.

:The quick brown fox jumps over the lazy dog

A process of tokenization could be used to split the sentence into word tokens. Although the following example is given as XML there are many ways to represent tokenized input: The quick brown fox jumps over the lazy dogA lexeme, however, is only a string of characters known to be of a certain kind (eg, a string literal, a sequence of letters). In order to construct a token, the lexical analyzer needs a second stage, the evaluator, which goes over the characters of the lexeme to produce a "value". The lexeme's type combined with its value is what properly constitutes a token, which can be given to a parser. (Some tokens such as parentheses do not really have values, and so the evaluator function for these can return nothing. The evaluators for integers, identifiers, and strings can be considerably more complex. Sometimes evaluators can suppress a lexeme entirely, concealing it from the parser, which is useful for whitespace and comments.)

For example, in the source code of a computer program the string

:net_worth_future = (assets - liabilities);

might be converted (with whitespace suppressed) into the lexical token stream:

NAME "net_worth_future" EQUALS OPEN_PARENTHESIS NAME "assets" MINUS NAME "liabilities" CLOSE_PARENTHESIS SEMICOLON

Though it is possible and sometimes necessary to write a lexer by hand, lexers are often generated by automated tools. These tools generally accept regular expressions that describe the tokens allowed in the input stream. Each regular expression is associated with a production in the lexical grammar of the programming language that evaluates the lexemes matching the regular expression. These tools may generate source code that can be compiled and executed or construct a state table for a finite state machine (which is plugged into template code for compilation and execution).

Regular expressions compactly represent patterns that the characters in lexemes might follow. For example, for an English-based language, a NAME token might be any English alphabetical character or an underscore, followed by any number of instances of any ASCII alphanumeric character or an underscore. This could be represented compactly by the string [a-zA-Z_] [a-zA-Z_0-9] *. This means "any character a-z, A-Z or _, followed by 0 or more of a-z, A-Z, _ or 0-9".

Regular expressions and the finite state machines they generate are not powerful enough to handle recursive patterns, such as "n" opening parentheses, followed by a statement, followed by "n" closing parentheses." They are not capable of keeping count, and verifying that "n" is the same on both sides — unless you have a finite set of permissible values for "n". It takes a full-fledged parser to recognize such patterns in their full generality. A parser can push parentheses on a stack and then try to pop them off and see if the stack is empty at the end.

The Lex programming tool and its compiler is designed to generate code for fast lexical analysers based on a formal description of the lexical syntax. It is not generally considered sufficient for applications with a complicated set of lexical rules and severe performance requirements; for instance, the GNU Compiler Collection uses hand-written lexers.

Lexer generator

Lexical analysis can often be performed in a single pass if reading is done a character at a time. Single-pass lexers can be generated by tools such as the classic flex.

The lex/flex family of generators uses a table-driven approach which is much less efficient than the directly coded approach. With the latter approach the generator produces an engine that directly jumps to follow-up states via goto statements. Tools like re2c and Quex have proven (e.g. [http://citeseer.ist.psu.edu/bumbulis94rec.html article about re2c] ) to produce engines that are between two to three times faster than flex produced engines.Fact|date=April 2008 It is in general difficult to hand-write analyzers that perform better than engines generated by these latter tools.

The simple utility of using a scanner generator should not be discounted, especially in the developmental phase, when a language specification might change daily. The ability to express lexical constructs as regular expressions facilitates the description of a lexical analyzer. Some tools offer the specification of pre- and post-conditions which are hard to program by hand. In that case, using a scanner generator may save a lot of development time.

Lexical analyzer generators

* Flex - Alternative variant of the classic 'lex' (C/C++).
* JLex - A Lexical Analyzer Generator for Java.
* Quex - (or 'Queχ') A Mode Oriented Lexical Analyzer Generator for C++.

ee also

* List of C Sharp lexer generators
* Parsing

References

* [http://www.cs.berkeley.edu/~hilfingr/cs164/public_html/lectures/note2.pdf CS 164: Programming Languages and Compilers (Class Notes #2: Lexical)]
* "Compiling with C# and Java", Pat Terry, 2005, ISBN 0-321-26360-X 624
* "Algorithms + Data Structures = Programs", Niklaus Wirth, 1975, ISBN 0-13-022418-9
* "Compiler Construction", Niklaus Wirth, 1996, ISBN 0-201-40353-6
* Sebesta, R. W. (2006). Concepts of programming languages (Seventh edition) pp.177. Boston: Pearson/Addison-Wesley.


Wikimedia Foundation. 2010.

Look at other dictionaries:

  • lexical analysis — noun (computing) A stage during the compilation of a program in which standard components of a statement are replaced by internal codes (tokens) which identify their meaning • • • Main Entry: ↑lexicon …   Useful english dictionary

  • lexical analysis — noun The conversion of a stream of characters to a stream of meaningful tokens; normally to simplify parsing. While its often not difficult to identify tokens while parsing, having a separate stage for lexical analysis simplifies the structure of …   Wiktionary

  • lexical analysis — leksikos analizė statusas T sritis informatika apibrėžtis ↑Leksemų (vardų, bazinių žodžių, skaičių, keliais ženklais užrašytų operacijų ženklų) aptikimas programos, parašytos programavimo kalba, arba scenarijaus tekste ir jų kodavimas, kad toliau …   Enciklopedinis kompiuterijos žodynas

  • Lexical analysis — …   Википедия

  • Analysis — (from Greek ἀνάλυσις , a breaking up ) is the process of breaking a complex topic or substance into smaller parts to gain a better understanding of it. The technique has been applied in the study of mathematics and logic since before Aristotle,… …   Wikipedia

  • lexical analyzer — noun A computer program that performs lexical analysis. Syn: lexer …   Wiktionary

  • Lexical functional grammar — (LFG) is a grammar framework in theoretical linguistics, a variety of generative grammar. The development of the theory was initiated by Joan Bresnan and Ronald Kaplan in the 1970s, in reaction to the direction research in the area of… …   Wikipedia

  • Lexical decision task — The lexical decision task is a procedure used in many psychology and psycholinguistics experiments. The basic procedure involves measuring how quickly people classify stimuli as words or nonwords. Although versions of the task had been used by… …   Wikipedia

  • Lexical density — In computational linguistics, lexical density constitutes the estimated measure of content per functional (grammatical) and lexical units (lexemes) in total. Specifically, this is a coefficient of the word type to token ratio of a text. The main… …   Wikipedia

  • Conversation analysis — (commonly abbreviated as CA) is the study of talk in interaction (both verbal and non verbal in situations of everyday life). CA generally attempts to describe the orderliness, structure and sequential patterns of interaction, whether… …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”