From 65a8cee37cb94b7824c6f82246f262f314cde717 Mon Sep 17 00:00:00 2001 From: Vern Paxson Date: Sun, 25 Feb 1990 01:28:22 +0000 Subject: [PATCH] Initial revision --- flex.1 | 1434 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 1434 insertions(+) create mode 100644 flex.1 diff --git a/flex.1 b/flex.1 new file mode 100644 index 0000000..d52f126 --- /dev/null +++ b/flex.1 @@ -0,0 +1,1434 @@ +.TH FLEX 1 "24 February 1990" "Version 2.2" +.SH NAME +flex - fast lexical analyzer generator +.SH SYNOPSIS +.B flex +.B [-bcdfinpstvFILT -C[efmF] -Sskeleton] +.I [filename ...] +.SH DESCRIPTION +.I flex +is a tool for generating +.I scanners: +programs which recognized lexical patterns in text. +.I flex +reads +the given input files, or its standard input if no file names are given, +for a description of a scanner to generate. The description is in +the form of pairs +of regular expressions and C code, called +.I rules. flex +generates as output a C source file, +.B lex.yy.c, +which defines a routine +.B yylex(). +This file is compiled and linked with the +.B -ll +library to produce an executable. When the executable is run, +it analyzes its input for occurrences +of the regular expressions. Whenever it finds one, it executes +the corresponding C code. +.SH SOME SIMPLE EXAMPLES +.LP +First some simple examples to get the flavor of how one uses +.I flex. +The following +.I flex +input specifies a scanner which whenever it encounters a tab +will print eight blanks to its standard output: +.nf + + %% + \t printf( " " ); + +.fi +By default, any text not matched by a +.I flex +scanner +is copied to the output, so the net effect of this scanner is +to copy its input file to its output with each tab expanded +into eight blanks. +In this input, there is just one rule. "\t" is the +.I pattern +(it's a regular expression specifying a tab) and the "printf" is the +.I action. The "%%" marks the beginning of the rules. +.LP +Here's another simple example: +.nf + + int num_lines = 0, num_chars = 0; + + %% + \n ++num_lines; ++num_chars; + . ++num_chars; + + %% + main() + { + yylex(); + printf( "# of lines = %d, # of chars = %d\n", + num_lines, num_chars ); + } + +.fi +This scanner counts the number of characters and the number +of lines in its input (it produces no output other than the +final report on the counts). The first line +declares two globals, "num_lines" and "num_chars", which are accessible +both inside +.B yylex() +and in the +.B main() +routine declared after the second "%%". There are two rules, one +which matches a newline ("\\n") and increments both the line count and +the character count, and one which matches any character other than +a newline (indicated by the "." regular expression). +.LP +A somewhat more complicated example: +.nf + + /* scanner for a toy Pascal-like language */ + + %{ + /* need this for the call to atof() below */ + #include + %} + + DIGIT [0-9] + ID [a-z][a-z0-9]* + + %% + + {DIGIT}+ { + printf( "An integer: %s (%d)\\n", yytext, + atoi( yytext ) ); + } + + {DIGIT}+"."{DIGIT}* { + printf( "A float: %s (%d)\\n", yytext, + atof( yytext ) ); + } + + if|then|begin|end|procedure|function { + printf( "A keyword: %s\\n", yytext ); + } + + {ID} printf( "An identifier: %s\\n", yytext ); + + "+"|"-"|"*"|"/" printf( "An operator: %s\\n", yytext ); + + "{"[^}\\n]*"}" /* eat up one-line comments */ + + [ \\t\\n]+ /* eat up whitespace */ + + . printf( "Unrecognized character: %s\\n", yytext ); + + %% + + main( argc, argv ) + int argc; + char **argv; + { + ++argv, --argc; + if ( argc > 0 ) + yyin = fopen( argv[0], "r" ); + else + yyin = stdin; + + yylex(); + } + +.fi +This is the beginnings of a simple scanner for a language like +Pascal. It identifies different types of +.I tokens +and reports on what it has seen. +.LP +The details of the example will be explained in the following +sections. +.SH FORMAT OF THE INPUT FILE +The +.I flex +input file consists of three sections, separated by +.B %%: +.nf + + definitions + %% + rules + %% + user code + +.fi +The +.I definitions +section contains declarations of simple +.I name +definitions to simplify the scanner specification and of +.I start conditions, +which are explained in a later section. +.LP +Name definitions have the form: +.nf + + name definition + +.fi +The "name" is a word beginning with a letter or a '_' +followed by zero or more letters, digits, '_', or '-'. +The definition is taken to begin at the first non-white-space +following the name and continue to the end of the line. +Definition can subsequently be referred to using "{name}", which +will expand to "(definition)". For example, +.nf + + DIGIT [0-9] + ID [a-z][a-z0-9]* + +.fi +defines "DIGIT" to be a regular expression which matches a +single digit, and +"ID" to be a regular expression which matches a letter +followed by zero-or-more letters or digits. +A subsequent reference to +.nf + + {DIGIT}+"."{DIGIT}* + +.fi +is identical to +.nf + + ([0-9])+"."([0-9])* + +.fi +and matches one-or-more digits followed by a '.' followed +by zero-or-more digits. +.LP +The +.I rules +section of the +.I flex +input contains a series of rules of the form: +.nf + + pattern action + +.fi +where the pattern must be unindented and the action must begin +on the same line. +.LP +See below for a further description of patterns and actions. +.LP +Finally, the user code section is simply copied to +.B lex.yy.c +verbatim. +It is used for companion routines which call or are called +by the scanner. The presence of this section is optional; +if it is missing, the second +.B %% +in the input file may be skipped, too. +.LP +In the definitions and rule sections, any +.I indented +text or text enclosed in +.B %{ +and +.B %} +is copied verbatim to the output (with the %{}'s removed). +The %{}'s must appear unindented on lines by themselves. +.LP +In the rules section, +any indented or %{} text appearing before the +first rule may be used to declare variables +which are local to the scanning routine, and, after the declarations, +code which is to be executed whenever the scanning routine is entered. +Other indented or %{} text in the rule section is still copied to the output, +but its meaning is not well-defined and it may well cause compile-time +errors (this feature is present for +.I POSIX +compliance; see below for other such features). +.LP +In the definitions section, an unindented comment (i.e., a line +beginning with "/*") is also copied verbatim to the output up +to the next "*/". Also, any line beginning with '#' is ignored. +.SH PATTERNS +The patterns in the input are written using an extended set of regular +expressions. These are: +.nf + + x match the character 'x' + . any character except newline + [xyz] an 'x', a 'y', or a 'z' + [abj-oZ] an 'a', a 'b', any letter + from 'j' through 'o', or a 'Z' + [^A-Z] any character EXCEPT an uppercase letter, + including a newline (unlike how many other + regular expression tools treat the '^'!). + This means that a pattern like [^"]* will + match an entire file (overflowing the input + buffer) unless there's another quote in + the input. + [^A-Z\\n] any character EXCEPT an uppercase letter or + a newline + r* zero or more r's, where r is any regular expression + r+ one or more r's + r? zero or one r's (that is, "an optional r") + r{2,5} anywhere from two to five r's + r{2,} two or more r's + r{4} exactly 4 r's + {name} the expansion of the "name" definition + (see above) + "[xyz]\\"foo" + the literal string: [xyz]"foo + \\x if x is an 'a', 'b', 'f', 'n', 'r', + 't', or 'v', then the ANSI-C + interpretation of \\x. Otherwise, + a literal 'x' (used to escape + operators such as '*') + \\123 the character with octal value 123 + \\x2a the character with hexadecimal value 2a + (r) match an r; parentheses are used + to override precedence (see below) + + + rs the regular expression r followed + by the regular expression s; called + "concatenation" + + + r|s either an r or an s + + + r/s an r but only if it is followed by + an s. The s is not part of the + matched text. This type of + pattern is known as "trailing context". + ^r an r, but only at the beginning of a line + r$ an r, but only at the end of a line + (r must not use trailing context) + + + r an r, but only in start condition s (see + below for discussion of start conditions) + r + same, but in any of start conditions s1, + s2, or s3 + + + <> an end-of-file + <> + an end-of-file when in start condition s1 or s2 + +.fi +The regular expressions listed above are grouped according to +precedence, from highest precedence at the top to lowest at the bottom. +Those grouped together have equal precedence. For example, +.nf + + foo|bar* + +.fi +is the same as +.nf + + (foo)|(ba(r*)) + +.fi +since the '*' operator has higher precedence than concatenation, +and concatenation higher than alternation ('|'). This pattern +therefore matches +.I either +the string "foo" +.I or +the string "ba" followed by zero-or-more r's. +To match "foo" or zero-or-more "bar"'s, use: +.nf + + foo|(bar)* + +.fi +and to match zero-or-more "foo"'s or "bar"'s: +.nf + + (foo|bar)* + +.fi +.SH HOW THE INPUT IS MATCHED +When the generated scanner is run, it analyzes its input looking +for strings which match any of its patterns. If it finds more than +one match, it takes the one matching the most text (for trailing +context rules, this includes the length of the trailing part, even +though it will then be returned to the input). If it finds two +or more matches of the same length, the +rule listed first in the +.I flex +input file is chosen. +.LP +Once the match is determined, the text corresponding to the match +(called the +.I token) +is made available in the global character pointer +.B yytext, +and its length in the global integer +.B yyleng. +The +.I action +corresponding to the matched pattern is then executed (a more +detailed description of actions follows), and then the remaining +input is scanned for another match. +.LP +If no match is found, then the +.I default rule +is executed: the next character in the input is matched and +copied to the standard output. Thus, the simplest legal +.I flex +input is: +.nf + + %% + +.fi +which generates a scanner that simply copies its input (one character +at a time) to its output. +.SH ACTIONS +Each pattern in a rule has a corresponding action, which can be any +arbitrary C statement. The pattern ends at the first non-escaped +whitespace character; the remainder of the line is its action. If the +action is empty, then when the pattern is matched the input token +is simply discarded. For example, here is the specification for a program +which deletes all occurrences of "zap me" from its input: +.nf + + %% + "zap me" + +.fi +Here is a program which compresses multiple blanks and tabs down to +a single blank, and throws away whitespace found at the end of a line: +.nf + + %% + [ \t]+ putchar( ' ' ); + [ \t]+$ /* ignore this token */ + +.fi +.LP +If the action contains a '{', then the action spans till the balancing +'}' is found, and the action may cross multiple lines. +.I flex +knows about C strings and comments and won't be fooled by braces found +within them, but also allows actions to begin with +.B %{ +and will consider the action to be all the text up to the next +.B %}. +.LP +An action consisting solely of a vertical bar ('|') means "same as +the action for the next rule. See below for an illustration. +.LP +Actions can include arbitrary C code, including +.B return +statements to return a value whatever routine called +.B yylex(). +Each time +.B yylex() +is called it continues processing tokens from where it last left +off until it either reaches +the end of the file or executes a return. +.LP +Actions are not allowed to modify yytext or yyleng. +.LP +There are a number of special directives which can be included within +an action: +.IP - +.B ECHO +copies yytext to the scanner's output. +.IP - +.B BEGIN +followed by the name of a start condition places the scanner in the +corresponding start condition (see below). +.IP - +.B REJECT +directs the scanner to proceed on to the "second best" rule which matched the +input (or a prefix of the input). The rule is chosen as described +above in "How the Input is Matched", and +.B yytext +and +.B yyleng +set up appropriately. +It may either be one which matched as much text +as the originally chosen rule but came later in the +.I flex +input file, or one which matched less text. +For example, the following will both count the +words in the input and call the routine special() whenever "frob" is seen: +.nf + + %{ + int word_count = 0; + %} + %% + + frob special(); REJECT; + [^ \t\n]+ ++word_count; + +.fi +Without the +.B REJECT, +any "frob"'s in the input would not be counted as words, since the +scanner normally executes only one action per token. +Multiple +.B REJECT's +are allowed, each one finding the next best choice to the currently +active rule. For example, when the following scanner scans the token +"abcd", it will write "abcdabcaba" to the output: +.nf + + %% + a | + ab | + abc | + abcd ECHO; REJECT; + .|\n /* eat up any unmatched character */ + +.fi +(The first three rules share the fourth's action since they use +the special '|' action.) +.B REJECT +is a particularly expensive feature in terms scanner performance; +if it is used in +.I any +of the scanner's actions it will slow down +.I all +of the scanner's matching. +.IP - +.B yymore() +tells the scanner that the next time it matches a rule, the corresponding +token should be +.I appended +onto the current value of +.B yytext +rather than replacing it. For example, given the input "mega-kludge" +the following will write "mega-mega-kludge" to the output: +.nf + + %% + mega- ECHO; yymore(); + kludge ECHO; + +.fi +First "mega-" is matched and echoed to the output. Then "kludge" +is matched, but the previous "mega-" is still hanging around at the +beginning of +.B yytext +so the ECHO for the "kludge" rule will actually write "mega-kludge". +The presence of +.B yymore() +in the scanner's action entails a minor performance penalty in the +scanner's matching speed. +.IP - +.B yyless(n) +returns all but the first +.I n +characters of the current token back to the input stream, where they +will be rescanned when the scanner looks for the next match. +.B yytext +and +.B yyleng +are adjusted appropriately (e.g., +.B yyleng +will now be equal to +.I n +). For example, on the input "foobar" the following will write out +"foobarbar": +.nf + + foobar ECHO; yyless(3); + [a-z]+ ECHO; + +.fi +An argument of 0 to +.B yyless +will cause the current input string to be scanned again. Unless you've +changed how the scanner will subsequently process its input (using +.B BEGIN, +for example), this will result in an endless loop. +.IP - +.B unput(c) +puts the character +.I c +back onto the input stream. It will be the next character scanned. +The following action will take the current token and cause it +to be rescanned enclosed in parentheses. +.nf + + { + int i; + unput( ')' ); + for ( i = yyleng - 1; i >= 0; --i ) + unput( yytext[i] ); + unput( '(' ); + } + +.fi +Note that since each +.B unput() +puts the given character back at the +.I beginning +of the input stream, pushing back strings must be done back-to-front. +.IP - +.B input() +reads the next character from the input stream. For example, +the following is one way to eat up C comments: +.nf + + %% + "/*" { + register int c; + + for ( ; ; ) + { + while ( (c = input()) != '*' && + c != EOF ) + ; /* eat up text of comment */ + + if ( c == '*' ) + { + while ( (c = input()) == '*' ) + ; + if ( c == '/' ) + break; /* found the end */ + } + + if ( c == EOF ) + { + error( "EOF in comment" ); + break; + } + } + } + +.fi +.IP - +.I yyterminate() +can be used in lieu of a return statement in an action. It terminates +the scanner and returns a 0 to the scanner's caller, indicating "all done". +By default, +.I yyterminate() +is also called when an end-of-file is encountered. It is a macro and +may be redefined. +.SH THE GENERATED SCANNER +The output of +.I flex +is the file +.B lex.yy.c, +which contains the scanning routine +.B yylex(), +a number of tables used by it for matching tokens, and a number +of auxilliary routines and macros. By default, +.B yylex() +is declared as follows: +.nf + + int yylex() + { + ... various definitions and the actions in here ... + } + +.fi +(If your environment supports function prototypes, then it will +be "int yylex( void )".) This definition may be changed by redefining +the "YY_DECL" macro. For example, you could use: +.nf + + #undef YY_DECL + #define YY_DECL float lexscan( a, b ) float a, b; + +.fi +to give it the the scanning routine the name +.I lexscan, +returning a float, and taking two floats as arguments. Note that +if you give arguments to the scanning routine using a +K&R-style/non-prototyped function declaration, you must terminate +the definition with a semi-colon (;). +.LP +Whenever +.B yylex() +is called, it scans tokens from the global input file +.I yyin +(default, stdin). It continues until it either reaches +an end-of-file (at which point it returns the value 0) or +one of its actions executes a +.I return +statement. +In the former case, the scanner may not be called again unless +.B void yyrestart( FILE *input_file ) +is called, to point +.I yyin +at the new input_file. In the latter case (i.e., when an action +executes a return), the scanner may then be called again and it +will resume scanning where it left off. +.LP +By default (and for purposes of efficiency), the scanner uses +block-reads rather than simple +.I getc() +calls to read characters from +.I yyin. +The nature of how it gets its input can be controlled by redefining the +.B YY_INPUT +macro. +YY_INPUT's calling sequence is "YY_INPUT(buf,result,max_size)". Its +action is to place up to +.I max_size +characters in the character array +.I buf +and return in the integer variable +.I result +either the +number of characters read or the constant YY_NULL (0 on Unix systems) +to indicate EOF. The default YY_INPUT reads from the +global file-pointer "yyin" (which is by default +.I stdin), +so if you +just want to change the input file, you needn't redefine +YY_INPUT - just point yyin at the input file. +.LP +A sample redefinition of YY_INPUT (in the definitions section of the input +file): +.nf + + %{ + #undef YY_INPUT + #define YY_INPUT(buf,result,max_size) \\ + result = ((buf[0] = getchar()) == EOF) ? YY_NULL : 1; + %} + +.fi +You also can add in things like keeping track of the +input line number this way; but don't expect your scanner to +go very fast. +.LP +When the scanner receives an end-of-file indication from YY_INPUT, +it then checks the +.B yywrap() +function. If it returns false (zero), then it is assumed that the +function has gone ahead and set up +.I yyin +to point to another input file, and scanning continues. If it returns +true (non-zero), then the scanner terminates, returning 0 to its +caller. +.LP +The default +.B yywrap() +always returns 1. Presently, to redefine it you must first +"#undef yywrap", as it is currently implemented as a macro. As noted +by the hedging in the previous sentence, it may be changed to +a true function in the near future. +.LP +The scanner writes its +.B ECHO +output to the +.I yyout +global (default, stdout), which may be redefined by the user simply +by assigning it to some other FILE*. +.SH START CONDITIONS +.I flex +provides a mechanism for conditionally activating rules. Any rule +whose pattern is prefixed with "" will only be active when +the scanner is in the start condition named "sc". For example, +.nf + + [^"]* { /* eat up the string body ... */ + ... + } + +.fi +will be active only when the scanner is in the "STRING" start +condition, and +.nf + + \\. { /* handle an escape ... */ + ... + } + +.fi +will be active only when the current start condition is +either "INITIAL", "STRING", or "QUOTE". +.LP +Start conditions +are declared in the definitions (first) section of the input +using unindented lines beginning with either +.B %s +or +.B %x +followed by a list of names. +The former declares +.I inclusive +start conditions, the latter +.I exclusive +start conditions. A start condition is activated using the +.B BEGIN +action. Until the next +.B BEGIN +action is executed, rules with the given start +condition will be active and +rules with other start conditions will be inactive. +If the start condition is +.I inclusive, +then rules with no start conditions at all will also be active. +If it is +.I exclusive, +then +.I only +rules qualified with the start condition will be active. +So a set of rules conditioned on the same exclusive start condition +describe a scanner which is independent of any of the other rules in the +.I flex +input. Because of this, +exclusive start conditions make it easy to specify "mini-scanners" +which scan portions of the input that are syntactically different +from the rest (e.g., comments). +.LP +.B BEGIN(0) +returns to the original state where only the rules with +no start conditions are active. This state can also be +referred to as the start-condition "INITIAL", so +.B BEGIN(INITIAL) +is equivalent to +.B BEGIN(0). +.LP +Here is a scanner which will recognize numbers only if they +are preceded earlier in the line by the string "expect-number": +.nf + + %s expect + + %% + expect-number BEGIN(expect); + + [0-9]+ printf( "found a number\n" ); + \n { + /* that's the end of the line, so + * we need another "expect-number" + * before we'll recognize any more + * numbers + */ + BEGIN(INITIAL); + } + +.fi +Here is a scanner which recognizes (and discards) C comments while +maintaining a count of the current input line. +.nf + + %x comment + %% + int line_num = 1; + + [^*\n]* + "*"+[^*/\n]* + \n ++line_num; + "*"+"/" BEGIN(INITIAL); + + "/*" BEGIN(comment); + +.fi +Note that start-conditions names are really integer values and +can be stored as such. Thus, the above could be extended in the +following fashion: +.nf + + %x comment + %% + int line_num = 1; + int comment_caller; + + [^*\n]* + "*"+[^*/\n]* + \n ++line_num; + "*"+"/" BEGIN(comment_caller); + + "/*" { + comment_caller = INTIIAL; + BEGIN(comment); + } + + ... + + "/*" { + comment_caller = foo; + BEGIN(comment); + } +.fi +One can then implement a "stack" of start conditions using an +array of integers. (It is likely that such stacks will become +a full-fledged +.I flex +feature in the future.) Note, though, that +start conditions do not have their own name-space; %s's and %x's +declare names effectively the same as #define's. +.SH END-OF-FILE RULES +The special rule "<>" indicates +actions which are to be taken when an end-of-file is +encountered and yywrap() returns non-zero (i.e., indicates +no further files to process). The action can either +point yyin at a new file to process, in which case the +action +.I must +finish with the special +.I YY_NEW_FILE +action +(this is a branch, so subsequent code in the action won't +be executed), or the action must finish with a +.I return +or +.I yyterminate() +statement. <> rules may not be used with other +patterns; they may only be qualified with a list of start +conditions. If an unqualified <> rule is given, it +applies only to the +.B INITIAL +start condition, and +.I not +to +.B %s +(or +.B %x) +start conditions. +.LP +These rules are useful for catching things like unclosed comments. +An example: +.nf + + %x quote + %% + ... + <> { + error( "unterminated quote" ); + yyterminate(); + } + <> { + if ( *++filelist ) + { + yyin = fopen( *filelist, "r" ); + YY_NEW_FILE; + } + else + yyterminate(); + } + +.fi +.SH MISCELLANEOUS MACROS +The macro +.bd +YY_USER_ACTION +can be redefined to provide an action +which is always executed prior to the matched rule's action. For example, +it could be #define'd to call a routine to convert yytext to lower-case. +.LP +In the generated scanner, the actions are all gathered in one large +switch statement and separated using +.B YY_BREAK, +which may be redefined. +This allows, for example, C++ users to +#define YY_BREAK to do nothing (while being very careful that every +rule ends with a "break" or a "return"!) to avoid suffering from +unreachable statement warnings where a rule's action ends with "return". +.SH INTERFACING WITH YACC +One of the main uses of +.I flex +is as a companion to the +.I yacc +parser-generator. +.I yacc +parsers expect to call the +.B yylex() +routine to find the next input token. The routine is supposed to +return the type of the next token as well as putting any associated +value in the global +.B yylval. +To use +.I flex +with +.I yacc, +one specifies the +.B -d +option to +.I yacc +to instruct it to generate the file +.B y.tab.h +containing definitions of all the +.B %tokens +appearing in the +.I yacc +input. This file is then included in the +.I flex +scanner. For example, if one of the tokens is "TOK_NUMBER", +part of the scanner might look like: +.nf + + %{ + #include "y.tab.h" + %} + + %% + + [0-9]+ yylval = atoi( yytext ); return TOK_NUMBER; + +.fi +.SH TRANSLATION TABLE +In the name of POSIX compliance, +.I flex +supports a +.I translation table +for mapping input characters together into specified sets. +The table is specified in the first section, and its format looks like: +.nf + + %t + 1 abcd + 2 ABCDEFGHIJKLMNOPQRSTUVWXYZ + 52 0123456789 + %t + +.fi +This example specifies that the characters 'a', 'b', 'c', and 'd' +are to all be lumped into group #1, the upper-case letters are +to be in group #2, and digits in group #52, and +.I no other characters will appear in the patterns +(note that characters can also be specified in a +.B %t +table using escape sequences). +The group numbers are actually disregarded by +.I flex; +.B %t +serves, though, to lump characters together. Given the above +table, for example, the pattern "aAA*5" is equivalent to "dZQ*0". +They both say, "match any character in group #1, followed by +a character from group #2, followed by zero-or-more characters +from group #2, followed by a character from group #52." Thus +.B %t +provides a crude way for introducing equivalence classes into +the scanner specification. It is the author's belief that the +.B -i +option coupled with the equivalence classes which +.I flex +automatically generates take care of virtually all the instances +when one might consider using +.B %t. +But what the hell, it's there if you want it. +.so options.man +.SH PERFORMANCE CONSIDERATIONS +The main design goal of +.I flex +is that it generate high-performance scanners. It has been optimized +for dealing well with large sets of rules. Aside from the effects +outlined above of table compression on scanner speed, +there are a number of options/actions which degrade performance. These +are, in decreasing order of performance impact: +.nf + + REJECT + pattern sets that require backtracking + arbitrary trailing context + %T + '^' beginning-of-line operator + yymore() + start conditions + +.fi +.LP +Getting rid of backtracking is messy and often may be too much +work for a complicated scanner's rules. In principal, one begins +by using the +.B -b +flag to generate a +.I lex.backtrack +file. For example, on the input +.nf + + %% + foo return TOK_KEYWORD; + foobar return TOK_KEYWORD; + +.fi +the file looks like: +.nf + + State #6 is non-accepting - + associated rules: + 2 3 + out-transitions: [ o ] + jam-transitions: EOF [ \001-n p-\177 ] + + State #8 is non-accepting - + associated rules: + 3 + out-transitions: [ a ] + jam-transitions: EOF [ \001-` b-\177 ] + + State #9 is non-accepting - + associated rules: + 3 + out-transitions: [ r ] + jam-transitions: EOF [ \001-q s-\177 ] + + Compressed tables always backtrack. + +.fi +The first few lines tell us that there's a scanner state in +which it can make a transition on an 'o' but not on any other +character, and the currently scanned text does not match any rule. +If the scanner is in that state and then reads +something other than an 'o', it will have to backtrack to find +a rule which is matched. With +a bit of headscratching one can see that this must be the +state it's in when it has seen "fo". When this has happened, +if anything other than another 'o' is seen, the scanner will +have to back up to simply match the 'f' (by the default rule). +.LP +The comment regarding State #8 indicates there's a problem +when "foob" has been scanned. Indeed, on any character other +than a 'b', the scanner will have to back up to accept "foo". +Similarly, the comment for State #9 concerns when "fooba" has +been scanned. +.LP +The final comment reminds us that there's no point going to +all the trouble of removing backtracking from the rules unless +we're using +.B -f +or +.B -F, +since there's no performance gain doing so with compressed scanners. +.LP +The way to remove the backtracking is to add "error" rules: +.nf + + %% + foo return TOK_KEYWORD; + foobar return TOK_KEYWORD; + + fooba | + foob | + fo { + /* false alarm, not really a keyword */ + return TOK_ID; + } + +.fi +.LP +Unfortunately backtracking messages tend to cascade and +with a complicated input set it's not uncommon to get hundreds +of messages. If one can decipher them, though, it often +only takes a dozen or so rules to eliminate the backtracking. +(A possible future +.I flex +feature will be to automatically add rules to eliminate backtracking. +The problem is that while it's easy for +.I flex +to figure out what rules are needed, it's very hard for it to +know what the proper action is. Currently I'm thinking that it +will simply invoke a user-redefinable macro and that's it ...) +.LP +Another area where the user can increase a scanner's performance +(and one that's easier to implement) arises from the fact that +the longer the tokens matched, the faster the scanner will run. +This is because with long tokens the processing of most input +characters takes place in the (short) inner scanning loop, and +does not often have to go through the additional work of setting up +the scanning environment (e.g., +.B yytext) +for the action. Recall the scanner for C comments: +.nf + + %x comment + %% + int line_num = 1; + + [^*\n]* + "*"+[^*/\n]* + \n ++line_num; + "*"+"/" BEGIN(INITIAL); + + "/*" BEGIN(comment); + +.fi +This could be sped up by writing it as: +.nf + + %x comment + %% + int line_num = 1; + + [^*\n]* + [^*\n]*\n ++line_num; + "*"+[^*/\n]* + "*"+[^*/\n]*\n ++line_num; + "*"+"/" BEGIN(INITIAL); + + "/*" BEGIN(comment); + +.fi +Now instead of each newline requiring the processing of another +action, recognizing the newlines is "distributed" over the other rules +to keep the matched text as long as possible. Note that +.I adding +rules does +.I not +slow down the scanner! The speed of the scanner is independent +of the number of rules or (modulo the considerations given at the +beginning of this section) how complicated the rules are with +regard to operators such as '*' and '|'. +.SH INCOMPATIBILITIES WITH LEX AND POSIX +.I flex +is a rewrite of the Unix +.I lex +tool (the two implementations do not share any code, though), +which dates to the late 1970's. There are some incompatibilities +which are of concern to those who wish to write scanners acceptable +to either implementation. At present, the POSIX lex draft is +very close to the original lex implementation, so some of these +incompatibilities are also in conflict with the POSIX draft. But +the intent is that except as noted below, +.I flex +as it presently stands will +ultimately be POSIX comformant (i.e., that those areas of conflict with +the POSIX draft will be resolved in +.I flex's +favor). Please bare in +mind that all the comments are with regard to the POSIX +.I draft +standard and not the final document; they are included so +.I flex +users can be aware of the standardization issues and those areas where +.I flex +may in the near future be incompatibly changed with +its current definition. +.LP +.I flex +is fully compatible with +.I lex +with the following exceptions: +.IP - +When definitions are expanded, +.I flex +encloses them in parentheses. +With lex, the following +.nf + + NAME [A-Z][A-Z0-9]* + %% + foo{NAME}? printf( "Found it\\n" ); + %% + +.fi +will not match the string "foo" because when the macro +is expanded the rule is equivalent to "foo[A-Z][A-Z0-9]*?" +and the precedence is such that the '?' is associated with +"[A-Z0-9]*". With +.I flex, +the rule will be expanded to +"foo([A-z][A-Z0-9]*)?" and so the string "foo" will match. +Note that because of this, the +.B ^, $, , +and +.B / +operators cannot be used in a definition. +.IP +Note that the POSIX draft interpretation here is the same as +.I flex's. +.IP - +The undocumented lex-scanner internal variable +.B yylineno +is not supported. (The variable is not part of the POSIX draft.) +.IP - +The +.B input() +routine is not redefinable, though may be called to read characters +following whatever has been matched by a rule. If +.B input() +encounters an end-of-file the normal +.B yywrap() +processing is done. A ``real'' end-of-file is returned as +.I EOF. +.IP +Input is instead controlled by redefining the +.B YY_INPUT +macro. +.IP +The +.I flex +restriction that +.B input() +cannot be redefined is in accordance with the POSIX draft, but +.B YY_INPUT +has not yet been accepted into the draft. +.IP - +.B output() +is not supported. +Output from the ECHO macro is done to the file-pointer +"yyout" (default +.I stdout). +.IP +The POSIX draft mentions that an +.B output() +routine exists but currently gives no details as to what it does. +.IP - +If you are providing your own yywrap() routine, you must include a +"#undef yywrap" in the definitions section (section 1). Note that +the "#undef" will have to be enclosed in %{}'s. +.IP +The POSIX draft +specifies that yywrap() is a function and this is unlikely to change; so +.I flex users are warned +that +.I yywrap() +is likely to be changed to a function in the near future. +.IP - +The precedence of the +.B {} +operator is different. lex interprets "abc{1,3}" as "match one, two, or +three occurrences of 'abc'", whereas +.I flex +interprets it as "match 'ab' +followed by one, two, or three occurrences of 'c'". The latter is +in agreement with the current POSIX draft. +.IP - +To refer to yytext outside of your scanner source file, use +"extern char *yytext;" rather than "extern char yytext[];". +This is contrary to the POSIX draft but a point on which I refuse +to budge, as the array representation entails a serious performance penalty. +.IP - +The name +.bd +FLEX_SCANNER +is #define'd so scanners may be written for use with either +.I flex +or +.I lex. +.SH DEFICIENCES / BUGS +.LP +Some trailing context +patterns cannot be properly matched and generate +warning messages ("Dangerous trailing context"). These are +patterns where the ending of the +first part of the rule matches the beginning of the second +part, such as "zx*/xy*", where the 'x*' matches the 'x' at +the beginning of the trailing context. (Note that the POSIX draft +states that the text matched by such patterns is undefined.) +If desperate, you can use +.B yyless() +to effect arbitrary trailing context. +.LP +.I variable +trailing context (where both the leading and trailing parts do not have +a fixed length) entails the same performance loss as +.I REJECT +(i.e., substantial). +.LP +For some trailing context rules, parts which are actually fixed-length are +not recognized as such, leading to the abovementioned performance loss. +In particular, parts using '|' or {n} are always considered variable-length. +.LP +Use of unput() or input() invalidates yytext and yyleng. +.LP +Use of unput() to push back more text than was matched can +result in the pushed-back text matching a beginning-of-line ('^') +rule even though it didn't come at the beginning of the line +(though this is rare!). +.LP +Nulls are not allowed in +.I flex +inputs or in the inputs to +scanners generated by +.I flex. +Their presence generates fatal errors. +.LP +.I flex +does not generate correct #line directives for code internal +to the scanner; thus, bugs in +.I +flex.skel +yield bogus line numbers. +.LP +Due to both buffering of input and read-ahead, you cannot intermix +calls to stdio routines, such as, for example, +.B getchar() +with +.I flex +rules and expect it to work. Call +.B input() +instead. +.LP +The total table entries listed by the +.B -v +flag excludes the number of table entries needed to determine +what rule has been matched. The number of entries is equal +to the number of DFA states if the scanner does not use REJECT, +and somewhat greater than the number of states if it does. +.LP +It would be useful if +.I flex +wrote to lex.yy.c a summary of the flags used in +its generation (such as which table compression options). +.LP +Some of the macros, such as +.B yywrap(), +may in the future become functions which live in the +.B -ll +library. This will doubtless break a lot of code, but may be +required for POSIX-compliance. +.LP +The +.I flex +internal algorithms need documentation. +.SH "SEE ALSO" +.LP +lex(1), yacc(1), sed(1), awk(1). +.LP +M. E. Lesk and E. Schmidt, +.I LEX - Lexical Analyzer Generator +.SH AUTHOR +Vern Paxson, with the help of many ideas and much inspiration from +Van Jacobson. Original version by Jef Poskanzer. Fast table +representation is a partial implementation of a design done by Van +Jacobson. The implementation was done by Kevin Gong and Vern Paxson. +.LP +Thanks to the many +.I flex +beta-testers and feedbackers, especially Casey +Leedom, benson@odi.com, +Frederic Brehm, Nick Christopher, Jason Coughlin, +Chris Faylor, Eric Goldman, Eric +Hughes, Jeffrey R. Jones, Kevin B. Kenny, Ronald Lamprecht, +Greg Lee, Craig Leres, Mohamed el Lozy, Jim Meyering, Marc Nozell, Esmond Pitt, +Jef Poskanzer, Dave Tallman, Frank Whaley, Ken Yap, and others whose names +have slipped my marginal mail-archiving skills but whose contributions +are appreciated all the same. +.LP +Thanks to Keith Bostic, John Gilmore, Bob +Mulcahy, Rich Salz, and Richard Stallman for help with various distribution +headaches. +.LP +Thanks to Esmond Pitt for 8-bit character support, Benson Margulies and Fred +Burke for C++ support, and Ove Ewerlid for supporting NUL's (as well as for +impressive efforts regarding generating extremely high-performance +scanners, which with luck will be soon forthcoming). +.LP +This work was primarily done when I was a member of the Real Time System Group +at the Lawrence Berkeley Laboratory in Berkeley, CA. Many thanks to all there +for the support I received. +.LP +Send comments to: +.nf + + Vern Paxson + Computer Science Department + 4126 Upson Hall + Cornell University + Ithaca, NY 14853-7501 + + vern@cs.cornell.edu + decvax!cornell!vern + vern@LBL (bitnet) + +.fi -- 2.40.0