From: Ulya Trofimovich Date: Wed, 17 Jul 2019 22:40:51 +0000 (+0100) Subject: Updated documentation for new options and directives. X-Git-Tag: 1.2~18 X-Git-Url: https://granicus.if.org/sourcecode?a=commitdiff_plain;h=1e09560e0e5baac5f28f880f9cf02bfc48ea4df2;p=re2c Updated documentation for new options and directives. --- diff --git a/bootstrap/doc/re2c.1 b/bootstrap/doc/re2c.1 index d7c2f1a1..4c4d4d00 100644 --- a/bootstrap/doc/re2c.1 +++ b/bootstrap/doc/re2c.1 @@ -45,175 +45,225 @@ control and customize the generated DFA. .B \fB\-? \-h \-\-help\fP Show help message. .TP +.B \fB\-1 \-\-single\-pass\fP +Deprecated. Does nothing (single pass is the default now). +.TP +.B \fB\-8 \-\-utf\-8\fP +Generate a lexer that reads input in UTF\-8 encoding. +re2c assumes that character range is 0 \-\- 0x10FFFF and character size is +1 byte. +.TP .B \fB\-b \-\-bit\-vectors\fP Optimize conditional jumps using bit masks. Implies \fB\-s\fP\&. .TP .B \fB\-c \-\-conditions \-\-start\-conditions\fP -Enable support of Flex\-like "conditions": multiple interrelated lexers within one block. -Option \fB\-\-start\-conditions\fP is a legacy alias; use \fB\-\-conditions\fP instead. +Enable support of Flex\-like "conditions": multiple interrelated lexers +within one block. Option \fB\-\-start\-conditions\fP is a legacy alias; use +\fB\-\-conditions\fP instead. .TP -.B \fB\-d \-\-debug\-output\fP -Emit \fBYYDEBUG\fP in the generated code. -\fBYYDEBUG\fP should be defined by the user in the form of a void function with two parameters: -\fBstate\fP (lexer state or \-1) and \fBsymbol\fP (current input symbol of type \fBYYCTYPE\fP). +.B \fB\-\-case\-insensitive\fP +Treat single\-quoted and double\-quoted strings as case\-insensitive. +.TP +.B \fB\-\-case\-inverted\fP +Invert the meaning of single\-quoted and double\-quoted strings: +treat single\-quoted strings as case\-sensitive and double\-quoted strings +as case\-insensitive. .TP .B \fB\-D \-\-emit\-dot\fP -Instead of normal output generate lexer graph in DOT format. -The output can be converted to PNG with the help of Graphviz (something like \fBdot \-Tpng \-odfa.png dfa.dot\fP). -Note that large graphs may crash Graphviz. +Instead of normal output generate lexer graph in .dot format. +The output can be converted to an image with the help of Graphviz +(e.g. something like \fBdot \-Tpng \-odfa.png dfa.dot\fP). .TP -.B \fB\-e \-\-ecb\fP -Generate a lexer that reads input in EBCDIC encoding. -\fBre2c\fP assumes that character range is 0 \-\- 0xFF an character size is 1 byte. +.B \fB\-d \-\-debug\-output\fP +Emit \fBYYDEBUG\fP in the generated code. +\fBYYDEBUG\fP should be defined by the user in the form of a void function +with two parameters: \fBstate\fP (lexer state or \-1) and \fBsymbol\fP (current +input symbol of type \fBYYCTYPE\fP). .TP -.B \fB\-f \-\-storable\-state\fP -Generate a lexer which can store its inner state. -This is useful in push\-model lexers which are stopped by an outer program when there is not enough input, -and then resumed when more input becomes available. -In this mode users should additionally define -\fBYYGETSTATE ()\fP and \fBYYSETSTATE (state)\fP macros -and variables \fByych\fP, \fByyaccept\fP and the \fBstate\fP as part of the lexer state. +.B \fB\-\-dfa\-minimization \fP +The internal algorithm used by re2c to minimize the DFA: \fBmoore\fP (the +default) is Moore algorithm, and \fBtable\fP is the "table filling" algorithm. +Both algorithms should produce the same DFA up to states relabeling; table +filling is simpler and much slower and serves as a reference implementation. .TP -.B \fB\-F \-\-flex\-syntax\fP -Partial support for Flex syntax: -in this mode named definitions don\(aqt need the equal sign and the terminating semicolon, -and when used they must be surrounded by curly braces. -Names without curly braces are treated as double\-quoted strings. +.B \fB\-\-dump\-adfa\fP +Debug option: output DFA after tunneling (in .dot format). .TP -.B \fB\-g \-\-computed\-gotos\fP -Optimize conditional jumps using non\-standard "computed goto" extension (must be supported by C/C++ compiler). -\fBre2c\fP generates jump tables only in complex cases with a lot of conditional branches. -Complexity threshold can be configured with \fBcgoto:threshold\fP configuration. -This option implies \fB\-b\fP\&. +.B \fB\-\-dump\-cfg\fP +Debug option: output control flow graph of tag variables (in .dot format). .TP -.B \fB\-i \-\-no\-debug\-info\fP -Do not output \fB#line\fP information. -This is useful when the generated code is tracked by some version control system. +.B \fB\-\-dump\-closure\-stats\fP +Debug option: output statistics on the number of states in closure. .TP -.B \fB\-o OUTPUT \-\-output=OUTPUT\fP -Specify the \fBOUTPUT\fP file. +.B \fB\-\-dump\-dfa\-det\fP +Debug option: output DFA immediately after determinization (in .dot format). .TP -.B \fB\-r \-\-reusable\fP -Allows reuse of \fBre2c\fP rules with \fB/*!rules:re2c */\fP and \fB/*!use:re2c */\fP blocks. -In this mode simple \fB/*!re2c */\fP blocks are not allowed -and exactly one \fB/*!rules:re2c */\fP block must be present. -The rules are saved and used by every \fB/*!use:re2c */\fP block that follows (which may add rules of their own). -This option allows to reuse the same set of rules with different configurations. +.B \fB\-\-dump\-dfa\-min\fP +Debug option: output DFA after minimization (in .dot format). .TP -.B \fB\-s \-\-nested\-ifs\fP -Use nested \fBif\fP statements instead of \fBswitch\fP statements in conditional jumps. -This usually results in more efficient code with non\-optimizing C/C++ compilers. +.B \fB\-\-dump\-dfa\-tagopt\fP +Debug option: output DFA after tag optimizations (in .dot format). .TP -.B \fB\-t HEADER \-\-type\-header=HEADER\fP -Generate a \fBHEADER\fP file that contains enum with condition names. -Requires \fB\-c\fP option. +.B \fB\-\-dump\-dfa\-raw\fP +Debug option: output DFA under construction with expanded state\-sets +(in .dot format). .TP -.B \fB\-T \-\-tags\fP -Enable submatch extraction with tags. +.B \fB\-\-dump\-interf\fP +Debug option: output interference table produced by liveness analysis of tag +variables. .TP -.B \fB\-P \-\-posix\-captures\fP -Enable submatch extraction with POSIX\-style capturing groups. +.B \fB\-\-dump\-nfa\fP +Debug option: output NFA (in .dot format). .TP -.B \fB\-u \-\-unicode\fP -Generate a lexer that reads input in UTF\-32 encoding. -\fBre2c\fP assumes that character range is 0 \-\- 0x10FFFF and character size is 4 bytes. -Implies \fB\-s\fP\&. +.B \fB\-e \-\-ecb\fP +Generate a lexer that reads input in EBCDIC encoding. +re2c assumes that character range is 0 \-\- 0xFF an character size is 1 byte. .TP -.B \fB\-v \-\-version\fP -Show version information. +.B \fB\-\-eager\-skip\fP +Make the generated lexer advance the input position "eagerly": +immediately after reading input symbol. +By default this happens after transition to the next state. +Implied by \fB\-\-no\-lookahead\fP\&. .TP -.B \fB\-V \-\-vernum\fP -Show version information in \fBMMmmpp\fP format (major, minor, patch). +.B \fB\-\-empty\-class \fP +Define the way re2c treats empty character classes. With \fBmatch\-empty\fP +(the default) empty class matches empty input (which is illogical, but +backwards\-compatible). With\(ga\(gamatch\-none\(ga\(ga empty class always fails to match. +With \fBerror\fP empty class raises a compilation error. .TP -.B \fB\-w \-\-wide\-chars\fP -Generate a lexer that reads input in UCS\-2 encoding. -\fBre2c\fP assumes that character range is 0 \-\- 0xFFFF and character size is 2 bytes. -Implies \fB\-s\fP\&. +.B \fB\-\-encoding\-policy \fP +Define the way re2c treats Unicode surrogates. +With \fBfail\fP re2c aborts with an error when a surrogate is encountered. +With \fBsubstitute\fP re2c silently replaces surrogates with the error code +point 0xFFFD. With \fBignore\fP (the default) re2c treats surrogates as +normal code points. The Unicode standard says that standalone surrogates +are invalid, but real\-world libraries and programs behave in different ways. .TP -.B \fB\-x \-\-utf\-16\fP -Generate a lexer that reads input in UTF\-16 encoding. -\fBre2c\fP assumes that character range is 0 \-\- 0x10FFFF and character size is 2 bytes. -Implies \fB\-s\fP\&. +.B \fB\-f \-\-storable\-state\fP +Generate a lexer which can store its inner state. +This is useful in push\-model lexers which are stopped by an outer program +when there is not enough input, and then resumed when more input becomes +available. In this mode users should additionally define \fBYYGETSTATE()\fP +and \fBYYSETSTATE(state)\fP macros and variables \fByych\fP, \fByyaccept\fP +and \fBstate\fP as part of the lexer state. .TP -.B \fB\-8 \-\-utf\-8\fP -Generate a lexer that reads input in UTF\-8 encoding. -\fBre2c\fP assumes that character range is 0 \-\- 0x10FFFF and character size is 1 byte. +.B \fB\-F \-\-flex\-syntax\fP +Partial support for Flex syntax: in this mode named definitions don\(aqt need +the equal sign and the terminating semicolon, and when used they must be +surrounded by curly braces. Names without curly braces are treated as +double\-quoted strings. .TP -.B \fB\-\-case\-insensitive\fP -Treat single\-quoted and double\-quoted strings as case\-insensitive. +.B \fB\-g \-\-computed\-gotos\fP +Optimize conditional jumps using non\-standard "computed goto" extension +(which must be supported by the C/C++ compiler). re2c generates jump tables +only in complex cases with a lot of conditional branches. Complexity +threshold can be configured with \fBcgoto:threshold\fP configuration. This +option implies \fB\-b\fP\&. +.TP +.B \fB\-I PATH\fP +Add \fBPATH\fP to the list of locations which are used when searching for +include files. This option is useful in combination with +\fB/*!include:re2c ... */\fP directive. Re2c looks for \fBFILE\fP in the +directory of including file and in the list of include paths specified by +\fB\-I\fP option. .TP -.B \fB\-\-case\-inverted\fP -Invert the meaning of single\-quoted and double\-quoted strings: -treat single\-quoted strings as case\-sensitive and double\-quoted strings as case\-insensitive. +.B \fB\-i \-\-no\-debug\-info\fP +Do not output \fB#line\fP information. This is useful when the generated code +is tracked by some version control system or IDE. +.TP +.B \fB\-\-input \fP +Specify re2c input API. +Option \fBdefault\fP is the default API composed of pointer\-like primitives +\fBYYCURSOR\fP, \fBYYMARKER\fP, \fBYYLIMIT\fP etc. +Option \fBcustom\fP is the generic API composed of function\-like primitives +\fBYYPEEK()\fP, \fBYYSKIP()\fP, \fBYYBACKUP()\fP, \fBYYRESTORE()\fP etc. +.TP +.B \fB\-\-input\-encoding \fP +Specify the way re2c parses regular expressions. +With \fBascii\fP (the default) re2c handles input as ASCII\-encoded: any +sequence of code units is a sequence of standalone 1\-byte characters. +With \fButf8\fP re2c handles input as UTF8\-encoded and recognizes multibyte +characters. +.TP +.B \fB\-\-location\-format \fP +Specify location format in messages. +With \fBgnu\fP locations are printed as \(aqfilename:line:column: ...\(aq. +With \fBmsvc\fP locations are printed as \(aqfilename(line,column) ...\(aq. +Default is \fBgnu\fP\&. .TP .B \fB\-\-no\-generation\-date\fP Suppress date output in the generated file. .TP .B \fB\-\-no\-lookahead\fP Use TDFA(0) instead of TDFA(1). -This option only has effect with \fB\-\-tags\fP or \fB\-\-posix\-captures\fP options. +This option has effect only with \fB\-\-tags\fP or \fB\-\-posix\-captures\fP options. .TP .B \fB\-\-no\-optimize\-tags\fP -Suppress optimization of tag variables (useful for debugging or benchmarking). +Suppress optimization of tag variables (useful for debugging). .TP .B \fB\-\-no\-version\fP Suppress version output in the generated file. .TP -.B \fB\-\-encoding\-policy POLICY\fP -Define the way \fBre2c\fP treats Unicode surrogates. -\fBPOLICY\fP can be one of the following: \fBfail\fP (abort with an error when a surrogate is encountered), -\fBsubstitute\fP (silently replace surrogates with the error code point 0xFFFD), -\fBignore\fP (default, treat surrogates as normal code points). -The Unicode standard says that standalone surrogates are invalid, -but real\-world libraries and programs behave in different ways. +.B \fB\-o OUTPUT \-\-output=OUTPUT\fP +Specify the \fBOUTPUT\fP file. .TP -.B \fB\-\-input INPUT\fP -Specify \fBre2c\fP input API. \fBINPUT\fP can be either \fBdefault\fP or \fBcustom\fP (enables the use of generic API). +.B \fB\-P \-\-posix\-captures\fP +Enable submatch extraction with POSIX\-style capturing groups. +.TP +.B \fB\-\-posix\-closure \fP +Specify shortest\-path algorithm used for construction of epsilon\-closure +with POSIX disambiguation semantics: \fBgor1\fP (the default) stands for +Goldberg\-Radzik algorithm, and \fBgtop\fP stands for "global topological +order" algorithm. +.TP +.B \fB\-r \-\-reusable\fP +Allows reuse of re2c rules with \fB/*!rules:re2c */\fP and \fB/*!use:re2c */\fP +blocks. Exactly one rules\-block must be present. The rules are saved and +used by every use\-block that follows, which may add its own rules and +configurations. .TP .B \fB\-S \-\-skeleton\fP -Ignore user\-defined interface code and generate a self\-contained "skeleton" program. -Additionally, generate input files with strings derived from the regular grammar -and compressed match results that are used to verify "skeleton" behavior on all inputs. -This option is useful for finding bugs in optimizations and code generation. -.TP -.B \fB\-\-empty\-class POLICY\fP -Define the way \fBre2c\fP treats empty character classes. -\fBPOLICY\fP can be one of the following: \fBmatch\-empty\fP (match empty input: illogical, but default behavior for backwards compatibility reasons), -\fBmatch\-none\fP (fail to match on any input), -\fBerror\fP (compilation error). -.TP -.B \fB\-\-dfa\-minimization ALGORITHM\fP -The internal algorithm used by re2c to minimize the DFA. -\fBALGORITHM\fP can be either \fBmoore\fP (Moore algorithm, the default) or \fBtable\fP (table filling algorithm). -Both algorithms should produce the same DFA up to states relabeling; -table filling is much slower and serves as a reference implementation. +Ignore user\-defined interface code and generate a self\-contained "skeleton" +program. Additionally, generate input files with strings derived from the +regular grammar and compressed match results that are used to verify +"skeleton" behavior on all inputs. This option is useful for finding bugs +in optimizations and code generation. .TP -.B \fB\-\-eager\-skip\fP -Make the generated lexer advance the input position "eagerly": -immediately after reading input symbol. -By default this happens after transition to the next state. -Implied by \fB\-\-no\-lookahead\fP\&. +.B \fB\-s \-\-nested\-ifs\fP +Use nested \fBif\fP statements instead of \fBswitch\fP statements in conditional +jumps. This usually results in more efficient code with non\-optimizing C/C++ +compilers. .TP -.B \fB\-\-dump\-nfa\fP -Generate representation of NFA in DOT format and dump it on stderr. +.B \fB\-T \-\-tags\fP +Enable submatch extraction with tags. .TP -.B \fB\-\-dump\-dfa\-raw\fP -Generate representation of DFA in DOT format under construction and dump it on stderr. +.B \fB\-t HEADER \-\-type\-header=HEADER\fP +Generate a \fBHEADER\fP file that contains enum with condition names. +Requires \fB\-c\fP option. .TP -.B \fB\-\-dump\-dfa\-det\fP -Generate representation of DFA in DOT format immediately after determinization and dump it on stderr. +.B \fB\-u \-\-unicode\fP +Generate a lexer that reads UTF32\-encoded input. Re2c assumes that character +range is 0 \-\- 0x10FFFF and character size is 4 bytes. This option implies +\fB\-s\fP\&. .TP -.B \fB\-\-dump\-dfa\-tagopt\fP -Generate representation of DFA in DOT format after tag optimizations and dump it on stderr. +.B \fB\-V \-\-vernum\fP +Show version information in \fBMMmmpp\fP format (major, minor, patch). .TP -.B \fB\-\-dump\-dfa\-min\fP -Generate representation of DFA in DOT format after minimization and dump it on stderr. +.B \fB\-\-verbose\fP +Output a short message in case of success. .TP -.B \fB\-\-dump\-adfa\fP -Generate representation of DFA in DOT format after tunneling and dump it on stderr. +.B \fB\-v \-\-version\fP +Show version information. .TP -.B \fB\-1 \-\-single\-pass\fP -Deprecated. Does nothing (single pass is the default now). +.B \fB\-w \-\-wide\-chars\fP +Generate a lexer that reads UCS2\-encoded input. Re2c assumes that character +range is 0 \-\- 0xFFFF and character size is 2 bytes. This option implies +\fB\-s\fP\&. +.TP +.B \fB\-x \-\-utf\-16\fP +Generate a lexer that reads UTF16\-encoded input. Re2c assumes that character +range is 0 \-\- 0x10FFFF and character size is 2 bytes. This option implies +\fB\-s\fP\&. .UNINDENT .INDENT 0.0 .TP @@ -596,12 +646,6 @@ Same as \fB\-\-case\-inverted\fP command\-line option. .B \fBre2c:flags:d\fP or \fBre2c:flags:debug\-output\fP Same as \fB\-d \-\-debug\-output\fP command\-line option. .TP -.B \fBre2c:flags:dfa\-minimization = \(aqmoore\(aq;\fP -Same as \fB\-\-dfa\-minimization\fP command\-line option. -.TP -.B \fBre2c:flags:eager\-skip = 0;\fP -Same as \fB\-\-eager\-skip\fP command\-line option. -.TP .B \fBre2c:flags:e\fP or \fBre2c:flags:ecb\fP Same as \fB\-e \-\-ecb\fP command\-line option. .TP @@ -620,18 +664,18 @@ Same as \fB\-i \-\-no\-debug\-info\fP command\-line option. .B \fBre2c:flags:input = \(aqdefault\(aq;\fP Same as \fB\-\-input\fP command\-line option. .TP -.B \fBre2c:flags:lookahead = 1;\fP -Same as inverted \fB\-\-no\-lookahead\fP command\-line option. -.TP -.B \fBre2c:flags:optimize\-tags = 1;\fP -Same as inverted \fB\-\-no\-optimize\-tags\fP command\-line option. -.TP .B \fBre2c:flags:P\fP or \fBre2c:flags:posix\-captures\fP Same as \fB\-P \-\-posix\-captures\fP command\-line option. .TP .B \fBre2c:flags:s\fP or \fBre2c:flags:nested\-ifs\fP Same as \fB\-s \-\-nested\-ifs\fP command\-line option. .TP +.B \fBre2c:flags:o\fP or \fBre2c:flags:output\fP +Same as \fB\-o \-\-output\fP command\-line option. +.TP +.B \fBre2c:flags:t\fP or \fBre2c:flags:type\-header\fP +Same as \fB\-t \-\-type\-header\fP command\-line option. +.TP .B \fBre2c:flags:T\fP or \fBre2c:flags:tags\fP Same as \fB\-T \-\-tags\fP command\-line option. .TP @@ -861,6 +905,110 @@ represented as array of nodes \fB(v, p)\fP, where \fBv\fP is tag value and \fBp\ .sp For further details see \fBhttp://re2c.org/examples/examples.html\fP page on the website or \fBre2c/examples/\fP subdirectory of \fBre2c\fP distribution. +.SH INCLUDES +.sp +Re2c allows to include other files using directive \fB/*!include:re2c FILE */\fP, +where \fBFILE\fP is the name of file to be included. Re2c looks for included +files in the directory of the including file and in include locations, which +can be specified with \fB\-I\fP option. +Re2c include directive works in the same way as C/C++ \fB#include\fP: the contents +of \fBFILE\fP are copy\-pasted verbatim in place of the directive. Include files +may have further includes of their own. +Re2c provides some predefined include files that can be found in the +\fBinclude/\fP subdirectory of the project. These files contain definitions that +can be useful to other projects (such as Unicode categories) and form something +like a standard library for re2c. +Here is an example of using include files: +.INDENT 0.0 +.INDENT 3.5 +.sp +.nf +.ft C +// definitions.re +/*!re2c + alpha = [a\-zA\-Z]; + digit = [0\-9]; +*/ + +// main.re +/*!include:re2c "definitions.re" */ +int lex(const char *YYCURSOR) +{ + const char *YYMARKER; + /*!re2c + alpha { return 1; } + digit { return 2; } + * { return 0; } + */ +} +.ft P +.fi +.UNINDENT +.UNINDENT +.SH HEADERS +.sp +Re2c allows to generate header file from the input \fB\&.re\fP file using option +\fB\-t \-\-type\-header\fP (or the corresponding configurations) and directives +\fB/*!header:re2c:on*/\fP and \fB/*!header:re2c:off*/\fP\&. The first directive +marks the beginning of header file, and the second directive marks the end of +it. This may be needed in cases when re2c is used to generate definitions of +constants, variables and structs that must be visible from other translation +units. +Below is an example of generating header file that contains definitions of +\fBYYMAXFILL\fP and lexer state with tag variables. Note that \fBYYMAXFILL\fP and +tag variables depend on the grammar rules in the input \fB\&.re\fP file and cannot +be hard\-coded. +.INDENT 0.0 +.INDENT 3.5 +.sp +.nf +.ft C +/*!header:re2c:on*/ +/*!max:re2c*/ +struct State { + char buffer[4096 + YYMAXFILL], *cursor, *marker, *limit; + /*!stags:re2c format = "char *@@; "; */ +}; +/*!header:re2c:off*/ + +#include "lex.h" +#define YYCTYPE char +#define YYCURSOR state\->cursor +#define YYMARKER state\->marker +#define YYLIMIT state\->limit +#define YYFILL(n) return 2 +int lex(State *state) +{ + char *x, *y; + /*!re2c + re2c:tags:expression = state\->@@; + re2c:flags:t = lex.h; + + "a"* @x "b"* @y "c"* { return 0; } + * { return 1; } + */ +} +.ft P +.fi +.UNINDENT +.UNINDENT +.sp +The generated header will look like this: +.INDENT 0.0 +.INDENT 3.5 +.sp +.nf +.ft C +#define YYMAXFILL 1 + +struct State { + char buffer[4096 + YYMAXFILL], *cursor, *marker, *limit; + char *yyt1; char *yyt2; +}; +.ft P +.fi +.UNINDENT +.UNINDENT .SH STORABLE STATE .sp With \fB\-f\fP \fB\-\-storable\-state\fP option re2c generates a lexer that can @@ -1123,6 +1271,7 @@ Durimar, Eldar Zakirov, Emmanuel Mogenet, Hartmut Kaiser, +Henri Salo (fgeek), jcfp, Jean\-Claude Wippler, Jeff Trull, @@ -1143,11 +1292,12 @@ Rui Maciel, Ryan Mast, Samuel006, Sergei Trofimovich, +Serghei Iakovlev, sirzooro, Tim Kelly, Ulya Trofimovich .SH VERSION INFORMATION .sp -This manpage describes \fBre2c\fP version 1.1.1, package date 10 Jan 2019. +This manpage describes \fBre2c\fP version 1.1.1, package date 15 Jul 2019. .\" Generated by docutils manpage writer. . diff --git a/bootstrap/src/msg/help.cc b/bootstrap/src/msg/help.cc index d1975b52..bc720f7f 100644 --- a/bootstrap/src/msg/help.cc +++ b/bootstrap/src/msg/help.cc @@ -1,207 +1,167 @@ extern const char *help; const char *help = -"\n" -" -? -h --help\n" " Show help message.\n" "\n" +" -1 --single-pass\n" +" Deprecated. Does nothing (single pass is the default now).\n" +"\n" +" -8 --utf-8\n" +" Generate a lexer that reads input in UTF-8 encoding. re2c assumes that character range is 0 -- 0x10FFFF and character size is 1 byte.\n" +"\n" " -b --bit-vectors\n" " Optimize conditional jumps using bit masks. Implies -s.\n" "\n" " -c --conditions --start-conditions\n" -" Enable support of Flex-like \"conditions\": multiple interrelated\n" -" lexers within one block. Option --start-conditions is a legacy\n" -" alias; use --conditions instead.\n" +" Enable support of Flex-like \"conditions\": multiple interrelated lexers within one block. Option --start-conditions is a legacy alias; use --conditions instead.\n" "\n" -" -d --debug-output\n" -" Emit YYDEBUG in the generated code. YYDEBUG should be defined\n" -" by the user in the form of a void function with two parameters:\n" -" state (lexer state or -1) and symbol (current input symbol of\n" -" type YYCTYPE).\n" +" --case-insensitive\n" +" Treat single-quoted and double-quoted strings as case-insensitive.\n" +"\n" +" --case-inverted\n" +" Invert the meaning of single-quoted and double-quoted strings: treat single-quoted strings as case-sensitive and double-quoted strings as case-insensitive.\n" "\n" " -D --emit-dot\n" -" Instead of normal output generate lexer graph in DOT format.\n" -" The output can be converted to PNG with the help of Graphviz\n" -" (something like dot -Tpng -odfa.png dfa.dot). Note that large\n" -" graphs may crash Graphviz.\n" +" Instead of normal output generate lexer graph in .dot format. The output can be converted to an image with the help of Graphviz (e.g. something like dot -Tpng -odfa.png dfa.dot).\n" "\n" -" -e --ecb\n" -" Generate a lexer that reads input in EBCDIC encoding. re2c\n" -" assumes that character range is 0 -- 0xFF an character size is 1\n" -" byte.\n" +" -d --debug-output\n" +" Emit YYDEBUG in the generated code. YYDEBUG should be defined by the user in the form of a void function with two parameters: state (lexer state or -1) and symbol (current input symbol of type YYCTYPE).\n" "\n" -" -f --storable-state\n" -" Generate a lexer which can store its inner state. This is use‐\n" -" ful in push-model lexers which are stopped by an outer program\n" -" when there is not enough input, and then resumed when more input\n" -" becomes available. In this mode users should additionally\n" -" define YYGETSTATE () and YYSETSTATE (state) macros and variables\n" -" yych, yyaccept and the state as part of the lexer state.\n" +" --dfa-minimization \n" +" The internal algorithm used by re2c to minimize the DFA: moore (the default) is Moore algorithm, and table is the \"table filling\" algorithm. Both algorithms should produce the same DFA up to states relabeling; table\n" +" filling is simpler and much slower and serves as a reference implementation.\n" "\n" -" -F --flex-syntax\n" -" Partial support for Flex syntax: in this mode named definitions\n" -" don't need the equal sign and the terminating semicolon, and\n" -" when used they must be surrounded by curly braces. Names with‐\n" -" out curly braces are treated as double-quoted strings.\n" +" --dump-adfa\n" +" Debug option: output DFA after tunneling (in .dot format).\n" "\n" -" -g --computed-gotos\n" -" Optimize conditional jumps using non-standard \"computed goto\"\n" -" extension (must be supported by C/C++ compiler). re2c generates\n" -" jump tables only in complex cases with a lot of conditional\n" -" branches. Complexity threshold can be configured with\n" -" cgoto:threshold configuration. This option implies -b.\n" +" --dump-cfg\n" +" Debug option: output control flow graph of tag variables (in .dot format).\n" "\n" -" -i --no-debug-info\n" -" Do not output #line information. This is useful when the gener‐\n" -" ated code is tracked by some version control system.\n" +" --dump-closure-stats\n" +" Debug option: output statistics on the number of states in closure.\n" "\n" -" -o OUTPUT --output=OUTPUT\n" -" Specify the OUTPUT file.\n" +" --dump-dfa-det\n" +" Debug option: output DFA immediately after determinization (in .dot format).\n" "\n" -" -r --reusable\n" -" Allows reuse of re2c rules with /*!rules:re2c */ and /*!use:re2c\n" -" */ blocks. In this mode simple /*!re2c */ blocks are not\n" -" allowed and exactly one /*!rules:re2c */ block must be present.\n" -" The rules are saved and used by every /*!use:re2c */ block that\n" -" follows (which may add rules of their own). This option allows\n" -" to reuse the same set of rules with different configurations.\n" +" --dump-dfa-min\n" +" Debug option: output DFA after minimization (in .dot format).\n" "\n" -" -s --nested-ifs\n" -" Use nested if statements instead of switch statements in condi‐\n" -" tional jumps. This usually results in more efficient code with\n" -" non-optimizing C/C++ compilers.\n" +" --dump-dfa-tagopt\n" +" Debug option: output DFA after tag optimizations (in .dot format).\n" "\n" -" -t HEADER --type-header=HEADER\n" -" Generate a HEADER file that contains enum with condition names.\n" -" Requires -c option.\n" +" --dump-dfa-raw\n" +" Debug option: output DFA under construction with expanded state-sets (in .dot format).\n" "\n" -" -T --tags\n" -" Enable submatch extraction with tags.\n" +" --dump-interf\n" +" Debug option: output interference table produced by liveness analysis of tag variables.\n" "\n" -" -P --posix-captures\n" -" Enable submatch extraction with POSIX-style capturing groups.\n" +" --dump-nfa\n" +" Debug option: output NFA (in .dot format).\n" "\n" -" -u --unicode\n" -" Generate a lexer that reads input in UTF-32 encoding. re2c\n" -" assumes that character range is 0 -- 0x10FFFF and character size\n" -" is 4 bytes. Implies -s.\n" +" -e --ecb\n" +" Generate a lexer that reads input in EBCDIC encoding. re2c assumes that character range is 0 -- 0xFF an character size is 1 byte.\n" "\n" -" -v --version\n" -" Show version information.\n" +" --eager-skip\n" +" Make the generated lexer advance the input position \"eagerly\": immediately after reading input symbol. By default this happens after transition to the next state. Implied by --no-lookahead.\n" "\n" -" -V --vernum\n" -" Show version information in MMmmpp format (major, minor, patch).\n" +" --empty-class \n" +" Define the way re2c treats empty character classes. With match-empty (the default) empty class matches empty input (which is illogical, but backwards-compatible). With``match-none`` empty class always fails to match.\n" +" With error empty class raises a compilation error.\n" "\n" -" -w --wide-chars\n" -" Generate a lexer that reads input in UCS-2 encoding. re2c\n" -" assumes that character range is 0 -- 0xFFFF and character size\n" -" is 2 bytes. Implies -s.\n" +" --encoding-policy \n" +" Define the way re2c treats Unicode surrogates. With fail re2c aborts with an error when a surrogate is encountered. With substitute re2c silently replaces surrogates with the error code point 0xFFFD. With ignore (the\n" +" default) re2c treats surrogates as normal code points. The Unicode standard says that standalone surrogates are invalid, but real-world libraries and programs behave in different ways.\n" "\n" -" -x --utf-16\n" -" Generate a lexer that reads input in UTF-16 encoding. re2c\n" -" assumes that character range is 0 -- 0x10FFFF and character size\n" -" is 2 bytes. Implies -s.\n" +" -f --storable-state\n" +" Generate a lexer which can store its inner state. This is useful in push-model lexers which are stopped by an outer program when there is not enough input, and then resumed when more input becomes available. In this\n" +" mode users should additionally define YYGETSTATE() and YYSETSTATE(state) macros and variables yych, yyaccept and state as part of the lexer state.\n" "\n" -" -8 --utf-8\n" -" Generate a lexer that reads input in UTF-8 encoding. re2c\n" -" assumes that character range is 0 -- 0x10FFFF and character size\n" -" is 1 byte.\n" +" -F --flex-syntax\n" +" Partial support for Flex syntax: in this mode named definitions don't need the equal sign and the terminating semicolon, and when used they must be surrounded by curly braces. Names without curly braces are treated as\n" +" double-quoted strings.\n" "\n" -" --case-insensitive\n" -" Treat single-quoted and double-quoted strings as case-insensi‐\n" -" tive.\n" +" -g --computed-gotos\n" +" Optimize conditional jumps using non-standard \"computed goto\" extension (which must be supported by the C/C++ compiler). re2c generates jump tables only in complex cases with a lot of conditional branches. Complexity\n" +" threshold can be configured with cgoto:threshold configuration. This option implies -b.\n" "\n" -" --case-inverted\n" -" Invert the meaning of single-quoted and double-quoted strings:\n" -" treat single-quoted strings as case-sensitive and double-quoted\n" -" strings as case-insensitive.\n" +" -I PATH\n" +" Add PATH to the list of locations which are used when searching for include files. This option is useful in combination with /*!include:re2c ... */ directive. Re2c looks for FILE in the directory of including file and\n" +" in the list of include paths specified by -I option.\n" +"\n" +" -i --no-debug-info\n" +" Do not output #line information. This is useful when the generated code is tracked by some version control system or IDE.\n" +"\n" +" --input \n" +" Specify re2c input API. Option default is the default API composed of pointer-like primitives YYCURSOR, YYMARKER, YYLIMIT etc. Option custom is the generic API composed of function-like primitives YYPEEK(), YYSKIP(),\n" +" YYBACKUP(), YYRESTORE() etc.\n" +"\n" +" --input-encoding \n" +" Specify the way re2c parses regular expressions. With ascii (the default) re2c handles input as ASCII-encoded: any sequence of code units is a sequence of standalone 1-byte characters. With utf8 re2c handles input as\n" +" UTF8-encoded and recognizes multibyte characters.\n" +"\n" +" --location-format \n" +" Specify location format in messages. With gnu locations are printed as 'filename:line:column: ...'. With msvc locations are printed as 'filename(line,column) ...'. Default is gnu.\n" "\n" " --no-generation-date\n" " Suppress date output in the generated file.\n" "\n" " --no-lookahead\n" -" Use TDFA(0) instead of TDFA(1). This option only has effect\n" -" with --tags or --posix-captures options.\n" +" Use TDFA(0) instead of TDFA(1). This option has effect only with --tags or --posix-captures options.\n" "\n" " --no-optimize-tags\n" -" Suppress optimization of tag variables (useful for debugging or\n" -" benchmarking).\n" +" Suppress optimization of tag variables (useful for debugging).\n" "\n" " --no-version\n" " Suppress version output in the generated file.\n" "\n" -" --encoding-policy POLICY\n" -" Define the way re2c treats Unicode surrogates. POLICY can be\n" -" one of the following: fail (abort with an error when a surrogate\n" -" is encountered), substitute (silently replace surrogates with\n" -" the error code point 0xFFFD), ignore (default, treat surrogates\n" -" as normal code points). The Unicode standard says that stand‐\n" -" alone surrogates are invalid, but real-world libraries and pro‐\n" -" grams behave in different ways.\n" +" -o OUTPUT --output=OUTPUT\n" +" Specify the OUTPUT file.\n" +"\n" +" -P --posix-captures\n" +" Enable submatch extraction with POSIX-style capturing groups.\n" +"\n" +" --posix-closure \n" +" Specify shortest-path algorithm used for construction of epsilon-closure with POSIX disambiguation semantics: gor1 (the default) stands for Goldberg-Radzik algorithm, and gtop stands for \"global topological order\" algo‐\n" +" rithm.\n" "\n" -" --input INPUT\n" -" Specify re2c input API. INPUT can be either default or custom\n" -" (enables the use of generic API).\n" +" -r --reusable\n" +" Allows reuse of re2c rules with /*!rules:re2c */ and /*!use:re2c */ blocks. Exactly one rules-block must be present. The rules are saved and used by every use-block that follows, which may add its own rules and configu‐\n" +" rations.\n" "\n" " -S --skeleton\n" -" Ignore user-defined interface code and generate a self-contained\n" -" \"skeleton\" program. Additionally, generate input files with\n" -" strings derived from the regular grammar and compressed match\n" -" results that are used to verify \"skeleton\" behavior on all\n" -" inputs. This option is useful for finding bugs in optimizations\n" -" and code generation.\n" -"\n" -" --empty-class POLICY\n" -" Define the way re2c treats empty character classes. POLICY can\n" -" be one of the following: match-empty (match empty input: illogi‐\n" -" cal, but default behavior for backwards compatibility reasons),\n" -" match-none (fail to match on any input), error (compilation\n" -" error).\n" -"\n" -" --dfa-minimization ALGORITHM\n" -" The internal algorithm used by re2c to minimize the DFA. ALGO‐\n" -" RITHM can be either moore (Moore algorithm, the default) or ta‐\n" -" ble (table filling algorithm). Both algorithms should produce\n" -" the same DFA up to states relabeling; table filling is much\n" -" slower and serves as a reference implementation.\n" +" Ignore user-defined interface code and generate a self-contained \"skeleton\" program. Additionally, generate input files with strings derived from the regular grammar and compressed match results that are used to verify\n" +" \"skeleton\" behavior on all inputs. This option is useful for finding bugs in optimizations and code generation.\n" "\n" -" --eager-skip\n" -" Make the generated lexer advance the input position \"eagerly\":\n" -" immediately after reading input symbol. By default this happens\n" -" after transition to the next state. Implied by --no-lookahead.\n" +" -s --nested-ifs\n" +" Use nested if statements instead of switch statements in conditional jumps. This usually results in more efficient code with non-optimizing C/C++ compilers.\n" "\n" -" --dump-nfa\n" -" Generate representation of NFA in DOT format and dump it on\n" -" stderr.\n" +" -T --tags\n" +" Enable submatch extraction with tags.\n" "\n" -" --dump-dfa-raw\n" -" Generate representation of DFA in DOT format under construction\n" -" and dump it on stderr.\n" +" -t HEADER --type-header=HEADER\n" +" Generate a HEADER file that contains enum with condition names. Requires -c option.\n" "\n" -" --dump-dfa-det\n" -" Generate representation of DFA in DOT format immediately after\n" -" determinization and dump it on stderr.\n" +" -u --unicode\n" +" Generate a lexer that reads UTF32-encoded input. Re2c assumes that character range is 0 -- 0x10FFFF and character size is 4 bytes. This option implies -s.\n" "\n" -" --dump-dfa-tagopt\n" -" Generate representation of DFA in DOT format after tag optimiza‐\n" -" tions and dump it on stderr.\n" +" -V --vernum\n" +" Show version information in MMmmpp format (major, minor, patch).\n" "\n" -" --dump-dfa-min\n" -" Generate representation of DFA in DOT format after minimization\n" -" and dump it on stderr.\n" +" --verbose\n" +" Output a short message in case of success.\n" "\n" -" --dump-adfa\n" -" Generate representation of DFA in DOT format after tunneling and\n" -" dump it on stderr.\n" +" -v --version\n" +" Show version information.\n" "\n" -" -1 --single-pass\n" -" Deprecated. Does nothing (single pass is the default now).\n" +" -w --wide-chars\n" +" Generate a lexer that reads UCS2-encoded input. Re2c assumes that character range is 0 -- 0xFFFF and character size is 2 bytes. This option implies -s.\n" +"\n" +" -x --utf-16\n" +" Generate a lexer that reads UTF16-encoded input. Re2c assumes that character range is 0 -- 0x10FFFF and character size is 2 bytes. This option implies -s.\n" "\n" " -W Turn on all warnings.\n" "\n" " -Werror\n" -" Turn warnings into errors. Note that this option alone doesn't\n" -" turn on any warnings; it only affects those warnings that have\n" -" been turned on so far or will be turned on later.\n" +" Turn warnings into errors. Note that this option alone doesn't turn on any warnings; it only affects those warnings that have been turned on so far or will be turned on later.\n" "\n" " -W\n" " Turn on warning.\n" @@ -210,56 +170,34 @@ const char *help = " Turn off warning.\n" "\n" " -Werror-\n" -" Turn on warning and treat it as an error (this implies -W).\n" +" Turn on warning and treat it as an error (this implies -W).\n" "\n" " -Wno-error-\n" -" Don't treat this particular warning as an error. This doesn't\n" -" turn off the warning itself.\n" +" Don't treat this particular warning as an error. This doesn't turn off the warning itself.\n" "\n" " -Wcondition-order\n" -" Warn if the generated program makes implicit assumptions about\n" -" condition numbering. One should use either the -t, --type-header\n" -" option or the /*!types:re2c*/ directive to generate a mapping of\n" -" condition names to numbers and then use the autogenerated condi‐\n" -" tion names.\n" +" Warn if the generated program makes implicit assumptions about condition numbering. One should use either the -t, --type-header option or the /*!types:re2c*/ directive to generate a mapping of condition names to numbers\n" +" and then use the autogenerated condition names.\n" "\n" " -Wempty-character-class\n" -" Warn if a regular expression contains an empty character class.\n" -" Trying to match an empty character class makes no sense: it\n" -" should always fail. However, for backwards compatibility rea‐\n" -" sons re2c allows empty character classes and treats them as\n" -" empty strings. Use the --empty-class option to change the\n" -" default behavior.\n" +" Warn if a regular expression contains an empty character class. Trying to match an empty character class makes no sense: it should always fail. However, for backwards compatibility reasons re2c allows empty character\n" +" classes and treats them as empty strings. Use the --empty-class option to change the default behavior.\n" "\n" " -Wmatch-empty-string\n" -" Warn if a rule is nullable (matches an empty string). If the\n" -" lexer runs in a loop and the empty match is unintentional, the\n" -" lexer may unexpectedly hang in an infinite loop.\n" +" Warn if a rule is nullable (matches an empty string). If the lexer runs in a loop and the empty match is unintentional, the lexer may unexpectedly hang in an infinite loop.\n" "\n" " -Wswapped-range\n" -" Warn if the lower bound of a range is greater than its upper\n" -" bound. The default behavior is to silently swap the range\n" -" bounds.\n" +" Warn if the lower bound of a range is greater than its upper bound. The default behavior is to silently swap the range bounds.\n" "\n" " -Wundefined-control-flow\n" -" Warn if some input strings cause undefined control flow in the\n" -" lexer (the faulty patterns are reported). This is the most dan‐\n" -" gerous and most common mistake. It can be easily fixed by adding\n" -" the default rule * which has the lowest priority, matches any\n" -" code unit, and consumes exactly one code unit.\n" +" Warn if some input strings cause undefined control flow in the lexer (the faulty patterns are reported). This is the most dangerous and most common mistake. It can be easily fixed by adding the default rule * which has\n" +" the lowest priority, matches any code unit, and consumes exactly one code unit.\n" "\n" " -Wunreachable-rules\n" -" Warn about rules that are shadowed by other rules and will never\n" -" match.\n" +" Warn about rules that are shadowed by other rules and will never match.\n" "\n" " -Wuseless-escape\n" -" Warn if a symbol is escaped when it shouldn't be. By default,\n" -" re2c silently ignores such escapes, but this may as well indi‐\n" -" cate a typo or an error in the escape sequence.\n" +" Warn if a symbol is escaped when it shouldn't be. By default, re2c silently ignores such escapes, but this may as well indicate a typo or an error in the escape sequence.\n" "\n" " -Wnondeterministic-tags\n" -" Warn if a tag has n-th degree of nondeterminism, where n is\n" -" greater than 1.\n" -"\n" ; diff --git a/doc/manpage.rst.in b/doc/manpage.rst.in index c682807b..52031ae0 100644 --- a/doc/manpage.rst.in +++ b/doc/manpage.rst.in @@ -21,7 +21,6 @@ specifications inside of C/C++ comments and replaces them with a hard-coded DFA. The user must supply some interface code in order to control and customize the generated DFA. - OPTIONS ------- @@ -31,84 +30,78 @@ OPTIONS .. include:: @top_srcdir@/doc/manual/warnings/warnings_list.rst - INTERFACE CODE -------------- .. include:: @top_srcdir@/doc/manual/syntax/interface.rst_ - SYNTAX ------ A program can contain any number of ``re2c`` blocks. Each block consists of a sequence of ``RULES``, ``NAMED DEFINITIONS`` and ``INPLACE CONFIGURATIONS``. - - RULES ~~~~~ .. include:: @top_srcdir@/doc/manual/syntax/rules.rst_ - NAMED DEFINITIONS ~~~~~~~~~~~~~~~~~ .. include:: @top_srcdir@/doc/manual/syntax/named_definitions.rst_ - - INPLACE CONFIGURATIONS ~~~~~~~~~~~~~~~~~~~~~~ .. include:: @top_srcdir@/doc/manual/syntax/configurations.rst_ - REGULAR EXPRESSIONS ~~~~~~~~~~~~~~~~~~~ .. include:: @top_srcdir@/doc/manual/syntax/regular_expressions.rst_ - SUBMATCH EXTRACTION ------------------- .. include:: @top_srcdir@/doc/manual/features/submatch/submatch.rst_ +INCLUDES +-------- + +.. include:: @top_srcdir@/doc/manual/features/includes/includes.rst_ + +HEADERS +-------- + +.. include:: @top_srcdir@/doc/manual/features/headers/headers.rst_ STORABLE STATE -------------- .. include:: @top_srcdir@/doc/manual/features/state/state.rst_ - - CONDITIONS ---------- .. include:: @top_srcdir@/doc/manual/features/conditions/conditions.rst_ - ENCODINGS --------- .. include:: @top_srcdir@/doc/manual/features/encodings/encodings.rst_ - GENERIC API ----------- .. include:: @top_srcdir@/doc/manual/features/generic_api/generic_api.rst_ - SEE ALSO -------- You can find more information about ``re2c`` at: http://re2c.org. See also: flex(1), lex(1), quex (http://quex.sourceforge.net). - AUTHORS ------- @@ -118,7 +111,6 @@ Below is a (more or less) full list of contributors retrieved from the Git histo .. include:: @top_srcdir@/doc/manual/contributors.rst_ - VERSION INFORMATION ------------------- diff --git a/doc/manual/contributors.rst_ b/doc/manual/contributors.rst_ index ecde6c31..52df0f57 100644 --- a/doc/manual/contributors.rst_ +++ b/doc/manual/contributors.rst_ @@ -10,6 +10,7 @@ Durimar, Eldar Zakirov, Emmanuel Mogenet, Hartmut Kaiser, +Henri Salo (fgeek), jcfp, Jean-Claude Wippler, Jeff Trull, @@ -30,6 +31,7 @@ Rui Maciel, Ryan Mast, Samuel006, Sergei Trofimovich, +Serghei Iakovlev, sirzooro, Tim Kelly, Ulya Trofimovich diff --git a/doc/manual/features/headers/headers.rst_ b/doc/manual/features/headers/headers.rst_ new file mode 100644 index 00000000..5e8318eb --- /dev/null +++ b/doc/manual/features/headers/headers.rst_ @@ -0,0 +1,50 @@ +Re2c allows to generate header file from the input ``.re`` file using option +``-t --type-header`` (or the corresponding configurations) and directives +``/*!header:re2c:on*/`` and ``/*!header:re2c:off*/``. The first directive +marks the beginning of header file, and the second directive marks the end of +it. This may be needed in cases when re2c is used to generate definitions of +constants, variables and structs that must be visible from other translation +units. +Below is an example of generating header file that contains definitions of +``YYMAXFILL`` and lexer state with tag variables. Note that ``YYMAXFILL`` and +tag variables depend on the grammar rules in the input ``.re`` file and cannot +be hard-coded. + +.. code-block:: cpp + + /*!header:re2c:on*/ + /*!max:re2c*/ + struct State { + char buffer[4096 + YYMAXFILL], *cursor, *marker, *limit; + /*!stags:re2c format = "char *@@; "; */ + }; + /*!header:re2c:off*/ + + #include "lex.h" + #define YYCTYPE char + #define YYCURSOR state->cursor + #define YYMARKER state->marker + #define YYLIMIT state->limit + #define YYFILL(n) return 2 + int lex(State *state) + { + char *x, *y; + /*!re2c + re2c:tags:expression = state->@@; + re2c:flags:t = lex.h; + + "a"* @x "b"* @y "c"* { return 0; } + * { return 1; } + */ + } + +The generated header will look like this: + +.. code-block:: cpp + + #define YYMAXFILL 1 + + struct State { + char buffer[4096 + YYMAXFILL], *cursor, *marker, *limit; + char *yyt1; char *yyt2; + }; diff --git a/doc/manual/features/includes/includes.rst_ b/doc/manual/features/includes/includes.rst_ new file mode 100644 index 00000000..53542db9 --- /dev/null +++ b/doc/manual/features/includes/includes.rst_ @@ -0,0 +1,33 @@ +Re2c allows to include other files using directive ``/*!include:re2c FILE */``, +where ``FILE`` is the name of file to be included. Re2c looks for included +files in the directory of the including file and in include locations, which +can be specified with ``-I`` option. +Re2c include directive works in the same way as C/C++ ``#include``: the contents +of ``FILE`` are copy-pasted verbatim in place of the directive. Include files +may have further includes of their own. +Re2c provides some predefined include files that can be found in the +``include/`` subdirectory of the project. These files contain definitions that +can be useful to other projects (such as Unicode categories) and form something +like a standard library for re2c. +Here is an example of using include files: + +.. code-block:: cpp + + // definitions.re + /*!re2c + alpha = [a-zA-Z]; + digit = [0-9]; + */ + + // main.re + /*!include:re2c "definitions.re" */ + int lex(const char *YYCURSOR) + { + const char *YYMARKER; + /*!re2c + alpha { return 1; } + digit { return 2; } + * { return 0; } + */ + } + diff --git a/doc/manual/options/options_list.rst b/doc/manual/options/options_list.rst index a72d3feb..a5a397f3 100644 --- a/doc/manual/options/options_list.rst +++ b/doc/manual/options/options_list.rst @@ -1,172 +1,222 @@ ``-? -h --help`` Show help message. +``-1 --single-pass`` + Deprecated. Does nothing (single pass is the default now). + +``-8 --utf-8`` + Generate a lexer that reads input in UTF-8 encoding. + re2c assumes that character range is 0 -- 0x10FFFF and character size is + 1 byte. + ``-b --bit-vectors`` Optimize conditional jumps using bit masks. Implies ``-s``. ``-c --conditions --start-conditions`` - Enable support of Flex-like "conditions": multiple interrelated lexers within one block. - Option ``--start-conditions`` is a legacy alias; use ``--conditions`` instead. + Enable support of Flex-like "conditions": multiple interrelated lexers + within one block. Option ``--start-conditions`` is a legacy alias; use + ``--conditions`` instead. -``-d --debug-output`` - Emit ``YYDEBUG`` in the generated code. - ``YYDEBUG`` should be defined by the user in the form of a void function with two parameters: - ``state`` (lexer state or -1) and ``symbol`` (current input symbol of type ``YYCTYPE``). +``--case-insensitive`` + Treat single-quoted and double-quoted strings as case-insensitive. + +``--case-inverted`` + Invert the meaning of single-quoted and double-quoted strings: + treat single-quoted strings as case-sensitive and double-quoted strings + as case-insensitive. ``-D --emit-dot`` - Instead of normal output generate lexer graph in DOT format. - The output can be converted to PNG with the help of Graphviz (something like ``dot -Tpng -odfa.png dfa.dot``). - Note that large graphs may crash Graphviz. + Instead of normal output generate lexer graph in .dot format. + The output can be converted to an image with the help of Graphviz + (e.g. something like ``dot -Tpng -odfa.png dfa.dot``). -``-e --ecb`` - Generate a lexer that reads input in EBCDIC encoding. - ``re2c`` assumes that character range is 0 -- 0xFF an character size is 1 byte. +``-d --debug-output`` + Emit ``YYDEBUG`` in the generated code. + ``YYDEBUG`` should be defined by the user in the form of a void function + with two parameters: ``state`` (lexer state or -1) and ``symbol`` (current + input symbol of type ``YYCTYPE``). -``-f --storable-state`` - Generate a lexer which can store its inner state. - This is useful in push-model lexers which are stopped by an outer program when there is not enough input, - and then resumed when more input becomes available. - In this mode users should additionally define - ``YYGETSTATE ()`` and ``YYSETSTATE (state)`` macros - and variables ``yych``, ``yyaccept`` and the ``state`` as part of the lexer state. +``--dfa-minimization `` + The internal algorithm used by re2c to minimize the DFA: ``moore`` (the + default) is Moore algorithm, and ``table`` is the "table filling" algorithm. + Both algorithms should produce the same DFA up to states relabeling; table + filling is simpler and much slower and serves as a reference implementation. -``-F --flex-syntax`` - Partial support for Flex syntax: - in this mode named definitions don't need the equal sign and the terminating semicolon, - and when used they must be surrounded by curly braces. - Names without curly braces are treated as double-quoted strings. +``--dump-adfa`` + Debug option: output DFA after tunneling (in .dot format). -``-g --computed-gotos`` - Optimize conditional jumps using non-standard "computed goto" extension (must be supported by C/C++ compiler). - ``re2c`` generates jump tables only in complex cases with a lot of conditional branches. - Complexity threshold can be configured with ``cgoto:threshold`` configuration. - This option implies ``-b``. +``--dump-cfg`` + Debug option: output control flow graph of tag variables (in .dot format). -``-i --no-debug-info`` - Do not output ``#line`` information. - This is useful when the generated code is tracked by some version control system. +``--dump-closure-stats`` + Debug option: output statistics on the number of states in closure. -``-o OUTPUT --output=OUTPUT`` - Specify the ``OUTPUT`` file. +``--dump-dfa-det`` + Debug option: output DFA immediately after determinization (in .dot format). -``-r --reusable`` - Allows reuse of ``re2c`` rules with ``/*!rules:re2c */`` and ``/*!use:re2c */`` blocks. - In this mode simple ``/*!re2c */`` blocks are not allowed - and exactly one ``/*!rules:re2c */`` block must be present. - The rules are saved and used by every ``/*!use:re2c */`` block that follows (which may add rules of their own). - This option allows to reuse the same set of rules with different configurations. +``--dump-dfa-min`` + Debug option: output DFA after minimization (in .dot format). -``-s --nested-ifs`` - Use nested ``if`` statements instead of ``switch`` statements in conditional jumps. - This usually results in more efficient code with non-optimizing C/C++ compilers. +``--dump-dfa-tagopt`` + Debug option: output DFA after tag optimizations (in .dot format). -``-t HEADER --type-header=HEADER`` - Generate a ``HEADER`` file that contains enum with condition names. - Requires ``-c`` option. +``--dump-dfa-raw`` + Debug option: output DFA under construction with expanded state-sets + (in .dot format). -``-T --tags`` - Enable submatch extraction with tags. +``--dump-interf`` + Debug option: output interference table produced by liveness analysis of tag + variables. -``-P --posix-captures`` - Enable submatch extraction with POSIX-style capturing groups. +``--dump-nfa`` + Debug option: output NFA (in .dot format). -``-u --unicode`` - Generate a lexer that reads input in UTF-32 encoding. - ``re2c`` assumes that character range is 0 -- 0x10FFFF and character size is 4 bytes. - Implies ``-s``. +``-e --ecb`` + Generate a lexer that reads input in EBCDIC encoding. + re2c assumes that character range is 0 -- 0xFF an character size is 1 byte. -``-v --version`` - Show version information. +``--eager-skip`` + Make the generated lexer advance the input position "eagerly": + immediately after reading input symbol. + By default this happens after transition to the next state. + Implied by ``--no-lookahead``. -``-V --vernum`` - Show version information in ``MMmmpp`` format (major, minor, patch). +``--empty-class `` + Define the way re2c treats empty character classes. With ``match-empty`` + (the default) empty class matches empty input (which is illogical, but + backwards-compatible). With``match-none`` empty class always fails to match. + With ``error`` empty class raises a compilation error. -``-w --wide-chars`` - Generate a lexer that reads input in UCS-2 encoding. - ``re2c`` assumes that character range is 0 -- 0xFFFF and character size is 2 bytes. - Implies ``-s``. +``--encoding-policy `` + Define the way re2c treats Unicode surrogates. + With ``fail`` re2c aborts with an error when a surrogate is encountered. + With ``substitute`` re2c silently replaces surrogates with the error code + point 0xFFFD. With ``ignore`` (the default) re2c treats surrogates as + normal code points. The Unicode standard says that standalone surrogates + are invalid, but real-world libraries and programs behave in different ways. -``-x --utf-16`` - Generate a lexer that reads input in UTF-16 encoding. - ``re2c`` assumes that character range is 0 -- 0x10FFFF and character size is 2 bytes. - Implies ``-s``. +``-f --storable-state`` + Generate a lexer which can store its inner state. + This is useful in push-model lexers which are stopped by an outer program + when there is not enough input, and then resumed when more input becomes + available. In this mode users should additionally define ``YYGETSTATE()`` + and ``YYSETSTATE(state)`` macros and variables ``yych``, ``yyaccept`` + and ``state`` as part of the lexer state. -``-8 --utf-8`` - Generate a lexer that reads input in UTF-8 encoding. - ``re2c`` assumes that character range is 0 -- 0x10FFFF and character size is 1 byte. +``-F --flex-syntax`` + Partial support for Flex syntax: in this mode named definitions don't need + the equal sign and the terminating semicolon, and when used they must be + surrounded by curly braces. Names without curly braces are treated as + double-quoted strings. -``--case-insensitive`` - Treat single-quoted and double-quoted strings as case-insensitive. +``-g --computed-gotos`` + Optimize conditional jumps using non-standard "computed goto" extension + (which must be supported by the C/C++ compiler). re2c generates jump tables + only in complex cases with a lot of conditional branches. Complexity + threshold can be configured with ``cgoto:threshold`` configuration. This + option implies ``-b``. + +``-I PATH`` + Add ``PATH`` to the list of locations which are used when searching for + include files. This option is useful in combination with + ``/*!include:re2c ... */`` directive. Re2c looks for ``FILE`` in the + directory of including file and in the list of include paths specified by + ``-I`` option. -``--case-inverted`` - Invert the meaning of single-quoted and double-quoted strings: - treat single-quoted strings as case-sensitive and double-quoted strings as case-insensitive. +``-i --no-debug-info`` + Do not output ``#line`` information. This is useful when the generated code + is tracked by some version control system or IDE. + +``--input `` + Specify re2c input API. + Option ``default`` is the default API composed of pointer-like primitives + ``YYCURSOR``, ``YYMARKER``, ``YYLIMIT`` etc. + Option ``custom`` is the generic API composed of function-like primitives + ``YYPEEK()``, ``YYSKIP()``, ``YYBACKUP()``, ``YYRESTORE()`` etc. + +``--input-encoding `` + Specify the way re2c parses regular expressions. + With ``ascii`` (the default) re2c handles input as ASCII-encoded: any + sequence of code units is a sequence of standalone 1-byte characters. + With ``utf8`` re2c handles input as UTF8-encoded and recognizes multibyte + characters. + +``--location-format `` + Specify location format in messages. + With ``gnu`` locations are printed as 'filename:line:column: ...'. + With ``msvc`` locations are printed as 'filename(line,column) ...'. + Default is ``gnu``. ``--no-generation-date`` Suppress date output in the generated file. ``--no-lookahead`` Use TDFA(0) instead of TDFA(1). - This option only has effect with ``--tags`` or ``--posix-captures`` options. + This option has effect only with ``--tags`` or ``--posix-captures`` options. ``--no-optimize-tags`` - Suppress optimization of tag variables (useful for debugging or benchmarking). + Suppress optimization of tag variables (useful for debugging). ``--no-version`` Suppress version output in the generated file. -``--encoding-policy POLICY`` - Define the way ``re2c`` treats Unicode surrogates. - ``POLICY`` can be one of the following: ``fail`` (abort with an error when a surrogate is encountered), - ``substitute`` (silently replace surrogates with the error code point 0xFFFD), - ``ignore`` (default, treat surrogates as normal code points). - The Unicode standard says that standalone surrogates are invalid, - but real-world libraries and programs behave in different ways. +``-o OUTPUT --output=OUTPUT`` + Specify the ``OUTPUT`` file. + +``-P --posix-captures`` + Enable submatch extraction with POSIX-style capturing groups. + +``--posix-closure `` + Specify shortest-path algorithm used for construction of epsilon-closure + with POSIX disambiguation semantics: ``gor1`` (the default) stands for + Goldberg-Radzik algorithm, and ``gtop`` stands for "global topological + order" algorithm. -``--input INPUT`` - Specify ``re2c`` input API. ``INPUT`` can be either ``default`` or ``custom`` (enables the use of generic API). +``-r --reusable`` + Allows reuse of re2c rules with ``/*!rules:re2c */`` and ``/*!use:re2c */`` + blocks. Exactly one rules-block must be present. The rules are saved and + used by every use-block that follows, which may add its own rules and + configurations. ``-S --skeleton`` - Ignore user-defined interface code and generate a self-contained "skeleton" program. - Additionally, generate input files with strings derived from the regular grammar - and compressed match results that are used to verify "skeleton" behavior on all inputs. - This option is useful for finding bugs in optimizations and code generation. - -``--empty-class POLICY`` - Define the way ``re2c`` treats empty character classes. - ``POLICY`` can be one of the following: ``match-empty`` (match empty input: illogical, but default behavior for backwards compatibility reasons), - ``match-none`` (fail to match on any input), - ``error`` (compilation error). - -``--dfa-minimization ALGORITHM`` - The internal algorithm used by re2c to minimize the DFA. - ``ALGORITHM`` can be either ``moore`` (Moore algorithm, the default) or ``table`` (table filling algorithm). - Both algorithms should produce the same DFA up to states relabeling; - table filling is much slower and serves as a reference implementation. + Ignore user-defined interface code and generate a self-contained "skeleton" + program. Additionally, generate input files with strings derived from the + regular grammar and compressed match results that are used to verify + "skeleton" behavior on all inputs. This option is useful for finding bugs + in optimizations and code generation. -``--eager-skip`` - Make the generated lexer advance the input position "eagerly": - immediately after reading input symbol. - By default this happens after transition to the next state. - Implied by ``--no-lookahead``. +``-s --nested-ifs`` + Use nested ``if`` statements instead of ``switch`` statements in conditional + jumps. This usually results in more efficient code with non-optimizing C/C++ + compilers. -``--dump-nfa`` - Generate representation of NFA in DOT format and dump it on stderr. +``-T --tags`` + Enable submatch extraction with tags. -``--dump-dfa-raw`` - Generate representation of DFA in DOT format under construction and dump it on stderr. +``-t HEADER --type-header=HEADER`` + Generate a ``HEADER`` file that contains enum with condition names. + Requires ``-c`` option. -``--dump-dfa-det`` - Generate representation of DFA in DOT format immediately after determinization and dump it on stderr. +``-u --unicode`` + Generate a lexer that reads UTF32-encoded input. Re2c assumes that character + range is 0 -- 0x10FFFF and character size is 4 bytes. This option implies + ``-s``. -``--dump-dfa-tagopt`` - Generate representation of DFA in DOT format after tag optimizations and dump it on stderr. +``-V --vernum`` + Show version information in ``MMmmpp`` format (major, minor, patch). -``--dump-dfa-min`` - Generate representation of DFA in DOT format after minimization and dump it on stderr. +``--verbose`` + Output a short message in case of success. -``--dump-adfa`` - Generate representation of DFA in DOT format after tunneling and dump it on stderr. +``-v --version`` + Show version information. -``-1 --single-pass`` - Deprecated. Does nothing (single pass is the default now). +``-w --wide-chars`` + Generate a lexer that reads UCS2-encoded input. Re2c assumes that character + range is 0 -- 0xFFFF and character size is 2 bytes. This option implies + ``-s``. + +``-x --utf-16`` + Generate a lexer that reads UTF16-encoded input. Re2c assumes that character + range is 0 -- 0x10FFFF and character size is 2 bytes. This option implies + ``-s``. diff --git a/doc/manual/syntax/configurations.rst_ b/doc/manual/syntax/configurations.rst_ index bacf4095..73f39667 100644 --- a/doc/manual/syntax/configurations.rst_ +++ b/doc/manual/syntax/configurations.rst_ @@ -160,12 +160,6 @@ ``re2c:flags:d`` or ``re2c:flags:debug-output`` Same as ``-d --debug-output`` command-line option. -``re2c:flags:dfa-minimization = 'moore';`` - Same as ``--dfa-minimization`` command-line option. - -``re2c:flags:eager-skip = 0;`` - Same as ``--eager-skip`` command-line option. - ``re2c:flags:e`` or ``re2c:flags:ecb`` Same as ``-e --ecb`` command-line option. @@ -184,18 +178,18 @@ ``re2c:flags:input = 'default';`` Same as ``--input`` command-line option. -``re2c:flags:lookahead = 1;`` - Same as inverted ``--no-lookahead`` command-line option. - -``re2c:flags:optimize-tags = 1;`` - Same as inverted ``--no-optimize-tags`` command-line option. - ``re2c:flags:P`` or ``re2c:flags:posix-captures`` Same as ``-P --posix-captures`` command-line option. ``re2c:flags:s`` or ``re2c:flags:nested-ifs`` Same as ``-s --nested-ifs`` command-line option. +``re2c:flags:o`` or ``re2c:flags:output`` + Same as ``-o --output`` command-line option. + +``re2c:flags:t`` or ``re2c:flags:type-header`` + Same as ``-t --type-header`` command-line option. + ``re2c:flags:T`` or ``re2c:flags:tags`` Same as ``-T --tags`` command-line option.