From 3f2cfa5cd7c97e2b43d41c97e7e96586804e3830 Mon Sep 17 00:00:00 2001
From: Ulya Fokanova re2c - convert regular expressions to C/C++ re2c [-bdefFghisuvVw1] [-o output] [-c [-t header]] file re2c [-bdefFghisuvVw18] [-o output] [-c [-t header]] file re2c is a preprocessor that generates C-based recognizers from
@@ -105,7 +105,9 @@ YYDEBUG(int state, char current). The first parameter receives the state or
"dot -Tpng input.dot > output.png". Please note that scanners with many states
may crash dot.
SYNOPSIS
-DESCRIPTION
+re2c supports the following encodings: ASCII, EBCDIC (-e), UCS-2 (-w), +UTF-32 (-u) and UTF-8 (-8). ASCII is default. You can either pass cmd +flag or use inplace configuration. +
++ASCII, EBCDIC, UCS-2 and UTF-32 are fixed-length encodings. This means, that +all code units (symbols) are encoded in byte sequences of the same length: +1 byte in ASCII and EBCDIC, 2 bytes in UCS-2, 4 bytes in UTF-32. This length +must be equal to the size of YYCTYPE. re2c assumes that one input character +maps to one encoding symbol. +
++UTF-8 is a variable-length encoding: different Unicode symbols are encoded +in byte sequences of 1, 2, 3 or 4 bytes. In UTF-8 mode, re2c assumes that +size of YYCTYPE is 1 byte. It translates multibyte symbols to sequences of +1-byte characters. E.g., 2-byte symbol "\xFF" in UTF-8 is equal to 2-symbol +string "\xC3\xBF" in ASCII. Some bytes never occur in valid UTF-8 stream, e.g. +byte 0xFF. If the generated scanner must check for invalid input symbols, the +only way to do so is to use default rule *. Note, that full range rule +[^] means quite different thing in UTF-8, excluding 0xFF byte. +
Each scanner specification consists of a set of rules, named @@ -337,12 +369,35 @@ matched. You can either start the code with an opening curly brace or the sequence ':='. When the code with a curly brace then re2c counts the brace depth and stops looking for code automatically. Otherwise curly braces are not allowed and re2c stops -looking for code at the first line that does not begin with whitespace.
+looking for code at the first line that does not begin with whitespace. If two +or more rules overlap, the first rule is preferred.+There is one special rule: default rule *. +
++The former "default" rule [^] differs from *: +
++In fact, when variable-length encoding is used, * is the only possible way +to match invalid input character. +
+If -c is active then each regular expression is preceeded by a list of comma separated condition names. Besides normal naming rules there are two special cases. A rule may contain the single condition name '*' and no contition @@ -360,11 +415,15 @@ jumps) you can doso by using <! pseudo-rules.
Since characters greater \X00FF are not allowed in non unicode mode, -the only portable "any" rules are (.|"\n") and [^].
+The only portable "any" rule is the default rule *.
The regular expressions listed above are grouped according to precedence, from highest precedence at the top to lowest at the bottom. Those grouped together have equal precedence.
@@ -633,10 +691,8 @@ lessons to get you started with re2c. All examples in the lessons subdirectory can be compiled and actually work.re2c does not provide a default action: the generated code assumes -that the input will consist of a sequence of tokens. Typically this can be -dealt with by adding a rule such as the one for unexpected characters in the -example above.
+re2c provides default action: *. When the default rule matches, +exactly one input character is consumed.
The user must arrange for a sentinel token to appear at the end of input (and provide a rule for matching it): re2c does not provide an <<EOF>> expression. If the source is from a null-byte terminated @@ -649,10 +705,17 @@ else then e detection of end of data/file.
Difference only works for character sets.
+The generated DFA is not minimal.
+Features, that are naturally orthogonal (such as reusable rules, conditions, +setup rules and default rules), cannot always be combined. E.g., one cannot set +setup/default rule for condition in scanner with reusable rules.
+re2c does too much unnecessary work: e.g., if /*!use:re2c ... */ block has +additional rules, these rules are parsed 4 times, while they should be parsed +only once.
The re2c internal algorithms need documentation.
flex(1), lex(1). More information on re2c can be found here: +
flex(1), lex(1), quex(http://quex.sourceforge.net/). More information on re2c can be found here: http://re2c.org/
diff --git a/re2c/re2c.1.in b/re2c/re2c.1.in index 8dc17ebe..45685e0d 100644 --- a/re2c/re2c.1.in +++ b/re2c/re2c.1.in @@ -11,7 +11,7 @@ \*(re \- convert \*(rxs to C/C++ .SH SYNOPSIS -\*(re [\fB-bdDefFghisuvVw1\fP] [\fB-o output\fP] [\fB-c\fP [\fB-t header\fP]] \fBfile\fP +\*(re [\fB-bdDefFghisuvVw18\fP] [\fB-o output\fP] [\fB-c\fP [\fB-t header\fP]] \fBfile\fP .SH DESCRIPTION \*(re is a preprocessor that generates C-based recognizers from regular @@ -115,7 +115,9 @@ Emit Graphviz dot data. It can then be processed with e.g. may crash dot. .TP \fB-e\fP -Cross-compile from an ASCII platform to an EBCDIC one. +Generate a parser that supports EBCDIC. The generated code can deal with any +character up to 0xFF. In this mode \*(re assumes that input character size is +1 byte. This switch is incompatible with \fB-w\fP, \fB-u\fP and \fB-8\fP. .TP \fB-f\fP Generate a scanner with support for storable state. @@ -163,10 +165,10 @@ Create a header file that contains types for the (f)lex-like condition support. This can only be activated when \fB-c\fP is in use. .TP \fB-u\fP -Generate a parser that supports Unicode chars (UTF-32). This means the -generated code can deal with any valid Unicode character up to 0x10FFFF. When -UTF-8 or UTF-16 needs to be supported you need to convert the incoming stream -to UTF-32 upon input yourself. +Generate a parser that supports UTF-32. The generated code can deal with any +valid Unicode character up to 0x10FFFF. In this mode \*(re assumes that input +character size is 4 bytes. This switch is incompatible with \fB-e\fP, \fB-w\fP and \fB-8\fP. +This implies \fB-s\fP. .TP \fB-v\fP Show version information. @@ -175,13 +177,20 @@ Show version information. Show the version as a number XXYYZZ. .TP \fB-w\fP -Create a parser that supports wide chars (UCS-2). This implies \fB-s\fP and -cannot be used together with \fB-e\fP switch. +Generate a parser that supports UCS-2. The generated code can deal with any +valid Unicode character up to 0xFFFF. In this mode \*(re assumes that input +character size is 2 bytes. This switch is incompatible with \fB-e\fP, \fB-u\fP and \fB-8\fP. +This implies \fB-s\fP. .TP \fB-1\fP Force single pass generation, this cannot be combined with -f and disables YYMAXFILL generation prior to last \*(re block. .TP +\fB-8\fP +Generate a parser that supports UTF-8. The generated code can deal with any +valid Unicode character up to 0x10FFFF. In this mode \*(re assumes that input +character size is 1 byte. This switch is incompatible with \fB-e\fP, \fB-w\fP and \fB-u\fP. +.TP \fB--no-generation-date\fP Suppress date output in the generated output so that it only shows the re2c version. @@ -373,6 +382,26 @@ When \*(re generates the code for a rule whose state does not have a setup rule and a star'd setup rule is present, than that code will be used as setup code. +.SH "ENCODINGS" +\*(re supports the following encodings: ASCII, EBCDIC (\fB-e\fP), UCS-2 (\fB-w\fP), +UTF-32 (\fB-u\fP) and UTF-8 (\fB-8\fP). ASCII is default. You can either pass cmd +flag or use inplace configuration. +.LP +ASCII, EBCDIC, UCS-2 and UTF-32 are fixed-length encodings. This means, that +all code units (symbols) are encoded in byte sequences of the same length: +1 byte in ASCII and EBCDIC, 2 bytes in UCS-2, 4 bytes in UTF-32. This length +must be equal to the size of YYCTYPE. \*(re assumes that one input character +maps to one encoding symbol. +.LP +UTF-8 is a variable-length encoding: different Unicode symbols are encoded +in byte sequences of 1, 2, 3 or 4 bytes. In UTF-8 mode, \*(re assumes that +size of YYCTYPE is 1 byte. It translates multibyte symbols to sequences of +1-byte characters. E.g., 2-byte symbol "\\xFF" in UTF-8 is equal to 2-symbol +string "\\xC3\\xBF" in ASCII. Some bytes never occur in valid UTF-8 stream, e.g. +byte 0xFF. If the generated scanner must check for invalid input symbols, the +only way to do so is to use default rule \fB*\fP. Note, that full range rule +\fB[^]\fP means quite different thing in UTF-8, excluding 0xFF byte. + .SH "SCANNER SPECIFICATIONS" Each scanner specification consists of a set of \fIrules\fP, \fInamed definitions\fP and \fIconfigurations\fP. @@ -382,7 +411,8 @@ is to be executed when the associated \fI\*(rx\fP is matched. You can either start the code with an opening curly brace or the sequence '\fB:=\fP'. When the code with a curly brace then \*(re counts the brace depth and stops looking for code automatically. Otherwise curly braces are not allowed and \*(re stops -looking for code at the first line that does not begin with whitespace. +looking for code at the first line that does not begin with whitespace. If two +or more rules overlap, the first rule is preferred. .P .RS \fI\*(rx\fP \fC{\fP \fIC/C++ code\fP \fC}\fP @@ -390,6 +420,30 @@ looking for code at the first line that does not begin with whitespace. \fI\*(rx\fP \fC:=\fP \fIC/C++ code\fP .RE .P +There is one special rule: default rule \fB*\fP. +.P +.RS +\fI*\fP \fC{\fP \fIC/C++ code\fP \fC}\fP +.P +\fI*\fP \fC:=\fP \fIC/C++ code\fP +.RE +.P +The former "default" rule \fB[^]\fP differs from \fB*\fP: +.P +.RS +- \fB*\fP can occur anywhere a normal rule can occur, but regardless to its place, +\fB*\fP has the lowest priority. +.P +- \fB[^]\fP matches all valid symbols in current encoding, while \fB*\fP matches +any input character, either valid or invalid. +.P +- \fB[^]\fP can consume multiple input characters, while \fB*\fP always consumes +one input character. +.RE +.P +In fact, when variable-length encoding is used, \fB*\fP is the only possible way +to match invalid input character. +.LP If \fB-c\fP is active then each \*(rx is preceeded by a list of comma separated condition names. Besides normal naming rules there are two special cases. A rule may contain the single condition name '*' and no contition @@ -409,6 +463,10 @@ jumps) you can doso by using \fP \fI\*(rx\fP \fC:=\fP \fIC/C++ code\fP .P +\fC<\fP\fIcondition-list\fP\fC>\fP \fI*\fP \fC{\fP \fIC/C++ code\fP \fC}\fP +.P +\fC<\fP\fIcondition-list\fP\fC>\fP \fI*\fP \fC:=\fP \fIC/C++ code\fP +.P \fC<\fP\fIcondition-list\fP\fC>\fP \fI\*(rx\fP \fC=>\fP \fP\fIcondition\fP \fC{\fP \fIC/C++ code\fP \fC}\fP .P \fC<\fP\fIcondition-list\fP\fC>\fP \fI\*(rx\fP \fC=>\fP \fP\fIcondition\fP \fC:=\fP \fIC/C++ code\fP @@ -419,6 +477,10 @@ jumps) you can doso by using \fP \fI\*(rx\fP \fC:=\fP \fIC/C++ code\fP .P +\fC<\fP\fI*\fP\fC>\fP \fI*\fP \fC{\fP \fIC/C++ code\fP \fC}\fP +.P +\fC<\fP\fI*\fP\fC>\fP \fI*\fP \fC:=\fP \fIC/C++ code\fP +.P \fC<\fP\fI*\fP\fC>\fP \fI\*(rx\fP \fC=>\fP \fP\fIcondition\fP \fC{\fP \fIC/C++ code\fP \fC}\fP .P \fC<\fP\fI*\fP\fC>\fP \fI\*(rx\fP \fC=>\fP \fP\fIcondition\fP \fC:=\fP \fIC/C++ code\fP @@ -549,8 +611,7 @@ by either a lowercased \fBu\fP and its four hexadecimal digits or an uppercased \fBU\fP and its eight hexadecimal digits. However only in \fB-u\fP mode the generated code can deal with any valid Unicode character up to 0x10FFFF. .LP -Since characters greater \fB\\X00FF\fP are not allowed in non unicode mode, the -only portable "\fBany\fP" rules are \fB(.|"\\n")\fP and \fB[^]\fP. +The only portable "\fBany\fP" rule is the default rule \fB*\fP. .LP The \*(rxs listed above are grouped according to precedence, from highest precedence at the top to lowest at the bottom. @@ -774,11 +835,8 @@ can be compiled and actually work. .SH FEATURES .LP -\*(re does not provide a default action: -the generated code assumes that the input -will consist of a sequence of tokens. -Typically this can be dealt with by adding a rule such as the one for -unexpected characters in the example above. +\*(re provides default action: \fB*\fP. When the default rule matches, +exactly one input character is consumed. .LP The user must arrange for a sentinel token to appear at the end of input (and provide a rule for matching it): @@ -793,13 +851,27 @@ else then e detection of end of data/file. .SH BUGS .LP -Difference only works for character sets. +Difference only works for character sets, and not in UTF-8 mode. +.LP +The generated DFA is not minimal. +.LP +Features, that are naturally orthogonal (such as reusable rules, conditions, +setup rules and default rules), cannot always be combined. E.g., one cannot set +setup/default rule for condition in scanner with reusable rules. +.LP +\*(re does too much unnecessary work: e.g., if /*!use:re2c ... */ block has +additional rules, these rules are parsed 4 times, while they should be parsed +only once. .LP The \*(re internal algorithms need documentation. .SH "SEE ALSO" .LP -flex(1), lex(1). +flex(1), lex(1), quex( +.PD 0 +.B http://quex.sourceforge.net +.PD 1 +). .P More information on \*(re can be found here: .PD 0 -- 2.40.0