<p>re2c - convert regular expressions to C/C++</p>
<a name="lbAC" id="lbAC"> </a>
<h2>SYNOPSIS</h2>
-<p><b>re2c</b> [<b>-bdefFghisuvVw1</b>] [<b>-o output</b>] [<b>-c</b> [<b>-t header</b>]] file</p>
+<p><b>re2c</b> [<b>-bdefFghisuvVw18</b>] [<b>-o output</b>] [<b>-c</b> [<b>-t header</b>]] file</p>
<a name="lbAD" id="lbAD"> </a>
<h2>DESCRIPTION</h2>
<p><b>re2c</b> is a preprocessor that generates C-based recognizers from
"dot -Tpng input.dot > output.png". Please note that scanners with many states
may crash dot.<br /><br /></dd>
<dt><b>-e</b></dt>
-<dd>Cross-compile from an ASCII platform to an EBCDIC one.<br /><br /></dd>
+<dd>Generate a parser that supports EBCDIC. The generated code can deal with any
+character up to 0xFF. In this mode re2c assumes that input character size is
+1 byte. This switch is incompatible with <b>-w</b>, <b>-u</b> and <b>-8</b>.<br /><br /></dd>
<dt><b>-f</b></dt>
<dd>Generate a scanner with support for storable state. For details see below
at <b>SCANNER WITH STORABLE STATES</b>.<br /><br /></dd>
<dd>Create a header file that contains types for the (f)lex-like condition support.
This can only be activated when <b>-c</b> is in use.<br /><br /></dd>
<dt><b>-u</b></dt>
-<dd>Generate a parser that supports Unicode chars (UTF-32). This means the
-generated code can deal with any valid Unicode character up to 0x10FFFF. When
-UTF-8 or UTF-16 needs to be supported you need to convert the incoming stream
-to UTF-32 upon input yourself.<br /><br /></dd>
+<dd>Generate a parser that supports UTF-32. The generated code can deal with any
+valid Unicode character up to 0x10FFFF. In this mode re2c assumes that input
+character size is 4 bytes. This switch is incompatible with <b>-e</b>, <b>-w</b> and <b>-8</b>.
+This implies <b>-s</b>.<br /><br /></dd>
<dt><b>-v</b></dt>
<dd>Show version information.<br /><br /></dd>
<dt><b>-V</b></dt>
<dd>Show the version as a number XXYYZZ.<br /><br /></dd>
<dt><b>-w</b></dt>
-<dd>Create a parser that supports wide chars (UCS-2). This implies <b>-s</b>
-and cannot be used together with <b>-e</b> switch.<br /><br /></dd>
+<dd>Generate a parser that supports UCS-2. The generated code can deal with any
+valid Unicode character up to 0xFFFF. In this mode re2c assumes that input
+character size is 2 bytes. This switch is incompatible with <b>-e</b>, <b>-u</b> and <b>-8</b>.
+This implies <b>-s</b>.<br /><br /></dd>
<dt><b>-1</b></dt>
<dd>Force single pass generation, this cannot be combined with -f and disables
YYMAXFILL generation prior to last re2c block.<br /><br /></dd>
+<dt><b>-8</b></dt>
+<dd>Generate a parser that supports UTF-8. The generated code can deal with any
+valid Unicode character up to 0x10FFFF. In this mode re2c assumes that input
+character size is 1 byte. This switch is incompatible with <b>-e</b>, <b>-w</b> and <b>-u</b>.<br /><br /></dd>
<dt><b>--no-generation-date</b></dt>
<dd>Suppress date output in the generated output so that it only shows the re2c
version.<br /><br /></dd>
setup rule and a star'd setup rule is present, than that code will be used
as setup code.
</p>
+<a name="lbAH2" id="lbAH2"> </a>
+<h2>ENCODINGS</h2>
+<p>
+<b>re2c</b> supports the following encodings: ASCII, EBCDIC (<b>-e</b>), UCS-2 (<b>-w</b>),
+UTF-32 (<b>-u</b>) and UTF-8 (<b>-8</b>). ASCII is default. You can either pass cmd
+flag or use inplace configuration.
+</p>
+<p>
+ASCII, EBCDIC, UCS-2 and UTF-32 are fixed-length encodings. This means, that
+all code units (symbols) are encoded in byte sequences of the same length:
+1 byte in ASCII and EBCDIC, 2 bytes in UCS-2, 4 bytes in UTF-32. This length
+must be equal to the size of <b>YYCTYPE</b>. <b>re2c</b> assumes that one input character
+maps to one encoding symbol.
+</p>
+<p>
+UTF-8 is a variable-length encoding: different Unicode symbols are encoded
+in byte sequences of 1, 2, 3 or 4 bytes. In UTF-8 mode, <b>re2c</b> assumes that
+size of <b>YYCTYPE</b> is 1 byte. It translates multibyte symbols to sequences of
+1-byte characters. E.g., 2-byte symbol "\xFF" in UTF-8 is equal to 2-symbol
+string "\xC3\xBF" in ASCII. Some bytes never occur in valid UTF-8 stream, e.g.
+byte 0xFF. If the generated scanner must check for invalid input symbols, the
+only way to do so is to use default rule <b>*</b>. Note, that full range rule
+<b>[^]</b> means quite different thing in UTF-8, excluding 0xFF byte.
+</p>
<a name="lbAI" id="lbAI"> </a>
<h2>SCANNER SPECIFICATIONS</h2>
<p>Each scanner specification consists of a set of <i>rules</i>, <i>named
start the code with an opening curly brace or the sequence '<b>:=</b>'. When
the code with a curly brace then <b>re2c</b> counts the brace depth and stops looking
for code automatically. Otherwise curly braces are not allowed and <b>re2c</b> stops
-looking for code at the first line that does not begin with whitespace.</p>
+looking for code at the first line that does not begin with whitespace. If two
+or more rules overlap, the first rule is preferred.</p>
<dl compact="compact">
<dd><i>regular-expression</i> { <i>C/C++ code</i> }</dd>
<dd><i>regular-expression</i> := <i>C/C++ code</i></dd>
</dl>
<p>
+There is one special rule: default rule <b>*</b>.
+</p>
+<dl compact="compact">
+<dd><b>*</b> { <i>C/C++ code</i> }</dd>
+<dd><b>*</b> := <i>C/C++ code</i></dd>
+</dl>
+<p>
+The former "default" rule <b>[^]</b> differs from <b>*</b>:
+</p>
+<dl compact="compact">
+<dd>- <b>*</b> can occur anywhere a normal rule can occur, but regardless to its place,
+<b>*</b> has the lowest priority.
+<dd>- <b>[^]</b> matches all valid symbols in current encoding, while <b>*</b> matches
+any input character, either valid or invalid.
+<dd>- <b>[^]</b> can consume multiple input characters, while <b>*</b> always consumes
+one input character.
+</dl>
+<p>
+In fact, when variable-length encoding is used, <b>*</b> is the only possible way
+to match invalid input character.
+</p>
+<p>
If <b>-c</b> is active then each regular expression is preceeded by a list of
comma separated condition names. Besides normal naming rules there are two
special cases. A rule may contain the single condition name '*' and no contition
<dl compact="compact">
<dd><<i>condition-list</i>> <i>regular-expression</i> { <i>C/C++ code</i> }</dd>
<dd><<i>condition-list</i>> <i>regular-expression</i> := <i>C/C++ code</i></dd>
+<dd><<i>condition-list</i>> <b>*</b> { <i>C/C++ code</i> }</dd>
+<dd><<i>condition-list</i>> <b>*</b> := <i>C/C++ code</i></dd>
<dd><<i>condition-list</i>> <i>regular-expression</i> => <i>condition</i> { <i>C/C++ code</i> }</dd>
<dd><<i>condition-list</i>> <i>regular-expression</i> => <i>condition</i> := <i>C/C++ code</i></dd>
<dd><<i>condition-list</i>> <i>regular-expression</i> :=> <i>condition</i></dd>
<dd><<i>*</i>> <i>regular-expression</i> { <i>C/C++ code</i> }</dd>
<dd><<i>*</i>> <i>regular-expression</i> := <i>C/C++ code</i></dd>
+<dd><<i>*</i>> <b>*</b> { <i>C/C++ code</i> }</dd>
+<dd><<i>*</i>> <b>*</b> := <i>C/C++ code</i></dd>
<dd><<i>*</i>> <i>regular-expression</i> => <i>condition</i> { <i>C/C++ code</i> }</dd>
<dd><<i>*</i>> <i>regular-expression</i> => <i>condition</i> := <i>C/C++ code</i></dd>
<dd><<i>*</i>> <i>regular-expression</i> :=> <i>condition</i></dd>
uppercased <b>U</b> and its eight hexadecimal digits. However only in \fB-u\fP
mode the generated code can deal with any valid Unicode character up to
0x10FFFF.</p>
-<p>Since characters greater <b>\X00FF</b> are not allowed in non unicode mode,
-the only portable "<b>any</b>" rules are <b>(.|"\n")</b> and <b>[^]</b>.</p>
+<p>The only portable "<b>any</b>" rule is the default rule <b>*</b>.</p>
<p>The regular expressions listed above are grouped according to precedence,
from highest precedence at the top to lowest at the bottom. Those grouped
together have equal precedence.</p>
can be compiled and actually work.</p>
<a name="lbAM" id="lbAM"> </a>
<h2>FEATURES</h2>
-<p><b>re2c</b> does not provide a default action: the generated code assumes
-that the input will consist of a sequence of tokens. Typically this can be
-dealt with by adding a rule such as the one for unexpected characters in the
-example above.</p>
+<p><b>re2c</b> provides default action: <b>*</b>. When the default rule matches,
+exactly one input character is consumed.</p>
<p>The user must arrange for a sentinel token to appear at the end of input
(and provide a rule for matching it): <b>re2c</b> does not provide an
<<EOF>> expression. If the source is from a null-byte terminated
<a name="lbAN" id="lbAN"> </a>
<h2>BUGS</h2>
<p>Difference only works for character sets.</p>
+<p>The generated DFA is not minimal.</p>
+<p>Features, that are naturally orthogonal (such as reusable rules, conditions,
+setup rules and default rules), cannot always be combined. E.g., one cannot set
+setup/default rule for condition in scanner with reusable rules.</p>
+<p><b>re2c</b> does too much unnecessary work: e.g., if /*!use:re2c ... */ block has
+additional rules, these rules are parsed 4 times, while they should be parsed
+only once.</p>
<p>The <b>re2c</b> internal algorithms need documentation.</p>
<a name="lbAO" id="lbAO"> </a>
<h2>SEE ALSO</h2>
-<p>flex(1), lex(1). More information on <b>re2c</b> can be found here:
+<p>flex(1), lex(1), quex(<b><a href="http://quex.sourceforge.net/">http://quex.sourceforge.net/</a></b>). More information on <b>re2c</b> can be found here:
<b><a href=
"http://re2c.org/">http://re2c.org/</a></b></p>
<a name="lbAP" id="lbAP"> </a>
\*(re \- convert \*(rxs to C/C++
.SH SYNOPSIS
-\*(re [\fB-bdDefFghisuvVw1\fP] [\fB-o output\fP] [\fB-c\fP [\fB-t header\fP]] \fBfile\fP
+\*(re [\fB-bdDefFghisuvVw18\fP] [\fB-o output\fP] [\fB-c\fP [\fB-t header\fP]] \fBfile\fP
.SH DESCRIPTION
\*(re is a preprocessor that generates C-based recognizers from regular
may crash dot.
.TP
\fB-e\fP
-Cross-compile from an ASCII platform to an EBCDIC one.
+Generate a parser that supports EBCDIC. The generated code can deal with any
+character up to 0xFF. In this mode \*(re assumes that input character size is
+1 byte. This switch is incompatible with \fB-w\fP, \fB-u\fP and \fB-8\fP.
.TP
\fB-f\fP
Generate a scanner with support for storable state.
This can only be activated when \fB-c\fP is in use.
.TP
\fB-u\fP
-Generate a parser that supports Unicode chars (UTF-32). This means the
-generated code can deal with any valid Unicode character up to 0x10FFFF. When
-UTF-8 or UTF-16 needs to be supported you need to convert the incoming stream
-to UTF-32 upon input yourself.
+Generate a parser that supports UTF-32. The generated code can deal with any
+valid Unicode character up to 0x10FFFF. In this mode \*(re assumes that input
+character size is 4 bytes. This switch is incompatible with \fB-e\fP, \fB-w\fP and \fB-8\fP.
+This implies \fB-s\fP.
.TP
\fB-v\fP
Show version information.
Show the version as a number XXYYZZ.
.TP
\fB-w\fP
-Create a parser that supports wide chars (UCS-2). This implies \fB-s\fP and
-cannot be used together with \fB-e\fP switch.
+Generate a parser that supports UCS-2. The generated code can deal with any
+valid Unicode character up to 0xFFFF. In this mode \*(re assumes that input
+character size is 2 bytes. This switch is incompatible with \fB-e\fP, \fB-u\fP and \fB-8\fP.
+This implies \fB-s\fP.
.TP
\fB-1\fP
Force single pass generation, this cannot be combined with -f and disables
YYMAXFILL generation prior to last \*(re block.
.TP
+\fB-8\fP
+Generate a parser that supports UTF-8. The generated code can deal with any
+valid Unicode character up to 0x10FFFF. In this mode \*(re assumes that input
+character size is 1 byte. This switch is incompatible with \fB-e\fP, \fB-w\fP and \fB-u\fP.
+.TP
\fB--no-generation-date\fP
Suppress date output in the generated output so that it only shows the re2c
version.
setup rule and a star'd setup rule is present, than that code will be used
as setup code.
+.SH "ENCODINGS"
+\*(re supports the following encodings: ASCII, EBCDIC (\fB-e\fP), UCS-2 (\fB-w\fP),
+UTF-32 (\fB-u\fP) and UTF-8 (\fB-8\fP). ASCII is default. You can either pass cmd
+flag or use inplace configuration.
+.LP
+ASCII, EBCDIC, UCS-2 and UTF-32 are fixed-length encodings. This means, that
+all code units (symbols) are encoded in byte sequences of the same length:
+1 byte in ASCII and EBCDIC, 2 bytes in UCS-2, 4 bytes in UTF-32. This length
+must be equal to the size of YYCTYPE. \*(re assumes that one input character
+maps to one encoding symbol.
+.LP
+UTF-8 is a variable-length encoding: different Unicode symbols are encoded
+in byte sequences of 1, 2, 3 or 4 bytes. In UTF-8 mode, \*(re assumes that
+size of YYCTYPE is 1 byte. It translates multibyte symbols to sequences of
+1-byte characters. E.g., 2-byte symbol "\\xFF" in UTF-8 is equal to 2-symbol
+string "\\xC3\\xBF" in ASCII. Some bytes never occur in valid UTF-8 stream, e.g.
+byte 0xFF. If the generated scanner must check for invalid input symbols, the
+only way to do so is to use default rule \fB*\fP. Note, that full range rule
+\fB[^]\fP means quite different thing in UTF-8, excluding 0xFF byte.
+
.SH "SCANNER SPECIFICATIONS"
Each scanner specification consists of a set of \fIrules\fP, \fInamed
definitions\fP and \fIconfigurations\fP.
start the code with an opening curly brace or the sequence '\fB:=\fP'. When
the code with a curly brace then \*(re counts the brace depth and stops looking
for code automatically. Otherwise curly braces are not allowed and \*(re stops
-looking for code at the first line that does not begin with whitespace.
+looking for code at the first line that does not begin with whitespace. If two
+or more rules overlap, the first rule is preferred.
.P
.RS
\fI\*(rx\fP \fC{\fP \fIC/C++ code\fP \fC}\fP
\fI\*(rx\fP \fC:=\fP \fIC/C++ code\fP
.RE
.P
+There is one special rule: default rule \fB*\fP.
+.P
+.RS
+\fI*\fP \fC{\fP \fIC/C++ code\fP \fC}\fP
+.P
+\fI*\fP \fC:=\fP \fIC/C++ code\fP
+.RE
+.P
+The former "default" rule \fB[^]\fP differs from \fB*\fP:
+.P
+.RS
+- \fB*\fP can occur anywhere a normal rule can occur, but regardless to its place,
+\fB*\fP has the lowest priority.
+.P
+- \fB[^]\fP matches all valid symbols in current encoding, while \fB*\fP matches
+any input character, either valid or invalid.
+.P
+- \fB[^]\fP can consume multiple input characters, while \fB*\fP always consumes
+one input character.
+.RE
+.P
+In fact, when variable-length encoding is used, \fB*\fP is the only possible way
+to match invalid input character.
+.LP
If \fB-c\fP is active then each \*(rx is preceeded by a list of
comma separated condition names. Besides normal naming rules there are two
special cases. A rule may contain the single condition name '*' and no contition
.P
\fC<\fP\fIcondition-list\fP\fC>\fP \fI\*(rx\fP \fC:=\fP \fIC/C++ code\fP
.P
+\fC<\fP\fIcondition-list\fP\fC>\fP \fI*\fP \fC{\fP \fIC/C++ code\fP \fC}\fP
+.P
+\fC<\fP\fIcondition-list\fP\fC>\fP \fI*\fP \fC:=\fP \fIC/C++ code\fP
+.P
\fC<\fP\fIcondition-list\fP\fC>\fP \fI\*(rx\fP \fC=>\fP \fP\fIcondition\fP \fC{\fP \fIC/C++ code\fP \fC}\fP
.P
\fC<\fP\fIcondition-list\fP\fC>\fP \fI\*(rx\fP \fC=>\fP \fP\fIcondition\fP \fC:=\fP \fIC/C++ code\fP
.P
\fC<\fP\fI*\fP\fC>\fP \fI\*(rx\fP \fC:=\fP \fIC/C++ code\fP
.P
+\fC<\fP\fI*\fP\fC>\fP \fI*\fP \fC{\fP \fIC/C++ code\fP \fC}\fP
+.P
+\fC<\fP\fI*\fP\fC>\fP \fI*\fP \fC:=\fP \fIC/C++ code\fP
+.P
\fC<\fP\fI*\fP\fC>\fP \fI\*(rx\fP \fC=>\fP \fP\fIcondition\fP \fC{\fP \fIC/C++ code\fP \fC}\fP
.P
\fC<\fP\fI*\fP\fC>\fP \fI\*(rx\fP \fC=>\fP \fP\fIcondition\fP \fC:=\fP \fIC/C++ code\fP
\fBU\fP and its eight hexadecimal digits. However only in \fB-u\fP mode the
generated code can deal with any valid Unicode character up to 0x10FFFF.
.LP
-Since characters greater \fB\\X00FF\fP are not allowed in non unicode mode, the
-only portable "\fBany\fP" rules are \fB(.|"\\n")\fP and \fB[^]\fP.
+The only portable "\fBany\fP" rule is the default rule \fB*\fP.
.LP
The \*(rxs listed above are grouped according to
precedence, from highest precedence at the top to lowest at the bottom.
.SH FEATURES
.LP
-\*(re does not provide a default action:
-the generated code assumes that the input
-will consist of a sequence of tokens.
-Typically this can be dealt with by adding a rule such as the one for
-unexpected characters in the example above.
+\*(re provides default action: \fB*\fP. When the default rule matches,
+exactly one input character is consumed.
.LP
The user must arrange for a sentinel token to appear at the end of input
(and provide a rule for matching it):
.SH BUGS
.LP
-Difference only works for character sets.
+Difference only works for character sets, and not in UTF-8 mode.
+.LP
+The generated DFA is not minimal.
+.LP
+Features, that are naturally orthogonal (such as reusable rules, conditions,
+setup rules and default rules), cannot always be combined. E.g., one cannot set
+setup/default rule for condition in scanner with reusable rules.
+.LP
+\*(re does too much unnecessary work: e.g., if /*!use:re2c ... */ block has
+additional rules, these rules are parsed 4 times, while they should be parsed
+only once.
.LP
The \*(re internal algorithms need documentation.
.SH "SEE ALSO"
.LP
-flex(1), lex(1).
+flex(1), lex(1), quex(
+.PD 0
+.B http://quex.sourceforge.net
+.PD 1
+).
.P
More information on \*(re can be found here:
.PD 0