From 3f2cfa5cd7c97e2b43d41c97e7e96586804e3830 Mon Sep 17 00:00:00 2001 From: Ulya Fokanova Date: Thu, 23 Jan 2014 13:18:40 +0300 Subject: [PATCH] Updated documentation. --- re2c/htdocs/manual.html.in | 95 ++++++++++++++++++++++++++------ re2c/re2c.1.in | 108 ++++++++++++++++++++++++++++++------- 2 files changed, 169 insertions(+), 34 deletions(-) diff --git a/re2c/htdocs/manual.html.in b/re2c/htdocs/manual.html.in index db3d1c6e..0483db47 100755 --- a/re2c/htdocs/manual.html.in +++ b/re2c/htdocs/manual.html.in @@ -15,7 +15,7 @@ Updated: @PACKAGE_DATE@

re2c - convert regular expressions to C/C++

 

SYNOPSIS

-

re2c [-bdefFghisuvVw1] [-o output] [-c [-t header]] file

+

re2c [-bdefFghisuvVw18] [-o output] [-c [-t header]] file

 

DESCRIPTION

re2c is a preprocessor that generates C-based recognizers from @@ -105,7 +105,9 @@ YYDEBUG(int state, char current). The first parameter receives the state or "dot -Tpng input.dot > output.png". Please note that scanners with many states may crash dot.

-e
-
Cross-compile from an ASCII platform to an EBCDIC one.

+
Generate a parser that supports EBCDIC. The generated code can deal with any +character up to 0xFF. In this mode re2c assumes that input character size is +1 byte. This switch is incompatible with -w, -u and -8.

-f
Generate a scanner with support for storable state. For details see below at SCANNER WITH STORABLE STATES.

@@ -144,20 +146,26 @@ generate better code.

Create a header file that contains types for the (f)lex-like condition support. This can only be activated when -c is in use.

-u
-
Generate a parser that supports Unicode chars (UTF-32). This means the -generated code can deal with any valid Unicode character up to 0x10FFFF. When -UTF-8 or UTF-16 needs to be supported you need to convert the incoming stream -to UTF-32 upon input yourself.

+
Generate a parser that supports UTF-32. The generated code can deal with any +valid Unicode character up to 0x10FFFF. In this mode re2c assumes that input +character size is 4 bytes. This switch is incompatible with -e, -w and -8. +This implies -s.

-v
Show version information.

-V
Show the version as a number XXYYZZ.

-w
-
Create a parser that supports wide chars (UCS-2). This implies -s -and cannot be used together with -e switch.

+
Generate a parser that supports UCS-2. The generated code can deal with any +valid Unicode character up to 0xFFFF. In this mode re2c assumes that input +character size is 2 bytes. This switch is incompatible with -e, -u and -8. +This implies -s.

-1
Force single pass generation, this cannot be combined with -f and disables YYMAXFILL generation prior to last re2c block.

+
-8
+
Generate a parser that supports UTF-8. The generated code can deal with any +valid Unicode character up to 0x10FFFF. In this mode re2c assumes that input +character size is 1 byte. This switch is incompatible with -e, -w and -u.

--no-generation-date
Suppress date output in the generated output so that it only shows the re2c version.

@@ -327,6 +335,30 @@ When re2c generates the code for a rule whose state does not have a setup rule and a star'd setup rule is present, than that code will be used as setup code.

+  +

ENCODINGS

+

+re2c supports the following encodings: ASCII, EBCDIC (-e), UCS-2 (-w), +UTF-32 (-u) and UTF-8 (-8). ASCII is default. You can either pass cmd +flag or use inplace configuration. +

+

+ASCII, EBCDIC, UCS-2 and UTF-32 are fixed-length encodings. This means, that +all code units (symbols) are encoded in byte sequences of the same length: +1 byte in ASCII and EBCDIC, 2 bytes in UCS-2, 4 bytes in UTF-32. This length +must be equal to the size of YYCTYPE. re2c assumes that one input character +maps to one encoding symbol. +

+

+UTF-8 is a variable-length encoding: different Unicode symbols are encoded +in byte sequences of 1, 2, 3 or 4 bytes. In UTF-8 mode, re2c assumes that +size of YYCTYPE is 1 byte. It translates multibyte symbols to sequences of +1-byte characters. E.g., 2-byte symbol "\xFF" in UTF-8 is equal to 2-symbol +string "\xC3\xBF" in ASCII. Some bytes never occur in valid UTF-8 stream, e.g. +byte 0xFF. If the generated scanner must check for invalid input symbols, the +only way to do so is to use default rule *. Note, that full range rule +[^] means quite different thing in UTF-8, excluding 0xFF byte. +

 

SCANNER SPECIFICATIONS

Each scanner specification consists of a set of rules, named @@ -337,12 +369,35 @@ matched. You can either start the code with an opening curly brace or the sequence ':='. When the code with a curly brace then re2c counts the brace depth and stops looking for code automatically. Otherwise curly braces are not allowed and re2c stops -looking for code at the first line that does not begin with whitespace.

+looking for code at the first line that does not begin with whitespace. If two +or more rules overlap, the first rule is preferred.

regular-expression { C/C++ code }
regular-expression := C/C++ code

+There is one special rule: default rule *. +

+
+
* { C/C++ code }
+
* := C/C++ code
+
+

+The former "default" rule [^] differs from *: +

+
+
- * can occur anywhere a normal rule can occur, but regardless to its place, +* has the lowest priority. +
- [^] matches all valid symbols in current encoding, while * matches +any input character, either valid or invalid. +
- [^] can consume multiple input characters, while * always consumes +one input character. +
+

+In fact, when variable-length encoding is used, * is the only possible way +to match invalid input character. +

+

If -c is active then each regular expression is preceeded by a list of comma separated condition names. Besides normal naming rules there are two special cases. A rule may contain the single condition name '*' and no contition @@ -360,11 +415,15 @@ jumps) you can doso by using <! pseudo-rules.

<condition-list> regular-expression { C/C++ code }
<condition-list> regular-expression := C/C++ code
+
<condition-list> * { C/C++ code }
+
<condition-list> * := C/C++ code
<condition-list> regular-expression => condition { C/C++ code }
<condition-list> regular-expression => condition := C/C++ code
<condition-list> regular-expression :=> condition
<*> regular-expression { C/C++ code }
<*> regular-expression := C/C++ code
+
<*> * { C/C++ code }
+
<*> * := C/C++ code
<*> regular-expression => condition { C/C++ code }
<*> regular-expression => condition := C/C++ code
<*> regular-expression :=> condition
@@ -457,8 +516,7 @@ followed by either a lowercased u and its four hexadecimal digits or an uppercased U and its eight hexadecimal digits. However only in \fB-u\fP mode the generated code can deal with any valid Unicode character up to 0x10FFFF.

-

Since characters greater \X00FF are not allowed in non unicode mode, -the only portable "any" rules are (.|"\n") and [^].

+

The only portable "any" rule is the default rule *.

The regular expressions listed above are grouped according to precedence, from highest precedence at the top to lowest at the bottom. Those grouped together have equal precedence.

@@ -633,10 +691,8 @@ lessons to get you started with re2c. All examples in the lessons subdirectory can be compiled and actually work.

 

FEATURES

-

re2c does not provide a default action: the generated code assumes -that the input will consist of a sequence of tokens. Typically this can be -dealt with by adding a rule such as the one for unexpected characters in the -example above.

+

re2c provides default action: *. When the default rule matches, +exactly one input character is consumed.

The user must arrange for a sentinel token to appear at the end of input (and provide a rule for matching it): re2c does not provide an <<EOF>> expression. If the source is from a null-byte terminated @@ -649,10 +705,17 @@ else then e detection of end of data/file.

 

BUGS

Difference only works for character sets.

+

The generated DFA is not minimal.

+

Features, that are naturally orthogonal (such as reusable rules, conditions, +setup rules and default rules), cannot always be combined. E.g., one cannot set +setup/default rule for condition in scanner with reusable rules.

+

re2c does too much unnecessary work: e.g., if /*!use:re2c ... */ block has +additional rules, these rules are parsed 4 times, while they should be parsed +only once.

The re2c internal algorithms need documentation.

 

SEE ALSO

-

flex(1), lex(1). More information on re2c can be found here: +

flex(1), lex(1), quex(http://quex.sourceforge.net/). More information on re2c can be found here: http://re2c.org/

  diff --git a/re2c/re2c.1.in b/re2c/re2c.1.in index 8dc17ebe..45685e0d 100644 --- a/re2c/re2c.1.in +++ b/re2c/re2c.1.in @@ -11,7 +11,7 @@ \*(re \- convert \*(rxs to C/C++ .SH SYNOPSIS -\*(re [\fB-bdDefFghisuvVw1\fP] [\fB-o output\fP] [\fB-c\fP [\fB-t header\fP]] \fBfile\fP +\*(re [\fB-bdDefFghisuvVw18\fP] [\fB-o output\fP] [\fB-c\fP [\fB-t header\fP]] \fBfile\fP .SH DESCRIPTION \*(re is a preprocessor that generates C-based recognizers from regular @@ -115,7 +115,9 @@ Emit Graphviz dot data. It can then be processed with e.g. may crash dot. .TP \fB-e\fP -Cross-compile from an ASCII platform to an EBCDIC one. +Generate a parser that supports EBCDIC. The generated code can deal with any +character up to 0xFF. In this mode \*(re assumes that input character size is +1 byte. This switch is incompatible with \fB-w\fP, \fB-u\fP and \fB-8\fP. .TP \fB-f\fP Generate a scanner with support for storable state. @@ -163,10 +165,10 @@ Create a header file that contains types for the (f)lex-like condition support. This can only be activated when \fB-c\fP is in use. .TP \fB-u\fP -Generate a parser that supports Unicode chars (UTF-32). This means the -generated code can deal with any valid Unicode character up to 0x10FFFF. When -UTF-8 or UTF-16 needs to be supported you need to convert the incoming stream -to UTF-32 upon input yourself. +Generate a parser that supports UTF-32. The generated code can deal with any +valid Unicode character up to 0x10FFFF. In this mode \*(re assumes that input +character size is 4 bytes. This switch is incompatible with \fB-e\fP, \fB-w\fP and \fB-8\fP. +This implies \fB-s\fP. .TP \fB-v\fP Show version information. @@ -175,13 +177,20 @@ Show version information. Show the version as a number XXYYZZ. .TP \fB-w\fP -Create a parser that supports wide chars (UCS-2). This implies \fB-s\fP and -cannot be used together with \fB-e\fP switch. +Generate a parser that supports UCS-2. The generated code can deal with any +valid Unicode character up to 0xFFFF. In this mode \*(re assumes that input +character size is 2 bytes. This switch is incompatible with \fB-e\fP, \fB-u\fP and \fB-8\fP. +This implies \fB-s\fP. .TP \fB-1\fP Force single pass generation, this cannot be combined with -f and disables YYMAXFILL generation prior to last \*(re block. .TP +\fB-8\fP +Generate a parser that supports UTF-8. The generated code can deal with any +valid Unicode character up to 0x10FFFF. In this mode \*(re assumes that input +character size is 1 byte. This switch is incompatible with \fB-e\fP, \fB-w\fP and \fB-u\fP. +.TP \fB--no-generation-date\fP Suppress date output in the generated output so that it only shows the re2c version. @@ -373,6 +382,26 @@ When \*(re generates the code for a rule whose state does not have a setup rule and a star'd setup rule is present, than that code will be used as setup code. +.SH "ENCODINGS" +\*(re supports the following encodings: ASCII, EBCDIC (\fB-e\fP), UCS-2 (\fB-w\fP), +UTF-32 (\fB-u\fP) and UTF-8 (\fB-8\fP). ASCII is default. You can either pass cmd +flag or use inplace configuration. +.LP +ASCII, EBCDIC, UCS-2 and UTF-32 are fixed-length encodings. This means, that +all code units (symbols) are encoded in byte sequences of the same length: +1 byte in ASCII and EBCDIC, 2 bytes in UCS-2, 4 bytes in UTF-32. This length +must be equal to the size of YYCTYPE. \*(re assumes that one input character +maps to one encoding symbol. +.LP +UTF-8 is a variable-length encoding: different Unicode symbols are encoded +in byte sequences of 1, 2, 3 or 4 bytes. In UTF-8 mode, \*(re assumes that +size of YYCTYPE is 1 byte. It translates multibyte symbols to sequences of +1-byte characters. E.g., 2-byte symbol "\\xFF" in UTF-8 is equal to 2-symbol +string "\\xC3\\xBF" in ASCII. Some bytes never occur in valid UTF-8 stream, e.g. +byte 0xFF. If the generated scanner must check for invalid input symbols, the +only way to do so is to use default rule \fB*\fP. Note, that full range rule +\fB[^]\fP means quite different thing in UTF-8, excluding 0xFF byte. + .SH "SCANNER SPECIFICATIONS" Each scanner specification consists of a set of \fIrules\fP, \fInamed definitions\fP and \fIconfigurations\fP. @@ -382,7 +411,8 @@ is to be executed when the associated \fI\*(rx\fP is matched. You can either start the code with an opening curly brace or the sequence '\fB:=\fP'. When the code with a curly brace then \*(re counts the brace depth and stops looking for code automatically. Otherwise curly braces are not allowed and \*(re stops -looking for code at the first line that does not begin with whitespace. +looking for code at the first line that does not begin with whitespace. If two +or more rules overlap, the first rule is preferred. .P .RS \fI\*(rx\fP \fC{\fP \fIC/C++ code\fP \fC}\fP @@ -390,6 +420,30 @@ looking for code at the first line that does not begin with whitespace. \fI\*(rx\fP \fC:=\fP \fIC/C++ code\fP .RE .P +There is one special rule: default rule \fB*\fP. +.P +.RS +\fI*\fP \fC{\fP \fIC/C++ code\fP \fC}\fP +.P +\fI*\fP \fC:=\fP \fIC/C++ code\fP +.RE +.P +The former "default" rule \fB[^]\fP differs from \fB*\fP: +.P +.RS +- \fB*\fP can occur anywhere a normal rule can occur, but regardless to its place, +\fB*\fP has the lowest priority. +.P +- \fB[^]\fP matches all valid symbols in current encoding, while \fB*\fP matches +any input character, either valid or invalid. +.P +- \fB[^]\fP can consume multiple input characters, while \fB*\fP always consumes +one input character. +.RE +.P +In fact, when variable-length encoding is used, \fB*\fP is the only possible way +to match invalid input character. +.LP If \fB-c\fP is active then each \*(rx is preceeded by a list of comma separated condition names. Besides normal naming rules there are two special cases. A rule may contain the single condition name '*' and no contition @@ -409,6 +463,10 @@ jumps) you can doso by using \fP \fI\*(rx\fP \fC:=\fP \fIC/C++ code\fP .P +\fC<\fP\fIcondition-list\fP\fC>\fP \fI*\fP \fC{\fP \fIC/C++ code\fP \fC}\fP +.P +\fC<\fP\fIcondition-list\fP\fC>\fP \fI*\fP \fC:=\fP \fIC/C++ code\fP +.P \fC<\fP\fIcondition-list\fP\fC>\fP \fI\*(rx\fP \fC=>\fP \fP\fIcondition\fP \fC{\fP \fIC/C++ code\fP \fC}\fP .P \fC<\fP\fIcondition-list\fP\fC>\fP \fI\*(rx\fP \fC=>\fP \fP\fIcondition\fP \fC:=\fP \fIC/C++ code\fP @@ -419,6 +477,10 @@ jumps) you can doso by using \fP \fI\*(rx\fP \fC:=\fP \fIC/C++ code\fP .P +\fC<\fP\fI*\fP\fC>\fP \fI*\fP \fC{\fP \fIC/C++ code\fP \fC}\fP +.P +\fC<\fP\fI*\fP\fC>\fP \fI*\fP \fC:=\fP \fIC/C++ code\fP +.P \fC<\fP\fI*\fP\fC>\fP \fI\*(rx\fP \fC=>\fP \fP\fIcondition\fP \fC{\fP \fIC/C++ code\fP \fC}\fP .P \fC<\fP\fI*\fP\fC>\fP \fI\*(rx\fP \fC=>\fP \fP\fIcondition\fP \fC:=\fP \fIC/C++ code\fP @@ -549,8 +611,7 @@ by either a lowercased \fBu\fP and its four hexadecimal digits or an uppercased \fBU\fP and its eight hexadecimal digits. However only in \fB-u\fP mode the generated code can deal with any valid Unicode character up to 0x10FFFF. .LP -Since characters greater \fB\\X00FF\fP are not allowed in non unicode mode, the -only portable "\fBany\fP" rules are \fB(.|"\\n")\fP and \fB[^]\fP. +The only portable "\fBany\fP" rule is the default rule \fB*\fP. .LP The \*(rxs listed above are grouped according to precedence, from highest precedence at the top to lowest at the bottom. @@ -774,11 +835,8 @@ can be compiled and actually work. .SH FEATURES .LP -\*(re does not provide a default action: -the generated code assumes that the input -will consist of a sequence of tokens. -Typically this can be dealt with by adding a rule such as the one for -unexpected characters in the example above. +\*(re provides default action: \fB*\fP. When the default rule matches, +exactly one input character is consumed. .LP The user must arrange for a sentinel token to appear at the end of input (and provide a rule for matching it): @@ -793,13 +851,27 @@ else then e detection of end of data/file. .SH BUGS .LP -Difference only works for character sets. +Difference only works for character sets, and not in UTF-8 mode. +.LP +The generated DFA is not minimal. +.LP +Features, that are naturally orthogonal (such as reusable rules, conditions, +setup rules and default rules), cannot always be combined. E.g., one cannot set +setup/default rule for condition in scanner with reusable rules. +.LP +\*(re does too much unnecessary work: e.g., if /*!use:re2c ... */ block has +additional rules, these rules are parsed 4 times, while they should be parsed +only once. .LP The \*(re internal algorithms need documentation. .SH "SEE ALSO" .LP -flex(1), lex(1). +flex(1), lex(1), quex( +.PD 0 +.B http://quex.sourceforge.net +.PD 1 +). .P More information on \*(re can be found here: .PD 0 -- 2.40.0