This example is very simple, yet practical.
We assume that the input is small (fits in one continuous block of memory).
-We also assume that some characters never occur in well-formed input (but may occur in ill-formed input).
+We also assume that some characters never occur in well-formed input (but may occur in input that's ill-formed).
This is often the case in simple real-world tasks such as parsing program options,
-converting strings to numbers, determining binary file type based on magic in the first few bytes,
-efficiently switching on a string, and many others.
+converting strings to numbers, determining binary file types based on some magic in the first few bytes, or
+efficiently switching on a string.
Our example program simply loops over its command-line arguments
and tries to match each argument against one of four patterns:
binary, octal, decimal, and hexadecimal integer literals.
A couple of things should be noted:
* The default case (when none of the rules matched) is handled properly with the ``*`` rule (line 16).
- **Never forget to handle the default case, otherwise control flow in the lexer will be undefined for some input strings.**
+ **Never forget to handle the default case, otherwise control flow in the lexer for some input strings will be undefined .**
Use the `[-Wundefined-control-flow] <../manual/warnings/undefined_control_flow/wundefined_control_flow.html>`_ re2c warning:
it will warn you about the unhandled default case and show the input patterns that are not covered by the rules.
* ``YYMARKER`` (line 5) is needed because rules overlap:
it backs up the input position of the longest successful match.
- Say, we have overlapping rules ``"a"`` and ``"abc"`` and input string ``"abd"``:
- by the time ``"a"`` matches there's still a chance to match ``"abc"``,
+ Imagine we have overlapping rules ``"a"`` and ``"abc"`` and input string ``"abd"``:
+ by the time ``"a"`` matches, there's still a chance to match ``"abc"``,
but when the lexer sees ``'d'``, it must roll back.
(You might wonder why ``YYMARKER`` is exposed at all: why not make it a local variable like ``yych``?
The reason is, all input pointers must be updated by ``YYFILL``
This time we cannot use ``\0`` as a sentinel: input strings such as ``"aha\0ha"`` are perfectly valid,
but ill-formed strings such as ``"aha\0`` are also possible and shouldn't crash the lexer.
Any other character cannot be used for the same reason
-(including quotes: each type of strings can contain quotes of the opposite type).
+(including quotes: each type of string may contain quotes of the opposite type).
By default, re2c-generated lexers use the following approach to check for the end of the input buffer:
they assume that ``YYLIMIT`` is a pointer to the end of the input buffer, and they check by simply comparing ``YYCURSOR`` and ``YYLIMIT``.
-The obvious way is to check on each input character (before advancing to the next character), but that's very slow.
+The obvious to accomplish this is by checking on each input character (before advancing to the next character), but that's very slow.
Instead, re2c inserts checks only at certain points in the generated program.
Each check ensures that there is enough input to proceed until the next check.
If the check fails, the lexer calls ``YYFILL(n)``, which may either supply at least ``n`` characters or stop:
``if ((YYLIMIT - YYCURSOR) < n) YYFILL(n);``
For those interested in the internal re2c algorithm used to determine checkpoints,
-here is a quote from the original paper
+here is a quotation from the original paper
:download:`"RE2C: a more versatile scanner generator" <../about/1994_bumbulis_cowan_re2c_a_more_versatile_scanner_generator.pdf>`
-by Peter Bumbulis, Donald D. Cowan, 1994, ACM Letters on Programming Languages and Systems (LOPLAS):
+*by Peter Bumbulis, Donald D. Cowan, 1994, ACM Letters on Programming Languages and Systems (LOPLAS)*:
*A set of key states can be determined by discovering the strongly-connected components (SCCs) of the
DFA. An SCC is a maximal subset of states such that there exists a path from any state in the subset to any
The end of input is a special case: as explained in the `Recognizing strings: the need for YYMAXFILL <example_02.html>`_ example,
the input must be padded with ``YYMAXFILL`` fake characters.
-In this case ``YYLIMIT`` must point at the end of the padding:
+In this case, ``YYLIMIT`` must point at the end of the padding:
.. code-block:: bash
* There is only one successful way out (line 60): the lexer must recognize a standalone
"end of input" lexeme (``NULL``) exactly at the beginning of the padding.
- ``YYFILL`` failure is an error: if the input was correct, the lexer should have already stopped.
+ A ``YYFILL`` failure is an error: if the input was correct, the lexer should have already stopped.
* ``YYFILL`` may fail for two reasons:
either there is no more input (line 23),
or the lexeme is too long: it occupies the whole buffer and nothing can be discarded (line 27).
- We treat both cases in the same way (as error), but a real-world program might handle them differently
+ We treat both cases in the same way (as an error), but a real-world program might handle them differently
(resize the buffer, cut the long lexeme in two, etc.).
* ``@@`` in the ``YYFILL`` definition (line 52) is a formal parameter: re2c substitutes it with the actual argument to ``YYFILL``.
* Each condition is a standalone lexer (DFA).
* Each condition has a unique identifier: ``/*!types:re2c*/`` tells re2c to generate
- an enumeration of all identifiers (names are prefixed with ``yyc`` by default).
+ an enumeration of all the identifiers (the names are prefixed with ``yyc`` by default).
The lexer uses ``YYGETCONDITION`` to get the identifier of the current condition
and ``YYSETCONDITION`` to set it.
* Conditions are connected: transitions are allowed between the final states of one condition
and the start state of another condition (but not between inner states of different conditions).
The generated code starts with dispatch.
- Actions can either jump to the initial dispatch or jump directly to any condition.
+ Actions can either jump to the initial dispatch or jump directly to a condition.
* The ``<*>`` rule is merged to all conditions (low priority).
This example is about encoding support in re2c.
It's a partial decoder from Grade-1 (uncontracted) Unicode English Braille to plain English.
-The input may be encoded in UTF-8, UTF-16, UTF-32 or UCS-2:
+The input may be encoded in UTF-8, UTF-16, UTF-32, or UCS-2:
all of these encodings are capable of representing Braille patterns (code points ``[0x2800 - 0x28ff]``).
-We use ``-r`` option to reuse the same block of re2c rules with different encodings.
+We use the ``-r`` option to reuse the same block of re2c rules with different encodings.
So. The hardest part is to get some input.
Here is a message out of the void:
.. include:: 06_braille.utf8.txt
It appears to be UTF-8 encoded :download:`[06_braille.utf8.txt] <06_braille.utf8.txt>`.
-Convert it into UTF-16, UTF-32 or UCS-2:
+Let's convert it into UTF-16, UTF-32, and UCS-2:
.. code-block:: bash
And the input is ready.
Grade-1 Braille is quite simple (compared to Grade-2 Braille).
-Patterns map directly to symbols (letters, digits and punctuators) except for a couple of special patterns:
-numeric mode indicator (⠼), letter mode indicator (⠰), capital letter (⠠)
-and some other, which we omit for simplicity (as well as a few ambiguous punctuation patterns).
-Grade-2 Braille allows contractions; they obey complex rules (like those of a natural language)
+Patterns map directly to symbols (letters, digits, and punctuators) except for a couple of special patterns:
+the numeric mode indicator (⠼), the letter mode indicator (⠰), the capital letter indicator (⠠)
+and some others, which we omit here for the sake of simplicity (as well as a few ambiguous punctuation patterns).
+Grade-2 Braille allows contractions; those obey some rather complex rules (like those of a natural language)
and are much harder to implement.
:download:`[06_braille.re] <06_braille.re.txt>`
Notes:
-* Reuse mode is enabled with ``-r`` option.
-* In reuse mode re2c expects a single ``/*!rules:re2c ... */`` block
+* The reuse mode is enabled with the ``-r`` option.
+* In the reuse mode, re2c expects a single ``/*!rules:re2c ... */`` block
followed by multiple ``/*!use:re2c ... */`` blocks.
- All blocks can have their own configurations, definitions and rules.
-* Encoding can be enabled either with command-line option or with configuration.
-* Each encoding needs an appropriate code unit type (``YYCTYPE``).
+ All blocks can have their own configurations, definitions, and rules.
+* Encoding can be enabled either with a command-line option or a configuration.
+* Each encoding needs the appropriate code unit type (``YYCTYPE``).
* We use conditions to switch between numeric and normal modes.
-Generate, compile and run:
+Generate, compile, and run:
.. code-block:: bash
-C++98 lexer
------------
+A C++98 lexer
+-------------
This is an example of a big, real-world re2c program: a C++98 lexer.
-It conforms to the C++98 standard (except for a couple of hacks to simulate the preprocessor).
+It conforms to the C++98 standard (except for a couple of hacks that simulate the preprocessor).
All nontrivial lexemes (integers, floating-point constants, strings, and character literals)
are parsed (not only recognized): numeric literals are converted to numbers, and strings are unescaped.
-Some additional checks described in standard (e.g., overflows in integer literals) are also done.
+Some additional checks described in the standard (e.g., overflows in integer literals) are also done.
In fact, C++ is an easy language to lex: unlike in many other languages, the C++98 lexer can proceed without feedback from the parser.
:download:`[07_cxx98.re] <07_cxx98.re.txt>`
recognize numeric literals (which are further parsed by a bunch of auxiliary lexers),
and recognize the start of a string and character literals (which are further recognized and parsed by an auxiliary lexer).
Numeric literals are thus lexed twice: this approach may be deemed inefficient,
- but it takes much more effort to validate and parse them at once.
+ but it takes much more effort to validate and parse them in one go.
Besides, a real-world lexer would rather recognize ill-formed lexemes (e.g., overflown integer literals),
report them, and resume lexing.
-* We don't use re2c in cases when a hand-written parser looks simpler: when parsing octal and decimal literals
+* We don't use re2c in cases where a hand-written parser looks simpler: when parsing octal and decimal literals
(though a re2c-based parser would do exactly the same, without the slightest overhead).
However, hexadecimal literals still require some lexing, which looks better with re2c.
Again, it's only a matter of taste: a re2c-based implementation adds no overhead.
(they are guarded by ``in.tok`` on the left and ``in.lim`` on the right).
* The hardest part is (unsurprisingly) floating-point literals.
- They are just as hard to lex as to use. ``:)``
+ They are just as hard to lex as they are to use. ``:)``
Generate, compile, and run:
========
All examples are written in C++-98.
-`Do let me know <skvadik@gmail.com>`_ if you notice any obvious lies and errors.
-You can find more examples in subdirectory ``examples`` of the ``re2c`` distribution.
+`Do let me know <skvadik@gmail.com>`_ if you notice any obvious inaccuracies or errors.
+You can find more examples in the ``examples`` subdirectory in the ``re2c`` distribution tree.
.. toctree::
:maxdepth: 1
Parsing integers (multiple re2c blocks) <example_04>
Parsing integers (conditions) <example_05>
Braille patterns (encodings) <example_06>
- C++98 lexer <example_07>
+ A C++98 lexer <example_07>
Its key features are:
* Very fast lexers: the generated code is as good as a carefully tuned, hand-crafted C/C++ lexer.
- It's because re2c generates minimalistic, hard-coded state machines
+ This is because re2c generates minimalistic, hard-coded state machines
(as opposed to full-featured table-based lexers).
* Flexible API: one can `configure <manual/syntax/syntax.html#configurations>`_
Programmers can adjust their lexer to a particular input model,
avoid unnecessary overhead (drop useless runtime checks, do in-place lexing, etc.),
and make all sorts of hacks.
- `Examples <examples/examples.html>`_ cover many real-world cases and shed some light on the dark corners of the re2c API.
+ The `examples <examples/examples.html>`_ cover many real-world cases and shed some light on the dark corners of the re2c API.
* Efficient `Unicode support <manual/features/encodings/encodings.html>`_
(code points are compiled into executable finite-state machines).
----------------------
* `PHP <http://php.net/>`_ (general-purpose scripting language)
-* `ninja <https://ninja-build.org/>`_ (a small build system with a focus on speed)
+* `ninja <https://ninja-build.org/>`_ (small build system with a focus on speed)
* `yasm <http://yasm.tortall.net/>`_ (assembler)
* `spamasassin <https://spamassassin.apache.org/>`_ (anti-spam platform)
* `BRL-CAD <http://brlcad.org/>`_ (cross-platform solid modeling system)
* ... last but not least, `re2c <http://re2c.org>`_
-This list is by no means complete;
-these are only the best-known and open source projects.
+This list is by no means complete.
+These are only the best-known and open source projects.
are merged to all conditions (note that they have a lower priority than
other rules of that condition). And second, the empty condition list
allows to provide a code block that does not have a scanner part,
-meaning it does not allow any regular expression. The condition value
+meaning it does not allow any regular expressions. The condition value
referring to this special block is always the one with the enumeration
value 0. This way the code of this special rule can be used to
initialize a scanner. It is in no way necessary to have these rules: but
transition rules. Besides generating calls for the
``YYSETCONDTITION`` define, no other special code is generated.
-There is another kind of special rules that allows to prepend code to any
+There is another kind of special rule that allows to prepend code to any
code block of all rules of a certain set of conditions or to all code
-blocks to all rules. This can be helpful when some operation is common
+blocks of all rules. This can be helpful when some operation is common
among rules. For instance, this can be used to store the length of the
scanned string. These special setup rules start with an exclamation mark
followed by either a list of conditions ``<! condition, ... >`` or a star
``<!*>``. When ``re2c`` generates the code for a rule whose state does not have a
-setup rule and a starred setup rule is present, that code will be
+setup rule and a starred setup rule is present, the starred setup code will be
used as setup code.
* UTF-8 is a variable-length encoding. Its code space includes all
Unicode code points, from 0 to 0xD7FF and from 0xE000 to 0x10FFFF. One
- code point is represented with sequence of one, two, three, or four
- 1-byte code units. Size of ``YYCTYPE`` must be 1 byte.
+ code point is represented with a sequence of one, two, three, or four
+ 1-byte code units. The size of ``YYCTYPE`` must be 1 byte.
In Unicode, values from range 0xD800 to 0xDFFF (surrogates) are not
valid Unicode code points. Any encoded sequence of code units that
would map to Unicode code points in the range 0xD800-0xDFFF, is
ill-formed. The user can control how ``re2c`` treats such ill-formed
-sequences with the ``--encoding-policy <policy>`` flag.
+sequences with the ``--encoding-policy <policy>`` switch.
For some encodings, there are code units that never occur in a valid
encoded stream (e.g., 0xFF byte in UTF-8). If the generated scanner must
:hidden:
When the ``-f`` flag is specified, ``re2c`` generates a scanner that can
-store its current state, return to the caller, and later resume
+store its current state, return to its caller, and later resume
operations exactly where it left off.
-The default operation of ``re2c`` is a
-"pull" model, where the scanner asks for extra input whenever it needs it. However, this mode of operation assumes that the scanner is the "owner"
-the parsing loop, and that may not always be convenient.
+The default mode of operation in ``re2c`` is a
+"pull" model, where the scanner asks for extra input whenever it needs it. However, this mode of operation assumes that the scanner is the "owner" of the parsing loop, and that may not always be convenient.
Typically, if there is a preprocessor ahead of the scanner in the
-stream, or for that matter any other procedural source of data, the
-scanner cannot "ask" for more data unless both scanner and source
-live in a separate threads.
+stream, or for that matter, any other procedural source of data, the
+scanner cannot "ask" for more data unless both the scanner and the source
+live in separate threads.
-The ``-f`` flag is useful for just this situation: it lets users design
-scanners that work in a "push" model, i.e. where data is fed to the
+The ``-f`` flag is useful exactly for situations like that: it lets users design
+scanners that work in a "push" model, i.e., a model where data is fed to the
scanner chunk by chunk. When the scanner runs out of data to consume, it
-just stores its state, and return to the caller. When more input data is
+stores its state and returns to the caller. When more input data is
fed to the scanner, it resumes operations exactly where it left off.
Changes needed compared to the "pull" model:
-* User has to supply macros ``YYSETSTATE ()`` and ``YYGETSTATE (state)``.
+* The user has to supply macros named ``YYSETSTATE ()`` and ``YYGETSTATE (state)``.
-* The ``-f`` option inhibits declaration of ``yych`` and ``yyaccept``. So the
- user has to declare these. Also the user has to save and restore these.
- In the example ``examples/push_model/push.re`` these are declared as
- fields of the (C++) class of which the scanner is a method, so they do
- not need to be saved/restored explicitly. For C they could e.g. be made
- macros that select fields from a structure passed in as parameter.
+* The ``-f`` option inhibits declaration of ``yych`` and ``yyaccept``, so the
+ user has to declare them and save and restore them where required.
+ In the ``examples/push_model/push.re`` example, these are declared as
+ fields of a (C++) class of which the scanner is a method, so they do
+ not need to be saved/restored explicitly. For C, they could, e.g., be made
+ macros that select fields from a structure passed in as a parameter.
Alternatively, they could be declared as local variables, saved with
- ``YYFILL (n)`` when it decides to return and restored at entry to the
+ ``YYFILL (n)`` when it decides to return and restored upon entering the
function. Also, it could be more efficient to save the state from
``YYFILL (n)`` because ``YYSETSTATE (state)`` is called unconditionally.
- ``YYFILL (n)`` however does not get ``state`` as parameter, so we would have
+ ``YYFILL (n)`` however does not get ``state`` as a parameter, so we would have
to store state in a local variable by ``YYSETSTATE (state)``.
* Modify ``YYFILL (n)`` to return (from the function calling it) if more input is needed.
-* Modify caller to recognise if more input is needed and respond appropriately.
+* Modify the caller to recognize if more input is needed and respond appropriately.
* The generated code will contain a switch block that is used to
- restores the last state by jumping behind the corrspoding ``YYFILL (n)``
- call. This code is automatically generated in the epilog of the first ``/*!re2c */``
+ restore the last state by jumping behind the corresponding ``YYFILL (n)``
+ call. This code is automatically generated in the epilogue of the first ``/*!re2c */``
block. It is possible to trigger generation of the ``YYGETSTATE ()``
block earlier by placing a ``/*!getstate:re2c*/`` comment. This is especially useful when the scanner code should be
wrapped inside a loop.
-Please see ``examples/push_model/push.re`` for "push" model scanner. The
-generated code can be tweaked using inplace configurations ``state:abort``
+Please see ``examples/push_model/push.re`` for an example of a "push" model scanner. The
+generated code can be tweaked with inplace configurations ``state:abort``
and ``state:nextlabel``.
-
``-d --debug-output``
Creates a parser that dumps information about
the current position and the state the parser is in.
- This is useful to debug parser issues and states. If you use this
+ This is useful for debugging parser issues and states. If you use this
switch, you need to define a ``YYDEBUG`` macro, which will be called like a
function with two parameters: ``void YYDEBUG (int state, char current)``.
The first parameter receives the state or ``-1`` and the second parameter
``-e --ecb``
Generate a parser that supports EBCDIC. The generated
- code can deal with any character up to 0xFF. In this mode ``re2c`` assumes
- that input character size is 1 byte. This switch is incompatible with
- ``-w``, ``-x``, ``-u`` and ``-8``.
+ code can deal with any character up to 0xFF. In this mode, ``re2c`` assumes
+ an input character size of 1 byte. This switch is incompatible with
+ ``-w``, ``-x``, ``-u``, and ``-8``.
``-f --storable-state``
Generate a scanner with support for storable state.
``-F --flex-syntax``
- Partial support for the flex syntax. When this flag
+ Partial support for flex syntax. When this flag
is active, named definitions must be surrounded by curly braces and
- can be defined without an equal sign and the terminating semi colon.
- Instead names are treated as direct double quoted strings.
+ can be defined without an equal sign and the terminating semicolon.
+ Instead, names are treated as direct double quoted strings.
``-g --computed-gotos``
Generate a scanner that utilizes GCC's
- computed goto feature. That is, ``re2c`` generates jump tables whenever a
- decision is of a certain complexity (e.g., a lot of if conditions are
+ computed-goto feature. That is, ``re2c`` generates jump tables whenever a
+ decision is of certain complexity (e.g., a lot of if conditions would be
otherwise necessary). This is only usable with compilers that support this feature.
- Note that this implies ``-b`` and that the complexity threshold can be configured using the inplace configuration ``cgoto:threshold``.
+ Note that this implies ``-b`` and that the complexity threshold can be configured
+ using the ``cgoto:threshold`` inplace configuration.
``-i --no-debug-info``
Do not output ``#line`` information. This is
- useful when you want use a CMS tool with the ``re2c`` output which you
- might want if you do not require your users to have ``re2c`` themselves
- when building from your source.
+ useful when you want use a CMS tool with ``re2c``'s output. You might
+ want to do this if you do not want to impose re2c as a build requirement
+ for your source.
``-o OUTPUT --output=OUTPUT``
Specify the ``OUTPUT`` file.
In this mode, no ``/*!re2c */`` block and exactly one ``/*!rules:re2c */`` must be present.
The rules are saved and used by every ``/*!use:re2c */`` block that follows.
These blocks can contain inplace configurations, especially ``re2c:flags:e``,
- ``re2c:flags:w``, ``re2c:flags:x``, ``re2c:flags:u`` and ``re2c:flags:8``.
+ ``re2c:flags:w``, ``re2c:flags:x``, ``re2c:flags:u``, and ``re2c:flags:8``.
That way it is possible to create the same scanner multiple times for
- different character types, different input mechanisms or different output mechanisms.
+ different character types, different input mechanisms, or different output mechanisms.
The ``/*!use:re2c */`` blocks can also contain additional rules that will be appended
to the set of rules in ``/*!rules:re2c */``.
``-u --unicode``
Generate a parser that supports UTF-32. The generated
code can deal with any valid Unicode character up to 0x10FFFF. In this
- mode ``re2c`` assumes that input character size is 4 bytes. This switch is
- incompatible with ``-e``, ``-w``, ``-x`` and ``-8``. This implies ``-s``.
+ mode, ``re2c`` assumes an input character size of 4 bytes. This switch is
+ incompatible with ``-e``, ``-w``, ``-x``, and ``-8``. This implies ``-s``.
``-v --version``
Show version information.
``-V --vernum``
- Show the version as a number XXYYZZ.
+ Show the version as a number in the MMmmpp (Majorm, minor, patch) format.
``-w --wide-chars``
Generate a parser that supports UCS-2. The
generated code can deal with any valid Unicode character up to 0xFFFF.
- In this mode ``re2c`` assumes that input character size is 2 bytes. This
- switch is incompatible with ``-e``, ``-x``, ``-u`` and ``-8``. This implies
+ In this mode, ``re2c`` assumes an input character size of 2 bytes. This
+ switch is incompatible with ``-e``, ``-x``, ``-u``, and ``-8``. This implies
``-s``.
``-x --utf-16``
Generate a parser that supports UTF-16. The generated
code can deal with any valid Unicode character up to 0x10FFFF. In this
- mode ``re2c`` assumes that input character size is 2 bytes. This switch is
- incompatible with ``-e``, ``-w``, ``-u`` and ``-8``. This implies ``-s``.
+ mode, ``re2c`` assumes an input character size of 2 bytes. This switch is
+ incompatible with ``-e``, ``-w``, ``-u``, and ``-8``. This implies ``-s``.
``-8 --utf-8``
Generate a parser that supports UTF-8. The generated
code can deal with any valid Unicode character up to 0x10FFFF. In this
- mode ``re2c`` assumes that input character size is 1 byte. This switch is
- incompatible with ``-e``, ``-w``, ``-x`` and ``-u``.
+ mode, ``re2c`` assumes an input character size of 1 byte. This switch is
+ incompatible with ``-e``, ``-w``, ``-x``, and ``-u``.
``--case-insensitive``
- All strings are case insensitive, so all
- "-expressions are treated in the same way '-expressions are.
+ Makes all strings case insensitive. This makes
+ "-quoted expressions behave as '-quoted expressions.
``--case-inverted``
Invert the meaning of single and double quoted
- strings. With this switch single quotes are case sensitive and double
+ strings. With this switch, single quotes are case sensitive and double
quotes are case insensitive.
``--no-generation-date``
``--no-version``
Suppress version output in the generated file.
+``--no-generation-date``
+ Suppress version output in the generated file.
+
``--encoding-policy POLICY``
Specify how ``re2c`` must treat Unicode
surrogates. ``POLICY`` can be one of the following: ``fail`` (abort with
- error when surrogate encountered), ``substitute`` (silently substitute
- surrogate with error code point 0xFFFD), ``ignore`` (treat surrogates as
- normal code points). By default ``re2c`` ignores surrogates (for backward
- compatibility). Unicode standard says that standalone surrogates are
+ an error when a surrogate is encountered), ``substitute`` (silently replace
+ surrogates with the error code point 0xFFFD), ``ignore`` (treat surrogates as
+ normal code points). By default, ``re2c`` ignores surrogates (for backward
+ compatibility). The Unicode standard says that standalone surrogates are
invalid code points, but different libraries and programs treat them
differently.
``--input INPUT``
- Specify re2c input API. ``INPUT`` can be one of the
- following: ``default``, ``custom``.
+ Specify re2c's input API. ``INPUT`` can be either ``default`` or ``custom``.
``-S --skeleton``
Instead of embedding re2c-generated code into C/C++
for correctness and performance testing.
``--empty-class POLICY``
- What to do if user inputs empty character
+ What to do if the user uses an empty character
class. ``POLICY`` can be one of the following: ``match-empty`` (match empty
input: pretty illogical, but this is the default for backwards
- compatibility reason), ``match-none`` (fail to match on any input),
+ compatibility reasons), ``match-none`` (fail to match on any input),
``error`` (compilation error). Note that there are various ways to
- construct empty class, e.g: [], [^\\x00-\\xFF],
+ construct an empty class, e.g., [], [^\\x00-\\xFF],
[\\x00-\\xFF][\\x00-\\xFF].
+``--dfa-minimization <table | moore>``
+ The internal algorithm used by re2c to minimize the DFA (defaults to ``moore``).
+ Both the table filling algorithm and the Moore algorithm should produce the same DFA (up to states relabeling).
+ The table filling algorithm is much simpler and slower; it serves as a reference implementation.
+
``-1 --single-pass``
- Deprecated and does nothing (single pass is by default now).
+ Deprecated. Does nothing (single pass is the default now).
``re2c:yych:conversion = 0;``
When this setting is non zero, ``re2c`` automatically generates
- conversion code whenever yych gets read. In this case the type must be
+ conversion code whenever yych gets read. In this case, the type must be
defined using ``re2c:define:YYCTYPE``.
``re2c:yych:emit = 1;``
- The generation of *yych* can be suppressed by setting this to 0.
+ Set this to zero to suppress the generation of *yych*.
``re2c:yybm:hex = 0;``
If set to zero, a decimal table will be used. Otherwise, a hexadecimal table will be generated.
Character classes and string literals may contain octal or hexadecimal
character definitions and the following set of escape sequences:
``\a``, ``\b``, ``\f``, ``\n``, ``\r``, ``\t``, ``\v``, ``\\``. An octal character is defined by a backslash
-followed by its three octal digits (e.g. ``\377``).
-Hexadecimal characters from 0 to 0xFF are defined by backslash, a lower
-cased ``x`` and two hexadecimal digits (e.g. ``\x12``). Hexadecimal characters from 0x100 to 0xFFFF are defined by backslash, a lower cased
-``\u`` or an upper cased ``\X`` and four hexadecimal digits (e.g. ``\u1234``).
-Hexadecimal characters from 0x10000 to 0xFFFFffff are defined by backslash, an upper cased ``\U``
-and eight hexadecimal digits (e.g. ``\U12345678``).
+followed by its three octal digits (e.g., ``\377``).
+Hexadecimal characters from 0 to 0xFF are defined by a backslash, a lower
+case ``x`` and two hexadecimal digits (e.g., ``\x12``). Hexadecimal characters from 0x100 to 0xFFFF are defined by a backslash, a lower case
+``\u``or an upper case ``\X``, and four hexadecimal digits (e.g., ``\u1234``).
+Hexadecimal characters from 0x10000 to 0xFFFFffff are defined by a backslash, an upper case ``\U``,
+and eight hexadecimal digits (e.g., ``\U12345678``).
The only portable "any" rule is the default rule, ``*``.
``YYDEBUG (state, current)``
This is only needed if the ``-d`` flag was
- specified. It allows to easily debug the generated parser by calling a
+ specified. It allows easy debugging of the generated parser by calling a
user defined function for every state. The function should have the
following signature: ``void YYDEBUG (int state, char current)``. The first
parameter receives the state or -1 and the second parameter receives the
provided. ``YYFILL (n)`` should adjust ``YYCURSOR``, ``YYLIMIT``, ``YYMARKER``,
and ``YYCTXMARKER`` as needed. Note that for typical programming languages
``n`` will be the length of the longest keyword plus one. The user can
- place a comment of the form ``/*!max:re2c*/`` to insert ``YYMAXFILL`` definition that is set to the maximum
+ place a comment of the form ``/*!max:re2c*/`` to insert a ``YYMAXFILL`` define set to the maximum
length value.
``YYGETCONDITION ()``
This define is used to get the condition prior to
entering the scanner code when using the ``-c`` switch. The value must be
- initialized with a value from the enumeration ``YYCONDTYPE`` type.
+ initialized with a value from the ``YYCONDTYPE`` enumeration type.
``YYGETSTATE ()``
The user only needs to define this macro if the ``-f``
``YYFILL (n)`` was called.
``YYLIMIT``
- Expression of type ``YYCTYPE *`` that marks the end of the buffer ``YYLIMIT[-1]``
+ An expression of type ``YYCTYPE *`` that marks the end of the buffer ``YYLIMIT[-1]``
is the last character in the buffer). The generated code repeatedly
compares ``YYCURSOR`` to ``YYLIMIT`` to determine when the buffer needs
(re)filling.
``YYMARKER``
- l-value of type ``YYCTYPE *``.
+ An l-value of type ``YYCTYPE *``.
The generated code saves backtracking information in ``YYMARKER``. Some
- easy scanners might not use this.
+ simple scanners might not use this.
``YYMAXFILL``
This will be automatically defined by ``/*!max:re2c*/`` blocks as explained above.
-
``-Wcondition-order``
Warn if the generated program makes implicit
- assumptions about condition numbering. One should use either ``-t, --type-header`` option or
- ``/*!types:re2c*/`` directive to generate mapping of condition names to numbers and use
- autogenerated condition names.
+ assumptions about condition numbering. You should use either the ``-t, --type-header`` option or
+ the ``/*!types:re2c*/`` directive to generate a mapping of condition names to numbers and then use
+ the autogenerated condition names.
``-Wempty-character-class``
- Warn if regular expression contains empty
- character class. From the rational point of view trying to match empty
+ Warn if a regular expression contains an empty
+ character class. Rationally, trying to match an empty
character class makes no sense: it should always fail. However, for
- backwards compatibility reasons ``re2c`` allows empty character class and
- treats it as empty string. Use ``--empty-class`` option to change default
+ backwards compatibility reasons, ``re2c`` allows empty character classes and
+ treats them as empty strings. Use the ``--empty-class`` option to change the default
behavior.
``-Wmatch-empty-string``
- Warn if regular expression in a rule is
- nullable (matches empty string). If DFA runs in a loop and empty match
- is unintentional (input position in not advanced manually), lexer may
- get stuck in eternal loop.
+ Warn if a regular expression in a rule is
+ nullable (matches an empty string). If the DFA runs in a loop and an empty match
+ is unintentional (the input position in not advanced manually), the lexer may
+ get stuck in an infinite loop.
``-Wswapped-range``
- Warn if range lower bound is greater that upper
- bound. Default ``re2c`` behavior is to silently swap range bounds.
+ Warn if the lower bound of a range is greater than its upper
+ bound. The default behavior is to silently swap the range bounds.
``-Wundefined-control-flow``
Warn if some input strings cause undefined
- control flow in lexer (the faulty patterns are reported). This is the
- most dangerous and common mistake. It can be easily fixed by adding
- default rule ``*`` (this rule has the lowest priority, matches any code unit and consumes
+ control flow in the lexer (the faulty patterns are reported). This is the
+ most dangerous and most common mistake. It can be easily fixed by adding
+ the default rule (``*``) (this rule has the lowest priority, matches any code unit, and consumes
exactly one code unit).
``-Wunreachable-rules``
``-Wuseless-escape``
Warn if a symbol is escaped when it shouldn't be.
- By default re2c silently ignores escape, but this may as well indicate a
- typo or an error in escape sequence.
+ By default, re2c silently ignores such escapes, but this may as well indicate a
+ typo or error in the escape sequence.
+ ``-Wno-<warning>``
+ ``-Werror-<warning>``
+ ``-Wno-error-<warning>``
-- Added individual warnings:
+- Added specific warnings:
+ ``-Wundefined-control-flow``
+ ``-Wunreachable-rules``
+ ``-Wcondition-order``
- Fixed options:
+ ``--`` (interpret remaining arguments as non-options)
- Deprecated options:
- + ``-1 --single-pass`` (single pass is by default now)
+ + ``-1 --single-pass`` (single pass is the default now)
- Reduced size of the generated ``.dot`` files.
- Fixed bugs:
+ #27 re2c crashes reading files containing ``%{ %}`` (patch by Rui)
- Updated build system:
+ support out of source builds
+ support ```make distcheck```
- + added ```make bootstrap``` (rebuild re2c after building with precomplied ``.re`` files)
+ + added ```make bootstrap``` (rebuild re2c after building with precompiled ``.re`` files)
+ added ```make tests``` (run tests with ``-j``)
+ added ```make vtests``` (run tests with ``--valgrind -j``)
+ added ```make wtests``` (run tests with ``--wine -j 1``)
.. toctree::
:hidden:
-It is a truth universally acknowledged,
+It is a universally acknowledged truth
that *parsing* regular languages is not the same as just *recognizing* them.
Parsing is way more difficult: it introduces the notion of ambiguity,
-which immediately poses ambiguity decision problem
+which immediately poses an ambiguity decision problem
and gives rise to a bunch of disambiguation techniques.
-Even if we put aside ambiguity, still there is a problem of efficient extraction of parse results:
-non-determinism of the underlying automata is a clear advantage in this respect,
-but NFA are not as fast as DFA.
+Even if we put ambiguity aside, there is still the problem of extracting parsing results efficiently:
+non-determinism in the underlying automata is a clear advantage in this respect,
+but NFAs are not as fast as DFAs.
+
+.. note on pluralizing NFA and DFA: http://english.stackexchange.com/questions/377849/what-is-the-correct-way-to-pluralize-an-initialism-in-which-the-final-word-is-no/377864
The question is, given a regular expression recognizer, what is the best way of turning it into a parser?
-Well, the real question is, how do we add submatch extraction to re2c,
-while retaining the extreme speed of direct executable DFA?
+Well, the real question is, how do we add submatch extraction to re2c
+while retaining the extreme speed of a directly executable DFA?
As usual, there's no solution to the problem in general:
*No servant can serve two masters:
for either he will hate the one, and love the other;
or else he will hold to the one, and despise the other.*
-All existing techniques for parsing regular grammars (those I'm aware of)
-are a kind of compromise: they trade off recognition speed for parsing ability.
-In many cases this is quite acceptable:
+All existing techniques for parsing regular grammars (those that I'm aware of)
+are kind of a compromise: they trade off recognition speed for parsing ability.
+In many cases, this is quite acceptable:
interpreting engines are inherently slower than compiled regular expressions.
They are usually NFA-based and don't aim at extreme speed (they are *fast enough*).
Captures? Eh. Such engines don't even hesitate to allow backreferences,
-which throw them far beyond regular languages and imply exponential recognition complexity.
+which throws them far behind regular languages and implies exponential recognition complexity.
Captures are simply not an issue.
Well, they have a clear practical goal: regular expressions should be *easy to use*.
-On the other side, there are compiling engines that are mad about generating fast code.
-They are willing to increase compilation time and complexity
+On the other hand, there are compiling engines that are crazy about generating fast code.
+Those are willing to increase compilation time and complexity
in order to gain even a little run-time speed.
Their definition of *practical* is somewhat less popular: regular expressions should be a *zero-cost abstraction*;
-in fact, compiled code should be better than hand-crafted (faster, smaller).
-One such engine is re2c, and in the battle of speed against generality re2c will surely hold to speed.
+in fact, the compiled code should be better than hand-crafted code (faster, smaller).
+Re2c is one such engine, and in the battle of speed against generality, re2c will surely hold to speed.
-Then there is the third camp: engines that try to take best of both worlds,
-kind of JIT-compilers for regular expressions.
+Then there is a third camp: engines that try to get best of both worlds,
+kind of like JIT-compilers for regular expressions.
Such engines usually incorporate a bunch of techniques ranging from fast substring search to backtracking,
-including NFA, DFA and mixed approaches like cached NFA, also known as lazy DFA.
+including NFAs, DFAs and mixed approaches such as cached NFAs, also known as lazy DFAs.
The choice of technique is based on the given regular expression:
it must be the fastest technique still capable of handling the given expression.
-In case of backreferences the engine will fallback to exponential backtracking;
-in case of captures it will probably use NFA.
+In the case of backreferences, the engine will fall back to exponential backtracking;
+in the case of captures, it will probably use an NFA.
And than there are experiments: various research efforts
-that yield interesting tools (usually tailored for a particular domain-specific problem).
-Some of them look very promising, and we'll definitely have an overview of them.
+that yield interesting tools (usually tailored to a particular domain-specific problem).
+Some of them look very promising, and we'll definitely take a look at them.
Yet it becomes clear that even DFA-based approaches are not suitable for re2c:
they incur too much overhead.
It should allow captures in unambiguous cases
that can be technically implemented with a simple DFA.
The rest of the article tries to formalize these requirements
-and define effective procedure to find out if a given regular expression conforms to them.
+and define an effective procedure to find out if a given regular expression conforms to them.
-The analyses is based on the work of many people:
+The analysis is based on the work of many people:
some ideas are taken from papers,
others are inspired by fellow regular expression engines like flex and quex.
The discussion is rather informal; a thoughtful reader might notice
Ambiguity
=========
-In short, *ambiguity* in grammar is the possibility of parsing the same sentence in two different ways.
+In short, *ambiguity* in a grammar is the possibility of parsing the same sentence in multiple different ways.
One should clearly see the difference between *recognition* and *parsing*.
To recognize a sentence means to tell if it belongs to the language.
Roughly speaking, vertical ambiguity is concerned with intersection of alternative grammar rules,
while horizontal ambiguity deals with overlap in rule concatenation.
Both kinds can be defined in terms of operations on finite-state automata.
-Ambiguity problem for regular languages is decidable and has ... complexity.
+The ambiguity problem for regular languages is decidable and has ... complexity.
Formal definition
-----------------
-Ambiguity in regular expressions is defined in terms of corresponding parse trees,
+Ambiguity in regular expressions is defined in terms of the corresponding parse trees,
so we first need to define regular expressions,
A *regular expression* over finite alphabet :math:`\Sigma` is:
G \in AMB \iff \exists w \in L(G) | T_{G}(w)
-G is ambigous <==> exists W | :math:`W^{3\beta}_{\delta_1 \rho_1 \sigma_2} \approx U^{3\beta}_{\delta_1 \rho_1}` aaaa
+G is ambiguous <==> exists W | :math:`W^{3\beta}_{\delta_1 \rho_1 \sigma_2} \approx U^{3\beta}_{\delta_1 \rho_1}` aaaa
.. math::
Release 0.15
============
-This release started out in spring 2015 as a relatively simple code cleanup.
+This release started out in the spring of 2015 as a relatively simple code cleanup.
I focused on the following problem: re2c used to repeat the whole generation process multiple times.
Some parts of the generated program depend on the overall input statistics;
they cannot be generated until the whole input has been processed.
-The usual way is to make stubs for all such things and fix them later.
+The usual strategy is to make stubs for all such things and fix them later.
Instead, re2c used to process the whole input, gather statistics,
-discard the generated output and regenerate it from scratch.
+discard the generated output, and regenerate it from scratch.
Moreover, each generation pass further duplicated certain calculations (for similar reasons).
-As a result code generation phase was repeated four times.
+As a result, the code generation phase was repeated four times.
-The problem here is not inefficiency: re2c is fast enough to allow 4x overhead.
-The real problem is the complexity of code: you have to think of multiple execution layers all the time.
+The problem here is not inefficiency: re2c is fast enough to allow the 4x overhead.
+The real problem is the complexity of the code: you have to think of multiple execution layers all the time.
Some parts of code are only executed on certain layers and affect each other.
Simple reasoning gets really hard.
-So the main purpose of this release was to simplify code and make it easier to fix bugs and add new features.
-However, very soon I realized that some changes in code generation are hard to verify by hand.
+So the main purpose of this release was to simplify the code and make it easier to fix bugs and add new features.
+However, very soon I realized that some of the changes in code generation are hard to verify by hand.
For example, even a minor rebalancing of ``if`` and ``switch`` statements
may change the generated code significantly.
-In search of an automatic verification tool I encountered the idea of generating `skeleton <../../manual/features/skeleton/skeleton.html>`_ programs.
+In search of an automatic verification tool, I encountered the idea of generating `skeleton <../../manual/features/skeleton/skeleton.html>`_ programs.
Meanwhile I just couldn't help adding `warnings <../../manual/warnings/warnings.html>`_,
-updating build system and fixing various bugs.
-A heart-warming consequence of code simplification is that re2c now uses re2c more extensively:
-not only the main program, but also command-line options, inplace configurations
+updating the build system, and fixing various bugs.
+A heart-warming consequence of the code simplification is that re2c now uses re2c more extensively:
+not only the main program, but also the command-line options, inplace configurations,
and all kinds of escaped strings are parsed with re2c.
-l also updated the website (feel free to suggest improvements).
+I also updated the website (feel free to suggest improvements).
-Here is the list of all changes:
+Here is the list of all the changes:
.. include:: ../changelog/0_15_list.rst
==============
This release fixes an error in the testing script:
-it used locale-sensitive ``sort`` utility,
+it used the locale-sensitive ``sort`` utility,
which resulted in different order of files on different platforms.
Thanks to Sergei Trofimovich who reported the bug and suggested a quick fix. ``:)``
Release 0.15.2
==============
-This release fixes a bug in build system (`reported <https://bugs.gentoo.org/show_bug.cgi?id=566620>`_ on Gentoo bugtracker).
-Fix adds missing dependency (lexer depends on bison-generated header).
-It seems that the dependency has always been missing, build order just happened to be correct.
+This release fixes a bug in the build system (`reported <https://bugs.gentoo.org/show_bug.cgi?id=566620>`_ on Gentoo bugtracker).
+The fix adds a missing dependency (the lexer depends on a bison-generated header).
+It seems that the dependency has always been missing, and the build order just happened to be correct.
Release 0.15.3
==============
-This release fixes multiple build-time and run-time failures on OS X, FreeBSD and Windows.
+This release fixes multiple build-time and run-time failures on OS X, FreeBSD, and Windows.
Most of the problems were reported and fixed by Oleksii Taran (on OS X)
and Sergei Trofimovich (on FreeBSD and Windows).
Thank you for your help!
This release adds a very important step in the process of code generation:
minimization of the underlying DFA (deterministic finite automaton).
Simply speaking, this means that re2c now generates less code
-(while the generated code behaves in exactly the same way).
+(while the generated code behaves exactly the same).
DFA minimization is a very well-known technique
and one might expect that any self-respecting lexer generator would definitely use it.
So how could it be that re2c didn't?
In fact, re2c did use a couple of self-invented tricks to compress the generated code
-(one interesting technique is constructing *tunnel* automaton).
+(one interesting technique is the construction of a *tunnel* automaton).
Some of these tricks were quite buggy (see `this bug report <https://bugs.gentoo.org/show_bug.cgi?id=518904>`_ for example).
-Now that re2c does canonical DFA minimization all this stuff is obsolete and has been dropped.
+Now that re2c does canonical DFA minimization, all this stuff is obsolete and has been dropped.
-A lot of attention has been paid to the correctness of DFA minimization.
-Usually re2c uses a very simple criterion to validate changes:
-the generated code for all tests in testsuite must remain the same.
-However, in case of DFA minimization the generated code changes dramatically.
+A lot of attention has been paid to the correctness of the DFA minimization.
+Usually, re2c uses a very simple criterion to validate changes:
+the generated code for all the tests in testsuite must remain the same.
+However, in the case of DFA minimization the generated code changes dramatically.
It is impossible to verify the changes manually.
One possible verification tool is `skeleton <../../manual/features/skeleton/skeleton.html>`_.
Because skeleton is constructed prior to DFA minimization, it cannot be affected by any errors in its implementation.
-Another way to verify DFA minimization is to implement two different algorithms
-and compare the results. Minimization procedure has a very useful property:
-the miminal DFA is unique (with respect to state relabelling).
-We used Moore's and so-called *table filling* algorithms:
-Moore's algorithm is fast, while table filling is very simple to implement.
-There is an option ``--dfa-minimization <moore | table>`` that allows to choose
-a particular algorithm (defaults to ``moore``), but it's only useful for debugging
+Another way to verify DFA minimization is by implementing two different algorithms
+and comparing the results. The minimization procedure has a very useful property:
+the miminal DFA is unique (with respect to state relabeling).
+We used the Moore and the so-called *table filling* algorithms:
+The Moore algorithm is fast, while table filling is very simple to implement.
+The ``--dfa-minimization <moore | table>`` option that allows choosing
+the particular algorithm (defaults to ``moore``), but it's only useful for debugging
DFA minimization.
A good side effect of messing with re2c internals is a significant speedup
of code generation (see `this issue <https://github.com/skvadrik/re2c/issues/128>`_ for example).
-Test suite now runs twice as fast (at least).
+The test suite now runs twice as fast (at least).
-See `changelog <../changelog/changelog.html>`_ for the list of all changes.
+See `changelog <../changelog/changelog.html>`_ for the list of changes.