From c826cd19c35d6759c044d62f87a7e8bdf652e2a9 Mon Sep 17 00:00:00 2001 From: Petr Skocik Date: Sat, 1 Apr 2017 16:46:01 +0200 Subject: [PATCH] Grammar fixes in the docs --- src/about/about.rst | 2 +- src/examples/example_01.rst | 36 +-- src/examples/example_02.rst | 46 +-- src/examples/example_03.rst | 64 ++-- src/examples/example_04.rst | 10 +- src/examples/example_05.rst | 18 +- src/examples/example_07.rst | 36 +-- src/index.rst | 30 +- src/install/install.rst | 12 +- src/manual/features/conditions/conditions.rst | 24 +- src/manual/features/dot/dot.rst | 22 +- src/manual/features/encodings/encodings.rst | 44 +-- .../features/generic_api/generic_api.rst | 14 +- src/manual/features/reuse/reuse.rst | 12 +- src/manual/features/skeleton/skeleton.rst | 10 +- src/manual/options/options_list.rst | 40 ++- src/manual/syntax/syntax.rst | 275 +++++++++--------- .../warnings/condition_order/how_it_works.rst | 8 +- .../warnings/condition_order/real_world.rst | 8 +- .../condition_order/simple_example.rst | 22 +- .../wempty_character_class.rst | 18 +- .../match_empty_string/false_alarm.rst | 8 +- .../match_empty_string/real_world.rst | 16 +- .../match_empty_string/simple_example.rst | 12 +- .../warnings/swapped_range/wswapped_range.rst | 8 +- .../undefined_control_flow/default_vs_any.rst | 30 +- .../undefined_control_flow/how_it_works.rst | 16 +- .../undefined_control_flow/real_world.rst | 14 +- .../undefined_control_flow/simple_example.rst | 6 +- .../warnings/useless_escape/how_it_works.rst | 12 +- .../warnings/useless_escape/real_world.rst | 16 +- .../useless_escape/simple_example.rst | 6 +- src/manual/warnings/warnings_general.rst | 12 +- src/manual/warnings/warnings_list.rst | 4 +- 34 files changed, 448 insertions(+), 463 deletions(-) diff --git a/src/about/about.rst b/src/about/about.rst index 60d4bcc6..4ab5267c 100644 --- a/src/about/about.rst +++ b/src/about/about.rst @@ -37,7 +37,7 @@ they can obtain for re2c. If you do make use of re2c, or incorporate it into a l acknowledgement somewhere (documentation, research report, etc.) would be appreciated. -re2c is distributed with no warranty whatever. +re2c is distributed with no warranty whatsoever. The code is certain to contain errors. Neither the author nor any contributor takes responsibility for any consequences of its use. diff --git a/src/examples/example_01.rst b/src/examples/example_01.rst index d79c638a..049aa813 100644 --- a/src/examples/example_01.rst +++ b/src/examples/example_01.rst @@ -2,14 +2,14 @@ Recognizing integers: the sentinel method ----------------------------------------- This example is very simple, yet practical. -We assume that the input is small (fits in one continuous piece of memory). +We assume that the input is small (fits in one continuous block of memory). We also assume that some characters never occur in well-formed input (but may occur in ill-formed input). -This is often the case in simple real-world tasks like parsing program options, +This is often the case in simple real-world tasks such as parsing program options, converting strings to numbers, determining binary file type based on magic in the first few bytes, -efficiently switching on a string and many others. -Our example program simply loops over its commad-line arguments -and tries to match each argument against one of the four patterns: -binary, octal, decimal and hexadecimal integer literals. +efficiently switching on a string, and many others. +Our example program simply loops over its command-line arguments +and tries to match each argument against one of four patterns: +binary, octal, decimal, and hexadecimal integer literals. The numbers are not *parsed* (their numeric value is not retrieved), they are merely *recognized*. :download:`[01_recognizing_integers.re] <01_recognizing_integers.re.txt>` @@ -20,29 +20,29 @@ The numbers are not *parsed* (their numeric value is not retrieved), they are me A couple of things should be noted: -* Default case (when none of the rules matched) is handled properly with ``*`` rule (line 16). - **Never forget to handle default case, otherwise control flow in lexer will be undefined for some input strings.** - Use `[-Wundefined-control-flow] <../manual/warnings/undefined_control_flow/wundefined_control_flow.html>`_ re2c warning: - it will warn you about unhandled default case and show input patterns that are not covered by the rules. +* The default case (when none of the rules matched) is handled properly with the ``*`` rule (line 16). + **Never forget to handle the default case, otherwise control flow in the lexer will be undefined for some input strings.** + Use the `[-Wundefined-control-flow] <../manual/warnings/undefined_control_flow/wundefined_control_flow.html>`_ re2c warning: + it will warn you about the unhandled default case and show the input patterns that are not covered by the rules. * We use the *sentinel* method to stop at the end of input (``re2c:yyfill:enable = 0;`` at line 8). - Sentinel is a special character that can never occur in well-formed input. + A sentinel is a special character that can never occur in well-formed input. It is appended to the end of input and serves as a stop signal for the lexer. - In out case sentinel is ``NULL``: all arguments are ``NULL``-terminated and none of the rules matches ``NULL`` in the middle. - Lexer will inevitably stop when it sees ``NULL``. - Note that we make no assumptions about the input, it may contain any characters. + In our case, the sentinel is ``NULL``: all arguments are ``NULL``-terminated and none of the rules matches ``NULL`` in the middle. + The lexer will inevitably stop when it sees a ``NULL``. + Note that we make no assumptions about the input; it may contain any characters. **But do make sure that the sentinel character is not allowed in the middle of a rule.** * ``YYMARKER`` (line 5) is needed because rules overlap: - it backups input position of the longest successful match. + it backs up the input position of the longest successful match. Say, we have overlapping rules ``"a"`` and ``"abc"`` and input string ``"abd"``: by the time ``"a"`` matches there's still a chance to match ``"abc"``, - but when lexer sees ``'d'`` it must rollback. + but when the lexer sees ``'d'``, it must roll back. (You might wonder why ``YYMARKER`` is exposed at all: why not make it a local variable like ``yych``? The reason is, all input pointers must be updated by ``YYFILL`` - as explained in `Arbitrary large input and YYFILL `_ example.) + as explained in the `Arbitrary large input and YYFILL `_ example.) -Generate, compile and run: +Generate, compile, and run: .. code-block:: bash diff --git a/src/examples/example_02.rst b/src/examples/example_02.rst index d0677349..69a2caef 100644 --- a/src/examples/example_02.rst +++ b/src/examples/example_02.rst @@ -2,22 +2,22 @@ Recognizing strings: the need for YYMAXFILL ------------------------------------------- This example is about recognizing strings. -Strings (in generic sense) are different from other kinds of lexemes: they can contain *arbitrary* characters. -It makes them a way more difficult to lex: unlike `Recognizing integers: the sentinel method `_ example, -we cannot use sentinel character to stop at the end of input. +Strings (in the generic sense) are different from other kinds of lexemes: they can contain *arbitrary* characters. +That makes them way more difficult to lex: unlike in the `Recognizing integers: the sentinel method `_ example, +we cannot use a sentinel character to stop at the end of input. Suppose, for example, that our strings may be single or double-quoted and may contain any character in range ``[0 - 0xFF]`` except quotes of the appropriate type. -This time we cannot use ``NULL`` as a sentinel: input strings like ``"aha\0ha"`` are perfectly valid, -but ill-formed strings like ``"aha\0`` are also possible and shouldn't crash lexer. +This time we cannot use ``\0`` as a sentinel: input strings such as ``"aha\0ha"`` are perfectly valid, +but ill-formed strings such as ``"aha\0`` are also possible and shouldn't crash the lexer. Any other character cannot be used for the same reason (including quotes: each type of strings can contain quotes of the opposite type). -By default re2c-generated lexers use the following approach to check for the end of input: -they assume that ``YYLIMIT`` is a pointer to the end of input and check by simply comparing ``YYCURSOR`` and ``YYLIMIT``. -The obvious way is to check on each input character (before advancing to the next character), but it's very slow. +By default, re2c-generated lexers use the following approach to check for the end of the input buffer: +they assume that ``YYLIMIT`` is a pointer to the end of the input buffer, and they check by simply comparing ``YYCURSOR`` and ``YYLIMIT``. +The obvious way is to check on each input character (before advancing to the next character), but that's very slow. Instead, re2c inserts checks only at certain points in the generated program. Each check ensures that there is enough input to proceed until the next check. -If the check fails, lexer calls ``YYFILL(n)``, which can either supply at least ``n`` characters or stop: +If the check fails, the lexer calls ``YYFILL(n)``, which may either supply at least ``n`` characters or stop: ``if ((YYLIMIT - YYCURSOR) < n) YYFILL(n);`` @@ -34,9 +34,9 @@ by Peter Bumbulis, Donald D. Cowan, 1994, ACM Letters on Programming Languages a This approach reduces the number of checks significantly, but it has a downside. Since the lexer checks for multiple characters at once, the last few input characters may become unreachable. -Common hack is to pad input with a few fake characters that **do not form a valid lexeme or lexeme suffix**. -The length of padding depends on the maximal argument to ``YYFILL`` -(this value is called ``YYMAXFILL`` and can be generated using ``/*!max:re2c*/`` directive). +A common hack is to pad the input with a few fake characters that **do not form a valid lexeme or lexeme suffix**. +The length of the padding depends on the maximum argument to ``YYFILL`` +(this value is called ``YYMAXFILL`` and can be generated with the ``/*!max:re2c*/`` directive). :download:`[02_recognizing_strings.re] <02_recognizing_strings.re.txt>` @@ -48,25 +48,25 @@ Notes: * ``/*!max:re2c*/`` (line 4) tells re2c to generate ``#define YYMAXFILL n``. -* Input string is padded with ``YYMAXFILL`` characters ``'a'`` (line 15). - Sequence of ``'a'`` does not form a valid lexeme or lexeme suffix - (but padding with quotes would cause false match on ill-formed input like ``"aha``). +* The input string is padded with ``YYMAXFILL`` ``'a'`` characters (line 15). + The sequence of ``'a'`` does not form a valid lexeme or lexeme suffix + (but padding with quotes would cause false matches on ill-formed inputs such as ``"aha``). -* ``YYLIMIT`` points at the end of padding (line 26). +* ``YYLIMIT`` points at the end of the padding (line 26). -* ``YYFILL`` returns an error (line 29): if the input was correct, lexer should have stopped - at the beginning of padding. +* ``YYFILL`` returns an error (line 29): if the input was correct, the lexer should have stopped + at the beginning of the padding. -* If the rule matched (line 36), we ensure that lexer consumed *all* input characters - and stopped exactly at the beginning of padding. +* If the rule matched (line 36), we ensure that the lexer has consumed *all* input characters + and stopped exactly at the beginning of the padding. * We have to use ``re2c:define:YYFILL:naked = 1;`` (line 30) - in order to suppress passing parameter to ``YYFILL``. + in order to suppress passing a parameter to ``YYFILL``. (It was an unfortunate idea to make ``YYFILL`` a call expression by default: - ``YYFILL`` has to stop the lexer eventually, that's why it has to be a macro and not a function. + ``YYFILL`` has to stop the lexer eventually. That's why it has to be a macro and not a function. One should either set ``re2c:define:YYFILL:naked = 1;`` or define ``YYFILL(n)`` as a macro.) -Generate, compile and run: +Generate, compile, and run: .. code-block:: bash diff --git a/src/examples/example_03.rst b/src/examples/example_03.rst index fa86d540..ea175ca6 100644 --- a/src/examples/example_03.rst +++ b/src/examples/example_03.rst @@ -1,26 +1,26 @@ Arbitrary large input and YYFILL -------------------------------- -In this example we suppose that input cannot be mapped in memory at once: +In this example we suppose that our input cannot be mapped in memory at once: either it's too large or its size cannot be determined in advance. -The usual thing to do in such case is to allocate a buffer and lex input in chunks that fit into buffer. -re2c allows us to refill buffer using ``YYFILL``: see `Recognizing strings: the need for YYMAXFILL `_ example -for details about program points and conditions that trigger ``YYFILL`` invocation. +The usual thing to do in such a case is to allocate a buffer and lex the input in chunks that fit into the buffer. +re2c allows us to refill the buffer using ``YYFILL``: see `Recognizing strings: the need for YYMAXFILL `_ example +for details about the program points and conditions that trigger a ``YYFILL`` invocation. Currently re2c provides no way to combine ``YYFILL`` with the sentinel method: we have to enable ``YYLIMIT``-based checks for the end of input and pad input with ``YYMAXFILL`` fake characters. This may be changed in later versions of re2c. -The idea of ``YYFILL`` is fairly simple: lexer is stuck upon the fact that +The idea of ``YYFILL`` is fairly simple: the lexer is stuck upon the fact that ``(YYLIMIT - YYCURSOR) < n`` and ``YYFILL`` must either invert this condition or stop lexing. -Disaster will happen if ``YYFILL`` fails to provide at least ``n`` characters, yet resumes lexing. -Technically ``YYFILL`` must somehow "extend" input for at least ``n`` characters: -after ``YYFILL`` all input pointers must point to exact same characters, -except ``YYLIMIT``: it must be advanced at least ``n`` positions. +A disaster will happen if ``YYFILL`` fails to provide at least ``n`` characters, yet resumes lexing. +Technically ``YYFILL`` must somehow "extend" the input by at least ``n`` characters: +after ``YYFILL``, all input pointers must point to the exact same characters, +except ``YYLIMIT``, which must be advanced at least ``n`` positions. Since we want to use a fixed amount of memory, we have to shift buffer contents: -discard characters that are already lexed, -move the remaining characters at the beginning of the buffer +discard characters that have already been lexed, +move the remaining characters to the beginning of the buffer, and fill the vacant space with new characters. -All the pointers must be decreased by the length of discarded input, +All the pointers must be decreased by the length of the discarded input, except ``YYLIMIT`` (it must point at the end of buffer): .. code-block:: bash @@ -37,9 +37,9 @@ except ``YYLIMIT`` (it must point at the end of buffer): buffer, YYMARKER YYCURSOR YYLIMIT lexeme -End of input is a special case: as explained in `Recognizing strings: the need for YYMAXFILL `_ example, +The end of input is a special case: as explained in the `Recognizing strings: the need for YYMAXFILL `_ example, the input must be padded with ``YYMAXFILL`` fake characters. -In this case ``YYLIMIT`` must point at the end of padding: +In this case ``YYLIMIT`` must point at the end of the padding: .. code-block:: bash @@ -57,17 +57,17 @@ In this case ``YYLIMIT`` must point at the end of padding: Which part of input can be discarded? The answer is, all input up to the leftmost meaningful pointer. -Intuitively it seems that it must be ``YYMARKER``: it backups input position of the latest match, +Intuitively it seems that it must be ``YYMARKER``: it backs up the input position of the latest match, so it's always less than or equal to ``YYCURSOR``. -However, ``YYMARKER`` is not always used and even when it is, its usage depends on the input: -not all control flow paths in lexer ever initialize it. -Thus for some inputs ``YYMARKER`` is meaningless +However, ``YYMARKER`` is not always used and even when it is, its use depends on the input: +not all control flow paths in the lexer ever initialize it. +Thus for some inputs, ``YYMARKER`` is meaningless and should be used with care. -In practice input rarely consists of one giant lexeme: it is usually a sequence of small lexemes. -In that case lexer runs in a loop and it is convenient to have a special "lexeme start" pointer. -It can be used as boundary in ``YYFILL``. +In practice, input rarely consists of one giant lexeme: it is usually a sequence of small lexemes. +In that case, the lexer runs in a loop and it is convenient to have a special "lexeme start" pointer. +It can be used as a boundary in ``YYFILL``. -Our example program reads ``stdin`` in chunks of 16 bytes (in real word buffer size is usually ~4Kb) +Our example program reads ``stdin`` in chunks of 16 bytes (in the real world, the buffer size is usually ~4KiB) and tries to lex numbers separated by newlines. :download:`[03_arbitrary_large_input.re] <03_arbitrary_large_input.re.txt>` @@ -78,25 +78,25 @@ and tries to lex numbers separated by newlines. Notes: -* ``YYMAXFILL`` bytes at the end of buffer are reserved for padding. - This memory is unused most of the time, but ``YYMAXFILL`` is usually negligably small compared to buffer size. +* ``YYMAXFILL`` bytes at the end of the buffer are reserved for padding. + This memory is unused most of the time, but ``YYMAXFILL`` is usually negligibly small compared to the buffer size. -* There is only one successsful way out (line 60): lexer must recognize a standalone - "end of input" lexeme (``NULL``) exactly at the beginning of padding. - ``YYFILL`` failure is an error: if the input was correct, lexer should have already stopped. +* There is only one successful way out (line 60): the lexer must recognize a standalone + "end of input" lexeme (``NULL``) exactly at the beginning of the padding. + ``YYFILL`` failure is an error: if the input was correct, the lexer should have already stopped. * ``YYFILL`` may fail for two reasons: either there is no more input (line 23), - or lexeme is too long: it occupies the whole buffer and nothing can be discarded (line 27). + or the lexeme is too long: it occupies the whole buffer and nothing can be discarded (line 27). We treat both cases in the same way (as error), but a real-world program might handle them differently - (resize buffer, cut long lexeme in two, etc.). + (resize the buffer, cut the long lexeme in two, etc.). -* ``@@`` in ``YYFILL`` definition (line 52) is a formal parameter: re2c substitutes it with the actual argument to ``YYFILL``. +* ``@@`` in the ``YYFILL`` definition (line 52) is a formal parameter: re2c substitutes it with the actual argument to ``YYFILL``. -* There is a special ``tok`` pointer: it points at the beginning of lexeme (line 47) +* There is a special ``tok`` pointer: it points at the beginning of a lexeme (line 47) and serves as a boundary in ``YYFILL``. -Generate, compile and run: +Generate, compile, and run: .. code-block:: bash diff --git a/src/examples/example_04.rst b/src/examples/example_04.rst index f26f59ba..a76bc270 100644 --- a/src/examples/example_04.rst +++ b/src/examples/example_04.rst @@ -1,8 +1,8 @@ Parsing integers (multiple re2c blocks) --------------------------------------- -This example is based on `Recognizing integers: the sentinel method `_ example, -only now integer literals are parsed rather than simply recognized. +This example is based on the `Recognizing integers: the sentinel method `_ example, +only now the integer literals are parsed rather than simply recognized. Parsing integers is simple: one can easily do it by hand. However, re2c-generated code *does* look like a simple handwritten parser: a couple of dereferences and conditional jumps. No overhead. ``:)`` @@ -16,11 +16,11 @@ a couple of dereferences and conditional jumps. No overhead. ``:)`` Notes: * Configurations and definitions (lines 20 - 26) are not scoped to a single re2c block — they are global. - Each block may override configurations, but this affects global scope. + Each block may override configurations, but this affects the global scope. * Blocks don't have to be in the same function: they can be in separate functions or elsewhere - as long as the exposed interface fits into lexical scope. + as long as the exposed interface fits into the lexical scope. -Generate, compile and run: +Generate, compile, and run: .. code-block:: bash diff --git a/src/examples/example_05.rst b/src/examples/example_05.rst index 3e6f87c6..dc141167 100644 --- a/src/examples/example_05.rst +++ b/src/examples/example_05.rst @@ -1,7 +1,7 @@ Parsing integers (conditions) ----------------------------- -This example does exactly the same as `Parsing integers (multiple re2c blocks) `_ example, +This example does exactly the same thing as the `Parsing integers (multiple re2c blocks) `_ example, but in a slightly different manner: it uses re2c conditions instead of blocks. Conditions allow to encode multiple interconnected lexers within a single re2c block. @@ -13,31 +13,31 @@ Conditions allow to encode multiple interconnected lexers within a single re2c b Notes: -* Conditions are enabled with ``-c`` option. +* Conditions are enabled with the ``-c`` option. -* Conditions are only syntactic sugar, they can be translated into multiple blocks. +* Conditions are only syntactic sugar; they can be translated into multiple blocks. * Each condition is a standalone lexer (DFA). * Each condition has a unique identifier: ``/*!types:re2c*/`` tells re2c to generate - enumeration of all identifiers (names are prefixed with ``yyc`` by default). - Lexer uses ``YYGETCONDITION`` to get the identifier of current condition + an enumeration of all identifiers (names are prefixed with ``yyc`` by default). + The lexer uses ``YYGETCONDITION`` to get the identifier of the current condition and ``YYSETCONDITION`` to set it. * Each condition has a unique label (prefixed with ``yyc_`` by default). -* Conditions are connected: transitions are allowed between final states of one condition - and start state of another condition (but not between inner states of different conditions). +* Conditions are connected: transitions are allowed between the final states of one condition + and the start state of another condition (but not between inner states of different conditions). The generated code starts with dispatch. Actions can either jump to the initial dispatch or jump directly to any condition. -* Rule ``<*>`` is merged to all conditions (low priority). +* The ``<*>`` rule is merged to all conditions (low priority). * Rules with multiple conditions are merged to each listed condition (normal priority). * ``:=>`` jumps directly to the next condition (bypassing the initial dispatch). -Generate, compile and run: +Generate, compile, and run: .. code-block:: bash diff --git a/src/examples/example_07.rst b/src/examples/example_07.rst index 3c5cdc46..fe811b12 100644 --- a/src/examples/example_07.rst +++ b/src/examples/example_07.rst @@ -1,12 +1,12 @@ C++98 lexer ----------- -This is an example of a big real-world re2c program: C++98 lexer. -It confirms to the C++98 standard (except for a couple of hacks to simulate preprocessor). -All nontrivial lexemes (integers, floating-point constants, strings and character literals) -are parsed (not only recognized): numeric literals are converted to numbers, strings are unescaped. -Some additional checks described in standard (e.g. overflows in integer literals) are also done. -In fact, C++ is an easy language to lex: unlike many other languages, lexer can proceed without feedback from parser. +This is an example of a big, real-world re2c program: a C++98 lexer. +It conforms to the C++98 standard (except for a couple of hacks to simulate the preprocessor). +All nontrivial lexemes (integers, floating-point constants, strings, and character literals) +are parsed (not only recognized): numeric literals are converted to numbers, and strings are unescaped. +Some additional checks described in standard (e.g., overflows in integer literals) are also done. +In fact, C++ is an easy language to lex: unlike in many other languages, the C++98 lexer can proceed without feedback from the parser. :download:`[07_cxx98.re] <07_cxx98.re.txt>` @@ -16,30 +16,30 @@ In fact, C++ is an easy language to lex: unlike many other languages, lexer can Notes: -* The main lexer is used to lex all trivial lexemes (macros, whitespaces, boolean literals, keywords, operators and punctuators, identifiers), - recognize numeric literals (which are further parsed by a bunch of auxilary lexers), - and recognize the start of string and character literals (which are further recognized and parsed by an auxilary lexer). +* The main lexer is used to lex all trivial lexemes (macros, whitespace, boolean literals, keywords, operators, punctuators, and identifiers), + recognize numeric literals (which are further parsed by a bunch of auxiliary lexers), + and recognize the start of a string and character literals (which are further recognized and parsed by an auxiliary lexer). Numeric literals are thus lexed twice: this approach may be deemed inefficient, but it takes much more effort to validate and parse them at once. - Besides, a real-world lexer would rather recognize ill-formed lexemes (e.g. overflowed integer literals), - report them and resume lexing. + Besides, a real-world lexer would rather recognize ill-formed lexemes (e.g., overflown integer literals), + report them, and resume lexing. -* We don't use re2c in cases when hand-written parser looks simpler: when parsing octal and decimal literals - (though re2c-based parser would do exactly the same, without the slightest overhead). +* We don't use re2c in cases when a hand-written parser looks simpler: when parsing octal and decimal literals + (though a re2c-based parser would do exactly the same, without the slightest overhead). However, hexadecimal literals still require some lexing, which looks better with re2c. - Again, it's only a matter of taste: re2c-based implementation adds no overhead. + Again, it's only a matter of taste: a re2c-based implementation adds no overhead. Look at the generated code to make sure. * The main lexer and string lexer both use ``re2c:yyfill:enable = 1;``, other lexers use ``re2c:yyfill:enable = 0;``. - This is very important: both main lexer and string lexer advance input position to new (yet unseen) input characters, - so they must check for the end of input and call ``YYFILL``. In conrast, other lexers only parse lexemes that - have been already recognized by the main lexer: these lexemes are guaranteed to be within buffer bounds + This is very important: both the main lexer and string lexer advance input position to new (yet unseen) input characters, + so they must check for the end of input and call ``YYFILL``. In contrast, other lexers only parse lexemes that + have already been recognized by the main lexer: these lexemes are guaranteed to be within buffer bounds (they are guarded by ``in.tok`` on the left and ``in.lim`` on the right). * The hardest part is (unsurprisingly) floating-point literals. They are just as hard to lex as to use. ``:)`` -Generate, compile and run: +Generate, compile, and run: .. code-block:: bash diff --git a/src/index.rst b/src/index.rst index 336e2b7c..fb4b3ad3 100644 --- a/src/index.rst +++ b/src/index.rst @@ -15,17 +15,17 @@ re2c re2c is a lexer generator for C/C++. Its key features are: -* Very fast lexers: the generated code is as good as a carefully tuned hand-crafted C/C++ lexer. - It's because re2c generates minimalistic hard-coded state machine +* Very fast lexers: the generated code is as good as a carefully tuned, hand-crafted C/C++ lexer. + It's because re2c generates minimalistic, hard-coded state machines (as opposed to full-featured table-based lexers). * Flexible API: one can `configure `_ or even `completely override `_ the way re2c generates code. - Programmers can adjust lexer to a particular input model, - avoid unnecessary overhead (drop useless runtime checks, do inplace lexing, etc.) + Programmers can adjust their lexer to a particular input model, + avoid unnecessary overhead (drop useless runtime checks, do in-place lexing, etc.), and make all sorts of hacks. - `Examples `_ cover many real-world cases and shed some light on dark corners of re2c API. + `Examples `_ cover many real-world cases and shed some light on the dark corners of the re2c API. * Efficient `Unicode support `_ (code points are compiled into executable finite-state machines). @@ -42,15 +42,15 @@ Bugs & feedback --------------- Please send feedback to `re2c-devel `_ and -`re2c-general `_ mailing lists -(search `mail archieves `_ for old threads) +the `re2c-general `_ mailing lists +(search `mail archives `_ for old threads), or `report a bug `_. Note that re2c is hosted both on `github `_ and on `sourceforge `_ for redundancy. -Currently github serves as main repository, bugtracker and binary hosting. -Sourceforge is used as backup repository and to host mail -(so please don't send bugs or feedback to sourceforge). +Currently, github serves as the main repository, bugtracker, and hosting site for binaries. +Sourceforge is used as backup repository and for mail +(so please don't send your bugreports or feedback to sourceforge). News & updates @@ -61,7 +61,7 @@ News & updates :class: feed :width: 2em -Subscribe to receive latest news and updates: |feed| +Subscribe to receive the latest news and updates: |feed| @@ -76,7 +76,7 @@ Projects that use re2c * ... last but not least, `re2c `_ This list is by no means complete; -these are only the most well-known and open source projects. +these are only the best-known and open source projects. @@ -85,14 +85,14 @@ Contribute Contributions come in various forms: -* Tests: a very easy and valuable contribution is to add your lexer to re2c test suite. +* Tests: a very easy and valuable way of contributing is by adding your lexer to the re2c test suite. Real-world tests are the best. Feel free to strip out all non-re2c code if you must keep it secret. - In return re2c will not break your code (re2c developers strive to never break existing tests). + In return, re2c will not break your code (re2c developers strive to never break existing tests). * Ideas: new features and new ways to use re2c. -* Development: bugfixes, features, ports to other languages. +* Development: bugfixes, features, and ports to other languages. Everyone is welcome! diff --git a/src/install/install.rst b/src/install/install.rst index d5b5a345..5fa1081e 100644 --- a/src/install/install.rst +++ b/src/install/install.rst @@ -70,10 +70,10 @@ BSD Build ===== -You only need C++98 compier to build re2c from tarball. +You only need a C++98 compiler to build re2c from the tarball. If you have bison, re2c will use it (otherwise it will use precompiled files). -If you are building re2c from source (not from tarball), you will also need autotools: +If you are building re2c from source (not from a tarball), you will also need autotools: .. code-block:: bash @@ -90,11 +90,11 @@ This will install re2c (binary and manpage) to ``prefix`` (``/usr/local`` by def $ make $ make install -Bootstrap ---------- +Bootstrapping +------------- Some parts of re2c (lexers and parser of command-line options) are written in re2c. -These files are precompiled and packaged into re2c distribution (so that re2c can be built without re2c). +These files are precompiled and packaged into the re2c distribution (so that re2c can be built without re2c). However, one can fully bootstrap re2c: .. code-block:: bash @@ -143,7 +143,7 @@ Or run only the main test suite (and watch progress dumped to ``stdout``): $ make tests -Run test suite under `valgrind `_ (takes a long time): +Run the test suite under `valgrind `_ (takes a long time): .. code-block:: bash diff --git a/src/manual/features/conditions/conditions.rst b/src/manual/features/conditions/conditions.rst index bbaae4be..595353cf 100644 --- a/src/manual/features/conditions/conditions.rst +++ b/src/manual/features/conditions/conditions.rst @@ -4,17 +4,17 @@ Conditions .. toctree:: :hidden: -You can preceed regular expressions with a list of condition names when -using the ``-c`` switch. In this case ``re2c`` generates scanner blocks for -each conditon. Where each of the generated blocks has its own +You can precede regular expressions with a list of condition names when +using the ``-c`` switch. ``re2c`` will then generate a scanner block for +each condition, and each of the generated blocks will have its own precondition. The precondition is given by the interface define ``YYGETCONDITON()`` and must be of type ``YYCONDTYPE``. There are two special rule types. First, the rules of the condition ``<*>`` -are merged to all conditions (note that they have lower priority than -other rules of that condition). And second the empty condition list -allows to provide a code block that does not have a scanner part. -Meaning it does not allow any regular expression. The condition value +are merged to all conditions (note that they have a lower priority than +other rules of that condition). And second, the empty condition list +allows to provide a code block that does not have a scanner part, +meaning it does not allow any regular expression. The condition value referring to this special block is always the one with the enumeration value 0. This way the code of this special rule can be used to initialize a scanner. It is in no way necessary to have these rules: but @@ -22,16 +22,16 @@ sometimes it is helpful to have a dedicated uninitialized condition state. Non empty rules allow to specify the new condition, which makes them -transition rules. Besides generating calls for the define -``YYSETCONDTITION`` no other special code is generated. +transition rules. Besides generating calls for the +``YYSETCONDTITION`` define, no other special code is generated. -There is another kind of special rules that allow to prepend code to any +There is another kind of special rules that allows to prepend code to any code block of all rules of a certain set of conditions or to all code blocks to all rules. This can be helpful when some operation is common -among rules. For instance this can be used to store the length of the +among rules. For instance, this can be used to store the length of the scanned string. These special setup rules start with an exclamation mark followed by either a list of conditions ```` or a star ````. When ``re2c`` generates the code for a rule whose state does not have a -setup rule and a star'd setup rule is present, than that code will be +setup rule and a starred setup rule is present, that code will be used as setup code. diff --git a/src/manual/features/dot/dot.rst b/src/manual/features/dot/dot.rst index 7cbbb86d..ea8f838d 100644 --- a/src/manual/features/dot/dot.rst +++ b/src/manual/features/dot/dot.rst @@ -4,20 +4,20 @@ .. toctree:: :hidden: -With ``-D, --emit-dot`` option re2c does not generate C/C++ code. -Instead, it dumps the generated DFA in `DOT format `_. -One can convert this dump to an image of DFA using `graphviz `_ or another library. +With the ``-D, --emit-dot`` option, re2c does not generate C/C++ code. +Instead, it dumps the generated DFA in the `DOT format `_. +One can convert this dump to an image of the DFA using `graphviz `_ or another library. -Say we want a picture of DFA that accepts any UTF-8 code point, ``utf8_any.re``: +Say we want a picture of the DFA that accepts any UTF-8 code point, ``utf8_any.re``: .. literalinclude:: utf8_any.re.txt :language: cpp -Generate and render : +Generate and render: .. code-block:: none - $ re2c -D8 -o utf8_any.dot utf8_any.re + $ re2c -D -8 -o utf8_any.dot utf8_any.re $ dot -Tpng -o utf8_any.png utf8_any.dot Here is the picture: @@ -27,8 +27,8 @@ Here is the picture: Note that re2c performs additional transformations on the DFA: inserts ``YYFILL`` `checkpoints <../../../examples/example_02.html>`_, -binds actions, applies basic code deduplication. -During the transformations it splits certain states and adds lambda transitions. +binds actions, and applies basic code deduplication. +During the transformations, it splits certain states and adds lambda transitions. Lambda transitions correspond to the unlabeled edges on the picture. A real-world example (JSON lexer, all non-re2c code stripped out), ``php_json.re``: @@ -42,17 +42,17 @@ Generate .dot file: $ re2c -Dc -o php_json.dot php_json.re -Render with ```dot -Gratio=0.3 -Tpng -o php_json_dot.png php_json.dot```: +Render it with ```dot -Gratio=0.3 -Tpng -o php_json_dot.png php_json.dot```: .. image:: php_json_dot.png :width: 100% -Render with ```neato -Elen=4 -Tpng -o php_json_neato.png php_json.dot```: +Render it with ```neato -Elen=4 -Tpng -o php_json_neato.png php_json.dot```: .. image:: php_json_neato.png :width: 70% -The generated graph is sometimes very large and requires careful tuning of rendering paratemeters. +The generated graph is sometimes very large and requires careful tuning of the rendering parameters. diff --git a/src/manual/features/encodings/encodings.rst b/src/manual/features/encodings/encodings.rst index 66dc945d..8aa5311d 100644 --- a/src/manual/features/encodings/encodings.rst +++ b/src/manual/features/encodings/encodings.rst @@ -8,55 +8,55 @@ Encodings UCS-2 (``-w``), UTF-16 (``-x``), UTF-32 (``-u``) and UTF-8 (``-8``). See also inplace configuration ``re2c:flags``. -The following concepts should be clarified when talking about encoding. -*Code point* is an abstract number, which represents single encoding -symbol. *Code unit* is the smallest unit of memory, which is used in the +The following concepts should be clarified when talking about encodings. +A *code point* is an abstract number that represents a single symbol. +A *code unit* is the smallest unit of memory, which is used in the encoded text (it corresponds to one character in the input stream). One -or more code units can be needed to represent a single code point, -depending on the encoding. In *fixed-length* encoding, each code point -is represented with equal number of code units. In *variable-length* -encoding, different code points can be represented with different number +or more code units may be needed to represent a single code point, +depending on the encoding. In a *fixed-length* encoding, each code point +is represented with an equal number of code units. In *variable-length* +encodings, different code points can be represented with different number of code units. * ASCII is a fixed-length encoding. Its code space includes 0x100 - code points, from 0 to 0xFF. One code point is represented with exactly one - 1-byte code unit, which has the same value as the code point. Size of + code points, from 0 to 0xFF. A code point is represented with exactly one + 1-byte code unit, which has the same value as the code point. The size of ``YYCTYPE`` must be 1 byte. * EBCDIC is a fixed-length encoding. Its code space includes 0x100 - code points, from 0 to 0xFF. One code point is represented with exactly - one 1-byte code unit, which has the same value as the code point. Size + code points, from 0 to 0xFF. A code point is represented with exactly + one 1-byte code unit, which has the same value as the code point. The size of ``YYCTYPE`` must be 1 byte. * UCS-2 is a fixed-length encoding. Its code space includes 0x10000 code points, from 0 to 0xFFFF. One code point is represented with exactly one 2-byte code unit, which has the same value as the code - point. Size of ``YYCTYPE`` must be 2 bytes. + point. The size of ``YYCTYPE`` must be 2 bytes. * UTF-16 is a variable-length encoding. Its code space includes all Unicode code points, from 0 to 0xD7FF and from 0xE000 to 0x10FFFF. One - code point is represented with one or two 2-byte code units. Size of + code point is represented with one or two 2-byte code units. The size of ``YYCTYPE`` must be 2 bytes. * UTF-32 is a fixed-length encoding. Its code space includes all Unicode code points, from 0 to 0xD7FF and from 0xE000 to 0x10FFFF. One - code point is represented with exactly one 4-byte code unit. Size of + code point is represented with exactly one 4-byte code unit. The size of ``YYCTYPE`` must be 4 bytes. * UTF-8 is a variable-length encoding. Its code space includes all Unicode code points, from 0 to 0xD7FF and from 0xE000 to 0x10FFFF. One - code point is represented with sequence of one, two, three or four + code point is represented with sequence of one, two, three, or four 1-byte code units. Size of ``YYCTYPE`` must be 1 byte. In Unicode, values from range 0xD800 to 0xDFFF (surrogates) are not -valid Unicode code points, any encoded sequence of code units, that +valid Unicode code points. Any encoded sequence of code units that would map to Unicode code points in the range 0xD800-0xDFFF, is ill-formed. The user can control how ``re2c`` treats such ill-formed -sequences with ``--encoding-policy `` flag. +sequences with the ``--encoding-policy `` flag. -For some encodings, there are code units, that never occur in valid -encoded stream (e.g. 0xFF byte in UTF-8). If the generated scanner must -check for invalid input, the only true way to do so is to use default -rule ``*``. Note, that full range rule ``[^]`` won't catch invalid code units when variable-length encoding is used -(``[^]`` means "all valid code points", while default rule ``*`` means "all possible code units"). +For some encodings, there are code units that never occur in a valid +encoded stream (e.g., 0xFF byte in UTF-8). If the generated scanner must +check for invalid input, the only correct way to do so is to use the default +rule (``*``). Note that the full range rule (``[^]``) won't catch invalid code units when a variable-length encoding is used +(``[^]`` means "any valid code point", whereas the default rule (``*``) means "any possible code unit"). diff --git a/src/manual/features/generic_api/generic_api.rst b/src/manual/features/generic_api/generic_api.rst index 9c082c49..fbdfe582 100644 --- a/src/manual/features/generic_api/generic_api.rst +++ b/src/manual/features/generic_api/generic_api.rst @@ -4,21 +4,21 @@ Generic API .. toctree:: :hidden: -``re2c`` usually operates on input using pointer-like primitives -``YYCURSOR``, ``YYMARKER``, ``YYCTXMARKER`` and ``YYLIMIT``. +``re2c`` usually operates on input with pointer-like primitives +``YYCURSOR``, ``YYMARKER``, ``YYCTXMARKER``, and ``YYLIMIT``. -Generic input API (enabled with ``--input custom`` switch) allows to -customize input operations. In this mode, ``re2c`` will express all +The generic input API (enabled with the ``--input custom`` switch) allows +customizing input operations. In this mode, ``re2c`` will express all operations on input in terms of the following primitives: +---------------------+-----------------------------------------------------+ | ``YYPEEK ()`` | get current input character | +---------------------+-----------------------------------------------------+ - | ``YYSKIP ()`` | advance to the next character | + | ``YYSKIP ()`` | advance to next character | +---------------------+-----------------------------------------------------+ - | ``YYBACKUP ()`` | backup current input position | + | ``YYBACKUP ()`` | back up current input position | +---------------------+-----------------------------------------------------+ - | ``YYBACKUPCTX ()`` | backup current input position for trailing context | + | ``YYBACKUPCTX ()`` | back up current input position for trailing context | +---------------------+-----------------------------------------------------+ | ``YYRESTORE ()`` | restore current input position | +---------------------+-----------------------------------------------------+ diff --git a/src/manual/features/reuse/reuse.rst b/src/manual/features/reuse/reuse.rst index 4817311e..49c6254a 100644 --- a/src/manual/features/reuse/reuse.rst +++ b/src/manual/features/reuse/reuse.rst @@ -1,14 +1,14 @@ Reuse ----- -Reuse mode is controlled by ``-r --reusable`` option. -Allows reuse of scanner definitions with ``/*!use:re2c */`` after ``/*!rules:re2c */``. -In this mode no ``/*!re2c */`` block and exactly one ``/*!rules:re2c */`` must be present. -The rules are being saved and used by every ``/*!use:re2c */`` block that follows. +Reuse mode is controlled by the ``-r --reusable`` option. +This allows reuse of scanner definitions with ``/*!use:re2c */`` after ``/*!rules:re2c */``. +In this mode, no ``/*!re2c */`` block and exactly one ``/*!rules:re2c */`` must be present. +The rules are saved and used by every ``/*!use:re2c */`` block that follows. These blocks can contain inplace configurations, especially ``re2c:flags:e``, ``re2c:flags:w``, ``re2c:flags:x``, ``re2c:flags:u`` and ``re2c:flags:8``. -That way it is possible to create the same scanner multiple times for -different character types, different input mechanisms or different output mechanisms. +That way, it is possible to create the same scanner multiple times for +different character types, different input mechanisms, or different output mechanisms. The ``/*!use:re2c */`` blocks can also contain additional rules that will be appended to the set of rules in ``/*!rules:re2c */``. diff --git a/src/manual/features/skeleton/skeleton.rst b/src/manual/features/skeleton/skeleton.rst index 10ab2547..44e80dc8 100644 --- a/src/manual/features/skeleton/skeleton.rst +++ b/src/manual/features/skeleton/skeleton.rst @@ -4,18 +4,18 @@ Skeleton .. toctree:: :hidden: -With ``-S, --skeleton`` option re2c ignores all non-re2c code and generates a self-contained C program +With the ``-S, --skeleton`` option, re2c ignores all non-re2c code and generates a self-contained C program that can be further compiled and executed. The program consists of lexer code and input data. -For each constructed DFA (block or condition) re2c generates a standalone lexer and two files: -``.input`` file with strings derived from the DFA and ``.keys`` file with expected match results. +For each constructed DFA (block or condition), re2c generates a standalone lexer and two files: +an ``.input`` file with strings derived from the DFA and a ``.keys`` file with expected match results. The program runs each lexer on the corresponding ``.input`` file and compares results with the expectations. For encodings with 1-byte code units (such as ASCII, UTF-8 and EBCDIC) the generated data -covers all DFA transitions (in other words, skeleton program triggers each conditional jump in lexer). +covers all DFA transitions (in other words, the skeleton program triggers each conditional jump in the lexer). For encodings with multibyte code units the generated data covers up to 256 transitions -of each disjoint character range in DFA (see `Generating data`_ section for details). +of each disjoint character range in the DFA (see `Generating data`_ section for details). .. include:: example.rst diff --git a/src/manual/options/options_list.rst b/src/manual/options/options_list.rst index c835e2e9..a358a030 100644 --- a/src/manual/options/options_list.rst +++ b/src/manual/options/options_list.rst @@ -1,27 +1,27 @@ ``-? -h --help`` - Invoke a short help. + Show a short help screen: ``-b --bit-vectors`` - Implies ``-s``. Use bit vectors as well in the - attempt to coax better code out of the compiler. Most useful for - specifications with more than a few keywords (e.g. for most programming + Implies ``-s``. Use bit vectors as well to try to + coax better code out of the compiler. Most useful for + specifications with more than a few keywords (e.g., for most programming languages). ``-c --conditions`` - Used to support (f)lex-like condition support. + Used for (f)lex-like condition support. ``-d --debug-output`` Creates a parser that dumps information about - the current position and in which state the parser is while parsing the - input. This is useful to debug parser issues and states. If you use this - switch you need to define a macro ``YYDEBUG`` that is called like a + the current position and the state the parser is in. + This is useful to debug parser issues and states. If you use this + switch, you need to define a ``YYDEBUG`` macro, which will be called like a function with two parameters: ``void YYDEBUG (int state, char current)``. The first parameter receives the state or ``-1`` and the second parameter receives the input at the current cursor. ``-D --emit-dot`` - Emit Graphviz dot data. It can then be processed - with e.g. ``dot -Tpng input.dot > output.png``. Please note that + Emit Graphviz dot data, which can then be processed + with e.g., ``dot -Tpng input.dot > output.png``. Please note that scanners with many states may crash dot. ``-e --ecb`` @@ -34,23 +34,21 @@ Generate a scanner with support for storable state. ``-F --flex-syntax`` - Partial support for flex syntax. When this flag - is active then named definitions must be surrounded by curly braces and + Partial support for the flex syntax. When this flag + is active, named definitions must be surrounded by curly braces and can be defined without an equal sign and the terminating semi colon. Instead names are treated as direct double quoted strings. ``-g --computed-gotos`` Generate a scanner that utilizes GCC's - computed goto feature. That is ``re2c`` generates jump tables whenever a - decision is of a certain complexity (e.g. a lot of if conditions are - otherwise necessary). This is only useable with GCC and produces output - that cannot be compiled with any other compiler. Note that this implies - ``-b`` and that the complexity threshold can be configured using the - inplace configuration ``cgoto:threshold``. + computed goto feature. That is, ``re2c`` generates jump tables whenever a + decision is of a certain complexity (e.g., a lot of if conditions are + otherwise necessary). This is only usable with compilers that support this feature. + Note that this implies ``-b`` and that the complexity threshold can be configured using the inplace configuration ``cgoto:threshold``. ``-i --no-debug-info`` Do not output ``#line`` information. This is - usefull when you want use a CMS tool with the ``re2c`` output which you + useful when you want use a CMS tool with the ``re2c`` output which you might want if you do not require your users to have ``re2c`` themselves when building from your source. @@ -59,8 +57,8 @@ ``-r --reusable`` Allows reuse of scanner definitions with ``/*!use:re2c */`` after ``/*!rules:re2c */``. - In this mode no ``/*!re2c */`` block and exactly one ``/*!rules:re2c */`` must be present. - The rules are being saved and used by every ``/*!use:re2c */`` block that follows. + In this mode, no ``/*!re2c */`` block and exactly one ``/*!rules:re2c */`` must be present. + The rules are saved and used by every ``/*!use:re2c */`` block that follows. These blocks can contain inplace configurations, especially ``re2c:flags:e``, ``re2c:flags:w``, ``re2c:flags:x``, ``re2c:flags:u`` and ``re2c:flags:8``. That way it is possible to create the same scanner multiple times for diff --git a/src/manual/syntax/syntax.rst b/src/manual/syntax/syntax.rst index ef1bd560..dec4ef5e 100644 --- a/src/manual/syntax/syntax.rst +++ b/src/manual/syntax/syntax.rst @@ -11,11 +11,11 @@ Code for ``re2c`` consists of a set of `rules`_, `definitions`_ and Rules ----- -Rules consist of a `regular expressions`_ along with a block of C/C++ code -that is to be executed when the associated regular expression is +Each rule consist of a `regular expression`_ accompanied with a block of C/C++ code +which is to be executed when the associated regular expression is matched. You can either start the code with an opening curly brace or -the sequence ``:=``. When the code with a curly brace then ``re2c`` counts the brace depth -and stops looking for code automatically. Otherwise curly braces are not +the sequence ``:=``. If you use an opening curly brace, ``re2c`` will count brace depth +and stop looking for code automatically. Otherwise, curly braces are not allowed and ``re2c`` stops looking for code at the first line that does not begin with whitespace. If two or more rules overlap, the first rule is preferred. @@ -24,31 +24,31 @@ is preferred. ``regular-expression := C/C++ code`` -There is one special rule: default rule ``*`` +There is one special rule: the default rule (``*``) ``* { C/C++ code }`` ``* := C/C++ code`` -Note that default rule ``*`` differs from ``[^]``: default rule has the lowest priority, -matches any code unit (either valid or invalid) and always consumes one character; -while ``[^]`` matches any valid code point (not code unit) and can consume multiple -code units. In fact, when variable-length encoding is used, ``*`` -is the only possible way to match invalid input character. - -If ``-c`` is active then each regular expression is preceeded by a list -of comma separated condition names. Besides normal naming rules there -are two special cases: ``<*>`` (such rules are merged to all conditions) -and ``<>`` (such the rule cannot have an associated regular expression, -its code is merged to all actions). Non empty rules may further more specify the new -condition. In that case ``re2c`` will generate the necessary code to +Note that the default rule (``*``) differs from ``[^]``: the default rule has the lowest priority, +matches any code unit (either valid or invalid) and always consumes exactly one character. +``[^]``, on the other hand, matches any valid code point (not the same as a code unit) and can consume multiple +code units. In fact, when a variable-length encoding is used, ``*`` +is the only possible way to match an invalid input character. + +If ``-c`` is active, then each regular expression is preceded by a list +of comma-separated condition names. Besides the normal naming rules, there +are two special cases: ``<*>`` (these rules are merged to all conditions) +and ``<>`` (these rules cannot have an associated regular expression; +their code is merged to all actions). Non-empty rules may furthermore specify the new +condition. In that case, ``re2c`` will generate the necessary code to change the condition automatically. Rules can use ``:=>`` as a shortcut to automatically generate code that not only sets the new condition state but also continues execution with the new state. A shortcut rule should not be used in a loop where there is code between the start of the loop and the ``re2c`` block unless ``re2c:cond:goto`` -is changed to ``continue``. If code is necessary before all rules (though not simple jumps) you -can doso by using ```` pseudo-rules. +is changed to ``continue``. If some code is needed before all rules (though not before simple jumps), you +can insert it with ```` pseudo-rules. `` regular-expression { C/C++ code }`` @@ -125,16 +125,17 @@ Configurations ``re2c:condprefix = yyc;`` Allows to specify the prefix used for - condition labels. That is this text is prepended to any condition label + condition labels. That is, the text to be prepended to condition labels in the generated output file. + ``re2c:condenumprefix = yyc;`` Allows to specify the prefix used for - condition values. That is this text is prepended to any condition enum - value in the generated output file. + condition values. That is, the text to be prepended to condition enum + values in the generated output file. ``re2c:cond:divider = "/* *********************************** */";`` - Allows to customize the devider for condition blocks. You can use ``@@`` + Allows to customize the divider for condition blocks. You can use ``@@`` to put the name of the condition or customize the placeholder using ``re2c:cond:divider@cond``. @@ -144,234 +145,220 @@ Configurations ``re2c:cond:goto = "goto @@;";`` Allows to customize the condition goto statements used with ``:=>`` style rules. You can use ``@@`` - to put the name of the condition or ustomize the placeholder using + to put the name of the condition or customize the placeholder using ``re2c:cond:goto@cond``. You can also change this to ``continue;``, which would allow you to continue with the next loop cycle including any code - between loop start and re2c block. + between your loop start and your re2c block. ``re2c:cond:goto@cond = @@;`` - Spcifies the placeholder that will be replaced with the condition label in ``re2c:cond:goto``. + Specifies the placeholder that will be replaced with the condition label in ``re2c:cond:goto``. ``re2c:indent:top = 0;`` - Specifies the minimum number of indendation to - use. Requires a numeric value greater than or equal zero. + Specifies the minimum amount of indentation to + use. Requires a numeric value greater than or equal to zero. ``re2c:indent:string = "\t";`` - Specifies the string to use for indendation. Requires a string that should - contain only whitespace unless you need this for external tools. The easiest - way to specify spaces is to enclude them in single or double quotes. - If you do not want any indendation at all you can simply set this to "". + Specifies the string to use for indentation. Requires a string that should + contain only whitespace unless you need something else for external tools. The easiest + way to specify spaces is to enclose them in single or double quotes. + If you do not want any indentation at all, you can simply set this to "". ``re2c:yych:conversion = 0;`` - When this setting is non zero, then ``re2c`` automatically generates + When this setting is non zero, ``re2c`` automatically generates conversion code whenever yych gets read. In this case the type must be defined using ``re2c:define:YYCTYPE``. ``re2c:yych:emit = 1;`` - Generation of *yych* can be suppressed by setting this to 0. + The generation of *yych* can be suppressed by setting this to 0. ``re2c:yybm:hex = 0;`` - If set to zero then a decimal table is being used else a hexadecimal table will be generated. + If set to zero, a decimal table will be used. Otherwise, a hexadecimal table will be generated. ``re2c:yyfill:enable = 1;`` - Set this to zero to suppress generation of ``YYFILL (n)``. When using this be sure to verify that the generated - scanner does not read behind input. Allowing this behavior might - introduce several security issues to your programs. + Set this to zero to suppress the generation of ``YYFILL (n)``. When using this, be sure to verify that the generated + scanner does not read behind the end of your input, allowing such behavior might + introduce several security issues to your program. ``re2c:yyfill:check = 1;`` - This can be set to 0 to suppress output of the - pre condition using ``YYCURSOR`` and ``YYLIMIT`` which becomes usefull when + This can be set to 0 to suppress the generations of + ``YYCURSOR`` and ``YYLIMIT`` based precondition checks. This option is useful when ``YYLIMIT + YYMAXFILL`` is always accessible. ``re2c:define:YYFILL = "YYFILL";`` - Substitution for ``YYFILL``. Note - that by default ``re2c`` generates argument in braces and semicolon after + Define a substitution for ``YYFILL``. Note that by default, + ``re2c`` generates an argument in parentheses and a semicolon after ``YYFILL``. If you need to make ``YYFILL`` an arbitrary statement rather - than a call, set ``re2c:define:YYFILL:naked`` to non-zero and use - ``re2c:define:YYFILL@len`` to denote formal parameter inside of ``YYFILL`` + than a call, set ``re2c:define:YYFILL:naked`` to a non-zero value and use + ``re2c:define:YYFILL@len`` to set a placeholder for the formal parameter inside of your ``YYFILL`` body. ``re2c:define:YYFILL@len = "@@";`` - Any occurence of this text - inside of ``YYFILL`` will be replaced with the actual argument. + Any occurrence of this text + inside of a ``YYFILL`` call will be replaced with the actual argument. ``re2c:yyfill:parameter = 1;`` - Controls argument in braces after - ``YYFILL``. If zero, agrument is omitted. If non-zero, argument is - generated unless ``re2c:define:YYFILL:naked`` is set to non-zero. + Controls the argument in the parentheses that follow ``YYFILL``. If zero, the argument is omitted. + If non-zero, the argument is generated unless ``re2c:define:YYFILL:naked`` is set to non-zero. ``re2c:define:YYFILL:naked = 0;`` - Controls argument in braces and - semicolon after ``YYFILL``. If zero, both agrument and semicolon are - omitted. If non-zero, argument is generated unless - ``re2c:yyfill:parameter`` is set to zero and semicolon is generated + Controls the argument in the parentheses after ``YYFILL`` and + the following semicolon. If zero, both the argument and the semicolon are + omitted. If non-zero, the argument is generated unless + ``re2c:yyfill:parameter`` is set to zero; the semicolon is generated unconditionally. ``re2c:startlabel = 0;`` - If set to a non zero integer then the start - label of the next scanner blocks will be generated even if not used by - the scanner itself. Otherwise the normal ``yy0`` like start label is only - being generated if needed. If set to a text value then a label with that + If set to a non zero integer, then the start + label of the next scanner block will be generated even if it isn't used by + the scanner itself. Otherwise, the normal ``yy0``-like start label is only + generated if needed. If set to a text value, then a label with that text will be generated regardless of whether the normal start label is - being used or not. This setting is being reset to 0 after a start - label has been generated. + used or not. This setting is reset to 0 after a start label has been generated. ``re2c:labelprefix = "yy";`` Allows to change the prefix of numbered - labels. The default is ``yy`` and can be set any string that is a valid - label. + labels. The default is ``yy``. Can be set any string that is valid in + a label name. ``re2c:state:abort = 0;`` - When not zero and switch ``-f`` is active then + When not zero and the ``-f`` switch is active, then the ``YYGETSTATE`` block will contain a default case that aborts and a -1 - case is used for initialization. + case will be used for initialization. ``re2c:state:nextlabel = 0;`` Used when ``-f`` is active to control whether the ``YYGETSTATE`` block is followed by a ``yyNext:`` label line. - Instead of using ``yyNext`` you can usually also use configuration + Instead of using ``yyNext``, you can usually also use configuration ``startlabel`` to force a specific start label or default to ``yy0`` as - start label. Instead of using a dedicated label it is often better to + a start label. Instead of using a dedicated label, it is often better to separate the ``YYGETSTATE`` code from the actual scanner code by placing a ``/*!getstate:re2c*/`` comment. ``re2c:cgoto:threshold = 9;`` - When ``-g`` is active this value specifies - the complexity threshold that triggers generation of jump tables rather - than using nested if's and decision bitfields. The threshold is compared - against a calculated estimation of if-s needed where every used bitmap + When ``-g`` is active, this value specifies + the complexity threshold that triggers the generation of jump tables rather + than nested ifs and decision bitfields. The threshold is compared + against a calculated estimation of ifs needed where every used bitmap divides the threshold by 2. ``re2c:yych:conversion = 0;`` - When the input uses signed characters and - ``-s`` or ``-b`` switches are in effect re2c allows to automatically convert + When input uses signed characters and the + ``-s`` or ``-b`` switches are in effect, re2c allows automatic conversion to the unsigned character type that is then necessary for its internal - single character. When this setting is zero or an empty string the - conversion is disabled. Using a non zero number the conversion is taken - from ``YYCTYPE``. If that is given by an inplace configuration that value - is being used. Otherwise it will be ``(YYCTYPE)`` and changes to that - configuration are no longer possible. When this setting is a string the - braces must be specified. Now assuming your input is a ``char *`` - buffer and you are using above mentioned switches you can set + single character. When this setting is zero or an empty string, the + conversion is disabled. If a non zero number is used, the conversion is taken + from ``YYCTYPE``. If ``YYCTYPE`` is overridden by an inplace configuration setting, that setting is + is used instead of a ``YYCTYPE`` cast. Otherwise, it will be ``(YYCTYPE)`` and changes to that + configuration are no longer possible. When this setting is a string, it must contain the casting + parentheses. Now assuming your input is a ``char *`` buffer and you are using the above mentioned switches, you can set ``YYCTYPE`` to ``unsigned char`` and this setting to either 1 or ``(unsigned char)``. ``re2c:define:YYCONDTYPE = "YYCONDTYPE";`` Enumeration used for condition support with ``-c`` mode. ``re2c:define:YYCTXMARKER = "YYCTXMARKER";`` - Allows to overwrite the - define ``YYCTXMARKER`` and thus avoiding it by setting the value to the - actual code needed. + Replaces the ``YYCTXMARKER`` placeholder with the specified identifier. ``re2c:define:YYCTYPE = "YYCTYPE";`` - Allows to overwrite the define - ``YYCTYPE`` and thus avoiding it by setting the value to the actual code - needed. + Replaces the ``YYCTYPE`` placeholder with the specified type. ``re2c:define:YYCURSOR = "YYCURSOR";`` - Allows to overwrite the define - ``YYCURSOR`` and thus avoiding it by setting the value to the actual code - needed. + Replaces the ``YYCURSOR`` placeholder with the specified identifier. ``re2c:define:YYDEBUG = "YYDEBUG";`` - Allows to overwrite the define - ``YYDEBUG`` and thus avoiding it by setting the value to the actual code - needed. + Replaces the ``YYDEBUG`` placeholder with the specified identifier. ``re2c:define:YYGETCONDITION = "YYGETCONDITION";`` Substitution for - ``YYGETCONDITION``. Note that by default ``re2c`` generates braces after + ``YYGETCONDITION``. Note that by default, ``re2c`` generates parentheses after ``YYGETCONDITION``. Set ``re2c:define:YYGETCONDITION:naked`` to non-zero to - omit braces. + omit the parentheses. ``re2c:define:YYGETCONDITION:naked = 0;`` - Controls braces after - ``YYGETCONDITION``. If zero, braces are omitted. If non-zero, braces are + Controls the parentheses after + ``YYGETCONDITION``. If zero, the parentheses are omitted. If non-zero, the parentheses are generated. ``re2c:define:YYSETCONDITION = "YYSETCONDITION";`` Substitution for - ``YYSETCONDITION``. Note that by default ``re2c`` generates argument in - braces and semicolon after ``YYSETCONDITION``. If you need to make + ``YYSETCONDITION``. Note that by default, ``re2c`` generates an argument in + parentheses followed by semicolon after ``YYSETCONDITION``. If you need to make ``YYSETCONDITION`` an arbitrary statement rather than a call, set ``re2c:define:YYSETCONDITION:naked`` to non-zero and use - ``re2c:define:YYSETCONDITION@cond`` to denote formal parameter inside of + ``re2c:define:YYSETCONDITION@cond`` to denote the formal parameter inside of the ``YYSETCONDITION`` body. ``re2c:define:YYSETCONDITION@cond = "@@";`` - Any occurence of this + Any occurrence of this text inside of ``YYSETCONDITION`` will be replaced with the actual argument. ``re2c:define:YYSETCONDITION:naked = 0;`` - Controls argument in braces - and semicolon after ``YYSETCONDITION``. If zero, both agrument and - semicolon are omitted. If non-zero, both argument and semicolon are + Controls the argument in parentheses + and the semicolon after ``YYSETCONDITION``. If zero, both the argument and + the semicolon are omitted. If non-zero, both the argument and the semicolon are generated. ``re2c:define:YYGETSTATE = "YYGETSTATE";`` Substitution for - ``YYGETSTATE``. Note that by default ``re2c`` generates braces after + ``YYGETSTATE``. Note that by default, ``re2c`` generates parentheses after ``YYGETSTATE``. Set ``re2c:define:YYGETSTATE:naked`` to non-zero to omit - braces. + the parentheses. ``re2c:define:YYGETSTATE:naked = 0;`` - Controls braces after - ``YYGETSTATE``. If zero, braces are omitted. If non-zero, braces are + Controls the parentheses that follow + ``YYGETSTATE``. If zero, the parentheses are omitted. If non-zero, they are generated. ``re2c:define:YYSETSTATE = "YYSETSTATE";`` Substitution for - ``YYSETSTATE``. Note that by default ``re2c`` generates argument in braces - and semicolon after ``YYSETSTATE``. If you need to make ``YYSETSTATE`` an + ``YYSETSTATE``. Note that by default, ``re2c`` generates an argument in parentheses + followed by a semicolon after ``YYSETSTATE``. If you need to make ``YYSETSTATE`` an arbitrary statement rather than a call, set ``re2c:define:YYSETSTATE:naked`` to non-zero and use ``re2c:define:YYSETSTATE@cond`` to denote formal parameter inside of - ``YYSETSTATE`` body. + your ``YYSETSTATE`` body. ``re2c:define:YYSETSTATE@state = "@@";`` - Any occurence of this text + Any occurrence of this text inside of ``YYSETSTATE`` will be replaced with the actual argument. ``re2c:define:YYSETSTATE:naked = 0;`` - Controls argument in braces and - semicolon after ``YYSETSTATE``. If zero, both agrument and semicolon are - omitted. If non-zero, both argument and semicolon are generated. + Controls the argument in parentheses and the + semicolon after ``YYSETSTATE``. If zero, both argument and the semicolon are + omitted. If non-zero, both the argument and the semicolon are generated. ``re2c:define:YYLIMIT = "YYLIMIT";`` - Allows to overwrite the define - ``YYLIMIT`` and thus avoiding it by setting the value to the actual code + Replaces the ``YYLIMIT`` placeholder with the specified identifier. needed. ``re2c:define:YYMARKER = "YYMARKER";`` - Allows to overwrite the define - ``YYMARKER`` and thus avoiding it by setting the value to the actual code - needed. + Replaces the ``YYMARKER`` placeholder with the specified identifier. ``re2c:label:yyFillLabel = "yyFillLabel";`` - Allows to overwrite the name of the label ``yyFillLabel``. + Overrides the name of the ``yyFillLabel`` label. ``re2c:label:yyNext = "yyNext";`` - Allows to overwrite the name of the label ``yyNext``. + Overrides the name of the ``yyNext`` label. ``re2c:variable:yyaccept = yyaccept;`` - Allows to overwrite the name of the variable ``yyaccept``. + Overrides the name of the ``yyaccept`` variable. ``re2c:variable:yybm = "yybm";`` - Allows to overwrite the name of the variable ``yybm``. + Overrides the name of the ``yybm`` variable. ``re2c:variable:yych = "yych";`` - Allows to overwrite the name of the variable ``yych``. + Overrides the name of the ``yych`` variable. ``re2c:variable:yyctable = "yyctable";`` - When both ``-c`` and ``-g`` are active then ``re2c`` uses this variable to generate a static jump table + When both ``-c`` and ``-g`` are active, ``re2c`` will use this variable to generate a static jump table for ``YYGETCONDITION``. ``re2c:variable:yystable = "yystable";`` Deprecated. ``re2c:variable:yytarget = "yytarget";`` - Allows to overwrite the name of the variable ``yytarget``. + Overrides the name of the ``yytarget`` variable. Regular expressions ------------------- @@ -380,14 +367,14 @@ Regular expressions literal string ``"foo"``. ANSI-C escape sequences can be used. ``'foo'`` - literal string ``"foo"`` (characters [a-zA-Z] treated - case-insensitive). ANSI-C escape sequences can be used. + literal string ``"foo"`` (case insensitive for characters [a-zA-Z]). + ANSI-C escape sequences can be used. ``[xyz]`` - character class; in this case, regular expression matches either ``x``, ``y``, or ``z``. + character class; in this case, the regular expression matches ``x``, ``y``, or ``z``. ``[abj-oZ]`` - character class with a range in it; matches ``a``, ``b``, any letter from ``j`` through ``o`` or ``Z``. + character class with a range in it; matches ``a``, ``b``, any letter from ``j`` through ``o``, or ``Z``. ``[^class]`` inverted character class. @@ -397,10 +384,10 @@ Regular expressions which can be expressed as character classes. ``r*`` - zero or more occurences of ``r``. + zero or more occurrences of ``r``. ``r+`` - one or more occurences of ``r``. + one or more occurrences of ``r``. ``r?`` optional ``r``. @@ -412,13 +399,13 @@ Regular expressions ``r`` followed by ``s`` (concatenation). ``r | s`` - either ``r`` or ``s`` (alternative). + ``r`` or ``s`` (alternative). ``r`` / ``s`` ``r`` but only if it is followed by ``s``. Note that ``s`` is not part of the matched text. This type of regular expression is called - "trailing context". Trailing context can only be the end of a rule - and not part of a named definition. + "trailing context". Trailing context can only be at the end of a rule + and cannot be part of a named definition. ``r{n}`` matches ``r`` exactly ``n`` times. @@ -433,7 +420,7 @@ Regular expressions match any character except newline. ``name`` - matches named definition as specified by ``name`` only if ``-F`` is + matches a named definition as specified by ``name`` only if ``-F`` is off. If ``-F`` is active then this behaves like it was enclosed in double quotes and matches the string "name". @@ -447,7 +434,7 @@ cased ``x`` and two hexadecimal digits (e.g. ``\x12``). Hexadecimal characters f Hexadecimal characters from 0x10000 to 0xFFFFffff are defined by backslash, an upper cased ``\U`` and eight hexadecimal digits (e.g. ``\U12345678``). -The only portable "any" rule is the default rule ``*``. +The only portable "any" rule is the default rule, ``*``. Interface --------- @@ -455,10 +442,10 @@ Interface The user must supply interface code either in the form of C/C++ code (macros, functions, variables, etc.) or in the form of `configurations`_. Which symbols must be defined and which are optional -depends on a particular use case. +depends on the particular use case. ``YYCONDTYPE`` - In ``-c`` mode you can use ``-t`` to generate a file that + In ``-c`` mode, you can use ``-t`` to generate a file that contains the enumeration used as conditions. Each of the values refers to a condition of a rule set. @@ -471,8 +458,8 @@ depends on a particular use case. ``YYCTYPE`` Type used to hold an input symbol (code unit). Usually - ``char`` or ``unsigned char`` for ASCII, EBCDIC and UTF-8, *unsigned short* - for UTF-16 or UCS-2 and ``unsigned int`` for UTF-32. + ``char`` or ``unsigned char`` for ASCII, EBCDIC or UTF-8, or *unsigned short* + for UTF-16 or UCS-2, or ``unsigned int`` for UTF-32. ``YYCURSOR`` l-value of type ``YYCTYPE *`` that points to the current input symbol. The generated code advances @@ -491,7 +478,7 @@ depends on a particular use case. ``YYFILL (n)`` The generated code "calls"" ``YYFILL (n)`` when the buffer needs (re)filling: at least ``n`` additional characters should be - provided. ``YYFILL (n)`` should adjust ``YYCURSOR``, ``YYLIMIT``, ``YYMARKER`` + provided. ``YYFILL (n)`` should adjust ``YYCURSOR``, ``YYLIMIT``, ``YYMARKER``, and ``YYCTXMARKER`` as needed. Note that for typical programming languages ``n`` will be the length of the longest keyword plus one. The user can place a comment of the form ``/*!max:re2c*/`` to insert ``YYMAXFILL`` definition that is set to the maximum @@ -499,7 +486,7 @@ depends on a particular use case. ``YYGETCONDITION ()`` This define is used to get the condition prior to - entering the scanner code when using ``-c`` switch. The value must be + entering the scanner code when using the ``-c`` switch. The value must be initialized with a value from the enumeration ``YYCONDTYPE`` type. ``YYGETSTATE ()`` @@ -538,9 +525,9 @@ depends on a particular use case. ``YYSETSTATE`` is a signed integer that uniquely identifies the specific instance of ``YYFILL (n)`` that is about to be called. Should the user wish to save the state of the scanner and have ``YYFILL (n)`` return to - the caller, all he has to do is store that unique identifer in a - variable. Later, when the scannered is called again, it will call + the caller, all he has to do is store that unique identifier in a + variable. Later, when the scanner is called again, it will call ``YYGETSTATE ()`` and resume execution right where it left off. The generated code will contain both ``YYSETSTATE (s)`` and ``YYGETSTATE`` even - if ``YYFILL (n)`` is being disabled. + if ``YYFILL (n)`` is disabled. diff --git a/src/manual/warnings/condition_order/how_it_works.rst b/src/manual/warnings/condition_order/how_it_works.rst index bd820c7e..fef1493e 100644 --- a/src/manual/warnings/condition_order/how_it_works.rst +++ b/src/manual/warnings/condition_order/how_it_works.rst @@ -3,10 +3,10 @@ How it works The warning is triggered if at the same time: -* Conditions are enabled with ``-c``. -* Neither ``/*!types:re2c*/``, nor ``-t, --type-header`` is used. -* Initial dispatch on conditions in sensitive to condition order: +* Conditions are enabled with with ``-c``. +* Neither ``/*!types:re2c*/`` nor ``-t, --type-header`` is used. +* The initial condition dispatch is sensitive to condition order: it is either an ``if`` statement or a jump table. Note that the number of conditions must be greater than one, - otherwise dispatch shrinks to a simple unconditional jump. + otherwise the dispatch shrinks to a simple unconditional jump. diff --git a/src/manual/warnings/condition_order/real_world.rst b/src/manual/warnings/condition_order/real_world.rst index 17aad405..55271e51 100644 --- a/src/manual/warnings/condition_order/real_world.rst +++ b/src/manual/warnings/condition_order/real_world.rst @@ -1,11 +1,11 @@ Real-world examples ~~~~~~~~~~~~~~~~~~~ -The best real-world example is a story of how ``[-Wcondition-order]`` was added to re2c. +The best real-world example is the story of how ``[-Wcondition-order]`` was added to re2c. -One day I decided to change condition numbering scheme. -It was only natural: re2c assigns numbers to conditions in the order they appear in code. -This is not very convenient, because elsewhere in the code condition names (not numbers) are used as unique identifiers. +One day I decided to change the condition numbering scheme. +It was only natural: re2c assigns numbers to conditions in the order they appear in the code. +This is not very convenient, because elsewhere in the code, condition names (not numbers) are used as unique identifiers. Names are sorted lexicographically, so the original condition order is not preserved. It takes extra care to remember the mapping of names to numbers. So why not just drop numbers and sort conditions by their names? diff --git a/src/manual/warnings/condition_order/simple_example.rst b/src/manual/warnings/condition_order/simple_example.rst index cb6e91b3..753136f8 100644 --- a/src/manual/warnings/condition_order/simple_example.rst +++ b/src/manual/warnings/condition_order/simple_example.rst @@ -3,10 +3,10 @@ A simple example The following lexer consists of two conditions: ``a`` and ``b``. It starts in condition ``a``, which expects a sequence of letters ``'a'`` followed by a comma. -Comma causes transition to condition ``b``, which expects a sequence of letters ``'b'`` followed by an exclamation. +The comma causes transition to condition ``b``, which expects a sequence of letters ``'b'`` followed by an exclamation mark. Anything else is an error. -Nothing special, except that instead of generating condition names with ``/*!types:re2c*/`` directive -or ``-t, --type-header`` option we hardcoded them manually: +Nothing special, except that instead of generating condition names with the ``/*!types:re2c*/`` directive +or the ``-t, --type-header`` option, we've hardcoded them manually: :download:`[wcondition_order.re] ` @@ -14,7 +14,7 @@ or ``-t, --type-header`` option we hardcoded them manually: :language: cpp :linenos: -Condition order is controlled by ``REVERSED_CONDITION_ORDER`` define. +Condition order is controlled by the ``REVERSED_CONDITION_ORDER`` define. Let's compile and run it: .. code-block:: none @@ -30,7 +30,7 @@ Let's compile and run it: aaaa,bbb! Everything works fine: we get ``aaaa,bbb!`` in both cases. -However, if we use ``-s`` re2c option, lexer becomes sensitive to condition order: +However, if we use the ``-s`` re2c option, the lexer becomes sensitive to condition order: .. code-block:: none @@ -47,11 +47,11 @@ However, if we use ``-s`` re2c option, lexer becomes sensitive to condition orde error And we also get a warning from re2c. -The same behaviour (re2c warning and error with ``-DREVERSED_CONDITION_ORDER``) remains if we use ``-g`` option +The same behavior (re2c warning and error with ``-DREVERSED_CONDITION_ORDER``) remains if we use the ``-g`` option (or any option that implies ``-s`` or ``-g``). Why is that? A look at the generated code explains everything. -Normally the inital dispatch on conditions is a ``switch`` statement: +Normally the initial dispatch on conditions is a ``switch`` statement: .. code-block:: cpp @@ -61,7 +61,7 @@ Normally the inital dispatch on conditions is a ``switch`` statement: } Dispatch uses explicit condition names and works no matter what numbers are assigned to them. -However, with ``-s`` option re2c generates an ``if`` statement instead of a ``switch``: +However, with the ``-s`` option, re2c generates an ``if`` statement instead of a ``switch``: .. code-block:: cpp @@ -71,7 +71,7 @@ However, with ``-s`` option re2c generates an ``if`` statement instead of a ``sw goto yyc_b; } -And with ``-g`` option it uses jump table (computed ``goto``): +And with the ``-g`` option, it uses a jump table (computed ``goto``): .. code-block:: cpp @@ -81,6 +81,6 @@ And with ``-g`` option it uses jump table (computed ``goto``): }; goto *yyctable[c]; -Clearly, the last two cases are sensitive to condition order. -The fix is easy: as the warning suggests, use ``/*!types:re2c*/`` directive or ``-t, --type-header`` option. +Clearly, the last two cases are condition order sensitive . +The fix is easy: as the warning suggests, use the ``/*!types:re2c*/`` directive or the ``-t, --type-header`` option. diff --git a/src/manual/warnings/empty_character_class/wempty_character_class.rst b/src/manual/warnings/empty_character_class/wempty_character_class.rst index d78ea8ed..3d225c2f 100644 --- a/src/manual/warnings/empty_character_class/wempty_character_class.rst +++ b/src/manual/warnings/empty_character_class/wempty_character_class.rst @@ -4,22 +4,22 @@ .. toctree:: :hidden: -This warning is complementary to ``--empty-class`` option. -The option expects a single argument, one of the following: +This warning is complementary to the ``--empty-class`` option. +The option expects a single argument, which should be one of the following: -* ``match-empty`` (default): empty character class matches empty string - (that is, always matches and consumes no code units). +* ``match-empty`` (default): an empty character class matches an empty string + (that is, it always matches and consumes no code units). This attitude is strange; however, some real-world programs rely on it. -* ``match-none``: empty character class never matches. +* ``match-none``: an empty character class never matches. This is what logic suggests. -* ``error``: empty character class is an error. +* ``error``: an empty character class is an error. -The ``[-Wempty-character-class]`` warning is a reminder for those -who are not aware of ``--empty-class`` option. +The ``[-Wempty-character-class]`` warning is a reminder to those +who are not aware of the ``--empty-class`` option. -Note that empty character class can be constructed in many ways: +Note that an empty character class can be constructed in many ways: .. code-block:: cpp :linenos: diff --git a/src/manual/warnings/match_empty_string/false_alarm.rst b/src/manual/warnings/match_empty_string/false_alarm.rst index 7a2fa7a1..616c286f 100644 --- a/src/manual/warnings/match_empty_string/false_alarm.rst +++ b/src/manual/warnings/match_empty_string/false_alarm.rst @@ -1,12 +1,12 @@ -False alarm -~~~~~~~~~~~ +False alarms +~~~~~~~~~~~~ -In many cases matching empty string makes perfect sense: +In many cases, matching an empty string makes perfect sense: * It might be used as a non-consuming default rule. * It might be used to lex an optional lexeme: if lexeme rules didn't match, - lexer must jump to another block and resume lexing at the same input position. + the lexer must jump to another block and resume lexing at the same input position. Or any other useful examples you can invent. All these cases are perfectly sane. diff --git a/src/manual/warnings/match_empty_string/real_world.rst b/src/manual/warnings/match_empty_string/real_world.rst index 690b86e1..21916e01 100644 --- a/src/manual/warnings/match_empty_string/real_world.rst +++ b/src/manual/warnings/match_empty_string/real_world.rst @@ -7,7 +7,7 @@ That is, to accept zero or more repetitions instead of one or more. Typos in definitions .................... -Here is the skeleton of REXX lexer (the very lexer which motivated Peter to write re2c ``:)``). +Here is a skeleton of a REXX lexer (the very lexer that motivated Peter to write re2c ``:)``). .. code-block:: cpp :linenos: @@ -160,18 +160,18 @@ Here is the skeleton of REXX lexer (the very lexer which motivated Peter to writ The faulty rule is ``symbol``. It is defined as ``symchr*`` and clearly is nullable. -In this particular example (assuming ASCII encoding) empty match is shadowed by other rules: -together ``eof`` and ``any`` cover all possible code units. -So in this case there is no chance of hitting eternal loop. +In this particular example (assuming ASCII encoding), the empty match is shadowed by other rules: +together ``eof`` and ``any`` cover all possible code units, +so in this case, there is no chance of hitting an infinite loop. -However, by no means ``symbol`` should be nullable: it makes no sense. -Sure, it's just a typo and the author meant ``symchr+``. +However, by no means should ``symbol`` be nullable: it makes no sense. +Surely, it is just a typo and the author meant ``symchr+``. Skipping uninteresting stuff ............................ -One often needs to skip variable number of, say, spaces: +One often needs to skip a variable number of, say, spaces: .. code-block:: cpp @@ -188,7 +188,7 @@ This definition is ok when used inside of another (non-nullable) rule: "(" TABS_AND_SPACES ("int" | "integer") TABS_AND_SPACES ")" {} */ -However, as a standalone rule it may cause eternal loop on ill-formed input. +However, as a standalone rule, it may cause an infinite loop on ill-formed input. And it's very common to reuse one rule for multiple purposes. diff --git a/src/manual/warnings/match_empty_string/simple_example.rst b/src/manual/warnings/match_empty_string/simple_example.rst index 2c008653..440b90c4 100644 --- a/src/manual/warnings/match_empty_string/simple_example.rst +++ b/src/manual/warnings/match_empty_string/simple_example.rst @@ -1,8 +1,8 @@ A simple example ~~~~~~~~~~~~~~~~ -``[-Wmatch-empty-string]`` warns when a rule is nullable (matches empty string). -It was intended to prevent hitting eternal loop in cases like this: +``[-Wmatch-empty-string]`` warns when a rule is nullable (matches an empty string). +It was intended to prevent infinite looping in cases such as this: :download:`[wmatch_empty_string.re] ` @@ -36,10 +36,10 @@ Generate, compile and run: The program hangs forever if one of the arguments is ill-formed. Note that `[-Wundefined-control-flow] <../undefined_control_flow/wundefined_control_flow.html>`_ -has no complaints about this particular case: all input patterns are covered by rules. -Yet if we add default rule ``*``, lexer won't hang anymore: it will match default rule -instead of nullable rule. +has no complaints about this particular case: all input patterns are covered by the rules. +Yet if we add the default rule (``*``), the lexer won't hang anymore: it will match the default rule +instead of the nullable rule. -The fix is easy: make the rule non-nullable (say, ``[a-z]+``) and add default rule ``*``. +The fix is easy: make the rule non-nullable (say, ``[a-z]+``) and add the default rule (``*``). diff --git a/src/manual/warnings/swapped_range/wswapped_range.rst b/src/manual/warnings/swapped_range/wswapped_range.rst index 743b11a5..c5c15f5d 100644 --- a/src/manual/warnings/swapped_range/wswapped_range.rst +++ b/src/manual/warnings/swapped_range/wswapped_range.rst @@ -5,9 +5,9 @@ :hidden: This warning is very simple. -It warns you in cases when character class contains a range which lower bound is greater than upper bound. -For some strange reason re2c never considered it an error: -it simply swaps range bounds and goes on. +It warns you in cases when a character class contains a range whose lower bound is greater than its upper bound. +For some strange reason, re2c never considered this an error: +it simply swapped range bounds and went on. .. code-block:: cpp :linenos: @@ -47,7 +47,7 @@ Given this code, ```re2c -i -Wswapped-range``` generates the following: { return "is it what you want?"; } } -And reports a warning: +and it reports a warning: .. code-block:: none diff --git a/src/manual/warnings/undefined_control_flow/default_vs_any.rst b/src/manual/warnings/undefined_control_flow/default_vs_any.rst index 93ae29b2..9363ad42 100644 --- a/src/manual/warnings/undefined_control_flow/default_vs_any.rst +++ b/src/manual/warnings/undefined_control_flow/default_vs_any.rst @@ -1,8 +1,8 @@ Difference between ``*`` and ``[^]`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -When the the world was young and re2c didn't have default rule ``*`` (that is, before re2c-0.13.7), -everyone used ``[^]`` as default rule: +When the world was young and re2c didn't have the default ``*`` rule (that is, before re2c-0.13.7), +everyone used ``[^]`` as the default rule: .. code-block:: cpp @@ -15,18 +15,18 @@ everyone used ``[^]`` as default rule: If other rules didn't match, ``[^]`` will match and consume one character. But exactly what is a *character*? -First, an abstract number that is assigned some sacred meaning in current encoding — *code point*. -Second, a minimal piece of information (say, combination of bits) that can represent a unit of encoded text — *code unit*. +First, an abstract number that is assigned some sacred meaning within the current encoding — a *code point*. +Second, a minimal piece of information (say, combination of bits) that can represent a unit of encoded text — a *code unit*. Rules are defined in terms of code points. Input is measured in code units. -In fixed-width encodings (such as ASCII, EBCDIC, UCS-2, UTF-32, etc.) there is one-to-one correspondence between code points and code units. -In variable-width encodings (such as UTF-8, UTF-16, etc.) code points map to code unit sequences of differing length. +In fixed-width encodings (such as ASCII, EBCDIC, UCS-2, UTF-32, etc.), there is a one-to-one correspondence between code points and code units. +In variable-width encodings (such as UTF-8, UTF-16, etc.), code points map to code unit sequences of different lengths. -``[^]`` rule matches any code point. -In fixed-width encodings it covers all code units and consumes exactly one of them. -In variable-width encodings it consumes variable number of code units and may not match some of them. +The ``[^]`` rule matches any code point. +In fixed-width encodings, it covers all code units and consumes exactly one of them. +In variable-width encodings, it consumes variable number of code units and may not match some of them. The example above compiles without warnings with any fixed-width encoding (ASCII by default). -However, with UTF-8 encoding ```re2c -i8 -Wundefined-control-flow``` says: +However, with the UTF-8 encoding, ```re2c -i8 -Wundefined-control-flow``` says: .. code-block:: none @@ -42,9 +42,9 @@ However, with UTF-8 encoding ```re2c -i8 -Wundefined-control-flow``` says: ... and 7 more, use default rule '*' [-Wundefined-control-flow] It shows us the patterns that must never appear in valid UTF-8 encoded text. -If the input is not valid UTF-8, lexer behaviour is undefined (most likely it will end up with segfault). -One would expect that with UTF-16 (another variable-width encoding) re2c will also report a warning, but it doesn't. -This is because by default re2c treats Unicode surrogates as normal code points (for backwards compatibility reasons). +If the input is not valid UTF-8, the lexer behavior is undefined (most likely you will get a segfault). +One would expect that with UTF-16 (another variable-width encoding), re2c would also report a warning, but it doesn't. +This is because by default, re2c treats Unicode surrogates as normal code points (for backwards compatibility reasons). If we tell re2c to exclude surrogates, ```re2c -ix --encoding-policy fail -Wundefined-control-flow``` will warn: .. code-block:: none @@ -55,8 +55,8 @@ If we tell re2c to exclude surrogates, ```re2c -ix --encoding-policy fail -Wunde , use default rule '*' [-Wundefined-control-flow] As you see, it can get quite subtle. -One should always use the true default rule ``*`` (it matches any code unit regardless of encoding, +One should always use the true default rule (``*``; matches any code unit regardless of encoding; consumes a single code unit no matter what and always has the lowest priority). -Note that ``*`` is a builtin hack: it cannot be expressed through ordinary rules. +Note that ``*`` is a built-in hack: it cannot be expressed through ordinary rules. diff --git a/src/manual/warnings/undefined_control_flow/how_it_works.rst b/src/manual/warnings/undefined_control_flow/how_it_works.rst index e6eaac90..065752c5 100644 --- a/src/manual/warnings/undefined_control_flow/how_it_works.rst +++ b/src/manual/warnings/undefined_control_flow/how_it_works.rst @@ -2,15 +2,15 @@ How it works ~~~~~~~~~~~~ Every path in the generated DFA must contain at least one accepting state, -otherwise it causes undefined behaviour and should be reported. -re2c walks DFA in deep-first search and checks all paths. -Each branch of search aborts as soon as it meets accepting state. -Most of the real-world programs only forget to handle a few cases, -so almost all branches abort soon and search takes very little time even for a large DFA. -In pathological cases re2c avoids exponential time and space +otherwise the behavior is undefined, which should be reported. +re2c walks the DFA using a depth-first search and it checks all paths. +Each branch of the search finishes as soon as it reaches an accepting state. +Most real-world programs only forget to handle a few cases, +so almost all branches finish soon and the search takes very little time even for a large DFA. +In pathological cases, re2c avoids exponential time and space consumption by placing an upper bound on the number of faulty patterns. The shortest patterns are reported first. -Note that the analyses is done anyway. -The option ``-Wundefined-control-flow`` only controls if the warning is reported or not. +Note that the analysis is done anyway. +The ``-Wundefined-control-flow`` option only controls if the warning is reported or not. diff --git a/src/manual/warnings/undefined_control_flow/real_world.rst b/src/manual/warnings/undefined_control_flow/real_world.rst index 0aee17ba..e8d1b211 100644 --- a/src/manual/warnings/undefined_control_flow/real_world.rst +++ b/src/manual/warnings/undefined_control_flow/real_world.rst @@ -4,15 +4,15 @@ Real-world examples Many real-world examples deal with preprocessed input, so they make strong assumptions about the input form or character set. These assumptions may or may not be valid under certain circumstances; -however, double-check won't hurt. -Even it you are absolutely sure that default case is impossible, do handle it. +however, double-checking won't hurt. +Even it you are absolutely sure that the default case is impossible, do handle it. **It adds no overhead.** -No additional checks and transitions. -It simply binds code to default label. +No additional checks and transitions are added. +It simply binds code to the default label. -I found ``[-Wundefined-control-flow]`` warnings in many real-world programs (including re2c own lexer). -Mostly these are minor issues like forgetting to handle newlines or zeroes in already preprocessed input, -but it's curious how they creeped into the code. +I found ``[-Wundefined-control-flow]`` warnings useful in many real-world programs (including re2c's own lexer). +Mostly these are minor issues like forgetting to handle newlines or zeros in already preprocessed input, +but it's curious how they crept into the code. I bet they were just forgotten and not omitted for a good reason. ``:)`` diff --git a/src/manual/warnings/undefined_control_flow/simple_example.rst b/src/manual/warnings/undefined_control_flow/simple_example.rst index bd5380f4..8ebc7e50 100644 --- a/src/manual/warnings/undefined_control_flow/simple_example.rst +++ b/src/manual/warnings/undefined_control_flow/simple_example.rst @@ -32,11 +32,11 @@ Say, we want to match ``'a'``: } Clearly this is not what we want: this code matches any letter, not only ``'a'``. -re2c grumbles something about undefined control flow and says that default ``*`` rule won't hurt: +re2c grumbles something about undefined control flow and says that the default ``*`` rule won't hurt: .. code-block:: none - re2c: warning: line 3: control flow is undefined for strings that match '[\x0-\x60\x62-\xFF]', use default rule '*' [-Wundefined-control-flow] + re2c: warning: line 3: control flow is undefined for strings that match '[\x0-\x60\x62-\xFF]', use the default '*' rule [-Wundefined-control-flow] Let's add it: @@ -71,5 +71,5 @@ Now that's better: { return 'a'; } } -Note that default rule brings no overhead: it simply binds code to default label. +Note that the default rule brings no overhead: it simply binds code to the default label. diff --git a/src/manual/warnings/useless_escape/how_it_works.rst b/src/manual/warnings/useless_escape/how_it_works.rst index 84cd66c9..c3856467 100644 --- a/src/manual/warnings/useless_escape/how_it_works.rst +++ b/src/manual/warnings/useless_escape/how_it_works.rst @@ -9,17 +9,17 @@ re2c recognizes escapes in the following lexemes: The following escapes are recognized: -* Closing quotes (``\"`` for double-quoted strings, ``\'`` for single-quoted strings and ``\]`` for character classes). +* Closing quotes (``\"`` for double-quoted strings, ``\'`` for single-quoted strings, and ``\]`` for character classes). * Dash ``\-`` in character classes. * Octal escapes: ``\ooo``, where ``o`` is in range ``[0 - 7]`` - (maximal octal escape is ``\377``, which equals ``0xFF``). -* Hexadecimal escapes: ``\xhh``, ``\Xhhhh``, ``\uhhhh`` and ``\Uhhhhhhhh``, - where ``h`` is in range ``[0 - 9]``, ``[a - f]`` or ``[A - F]``. + (the largest octal escape is ``\377``, which equals ``0xFF``). +* Hexadecimal escapes: ``\xhh``, ``\Xhhhh``, ``\uhhhh``, and ``\Uhhhhhhhh``, + where ``h`` is in range ``[0 - 9]``, ``[a - f]``, or ``[A - F]``. * Miscellaneous escapes: ``\a``, ``\b``, ``\f``, ``\n``, ``\r``, ``\t``, ``\v``, ``\\``. Ill-formed octal and hexadecimal escapes are treated as errors. -Escape followed by a newline is also an error: multiline strings and classes are not allowed -(this is very inconvenient; hopefully it will be fixed in future). +An escape followed by a newline is also an error: multiline strings and classes are not allowed +(this is very inconvenient; hopefully it will be fixed in the future). Any other ill-formed escapes are ignored. If ``[-Wuseless-escape]`` is enabled, re2c warns about ignored escapes. diff --git a/src/manual/warnings/useless_escape/real_world.rst b/src/manual/warnings/useless_escape/real_world.rst index 7596af62..3a829e23 100644 --- a/src/manual/warnings/useless_escape/real_world.rst +++ b/src/manual/warnings/useless_escape/real_world.rst @@ -3,22 +3,22 @@ Real-world examples I found many useless escapes in real-world programs: -* A very strange escape ``\*`` in a regular expression like ``"*\*"``: +* A very strange escape ``\*`` in a regular expression such as ``"*\*"``: either someone wanted to write ``"*\\*"`` (with backslash in the middle), or I have no explanation at all (considering that the first ``*`` is not escaped). - As far as I know re2c always treated ``"*\*"`` as ``"**"``. + As far as I know, re2c has always treated ``"*\*"`` as ``"**"``. -* ``\h`` in character classes (e.g. ``[ \h\t\v\f\r]``): - perhaps someone confused ``\h`` with horisontal tab +* ``\h`` in character classes (e.g., ``[ \h\t\v\f\r]``): + perhaps someone confused ``\h`` with a horizontal tab (or even hostname ``:)``). -* ``\[`` in charater classes; this one is very common. +* ``\[`` in character classes; this one is very common. -* ``\/`` in character classes (e.g. ``[^\/\000]``) and strings (e.g. ``"\/*"``). +* ``\/`` in character classes (e.g. ``[^\/\000]``) and strings (e.g., ``"\/*"``). However, there is one interesting case: ``"/**** State @@ ***\/"``: - here unescaped slash would end multiline comment. + here, the unescaped slash would end the multiline comment. Perhaps ``[-Wuseless-escape]`` should be fixed to recognize such cases. -* ``\.`` in character classes (e.g ``[\.]``). +* ``\.`` in character classes (e.g., ``[\.]``). diff --git a/src/manual/warnings/useless_escape/simple_example.rst b/src/manual/warnings/useless_escape/simple_example.rst index 840bcdea..4959030d 100644 --- a/src/manual/warnings/useless_escape/simple_example.rst +++ b/src/manual/warnings/useless_escape/simple_example.rst @@ -30,9 +30,9 @@ Given this code, ```re2c -Wuseless-escape``` reports a bunch of warnings: re2c: warning: line 5: column 15: escape has no effect: '\'' [-Wuseless-escape] re2c: warning: line 5: column 17: escape has no effect: '\[' [-Wuseless-escape] -It says that ``\A`` and ``\[`` escapes are meaningless in all rules, -``\-`` makes sense only in character class -and each type of closing quotes (``"``, ``'`` and ``]``) should only be escaped inside of same-quoted string. +This is because the ``\A`` and ``\[`` escapes are meaningless in all rules, +``\-`` makes sense only in a character class, +and each type of closing quotes (``"``, ``'`` and ``]``) should only be escaped inside of a string delimited with the same quotes. Useless escapes are ignored: the escaped symbol is treated as not escaped (``\A`` becomes ``A``, etc.). The above example should be fixed as follows: diff --git a/src/manual/warnings/warnings_general.rst b/src/manual/warnings/warnings_general.rst index 50fdc471..c3563269 100644 --- a/src/manual/warnings/warnings_general.rst +++ b/src/manual/warnings/warnings_general.rst @@ -3,20 +3,20 @@ Turn on all warnings. ``-Werror`` - Turn warnings into errors. Note that this option along - doesn't turn on any warnings, it only affects those warnings that have + Turn warnings into errors. Note that this option alone + doesn't turn on any warnings; it only affects those warnings that have been turned on so far or will be turned on later. ``-W`` - Turn on individual ``warning``. + Turn on a ``warning``. ``-Wno-`` - Turn off individual ``warning``. + Turn off a ``warning``. ``-Werror-`` - Turn on individual ``warning`` and treat it as error (this implies ``-W``). + Turn on a ``warning`` and treat it as an error (this implies ``-W``). ``-Wno-error-`` - Don't treat this particular ``warning`` as error. This doesn't turn off + Don't treat this particular ``warning`` as an error. This doesn't turn off the warning itself. diff --git a/src/manual/warnings/warnings_list.rst b/src/manual/warnings/warnings_list.rst index b3a05ee8..48870904 100644 --- a/src/manual/warnings/warnings_list.rst +++ b/src/manual/warnings/warnings_list.rst @@ -11,7 +11,7 @@ character class makes no sense: it should always fail. However, for backwards compatibility reasons ``re2c`` allows empty character class and treats it as empty string. Use ``--empty-class`` option to change default - behaviour. + behavior. ``-Wmatch-empty-string`` Warn if regular expression in a rule is @@ -21,7 +21,7 @@ ``-Wswapped-range`` Warn if range lower bound is greater that upper - bound. Default ``re2c`` behaviour is to silently swap range bounds. + bound. Default ``re2c`` behavior is to silently swap range bounds. ``-Wundefined-control-flow`` Warn if some input strings cause undefined -- 2.40.0