``re2c:eof = -1;``
Specifies the sentinel symbol used with EOF rule ``$`` to check for the end
of input in the generated lexer. Default value is ``-1`` (EOF rule is not
- used). Other possible values include all valid code points. Only decimal
+ used). Other possible values include all valid code units. Only decimal
numbers are recognized.
+``re2c:sentinel = -1;``
+ Specifies the sentinel symbol used with the sentinel method of checking for
+ the end of input in the generated lexer (the case when when bounds checking
+ is disabled with ``re2c:yyfill:enable = 0;`` and EOF rule ``$`` is not
+ used). This configuration does not affect code generation. It is used by
+ re2c to verify that the sentinel symbol is not allowed in the middle of the
+ rule, and thus prevent possible reads past the end of buffer and crashes in
+ the generated lexer. Default value is ``-1``: in this case re2c assumes that
+ the sentinel symbol is ``0`` (which is by far the most common case). Other
+ possible values include all valid code units. Only decimal numbers are
+ recognized.
+
``re2c:flags:8`` or ``re2c:flags:utf-8``
Same as ``-8 --utf-8`` command-line option.
allow ``NULL`` in the middle of lexemes. Sentinel method is very efficient,
because the lexer does not need to perform any additional checks for the end of
input --- it comes naturally as a part of processing the next character.
+It is very important that the sentinel symbol is not allowed in the middle of
+the rule --- otherwise on some inputs the lexer may read past the end of buffer
+and crash or cause memory corruption. Re2c verifies this automatically.
+Use ``re2c:sentinel`` configuration to specify which sentinel symbol is used.
Below is an example of using sentinel method. Configuration
``re2c:yyfill:enable = 0;`` suppresses generation of end-of-input checks and
--- /dev/null
+[-Wsentinel-in-midrule]
+-----------------------
+
+When using sentinel method of checking for the end of input, it is easy to
+forget that the sentinel symbol must not be allowed in the middle of the rule.
+For example, the following code tries to match single-quoted strings. It allows
+any character except the single quote to occur in the string, including
+terminating ``NULL``. As a result, the generated lexer works as expected on
+well-formed input like ``'aaa'\0``, but things go wrong on ill-formed input like
+``'aaa\0`` (where the closing single quote is missing). Lexer reaches the
+terminating ``NULL`` and assumes it is a part of the single-quoted string, so
+it continues reading bytes from memory. Eventually the lexer terminates due to
+memory access violation, or worse --- it accidentally hits a single quote and
+assumes this to be the end of the string.
+
+.. code-block:: cpp
+ :linenos:
+
+ #include <assert.h>
+
+ int lex(const char *YYCURSOR)
+ {
+ /*!re2c
+ re2c:define:YYCTYPE = char;
+ re2c:yyfill:enable = 0;
+ ['] [^']* ['] { return 0; }
+ * { return 1; }
+ */
+ }
+
+ int main()
+ {
+ assert(lex("'good'") == 0);
+ assert(lex("'bad") == 1);
+ return 0;
+ }
+
+On this code re2c reports a warning. It cannot be certain that ``NULL`` is the sentinel
+symbol, but this is by far the most common case.
+
+.. code-block:: none
+
+ $ re2c -Wsentinel-in-midrule example.re -oexample.c
+ example.re:9:18: warning: sentinel symbol 0 occurs in the middle of the rule
+ (note: if a different sentinel symbol is used, specify it with 're2c:sentinel' configuration) [-Wsentinel-in-midrule]
+
+However, re2c suggests us to define the sentinel symbol using ``re2c:sentinel``
+configuration. Let's do it.
+
+.. code-block:: cpp
+ :linenos:
+
+ #include <assert.h>
+
+ int lex(const char *YYCURSOR)
+ {
+ /*!re2c
+ re2c:define:YYCTYPE = char;
+ re2c:yyfill:enable = 0;
+ re2c:sentinel = 0;
+ ['] [^']* ['] { return 0; }
+ * { return 1; }
+ */
+ }
+
+ int main()
+ {
+ assert(lex("'good'") == 0);
+ assert(lex("'bad") == 1);
+ return 0;
+ }
+
+The warning has turned into an error, as re2c is now certain that the code
+contains an error.
+
+.. code-block:: none
+
+ $ re2c -Wsentinel-in-midrule example.re -oexample.c
+ example.re:10:18: error: sentinel symbol 0 occurs in the middle of the rule [-Werror-sentinel-in-midrule]
+
+The code can be fixed by excluding ``NULL`` from the set of symbols allowed in
+the middle of the string: ``['] [^'\x00]* [']``. If it is necessary to allow
+all symbols, a more powerful EOF handling method should be used.
+
``-Wnondeterministic-tags``
Warn if a tag has ``n``-th degree of nondeterminism, where ``n`` is greater than 1.
+``-Wsentinel-in-midrule``
+ Warn if the sentinel symbol occurs in the middle of a rule --- this may
+ cause reads past the end of buffer, crashes or memory corruption in the
+ generated lexer. This warning is only applicable if the sentinel method of
+ checking for the end of input is used.
+ It is set to an error if ``re2c:sentinel`` configuration is used.
+