From: Ulya Trofimovich Date: Mon, 26 Aug 2019 08:53:32 +0000 (+0100) Subject: Synced with maste and added documentation for -Wsentinel-in-midrule warning. X-Git-Url: https://granicus.if.org/sourcecode?a=commitdiff_plain;h=302ee11bf002a3f911750b195fd0271a29a805e4;p=re2c Synced with maste and added documentation for -Wsentinel-in-midrule warning. --- diff --git a/src/manual/configurations/configurations.rst_ b/src/manual/configurations/configurations.rst_ index 54dea327..1299cf65 100644 --- a/src/manual/configurations/configurations.rst_ +++ b/src/manual/configurations/configurations.rst_ @@ -148,9 +148,21 @@ ``re2c:eof = -1;`` Specifies the sentinel symbol used with EOF rule ``$`` to check for the end of input in the generated lexer. Default value is ``-1`` (EOF rule is not - used). Other possible values include all valid code points. Only decimal + used). Other possible values include all valid code units. Only decimal numbers are recognized. +``re2c:sentinel = -1;`` + Specifies the sentinel symbol used with the sentinel method of checking for + the end of input in the generated lexer (the case when when bounds checking + is disabled with ``re2c:yyfill:enable = 0;`` and EOF rule ``$`` is not + used). This configuration does not affect code generation. It is used by + re2c to verify that the sentinel symbol is not allowed in the middle of the + rule, and thus prevent possible reads past the end of buffer and crashes in + the generated lexer. Default value is ``-1``: in this case re2c assumes that + the sentinel symbol is ``0`` (which is by far the most common case). Other + possible values include all valid code units. Only decimal numbers are + recognized. + ``re2c:flags:8`` or ``re2c:flags:utf-8`` Same as ``-8 --utf-8`` command-line option. diff --git a/src/manual/eof/01_sentinel.rst_ b/src/manual/eof/01_sentinel.rst_ index d55beab2..7d57cd61 100644 --- a/src/manual/eof/01_sentinel.rst_ +++ b/src/manual/eof/01_sentinel.rst_ @@ -10,6 +10,10 @@ such input is a null-terminated C-string, provided that the grammar does not allow ``NULL`` in the middle of lexemes. Sentinel method is very efficient, because the lexer does not need to perform any additional checks for the end of input --- it comes naturally as a part of processing the next character. +It is very important that the sentinel symbol is not allowed in the middle of +the rule --- otherwise on some inputs the lexer may read past the end of buffer +and crash or cause memory corruption. Re2c verifies this automatically. +Use ``re2c:sentinel`` configuration to specify which sentinel symbol is used. Below is an example of using sentinel method. Configuration ``re2c:yyfill:enable = 0;`` suppresses generation of end-of-input checks and diff --git a/src/manual/manual.rst b/src/manual/manual.rst index 62fb2670..2d62f4aa 100644 --- a/src/manual/manual.rst +++ b/src/manual/manual.rst @@ -116,6 +116,7 @@ Warnings .. include:: /manual/warnings/swapped_range/wswapped_range.rst .. include:: /manual/warnings/empty_character_class/wempty_character_class.rst .. include:: /manual/warnings/match_empty_string/wmatch_empty_string.rst +.. include:: /manual/warnings/sentinel_in_midrule/wsentinel_in_midrule.rst More examples ============= diff --git a/src/manual/warnings/sentinel_in_midrule/wsentinel_in_midrule.rst b/src/manual/warnings/sentinel_in_midrule/wsentinel_in_midrule.rst new file mode 100644 index 00000000..bc0c542b --- /dev/null +++ b/src/manual/warnings/sentinel_in_midrule/wsentinel_in_midrule.rst @@ -0,0 +1,84 @@ +[-Wsentinel-in-midrule] +----------------------- + +When using sentinel method of checking for the end of input, it is easy to +forget that the sentinel symbol must not be allowed in the middle of the rule. +For example, the following code tries to match single-quoted strings. It allows +any character except the single quote to occur in the string, including +terminating ``NULL``. As a result, the generated lexer works as expected on +well-formed input like ``'aaa'\0``, but things go wrong on ill-formed input like +``'aaa\0`` (where the closing single quote is missing). Lexer reaches the +terminating ``NULL`` and assumes it is a part of the single-quoted string, so +it continues reading bytes from memory. Eventually the lexer terminates due to +memory access violation, or worse --- it accidentally hits a single quote and +assumes this to be the end of the string. + +.. code-block:: cpp + :linenos: + + #include + + int lex(const char *YYCURSOR) + { + /*!re2c + re2c:define:YYCTYPE = char; + re2c:yyfill:enable = 0; + ['] [^']* ['] { return 0; } + * { return 1; } + */ + } + + int main() + { + assert(lex("'good'") == 0); + assert(lex("'bad") == 1); + return 0; + } + +On this code re2c reports a warning. It cannot be certain that ``NULL`` is the sentinel +symbol, but this is by far the most common case. + +.. code-block:: none + + $ re2c -Wsentinel-in-midrule example.re -oexample.c + example.re:9:18: warning: sentinel symbol 0 occurs in the middle of the rule + (note: if a different sentinel symbol is used, specify it with 're2c:sentinel' configuration) [-Wsentinel-in-midrule] + +However, re2c suggests us to define the sentinel symbol using ``re2c:sentinel`` +configuration. Let's do it. + +.. code-block:: cpp + :linenos: + + #include + + int lex(const char *YYCURSOR) + { + /*!re2c + re2c:define:YYCTYPE = char; + re2c:yyfill:enable = 0; + re2c:sentinel = 0; + ['] [^']* ['] { return 0; } + * { return 1; } + */ + } + + int main() + { + assert(lex("'good'") == 0); + assert(lex("'bad") == 1); + return 0; + } + +The warning has turned into an error, as re2c is now certain that the code +contains an error. + +.. code-block:: none + + $ re2c -Wsentinel-in-midrule example.re -oexample.c + example.re:10:18: error: sentinel symbol 0 occurs in the middle of the rule [-Werror-sentinel-in-midrule] + +The code can be fixed by excluding ``NULL`` from the set of symbols allowed in +the middle of the string: ``['] [^'\x00]* [']``. If it is necessary to allow +all symbols, a more powerful EOF handling method should be used. + diff --git a/src/manual/warnings/warnings_list.rst_ b/src/manual/warnings/warnings_list.rst_ index ae381bfb..bbea5936 100644 --- a/src/manual/warnings/warnings_list.rst_ +++ b/src/manual/warnings/warnings_list.rst_ @@ -38,3 +38,10 @@ ``-Wnondeterministic-tags`` Warn if a tag has ``n``-th degree of nondeterminism, where ``n`` is greater than 1. +``-Wsentinel-in-midrule`` + Warn if the sentinel symbol occurs in the middle of a rule --- this may + cause reads past the end of buffer, crashes or memory corruption in the + generated lexer. This warning is only applicable if the sentinel method of + checking for the end of input is used. + It is set to an error if ``re2c:sentinel`` configuration is used. +