From: Ulya Trofimovich Date: Fri, 6 Nov 2015 14:19:21 +0000 (+0000) Subject: Added [-Wundefined-control-flow] description. X-Git-Url: https://granicus.if.org/sourcecode?a=commitdiff_plain;h=f267ef10010e5551ec0aab8c1e60a31a2d1b4e7e;p=re2c Added [-Wundefined-control-flow] description. --- diff --git a/src/css/default.css b/src/css/default.css index 3a3c01ce..2c85c05f 100644 --- a/src/css/default.css +++ b/src/css/default.css @@ -12,6 +12,7 @@ h1, h2, h3, h4, h5, h6 { } h1 { border-bottom: groove gray; + padding-top: 1em; } h1.title { font-size: 2.5em; diff --git a/src/examples/example_04.rst b/src/examples/example_04.rst index be0e0fec..eabad615 100644 --- a/src/examples/example_04.rst +++ b/src/examples/example_04.rst @@ -17,7 +17,7 @@ a couple of dereferences and conditional jumps. No overhead. ``:)`` Notes: -* Configurations and definitions (lines 20 - 26) are not scoped to a single re2c block --- they are global. +* Configurations and definitions (lines 20 - 26) are not scoped to a single re2c block — they are global. Each block may override configurations, but this affects global scope. * Blocks don't have to be in the same function: they can be in separate functions or elsewhere as long as the exposed interface fits into lexical scope. diff --git a/src/manual/warnings/home.rst b/src/manual/warnings/home.rst index 4307a92c..c4307239 100644 --- a/src/manual/warnings/home.rst +++ b/src/manual/warnings/home.rst @@ -1,4 +1,4 @@ .. |[manual_warnings_home]| replace:: [home] -.. _[manual_warnings_home]: ../../index.html -.. header:: |[manual_warnings_home]|_ `[Manual] <../manual.html>`_ `[Warnings] `_ -.. footer:: |[manual_warnings_home]|_ `[Manual] <../manual.html>`_ `[Warnings] `_ +.. _[manual_warnings_home]: ../../../index.html +.. header:: |[manual_warnings_home]|_ `[Manual] <../../manual.html>`_ `[Warnings] <../warnings.html>`_ +.. footer:: |[manual_warnings_home]|_ `[Manual] <../../manual.html>`_ `[Warnings] <../warnings.html>`_ diff --git a/src/manual/warnings/undefined_control_flow/default_vs_any.rst b/src/manual/warnings/undefined_control_flow/default_vs_any.rst new file mode 100644 index 00000000..413e144f --- /dev/null +++ b/src/manual/warnings/undefined_control_flow/default_vs_any.rst @@ -0,0 +1,62 @@ +Difference between ``*`` and ``[^]`` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +When the the world was young and re2c didn't have default rule ``*`` (that is, before re2c-0.13.7), +everyone used ``[^]`` as default rule: + +.. code-block:: cpp + + /*!re2c + // ... normal rules ... + [^] { return "any"; } + */ + +``[^]`` is just an ordinary rule: it matches any character and has normal priority (so it should be the last rule). +If other rules didn't match, ``[^]`` will match and consume one character. + +But exactly what is a *character*? +First, an abstract number that is assigned some sacred meaning in current encoding — *code point*. +Second, a minimal piece of information (say, combination of bits) that can represent a unit of encoded text — *code unit*. +Rules are defined in terms of code points. +Input is measured in code units. +In fixed-width encodings (such as ASCII, EBCDIC, UCS-2, UTF-32, etc.) there is one-to-one correspondence between code points and code units. +In variable-width encodings (such as UTF-8, UTF-16, etc.) code points map to code unit sequences of differing length. + +``[^]`` rule matches any code point. +In fixed-width encodings it covers all code units and consumes exactly one of them. +In variable-width encodings it consumes variable number of code units and may not match some of them. +The example above compiles without warnings with any fixed-width encoding (ASCII by default). +However, with UTF-8 encoding ```re2c -i8 -Wundefined-control-flow``` says: + +.. code-block:: + + re2c: warning: line 3: control flow is undefined for strings that match + '[\x80-\xC1\xF5-\xFF]' + '\xF0 [\x0-\x8F\xC0-\xFF]' + '[\xE1-\xEF] [\x0-\x7F\xC0-\xFF]' + '\xF4 [\x0-\x7F\x90-\xFF]' + '\xE0 [\x0-\x9F\xC0-\xFF]' + '[\xF1-\xF3] [\x0-\x7F\xC0-\xFF]' + '[\xC2-\xDF] [\x0-\x7F\xC0-\xFF]' + '\xE0 [\xA0-\xBF] [\x0-\x7F\xC0-\xFF]' + ... and 7 more, use default rule '*' [-Wundefined-control-flow] + +It shows us the patterns that must never appear in valid UTF-8 encoded text. +If the input is not valid UTF-8, lexer behaviour is undefined (most likely it will end up with segfault). +One would expect that with UTF-16 (another variable-width encoding) re2c will also report a warning, but it doesn't. +This is because by default re2c treats Unicode surrogates as normal code points (for backwards compatibility reasons). +If we tell re2c to exclude surrogates, ```re2c -ix --encoding-policy fail -Wundefined-control-flow``` will warn: + +.. code-block:: + + re2c: warning: line 3: control flow is undefined for strings that match + '[\xDC00-\xDFFF]' + '[\xD800-\xDBFF] [\x0-\xDBFF\xE000-\xFFFF]' + , use default rule '*' [-Wundefined-control-flow] + +As you see, it can get quite subtle. +One should always use the true default rule ``*`` (it matches any code unit regardless of encoding, +consumes a single code unit no matter what and always has the lowest priority). +Note that ``*`` is a builtin hack: it cannot be expressed through ordinary rules. + + diff --git a/src/manual/warnings/undefined_control_flow/how_it_works.rst b/src/manual/warnings/undefined_control_flow/how_it_works.rst new file mode 100644 index 00000000..e6eaac90 --- /dev/null +++ b/src/manual/warnings/undefined_control_flow/how_it_works.rst @@ -0,0 +1,16 @@ +How it works +~~~~~~~~~~~~ + +Every path in the generated DFA must contain at least one accepting state, +otherwise it causes undefined behaviour and should be reported. +re2c walks DFA in deep-first search and checks all paths. +Each branch of search aborts as soon as it meets accepting state. +Most of the real-world programs only forget to handle a few cases, +so almost all branches abort soon and search takes very little time even for a large DFA. +In pathological cases re2c avoids exponential time and space +consumption by placing an upper bound on the number of faulty patterns. +The shortest patterns are reported first. + +Note that the analyses is done anyway. +The option ``-Wundefined-control-flow`` only controls if the warning is reported or not. + diff --git a/src/manual/warnings/undefined_control_flow/real_world.rst b/src/manual/warnings/undefined_control_flow/real_world.rst new file mode 100644 index 00000000..5baa9872 --- /dev/null +++ b/src/manual/warnings/undefined_control_flow/real_world.rst @@ -0,0 +1,19 @@ +Real-world examples +~~~~~~~~~~~~~~~~~~~ + +Many real-world examples deal with preprocessed input, +so they make strong assumptions about the input form or character set. +These assumptions may or may not be valid under certain circumstances; +however, double-check won't hurt. +Even it you are absolutely sure that default case is impossible, do handle it. +**It adds no overhead.** +No additional checks and transitions. +It simply binds code to default label. + +I found ``[-Wundefined-control-flow]`` warnings in many real-world programs +(including some of the PHP lexers and re2c own lexer). +Mostly these are minor issues like forgetting to handle newlines or zeroes in already preprocessed input, +but it's curious how they creeped into the code. +I bet they were just forgotten and not omitted for a good reason. ``:)`` + + diff --git a/src/manual/warnings/undefined_control_flow/simple_example.rst b/src/manual/warnings/undefined_control_flow/simple_example.rst new file mode 100644 index 00000000..cbc822eb --- /dev/null +++ b/src/manual/warnings/undefined_control_flow/simple_example.rst @@ -0,0 +1,71 @@ +A simple example +~~~~~~~~~~~~~~~~ + +Say, we want to match ``'a'``: + +.. code-block:: cpp + + /*!re2c + "a" { return 'a'; } + */ + +```re2c -i -Wundefined-control-flow```: + +.. code-block:: cpp + + /* Generated by re2c 0.14.1.dev on Thu Nov 5 14:35:46 2015*/ + + { + YYCTYPE yych; + if (YYLIMIT <= YYCURSOR) YYFILL(1); + yych = *YYCURSOR; + switch (yych) { + case 'a': goto yy3; + default: goto yy2; + } + yy2: + yy3: + ++YYCURSOR; + { return 'a'; } + } + +Clearly this is not what we want: this code matches any letter, not only ``'a'``. +re2c grumbles something about undefined control flow and says that default ``*`` rule won't hurt: + +.. code-block:: + + re2c: warning: line 3: control flow is undefined for strings that match '[\x0-\x60\x62-\xFF]', use default rule '*' [-Wundefined-control-flow] + +Let's add it: + +.. code-block:: cpp + + /*!re2c + * { return '*'; } + "a" { return 'a'; } + */ + +Now that's better: + +.. code-block:: cpp + + /* Generated by re2c 0.14.1.dev on Thu Nov 5 14:35:08 2015*/ + + { + YYCTYPE yych; + if (YYLIMIT <= YYCURSOR) YYFILL(1); + yych = *YYCURSOR; + switch (yych) { + case 'a': goto yy4; + default: goto yy2; + } + yy2: + ++YYCURSOR; + { return '*'; } + yy4: + ++YYCURSOR; + { return 'a'; } + } + +Note that default rule brings no overhead: it simply binds code to default label. + diff --git a/src/manual/warnings/undefined_control_flow/wundefined_control_flow.rst b/src/manual/warnings/undefined_control_flow/wundefined_control_flow.rst new file mode 100644 index 00000000..f184c4b2 --- /dev/null +++ b/src/manual/warnings/undefined_control_flow/wundefined_control_flow.rst @@ -0,0 +1,11 @@ +[-Wundefined-control-flow] +-------------------------- + +.. include:: ../home.rst +.. include:: ../../../contents.rst + +.. include:: simple_example.rst +.. include:: how_it_works.rst +.. include:: default_vs_any.rst +.. include:: real_world.rst + diff --git a/src/manual/warnings/warnings.rst b/src/manual/warnings/warnings.rst index 02eb4347..832d5633 100644 --- a/src/manual/warnings/warnings.rst +++ b/src/manual/warnings/warnings.rst @@ -6,7 +6,7 @@ Warnings ★ -* `[-Wundefined-control-flow] `_ +* `[-Wundefined-control-flow] `_ * `[-Wunreachable-rules] `_ * `[-Wcondition-order] `_ * `[-Wuseless-escape] `_ diff --git a/src/manual/warnings/wundefined_control_flow.rst b/src/manual/warnings/wundefined_control_flow.rst deleted file mode 100644 index 751fe24b..00000000 --- a/src/manual/warnings/wundefined_control_flow.rst +++ /dev/null @@ -1,17 +0,0 @@ -[-Wundefined-control-flow] --------------------------- - -.. include:: home.rst - -.. code-block:: cpp - - /*!re2c - [1-9][0-9]* {} - */ - - -.. code-block:: cpp - - /*!re2c - [^] {} - */