}
h1 {
border-bottom: groove gray;
+ padding-top: 1em;
}
h1.title {
font-size: 2.5em;
Notes:
-* Configurations and definitions (lines 20 - 26) are not scoped to a single re2c block --- they are global.
+* Configurations and definitions (lines 20 - 26) are not scoped to a single re2c block — they are global.
Each block may override configurations, but this affects global scope.
* Blocks don't have to be in the same function: they can be in separate functions or elsewhere
as long as the exposed interface fits into lexical scope.
.. |[manual_warnings_home]| replace:: [home]
-.. _[manual_warnings_home]: ../../index.html
-.. header:: |[manual_warnings_home]|_ `[Manual] <../manual.html>`_ `[Warnings] <warnings.html>`_
-.. footer:: |[manual_warnings_home]|_ `[Manual] <../manual.html>`_ `[Warnings] <warnings.html>`_
+.. _[manual_warnings_home]: ../../../index.html
+.. header:: |[manual_warnings_home]|_ `[Manual] <../../manual.html>`_ `[Warnings] <../warnings.html>`_
+.. footer:: |[manual_warnings_home]|_ `[Manual] <../../manual.html>`_ `[Warnings] <../warnings.html>`_
--- /dev/null
+Difference between ``*`` and ``[^]``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When the the world was young and re2c didn't have default rule ``*`` (that is, before re2c-0.13.7),
+everyone used ``[^]`` as default rule:
+
+.. code-block:: cpp
+
+ /*!re2c
+ // ... normal rules ...
+ [^] { return "any"; }
+ */
+
+``[^]`` is just an ordinary rule: it matches any character and has normal priority (so it should be the last rule).
+If other rules didn't match, ``[^]`` will match and consume one character.
+
+But exactly what is a *character*?
+First, an abstract number that is assigned some sacred meaning in current encoding — *code point*.
+Second, a minimal piece of information (say, combination of bits) that can represent a unit of encoded text — *code unit*.
+Rules are defined in terms of code points.
+Input is measured in code units.
+In fixed-width encodings (such as ASCII, EBCDIC, UCS-2, UTF-32, etc.) there is one-to-one correspondence between code points and code units.
+In variable-width encodings (such as UTF-8, UTF-16, etc.) code points map to code unit sequences of differing length.
+
+``[^]`` rule matches any code point.
+In fixed-width encodings it covers all code units and consumes exactly one of them.
+In variable-width encodings it consumes variable number of code units and may not match some of them.
+The example above compiles without warnings with any fixed-width encoding (ASCII by default).
+However, with UTF-8 encoding ```re2c -i8 -Wundefined-control-flow``` says:
+
+.. code-block::
+
+ re2c: warning: line 3: control flow is undefined for strings that match
+ '[\x80-\xC1\xF5-\xFF]'
+ '\xF0 [\x0-\x8F\xC0-\xFF]'
+ '[\xE1-\xEF] [\x0-\x7F\xC0-\xFF]'
+ '\xF4 [\x0-\x7F\x90-\xFF]'
+ '\xE0 [\x0-\x9F\xC0-\xFF]'
+ '[\xF1-\xF3] [\x0-\x7F\xC0-\xFF]'
+ '[\xC2-\xDF] [\x0-\x7F\xC0-\xFF]'
+ '\xE0 [\xA0-\xBF] [\x0-\x7F\xC0-\xFF]'
+ ... and 7 more, use default rule '*' [-Wundefined-control-flow]
+
+It shows us the patterns that must never appear in valid UTF-8 encoded text.
+If the input is not valid UTF-8, lexer behaviour is undefined (most likely it will end up with segfault).
+One would expect that with UTF-16 (another variable-width encoding) re2c will also report a warning, but it doesn't.
+This is because by default re2c treats Unicode surrogates as normal code points (for backwards compatibility reasons).
+If we tell re2c to exclude surrogates, ```re2c -ix --encoding-policy fail -Wundefined-control-flow``` will warn:
+
+.. code-block::
+
+ re2c: warning: line 3: control flow is undefined for strings that match
+ '[\xDC00-\xDFFF]'
+ '[\xD800-\xDBFF] [\x0-\xDBFF\xE000-\xFFFF]'
+ , use default rule '*' [-Wundefined-control-flow]
+
+As you see, it can get quite subtle.
+One should always use the true default rule ``*`` (it matches any code unit regardless of encoding,
+consumes a single code unit no matter what and always has the lowest priority).
+Note that ``*`` is a builtin hack: it cannot be expressed through ordinary rules.
+
+
--- /dev/null
+How it works
+~~~~~~~~~~~~
+
+Every path in the generated DFA must contain at least one accepting state,
+otherwise it causes undefined behaviour and should be reported.
+re2c walks DFA in deep-first search and checks all paths.
+Each branch of search aborts as soon as it meets accepting state.
+Most of the real-world programs only forget to handle a few cases,
+so almost all branches abort soon and search takes very little time even for a large DFA.
+In pathological cases re2c avoids exponential time and space
+consumption by placing an upper bound on the number of faulty patterns.
+The shortest patterns are reported first.
+
+Note that the analyses is done anyway.
+The option ``-Wundefined-control-flow`` only controls if the warning is reported or not.
+
--- /dev/null
+Real-world examples
+~~~~~~~~~~~~~~~~~~~
+
+Many real-world examples deal with preprocessed input,
+so they make strong assumptions about the input form or character set.
+These assumptions may or may not be valid under certain circumstances;
+however, double-check won't hurt.
+Even it you are absolutely sure that default case is impossible, do handle it.
+**It adds no overhead.**
+No additional checks and transitions.
+It simply binds code to default label.
+
+I found ``[-Wundefined-control-flow]`` warnings in many real-world programs
+(including some of the PHP lexers and re2c own lexer).
+Mostly these are minor issues like forgetting to handle newlines or zeroes in already preprocessed input,
+but it's curious how they creeped into the code.
+I bet they were just forgotten and not omitted for a good reason. ``:)``
+
+
--- /dev/null
+A simple example
+~~~~~~~~~~~~~~~~
+
+Say, we want to match ``'a'``:
+
+.. code-block:: cpp
+
+ /*!re2c
+ "a" { return 'a'; }
+ */
+
+```re2c -i -Wundefined-control-flow```:
+
+.. code-block:: cpp
+
+ /* Generated by re2c 0.14.1.dev on Thu Nov 5 14:35:46 2015*/
+
+ {
+ YYCTYPE yych;
+ if (YYLIMIT <= YYCURSOR) YYFILL(1);
+ yych = *YYCURSOR;
+ switch (yych) {
+ case 'a': goto yy3;
+ default: goto yy2;
+ }
+ yy2:
+ yy3:
+ ++YYCURSOR;
+ { return 'a'; }
+ }
+
+Clearly this is not what we want: this code matches any letter, not only ``'a'``.
+re2c grumbles something about undefined control flow and says that default ``*`` rule won't hurt:
+
+.. code-block::
+
+ re2c: warning: line 3: control flow is undefined for strings that match '[\x0-\x60\x62-\xFF]', use default rule '*' [-Wundefined-control-flow]
+
+Let's add it:
+
+.. code-block:: cpp
+
+ /*!re2c
+ * { return '*'; }
+ "a" { return 'a'; }
+ */
+
+Now that's better:
+
+.. code-block:: cpp
+
+ /* Generated by re2c 0.14.1.dev on Thu Nov 5 14:35:08 2015*/
+
+ {
+ YYCTYPE yych;
+ if (YYLIMIT <= YYCURSOR) YYFILL(1);
+ yych = *YYCURSOR;
+ switch (yych) {
+ case 'a': goto yy4;
+ default: goto yy2;
+ }
+ yy2:
+ ++YYCURSOR;
+ { return '*'; }
+ yy4:
+ ++YYCURSOR;
+ { return 'a'; }
+ }
+
+Note that default rule brings no overhead: it simply binds code to default label.
+
--- /dev/null
+[-Wundefined-control-flow]
+--------------------------
+
+.. include:: ../home.rst
+.. include:: ../../../contents.rst
+
+.. include:: simple_example.rst
+.. include:: how_it_works.rst
+.. include:: default_vs_any.rst
+.. include:: real_world.rst
+
★
-* `[-Wundefined-control-flow] <wundefined_control_flow.html>`_
+* `[-Wundefined-control-flow] <undefined_control_flow/wundefined_control_flow.html>`_
* `[-Wunreachable-rules] <wunreachable_rules.html>`_
* `[-Wcondition-order] <wcondition_order.html>`_
* `[-Wuseless-escape] <wuseless_escape.html>`_
+++ /dev/null
-[-Wundefined-control-flow]
---------------------------
-
-.. include:: home.rst
-
-.. code-block:: cpp
-
- /*!re2c
- [1-9][0-9]* {}
- */
-
-
-.. code-block:: cpp
-
- /*!re2c
- [^] {}
- */