Small simplifications in GOR1 initialization phase.
Instead of using two stacks to weed out low-precedence initial
configurations with duplicate target state, let them remain on the
bottom of topsort stack, which effectively makes them ignored.
Ulya Trofimovich [Mon, 31 Dec 2018 12:08:42 +0000 (12:08 +0000)]
Added GTOP SSSP algorithm for computing epsilon-closure with POSIX disambiguation.
GTOP SSSP meand "Global Topsort Single Source Shortes Path".
It is well known that SSSP can be solved in linear time on DAGs (directed
acyclic graphs) by exploring graph nodes in topological order. In our case
TNFA is not a DAG (it may have cycles), but it is possible to compute fake
topologcal order by ignoring back edges.
The algorithm works by having a priority queue of nodes, where priorities
are indices of nodes in fake topological ordering. At each step, the node
with the minimal priority is popped from queue and explored. All nodes
reachable from it on admissible arcs are enqueued, unless they are already
on queue.
The resulting algorithm is of course not optimal: it can get stuck on
graphs with loops, because it will give priority to some of the loop nodes
compared to others for no good reason.
However the algorithm is simple and optimal for DAGs, therefore we keep it.
Ulya Trofimovich [Sat, 29 Dec 2018 00:06:28 +0000 (00:06 +0000)]
Use correct order when unreading files from lexer buffer.
In lexer buffer nested files come before outer files. In lexer file
stack, however, outer files go before nested files (nested are at the
top). We want to break from the unreading cycle early, therefore we
proceed in reverse order of file offsets in buffer and break as soon
as the end offset is less than cursor (current position).
Ulya Trofimovich [Tue, 13 Nov 2018 23:42:11 +0000 (23:42 +0000)]
Fixed a couple of lexer/parser errors in flex mode (-F option).
This fixes bug #229: re2c option -F (flex syntax) broken,
reported by Robert van Engelen.
A well-formed example that caused syntax error (flex-style raw literal
followed by one or more spaces and a curly brace):
/*!re2c
a {}
*/
The faulty behaviour goes back as far as re2c-0.13.6 (and supposedly
before that): in flex mode, raw literal may occur in various contexts
both as a regexp (string literal) and an identifier (named definition,
condiiton name). RE2C uses lookahead to infer the context and determine
the appropriate type of lexer token, but it missed some cases.
The fix has two sides. First, if reduces the number of contexts where
the general lexer may encounter raw literal (by using specialized lexers
for condition lists <x,y,...,z> and condition goto => and :=>). Second,
it fixes the lookahead regexps used for context inference.
Also added a bunch of tests (generated by a script).
Ulya Trofimovich [Mon, 29 Oct 2018 23:00:50 +0000 (23:00 +0000)]
Fixed out of bounds read in lexer.
The error was caused by assuming that a sequence of zeroes (used for
padding in YYFILL) cannot form a valid lexeme suffix. This is not the
case with strings, as they may contain arbitrary characters. The fix
is to manually loop over string characters in lexer, stopping at each
zero to check if it's the end of input.
Found by american fuzzy lop (thanks to Henri Salo).
src/dfa/dfa.h: simplify constructor to avoid g++-3.4 bug
On g++-3.4.6 re2c tests SIGSEGVed due to use of uninitialized data:
```
$ valgrind ... ./re2c -8 a.re -o foo.c
Conditional jump or move depends on uninitialised value(s)
at 0x432F23: re2c::tcpool_t::insert(re2c::tcmd_t const*) (tcmd.cc:202)
by 0x421FDA: re2c::freeze_tags(re2c::dfa_t&) (freeze.cc:45)
by 0x43A7FF: re2c::ast_to_dfa(re2c::spec_t const&, re2c::Output&) (compile.cc:88)
by 0x43B052: push_back (stl_iterator.h:614)
by 0x43B052: re2c::compile(re2c::Scanner&, re2c::Output&, re2c::Opt&) (???:0)
by 0x449D29: main (main.cc:31)
Uninitialised value was created by a heap allocation
at 0x403252F: operator new[](unsigned long) (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
by 0x42FC9E: re2c::find_state(re2c::determ_context_t&) (dfa.h:37)
by 0x429BD9: re2c::dfa_t::dfa_t(re2c::nfa_t const&, re2c::opt_t const*, std::string const&, re2c::Warn&) (determinization.cc:56)
by 0x43A76C: re2c::ast_to_dfa(re2c::spec_t const&, re2c::Output&) (compile.cc:69)
by 0x43B052: push_back (stl_iterator.h:614)
by 0x43B052: re2c::compile(re2c::Scanner&, re2c::Output&, re2c::Opt&) (???:0)
by 0x449D29: main (main.cc:31)
```
the problem here arose in default array constructor:
```c++
explicit dfa_state_t(size_t nchars)
: // ...
, tcmd(new tcmd_t*[nchars + 2]()) // +2 for final and fallback epsilon-transitions
// ...
```
g++-3.4.6 can't figure out zero-initialization rule (likely a gcc bug).
The change uses non-initializing new[] and memset() instead.
Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
Fixed bug #215 "A memory read overrun issue in s_to_n32_unsafe.cc".
The error was in the code of the test itself: the special case of zero
wasn't handled correctrly by the function that prepares input data for
the test. As a result, zero-length input string was passed to the test,
which is unexpected: the tested function is an "unsafe" one (as the
name suggests) and is meant to be used on an already validated input.