Ulya Trofimovich [Sun, 27 Jan 2019 10:23:23 +0000 (10:23 +0000)]
libre2c_posix: extended regex_t structure to hold more submatch data.
Added fields:
- re_nsub: total number of submatch groups, required by POSIX standard
- pmatch: buffer for submatch results, usually supplied by the user to
regexec(), but we allow to do the allocation and storage in regex_t.
This is convenient for users that have hard time managing memory, e.g.
java bindings to libre2c_posix.
- regs: buffer for internal use by regexec(), strored in regex_t to
avoid repeated memory allocation on each call to regexec() with the
same regex.
Ulya Trofimovich [Sun, 27 Jan 2019 10:06:22 +0000 (10:06 +0000)]
Adjusted build system to correctly build DLLs for windows.
Adjustments:
- configure.ac: pass win32-dll option to LT_INIT
- Makefile_libre2c_posix.am: use -no-undefined in LDFLAGS
- use slibtool: https://github.com/midipix-project/slibtool for windows
builds The problem with libtool is that it doesn't allow to link libstdc++
and libgcc statically, which is necessary to build portable DLLs with
Mingw. Libtool adds -nostdlib option to LDFLAGS and links some predefined
objects that pull in dependency on dynamic libstdc++ and libgcc, even in
the presence of -static-libstdc++ -static-libgcc.
Ulya Trofimovich [Thu, 17 Jan 2019 22:52:04 +0000 (22:52 +0000)]
Fixed operator precedence with --flex-syntax option.
Operator precedence was broken because re2c tried to parse whole strings
of characters at once instead of parsing one character at a time (in
much the same way as it would parse properly quotes string literals in
the original re2c format). This caused ab* being parsed as (ab)*, which
is clearly wrong (should be a(b)*).
This fixes bug #242:
"Operator precedence with --flex-syntax is broken."
Ulya Trofimovich [Sat, 12 Jan 2019 20:32:15 +0000 (20:32 +0000)]
libre2c_posix: use C-array initializers instead of variadic functions for offset lists.
Variadic functions cause subtle toolchain-specific errors because integer
literals used to initialize offsets may have different type than the type
expected by variadic function (e.g. int vs long).
Small simplifications in GOR1 initialization phase.
Instead of using two stacks to weed out low-precedence initial
configurations with duplicate target state, let them remain on the
bottom of topsort stack, which effectively makes them ignored.
Ulya Trofimovich [Mon, 31 Dec 2018 12:08:42 +0000 (12:08 +0000)]
Added GTOP SSSP algorithm for computing epsilon-closure with POSIX disambiguation.
GTOP SSSP meand "Global Topsort Single Source Shortes Path".
It is well known that SSSP can be solved in linear time on DAGs (directed
acyclic graphs) by exploring graph nodes in topological order. In our case
TNFA is not a DAG (it may have cycles), but it is possible to compute fake
topologcal order by ignoring back edges.
The algorithm works by having a priority queue of nodes, where priorities
are indices of nodes in fake topological ordering. At each step, the node
with the minimal priority is popped from queue and explored. All nodes
reachable from it on admissible arcs are enqueued, unless they are already
on queue.
The resulting algorithm is of course not optimal: it can get stuck on
graphs with loops, because it will give priority to some of the loop nodes
compared to others for no good reason.
However the algorithm is simple and optimal for DAGs, therefore we keep it.
Ulya Trofimovich [Sat, 29 Dec 2018 00:06:28 +0000 (00:06 +0000)]
Use correct order when unreading files from lexer buffer.
In lexer buffer nested files come before outer files. In lexer file
stack, however, outer files go before nested files (nested are at the
top). We want to break from the unreading cycle early, therefore we
proceed in reverse order of file offsets in buffer and break as soon
as the end offset is less than cursor (current position).
Ulya Trofimovich [Tue, 13 Nov 2018 23:42:11 +0000 (23:42 +0000)]
Fixed a couple of lexer/parser errors in flex mode (-F option).
This fixes bug #229: re2c option -F (flex syntax) broken,
reported by Robert van Engelen.
A well-formed example that caused syntax error (flex-style raw literal
followed by one or more spaces and a curly brace):
/*!re2c
a {}
*/
The faulty behaviour goes back as far as re2c-0.13.6 (and supposedly
before that): in flex mode, raw literal may occur in various contexts
both as a regexp (string literal) and an identifier (named definition,
condiiton name). RE2C uses lookahead to infer the context and determine
the appropriate type of lexer token, but it missed some cases.
The fix has two sides. First, if reduces the number of contexts where
the general lexer may encounter raw literal (by using specialized lexers
for condition lists <x,y,...,z> and condition goto => and :=>). Second,
it fixes the lookahead regexps used for context inference.
Also added a bunch of tests (generated by a script).