Ulya Trofimovich [Thu, 14 Feb 2019 12:52:50 +0000 (12:52 +0000)]
Assign indices to core TNFA states (those that are final or have transitions on symbols).
During epsilon closure constructions, many operations only need to be
performed for core states. This allows to save some memory, e.g.
precedence matrix can be indexed by core states.
libre2c: speedup of POSIX NFA matcher by lazy computation and caching of precedence.
A lot of the match time was taken by eagerly computing precedence values
on each step for each pair of closure states. Most of these values are not
needed: RE may be unambigous; or the given input string may be unambigous,
even if there is ambiguity, it may take only a few comparisons to resolve.
All the rest is wasted effort.
We can avoid it by delaying precedence computation until necessary, and
then unwinding all the steps backwards, computing precedence for each step
and caching the computed values (so that the same pair of histories is not
compared twice). It is still the same incremental comparison as with
precedence matrices: we compare step by step, going from the fork frame to
the join frame. The result of comparison on each step is folded to a triple
of numbers and recorded in cache. It is important that we do record each
step, not just the first step, because the next pair of ambiguous histories
may unwind to the same pair of prefixes that was compared before.
For all this to work, it is necessary that we can store all history until
the very end, because at any step we might need to unwind an arbitrary
number of steps back. We also need to address individual subhistories
efficiently in order to use them as keys in the cache. All this is achieved
by storing history in the form of a trie and addressing individual
histories by indices in the trie. We also use trie to compute the resulting
tag values (instead of storing tags in registers at each step).
Differentiate between "POSIX syntax" and "POSIX semantics" options.
This is needed because we may want to use POSIX syntax with leftmost
greedy disambiguation. In that case, we still need parentheses-as-tags
attitude, but we don't want all the overhead and heuristics to spped
up POSIX closure.
libre2c_posix: added test for leftmost TNFA matcher.
Ulya Trofimovich [Sun, 27 Jan 2019 10:23:23 +0000 (10:23 +0000)]
libre2c_posix: extended regex_t structure to hold more submatch data.
Added fields:
- re_nsub: total number of submatch groups, required by POSIX standard
- pmatch: buffer for submatch results, usually supplied by the user to
regexec(), but we allow to do the allocation and storage in regex_t.
This is convenient for users that have hard time managing memory, e.g.
java bindings to libre2c_posix.
- regs: buffer for internal use by regexec(), strored in regex_t to
avoid repeated memory allocation on each call to regexec() with the
same regex.
Ulya Trofimovich [Sun, 27 Jan 2019 10:06:22 +0000 (10:06 +0000)]
Adjusted build system to correctly build DLLs for windows.
Adjustments:
- configure.ac: pass win32-dll option to LT_INIT
- Makefile_libre2c_posix.am: use -no-undefined in LDFLAGS
- use slibtool: https://github.com/midipix-project/slibtool for windows
builds The problem with libtool is that it doesn't allow to link libstdc++
and libgcc statically, which is necessary to build portable DLLs with
Mingw. Libtool adds -nostdlib option to LDFLAGS and links some predefined
objects that pull in dependency on dynamic libstdc++ and libgcc, even in
the presence of -static-libstdc++ -static-libgcc.
Ulya Trofimovich [Thu, 17 Jan 2019 22:52:04 +0000 (22:52 +0000)]
Fixed operator precedence with --flex-syntax option.
Operator precedence was broken because re2c tried to parse whole strings
of characters at once instead of parsing one character at a time (in
much the same way as it would parse properly quotes string literals in
the original re2c format). This caused ab* being parsed as (ab)*, which
is clearly wrong (should be a(b)*).
This fixes bug #242:
"Operator precedence with --flex-syntax is broken."
Ulya Trofimovich [Sat, 12 Jan 2019 20:32:15 +0000 (20:32 +0000)]
libre2c_posix: use C-array initializers instead of variadic functions for offset lists.
Variadic functions cause subtle toolchain-specific errors because integer
literals used to initialize offsets may have different type than the type
expected by variadic function (e.g. int vs long).
Small simplifications in GOR1 initialization phase.
Instead of using two stacks to weed out low-precedence initial
configurations with duplicate target state, let them remain on the
bottom of topsort stack, which effectively makes them ignored.
Ulya Trofimovich [Mon, 31 Dec 2018 12:08:42 +0000 (12:08 +0000)]
Added GTOP SSSP algorithm for computing epsilon-closure with POSIX disambiguation.
GTOP SSSP meand "Global Topsort Single Source Shortes Path".
It is well known that SSSP can be solved in linear time on DAGs (directed
acyclic graphs) by exploring graph nodes in topological order. In our case
TNFA is not a DAG (it may have cycles), but it is possible to compute fake
topologcal order by ignoring back edges.
The algorithm works by having a priority queue of nodes, where priorities
are indices of nodes in fake topological ordering. At each step, the node
with the minimal priority is popped from queue and explored. All nodes
reachable from it on admissible arcs are enqueued, unless they are already
on queue.
The resulting algorithm is of course not optimal: it can get stuck on
graphs with loops, because it will give priority to some of the loop nodes
compared to others for no good reason.
However the algorithm is simple and optimal for DAGs, therefore we keep it.