Handle cases when rename() fails because destination file exists.
In C/C++ rename() behaviour is implementation-defined. POSIX and Windows
have different behaviour when destination file exists: POSIX says rename()
should overwrite it, but Windows says rename() should fail.
Consistently convert all newlines in the generated file to Unix-style LF.
Some newlines originate in user-defined code (including semantic actions
and code fragments in configurations and directives), some are generated
by re2c itself. In order to maintain consistency we convert all newlines
to LF when writing output to file.
Respect platforms with 32-bit 'long' and 64-bit pointers.
Previously used ~0U works incorrectly in such cases: ~0U equals
0xFFFFffff, and maximum (numerical) pointer value is 0xFFFFffffFFFFffff.
This caused incorrect comparison result when tracking include files that
have been fully processed and can be removed frm the include stack.
Ulya Trofimovich [Wed, 27 Feb 2019 10:50:07 +0000 (10:50 +0000)]
libre2c: use a more efficient algorithm for computing POSIX precedence matrix.
The old alrorithm was straightforward: it computed precedence for each
pair of core closure items. This has worst-case cubic complexity in the
size of TNFA, because closure can have O(N) states and tag history length
can be also O(N). One example is ((a?){1,1000})*.
The new algorithm avoids that by computing parecedence for all pairs of
core items in one traversal of the shortest path tree constructed by
closure. It avoids traversing deadend branches (those that were cancelled
by search because a better path was found) by marking used nodes of the
tree from each leaf down to the root. The new algorithm has worst-case
quadratic behaviour. It can be slightly slower for simple cases when the
number of core items is low and histories are short.
Ulya Trofimovich [Wed, 20 Feb 2019 17:00:47 +0000 (17:00 +0000)]
libre2c: found pathological case for constant-memory POSIX algorithms.
The regexp is ((a?){1,200})*, and the input string is just "a".
Takes quibic time in the size of counter. This is caused by quadratic-
time computation of precedence matrix on each step (the number of TNFA
states in the closure approaches TNFA size), multiplied by the length
of compared histores (which also approaches TNFA size). Trie-based
algorithms are not affected, but they consume memory proportional to
the length of the input string, and so are also not practical.
Ulya Trofimovich [Tue, 19 Feb 2019 21:51:43 +0000 (21:51 +0000)]
No need to fully unwind tag histories when calculating POSIX precedence.
Histories are stored in a trie. Each history is represented by an index
to its tail in the trie. In order to compare two histories, we can simply
unwind both indices, step by step, until they coincide or root index is
reached.
Ulya Trofimovich [Mon, 18 Feb 2019 16:42:07 +0000 (16:42 +0000)]
libre2c: compare histories in a trie without reconstructing them.
Previously we used to unwing each history, store it as a linear sequence
in a separate buffer, and then iterate buffers from the beginning of
histories to fing the fork point.
This reconstruction step can be avoided, because we can find fork by
tracing both histories together from tail to root, element by element,
until their indices coincide (this is the fork point. Someone called
thos "two-fingers algorithm".
Ulya Trofimovich [Mon, 18 Feb 2019 10:52:26 +0000 (10:52 +0000)]
Correctly set values on main diagonal of POSIX precedence matrix (configuration compared to itself).
It doesn't show any difference on the tests, probably because previously
these values were set to zero, which unpacks to symmetrical (though
incorrect) values, and thus doesn't affect comparison.
Ulya Trofimovich [Thu, 14 Feb 2019 12:52:50 +0000 (12:52 +0000)]
Assign indices to core TNFA states (those that are final or have transitions on symbols).
During epsilon closure constructions, many operations only need to be
performed for core states. This allows to save some memory, e.g.
precedence matrix can be indexed by core states.
libre2c: speedup of POSIX NFA matcher by lazy computation and caching of precedence.
A lot of the match time was taken by eagerly computing precedence values
on each step for each pair of closure states. Most of these values are not
needed: RE may be unambigous; or the given input string may be unambigous,
even if there is ambiguity, it may take only a few comparisons to resolve.
All the rest is wasted effort.
We can avoid it by delaying precedence computation until necessary, and
then unwinding all the steps backwards, computing precedence for each step
and caching the computed values (so that the same pair of histories is not
compared twice). It is still the same incremental comparison as with
precedence matrices: we compare step by step, going from the fork frame to
the join frame. The result of comparison on each step is folded to a triple
of numbers and recorded in cache. It is important that we do record each
step, not just the first step, because the next pair of ambiguous histories
may unwind to the same pair of prefixes that was compared before.
For all this to work, it is necessary that we can store all history until
the very end, because at any step we might need to unwind an arbitrary
number of steps back. We also need to address individual subhistories
efficiently in order to use them as keys in the cache. All this is achieved
by storing history in the form of a trie and addressing individual
histories by indices in the trie. We also use trie to compute the resulting
tag values (instead of storing tags in registers at each step).