Ulya Trofimovich [Fri, 21 Jun 2019 13:51:26 +0000 (14:51 +0100)]
libre2c: don't add nested negative tags to TNFA, as it increases its size and makes matching slower.
Instead, add only one negative tag (top-level closing one) and record
the remaining tags in its metadata. Use this metadata to update nested
negative tags during TNFA simulation.
Ulya Trofimovich [Sun, 24 Mar 2019 09:31:56 +0000 (09:31 +0000)]
Fictive tags for right alt/cat must be added after adding all tags for left alt/cat.
Previous order was incorrect: fictive tags for both left and right
alternative or catenation were added before recursing into the right
alternative or catenation. This resulted in right fictive tags habving
higher precedence than left non-fictive tags.
Ulya Trofimovich [Sat, 16 Mar 2019 23:45:48 +0000 (23:45 +0000)]
libre2c: added POSIX tests that show why closure can't use DFS and needs worst-case quadratic shortest path algorithm.
In particular, test ((((a?)+)|(aa))+) and string "aaa" demonstrates why
we cannot replace SSSP (GOR1 or GTOP) with a liner-time DFS that sets
predecessor chains of TNFA states and sorts initial closure configurations
according to POSIX relation imposed by precedence matrix.
Consider what happens after consuming the first to 'a's (see the .dot
encoded TNFA below):
- initial states are 8, 14 and 9 (in the order imposed by D-matrix)
- find path
8->7->6->5->4->3->2->22->21->20->19->18->17->16->14->13->17
compare it to the previous path to 17 and discard the new looping path
- find path
8->7->6->5->4->3->2->22->21->20->19->18->17->16->14->13->12->3
compare it to the previous path to 3 and discard the new looping path
- find path 8->7->6->5->4->3->2->1->0
- DFS from 8 complete, start DFS from 14
- find path 14, compare it to the previous path to 14 and accept the new
one as it is better (but do not attempt to propagate the improvement
and rescan states: this is DFS, not GOR1, and the whole idea is to
avoid rescanning and have linear complexity)
- DFS from 14 complete, start DFS from 9
- find path 9
The end, and we have an incorrect path to 15: the long path through the
outer loop from 8, rather than the short path through the inner loop
from 14. This further leads to incorrect results.
Handle cases when rename() fails because destination file exists.
In C/C++ rename() behaviour is implementation-defined. POSIX and Windows
have different behaviour when destination file exists: POSIX says rename()
should overwrite it, but Windows says rename() should fail.
Consistently convert all newlines in the generated file to Unix-style LF.
Some newlines originate in user-defined code (including semantic actions
and code fragments in configurations and directives), some are generated
by re2c itself. In order to maintain consistency we convert all newlines
to LF when writing output to file.
Respect platforms with 32-bit 'long' and 64-bit pointers.
Previously used ~0U works incorrectly in such cases: ~0U equals
0xFFFFffff, and maximum (numerical) pointer value is 0xFFFFffffFFFFffff.
This caused incorrect comparison result when tracking include files that
have been fully processed and can be removed frm the include stack.
Ulya Trofimovich [Wed, 27 Feb 2019 10:50:07 +0000 (10:50 +0000)]
libre2c: use a more efficient algorithm for computing POSIX precedence matrix.
The old alrorithm was straightforward: it computed precedence for each
pair of core closure items. This has worst-case cubic complexity in the
size of TNFA, because closure can have O(N) states and tag history length
can be also O(N). One example is ((a?){1,1000})*.
The new algorithm avoids that by computing parecedence for all pairs of
core items in one traversal of the shortest path tree constructed by
closure. It avoids traversing deadend branches (those that were cancelled
by search because a better path was found) by marking used nodes of the
tree from each leaf down to the root. The new algorithm has worst-case
quadratic behaviour. It can be slightly slower for simple cases when the
number of core items is low and histories are short.
Ulya Trofimovich [Wed, 20 Feb 2019 17:00:47 +0000 (17:00 +0000)]
libre2c: found pathological case for constant-memory POSIX algorithms.
The regexp is ((a?){1,200})*, and the input string is just "a".
Takes quibic time in the size of counter. This is caused by quadratic-
time computation of precedence matrix on each step (the number of TNFA
states in the closure approaches TNFA size), multiplied by the length
of compared histores (which also approaches TNFA size). Trie-based
algorithms are not affected, but they consume memory proportional to
the length of the input string, and so are also not practical.
Ulya Trofimovich [Tue, 19 Feb 2019 21:51:43 +0000 (21:51 +0000)]
No need to fully unwind tag histories when calculating POSIX precedence.
Histories are stored in a trie. Each history is represented by an index
to its tail in the trie. In order to compare two histories, we can simply
unwind both indices, step by step, until they coincide or root index is
reached.