Ulya Trofimovich [Fri, 24 Feb 2017 16:59:33 +0000 (16:59 +0000)]
Keep fixed and variable tags together in one array.
One must take fixed tags into account during determinization
if disambiguation policy is based on tag values.
Also, NFA construction is much easier when tags have common enumeration:
it allows to perdorm tag creation, tag fixing and adding default tags
in three distinct passes (for now they are intermixed and the code is
very complex).
Ulya Trofimovich [Tue, 14 Feb 2017 14:51:47 +0000 (14:51 +0000)]
Stop preserving order of tag versions when mapping TDFA states.
We no longer need it; now that disambiguation policy is leftmost rather
than tag-centric, relative order of tag versions is irrelevant.
Tag versions have no impact on conflict resolution; any bijective
mapping is ok.
Ulya Trofimovich [Mon, 13 Feb 2017 21:59:06 +0000 (21:59 +0000)]
Changed disambiguation policy: leftmost rather than tag-centric.
Instead of tracking priority for each tag version and resolving conflicting
configurations encountered by the epsilon-closure based on version priorities
(trying to minimize or maximize each tag), we now simply pick the leftmost
epsilon-path through the NFA.
NFA construction has to accommodate this policy: for example, greedy iteration
should be compiled so that always choosing the left alternative correpongs to
re-iteration, while choosing the right alternative corresponds to leaving the
iteration. The necessary adjustments have been done by earlier commit 03c65121da3f8bcb412576568cfff6cbca039470.
Now we can drop priority comparison when mapping TDFA states and stick to
simple bijection. This may add cycles in reordering commands. (Has not been
done yet.)
The rationale of switching to leftmost policy is this: with tag maximization/
minimization policy tags could influence the regexp: for example, regexp
"a"* @p "a"*
has different meaning depending on whether tag 'p' is minimized (first iteration
is non-greedy) or maximized (first iteration is greedy). Moreover, adding
another tag with the opposite policy and higher priority would change the
meaning (suppose 'q' is minimized and earlier tags have higher priority):
"a"* @q @p "a"*
This all makes regexps somewhat fragile, since adding just one tag may change
the whole regexp (and in the presence of other tags, it may change parse
results).
But even worse, with tag-centric disambiguation simple (untagged) subexpressions
are inferior to tagged ones. This contradicts intuition and even POSIX standard,
where a*(a*) is the same as (a*)(a*) . We don't keep to POSIX (and we can't
because of last-longest instead of first-longest semantics of iteration as shown
by the (<|<a|<ab|<aba|abab|baba|b>|>)* example), but even this somewhat
tag-centric standard respects non-captured parts of the regexp. Note that
Laurikari in his TRE library has to do special preprocessing of the regexp in
order to assign correct minimization/maximisation direction to each tag
depending on the context of the tag. His original tag-centric policy (open tag
is minimized and has higher priority than closing tag which is maximized,
nested tags have lower priority) is insufficient to accommodate POSIX.
Ulya Trofimovich [Mon, 13 Feb 2017 15:01:12 +0000 (15:01 +0000)]
In greedy regexps first alternative must correspond to consuming path.
By convention first alternative has higher priority.
So, for example, the following must be true:
r* = rr* | <empty>
r{n,m} = r{n} (r{m - n} | r{m - n - 1} | ... | r{1} | <empty>)
r{n,} = r{n} (rr* | <empty>)
For now we don't care about priorities: this is a preparatory step
before transition to greedy leftmost semantics for tags.
Use 'normal form' of version matrix to test if states are mappable.
Normal form is calculated for every new state before searching for
a mappable counterpart. All versions are renumbered to occupy consequent
numbers starting from one; version order is preserved.
Ulya Trofimovich [Wed, 25 Jan 2017 17:50:45 +0000 (17:50 +0000)]
Fixed yet another case of disordered tags update and cursor increment.
/* note [tag hoisting, skip hoisting and tunneling]
*
* Tag hoisting is simple: if all transitions have the same commands,
* they can be hoisted out of conditional branches.
*
* Skip hoisting is only relevant with '--eager-skip' option.
* Normally this option is off and skip is lazy: it happens after
* transition to the next state, if this state is consuming.
* However, with '--eager-skip' skip happens before transition to the next
* state. Different transitions may disagree: some of them go to consuming
* states, others don't. If they agree, skip can be hoisted (just like tags).
*
* '--eager-skip' makes tag hoisting more complicated, because now we have
* to care about the type of automaton: lookahead TDFAs must skip after
* writing tags, while non-lookahead TDFAs must skip before writing tags.
* Therefore skip hoising cannot be done without tag hoisting in lookahead
* TDFAs, and vice versa with non-lookahead TDFAs.
* (Note that '--eager-skip' is implied by '--no-lookahead').
*
* Tunneling splits base states in two parts: head and body. Body has all
* the conditional branches (transitions on symbols), while head has just
* one unconditional jump to body.
*
* Normally tag hoisting should go before tunneling: hoisting may add new
* candidates to be merged by tunneling. However, with '--eager-skip' tag
* hoisting is interwined with skip hoisting, and the latter needs to know
* which states are consuming. This is not possible if tunneling is still
* to be done, because it may turn consuming states into non-consuming ones.
* Another option is to disallow splitting states with non-hoisted skip
* in the presence of '--eager-skip' (this way skip hoisting wouldn't need
* to know tunneling results), but it's much worse for tunneling.
*/
Ulya Trofimovich [Mon, 23 Jan 2017 18:09:05 +0000 (18:09 +0000)]
Fixed memory leak.
Found by valgrind:
1,177 bytes in 1 blocks are definitely lost in loss record 1 of 1
at 0x4C2AE40: operator new[](unsigned long) (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
by 0x42153F: re2c::cfg_t::compact(re2c::cfg_t const&, int*) (compact.cc:12)
by 0x422088: re2c::optimize_tags(re2c::dfa_t&) (optimize.cc:18)
by 0x42DE3A: re2c::compile(re2c::spec_t const&, re2c::Output&) (compile.cc:82)
by 0x446742: re2c::parse(re2c::Scanner&, re2c::Output&) (parser.ypp:484)
by 0x402DC0: main (main.cc:38)
Ulya Trofimovich [Mon, 23 Jan 2017 17:42:37 +0000 (17:42 +0000)]
Use the same tag version for all transitions from the given state.
This reduces the overall number of versions greatly, which is crucial
for the space and time efficiency of tag liveness analysis.
Of course, bottom and normal transitions still have different versions.
Differents tags on the same transition also have different versions.
It is not incorrect to give them the same version: all tags updated by
one and the same transition are either set to bottom, or to current
input position (which is the same for all tags on the given transition).
But this would create many states with a very special "shape" that do
not map to other states. In fact, this would cause exponential blowup
in the number of states (even in simple cases).
Ulya Trofimovich [Mon, 23 Jan 2017 16:38:04 +0000 (16:38 +0000)]
Compact tag versions before doing liveness analysis.
After TDFA construction tag versions occupy non-contiguous range
of natural numbers: there may be "holes" between versions. This wastes
space (and time) spent on liveness analysis and subsequent optimization
passes. This commit adds a simple pass that renumbers versions so that
they occupy contiguios range.
Ulya Trofimovich [Tue, 17 Jan 2017 12:36:43 +0000 (12:36 +0000)]
Added option '--eager-skip'.
This option allows to control program points in which cursor is
advanced to the next input character. Normally it happens after moving
to the next state (lazyly). With '--eager-skip' cursor is advanced
before moving to the next state (eagerly).
Eager mode presents some difficulties compared to the lazy mode:
transitions may disagree on whether advance cursor or not. Some of
them may go to accepting states and should not advance cursor, while
others go normal states and should advance cursor. In such cases
advancement must happen on a per-transition basis. If all transitions
agree on the matter, advancement can be hoisted out and performed on
a per-state basis, just as it happens in lazy mode.
Ulya Trofimovich [Fri, 30 Dec 2016 17:15:16 +0000 (17:15 +0000)]
Fold output expressions in a separate pass over the output.
re2c folds sequences of simple statements to complex tatements, e.g.:
++YYCURSOR;
yych = *YYCURSOR;
is folded to:
yych = *++YYCURSOR;
It is hard to fold expressions on the fly (at the time they are generated),
because one needs to forsee all the cases in advance. Some cases cannot be
anticipated (e.g. if substatements belong to different states, which is often
the case), so many statements remain unfolded.
This commit separates code generation and folding by delaying output of
skip, peek and backup primitives. Folding now is done in a separate pass
when all the substatements are already known.
Ulya Trofimovich [Thu, 29 Dec 2016 17:07:52 +0000 (17:07 +0000)]
Memorize options for each block and use them for delayed code generation.
Option scope is very poorly defined in re2c. Most options should have block
scope, but some have global scope (because they belong to global directives
like '/*!types:re2c ... */', '/*!tags:re2c ... */' or other global things
like generation date, etc.).
For now, options are applied immediately as they are parsed.
Code generation just reads the most recent state of options, so the output
depends not only on the location in source code when the option is defined,
but also on program points in which re2c generates and outputs code.
Generation and output may happen at different times, so they may use
different options. This is all very bad.
This commit is the first attempt to introduce scoped options: re2c now
makes a snapshot of options at the end of each block and uses this snapshot
for delayed code generation for this block (unless the given option must be
global).
This is not perfect (immediate code generation still uses 'seen-so-far'
options instead of block options), but at least it allows to delay code
generation for non-global things without loosing options.
Ulya Trofimovich [Tue, 27 Dec 2016 17:00:09 +0000 (17:00 +0000)]
Finally separated parsing, consistensy checking and after-parse transformation.
Updates in test results are caused by the changed order of condition
compilation, see commit b7bbbf471778b487d0717c85a5e586b595e495ec.
Changes do not affect code generation, only the error messages.
Ulya Trofimovich [Sun, 25 Dec 2016 18:13:43 +0000 (18:13 +0000)]
Emit conditions ordered by their number rather than lexicograhically.
Each condition has a unique name and unique number. Numbers are assigned
to conditions in the order they appear in source file. This order must
be preserved since some users rely on it (user code makes implicit
assumptions about condition order, see '-Wcondition-order').
However, once the order is preserved, the automata code can be emitted
in any order (user code can only address conditions by their start label,
and it can be anywhere in the output file).
This commit changes the order of automata: now they are emitted according
to their numbers, not names.
Test results have changed in cases when lexicographical order does not
coinside with number order. The changes touch some important files and
are not easy to verify. However, rearranging conditions in the generated
code back in lexicographical order makes the differences trivial (labels,
'YYDEBUG' parameter, 'YYFILL' labels, 'YYSETSTATE' parameter --- in other
words, everything that depends on state numbering). It is easy to see
that the structure of the code is not changed.
Ulya Trofimovich [Thu, 22 Dec 2016 17:36:15 +0000 (17:36 +0000)]
Pass context as parameter to 'yyparse' instead of a bunch of static variables.
As a side effect of this commit a long-standing bug with setup rules
has been fixed: re2c forgot to clean up the list of setup rules together
with other data, which resulted in false errors about duplicate setup code.
Ulya Trofimovich [Wed, 14 Dec 2016 14:21:28 +0000 (14:21 +0000)]
Don't delay reading next character until dispatch.
Instead of sometimes generating code like this:
++YYCURSOR;
switch ((yych = *YYCURSOR)) {
re2c now always generates
yych = *++YYCURSOR;
switch (yych) {
There's not much difference: it's hard to say which would be better
optimized by C/C++ compiler. Most likely they would be optimized
equally well; practical experiments with GCC -O2 resulted in bit-to-bit
identical binaries.
However, dropping the first form allows to cut off a lot of tricky code
in re2c itself (a whole layer of code spread across multiple files was
tracking whether next input character has been already read or not).
Changes to the test results are multiple, but trivial; they have been
verified with skeleton.
Respect injective mappings when optimizing save(X), copy(Y,X) to save(Y).
If mapping is injective, there may be multiple copy commands per one
save command: save(X), copy(Y,X), copy(Z,X). In this case save command
must be replicated for each copy command: save(Y), save(Z).
Added test: exponential blowup if bottom is treated as special version.
One tempting way to track default value of tags is to forbid mapping
bottom (default) to non-bottom versions. This way we don't need to
explicitely save bottom to variable; bottom value is encoded in state
(once a tag becomes bottom, it will never become non-bottom again).
However, this may exponentially blow up the number of states.
Untwined determinization and '-Wnondeterministic-tags' analysis.
Now '-Wnondeterministic-tags' analysis is done right after determinization,
while we still have all the kernels available. Instead of comparing
transition tags (which are available in closure items, but not in kernels)
we now compare tag versions: maximum number of distinct versions per tag is
the desired degree of nondeterminism. Kernel items that belong to another
rule do not count (all tags that don't belong to a particular rule should
have initial versions in all items belonging to this rule).
Added test with an example of pathological regexp: counted repetition
cannot be folded efficiently in DFA, and combined with nondeterministic
tags it results in arbitrary long chains of reordering commands.
Length of command choin grows linearly with the number of repetitions
(as DFA itself does).
This option controls how tag versions in DFA states are mapped to each
other during determinization. By default mapping is 'bijective', however
if it is set to 'injective' the new state must be less versatile than
the old one.
Drop hoisted tags in 'base' states duplicated by tunneling.
Tunneling optimization finds certain states and splits them into
two parts: First part consumes a character and unconditionally moves
to the second part. Second part does not consume characters, but
contains a switch on character value.
Second part is a 'base switch', which may be utilized by many other
states. So if the first part has hoisted tags, they should be dropped
in the second part (the aforementioned many other states may have hoisted
tags of there own or no hoisted tags at all).
Found by slyfox's fuzzer. :)
Yet unable to construct a test case that reveals the error: the original
test case found by fuzzer is not breaking anymore after recent commit 44045203c211531cc3e3a47f0b5eb2e5d993cc15 "Properly remove useless final states.".
Don't loose liveness of tags that occur on both sides of 'copy' commands.
When processing chain of 'copy' commands like 'x = y; y = z;'
one should first mark all left hand sides ('x' and 'y') as dead,
then mark all right hand sides as alive ('y' and 'z'). This way
'y' will end up alive, as it should be.
Before this commit the algorithm considered final state useless only
if its rule is unreachable in this state. However, this leaves out
cases when final state is shadowed by other final states of the same
rule (longer alternatives of the same rule always match). In such
cases the rule is reachable because some (maybe all) outgoing accepting
paths match the same rule.
Now the algorithm considers final state useless if all outgoing paths
are accepting (there's no default transitions and default rule is
unreachable in all successors).
Ulya Trofimovich [Wed, 30 Nov 2016 23:02:25 +0000 (23:02 +0000)]
Topologically sort 'copy' commands.
The order in which copy commands are executed is important:
'x = y; y = z;' is not the same as 'y = z; x = y;' (the latter
overwrites 'y' before its precious value is copied to 'x').
To avoid overwrites, commands should be topologically sorted.
This is always possible because there's no cyclic dependencies
by construction.
Ulya Trofimovich [Wed, 30 Nov 2016 11:21:27 +0000 (11:21 +0000)]
Allow nondeterministic tags.
This commit makes some important changes:
- Each DFA state under construction keeps a set of tag versions
(which version of each tag is actual for each NFA sub-state).
- Tag versions now have priorities which allow to disambuguate
between two competing configurations when building tagged epsilon-
closure. We try to maximaze value of each tag, thus versions
that correspond to greater tag value have greater priority.
- Comparison of DFA states (when trying to find an existing state
equal to the new one) has become more complex. We try to find
a similar enough (not necessarily identical) state: one that is
equal to the new state up to reordering of tag versions. Mapping
should be either bijective or injective (new state must be less
or equally versatile then the old one), default is bijective.
Mapping should not violate tag priorities: relative order of
versions must hold (absolute values of priorities don't have to
coincide).
- On each update tag gets new version: if tag is updated to bottom,
it gets negative version; otherwise it gets positive version.
- Different tags that updated by the same transition still get
different versions. This may seem to be a waste: we know exactly
that input position is equal for all tags on the same transition,
even more, for all tags on all outgoung transitions from the same
state, so we can alocate exactly one version for all of them.
However, allocating one version for different tags creates states
with a very specific tag configurations: these states don't map
to existing states and the automaton gets bloated (sometimes
exponentially). Furthermore, allocating independent versions is
good for optimizations.
- Second pass of optimizations: it only makes difference for one
test case (usually the first pass does a good job), but it's cheap
(all the necessary infrastructure is in place) and powerful.
- The resulting automaton is similar to Laurikari's TDFA, except
for one important difference: re2c's automaton makes use of
lookahead tags, while Laurikari's does not.
Ulya Trofimovich [Mon, 28 Nov 2016 15:32:46 +0000 (15:32 +0000)]
Keep fixed and variable tags separately.
This commit reverts commit de3f9e70a45c42fcb848a347ece3a727b8fb983e:
Keep fixed and variable tags together in one array.
Optimizations of variable tags get more complicated and fixed tags
should not get in the way.
This commit also drops check for tags in trailing context: there's
nothing special about them; no technical reason to forbid them.
Ulya Trofimovich [Sat, 26 Nov 2016 20:16:11 +0000 (20:16 +0000)]
Merged computation of fixed / variable tags into NFA construction.
NFA construction is already complex because of insertion of default tags,
now it gets even worse. However, keeping two separate passes requires
maintaining exactly the same order of traversal, which is fragile.