Ulya Trofimovich [Tue, 27 Jun 2017 13:05:26 +0000 (14:05 +0100)]
Skip (non-orbit) start tags during POSIX disambiguation.
Their position is fixed on some higher-priority tag (except the very
first tag, but in RE2C match is always anchored).
We cannot skip orbit start tag because the corresponding orbit end tag
is hoisted out of loop (by construction) and is, in fact, non-orbit.
Use Goldberg-Radzik shortest path algorithm for closure construction.
It has O(M*N) worst-case complexity, where M is the number of nodes (states)
and N is the number of arcs (transitions).
Papers: 1993, "A heuristic improvement of the Bellman-Ford algorithm"
by Goldberg, Radzik and 1996, Shortest paths algorithms: Theory and
experimental evaluation" by Cherkassky, Goldberg, Radzik.
Compare full tag histories, not just the last subhistories.
Comparing the last pair of subhistories is not enough: because of
the shortest-path algorithm mismatch may occur at some earlier point
(not at the last pair), in which case we still need to re-run the search.
Ulya Trofimovich [Sun, 21 May 2017 09:07:58 +0000 (10:07 +0100)]
Use two tags instead of three for captures under iteration.
Before we simplified POSIX disambiguation by inserting missing
captures, using three tags to represent captures under iteration
was the only way: we needed to guard where the first iteration starts,
where the last iteration starts and where the last iteration ends.
Therefore we used three tags: opening, closing and orbit.
Now that we insert missing pieces of capture hierarchy, we no longer
need a special opening tag for captures under iteration: there always
is a higher-priority tag that guards the start of the first iteration.
Massive test updates are mostly due to the fact that orbit tag is no
longer fixed (fixed tags go in the end): ~100 out of 147 tests.
Some other ~45 tests have two tag variables flipped. There are 4 tests
with non-trivial changes. All updated tests pass POSIX test
(/test/posix_captures/.run/__run.sh).
Ulya Trofimovich [Fri, 19 May 2017 21:11:24 +0000 (22:11 +0100)]
Forbid mixing leftmost greedy and POSIX disambiguation.
Mixing doesn't make much sense (there is no obvious use case).
It might be possible, but for now better forbid it before someone
starts to use it and it turns out to be impossible or very complex.
Ulya Trofimovich [Fri, 19 May 2017 07:47:36 +0000 (08:47 +0100)]
Simplified POSIX disambiguation by reconstructing capture hierarchy.
POSIX treats captured and non-captured subexpressions on equal terms.
However, non-captured subexpressions have no tags to guard them.
Previously we used default tags to infer ambiguous paths that correspond
to the missing captures: if one path had default tag and the other did
not, then desicion which path is better was made according to the leftmost
strategy. This algorithm works because in POSIX expressions without
captures have the property that leftmost path is the best path (for
example, 'a?' is greedy, it means 'a or epsilon').
However, this algorithm has one downside: because we may need leftmost
comparison, we have to impose leftmost order on NFA substates of each
DFA state, as well as maximize and orbit order for tags. This prevents
us from mapping perfectly mappable DFA states and we end up with a larger
DFA (which is sometimes folded back to smaller DFA, but not always).
Also, default-leftmost algorithm is more complex than inserting missing
hierarchy pieces: proving that it works is non-trivial.
Ulya Trofimovich [Tue, 16 May 2017 21:11:55 +0000 (22:11 +0100)]
Fixed history comparison in case both latest subhistories are bottoms.
In this case, instead of comparing histories by the leftmost criterion
(as it should be, if only one of them is bottom), we should assume that
from standpoint of this tag histories are equal and move on to other
tags.
Ulya Trofimovich [Tue, 16 May 2017 17:07:00 +0000 (18:07 +0100)]
Avoid exponential blowup in tagged epsilon-closure construction.
The previous algorithm waited until the full epsilon-path is built,
then compared it to already existing paths. The new algorithm compares
and merges partial epsilon-paths as soon as they arrive at the same
NFA state, so the overall number of paths is bounded by the number of
NFA states at all times.
Don't split tag history into individual sub-histories for tags.
This is necassary for correct comparison of orbit tag histories:
if orbit tag is nested in an outer capture, this outer capture is
under repetition and there is an epsilon-path through it, then
this epsilon-path may contain pieces of orbit history that belong
to different iterations of outer capture; these pieces will be
glued together and the boundary between them will be lost.
Example: ((""){0,3}){0,2}.
However, in a common history we can always find boundaries
(they are marked by tags that correspond to outer captures).
Command normalization: update pointer to last command, it might change.
If normalization of 'save' commands removes the last 'save' command,
then pointer to 'next' will point to the 'next' field of the removed
command and the chain of commands will break.
The previous attempt was to forbid 2-cycles; clearly it's not enough
and cycles of length greater than 2 also pose a problem.
We could allow cycles and deal with them by introducind a temporary
variable. However, this would create variables local to basic blocks,
which would complicate liveness analysis and dead code elimination.
Tag interference analysis: compare actual values, not formal RHS.
Commands in the same basic block may have equal right hand sides,
e.g. 'x = y; z = y;'. Clearly in this case 'x' and 'z' do not interfere:
both versions are equal to 'y'. This optimization allows to deduplicate
cases like '("a" @x @y @z "b")*', where multiple tags have the same
values.
However, it is insufficient to find all commands in the current block
with RHS equal to current RHS. An example when this algorithm fails:
'x = y; w = z; z = y;', here 'x' and 'z' are assigned to the same RHS
'y', but if we merge them into one variable, 'w' will get the wrong
value.
The fix is as follows: first, ignore subsequent commands and consider
only commands that precede current command: if subsequent command that
sets LHS to the same value precedes any use of it, liveness propagation
through basic block would mark this LHS as dead and not interfering
anyway; otherwise (if use precedes setting to the same value), then it
indeed interferes with current LHS. This fixes the above example
'x = y; w = z; z = y;': when analysing 'x = y;' command we find that
'z' is alive (as before), but ignore subsequent 'z = y;' command (due
to the fix).
Second, when analysing preceding commands, calculate actual value for
each LHS (based on pessimistic assumption that on entry of basic block
all used versions are different). Formal RHS of some preceding command
may coincide with current formal RHS, but their values might differ:
'x = y; y = w; z = y;', here 'x = y;' and 'z = y;' have identical formal
RHS, but the value may be different. On the other hand, formal RHS
might be different, while their actual values are equal, as in
'x = y; w = y; z = w;'.
Don't rely on the fact that topsort leaves all-zero in-degree:
it doesn't if cycles are present. One could manually zero the remaining
entries after topsort, but zero initialization is more reliable.
And it's quite efficient since we only need to initialize the entries
we use.
When mapping states, avoid double substitution in 'save' commands.
When mapping states, we need to create a list of 'copy' commands.
However, if there is a 'save' command with LHS equal to RHS of the 'copy'
command, then instead of the 'copy' command we just fix LHS of the 'save'
command, as explained in this note:
/* note [save(X), copy(Y,X) optimization]
*
* save(X) command followed by a copy(Y,X) command can be optimized to
* save(Y). This helps reduce the number commands and versions (new version
* X is gone), but what is more important, it allows to put copy commands
* in front of save commands. This order is necessary when it comes to
* fallback commands.
*
* Note that in case of injective mapping there may be more than one copy
* command matching the same save command: save(X), copy(Y,X), copy(Z,X).
* In this case save command must be replicated for each copy command:
* save(Y), save(Z).
*
* For each save(X) command there must be at least one copy(Y,X) command
* (exactly one case of bijective mapping). This is because X version in
* save(X) command must be a new version which cannot occur in the older
* DFA state. Thus all save commands are transformed (maybe replicated) by
* copy commands, and some copy commands are erased by save commands.
*
* This optimization is applied after checking priority violation, so it
* cannot affect the check.
*/
However, the previous implementation would sometimes erroneously re-fix
the already fixed 'save' commands. This patch rewrites it in a way that
avoids re-fixing.
'save' commands for tags with history may need fallback copies.
For simple tags without history 'save' command is self-sufficient:
it does not depend on anything. However, for tags with history 'save'
commands depend on history, which may be overwritten on the way from
accepting state to fallback transition. We must take care and backup
such overwritten histories when leaving the accepting state.
If history is not overwritten, we don't need a copy, but we still
have to propagate its liveness forward on all fallthrough paths
outgoing from accepting state.
First, fallback 'copy' commands should go before 'save' commands.
Second, codegen should not reorder 'copy' and 'save' commands.
These two bugs shadowed each other and everything seemed to work.
- removed 'YYCOPYTAG', use simple assignment insted
(it is hard to think of any other definition)
- renamed 'YYBACKUPTAG' to 'YYTAGP'
- added 'YYTAGN', 'YYTAGLISTP', 'YYTAGLISTN'
- removed 're2c:tags:default' configuration
Packing was possible because we were not interested in tag version
if closure item had this tag in lookahead set. However, if we are
to store the full history of each tag, we'll need this intermediate
version in spite of the fact that it is overwritten on the next
transition.
Liveness flows backwards through basic block, so it should be
propagated backwards starting with live-out set and applying
commands one by one in reverse order.
Merge 'save' and 'copy' command lists into one common list.
Previously we used separate lists for them; however the structure
of 'save' and 'copy' commands is almost identical, so its easy to
use one C struct for both types. Also, this simplifies analysis:
both types of commands are handled in a uniform way.
However, now we have to do list transformations like sorting and
deleleting duplicates in chunks.
This is a preliminary step to allow intermixing 'copy' and 'save'
commands.
Ulya Trofimovich [Thu, 16 Mar 2017 22:39:57 +0000 (22:39 +0000)]
Explicitly track order of closure items for leftmost disambiguation.
Before this commit leftmost order was maintained implicitly by construction
of closure: NFA is traversed in left-first order, so the earliest item added
to closure is also the leftmost. Special care was taken to preserve this
order and never rearrange closure elements.
This is convenient if leftmost disambiguation strategy is the only one.
However, other strategies require explicit bookkeeping of disambiguation
information (POSIX needs tag history for orbit tags, maximize/minimize
needs most recent tag versions). Implicit handling of leftmost strategy
does not fall in line with other strategies.
By tracking leftmost order explicitly we achieve several goals:
- uniform handling of disambiguation strategies
- DFA states can be treated as good old sets again, which is more
conventional and means less work for minimization
POSIX tests changed (but the '/test/posix_captures/.run/__run.sh' script
still passes all tests) because there was an error in the order items were
added to closure, which probably prevented minimization (if the new item
dominated the existing one, the old one was substituted with the new one,
while it should have been removed and the new one appended at the end).
Skeleton tests changed because the order of NFA states in DFA state does
not matter, so more states can be merged by determinization.
Split options into constant and mutable; restricted access to mutable ones.
Some options are immutable (target, output files; global switches like
conditions, reuse mode, storable states; support of flex syntax, etc.).
These options are passed as command-line arguments and never change.
It is safe to read them from any program point after parsing command-line
arguments.
Other options are configurable; they have block scope (may be specified
anywhere inside of the block and still affect the whole block).
Reading mutable options of yet unparsed block is not allowed because
they may affect the way RE2C parses current block (RE2C would be tempted
to base decisions on the latest option value, which may not be the final
one).
Configuration of these options either makes little sense (like changing
target on the fly) or is technically impossible to implement (the only
example is '-F, --flex-syntax' option which non-trivially affects the way
RE2C parses programs).
Check precondition inside of the function rather than at every call site.
Changes in tests are caused by the lack of newline after '/*!ignore:re2c*/'
directive. There's no particular reason why it should be emitted and RE2C
code is simpler without it.
AST is immutable and independent of options: encoding expansion and other
option-sensitive transformations are delayed until AST is converted to
intermediate representation. Thus AST can be shared among different blocks
and conditions and reused with any set of options without modifications.
Changes in tests are trivial: warnings from the rules block are no longer
emitted (we don't compile AST to DFA for rules block anymore, since it
will be discarded anyway); skeleton tests no longer emit '.input' and
'.keys' files for rules block.
Store locations in AST so that error reporting does not depend on parser.
This allows us to delay things that should be delayed (like expanding
regexps with respect to current encoding), but which may throw an error
and need location info.
Added test for '--posix-captures': implicit grouping is forbidden.
RE2C has named definitions, which allow to override operator precedence
without grouping parenthesis. However, in POSIX all parenthesis are
capturing: allowing some of them to be non-capturing would affect
disambiguation and might result in a different parse.
For now RE2C uses a crude criterion to determine if the given named
definition needs to be explicitely wrapped in parenthesis: if the topmost
constructor of the subexpression is either alternative, or catenation, or
another reference to another named definition, then wrapping is required.
This is an overapproximation, because it doesn't take the surrounding context
into account (e.g. catenation inside of an alternative doesn't need wrapping,
but catenation inside of an iteration does).
Tests are adopted from Glenn Fowler's testregex suite and its derivative
regex-posix-unittest by Christopher Kuklewicz.
Irrelevant tests have been dropped:
- tests for specific POSIX features not supported by RE2C, like
begin/end markers '$' and '^', backreferences, predefined character
classes (most of them come from 'basic' subdirectory)
- all 'leftassoc' tests: they are exact copies of 'rightassoc' tests,
differ only in match results and are supposed to fail
- all tests from 'categorize' that are supposed to fail
I also added a few tests of my own (in the 'other' subdirectory).
Tests are arranged as a bunch of '.dat' files (each file contains lines of
the form '<regexp> <string> <match>' or blank lines) and two scripts.
- '__run.sh' runs all tests (generates temporary '.re' file with the
given regexp, compiles it, runs against the given string and checks
that the result coincides with the given result)
- '__gen.sh' regenerates '.re' files for regular RE2C test suite (the
generated files do not contain any code, just the bare regexp)
All regexps are pre-patched to exclude '\x00' (terminating NULL) character
from subregexps like '.' or '[^a]: otherwise RE2C-generated programs would
greedily consume all characters and fail. Note that if only '.' was present
in the original tests, we might have used '\n' as terminator, but '[^a]'
includes newline. Also '()' subregexp was substituted with '("")'.
'--skeleton --posix-captures': use correct capture index when checking tags.
Instead of skipping fixed tags, skip orbit tags. For each orbit tag there is
a tag fixed on it, so if we check all fixed tags we'll surely check all orbit
tags. Anyway, with '--skeleton' all fixed tags correspond to orbit tags: other
fixed tags are forbidden with generic API (even zero-offset ones for now).
For some reason dead code elimination pass simply ignored 'copy' commands
and removed only dead 'save' commands. Strange as it seems, this went
unnoticed until now (when one of the POSIX tests compiled into C++ code
that triggered GCC warning about use of uninitialized variable).
With this option re2c interprets all parenthetised subexpressions
as POSIX capturing groups and returns submatch according to POSIX
disambiguation rules (in variables 'yynmatch' and 'yypmatch', local
to each action).
The algorithm for disambiguation was invented by Christopher Kuklewicz.
Now code generation can handle zero-offset fixed tags with generic API.
We'll need this for POSIX captures: in the prsence of orbit tag,
opening tag is needed only for disambiguation; after determinization
we can pretend that it is fixed on orbit tag with zero offset.
This will save some time and space on tag optimizations and maybe
reduce the number of tag variables.
Prepare default tags insertion for non-sequential tag enumeration.
This will be needed when we add POSIX captures: opening and closing
capture tags must have sequential numbers, but the tags may be located
far from each other in the regexp tree (other tags are possible on
the way from the opening tag to the closing tag).
Delay expansion of counted repetition until NFA construction time.
This is necessary to avoid creating default tags for tags that
are duplicated by expansion of counted repetition. Default tags
should only be created if zero repetitions are acceptable; this case
is still expanded early.
Keep the full history of tag occurences found by epsilon-closure.
Currently this makes no difference because tag duplication in NFA
is forbidden. Anyway, we are only interested in the last occurence
of each tag on the epsilon-path to the given NFA state.
This will make difference for POSIX disambiguation strategy: we'll
have to allow tag duplication in NFA (because of counted repetition
and iteration expansion).
Ulya Trofimovich [Sat, 25 Feb 2017 17:00:49 +0000 (17:00 +0000)]
Added yet another intermediate representation (after AST, before NFA).
Rationale: AST must be immutable, because it is shared among different
regexps, rules, conditions and blocks. NFA is mutable, but inconvenient
for some transformations (adding fixed and default tags, rewriting parts
of regexp, etc.). Before this commit all these transformations were done
in one pass (during convertion of AST to NFA) which made the code rather
complex. In order to add POSIX captures we'll need to add even more
complexity (tracking captures, creating tags for them, maintaining tag
order, etc.).
Ulya Trofimovich [Fri, 24 Feb 2017 16:59:33 +0000 (16:59 +0000)]
Keep fixed and variable tags together in one array.
One must take fixed tags into account during determinization
if disambiguation policy is based on tag values.
Also, NFA construction is much easier when tags have common enumeration:
it allows to perdorm tag creation, tag fixing and adding default tags
in three distinct passes (for now they are intermixed and the code is
very complex).
Ulya Trofimovich [Tue, 14 Feb 2017 14:51:47 +0000 (14:51 +0000)]
Stop preserving order of tag versions when mapping TDFA states.
We no longer need it; now that disambiguation policy is leftmost rather
than tag-centric, relative order of tag versions is irrelevant.
Tag versions have no impact on conflict resolution; any bijective
mapping is ok.
Ulya Trofimovich [Mon, 13 Feb 2017 21:59:06 +0000 (21:59 +0000)]
Changed disambiguation policy: leftmost rather than tag-centric.
Instead of tracking priority for each tag version and resolving conflicting
configurations encountered by the epsilon-closure based on version priorities
(trying to minimize or maximize each tag), we now simply pick the leftmost
epsilon-path through the NFA.
NFA construction has to accommodate this policy: for example, greedy iteration
should be compiled so that always choosing the left alternative correpongs to
re-iteration, while choosing the right alternative corresponds to leaving the
iteration. The necessary adjustments have been done by earlier commit 03c65121da3f8bcb412576568cfff6cbca039470.
Now we can drop priority comparison when mapping TDFA states and stick to
simple bijection. This may add cycles in reordering commands. (Has not been
done yet.)
The rationale of switching to leftmost policy is this: with tag maximization/
minimization policy tags could influence the regexp: for example, regexp
"a"* @p "a"*
has different meaning depending on whether tag 'p' is minimized (first iteration
is non-greedy) or maximized (first iteration is greedy). Moreover, adding
another tag with the opposite policy and higher priority would change the
meaning (suppose 'q' is minimized and earlier tags have higher priority):
"a"* @q @p "a"*
This all makes regexps somewhat fragile, since adding just one tag may change
the whole regexp (and in the presence of other tags, it may change parse
results).
But even worse, with tag-centric disambiguation simple (untagged) subexpressions
are inferior to tagged ones. This contradicts intuition and even POSIX standard,
where a*(a*) is the same as (a*)(a*) . We don't keep to POSIX (and we can't
because of last-longest instead of first-longest semantics of iteration as shown
by the (<|<a|<ab|<aba|abab|baba|b>|>)* example), but even this somewhat
tag-centric standard respects non-captured parts of the regexp. Note that
Laurikari in his TRE library has to do special preprocessing of the regexp in
order to assign correct minimization/maximisation direction to each tag
depending on the context of the tag. His original tag-centric policy (open tag
is minimized and has higher priority than closing tag which is maximized,
nested tags have lower priority) is insufficient to accommodate POSIX.
Ulya Trofimovich [Mon, 13 Feb 2017 15:01:12 +0000 (15:01 +0000)]
In greedy regexps first alternative must correspond to consuming path.
By convention first alternative has higher priority.
So, for example, the following must be true:
r* = rr* | <empty>
r{n,m} = r{n} (r{m - n} | r{m - n - 1} | ... | r{1} | <empty>)
r{n,} = r{n} (rr* | <empty>)
For now we don't care about priorities: this is a preparatory step
before transition to greedy leftmost semantics for tags.
Use 'normal form' of version matrix to test if states are mappable.
Normal form is calculated for every new state before searching for
a mappable counterpart. All versions are renumbered to occupy consequent
numbers starting from one; version order is preserved.