Ulya Trofimovich [Mon, 14 Mar 2016 22:23:03 +0000 (22:23 +0000)]
Skeleton: simplified path structure.
Just store pointers to skeleton nodes (instead of bookkeeping arcs,
contexts and rules in each path). All the necessary information can
be easily retrieved from nodes when path is baing dumped to file.
Three tests in '--skeleton' mode have been broken by this commit.
Actually, these are not breakages: these cases reveal incorrect
re2c-generated code. The change is due to the fact that skeleton
now doesn't simulate contexts that go *after* the matched rule:
------o------o------> ... (fallback to rule)
rule context
Ulya Trofimovich [Mon, 22 Feb 2016 10:22:43 +0000 (10:22 +0000)]
Code cleanup: factored RuleInfo out of RuleOp.
Rule information (line, attached code block, rank, shadowing set, etc.)
is used throughout the program. Before this patch, rule information was
inlined in RuleOp. It was inconvenient because RuleOp belongs to early
stages of compilation (AST, prior to NFA), but it had to live throughout
the whole program.
Ulya Trofimovich [Wed, 17 Feb 2016 16:12:59 +0000 (16:12 +0000)]
Simplified tracking of fixed-length trailing contexts.
Static (that is, of fixed length) trailing contexts don't need
recording context position with YYCTXMARKER and restoring it
back on successful match. They can be tracked simply by decreasing
input position by context length.
Ulya Trofimovich [Wed, 17 Feb 2016 14:46:43 +0000 (14:46 +0000)]
Simplified [-Wmatch-empty-rule] analyses.
Before this patch [-Wmatch-empty-rule] was based on:
- DFA structural analyses (skeleton phase)
- rule reachability analyses (skeleton phase)
Now it is based on:
- NFA structural analyses (NFA phase)
- rule reachability analyses (skeleton phase)
It's much easier to find nullable rules in NFA than in DFA.
The problem with DFA is in rules with trailing context, both
dynamic and especially static (as it leaves no trace in DFA
states). re2c currently treats static context as dynamic, but
it will change soon.
On the other side NFA may give some false positives because of
unreachable rules:
[^] {}
"" {}
infinite rules:
[^]* {}
or self-shadowing rules:
[^]?
Reachability analyses in skeleton helps to filter out unreachable
and infinite rules, but not self-shadowing ones.
Ulya Trofimovich [Sat, 16 Jan 2016 23:07:17 +0000 (23:07 +0000)]
Stabilized the list of shadowing rules reported by [-Wunreachable-rules].
Before this commit, the list of rules depended on the order of NFA states
in each DFA state under construction (which is simply a matter of ordering
pointers to heap: the order can be different).
Now all rules for each DFA state are collected and the final choice of
rule is delayed until DFA is constructed, so the order of NFA states
no longer matters.
Ulya Trofimovich [Mon, 11 Jan 2016 15:01:05 +0000 (15:01 +0000)]
Moved YYFILL points calculation to the earlier stage of DFA construction.
No serious changes intended (mostly cleanup and comments).
The underlying algorithm for finding strongly connected components
(SCC) remains the same: it's a slightly modified Tarjan's algorithm.
We now mark non-YYFILL states by setting YYFILL argument to zero,
which is only logical: why would anyone call YYFILL to provide zero
characters. In fact, re2c didn't generate 'YYFILL(0)' call itself,
but some remnants of YYFILL did remain (which caused changes in tests).
Serialize '--skeleton' generated data in little-endian.
This commit fixes bug #132 "test failure on big endian archs with 0.15.3".
Tests failed because re2c with '--skeleton' option used host endianness
when serializing binary data to file. Expected test result was generated
on little-endian arch, while actual test was run on big-endian arch.
Only three tests failed (out of ~40 tests that are always run with
'--skeleton'), because in most cases data unit is 1 byte and endianness
doesn't matter.
The fix: re2c now converts binary data from host-endian to little-endian
before dumping it to file. Skeleton programs convert data back from
little-endian to host-endian when reading it from file (iff data unit
size is greater than 1 byte).
Serialize '--skeleton' generated data in little-endian.
This commit fixes bug #132 "test failure on big endian archs with 0.15.3".
Tests failed because re2c with '--skeleton' option used host endianness
when serializing binary data to file. Expected test result was generated
on little-endian arch, while actual test was run on big-endian arch.
Only three tests failed (out of ~40 tests that are always run with
'--skeleton'), because in most cases data unit is 1 byte and endianness
doesn't matter.
The fix: re2c now converts binary data from host-endian to little-endian
before dumping it to file. Skeleton programs convert data back from
little-endian to host-endian when reading it from file (iff data unit
size is greater than 1 byte).
Ulya Trofimovich [Thu, 31 Dec 2015 21:17:32 +0000 (21:17 +0000)]
Removed obsolete code deduplication mechanism.
This mechanism was tricky and fragile; it cost us a most unfortunate
bug in PHP lexer: https://bugs.gentoo.org/show_bug.cgi?id=518904
(and a couple of other bugs).
Now that re2c does DFA minimization this is no longer needed. Hoooray!
The updated test changed because skeleton is constructed prior to
DFA minimization.
Ulya Trofimovich [Thu, 31 Dec 2015 15:35:30 +0000 (15:35 +0000)]
Added DFA minimization and option '--dfa-minimization <table | moore>'.
Test results changed a lot; it is next to impossible to verify them
by hand. I therefore implemented two different minimization algorithms:
- "table filling" algorithm (simple and inefficient)
- Moore's algorithm (not so simple and efficient enough)
They produce identical minimized DFA (up to states relabelling), thus
giving some confidence in that the resulting DFA is correct.
I also checked the results with '--skeleton': re2c constructs
skeleton prior to reordering and minimization, therefore
skeleton-generated data is free of (potential) minimization errors.
Ulya Trofimovich [Wed, 30 Dec 2015 20:52:33 +0000 (20:52 +0000)]
Split DFA intermediate representation in two parts: DFA and ADFA.
ADFA stands for 'action DFA', that is, DFA with actions.
During DFA construction (aka NFA determinization) it is convenient
to represent DFA states as indexes to array of states.
Later on, while binding actions, it is more convanient to store
states in a linked list.
Ulya Trofimovich [Sat, 19 Dec 2015 17:17:00 +0000 (17:17 +0000)]
Keep DFA states in a hash map (to speedup lookup fo an identical state).
This partially fixes bug #128: "very slow DFA construction (resulting
in a very large DFA)". DFA construction is no longer slow, but the
resulting DFA is still too large and needs to be minimized.
Ulya Trofimovich [Tue, 15 Dec 2015 12:44:47 +0000 (12:44 +0000)]
Base '+' (one or more repetitions) on '*' (zero or more repetitions).
Kleene star '*' (aka iteration, repetition, etc.) is a primitive
operation in regular expressions.
For some reason re2c used '+' as a primitive operation and expressed
'*' in terms of '+'. It is inconvenient, because all algorithms
described in literature are based on '*'.
Because we now express 'a+' as 'a* a', we have to set 'PRIVATE' attribute
on 'a': otherwize 'a' gets shared between the two occurences which causes
complex bugs.
Expressing 'a+' in a more intuitive way as 'a a*' rather than 'a* a'
causes the generated code to duplicate certain states. The generated code
is (supposedly correct), but re2c fails to deduplicate these states.
We therefore prefer 'a* a' expansion, which results in exactly the same
code as before.
Ulya Trofimovich [Mon, 14 Dec 2015 14:21:13 +0000 (14:21 +0000)]
Dropped the difference between left and right default rule (thanks to states reordering).
Bootstrap lexer changed a lot: this change is caused by commit a4c192f27ae8806e67a8ff311eeff53d74dacb71: "Reordered states in DFA.".
Changes in parser by this commit triggered lexer regeneration.
re2c used a complex and slow algorithm to split charset into
disjoint character ranges. This commit replaces old algorithm with
new (much simpler and quicker).
re2c test suite now runs 2x faster due to speedup in Unicode tests.
Fixed '#include's (appied most of 'include-what-you-use' suggestions).
The worst dependency which 'include-what-you-use' fails to see
(and rightly so) is 'src/parse/lex.re' -> 'src/parse/parser.h'.
This dependency is caused by '#include "y.tab.h"' in 'src/parse/lex.re'.
Another ubiquitos issue is 'src/util/c99_stdint.h' ('include-what-you-use'
suggests to substitute it with '<stdint.h>').
And a couple of other dependencies that 'include-what-you-use' fails to see.
Ulya Trofimovich [Mon, 30 Nov 2015 22:50:23 +0000 (22:50 +0000)]
Renamed tests that contained uppercase letters in file extension.
We use file extensions to encode re2c options.
Some (short) options are uppercase letters: e.g. '-D', '-F', '-S'.
There also short options for the same lowercase letters: '-d', '-f', '-s'.
This can cause filename collisions on platforms with case-insensitive
file extensions (e.g. Windows and OS X).
See bud #125: "[OS X] git reports changes not staged for commit
in newly cloned repository".
Fix: use long versions for options that uppercase options.
Disallowed uppercase options in 'run_tests.sh'.
The problem with pattern ordering first emerged on FreeBSD-10.2
(I was able to reproduce it with 'CXXFLAGS=-fsanitize=address').
Some tests failed because patterns reported by '-Wundefined-control-flow'
were sorted in different order than expected. This is because
patterns ordering was inconsistent: patterns were compared by length,
(it doesn't work for patterns of equal length). Now first ordering
criterion is length, and second criterion is lexicographical order.
This commit reduces the amount of memory consumed by '-Wundefined-control-flow':
re2c no longer allocates vectors on stack while deep-first-searching skeleton.
This commit also reduces the limit of memory for '-Wundefined-control-flow'
(64Mb edges -> 1Kb edges). Real-world programs rarely need that much.
The limit was so high to acommodate some few artificial tests (with lower
limit these tests cannot find shortest patterns).
This commit also removes the upper bound for the number of faulty patterns
reported by '-Wundefined-control-flow'. This bound was needed by the
artificial tests mentioned above: they produce lots of patterns.
Now these tests are limited with 1Kb of edges anyway.
Note that 1Kb limit is checked after each new pattern is added, so that
at least one pattern will fit in (even if it takes more than 1Kb).
Ulya Trofimovich [Tue, 24 Nov 2015 17:51:25 +0000 (17:51 +0000)]
Skeleton data generation: suffix should be multipath as well as prefix.
Prefix of current path under construction is a multipath, because prefix
arcs have not been covered yet. Suffix can be a simple path (that is, a
multipath of width 1), because all alternative suffix arcs have already
been covered.
prefix suffix
_________ _________
... \ /
--------- o
_________/
But nothing prevents us from alternating suffix arcs also, as long as
suffix remains a single multipath:
The resulting path's width is the maximum of prefix ans suffix width
(hence the growth in size of those tests in which suffix is wider
than prefix), but it only makes a small difference. And the generated
paths are more "variable".
Ulya Trofimovich [Tue, 24 Nov 2015 16:36:14 +0000 (16:36 +0000)]
Skeleton data generation: cover all edges in 1-byte range (not only range bounds).
If code units occupy 1 byte, then the generated path cover covers
*all* edges in the original DFA. If the size of code unit exceeds 1 byte,
then only some ~0x100 (or less) range values will be chosen
(including range bounds).
Ulya Trofimovich [Sun, 29 Nov 2015 11:38:04 +0000 (11:38 +0000)]
Removed obsolete '__STDC_LIMIT_MACROS' and '__STDC_CONSTANT_MACROS' defines.
These defines were necessary to enable numeric limits definitions
(such as 'UINT32_MAX') in our local version of 'stdint.h' (which is
used on platforms that don't have system header 'stdint.h').
Ulya Trofimovich [Sun, 29 Nov 2015 11:24:48 +0000 (11:24 +0000)]
Fixed [-Wconversion] warning.
Warning was introduced in commit b237daed2095c1e138761fb94a01d53ba2c80c95:
compiler fails to recognise (or deliberately choses not to recognize)
'std::numeric_limits<...>::max()' as a special constant.
Ulya Trofimovich [Sat, 28 Nov 2015 17:31:56 +0000 (17:31 +0000)]
Fixed crashes of 'ostream& operator<< (ostream& os, const char* s)' on NULL.
Crashes observed on platforms OS X (clang-7.0.0) and FreeBSD-10.2 (clang-3.4).
First reported in bug #122 "clang does not compile re2c 0.15.x".
What caused NULL passed to 'operator <<': re2c always generates content of
header file (regardless of '-t --type-header' option), but the content is
dumped to file (and header filename initialized to non-NULL) only if the
option was enabled.
Fix: always initialize header filename to non-NULL string.