Bug description: sometimes re2c would generate code that backups
current input position (e.g. 'YYMARKER = YYCURSOR'), but wouldn't
generate code that restores backuped position (e.g. 'YYCURSOR =
YYMARKER').
Analyses: DFA may have overlapping rules (e.g. "a" and "aaa").
In such cases, if the shorter rule matched, lexer must attempt to
match the longer one. If the longer rule also mathed, then lexer
prefers it to the shorter rule. If the longer rule didn't match,
lexer must backtrack input position to the point when the shorter
rule matched. In order to be able to backtrack, re2c must generate
backup code (e.g. 'YYMARKER = YYCURSOR') and restore code (e.g.
'YYCURSOR = YYMARKER').
In some rare cases DFA has overlapping rules, but if the shorter rule
matched, then the longer rule will always match (perhaps on an
arbitrary long input string), e.g.:
/*!re2c
[^]+ "a" { 1st }
"b" { 2nd }
*/
In this cases there's no need to generate backup code for 2nd rule:
lexer will either encounter final "a" and the 1st rule will match
or YYFILL will not return; anyway, restore code will never be run.
re2c used to output backup code but not restore code in such cases.
This is the bug: backup code is useless without restore code and
should be omitted.
In future re2c should warn about such cases (when the shorter of
two overlapping rules is shadowed by the longer one).
The fix: postpone insertion of save actions (those with backup code)
untill it is known if restore code will be generated.
I also removed obsolete global variable 'bUsedYYMarker', which was
always set to 'true' (it should be per-DFA, not per-block configuration
anyway).
We have one 'yyaccept' initialization per re2c block. Each block
consists of one or more DFA (multiple DFA in '-c' mode in case of
multiple conditions). Each DFA may or may not use 'yyaccept'
(that is, save 'yyaccept' in some states and have a dispatch state
based on saved 'yyaccept' value).
Description of the bug: in '-c' mode, sometimes a DFA would have
states that save 'yyaccept', but no dispatch state that uses that
saved values. DFA didn't actually need 'yyaccept' (all the
assignments vanished if other conditions that need 'yyaccept' were
removed).
The essence of the bug: re2c decided whether to output 'yyaccept'
related stuff on a per-block basis: for multiple conditions in the
same block, the same decision was made (if any condition needed
'yyaccept', all of them would to output it).
The fix: 'yyaccept' initialization should be done on a per-block
basis, while assignments to 'yyaccept' should be done on a per-DFA
basis. Also, 'yyaccept' initialization must be delayed, while
assignments to 'yyaccept' must not.
Note: we may consider per-DFA 'yyaccept' initialization (have a
local 'yyaccept' variable per DFA). This wouldn't conflict with '-f'
switch (as it might seem) as long as we name all the variables
'yyaccept' and don't generate any 'yyaccept' initializations with '-f'.
As automake manual (chapter 27.6 "Flag Variables Ordering") states,
CXXFLAGS is a user variable and should be left for users to override
C++ compiler flags. Thus we should leave CXXFLAGS as is and modify
AM_CXXFLAGS insted.
Instead of unconditionally setting CXXFLAGS in Makefile.am,
check the presence of a flag in configure.ac. If the flag is
present (that is, an attempt to compile an empty C++ program with
this flag is successful), then it is added to CXXFLAGS.
Now one can add *any* compiler flag in configure.ac without
worrying about portability.
The only problem with 'make distcheck' was that we needed write
access to top source directory ('make' wanted to overwrite bootstrap
parser if it was built with bison and 'make check' wanted to create
temporary files in 'test/' directory).
This commit fixes it:
- 'make' doesn't try to overwrite bootstrap if it is identical
to the existing one (must always be true for 'make distcheck')
- testing script makes a temporary directory and keeps all
temporary files there. If some tests failed, temporary files
for them are left and test sources and reference results are
copied into temporary directory to make debug more convenient.
This commit makes use of 'make distcheck' in release script.
Fixed build system to support automake's 'subdir-objects' feature.
As I updated automake to version 1.15 it began to produce lots of
warnings about 'subdir-objects' not used when it should have been.
Turns out that 'subdir-objects' will be on by default in 2.0.
So I tried to turn on 'subdir-objects' and builds began to fail:
automake didn't expand '$(sourcedir)' and '$(builddir)' prefixes.
I erroneously prepended these prefixes in commit 38f526d04415adb7b5e6bca228fc26409833f5c3 "Updated build system.",
as commit message says:
...
Makefile.am:
- explicitly prefixed all file names with $(srcdir) or $(builddir)
...
But automake prepends these prefixes already where necessary, except
for custom rules with side effects: if a custom rule touches some
files that are not explicit targets or dependencies of this rule,
then automake won't find these files unless they are in build directory.
We have such side-effectful custom rules:
- parser rule produces multiple files and touches bootstrap files
- scanner rule touches bootstrap file
- doc rules touch bootstrap files
Multiple files is a common problem of make. Bootstrap introduces
circular dependency, since bootstrap files need to be updated after
they've been used. So it's hard to get rid of side effects in these
rules.
This commit enabels 'subdir-objects' feature and removes all prefixes
in variables and adds them in side-effectful custom rules (for files
from source directory, not for files from build directory). It also
makes use of '$@' and '$<' special variables in custom rules (which
makes side effects more explicit).
Still I don't yet fully understand how automake uses things like
'$(sourcedir)' and '$(builddir)' and their relation with 'subdir-objects'
(it's probably related with non-recursive makefiles).
Ulya Trofimovich [Thu, 28 May 2015 11:16:06 +0000 (12:16 +0100)]
Don't output newline instead if label in initial DFA state.
Rationale: the equivalence of initial label to
're2c::label_counter_t::FIRST' is NOT a proper criterion
and pretty-printing shouldn't rely on it. The real criterion
is something like "(first re2c block OR any use block in '-r'
mode) AND first condition in'-c' mode", but it's spurious and
introduces unnecessary complications.
Droping this newline allows us drop equivalence operator for
labels.
Used the following bash script to ensure that all the changes
in tests are caused by missing newline(s):
# missing: only newlines and line directives
if [[ $diff1 -ne $((diff1_line + diff1_newline)) ]]
then
echo "FAIL1: $f1"
exit 1
fi
# added: only line directives
if [[ $diff2 -ne $diff2_line ]]
then
echo "FAIL2: $f1"
exit 1
fi
# the number of missing line directives
# equals to the number of added line directives
if [[ $diff1_line -ne $diff2_line ]]
then
echo "FAIL4: $f1"
exit 1
fi
done
Ulya Trofimovich [Wed, 27 May 2015 10:50:55 +0000 (11:50 +0100)]
Moved start label configuration out of global scope.
There are two configurations:
1. re2c:startlabel = <integer>;
2. re2c:startlabel = <string>;
The scope of these configurations is the scope of current re2c block.
Ulya Trofimovich [Tue, 26 May 2015 20:34:56 +0000 (21:34 +0100)]
Output user-defines start label in the appropriate place.
Before this commit, given the following example:
/*!re2c
re2c:startlabel = "start";
[^]* {}
*/
re2c would generate the following code:
{
YYCTYPE yych;
goto yy0;
yy1:
start:
++YYCURSOR;
yy0:
if (YYLIMIT <= YYCURSOR) YYFILL(1);
yych = *YYCURSOR;
goto yy1;
{}
}
where "start:" falsely corresponds to "yy1:" rather to "yy0:".
(The important property of this example is that DFA has arrows
to initial state.) This commit fixes this behavior:
{
YYCTYPE yych;
start:
goto yy0;
yy1:
++YYCURSOR;
yy0:
if (YYLIMIT <= YYCURSOR) YYFILL(1);
yych = *YYCURSOR;
goto yy1;
{}
}
Ulya Trofimovich [Tue, 26 May 2015 13:00:01 +0000 (14:00 +0100)]
Don't hide the ugly fact that default state in '-f' mode is always state 0.
In '-f' mode, state dispatch generation can be triggered in two ways.
Both ways use 're2c::OutputFile::insert_state_goto', which generates
state dispatch only if it hasn't been already generated. The two ways
are:
1. Explicitly, using '/*!getstate:re2c*/'. In this case default state
must be state 0 because it's hardcoded in the invocation of
're2c::OutputFile::insert_state_goto' in 'src/parse/scanner_lex.re'.
2. Implicitly, in 're2c::DFA::emit'. In this case default state must
be state 0 because if 'prolog_label' is not 0, it means that it's
not the first time 're2c::DFA::emit' is called and state dispatch
has already been generated.
This commit makes it explicit that re2c always uses state 0.
Note:
Currently in '-f' mode re2c generates one global dispatch for
the whole file (the enumeration of yyFillLabel's is also global).
All re2c blocks share the same state dispatch, so in '-f' mode all
re2c blocks must reside in the same function and must be parts of
the same lexer (exception: in '-r' mode re2c generates one state
dispatch per use block).
This is clearly an ugly limitation: one is forced put disconnected
lexers in different files in '-f' mode.
Now re2c provides conditions as a way to express related blocks,
so if users used multiple blocks only for unrelated lexers, we
could safely limit the scope of state dispatch to a single block.
But conditions can conflict with other re2c features, they are
a bit broken and I'm pretty sure some users use multiple blocks
(e.g. I used to do it).
Thus we cannot just make '-f' generate state dispatch on per-block
basis (there're some other obstacles: it's not quite clear which
block '/*!getstate:re2c*/ directive is related to, etc.).
I leave the situation 'as is' until better times (when lexer-
parser loop is fixed and the whole code generation model is more
robust).
Ulya Trofimovich [Sun, 24 May 2015 20:38:55 +0000 (21:38 +0100)]
Clarify which label is relevant to initial state.
Initial state is different from all other states in that should
not advance YYCURSOR when it's entered first time (when DFA is
entered). But it should advance YYCURSOR like any normal state if
it's entered from any other part of DFA (other DFA states may well
lead to the inital state, e.g. '[^]*')
That's why inital state is split into two parts, each marked by a
separate label: normal state label points to the code that advances
YYCURSOR, and special label points right after that code (skips it).
It happens so that normal state label is equal to special initial
label plus one, but we shouldn't rely on that.
Ulya Trofimovich [Thu, 21 May 2015 13:38:13 +0000 (14:38 +0100)]
Compare DFA states rather than labels.
This is first step to postpone labelling states until all the
code is actually generated (in the form of some structure in
memory) and it's time to pretty-print it to file.
Labels shouldn't be mixed up with states (in particular, they
shouldn't be used as a state identifier).
Ulya Trofimovich [Wed, 20 May 2015 08:34:26 +0000 (09:34 +0100)]
Finally removed auxiliary code generation pass to 'null device'.
This pass was used to gather statistics:
- which labels are used (to avoid 'unused label' warnings from
C/C++ compiler)
- in '-f' mode, how many times YYFILL is called (to generate
dispatch to the arrpopriate YYFILL call and resume lexing from
there)
This commit deals with second case: it makes counting YYFILL calls
independent of auxiliary pass. Counting relies on variable
're2c::last_fill_index', which is updated in 're2c::need' function
(note that it is crucial that 're2c::need' is always called prior
to generation of state dispatch).
Ulya Trofimovich [Tue, 19 May 2015 17:19:24 +0000 (18:19 +0100)]
Another part of tracking label usage moved out from codegen.
This is part of effort to reduce code generation to 'null device'
(the same code is generated twice only to gather statistics on
label usage). In order to reduce this evil pass, we should be able
to track label usage before code generation.
Ulya Trofimovich [Sat, 16 May 2015 11:01:58 +0000 (12:01 +0100)]
Don't copy <*> regexps for each condition.
There're two major things to care about in this situation:
1. <*> rules must have the lowest priority: in order to guarantee
it, we re-iterate <*> regexps after all other rules have been
parsed and fix <*> regexps priority.
2. <*> regexps must be compiled to instrictions separately for each
condition: this is guaranteed by assigning them
're2c::RegExp::PRIVATE' attribute.
Note that 're2c::RuleOp::accept' member stores rule priority.
These priorities don't have to be consecutive, only the right order
must be maintained. Later on in 're2c::DFA::prepare' they result in
consecutive 'yyaccept' values.
Ulya Trofimovich [Thu, 14 May 2015 17:20:13 +0000 (18:20 +0100)]
Forbid copying of 're2c::Substr'.
Removed useless methods (most of them became useless after
're2c::Str' removal) and obsolete autoconf check for 'strndup'.
Removed 'Scanner::token' methods. They used
'Scanner::check_token_length', which was pretty useless:
1. checking for the lower should have always succeed, because
'Scanner::tok' is always set to buffer start in 'Scanner::fill'
and if the token was too long, it's start will be lost anyway.
2. checking for the upper bound could fail if re2c dev passed some
trash into it, but any normal function would do so and this is
no particular reason to have special runtime checks here.
Now substrings are constructed in lexer, where all the lengths
and bounds are easier to verify from lexing context (for re2c dev).
Ulya Trofimovich [Thu, 14 May 2015 12:58:29 +0000 (13:58 +0100)]
Pass unquoted strings to parsing functions.
In most cases re2c accepts single-quoted or double-quoted strings
in regexp specifications, but if flex-like syntax if enabled, then
re2c accepts unquoted strings.
Before this commit, functions that parse strings into regexps
expected quoted strings, and we had to add quotes in case of
flex-like syntax. That was very inconvenient and in fact
unnecessary, since the first thing parsing functions do is get rid
of quotes.
Ulya Trofimovich [Thu, 14 May 2015 11:25:24 +0000 (12:25 +0100)]
Simplified handling of named definitions in parser.
Don't bother with named definitions in lexer, just pass them as
strings to parser. Parser will recognize named definitions, insert
them into symbol table and handle conflicts.
Use simple 'std::map' instead 're2c::Symbol' class (that hides
symbol table in class static member).
Use 'std::string' instead of 're2c::Str'. Due to bison limitations
we have to pass pointers to strings allocated on the heap and
carefully destroy them. The whole thing is quite error prone, so
maybe I'll make a small slab allocator for parser later on.
Ulya Trofimovich [Tue, 12 May 2015 11:26:07 +0000 (12:26 +0100)]
Split 'src/codegen/code.cc' into parts.
First, DFA is built ('re2c::DFA::DFA'), then it must be prepared
for code generation: some states must be split, backtracking points
must be marked, etc. ('re2c::DFA::prepare'), then finally code
can be generated ('re2c::DFA::genCode').
I haven't yet fully decided whether second stage (preparing) is
closer to DFA construction in general (and thus should be moved to
'src/dfa') or to code generation (and should be moved to 'src/codegen').
Since it deals a lot with bitmaps, second variant will suffice for
now. Perhaps later on I'll split preparation into general and
codegen-related parts.
Ulya Trofimovich [Tue, 12 May 2015 10:32:11 +0000 (11:32 +0100)]
Renamed struct to avoid confusion.
're2c::BitMap' represents actual bitmaps used for code generation,
while 're2c::GoBitmap' (former 're2c::Bitmap') is a node type in
graph representation of the program right before code generation.
Names used in 're2c::Go' subsystem are really ugly and will be
renamed to something more sensible later on.
Ulya Trofimovich [Tue, 12 May 2015 09:57:27 +0000 (10:57 +0100)]
Moved lexing functions to proper file.
For some reason 're2c::Scanner' methods were defined in file
'src/codegen/code.cc'. Mpved them to 'src/parser/scanner.cc'
and renamed 'src/parser/scanner.re' to 'src/parser/scanner_lex.re'.
Ulya Trofimovich [Mon, 11 May 2015 16:46:26 +0000 (17:46 +0100)]
Reduced "src/codegen/print.h" dependency for some unrelated files.
This is the first part of effort to reduce the total number of
interdependencies between different files.
"src/codegen/print.h" contains some pretty-printing functions
mostly used for code generation. There two other cases where.
these functions can also be useful:
1. debug: it's not yet well developed, just some messy chunks
of code that are commented out
2. error messages: character pretty-printing is not actually
very useful, since error messages mostly contain printable
characters (pieces of user-supplied input)
From the above I conclude that non-codegen files don't really need
"src/codegen/print.h". If this will change, I'll consider moving
this functions to src/util.
Ulya Trofimovich [Mon, 11 May 2015 10:13:34 +0000 (11:13 +0100)]
Added mingw builds for windows.
To enable mingw build, confugure with "--host i686-w64-mingw32".
Simple 'make' will then build re2c.exe executable for windows.
Note that 'make bootstrap' will not work: it will try to run
re2c.exe in order to recompile scanner and this will surely fail.
Testing with wine can be done using 'make wtests'.
Ulya Trofimovich [Mon, 11 May 2015 09:41:01 +0000 (10:41 +0100)]
Don't rebuild docs by default.
If we configure with "--enable-docs", then docs get rebuild every
time make is executed, and we'll have to commit these insignificant
changes every time, which is bad.
Major changes:
- Moved all re2c source files to a separate directory. Top source
directory should only contain autotools source files and a few other
files like README.
- Enabled out-of-source builds (and wrote a simple script build.sh
that makes out-of-source build).
- Improved portable variant of <stdint.h> header (src/c99_stdint.h):
now it relies only on some few defines in configure-generated
src/config.h (instead of checking for MSVC version and relying
on MSVC-defined stuff). Implementation follows C99 standard closely.
- Removed all windows-related build stuff. It was no use keeping it:
it's been broken for a long time and I can't maintain it.
- Removed all RPM stuff: distro-maintainers use their own hacks anyway.
Makefile.am is definetely the wrong place to keep such things.
A separete script and .spec file is a better idea, but again, nobody
uses it.
- added make target 'bootstrap' to make mainteiners' job easier.
- Merged lessons and examples into one.
- Updated README and doc/index.html.
- Run tests in parallel by default.
Changes concerning particular files:
- configure.ac:
- removed autoconf version check: developers will need the right
version anyway (otherwize autoconf will reject to work);
users should use configure script provided by package
distribution
- removed GCC version (3 or above) check: it's pretty ancient
and I don't know which features are missing anyway
- removed some useless checks (which resulted in defines in
src/config,h used by no one)
- introduced some new checks (used by src/c99_stdint.h)
- followed the advices of autoupdate
Makefile.am:
- explicitly prefixed all file names with $(srcdir) or $(builddir)
- removed windows and RPM related rules and targets
Simplified 'generate_paths_cover' a bit: NULL-state case is not
really needed, it's artificial and should be embedded into the
only case that needs it.
Estimate the size of data to be generated rather than the number
of paths in DFA. Returns precise size or 'Skeleton::MAX_PATHS',
whatever is less. Estimate data size in both cases for all paths
or path cover.
Simplified traversal of outgoing arrows when extending prefixes.
Instead of adding outgoing arrow sets (which was erroneous anyway:
they should remain unmodified for another recursion entry), simply
wrap iterator and keep iterating until both ingoing and outgoing
arrows are counted.
More code cleanup: pre-initialize paths for final states and
default state, so that NULL can be used as a terminating condition
for path-generating function. Fixed memleaks.
Tried to simplify data generation a bit: construct skeleton states
so that there's no links to NULL state. Default state is
represented with a state that has no rule and zero outgoing
arrows. Final states are represented with a state that has a rule,
but zero outgoing labels.
Reduced the amount of generated data (no exponential growth now).
Generate enough paths to cover all DFA transition (well, not
exactly all: only upper and lower bounds of a range). Respect
cycles (loop once).
Use a separate DFA-like structure for DFA skeleton. This allows
to optimize skeleton states for fast traversal (group transitions
by destination state) and avoid messing with the original DFA.
Added 'count' method to estimate the the amount of data that will
be generated (it worked too slow for the origianl DFA with
ungrouped transitions).
Output input data to a separate file (otherwize we'll have to
keep all generated data in memory, cause output has a complex
structure and cannot be written to file until it's fully
generated)
This reduces memory usage significantly (so that there remain no
memory consumption problems with "--skeleton" switch). However
on some files re2c generated too much data, e.g. case-insensitive
strings:
Generate check for input position in final states: first check
that input position equals to the expected one; second advance
input position to the beginning of next DFA path (this may be
necessary when the generated lexer rollbacks: it should resume
with a new DFA path rather than some default characters remaining
after rollback).
Track exact number of characters to be consumed on each input
string. Note that it's not equal to string length, since many
strings end up with characters from default transitions: on such
strings lexer must rollback to the latest backtracking point (that
is, the latest accepting state).
Moved input-string-generation-from-DFA algorithm earlier, to the
point when DFA is barely constructed and contains only 'Match'
states and starting state (no saving, accepting or split states).
This simplifies string generation: if destination state is NULL,
it is either next to final state (if all spans lead to it), or
default state (if some spans lead to other states).