Ulya Trofimovich [Tue, 12 May 2015 09:57:27 +0000 (10:57 +0100)]
Moved lexing functions to proper file.
For some reason 're2c::Scanner' methods were defined in file
'src/codegen/code.cc'. Mpved them to 'src/parser/scanner.cc'
and renamed 'src/parser/scanner.re' to 'src/parser/scanner_lex.re'.
Ulya Trofimovich [Mon, 11 May 2015 16:46:26 +0000 (17:46 +0100)]
Reduced "src/codegen/print.h" dependency for some unrelated files.
This is the first part of effort to reduce the total number of
interdependencies between different files.
"src/codegen/print.h" contains some pretty-printing functions
mostly used for code generation. There two other cases where.
these functions can also be useful:
1. debug: it's not yet well developed, just some messy chunks
of code that are commented out
2. error messages: character pretty-printing is not actually
very useful, since error messages mostly contain printable
characters (pieces of user-supplied input)
From the above I conclude that non-codegen files don't really need
"src/codegen/print.h". If this will change, I'll consider moving
this functions to src/util.
Ulya Trofimovich [Mon, 11 May 2015 10:13:34 +0000 (11:13 +0100)]
Added mingw builds for windows.
To enable mingw build, confugure with "--host i686-w64-mingw32".
Simple 'make' will then build re2c.exe executable for windows.
Note that 'make bootstrap' will not work: it will try to run
re2c.exe in order to recompile scanner and this will surely fail.
Testing with wine can be done using 'make wtests'.
Ulya Trofimovich [Mon, 11 May 2015 09:41:01 +0000 (10:41 +0100)]
Don't rebuild docs by default.
If we configure with "--enable-docs", then docs get rebuild every
time make is executed, and we'll have to commit these insignificant
changes every time, which is bad.
Major changes:
- Moved all re2c source files to a separate directory. Top source
directory should only contain autotools source files and a few other
files like README.
- Enabled out-of-source builds (and wrote a simple script build.sh
that makes out-of-source build).
- Improved portable variant of <stdint.h> header (src/c99_stdint.h):
now it relies only on some few defines in configure-generated
src/config.h (instead of checking for MSVC version and relying
on MSVC-defined stuff). Implementation follows C99 standard closely.
- Removed all windows-related build stuff. It was no use keeping it:
it's been broken for a long time and I can't maintain it.
- Removed all RPM stuff: distro-maintainers use their own hacks anyway.
Makefile.am is definetely the wrong place to keep such things.
A separete script and .spec file is a better idea, but again, nobody
uses it.
- added make target 'bootstrap' to make mainteiners' job easier.
- Merged lessons and examples into one.
- Updated README and doc/index.html.
- Run tests in parallel by default.
Changes concerning particular files:
- configure.ac:
- removed autoconf version check: developers will need the right
version anyway (otherwize autoconf will reject to work);
users should use configure script provided by package
distribution
- removed GCC version (3 or above) check: it's pretty ancient
and I don't know which features are missing anyway
- removed some useless checks (which resulted in defines in
src/config,h used by no one)
- introduced some new checks (used by src/c99_stdint.h)
- followed the advices of autoupdate
Makefile.am:
- explicitly prefixed all file names with $(srcdir) or $(builddir)
- removed windows and RPM related rules and targets
Simplified 'generate_paths_cover' a bit: NULL-state case is not
really needed, it's artificial and should be embedded into the
only case that needs it.
Estimate the size of data to be generated rather than the number
of paths in DFA. Returns precise size or 'Skeleton::MAX_PATHS',
whatever is less. Estimate data size in both cases for all paths
or path cover.
Simplified traversal of outgoing arrows when extending prefixes.
Instead of adding outgoing arrow sets (which was erroneous anyway:
they should remain unmodified for another recursion entry), simply
wrap iterator and keep iterating until both ingoing and outgoing
arrows are counted.
More code cleanup: pre-initialize paths for final states and
default state, so that NULL can be used as a terminating condition
for path-generating function. Fixed memleaks.
Tried to simplify data generation a bit: construct skeleton states
so that there's no links to NULL state. Default state is
represented with a state that has no rule and zero outgoing
arrows. Final states are represented with a state that has a rule,
but zero outgoing labels.
Reduced the amount of generated data (no exponential growth now).
Generate enough paths to cover all DFA transition (well, not
exactly all: only upper and lower bounds of a range). Respect
cycles (loop once).
Use a separate DFA-like structure for DFA skeleton. This allows
to optimize skeleton states for fast traversal (group transitions
by destination state) and avoid messing with the original DFA.
Added 'count' method to estimate the the amount of data that will
be generated (it worked too slow for the origianl DFA with
ungrouped transitions).
Output input data to a separate file (otherwize we'll have to
keep all generated data in memory, cause output has a complex
structure and cannot be written to file until it's fully
generated)
This reduces memory usage significantly (so that there remain no
memory consumption problems with "--skeleton" switch). However
on some files re2c generated too much data, e.g. case-insensitive
strings:
Generate check for input position in final states: first check
that input position equals to the expected one; second advance
input position to the beginning of next DFA path (this may be
necessary when the generated lexer rollbacks: it should resume
with a new DFA path rather than some default characters remaining
after rollback).
Track exact number of characters to be consumed on each input
string. Note that it's not equal to string length, since many
strings end up with characters from default transitions: on such
strings lexer must rollback to the latest backtracking point (that
is, the latest accepting state).
Moved input-string-generation-from-DFA algorithm earlier, to the
point when DFA is barely constructed and contains only 'Match'
states and starting state (no saving, accepting or split states).
This simplifies string generation: if destination state is NULL,
it is either next to final state (if all spans lead to it), or
default state (if some spans lead to other states).
A naive attempt to generate input strings from DFA. The problem is,
if the number of spans equals 1, it's hard to determine whether
it's some kind of a 'transit' state or a normal state with just one
span. All states have spans, but do they use them? The situation is
further complicated with 'readCh' which makes it hard to trace how
actions influence input operations.
Ulya Trofimovich [Mon, 30 Mar 2015 13:41:29 +0000 (14:41 +0100)]
Continued adding "--skeleton" switch.
Generate prolog and epilog in the form of a for-loop. The body
of the loop is the hard-coded DFA. The code in DFA final states
is substituted with "continue" statements.
Ulya Trofimovich [Wed, 18 Mar 2015 15:01:41 +0000 (15:01 +0000)]
Make 'Go' hierarchy independent of relabelling.
This allows to move 'Go' initialization loop to 'DFA::prepare'
and thus avoid ugly check if it is already initialized (it can
happen in '-r' mode when the same DFA is used multiple times).
Now that we store 'State *' pointers instead of labels in
'CpgotoTable', relabelling won't affect the generated code.
Ulya Trofimovich [Wed, 18 Mar 2015 14:35:19 +0000 (14:35 +0000)]
- Track used labels in a separate traversal of 'Go' graph
(first part of effort to reduce codegen to null device)
- Properly destruct 'Go' graph (46 test failing with '--valgrind'
because of early exiting on errors)
Ulya Trofimovich [Tue, 17 Mar 2015 16:00:52 +0000 (16:00 +0000)]
Split control flow codegen in two phases:
- First, re2c builds a complex structure where it stores
all control flow codegen decisions: nested ifs or switches,
bitmaps or computed gotos, etc.
- Second, this structure is traversed and code is generated.
This differentiation is necessary to compute some statistics
(e.g. used labels) in advance, before code generation.
Ulya Trofimovich [Thu, 12 Mar 2015 21:49:55 +0000 (21:49 +0000)]
Simplified codegen decision between switches/ifs.
All tests pass.
The previous condition made more sense: it was clear that the
author intended to consider some frequent corner cases.
But the condition was very tangled and yet too heuristic,
so I substituted it with a meaningless, but simple one.
I'm planning to simplify it even more later on.
- draw a single arrow for all transitions between two given states
- label all arrows with corresponding character ranges in square
brackests (no "default" label, single characters also appear in
square brackets)
- .dot output became much smaller, thus pictures are drawn faster
and generally look better: e.g. it takes ~10x less time to draw
PHP lexer and the resulting graph is shaped better.
- Use 'open' function instead of checking return status
(one may forget to check return status, but if one forgets to
open file, the error will be obvious)
- Introduced separate file type for header.
Header is much simpler than output, it doesn't need delayed
code fragments and can be generated in destructor.