Moved input-string-generation-from-DFA algorithm earlier, to the
point when DFA is barely constructed and contains only 'Match'
states and starting state (no saving, accepting or split states).
This simplifies string generation: if destination state is NULL,
it is either next to final state (if all spans lead to it), or
default state (if some spans lead to other states).
A naive attempt to generate input strings from DFA. The problem is,
if the number of spans equals 1, it's hard to determine whether
it's some kind of a 'transit' state or a normal state with just one
span. All states have spans, but do they use them? The situation is
further complicated with 'readCh' which makes it hard to trace how
actions influence input operations.
Ulya Trofimovich [Mon, 30 Mar 2015 13:41:29 +0000 (14:41 +0100)]
Continued adding "--skeleton" switch.
Generate prolog and epilog in the form of a for-loop. The body
of the loop is the hard-coded DFA. The code in DFA final states
is substituted with "continue" statements.
Ulya Trofimovich [Wed, 18 Mar 2015 15:01:41 +0000 (15:01 +0000)]
Make 'Go' hierarchy independent of relabelling.
This allows to move 'Go' initialization loop to 'DFA::prepare'
and thus avoid ugly check if it is already initialized (it can
happen in '-r' mode when the same DFA is used multiple times).
Now that we store 'State *' pointers instead of labels in
'CpgotoTable', relabelling won't affect the generated code.
Ulya Trofimovich [Wed, 18 Mar 2015 14:35:19 +0000 (14:35 +0000)]
- Track used labels in a separate traversal of 'Go' graph
(first part of effort to reduce codegen to null device)
- Properly destruct 'Go' graph (46 test failing with '--valgrind'
because of early exiting on errors)
Ulya Trofimovich [Tue, 17 Mar 2015 16:00:52 +0000 (16:00 +0000)]
Split control flow codegen in two phases:
- First, re2c builds a complex structure where it stores
all control flow codegen decisions: nested ifs or switches,
bitmaps or computed gotos, etc.
- Second, this structure is traversed and code is generated.
This differentiation is necessary to compute some statistics
(e.g. used labels) in advance, before code generation.
Ulya Trofimovich [Thu, 12 Mar 2015 21:49:55 +0000 (21:49 +0000)]
Simplified codegen decision between switches/ifs.
All tests pass.
The previous condition made more sense: it was clear that the
author intended to consider some frequent corner cases.
But the condition was very tangled and yet too heuristic,
so I substituted it with a meaningless, but simple one.
I'm planning to simplify it even more later on.
- draw a single arrow for all transitions between two given states
- label all arrows with corresponding character ranges in square
brackests (no "default" label, single characters also appear in
square brackets)
- .dot output became much smaller, thus pictures are drawn faster
and generally look better: e.g. it takes ~10x less time to draw
PHP lexer and the resulting graph is shaped better.
- Use 'open' function instead of checking return status
(one may forget to check return status, but if one forgets to
open file, the error will be obvious)
- Introduced separate file type for header.
Header is much simpler than output, it doesn't need delayed
code fragments and can be generated in destructor.
Ulya Trofimovich [Thu, 26 Feb 2015 10:48:41 +0000 (10:48 +0000)]
Removed unused enum members.
I was unsure if delayed generation was also needed for genCondGoto
and genCondTable; so I kept those enum members as a reminder. Now
I know that all conditions are known by the moment re2c block is
parsed and code generation starts.
Ulya Trofimovich [Wed, 25 Feb 2015 23:13:55 +0000 (23:13 +0000)]
One pass.
Second pass was used because some information (which influences
early parts of the generated code, e.g enum with condition names
or YYMAXFILL definition) becomes available only at the end of first
pass.
I isolate all (I hope so) these things and generate stubs for them,
which are filled later. I restructured output as follows: the whole
output consists of source and header, each of them is a list of
blocks (corresponding to re2c blocks in source file), each block
is a list of code fragments (which can be either regular strings
with code or stubs that will be filled later).
Ulya Trofimovich [Mon, 23 Feb 2015 13:30:42 +0000 (13:30 +0000)]
Added tests from PHP repository: https://github.com/php/php-src
Test results are almost identical to re2c-0.13.6
(there're some few changes, I believe they are due to
commit 255262b02928d3f38c00dd91952e3253c11c78f1 and
completely harmless).
Ulya Trofimovich [Sun, 18 Jan 2015 14:12:16 +0000 (14:12 +0000)]
Replaced "YYHAS (n)" with "YYEOI (n)".
The actual meaning of this primitive is to check if
there's not enough characters left in the input stream,
e.g. "(YYLIMIT - YYCURSOR) < n" or whatever else.
Ulya Trofimovich [Sun, 18 Jan 2015 13:47:51 +0000 (13:47 +0000)]
Added tests for "--input custom".
This implied modifying runtests.sh, as it couldn't handle
test names of the form "basename.--long-switch.re":
it inserted '-' in front of all switches.
Ulya Trofimovich [Tue, 13 Jan 2015 15:30:01 +0000 (15:30 +0000)]
A little cleanup of new input API:
- moved enum and pretty-printing functions to a class
- renamed files 'input.{h,cc}' to 'input_api.{h,cc}'
- for "--input istream": moved input position increment to 'stmt_restorectx'
- main.cc: removed useless include
Double-escape special characters for dot.
Example:
17 -> 18 [label="\n"]
results in an "unlabeled" arrow in the rendered graph, but
17 -> 18 [label="\\n"]
is ok.
Ulya Trofimovich [Fri, 22 Aug 2014 20:15:11 +0000 (23:15 +0300)]
Alternation of 'RegExp's should preserve 'ins_access' attribute.
When one builds 'AltOp' from two 'RegExp's, one sometimes has to
break these 'RegExp's in pieces in order to merge their common prefix.
In such cases, if one of the original 'RegExp's has 'ins_access'
set to 'PRIVATE', it is lost (defaults to 'SHARED') after alternation.
This commit fixes Gentoo bug https://bugs.gentoo.org/show_bug.cgi?id=518904.