Ulya Trofimovich [Thu, 20 Aug 2015 11:09:52 +0000 (12:09 +0100)]
Check if code point exceeds maximum. Correctly cast 'char' to 'uint32_t'.
First fix:
re2c used to check if code point exceeds maximal value for current
encoding when parsing it. When I moved parsing code points to lexer
I forgot the check.
Second fix:
I assumed that 'static_cast<uint32_t>' on 'signed char' zero-extends.
But I was wrong: it sign-extends. Need to cast to 'unsigned char'
instead.
Ulya Trofimovich [Wed, 19 Aug 2015 17:49:28 +0000 (18:49 +0100)]
Parse unquoted flex-like strings in lexer.
Such strings can only contain ASCII letters, digits and uderscore:
no escapes. So there's very little parsing to do: just map code units
directly to code points.
Ulya Trofimovich [Wed, 19 Aug 2015 17:31:04 +0000 (18:31 +0100)]
Parse character strings in lexer.
This commit is a continuation of e42f271e4b284cd6ed590dc06a5239ce13942fc5:
"Parse character classes in lexer." It makes lexer parse individual code
points instead of barely recognizing string of code points.
Since the contents of case sensitive strings, case insensitive strings
and character classes are lexically the same (except for escaped quotes
or closing bracket), we can parse them all in the same lexer block.
Note that re2c allows any 1-byte code points. Maybe we should report
non-ASCII or non-printables.
Ulya Trofimovich [Tue, 18 Aug 2015 14:54:22 +0000 (15:54 +0100)]
Parse character classes in lexer.
Before this commit, lexer would barely recognize class as a whole
lexeme and pass it to functions that would further parse is by manually
picking individual code points and control characters out of the lexeme.
Heh, re2c is made for such kind of stuff (and it does it much better).
So now lexer parses individual code points or control characters and
stores them as a sequence of code points. Then an outer function
splits this sequence into ranges and individual characters.
I tried to preserve existing behaviour (judging from test suite only
the text of some error messages has changed). Added some autogenerated
tests (and the generator script itself), but these tests are not exhaustive.
Ulya Trofimovich [Sun, 16 Aug 2015 19:57:29 +0000 (20:57 +0100)]
Lexer: no need to have "{0,}" special case.
Now "{0,}" is handled by general case for counted repetition. Had to
fix test because of swapped alternatives in construction of regexps
of the form "r*" and "r{n,".
Ulya Trofimovich [Sun, 16 Aug 2015 19:02:04 +0000 (20:02 +0100)]
Force definition of some conditionally defined parts of <stdint.h>.
Got some errors about undefined 'UINT32_MAX' and others while trying
to build re2c with MINGW. These variables are defined conditionally in
<stdint.h> depending on the following defines:
__STDC_LIMIT_MACROS
__STDC_CONSTANT_MACROS
Fix: defined these symbols in the beginning of "src/util/c99_stdint.h".
Also found out that "src/util/c99_stdint.h" didn't compile in the
absense of system <stdint.h> (which is exactly the case when it was
written for). THe problem was caused by incorrect use of 'typedef'
statement; fixed easily. What a shame I didn't compile it before ;D
Ulya Trofimovich [Sun, 16 Aug 2015 11:08:21 +0000 (12:08 +0100)]
Use custom function instead of 'atoi' to read 32-bit integers.
Using 'atoi' displeased me for several reasons:
- 'atoi' doesn't check for overflow
- by the time 'atoi' was called, re2c has already parsed input
string and knows that it is well-formed, no need to do it twice
- atoi returns 'int', custom function allows to refine types and
avoid casts between signed/unsigned values or [-Wsign-conversion]
warnings
Ulya Trofimovich [Sat, 15 Aug 2015 16:42:41 +0000 (17:42 +0100)]
Don't use special regexp type for counted repetitions.
Counted repetitions are regexps of the form:
r{n}
r{n,m}
r{n,}
They can be expressed in terms of concatenation, alternation and
iteration:
r{0} <empty regexp>
r{n} r{n-1} r
r{n,m} r{n} (r{0} | ... | r{m-n})
r{n,} r{n} r*
Reducing special regexp type to represent counted repetitions allows
us substitute complex code that compiles them to bytecode instructions
with simpler code that constructs counted repetitions using concatenation,
alternation and iteration.
Ulya Trofimovich [Fri, 14 Aug 2015 12:57:58 +0000 (13:57 +0100)]
Lexer: unified token length calculation.
Token length equals the difference between two pointers: YYCURSOR
value on the moment of successful match and YYCURSOR value when
entering DFA. This difference should be nonnegative and fit buffer
size.
Ulya Trofimovich [Wed, 12 Aug 2015 16:39:21 +0000 (17:39 +0100)]
Lexer: glued together two variables with almost identical functionality.
These two variables ('ScannerState' class member 'cur' and local variable
'cursor') both played the role of YYCURSOR: YYCURSOR was defined to
'cursor', but the code had to adjust 'cur' to 'cursor' on every return
and restore it when reentering lexing procedures.
There was only two obscure places in which they were not completely
syncronized. Both looked the same:
cur = ptr > tok ? ptr - 1 : cursor;
However, 'ptr' is effectively YYMARKER and is never changed manually
by lexing procedures; 'tok' should point to the beginning of current
lexeme; this check doesn't make sense to me. This is an attempt to
resume lexing one character before YYMARKER, but what for? YYMARKER
isn't even used in Scanner::scan (look at the the generated code),
so 'ptr > tok' is always false.
This check is present since the initial commit, so I cannot trace back
it's origin.
As it doesn't break any tests I hereby simply remove it.
Ulya Trofimovich [Wed, 12 Aug 2015 12:52:57 +0000 (13:52 +0100)]
Added tests for occasionally fixed bug in code generation.
The bug was triggered by the following conditions: '-b'
(or any flag that implies '-b') and encoding with code units
wider than 1 byte (that is, UTF-16, UTF-32 and UCS-2).
The bug itself: re2c omitted 'goto <some state>;' when it
should have been generated. The absense of goto caused incorrect
control flow. One cas verify it in two ways using the minimal test
added by this commit:
- manually compare the code generated before and after fix
and ensure that newer version generates correct code;
- run the test with --skeleton, then try to add/remove
the goto statement, compile and run: the code without
goto will fail, the code with goto will not;
git-bisected commit which fixed bug:
commit e42003d80529ab53413459c41a51114c7d437822
Author: Ulya Trofimovich <skvadrik@gmail.com>
Date: Wed Mar 11 12:46:58 2015 +0000
Pass limited span instead of checking range all the time.
Ulya Trofimovich [Tue, 11 Aug 2015 14:18:04 +0000 (15:18 +0100)]
Explicit cast of 'char' to 'unsigned char' in cmd options lexer.
Much the same as previous commit, only this time YYCTYPE has been
erroneously set to 'char' instead of 'unsigned char. It didn't
cause any warnings or errors becase regular expressions contain
only ASCII range characters. However, relying on this is unsafe:
hypotheticlly re2c can generate some strange comparisons that compile
but fail in runtime.
Explicit cast makes sure thet all comparisons are between unsigned
values.
Ulya Trofimovich [Tue, 11 Aug 2015 13:38:34 +0000 (14:38 +0100)]
Explicit cast of 'char' to 'unsigned char' in lexer.
Source code is in ASCII: pointers have type 'char *', but re2c
makes an implicit assumption that YYCTYPE is unsigned when it
generates comparisons.
At first glance making YYCTYPE 'char' instead of 'unsigned char'
seems to fix the problem (no more warnings). But it's only because
re2c-generated lexer happens to make no comparisons with constants
larget than 0x7F. Lexer rules may change a bit and then we'll have
lots of warnings about 'comparison between signed and unsigned
values' and 'case label exceeds maximal value for type', not
mentioning that it may lead to real errors.
Explicit cast makes sure thet all comparisons are between unsigned
values.
Ulya Trofimovich [Mon, 10 Aug 2015 16:11:01 +0000 (17:11 +0100)]
A better way to avoid GCC [-Wreturn-type] false positives.
Though all enum values are handled in switch, GCC still complains
that 'control reaches end of non-void function'. Adding default
clause helps and it's better than returning some spurious constant.
Ulya Trofimovich [Mon, 10 Aug 2015 12:16:56 +0000 (13:16 +0100)]
run_tests.sh: separate stdout and stderr, diff with -b on wine.
Found some errors while testing mingw-built re2c.exe on wine:
1. redirecting stderr to stdout resulted resulted in random
mix in tests that triggered re2c warnings
2. removing all '\r' symbols in the generated files didn't
work anymore since some tests that contain '\r' were added
Fixed:
1. redirect both to different files and cat manually
2. diff with -b flag on wine
Ulya Trofimovich [Mon, 10 Aug 2015 11:03:07 +0000 (12:03 +0100)]
Stop loosing arcs in DFA loops when building path cover with --skeleton.
The problem with the algorithm was that when one and the same skeleton
node is visited for the 3rd time, the algorithm would abandon this
branch and drop all the path prefixes that lead to it. It's ok when
bruteforcing all paths (the arcs in the dropped prefixes will occur in
other paths that lead to final states), but it's not ok for path cover:
some arcs may have occured only in these dropped prefixes).
The fix is easy: just check if the branch was cancelled and in case it
was try the same prefixes with another branch.
This small example demonstrates the problem:
/*!re2c
[^\x00]* [\x00] {}
*/
DFA skeleton looks like this:
state 0:
arcs to state 0: [0x01, 0xFF]
arcs to state 1: [0x00]
state 1: final
Before the fix, re2c would generate the following paths (provided
path cover construction is forced instead of generating all paths):
- 0x01,0x00
- 0x00
Clearly the [0xFF] arc looping from state 0 to state 0 is missing
(uncovered). After the fix:
Ulya Trofimovich [Mon, 10 Aug 2015 10:48:46 +0000 (11:48 +0100)]
Yet one more explicit cast of pointer difference to uint32_t.
Much the same as previous commit: array of length bounded with 32 bits,
first pointer points to some random element. second pointer points to
the beginning of array (unless there were some errors in calculatins
that would have likely crashed re2c anyway).
Ulya Trofimovich [Mon, 10 Aug 2015 10:41:51 +0000 (11:41 +0100)]
Yet another explicit cast of pointer difference to uint32_t.
This time we deal with pointers in 'Ins' array (the length of array
equals the calculated size of regular expression, which is currently
of type uint32_t: larger regular expressions will surely crash re2c).
So we can guarantee that, if regexp was correctly compiled to instructions,
than the first pointer points to some instuction in the array and second
pointer points to the beginning of array, so this cast is safe.
If compilation was incorrect than re2c will fail anyway.
Ulya Trofimovich [Mon, 10 Aug 2015 09:51:11 +0000 (10:51 +0100)]
Another explicit cast of pointer difference to uint32_t that seems to be safe.
In theory, 'top' points at the top of stack and it's value is always greater
than or equal to 'stk' that points to stack bottom. Total stack size is
equal to 'DFA::nStates', which is of type uint32_t (DFAs larger than
2^32 states are currently not supported an will crash re2c).
In practice, 'stk' is not changed and 'top' is only incremented
before the cast (it's decremented afterwards. Note that the function
is self-recursive).
Ulya Trofimovich [Mon, 10 Aug 2015 09:32:31 +0000 (10:32 +0100)]
Explicit cast of pointer difference to uint32_t: it seems to be safe from the code.
In theory, this function returns the number of 'Span's in the newly
constructed 'Span' array, so it must be a non-negative integer number
that fits into 32 bits and the cast is safe.
In practice, 'x0' variable stays unchanged in this function, while
'x' variable's value can only increase: whenever it can be (conditionally)
decremented, it is always unconditionally incremented. So 'x - x0'
must be non-negative.
Ulya Trofimovich [Mon, 10 Aug 2015 09:19:00 +0000 (10:19 +0100)]
Fixes some hidden NULL pointer dereferencing.
'specMap' parameter can sometimes be NULL (when not in '-c' mode).
In code it was dereferenced before the check for '-c', but due to
compiler optimizations this was never revealed.
I found the bug while trying to measure the size of 'specMap'
before the check (caught segfault). Fixed by moving the check prior
to dereferencing.
Warning status can be in fact represented with only 2 bits, but
since it's engaged in arithmetic operations it gets promoted and
special precautions are needed. Much easier to ure 32 bits:
warnings are nowhere near performance/memory bottleneck.
Use size_t to store the length of path in skeleton.
Though this length shouldn't actually excedd 32 bits (othewise
too much data will be generated, and that is checked), it's better
to use size_t and don't care about the order of checks.
A special truncated unsigned 32-bit type for overflow-sensitive calculations.
With --skeleton switch we need to generate lots of data: strings that
correspond to various paths in DFA and match given regular expression.
For small graphs we can afford to generate all paths, for large graphs
we can only generate path cover. Anyway we need to be able to estimate
the amount of data to be generated (measured in skeleton arcs). Since
it can easily exceed 32 bits (and 64 as well), calculations must stop
as soon as certain limit is reached.
Encodings: use 32-bit unsigned arithmetics instead of 8-bit and 16-bit.
8-bit and 16-bit unsigned integers used in arithmetic operations
are promoted to 32 bits before operation and then truncated back.
Theoretically this may change their value.
Fixed bug #115 "flex-style named definitions cause ambiguity in re2c grammar".
This commit removes 10 shift/reduce conflicts in bison grammar for re2c.
These conflicts are caused by allowing flex-style named definitions
name regular-expression
to contain newlines and to be mixed with rules. It's not just some
conflicts in LALR(1) grammar, it is genuine ambiguity as can be observed
from the following example:
/*!re2c
name "a"
"b" "c" {}
*/
which can be parsed in two ways:
definition -> name "a"
rule -> "b" "c" {}
and
definition -> name "a" "b"
rule -> "c" {}
, both ways being perfectly valid.
This commit resolves ambiguity by forbidding newlines in flex-style
named definitions (conforming to flex syntax). Newline in these
definitions is treated in a special way: lexer emits token 'FID_END',
which marks the end of flex-style named definition in parser.
Ulya Trofimovich [Tue, 16 Jun 2015 15:09:32 +0000 (16:09 +0100)]
More tests for "--empty-class" option.
This time I found all tests that are affected by this option
(that is, contain empty ranges) and added explicit option variants
for most of them (excluding 'test/empty_range.*' group, as we already
have 'test/bug61_positive.*' group that checks the same thing).
This option controls re2c actions when it encounters empty character
class (e.g. [], [^\0x00-\xFF] or [\0x00-\xFF]\[\0x00-\xFF]):
match-empty (default) - match on empty input
match-none - fail to match on any input
error - compilation error
This is a final fix for bug #61 "empty character class [] matches empty string".