]> granicus.if.org Git - re2c/log
re2c
8 years agoSimplified tracking of condition order.
Ulya Trofimovich [Wed, 26 Aug 2015 11:41:48 +0000 (12:41 +0100)]
Simplified tracking of condition order.

In theory re2c makes no guarantee about the order of conditions in
the generated lexer. Users should use 'YYGETCONDITION'/'YYSETCONDITION'
interface and generate enum with condition names using either '-t'
option or '/*!types:re2c*/' directive. Only members of the generated
enum should be used with 'YYGETCONDITION' and 'YYSETCONDITION'.
This way code is independent of internal re2c condition numbering.

However, it turns out that some users (and notably, PHP) rely on
internal condition numbering: they manually hardcode condition numbers
and make re2c generate condition dispatch in such a way that it
doesn't mention condition names at all (e.g. with in the form of nested
'if' statements with '-b' or in the form of 'goto' table with '-g').
The code compiles, but change of code generation mode may break
compilation. Even worse, change of internal re2c condition numbering
may lead to epic runtime failures.

So re2c cannot just change internal condition numbering: it has to
preserve the existing numbering scheme (rather illogical and not
alphabetically sorted).

This commit tries to simplify tracking of condition numbers.

8 years agoFixed bug #119: "-f with -b/-g generates incorrect dispatch on fill labels".
Ulya Trofimovich [Tue, 25 Aug 2015 10:33:29 +0000 (11:33 +0100)]
Fixed bug #119: "-f with -b/-g generates incorrect dispatch on fill labels".

Consider the following example 1.re:
    /*!re2c
        "" {}
    */
With -if, re2c would generate correct dispatch:
    $ re2c -if 1.re
    /* Generated by re2c 0.14.3 on Tue Aug 25 10:41:45 2015 */
            switch (YYGETSTATE()) {
            default: goto yy0;
            case 0: goto yyFillLabel0;
            }
    yy0:
            YYSETSTATE(0);
    yyFillLabel0:
            {}

With -bif all positive YYGETSTATE() values will lead to yyFillLabel0,
which is clearly an error: values that are greater than 0 should lead
to yy0:
    $ re2c -bif 1.re
    /* Generated by re2c 0.14.3 on Tue Aug 25 10:40:32 2015 */
            if (YYGETSTATE() < 0) {
                    goto yy0;
            } else {
                    goto yyFillLabel0;
            }
    yy0:
            YYSETSTATE(0);
    yyFillLabel0:
            {}

With -gif the error is different: all values greater than 0 now cause
undefined behaviour (access to memory not within array bounds):
    $ re2c -gif 1.re
    /* Generated by re2c 0.14.3 on Tue Aug 25 10:47:41 2015 */
    {
            static void *yystable[] = {
                    &&yyFillLabel0,
            };
            if (YYGETSTATE() < 0) {
                    goto yy0;
            }
            goto *yystable[YYGETSTATE()];
    yy0:
            YYSETSTATE(0);
    yyFillLabel0:
            {}
    }

The situation gets even more tricky with re2c:state:abort configuration.
Besides, YYGETSTATE() macro is called multiple times with -b and -g,
which is not so good. Additional checks with -g don't help gain performance
either.

Fix: always generate simple switch.

8 years agoMakefile.am: update bootstrap parser when necessary.
Ulya Trofimovich [Mon, 24 Aug 2015 13:38:16 +0000 (14:38 +0100)]
Makefile.am: update bootstrap parser when necessary.

The mechanism of updating bootstrap parser was broken by commit
b2bc24704ebd2a0e6050f755dbd93a5cd987a418 : "Support 'make distcheck'."

Before that commit bootstrap parser was updated every time when autogenerated
parser was updated. This approach contradicted with 'make distcheck' which
mounts top source directory read-only (including bootstrap files). That commit
tried to solve the problem by checking if the autogenerated parser has changed
and only then copying it to bootstrap parser.

I made two mistakes:
    - philisophic mistake: autogenerated parser was actually altered by
      'make distcheck' because of altered paths in '#line' info;
    - shell scripting mistake: the check if the autogenerated parser has
      changed was inverted, so bootstrap parser was never updated;

This commit makes use of two observations:
    - bootstrap parser depends on the source parser, not on the autogenerated
      parser;
    - bootstrap parser shouldn't at all contain line information: at the time
      it is copied to build directory the relative path from build directory
      to source directory may be completely different (as it is with 'make
      distcheck');

8 years agodistcheck.sh can only run from top source directory.
Ulya Trofimovich [Mon, 24 Aug 2015 12:23:55 +0000 (13:23 +0100)]
distcheck.sh can only run from top source directory.

8 years agoExplicit cast of signed nonnegative to unsigned (found with [-Wsign-conversion]).
Ulya Trofimovich [Mon, 24 Aug 2015 09:59:44 +0000 (10:59 +0100)]
Explicit cast of signed nonnegative to unsigned (found with [-Wsign-conversion]).

8 years agoParse configuration strings in the same way as character strings and classes.
Ulya Trofimovich [Sun, 23 Aug 2015 10:34:13 +0000 (11:34 +0100)]
Parse configuration strings in the same way as character strings and classes.

re2c allows configuration strings to contain escape sequencs as long
as the escaped code points fit into 1-byte range.

Finally both 'Scanner::unescape' functions can be removed along with
a whole bunch of [-Wsign-conversion] warnings. :)

8 years agoTests: simple tests for parsing inplace configurations.
Ulya Trofimovich [Sat, 22 Aug 2015 14:52:27 +0000 (15:52 +0100)]
Tests: simple tests for parsing inplace configurations.

Tests do not ensure that configurations are handled correctly or
have any impact on re2c behaviour. They only ensure that these
configuration names are recognized as valid.

8 years agoParse inplace configuration names in lexer.
Ulya Trofimovich [Sat, 22 Aug 2015 13:49:17 +0000 (14:49 +0100)]
Parse inplace configuration names in lexer.

Make re2c do all the dirty work instead of barely recognizing
configuration-like lexeme and manually matching the lexeme against
all allowed configuration names.

This commit adds many new tokens, which means that re2c-generated lexer
and bison-generated parser have both become larger. The difference
is not as great as one might expect: ~60 Kb -> ~80 Kb in both cases.

8 years agoZero-extend code units when casing them from 'signed char' to 'uint32_t'.
Ulya Trofimovich [Thu, 20 Aug 2015 12:24:42 +0000 (13:24 +0100)]
Zero-extend code units when casing them from 'signed char' to 'uint32_t'.

The problem starts with lexer: it should operate on unsigned chars
(as re2c wants). Then these casts won't be needed at all. Mixing
signed and unsigned chars is bad and lexer should be rewritten to
use unsigned chars.

8 years agoForbid newline in strings and character classes.
Ulya Trofimovich [Thu, 20 Aug 2015 12:05:07 +0000 (13:05 +0100)]
Forbid newline in strings and character classes.

Like C language, re2c allows all characters of current charset excapt for
newline and backslash to apper unescaped in string and class literals.

Added tests:
    - all possible characters in strings and classes (omitting
      newline, escaping backslash and quotes/closing bracket)
      with different encodings (re2c up to 0.14.x would segfault
      on this test with '-u' and '-w')
    - string with newline
    - string with unescaped backslash followed by newline (so that
      it won't stick to next character)

8 years agoTests: code points that exceed maximum for current encoding.
Ulya Trofimovich [Thu, 20 Aug 2015 11:35:47 +0000 (12:35 +0100)]
Tests: code points that exceed maximum for current encoding.

Test changes introduced by commit 0346d1666f58da5dbe35d156e0487d95354153ff:
"Check if code point exceeds maximum. Correctly cast 'char' to 'uint32_t'."

8 years agoCheck if code point exceeds maximum. Correctly cast 'char' to 'uint32_t'.
Ulya Trofimovich [Thu, 20 Aug 2015 11:09:52 +0000 (12:09 +0100)]
Check if code point exceeds maximum. Correctly cast 'char' to 'uint32_t'.

First fix:
re2c used to check if code point exceeds maximal value for current
encoding when parsing it. When I moved parsing code points to lexer
I forgot the check.

Second fix:
I assumed that 'static_cast<uint32_t>' on 'signed char' zero-extends.
But I was wrong: it sign-extends. Need to cast to 'unsigned char'
instead.

8 years agoParse unquoted flex-like strings in lexer.
Ulya Trofimovich [Wed, 19 Aug 2015 17:49:28 +0000 (18:49 +0100)]
Parse unquoted flex-like strings in lexer.

Such strings can only contain ASCII letters, digits and uderscore:
no escapes. So there's very little parsing to do: just map code units
directly to code points.

8 years agoParse character strings in lexer.
Ulya Trofimovich [Wed, 19 Aug 2015 17:31:04 +0000 (18:31 +0100)]
Parse character strings in lexer.

This commit is a continuation of e42f271e4b284cd6ed590dc06a5239ce13942fc5:
"Parse character classes in lexer." It makes lexer parse individual code
points instead of barely recognizing string of code points.

Since the contents of case sensitive strings, case insensitive strings
and character classes are lexically the same (except for escaped quotes
or closing bracket), we can parse them all in the same lexer block.

Note that re2c allows any 1-byte code points. Maybe we should report
non-ASCII or non-printables.

8 years agoRemoved unused function.
Ulya Trofimovich [Tue, 18 Aug 2015 21:11:58 +0000 (22:11 +0100)]
Removed unused function.

8 years agoAdded warning '-Wuseless-escape'.
Ulya Trofimovich [Tue, 18 Aug 2015 20:57:32 +0000 (21:57 +0100)]
Added warning '-Wuseless-escape'.

This is to warn about escaped printable ASCII characters that
shouldn't be escaped in the current context (e.g. quotes inside of
a character class).

8 years agoAdded warning '-Wswapped-range'.
Ulya Trofimovich [Tue, 18 Aug 2015 16:14:45 +0000 (17:14 +0100)]
Added warning '-Wswapped-range'.

When re2c parses character class and finds a range with lower bound
greater than upper bound, it silently swaps bounds. Warn about such
cases.

8 years agoParse character classes in lexer.
Ulya Trofimovich [Tue, 18 Aug 2015 14:54:22 +0000 (15:54 +0100)]
Parse character classes in lexer.

Before this commit, lexer would barely recognize class as a whole
lexeme and pass it to functions that would further parse is by manually
picking individual code points and control characters out of the lexeme.

Heh, re2c is made for such kind of stuff (and it does it much better).
So now lexer parses individual code points or control characters and
stores them as a sequence of code points. Then an outer function
splits this sequence into ranges and individual characters.

I tried to preserve existing behaviour (judging from test suite only
the text of some error messages has changed). Added some autogenerated
tests (and the generator script itself), but these tests are not exhaustive.

8 years agoLexer: no need to have "{0,}" special case.
Ulya Trofimovich [Sun, 16 Aug 2015 19:57:29 +0000 (20:57 +0100)]
Lexer: no need to have "{0,}" special case.

Now "{0,}" is handled by general case for counted repetition. Had to
fix test because of swapped alternatives in construction of regexps
of the form "r*" and "r{n,".

Added some tests for counted repetition.

8 years agoRestored explicit cast that was removed in 7af5d437e49878ea4e8de73a02d99ad4e5751933.
Ulya Trofimovich [Sun, 16 Aug 2015 19:14:50 +0000 (20:14 +0100)]
Restored explicit cast that was removed in 7af5d437e49878ea4e8de73a02d99ad4e5751933.

Found with [-Wsign-compare] while building with MINGW: pointer
difference is of different size on linux and windows.

8 years agoForce definition of some conditionally defined parts of <stdint.h>.
Ulya Trofimovich [Sun, 16 Aug 2015 19:02:04 +0000 (20:02 +0100)]
Force definition of some conditionally defined parts of <stdint.h>.

Got some errors about undefined 'UINT32_MAX' and others while trying
to build re2c with MINGW. These variables are defined conditionally in
<stdint.h> depending on the following defines:
    __STDC_LIMIT_MACROS
    __STDC_CONSTANT_MACROS
Fix: defined these symbols in the beginning of "src/util/c99_stdint.h".

Also found out that "src/util/c99_stdint.h" didn't compile in the
absense of system <stdint.h> (which is exactly the case when it was
written for). THe problem was caused by incorrect use of 'typedef'
statement; fixed easily. What a shame I didn't compile it before ;D

8 years agoFixed [-Wsign-conversion] warnings in 'testrange' sources.
Ulya Trofimovich [Sun, 16 Aug 2015 14:21:23 +0000 (15:21 +0100)]
Fixed [-Wsign-conversion] warnings in 'testrange' sources.

8 years agoUse custom function instead of 'atoi' to read 32-bit integers.
Ulya Trofimovich [Sun, 16 Aug 2015 11:08:21 +0000 (12:08 +0100)]
Use custom function instead of 'atoi' to read 32-bit integers.

Using 'atoi' displeased me for several reasons:
    - 'atoi' doesn't check for overflow
    - by the time 'atoi' was called, re2c has already parsed input
      string and knows that it is well-formed, no need to do it twice
    - atoi returns 'int', custom function allows to refine types and
      avoid casts between signed/unsigned values or [-Wsign-conversion]
      warnings

Added test for new conversion functions.

8 years agoDon't use special regexp type for counted repetitions.
Ulya Trofimovich [Sat, 15 Aug 2015 16:42:41 +0000 (17:42 +0100)]
Don't use special regexp type for counted repetitions.

Counted repetitions are regexps of the form:
    r{n}
    r{n,m}
    r{n,}

They can be expressed in terms of concatenation, alternation and
iteration:
    r{0}      <empty regexp>
    r{n}      r{n-1} r
    r{n,m}    r{n} (r{0} | ... | r{m-n})
    r{n,}     r{n} r*

Reducing special regexp type to represent counted repetitions allows
us substitute complex code that compiles them to bytecode instructions
with simpler code that constructs counted repetitions using concatenation,
alternation and iteration.

8 years agoRemoved unused struct field.
Ulya Trofimovich [Sat, 15 Aug 2015 15:22:28 +0000 (16:22 +0100)]
Removed unused struct field.

8 years agoStore lexer buffer size inside of lexer class.
Ulya Trofimovich [Fri, 14 Aug 2015 13:37:12 +0000 (14:37 +0100)]
Store lexer buffer size inside of lexer class.

8 years agoSome more explicit casts in lexer fill procedure (found with [-Wsign-conversion]).
Ulya Trofimovich [Fri, 14 Aug 2015 13:29:19 +0000 (14:29 +0100)]
Some more explicit casts in lexer fill procedure (found with [-Wsign-conversion]).

Lexer tries to maintain certain layout of pointers to buffer.
Buffer refilling procedure relies on that layout.

8 years agoLexer: unified token length calculation.
Ulya Trofimovich [Fri, 14 Aug 2015 12:57:58 +0000 (13:57 +0100)]
Lexer: unified token length calculation.

Token length equals the difference between two pointers: YYCURSOR
value on the moment of successful match and YYCURSOR value when
entering DFA. This difference should be nonnegative and fit buffer
size.

8 years agoUse different local variables for unrelated tasks.
Ulya Trofimovich [Fri, 14 Aug 2015 12:57:31 +0000 (13:57 +0100)]
Use different local variables for unrelated tasks.

8 years agoFixed '#27 re2c crashes reading files containing %{ %}' (patch by Rui)
Ulya Trofimovich [Wed, 12 Aug 2015 21:32:24 +0000 (22:32 +0100)]
Fixed '#27 re2c crashes reading files containing %{ %}' (patch by Rui)

merged commit b068f3ad2ea7b01aa1eb63cb655c3357f856e85b from master.

8 years agoLexer: reduced unnecessary macro.
Ulya Trofimovich [Wed, 12 Aug 2015 17:50:38 +0000 (18:50 +0100)]
Lexer: reduced unnecessary macro.

8 years agoLexer: glued together two variables with almost identical functionality.
Ulya Trofimovich [Wed, 12 Aug 2015 16:39:21 +0000 (17:39 +0100)]
Lexer: glued together two variables with almost identical functionality.

These two variables ('ScannerState' class member 'cur' and local variable
'cursor') both played the role of YYCURSOR: YYCURSOR was defined to
'cursor', but the code had to adjust 'cur' to 'cursor' on every return
and restore it when reentering lexing procedures.

There was only two obscure places in which they were not completely
syncronized. Both looked the same:

    cur = ptr > tok ? ptr - 1 : cursor;

However, 'ptr' is effectively YYMARKER and is never changed manually
by lexing procedures; 'tok' should point to the beginning of current
lexeme; this check doesn't make sense to me. This is an attempt to
resume lexing one character before YYMARKER, but what for? YYMARKER
isn't even used in Scanner::scan (look at the the generated code),
so 'ptr > tok' is always false.

This check is present since the initial commit, so I cannot trace back
it's origin.

As it doesn't break any tests I hereby simply remove it.

8 years agoAdded tests for occasionally fixed bug in code generation.
Ulya Trofimovich [Wed, 12 Aug 2015 12:52:57 +0000 (13:52 +0100)]
Added tests for occasionally fixed bug in code generation.

The bug was triggered by the following conditions: '-b'
(or any flag that implies '-b') and encoding with code units
wider than 1 byte (that is, UTF-16, UTF-32 and UCS-2).

The bug itself: re2c omitted 'goto <some state>;' when it
should have been generated. The absense of goto caused incorrect
control flow. One cas verify it in two ways using the minimal test
added by this commit:
    - manually compare the code generated before and after fix
      and ensure that newer version generates correct code;
    - run the test with --skeleton, then try to add/remove
      the goto statement, compile and run: the code without
      goto will fail, the code with goto will not;

git-bisected commit which fixed bug:
    commit e42003d80529ab53413459c41a51114c7d437822
    Author: Ulya Trofimovich <skvadrik@gmail.com>
    Date:   Wed Mar 11 12:46:58 2015 +0000

        Pass limited span instead of checking range all the time.

Also gound commit that introduced bug:
    commit d10fb601f3ba257bae0845945daf1dfa754d6146
    Author: helly <helly@642ea486-5414-0410-9d7f-a0204ed87703>
    Date:   Fri Dec 30 16:24:07 2005 +0000

        - Allow to use -w with -b
        # Still quite some work to do but it is possible

8 years agoMore accurate loop counting (inspired by [-Wsign-compare]).
Ulya Trofimovich [Tue, 11 Aug 2015 17:52:41 +0000 (18:52 +0100)]
More accurate loop counting (inspired by [-Wsign-compare]).

8 years agoLoop counter should be unsigned (found with [-Wsign-compare]).
Ulya Trofimovich [Tue, 11 Aug 2015 16:44:29 +0000 (17:44 +0100)]
Loop counter should be unsigned (found with [-Wsign-compare]).

8 years agoSimplified regexp construction for case insensitive strings.
Ulya Trofimovich [Tue, 11 Aug 2015 16:36:29 +0000 (17:36 +0100)]
Simplified regexp construction for case insensitive strings.

As for now, we only support ASCII. This should be fixed.

8 years agoExplicit cast of 'char' to 'unsigned char' in cmd options lexer.
Ulya Trofimovich [Tue, 11 Aug 2015 14:18:04 +0000 (15:18 +0100)]
Explicit cast of 'char' to 'unsigned char' in cmd options lexer.

Much the same as previous commit, only this time YYCTYPE has been
erroneously set to 'char' instead of 'unsigned char. It didn't
cause any warnings or errors becase regular expressions contain
only ASCII range characters. However, relying on this is unsafe:
hypotheticlly re2c can generate some strange comparisons that compile
but fail in runtime.

Explicit cast makes sure thet all comparisons are between unsigned
values.

8 years agoExplicit cast of 'char' to 'unsigned char' in lexer.
Ulya Trofimovich [Tue, 11 Aug 2015 13:38:34 +0000 (14:38 +0100)]
Explicit cast of 'char' to 'unsigned char' in lexer.

Source code is in ASCII: pointers have type 'char *', but re2c
makes an implicit assumption that YYCTYPE is unsigned when it
generates comparisons.

At first glance making YYCTYPE 'char' instead of 'unsigned char'
seems to fix the problem (no more warnings). But it's only because
re2c-generated lexer happens to make no comparisons with constants
larget than 0x7F. Lexer rules may change a bit and then we'll have
lots of warnings about 'comparison between signed and unsigned
values' and 'case label exceeds maximal value for type', not
mentioning that it may lead to real errors.

Explicit cast makes sure thet all comparisons are between unsigned
values.

8 years agoSimplified pretty-printing individual characters.
Ulya Trofimovich [Tue, 11 Aug 2015 10:12:32 +0000 (11:12 +0100)]
Simplified pretty-printing individual characters.

8 years agorun_tests.sh: a minor tweak to speed up tests (a little).
Ulya Trofimovich [Tue, 11 Aug 2015 10:06:09 +0000 (11:06 +0100)]
run_tests.sh: a minor tweak to speed up tests (a little).

8 years agoA better way to avoid GCC [-Wreturn-type] false positives.
Ulya Trofimovich [Mon, 10 Aug 2015 16:11:01 +0000 (17:11 +0100)]
A better way to avoid GCC [-Wreturn-type] false positives.

Though all enum values are handled in switch, GCC still complains
that 'control reaches end of non-void function'. Adding default
clause helps and it's better than returning some spurious constant.

8 years agoconfigure.ac: added warning to CXXFLAGS: -Wsign-conversion
Ulya Trofimovich [Mon, 10 Aug 2015 12:49:42 +0000 (13:49 +0100)]
configure.ac: added warning to CXXFLAGS: -Wsign-conversion

8 years agoMakefile.am: dropped bison flags: -y -d --no-lines.
Ulya Trofimovich [Mon, 10 Aug 2015 12:39:57 +0000 (13:39 +0100)]
Makefile.am: dropped bison flags: -y -d --no-lines.

-y stands for yacc compatibility, we don't need it since we no longer
accept yacc instead of bison.

-d stands for defines file, we already specify that in makefile rule
for parser.

--no-lines is generally useless and didn't have any effect for some
reason (#line directives are present).

8 years agorun_tests.sh: separate stdout and stderr, diff with -b on wine.
Ulya Trofimovich [Mon, 10 Aug 2015 12:16:56 +0000 (13:16 +0100)]
run_tests.sh: separate stdout and stderr, diff with -b on wine.

Found some errors while testing mingw-built re2c.exe on wine:
    1. redirecting stderr to stdout resulted resulted in random
       mix in tests that triggered re2c warnings
    2. removing all '\r' symbols in the generated files didn't
       work anymore since some tests that contain '\r' were added

Fixed:
    1. redirect both to different files and cat manually
    2. diff with -b flag on wine

8 years agoStop loosing arcs in DFA loops when building path cover with --skeleton.
Ulya Trofimovich [Mon, 10 Aug 2015 11:03:07 +0000 (12:03 +0100)]
Stop loosing arcs in DFA loops when building path cover with --skeleton.

The problem with the algorithm was that when one and the same skeleton
node is visited for the 3rd time, the algorithm would abandon this
branch and drop all the path prefixes that lead to it. It's ok when
bruteforcing all paths (the arcs in the dropped prefixes will occur in
other paths that lead to final states), but it's not ok for path cover:
some arcs may have occured only in these dropped prefixes).

The fix is easy: just check if the branch was cancelled and in case it
was try the same prefixes with another branch.

This small example demonstrates the problem:

    /*!re2c
        [^\x00]* [\x00] {}
    */

DFA skeleton looks like this:

    state 0:
        arcs to state 0: [0x01, 0xFF]
        arcs to state 1: [0x00]
    state 1: final

Before the fix, re2c would generate the following paths (provided
path cover construction is forced instead of generating all paths):

    - 0x01,0x00
    - 0x00

Clearly the [0xFF] arc looping from state 0 to state 0 is missing
(uncovered). After the fix:

    - 0x01,0x00
    - 0xFF,0x01,0x00
    - 0x01,0xFF,0x00
    - 0x00

Not a minimal path cover, but we don't need it. We just need a small
enough one.

8 years agoYet one more explicit cast of pointer difference to uint32_t.
Ulya Trofimovich [Mon, 10 Aug 2015 10:48:46 +0000 (11:48 +0100)]
Yet one more explicit cast of pointer difference to uint32_t.

Much the same as previous commit: array of length bounded with 32 bits,
first pointer points to some random element. second pointer points to
the beginning of array (unless there were some errors in calculatins
that would have likely crashed re2c anyway).

Fixes [-Wconversion] warning.

8 years agoYet another explicit cast of pointer difference to uint32_t.
Ulya Trofimovich [Mon, 10 Aug 2015 10:41:51 +0000 (11:41 +0100)]
Yet another explicit cast of pointer difference to uint32_t.

This time we deal with pointers in 'Ins' array (the length of array
equals the calculated size of regular expression, which is currently
of type uint32_t: larger regular expressions will surely crash re2c).

So we can guarantee that, if regexp was correctly compiled to instructions,
than the first pointer points to some instuction in the array and second
pointer points to the beginning of array, so this cast is safe.

If compilation was incorrect than re2c will fail anyway.

Fixes [-Wconversion] warning.

8 years agoAnother explicit cast of pointer difference to uint32_t that seems to be safe.
Ulya Trofimovich [Mon, 10 Aug 2015 09:51:11 +0000 (10:51 +0100)]
Another explicit cast of pointer difference to uint32_t that seems to be safe.

In theory, 'top' points at the top of stack and it's value is always greater
than or equal to 'stk' that points to stack bottom. Total stack size is
equal to 'DFA::nStates', which is of type uint32_t (DFAs larger than
2^32 states are currently not supported an will crash re2c).

In practice, 'stk' is not changed and 'top' is only incremented
before the cast (it's decremented afterwards. Note that the function
is self-recursive).

Fixes [-Wconversion] warning.

8 years agoExplicit cast of pointer difference to uint32_t: it seems to be safe from the code.
Ulya Trofimovich [Mon, 10 Aug 2015 09:32:31 +0000 (10:32 +0100)]
Explicit cast of pointer difference to uint32_t: it seems to be safe from the code.

In theory, this function returns the number of 'Span's in the newly
constructed 'Span' array, so it must be a non-negative integer number
that fits into 32 bits and the cast is safe.

In practice, 'x0' variable stays unchanged in this function, while
'x' variable's value can only increase: whenever it can be (conditionally)
decremented, it is always unconditionally incremented. So 'x - x0'
must be non-negative.

Fixes [-Wconversion] warning.

8 years agoPrint single character as char rather than convert it to one-symbol string.
Ulya Trofimovich [Mon, 10 Aug 2015 09:26:14 +0000 (10:26 +0100)]
Print single character as char rather than convert it to one-symbol string.

8 years agoFixes some hidden NULL pointer dereferencing.
Ulya Trofimovich [Mon, 10 Aug 2015 09:19:00 +0000 (10:19 +0100)]
Fixes some hidden NULL pointer dereferencing.

'specMap' parameter can sometimes be NULL (when not in '-c' mode).
In code it was dereferenced before the check for '-c', but due to
compiler optimizations this was never revealed.

I found the bug while trying to measure the size of 'specMap'
before the check (caught segfault). Fixed by moving the check prior
to dereferencing.

8 years agoExplicit cast of pointer difference to uint32_t: it's obviously safe from the code.
Ulya Trofimovich [Mon, 10 Aug 2015 09:12:20 +0000 (10:12 +0100)]
Explicit cast of pointer difference to uint32_t: it's obviously safe from the code.

Fixes some [-Wconversion] warnings.

8 years ago'DFA::kCount' type should be ptrdiff_t as it's involved in pointer arithmetics.
Ulya Trofimovich [Mon, 10 Aug 2015 09:07:23 +0000 (10:07 +0100)]
'DFA::kCount' type should be ptrdiff_t as it's involved in pointer arithmetics.

8 years agoAllow generic container to have size_t elements rather than uint32_t.
Ulya Trofimovich [Mon, 10 Aug 2015 08:58:10 +0000 (09:58 +0100)]
Allow generic container to have size_t elements rather than uint32_t.

Cast to uint32_t in those use cases when we are sure that container
only contains some few elements.

8 years agoUse 32 bits insted of 8 for warning status.
Ulya Trofimovich [Sun, 9 Aug 2015 19:31:00 +0000 (20:31 +0100)]
Use 32 bits insted of 8 for warning status.

Warning status can be in fact represented with only 2 bits, but
since it's engaged in arithmetic operations it gets promoted and
special precautions are needed. Much easier to ure 32 bits:
warnings are nowhere near performance/memory bottleneck.

8 years agoUse ptrdiff_t instead of uint32_t to represent offset in buffer.
Ulya Trofimovich [Sun, 9 Aug 2015 19:27:10 +0000 (20:27 +0100)]
Use ptrdiff_t instead of uint32_t to represent offset in buffer.

Found with [-Wconversion].

8 years agoRemoved useless piece of code (pretty-printing octal characters).
Ulya Trofimovich [Sun, 9 Aug 2015 19:11:20 +0000 (20:11 +0100)]
Removed useless piece of code (pretty-printing octal characters).

This piece of code could (and should) never be executed: EBCDIC and
non-pritables are handled prior to calling 'prtCh'.

8 years agoUse size_t to store the length of path in skeleton.
Ulya Trofimovich [Sun, 9 Aug 2015 19:01:52 +0000 (20:01 +0100)]
Use size_t to store the length of path in skeleton.

Though this length shouldn't actually excedd 32 bits (othewise
too much data will be generated, and that is checked), it's better
to use size_t and don't care about the order of checks.

8 years agoAdded comment.
Ulya Trofimovich [Sun, 9 Aug 2015 19:01:17 +0000 (20:01 +0100)]
Added comment.

8 years agoA special truncated unsigned 32-bit type for overflow-sensitive calculations.
Ulya Trofimovich [Sun, 9 Aug 2015 19:00:20 +0000 (20:00 +0100)]
A special truncated unsigned 32-bit type for overflow-sensitive calculations.

With --skeleton switch we need to generate lots of data: strings that
correspond to various paths in DFA and match given regular expression.
For small graphs we can afford to generate all paths, for large graphs
we can only generate path cover. Anyway we need to be able to estimate
the amount of data to be generated (measured in skeleton arcs). Since
it can easily exceed 32 bits (and 64 as well), calculations must stop
as soon as certain limit is reached.

8 years agoEncodings: use 32-bit unsigned arithmetics instead of 8-bit and 16-bit.
Ulya Trofimovich [Sun, 9 Aug 2015 18:07:10 +0000 (19:07 +0100)]
Encodings: use 32-bit unsigned arithmetics instead of 8-bit and 16-bit.

8-bit and 16-bit unsigned integers used in arithmetic operations
are promoted to 32 bits before operation and then truncated back.
Theoretically this may change their value.

This fixes a lot of [-Wconversion] warnings.

8 years agoconfigure.ac: added warning to CXXFLAGS: -Wconversion
Ulya Trofimovich [Sun, 9 Aug 2015 18:01:00 +0000 (19:01 +0100)]
configure.ac: added warning to CXXFLAGS: -Wconversion

8 years agoMakefile.am: dropped re2c flag '-i'.
Ulya Trofimovich [Thu, 6 Aug 2015 11:04:21 +0000 (12:04 +0100)]
Makefile.am: dropped re2c flag '-i'.

Line information is useful.

8 years agoForgot to update bootstrap lexer (changed by commit 1d4462d5d531dc707a442bc984283ea51...
Ulya Trofimovich [Thu, 6 Aug 2015 09:57:46 +0000 (10:57 +0100)]
Forgot to update bootstrap lexer (changed by commit 1d4462d5d531dc707a442bc984283ea51f77338c).

8 years agoNow -Werror-<warning> turns on <warning> (unlike -Werror in general).
Ulya Trofimovich [Wed, 5 Aug 2015 09:33:55 +0000 (10:33 +0100)]
Now -Werror-<warning> turns on <warning> (unlike -Werror in general).

At least GCC does so.

8 years agoBetter representation for rule actions; omit line info for autogenerated actions.
Ulya Trofimovich [Tue, 4 Aug 2015 12:38:57 +0000 (13:38 +0100)]
Better representation for rule actions; omit line info for autogenerated actions.

Since there's no such code in source file, there's no sense in
pointing into it. Updated test.

8 years agoFree memory allocated for range suffies at the same time as everything else.
Ulya Trofimovich [Tue, 4 Aug 2015 10:36:13 +0000 (11:36 +0100)]
Free memory allocated for range suffies at the same time as everything else.

Moved static member definition closer to class.

8 years agoAdded simple struct to store locations of parsed elements in source file.
Ulya Trofimovich [Tue, 4 Aug 2015 10:10:54 +0000 (11:10 +0100)]
Added simple struct to store locations of parsed elements in source file.

8 years agoNew condition belongs to the whole rule rather than to rule's code.
Ulya Trofimovich [Tue, 4 Aug 2015 09:29:47 +0000 (10:29 +0100)]
New condition belongs to the whole rule rather than to rule's code.

8 years agoRemoved unused 'strdup' function and autoconf check.
Ulya Trofimovich [Tue, 4 Aug 2015 08:53:52 +0000 (09:53 +0100)]
Removed unused 'strdup' function and autoconf check.

Function left unused by commit 00b14f309b5da3917d62f7d98a727290eaee6ea2.

8 years agoSome tests for windows-style newlines (CR LF).
Ulya Trofimovich [Thu, 30 Jul 2015 22:10:58 +0000 (23:10 +0100)]
Some tests for windows-style newlines (CR LF).

Copied two existing large tests and patched newlines: LF -> CR LF.

8 years agoFixed bug #115 "flex-style named definitions cause ambiguity in re2c grammar".
Ulya Trofimovich [Thu, 30 Jul 2015 14:11:03 +0000 (15:11 +0100)]
Fixed bug #115 "flex-style named definitions cause ambiguity in re2c grammar".

This commit removes 10 shift/reduce conflicts in bison grammar for re2c.
These conflicts are caused by allowing flex-style named definitions
    name regular-expression
to contain newlines and to be mixed with rules. It's not just some
conflicts in LALR(1) grammar, it is genuine ambiguity as can be observed
from the following example:
    /*!re2c
        name "a"
        "b" "c" {}
    */
which can be parsed in two ways:
    definition -> name "a"
    rule -> "b" "c" {}
and
    definition -> name "a" "b"
    rule -> "c" {}
, both ways being perfectly valid.

This commit resolves ambiguity by forbidding newlines in flex-style
named definitions (conforming to flex syntax). Newline in these
definitions is treated in a special way: lexer emits token 'FID_END',
which marks the end of flex-style named definition in parser.

8 years agoFixed segfault on options that expect an argument but are passed none.
Ulya Trofimovich [Wed, 29 Jul 2015 13:46:19 +0000 (14:46 +0100)]
Fixed segfault on options that expect an argument but are passed none.

Example of commands that triggered segfault:
    $ re2c -o
    $ re2c --type-header
    $ re2c --input

8 years agoForce custom input API with "--skeleton" option.
Ulya Trofimovich [Wed, 29 Jul 2015 11:54:50 +0000 (12:54 +0100)]
Force custom input API with "--skeleton" option.

Dropped this line while rewriting options parser.

8 years agoRemoved extra newline at the end of some error messages.
Ulya Trofimovich [Wed, 29 Jul 2015 11:37:51 +0000 (12:37 +0100)]
Removed extra newline at the end of some error messages.

8 years agoOutput error message when passed multiple arguments after "--" option.
Ulya Trofimovich [Wed, 29 Jul 2015 11:32:20 +0000 (12:32 +0100)]
Output error message when passed multiple arguments after "--" option.

All arguments after "--" are treated as input files, but re2c expects
exactly one input file.

8 years agoCommented parser conflict.
Ulya Trofimovich [Wed, 29 Jul 2015 11:19:40 +0000 (12:19 +0100)]
Commented parser conflict.

See https://github.com/skvadrik/re2c/issues/115

orts the commit.

8 years agoImproved warning messages with '-Werror'.
Ulya Trofimovich [Mon, 27 Jul 2015 20:29:03 +0000 (21:29 +0100)]
Improved warning messages with '-Werror'.

Report an error rather than a warning (at least GCC does so).

8 years agoOutput condition name with '-Wnaked-default' in '-c' mode.
Ulya Trofimovich [Mon, 27 Jul 2015 14:54:57 +0000 (15:54 +0100)]
Output condition name with '-Wnaked-default' in '-c' mode.

8 years agoUpdated tests.
Ulya Trofimovich [Mon, 27 Jul 2015 14:33:29 +0000 (15:33 +0100)]
Updated tests.

Line info changed in warning messages since commit
929c87fdf3a3de5f08206a258dfabbc0a068c561:
"Unified location output in warning messages.

8 years agoUnified location output in warning messages.
Ulya Trofimovich [Mon, 27 Jul 2015 14:12:49 +0000 (15:12 +0100)]
Unified location output in warning messages.

8 years agoRenames '-Wempty-rule' -> '-Wmatch-empty-string' and improved message.
Ulya Trofimovich [Mon, 27 Jul 2015 11:57:46 +0000 (12:57 +0100)]
Renames '-Wempty-rule' -> '-Wmatch-empty-string' and improved message.

8 years agoEnable warnings globally for tests.
Ulya Trofimovich [Mon, 27 Jul 2015 11:45:36 +0000 (12:45 +0100)]
Enable warnings globally for tests.

Fixed reference results for tests that trigger warnings.

8 years agoDon't lose stdout and stderr when testing with -o.
Ulya Trofimovich [Mon, 27 Jul 2015 10:45:30 +0000 (11:45 +0100)]
Don't lose stdout and stderr when testing with -o.

8 years agoDon't output generated code when failed with -Werror.
Ulya Trofimovich [Mon, 27 Jul 2015 10:41:23 +0000 (11:41 +0100)]
Don't output generated code when failed with -Werror.

8 years agoAdded warinigs: -W, -Werror.
Ulya Trofimovich [Mon, 27 Jul 2015 08:00:33 +0000 (09:00 +0100)]
Added warinigs: -W, -Werror.

    -W: enable all warnings
    -Werror: turn all enabled (hereafter) warnings into errors

Specific warnings added by this commit:
    -Wempty-character-class
    -Wempty-rule
    -Wnaked-default

For each warning there are -Wno-, -Werror-, -Wno-error- options.

8 years agoFixed typos in release script, updated release guidelines.
Ulya Trofimovich [Fri, 24 Jul 2015 19:28:20 +0000 (20:28 +0100)]
Fixed typos in release script, updated release guidelines.

8 years agoSplit release script into tw parts: distcheck and everything else.
Ulya Trofimovich [Tue, 21 Jul 2015 18:06:07 +0000 (19:06 +0100)]
Split release script into tw parts: distcheck and everything else.

8 years agoMakefile.am: remved nonexistent file 'doc/index.html'.
Ulya Trofimovich [Tue, 21 Jul 2015 17:17:54 +0000 (18:17 +0100)]
Makefile.am: remved nonexistent file 'doc/index.html'.

8 years agoSite front page 'doc/index.html' gone to branch 'gh-pages'.
Ulya Trofimovich [Tue, 21 Jul 2015 16:46:09 +0000 (17:46 +0100)]
Site front page 'doc/index.html' gone to branch 'gh-pages'.

Note that anther site page, manual.html, is autogenerated from
'doc/re2c.ad.in' which is also used to generate documentation.

8 years agoMakefile.am: added forgotten header.
Ulya Trofimovich [Tue, 21 Jul 2015 16:22:11 +0000 (17:22 +0100)]
Makefile.am: added forgotten header.

8 years agoFixed re2c behaviour with "--" command line option.
Ulya Trofimovich [Tue, 21 Jul 2015 12:47:31 +0000 (13:47 +0100)]
Fixed re2c behaviour with "--" command line option.

All arguments after "--" are treated as non-options. This can be
used to handle option-like filenames (those starting with "-" and "--").

8 years agoAllow to specify exacly one input file and at most one output and header file.
Ulya Trofimovich [Tue, 21 Jul 2015 11:15:01 +0000 (12:15 +0100)]
Allow to specify exacly one input file and at most one output and header file.

8 years agoFixed GXX warning '-Wsuggest-attribute=format'.
Ulya Trofimovich [Mon, 20 Jul 2015 21:14:12 +0000 (22:14 +0100)]
Fixed GXX warning '-Wsuggest-attribute=format'.

g++ warns if function with '__attribute__((format (printf, ...)))'
has wrong number of arguments or argument type is wrong.

8 years agoUse 'vfprintf' instead of 'fprintf' to print variadic arguments.
Ulya Trofimovich [Mon, 20 Jul 2015 18:45:54 +0000 (19:45 +0100)]
Use 'vfprintf' instead of 'fprintf' to print variadic arguments.

Added GCC warninigs -Wformat=2 (reveals this error) and
-Wsuggest-attribute=format.

8 years agoModified release script to try build with bmake as well as make.
Ulya Trofimovich [Mon, 20 Jul 2015 14:37:05 +0000 (15:37 +0100)]
Modified release script to try build with bmake as well as make.

8 years agoMakefile.am: drop non-POSIX make feature: '$<' in target rules.
Ulya Trofimovich [Mon, 20 Jul 2015 14:24:46 +0000 (15:24 +0100)]
Makefile.am: drop non-POSIX make feature: '$<' in target rules.

POSIX make allows '$<' only in inferencee rule.
Found while trying to build with bmake.

8 years agoMakefile.am: respect various 'make' alternatives: use '$(MAKE)'.
Ulya Trofimovich [Mon, 20 Jul 2015 13:42:32 +0000 (14:42 +0100)]
Makefile.am: respect various 'make' alternatives: use '$(MAKE)'.

8 years agoMakefile.am: use inference rule '.re.cc' to avoid rule duplication.
Ulya Trofimovich [Mon, 20 Jul 2015 13:39:54 +0000 (14:39 +0100)]
Makefile.am: use inference rule '.re.cc' to avoid rule duplication.

Inference rules is a standard POSIX make feature.

8 years agoNew re2c-based parser for command line options.
Ulya Trofimovich [Sun, 19 Jul 2015 14:22:47 +0000 (15:22 +0100)]
New re2c-based parser for command line options.