+++ /dev/null
-
-
-
-RE2C(1) RE2C(1)
-
-
-N\bNA\bAM\bME\bE
- re2c - convert regular expressions to C/C++
-
-
-S\bSY\bYN\bNO\bOP\bPS\bSI\bIS\bS
- r\bre\be2\b2c\bc [-\b-e\bes\bsb\bb] _\bn_\ba_\bm_\be
-
-
-D\bDE\bES\bSC\bCR\bRI\bIP\bPT\bTI\bIO\bON\bN
- r\bre\be2\b2c\bc is a preprocessor that generates C-based recognizers
- from regular expressions. The input to r\bre\be2\b2c\bc consists of
- C/C++ source interleaved with comments of the form /\b/*\b*!\b!r\bre\be2\b2c\bc
- ... *\b*/\b/ which contain scanner specifications. In the out-
- put these comments are replaced with code that, when exe-
- cuted, will find the next input token and then execute
- some user-supplied token-specific code.
-
- For example, given the following code
-
- #define NULL ((char*) 0)
- char *scan(char *p){
- char *q;
- #define YYCTYPE char
- #define YYCURSOR p
- #define YYLIMIT p
- #define YYMARKER q
- #define YYFILL(n)
- /*!re2c
- [0-9]+ {return YYCURSOR;}
- [\000-\377] {return NULL;}
- */
- }
-
- r\bre\be2\b2c\bc will generate
-
- /* Generated by re2c on Sat Apr 16 11:40:58 1994 */
- #line 1 "simple.re"
- #define NULL ((char*) 0)
- char *scan(char *p){
- char *q;
- #define YYCTYPE char
- #define YYCURSOR p
- #define YYLIMIT p
- #define YYMARKER q
- #define YYFILL(n)
- {
- YYCTYPE yych;
- unsigned int yyaccept;
- goto yy0;
- yy1: ++YYCURSOR;
- yy0:
- if((YYLIMIT - YYCURSOR) < 2) YYFILL(2);
- yych = *YYCURSOR;
- if(yych <= '/') goto yy4;
-
-
-
-Version 0.5 8 April 1994 1
-
-
-
-
-
-RE2C(1) RE2C(1)
-
-
- if(yych >= ':') goto yy4;
- yy2: yych = *++YYCURSOR;
- goto yy7;
- yy3:
- #line 10
- {return YYCURSOR;}
- yy4: yych = *++YYCURSOR;
- yy5:
- #line 11
- {return NULL;}
- yy6: ++YYCURSOR;
- if(YYLIMIT == YYCURSOR) YYFILL(1);
- yych = *YYCURSOR;
- yy7: if(yych <= '/') goto yy3;
- if(yych <= '9') goto yy6;
- goto yy3;
- }
- #line 12
-
- }
-
-
-O\bOP\bPT\bTI\bIO\bON\bNS\bS
- r\bre\be2\b2c\bc provides the following options:
-
- -\b-e\be Cross-compile from an ASCII platform to an EBCDIC
- one.
-
- -\b-s\bs Generate nested i\bif\bfs for some s\bsw\bwi\bit\btc\bch\bhes. Many com-
- pilers need this assist to generate better code.
-
- -\b-b\bb Implies -\b-s\bs. Use bit vectors as well in the attempt
- to coax better code out of the compiler. Most use-
- ful for specifications with more than a few key-
- words (e.g. for most programming languages).
-
-
-I\bIN\bNT\bTE\bER\bRF\bFA\bAC\bCE\bE C\bCO\bOD\bDE\bE
- Unlike other scanner generators, r\bre\be2\b2c\bc does not generate
- complete scanners: the user must supply some interface
- code. In particular, the user must define the following
- macros:
-
- Y\bYY\bYC\bCH\bHA\bAR\bR Type used to hold an input symbol. Usually c\bch\bha\bar\br or
- u\bun\bns\bsi\big\bgn\bne\bed\bd c\bch\bha\bar\br.
-
- Y\bYY\bYC\bCU\bUR\bRS\bSO\bOR\bR
- _\bl-expression of type *\b*Y\bYY\bYC\bCH\bHA\bAR\bR that points to the
- current input symbol. The generated code advances
- Y\bYY\bYC\bCU\bUR\bRS\bSO\bOR\bR as symbols are matched. On entry, Y\bYY\bYC\bCU\bUR\bR-\b-
- S\bSO\bOR\bR is assumed to point to the first character of
- the current token. On exit, Y\bYY\bYC\bCU\bUR\bRS\bSO\bOR\bR will point to
- the first character of the following token.
-
-
-
-
-Version 0.5 8 April 1994 2
-
-
-
-
-
-RE2C(1) RE2C(1)
-
-
- Y\bYL\bLI\bIM\bMI\bIT\bT Expression of type *\b*Y\bYY\bYC\bCH\bHA\bAR\bR that marks the end of
- the buffer (Y\bYL\bLI\bIM\bMI\bIT\bT[\b[-\b-1\b1]\b] is the last character in the
- buffer). The generated code repeatedly compares
- Y\bYY\bYC\bCU\bUR\bRS\bSO\bOR\bR to Y\bYL\bLI\bIM\bMI\bIT\bT to determine when the buffer
- needs (re)filling.
-
- Y\bYY\bYM\bMA\bAR\bRK\bKE\bER\bR
- _\bl-expression of type *\b*Y\bYY\bYC\bCH\bHA\bAR\bR. The generated code
- saves backtracking information in Y\bYY\bYM\bMA\bAR\bRK\bKE\bER\bR.
-
- Y\bYY\bYF\bFI\bIL\bLL\bL(\b(_\bn)\b)
- The generated code "calls" Y\bYY\bYF\bFI\bIL\bLL\bL when the buffer
- needs (re)filling: at least _\bn additional charac-
- ters should be provided. Y\bYY\bYF\bFI\bIL\bLL\bL should adjust
- Y\bYY\bYC\bCU\bUR\bRS\bSO\bOR\bR, Y\bYY\bYL\bLI\bIM\bMI\bIT\bT and Y\bYY\bYM\bMA\bAR\bRK\bKE\bER\bR as needed. Note
- that for typical programming languages _\bn will be
- the length of the longest keyword plus one.
-
-
-S\bSC\bCA\bAN\bNN\bNE\bER\bR S\bSP\bPE\bEC\bCI\bIF\bFI\bIC\bCA\bAT\bTI\bIO\bON\bNS\bS
- Each scanner specification consists of a set of _\br_\bu_\bl_\be_\bs and
- name definitions. Rules consist of a regular expression
- along with a block of C/C++ code that is to be executed
- when the associated regular expression is matched. Name
- definitions are of the form ``_\bn_\ba_\bm_\be =\b= _\br_\be_\bg_\bu_\bl_\ba_\br _\be_\bx_\bp_\br_\be_\bs_\b-
- _\bs_\bi_\bo_\bn;\b;''.
-
-
-S\bSU\bUM\bMM\bMA\bAR\bRY\bY O\bOF\bF R\bRE\bE2\b2C\bC R\bRE\bEG\bGU\bUL\bLA\bAR\bR E\bEX\bXP\bPR\bRE\bES\bSS\bSI\bIO\bON\bNS\bS
- "\b"f\bfo\boo\bo"\b" the literal string f\bfo\boo\bo. ANSI-C escape sequences
- can be used.
-
- [\b[x\bxy\byz\bz]\b] a "character class"; in this case, the regular
- expression matches either an 'x\bx', a 'y\by', or a 'z\bz'.
-
- [\b[a\bab\bbj\bj-\b-o\boZ\bZ]\b]
- a "character class" with a range in it; matches an
- 'a\ba', a 'b\bb', any letter from 'j\bj' through 'o\bo', or a
- 'Z\bZ'.
-
- _\br\\b\_\bs match any _\br which isn't an _\bs. _\br and _\bs must be regu-
- lar expressions which can be expressed as character
- classes.
-
- _\br*\b* zero or more _\br's, where _\br is any regular expression
-
- _\br+\b+ one or more _\br's
-
- _\br?\b? zero or one _\br's (that is, "an optional _\br")
-
- name the expansion of the "name" definition (see above)
-
- (\b(_\br)\b) an _\br; parentheses are used to override precedence
- (see below)
-
-
-
-Version 0.5 8 April 1994 3
-
-
-
-
-
-RE2C(1) RE2C(1)
-
-
- _\br_\bs an _\br followed by an _\bs ("concatenation")
-
- _\br|\b|_\bs either an _\br or an _\bs
-
- _\br/\b/_\bs an _\br but only if it is followed by an _\bs. The s is
- not part of the matched text. This type of regular
- expression is called "trailing context".
-
- The regular expressions listed above are grouped according
- to precedence, from highest precedence at the top to low-
- est at the bottom. Those grouped together have equal
- precedence.
-
-
-A\bA L\bLA\bAR\bRG\bGE\bER\bR E\bEX\bXA\bAM\bMP\bPL\bLE\bE
- #include <stdlib.h>
- #include <stdio.h>
- #include <fcntl.h>
- #include <string.h>
-
- #define ADDEQ 257
- #define ANDAND 258
- #define ANDEQ 259
- #define ARRAY 260
- #define ASM 261
- #define AUTO 262
- #define BREAK 263
- #define CASE 264
- #define CHAR 265
- #define CONST 266
- #define CONTINUE 267
- #define DECR 268
- #define DEFAULT 269
- #define DEREF 270
- #define DIVEQ 271
- #define DO 272
- #define DOUBLE 273
- #define ELLIPSIS 274
- #define ELSE 275
- #define ENUM 276
- #define EQL 277
- #define EXTERN 278
- #define FCON 279
- #define FLOAT 280
- #define FOR 281
- #define FUNCTION 282
- #define GEQ 283
- #define GOTO 284
- #define ICON 285
- #define ID 286
- #define IF 287
- #define INCR 288
- #define INT 289
- #define LEQ 290
-
-
-
-Version 0.5 8 April 1994 4
-
-
-
-
-
-RE2C(1) RE2C(1)
-
-
- #define LONG 291
- #define LSHIFT 292
- #define LSHIFTEQ 293
- #define MODEQ 294
- #define MULEQ 295
- #define NEQ 296
- #define OREQ 297
- #define OROR 298
- #define POINTER 299
- #define REGISTER 300
- #define RETURN 301
- #define RSHIFT 302
- #define RSHIFTEQ 303
- #define SCON 304
- #define SHORT 305
- #define SIGNED 306
- #define SIZEOF 307
- #define STATIC 308
- #define STRUCT 309
- #define SUBEQ 310
- #define SWITCH 311
- #define TYPEDEF 312
- #define UNION 313
- #define UNSIGNED 314
- #define VOID 315
- #define VOLATILE 316
- #define WHILE 317
- #define XOREQ 318
- #define EOI 319
-
- typedef unsigned int uint;
- typedef unsigned char uchar;
-
- #define BSIZE 8192
-
- #define YYCTYPE uchar
- #define YYCURSOR cursor
- #define YYLIMIT s->lim
- #define YYMARKER s->ptr
- #define YYFILL(n) {cursor = fill(s, cursor);}
-
- #define RET(i) {s->cur = cursor; return i;}
-
- typedef struct Scanner {
- int fd;
- uchar *bot, *tok, *ptr, *cur, *pos, *lim, *top, *eof;
- uint line;
- } Scanner;
-
- uchar *fill(Scanner *s, uchar *cursor){
- if(!s->eof){
- uint cnt = s->tok - s->bot;
- if(cnt){
- memcpy(s->bot, s->tok, s->lim - s->tok);
-
-
-
-Version 0.5 8 April 1994 5
-
-
-
-
-
-RE2C(1) RE2C(1)
-
-
- s->tok = s->bot;
- s->ptr -= cnt;
- cursor -= cnt;
- s->pos -= cnt;
- s->lim -= cnt;
- }
- if((s->top - s->lim) < BSIZE){
- uchar *buf = (uchar*)
- malloc(((s->lim - s->bot) + BSIZE)*sizeof(uchar));
- memcpy(buf, s->tok, s->lim - s->tok);
- s->tok = buf;
- s->ptr = &buf[s->ptr - s->bot];
- cursor = &buf[cursor - s->bot];
- s->pos = &buf[s->pos - s->bot];
- s->lim = &buf[s->lim - s->bot];
- s->top = &s->lim[BSIZE];
- free(s->bot);
- s->bot = buf;
- }
- if((cnt = read(s->fd, (char*) s->lim, BSIZE)) != BSIZE){
- s->eof = &s->lim[cnt]; *(s->eof)++ = '\n';
- }
- s->lim += cnt;
- }
- return cursor;
- }
-
- int scan(Scanner *s){
- uchar *cursor = s->cur;
- std:
- s->tok = cursor;
- /*!re2c
- any = [\000-\377];
- O = [0-7];
- D = [0-9];
- L = [a-zA-Z_];
- H = [a-fA-F0-9];
- E = [Ee] [+-]? D+;
- FS = [fFlL];
- IS = [uUlL]*;
- ESC = [\\] ([abfnrtv?'"\\] | "x" H+ | O+);
- */
-
- /*!re2c
- "/*" { goto comment; }
-
- "auto" { RET(AUTO); }
- "break" { RET(BREAK); }
- "case" { RET(CASE); }
- "char" { RET(CHAR); }
- "const" { RET(CONST); }
- "continue" { RET(CONTINUE); }
- "default" { RET(DEFAULT); }
- "do" { RET(DO); }
-
-
-
-Version 0.5 8 April 1994 6
-
-
-
-
-
-RE2C(1) RE2C(1)
-
-
- "double" { RET(DOUBLE); }
- "else" { RET(ELSE); }
- "enum" { RET(ENUM); }
- "extern" { RET(EXTERN); }
- "float" { RET(FLOAT); }
- "for" { RET(FOR); }
- "goto" { RET(GOTO); }
- "if" { RET(IF); }
- "int" { RET(INT); }
- "long" { RET(LONG); }
- "register" { RET(REGISTER); }
- "return" { RET(RETURN); }
- "short" { RET(SHORT); }
- "signed" { RET(SIGNED); }
- "sizeof" { RET(SIZEOF); }
- "static" { RET(STATIC); }
- "struct" { RET(STRUCT); }
- "switch" { RET(SWITCH); }
- "typedef" { RET(TYPEDEF); }
- "union" { RET(UNION); }
- "unsigned" { RET(UNSIGNED); }
- "void" { RET(VOID); }
- "volatile" { RET(VOLATILE); }
- "while" { RET(WHILE); }
-
- L (L|D)* { RET(ID); }
-
- ("0" [xX] H+ IS?) | ("0" D+ IS?) | (D+ IS?) |
- (['] (ESC|any\[\n\\'])* ['])
- { RET(ICON); }
-
- (D+ E FS?) | (D* "." D+ E? FS?) | (D+ "." D* E? FS?)
- { RET(FCON); }
-
- (["] (ESC|any\[\n\\"])* ["])
- { RET(SCON); }
-
- "..." { RET(ELLIPSIS); }
- ">>=" { RET(RSHIFTEQ); }
- "<<=" { RET(LSHIFTEQ); }
- "+=" { RET(ADDEQ); }
- "-=" { RET(SUBEQ); }
- "*=" { RET(MULEQ); }
- "/=" { RET(DIVEQ); }
- "%=" { RET(MODEQ); }
- "&=" { RET(ANDEQ); }
- "^=" { RET(XOREQ); }
- "|=" { RET(OREQ); }
- ">>" { RET(RSHIFT); }
- "<<" { RET(LSHIFT); }
- "++" { RET(INCR); }
- "--" { RET(DECR); }
- "->" { RET(DEREF); }
- "&&" { RET(ANDAND); }
-
-
-
-Version 0.5 8 April 1994 7
-
-
-
-
-
-RE2C(1) RE2C(1)
-
-
- "||" { RET(OROR); }
- "<=" { RET(LEQ); }
- ">=" { RET(GEQ); }
- "==" { RET(EQL); }
- "!=" { RET(NEQ); }
- ";" { RET(';'); }
- "{" { RET('{'); }
- "}" { RET('}'); }
- "," { RET(','); }
- ":" { RET(':'); }
- "=" { RET('='); }
- "(" { RET('('); }
- ")" { RET(')'); }
- "[" { RET('['); }
- "]" { RET(']'); }
- "." { RET('.'); }
- "&" { RET('&'); }
- "!" { RET('!'); }
- "~" { RET('~'); }
- "-" { RET('-'); }
- "+" { RET('+'); }
- "*" { RET('*'); }
- "/" { RET('/'); }
- "%" { RET('%'); }
- "<" { RET('<'); }
- ">" { RET('>'); }
- "^" { RET('^'); }
- "|" { RET('|'); }
- "?" { RET('?'); }
-
-
- [ \t\v\f]+ { goto std; }
-
- "\n"
- {
- if(cursor == s->eof) RET(EOI);
- s->pos = cursor; s->line++;
- goto std;
- }
-
- any
- {
- printf("unexpected character: %c\n", *s->tok);
- goto std;
- }
- */
-
- comment:
- /*!re2c
- "*/" { goto std; }
- "\n"
- {
- if(cursor == s->eof) RET(EOI);
- s->tok = s->pos = cursor; s->line++;
-
-
-
-Version 0.5 8 April 1994 8
-
-
-
-
-
-RE2C(1) RE2C(1)
-
-
- goto comment;
- }
- any { goto comment; }
- */
- }
-
- main(){
- Scanner in;
- int t;
- memset((char*) &in, 0, sizeof(in));
- in.fd = 0;
- while((t = scan(&in)) != EOI){
- /*
- printf("%d\t%.*s\n", t, in.cur - in.tok, in.tok);
- printf("%d\n", t);
- */
- }
- close(in.fd);
- }
-
-
-S\bSE\bEE\bE A\bAL\bLS\bSO\bO
- flex(1), lex(1).
-
-
-F\bFE\bEA\bAT\bTU\bUR\bRE\bES\bS
- r\bre\be2\b2c\bc does not provide a default action: the generated code
- assumes that the input will consist of a sequence of
- tokens. Typically this can be dealt with by adding a rule
- such as the one for unexpected characters in the example
- above.
-
- The user must arrange for a sentinel token to appear at
- the end of input (and provide a rule for matching it):
- r\bre\be2\b2c\bc does not provide an <\b<<\b<E\bEO\bOF\bF>\b>>\b> expression. If the
- source is from a null-byte terminated string, a rule
- matching a null character will suffice. If the source is
- from a file then the approach taken in the example can be
- used: pad the input with a newline (or some other charac-
- ter that can't appear within another token); upon recog-
- nizing such a character check to see if it is the sentinel
- and act accordingly.
-
- r\bre\be2\b2c\bc does not provide start conditions: use a separate
- scanner specification for each start condition (as illus-
- trated in the above example).
-
- No [^x]. Use difference instead.
-
-B\bBU\bUG\bGS\bS
- Only fixed length trailing context can be handled.
-
- The maximum value appearing as a parameter _\bn to Y\bYY\bYF\bFI\bIL\bLL\bL is
- not provided to the generated code (this value is needed
-
-
-
-Version 0.5 8 April 1994 9
-
-
-
-
-
-RE2C(1) RE2C(1)
-
-
- for constructing the interface code). Note that this
- value is usually relatively small: for typical programming
- languages _\bn will be the length of the longest keyword plus
- one.
-
- Difference only works for character sets.
-
- The r\bre\be2\b2c\bc internal algorithms need documentation.
-
-
-A\bAU\bUT\bTH\bHO\bOR\bR
- Please send bug reports, fixes and feedback to:
-
- Peter Bumbulis
- Computer Systems Group
- University of Waterloo
- Waterloo, Ontario
- N2L 3G1
- Internet: peter@csg.uwaterloo.ca
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-Version 0.5 8 April 1994 10
-
-