granicus.if.org Git - re2c/blob - doc/manual/submatch/submatch.rst_

   1 Submatch extraction in re2c is based on the lookahead-TDFA algorithm described
   2 in the
   3 `Tagged Deterministic Finite Automata with Lookahead <https://arxiv.org/abs/1907.08837>`_
   4 paper. The algorithm uses the notion of "tags" --- position markers that denote
   5 positions in the regular expression for which the lexer must determine the
   6 corresponding position in the input string.
   7 Re2c provides two options for submatch extraction: the first one allows to use
   8 raw tags, and the second one allows to use the more conventional parenthesized
   9 capturing groups.
  10
  11 The first option is ``-T --tags``. With this option one can use standalone tags
  12 of the form ``@stag`` and ``#mtag``, where ``stag`` and ``mtag`` are arbitrary
  13 used-defined names. Tags can be used anywhere inside of a regular expression.
  14 Tags of the form ``@stag`` are
  15 called s-tags: they denote a single submatch value (the last input position
  16 where this tag matched). Tags of the form ``#mtag`` are called m-tags: they
  17 denote multiple submatch values (the whole history of repetitions of this tag).
  18 All tags should be defined by the user as variables with the corresponding
  19 names. With standalone tags re2c uses leftmost greedy disambiguation: submatch
  20 positions correspond to the leftmost matching path through the regular
  21 expression.
  22
  23 The second option is ``-P --posix-captures``: it enables POSIX-compliant
  24 capturing groups. In this mode parentheses in regular expressions denote the
  25 beginning and the end of capturing groups; the whole regular expression is group
  26 number zero. The number of groups for the matching rule is stored in a variable
  27 ``yynmatch``, and submatch results are stored in ``yypmatch`` array. Both
  28 ``yynmatch`` and ``yypmatch`` should be defined by the user, and ``yypmatch``
  29 size must be at least ``[yynmatch * 2]``. Re2c provides a directive
  30 ``/*!maxnmatch:re2c*/`` that defines ``YYMAXNMATCH``: a constant  equal to the
  31 maximal value of ``yynmatch`` among all rules. Note that re2c implements
  32 POSIX-compliant disambiguation: each subexpression matches as long as possible,
  33 and subexpressions that start earlier in regular expression have priority over
  34 those starting later. Capturing groups are translated into s-tags under the
  35 hood.
  36
  37 The overhead on submatch extraction in the generated lexer grows with the
  38 number of tags --- if this number is moderate, the overhead is barely
  39 noticeable. In the lexer tags are implemented using a number of tag variables
  40 generated by re2c. There is no one-to-one correspondence between tag variables
  41 and tags: a single variable may be reused for different tags, and one tag may
  42 require multiple variables to hold all its ambiguous values. Eventually
  43 ambiguity is resolved, and only one final variable per tag survives. When a rule
  44 matches, all its tags are set to the values of the corresponding tag variables.
  45 The exact number of tag variables is unknown to the user; this number is
  46 determined by re2c. However, tag variables should be defined by the user as a
  47 part of the lexer state and updated by ``YYFILL``, therefore re2c provides
  48 directives ``/*!stags:re2c*/`` and ``/*!mtags:re2c*/`` that can be used to
  49 declare, initialize and manipulate tag variables. These directives have two
  50 optional configurations: ``format = "@@";`` (specifies the template where ``@@``
  51 is substituted with the name of each tag variable), and ``separator = "";``
  52 (specifies the piece of code used to join the generated pieces for different
  53 tag variables).
  54
  55 S-tags support the following operations:
  56
  57 * save input position to an s-tag: ``t = YYCURSOR`` with default API or a
  58   user-defined operation ``YYSTAGP(t)`` with generic API
  59 * save default value to an s-tag: ``t = NULL`` with default API or a
  60   user-defined operation ``YYSTAGN(t)`` with generic API
  61 * copy one s-tag to another: ``t1 = t2``
  62
  63 M-tags support the following operations:
  64
  65 * append input position to an m-tag: a user-defined operation ``YYMTAGP(t)``
  66   with both default and generic API
  67 * append default value to an m-tag: a user-defined operation ``YYMTAGN(t)``
  68   with both default and generic API
  69 * copy one m-tag to another: ``t1 = t2``
  70
  71 S-tags can be implemented as scalar values (pointers or offsets). M-tags need a
  72 more complex representation, as they need to store a sequence of tag values. The
  73 most naive and inefficient representation of an m-tag is a list (array, vector)
  74 of tag values; a more efficient representation is to store all m-tags in a
  75 prefix-tree represented as array of nodes ``(v, p)``, where ``v`` is tag value
  76 and ``p`` is a pointer to parent node.