\documentclass{article}
\usepackage[margin=2cm]{geometry}
+\usepackage{lipsum}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{amsthm}
\usepackage{caption}
\usepackage{listings}
\usepackage{multicol}\setlength{\columnsep}{1cm}
+%\usepackage[vlined]{algorithm2e}\setlength{\algomargin}{0em}\SetArgSty{textnormal}
+\usepackage[noline,noend]{algorithm2e}
+ \setlength{\algomargin}{0em}
+ \SetArgSty{textnormal}
+ \SetNoFillComment
+ \newcommand{\Xcmfont}[1]{\texttt{\footnotesize{#1}}}\SetCommentSty{Xcmfont}
+%\usepackage{algorithm}
+%\usepackage[noend]{algpseudocode}
\setlength{\parindent}{0pt}
\usepackage{enumitem}
\newtheorem{Xdef}{Definition}
\newtheorem{XThe}{Theorem}
\newtheorem{XLem}{Lemma}
+\newtheorem{Xobs}{Observation}
\title{Tagged Deterministic Finite Automata with Lookahead}
\author{Ulya Trofimivich}
\begin{abstract}
\noindent
This paper extends the work of Laurikari [Lau00] [Lau01] and Kuklewicz [Kuk??] on tagged deterministic finite automata (TDFA)
-in connection with submatch extraction in regular expressions.
+in the context of submatch extraction in regular expressions.
The main goal of this work is application of TDFA to lexer generators that optimize for speed of the generated code.
I suggest a number of practical improvements to Laurikari algorithm;
-notably, the use of 1-symbol lookahead, which results in significant reduction of tag variables and operations.
-Experimental results confirm that lookahead-aware TDFA are considerably faster and usually smaller then baseline TDFA;
-and they are reasonably close in speed and size to canonical DFA used for recognition.
+notably, the use of one-symbol lookahead, which results in significant reduction of tag variables and operations on them.
+Experimental results confirm that lookahead-aware TDFA are considerably faster and usually smaller than baseline TDFA;
+and they are reasonably close in speed and size to ordinary DFA used for recognition of regular languages.
The proposed algorithm can handle repeated submatch and therefore is applicable to full parsing.
-Furthermore, I consider two disambiguation policies: leftmost greedy and POSIX.
-I formalize the algorithm suggested by Kuklewicz
-and show that Kuklewicz TDFA have no overhead compared to Laurikari TDFA or automata that use leftmost greedy disambiguation.
-All discussed models and algorithms are implemented in the open source lexer generator RE2C.
+Furthermore, I examine the problem of disambiguation in the case of leftmost greedy and POSIX policies.
+I formalize POSIX disambiguation algorithm suggested by Kuklewicz
+and show that the resulting TDFA are as efficient as Laurikari TDFA or TDFA that use leftmost greedy disambiguation.
+All discussed algorithms are implemented in the open source lexer generator RE2C.
\end{abstract}
%\vspace{1em}
\section*{Introduction}
-RE2C [Bum94] [web??] is a lexer generator for C: it compiles regular expressions to C code.
-Unlike regular expression libraries such as TRE [Lau01] or RE2 [Cox??], RE2C has no restriction on preprocessing time
-and concentrates fully on the quality of generated code.
-It takes pride in generating fast lexers: at least as fast as reasonably optimized lexers coded by hand.
-This is not an easy goal; hand-written code is specialized for a particular environment, while autogenerated code must fit anywhere.
-RE2C has a highly configurable interface and quite a few optimizations ranging from
-high-level program transformations to low-level tweaks of conditional jumps.
-In such setting it is undesirable to add extensions that affect performance.
+RE2C [Bum94] [web??] is a lexer generator for C: it compiles regular expressions into C code.
+Unlike regular expression libraries, lexer generators separate compilation and execution steps:
+they can spend considerable amount of time on compilation in order to optimize the generated code.
+Consequently, lexer generators are usually aimed at generating efficient code rather than supporting multiple extensions;
+they use deterministic automata and avoid features that need more complex computational models.
+In particular, RE2C aims at generating lexers that are at least as fast as reasonably optimized hand-coded lexers.
+It compiles regular expressions into deterministic automata,
+applies a number of optimizations to reduce automata size
+and converts them directly into C code in the form of conditional jumps:
+this approach results in more efficient and human-readable code than table-based lexers.
+In addition, RE2C has a flexible interface:
+instead of using a fixed program template,
+it lets the programmer define most of the interface code
+and adapt the lexer to a particular environment.
\\ \\
-One useful extension of regular expressions is submatch extraction and parsing.
+One useful extension of traditional regular expressions that cannot be implemented using ordinary DFA is submatch extraction and parsing.
Many authors studied this subject and developed algorithms suitable for their particular settings and problem domains.
Their approaches differ in various respects:
-the specific subtype of problem (full parsing, submatch extracton with or without history of repetitions);
-the underlying formalizm (backtracking,
+the specific subtype of problem (full parsing, submatch extracton with or without history of repetitions),
+the underlying formalism (backtracking,
nondeterministic automata, deterministic automata,
-multiple automata, lazy determinization);
-the number of passes over the input (streaming, multi-pass);
-space consumption with respect to input length (constant, proportional);
-treatment of ambiguity (unhandled, manual disambiguation, default disambiguation policy, all possible parse trees).
-%Their approaches differ in many respects:
-%the specific subtype of problem (full parsing, submatch extracton with or without history of repeated submatches);
-%the underlying formalizm (backtracking,
-%nondeterministic automaton [ThoBra] [Cox],
-%deterministic automaton [Ragel] [Lau00] [Gra15],
-%multiple deterministic automata [SohTho],
-%lazy determinization [Kuk] [Kar]);
-%the number of passes over the input (streaming, multi-pass);
-%space consumption (constant, proportional to the size of output [Gra] [ThoBra], proportional to length of input [Kea] [DubFee]);
-%treatment of ambiguity (forbidden, manual disambiguation [Ragel], default policy [FriCar] [Gra], multiple options [ThoBra]).
-Most of the algorithms are unsuitable for RE2C: either insufficienty generic (cannot handle ambiguity),
-or too heavyweight (incur too much overhead on regular expressions with only a few submatches or no submatches at all).
-Laurikari algorithm is special in this respect.
-It is based on a single deterministic automaton, runs in one pass and linear time,
+multiple automata, lazy determinization),
+the number of passes over the input (streaming, multi-pass),
+space consumption with respect to input length (constant, linear),
+handing of ambiguity (unhandled, manual disambiguation, default disambiguation policy, all possible parse trees), etc.
+Most of the algorithms are unsuitable for RE2C: they are either insufficienty generic (cannot handle ambiguity),
+or too heavyweight (incur overhead on regular expressions with only a few submatches or no submatches at all).
+Laurikari algorithm is outstanding in this regard.
+It is based on a single deterministic automaton, runs in one pass and requires linear time,
and the consumed space does not depend on the input length.
What is most important, the overhead on submatch extraction depends on the detalization of submatch:
-on submatch-free regular expressions Laurikari automaton shrinks to a simple DFA.
+on submatch-free regular expressions Laurikari automaton reduces to a simple DFA.
\\ \\
From RE2C point of view this is close enough to hand-written code:
you only pay for what you need, like a reasonable programmer would do.
However, a closer look at Laurikari automata reveals that
they behave like a very strange programmer who is unable to think even one step ahead.
Take, for example, regular expression \texttt{a*b*}
-and suppose that we must find the boundary between \texttt{a} and \texttt{b} in the input string.
+and suppose that we must find the position between \texttt{a} and \texttt{b} in the input string.
The programmer would probably match all \texttt{a}, then save the input position, then match all \texttt{b}:
\begin{small}
Another problem that needs attention is disambiguation.
The original paper [Lau01] claims to have POSIX semantics, but it was proved to be wrong [LTU].
-Since then Kuklewicz suggested a fix of Laurikari algorithm that does have POSIX semantics [Regex-TDFA], but he never formalized it.
+Since then Kuklewicz suggested a fix for Laurikari algorithm that does have POSIX semantics [Regex-TDFA], but he never formalized the resulting algorithm.
The informal description [regex-wiki] is somewhat misleading as it suggests that Kuklewicz automata
require additional run-time operations to keep track of submatch history and hence are less efficient than Laurikari automata.
That is not true, as we shall see: all the added complexity is related to determinization,
\\
Finally, theory is no good without practice.
-Even lookahead-aware automata contain a lot of redundant operations,
-which can be dramatically reduced by the most basic optimizations like liveness analysis and dead code elimination.
+Even lookahead-aware automata contain redundant operations
+which can be reduced by basic optimizations like liveness analysis and dead code elimination.
The overall number of submatch records can be minimized using technique similar to register allocation.
I suggest another tweak of Laurikari algoritm that makes optimizations particularly easy
and show that they are useful even in the presence of an optimizing C compiler.
In section \ref{section_tagged_extension} we extend it with tags
and define ambiguity with respect to submatch extraction.
In section \ref{section_tnfa} we convert regular expressions to nondeterministic automata
-and in section \ref{section_closure} study various algorithms for closure construction.
-Section \ref{section_disambiguation} is about disambiguation;
+and in section \ref{section_closure} study various algorithms for $\epsilon$-closure construction.
+Section \ref{section_disambiguation} tackles disambiguation problem;
we discuss leftmost greedy and POSIX policies and the necessary properties that disambiguation policy should have in order to allow efficient submatch extraction.
-Section \ref{section_determinization} is the main part of this paper: it presents determinization algorithm.
+Section \ref{section_determinization} is the main part of this paper: it describes determinization algorithm.
Section \ref{section_implementation} highlihgts some practical implementation details and optimizations.
Section \ref{section_tests_and_benchmarks} concerns correctness testing and benchmarks.
Finally, section \ref{section_future_work} points directions for future work.
\section{Regular expressions}\label{section_regular_expressions}
-Regular expressions is a notation that originates in the work of Kleene
+Regular expressions are a \emph{notation} that originates in the work of Kleene
\emph{``Representation of Events in Nerve Nets and Finite Automata''} [Kle51] [Kle56].
He used this notation to describe \emph{regular events}:
each regular event is a set of \emph{definite events},
as a deductive system based on logical axioms and algebraic laws.
This question was thoroughly investigated by many authors (see [Koz91] for a historic overview)
and the formalism became known as \emph{the algebra of regular events} %$\mathcal{K} \Xeq (K, +, \cdot, *, 1, 0)$
-or, more generally, the \emph{Kleene algebra}.
+or, more generally, the \emph{Kleene algebra} $\mathcal{K} \Xeq (K, +, \cdot, *, 1, 0)$.
Several different axiomatizations of Kleene algebra were given;
in particular, Kozen gave a finitary axiomatization based on equations and equational implications and sound for all interpretations [Koz91].
-We will use the usual inductive definition:
+See also [Gra15] for extensions of Kleene algebra and generalization to the field of context-free languages.
\\
- \begin{Xdef}
- \emph{Regular expressions (RE)} over finite alphabet $\Sigma$
- is a notation that is inductively defined as follows:
+The following definition of regular expressions, with minor notational differences, is widely used in literature
+(see e.g. [HopUll], page 28):
+
+ \begin{Xdef}\label{re}
+ \emph{Regular expression (RE)} over finite alphabet $\Sigma$ is one of the following:
\begin{enumerate}
\medskip
- \item[] $\emptyset$, $\epsilon$ and $\alpha \Xin \Sigma$ are \emph{atomic} RE
- \item[] if $e_1$, $e_2$ are RE, then $e_1 | e_2$ is RE (\emph{sum})
- \item[] if $e_1$, $e_2$ are RE, then $e_1 e_2$ is RE (\emph{product})
- \item[] if $e$ is RE, then $e^*$ is RE (\emph{iteration})
+ \item[] $\emptyset$, $\epsilon$ and $\alpha \Xin \Sigma$ (\emph{atomic} RE)
+ \item[] $(e_1 | e_2)$, where $e_1$, $e_2$ are RE over $\Sigma$ (\emph{sum})
+ \item[] $(e_1 e_2)$, where $e_1$, $e_2$ are RE over $\Sigma$ (\emph{product})
+ \item[] $(e^*)$, where $e$ is a RE over $\Sigma$ (\emph{iteration})
\medskip
\end{enumerate}
- Iteration has precedence over product and product over sum;
- parenthesis may be used to override it.
$\square$
\end{Xdef}
-For the most useful RE there are special shortcuts:
- \begin{align*}
- e^n &\quad\text{for}\quad \overbrace{e \dots e}^{n} \\[-0.5em]
- e^{n,m} &\quad\text{for}\quad e^n | e^{n+1} | \dots | e^{m-1} | e^m \\[-0.5em]
- e^{n,} &\quad\text{for}\quad e^n e^* \\[-0.5em]
- e^+ &\quad\text{for}\quad ee^* \\[-0.5em]
- e^? &\quad\text{for}\quad e | \epsilon
- \end{align*}
-
+The usual assumption is that iteration has precedence over product and product has precedence over sum,
+and redundant parenthesis may be omitted.
+$\emptyset$ and $\epsilon$ are special symbols not included in the alphabet $\Sigma$ (they correspond to $1$ and $0$ in the Kleene algebra).
Since RE are only a notation, their exact meaning depends on the particular \emph{interpretation}.
In the \emph{standard} interpretation RE denote \emph{languages}: sets of strings over the alphabet of RE.
+\\
+
+Let $\epsilon$ denote the \emph{empty string} (not to be confused with RE $\epsilon$),
+and let $\Sigma^*$ denote the set of all (possibly empty) strings over $\Sigma$.
\begin{Xdef}
- \emph{Language} over $\Sigma$ is a subset of $\Sigma^*$,
- where $\Sigma^*$ denotes the set of all (possibly empty) strings over $\Sigma$.
+ \emph{Language} over $\Sigma$ is a subset of $\Sigma^*$.
$\square$
\end{Xdef}
except that we are interested in partial parse structure rather than full parse trees.
\begin{Xdef}
- Language $L$ over $\Sigma$ is \emph{regular} iff $\exists$ RE $e$ over $\Sigma$
+ Language $L$ over $\Sigma$ is \emph{regular} iff exists RE $e$ over $\Sigma$
such that $L$ is denoted by $e$: $\XL \Xlb e \Xrb \Xeq L$.
$\square$
\end{Xdef}
-The set $\mathcal{R}_\Sigma$ of all regular languages over alphabet $\Sigma$
-together with constants $\emptyset$, $\{ \epsilon \}$ and operations $\cup$, $\cdot$ and ${}^*$
-forms a Kleene algebra $\mathcal{K} \Xeq (\mathcal{R}_\Sigma, \cup, \cdot, *, \emptyset, \{ \epsilon \})$.
+For the most useful RE there are special shortcuts:
+ \begin{align*}
+ e^n &\quad\text{for}\quad \overbrace{e \dots e}^{n} \\[-0.5em]
+ e^{n,m} &\quad\text{for}\quad e^n | e^{n+1} | \dots | e^{m-1} | e^m \\[-0.5em]
+ e^{n,} &\quad\text{for}\quad e^n e^* \\[-0.5em]
+ e^+ &\quad\text{for}\quad ee^* \\[-0.5em]
+ e^? &\quad\text{for}\quad e | \epsilon
+ \end{align*}
\section{Tagged extension}\label{section_tagged_extension}
He did not define tags explicitly; rather, he defined automata with tagged transitions.
We take a slightly different appoach, inspired by [ThoBra10], [Gra15] and a number of other publications.
First, we define an extension of RE: tagged RE,
-and two interpretations: \emph{S-language} that ignores tags and \emph{T-language} that respects them.
+and two interpretations: \emph{S-language} that ignores tags and \emph{T-language} that preserves them.
T-language has the bare minimum of information necessary for submatch extraction;
in particular, it is less expressive than \emph{parse trees} or \emph{types} that are used for RE parsing.
Then we define \emph{ambiguity} and \emph{disambiguation policy} in terms of relations between the two interpretations.
-Finally, we show how T-language can be converted to the \emph{tag value functions} used by Laurikari
-and argue that the latter representation is insufficient as it cannot express ambiguity in some RE.
+Finally, we show how T-language can be converted to \emph{tag value functions} used by Laurikari
+and argue that the latter representation is insufficient as it cannot express ambiguity in certain RE.
\\
-Tagged RE differ from RE in the following respects.
-First, they have a new kind of atomic primitive: tags.
-Second, we use generalized repetition $e^{n,m}$ (possibly $m \Xeq \infty$) as one of the three base operations instead of iteration $e^*$.
-The reason for this is that
-desugaring repetition into concatenation and iteration requires duplication of $e$,
-and duplication may change the semantics of submatch extraction.
+In tagged RE we use generalized repetition $e^{n,m}$ (where $0 \!\leq\! n \!\leq\! m \!\leq\! \infty$)
+instead of iteration $e^*$ as one of the three base operations.
+The reason for this is the following:
+bounded repetition cannot be expressed in terms of union and product without duplication of RE,
+and duplication may change semantics of submatch extraction.
For example, POSIX RE \texttt{(a(b?))\{2\}} contains two submatch groups (aside from the whole RE),
-but if we rewrite it as \texttt{(a(b?))(a(b?))}, the number of submatches will change to four.
-Third difference is that repetition starts from one: zero or more repetitions are expressed as alternative between one or more repetitions and the empty word.
-This is mostly a matter of convenience:
-the case of zero iterations is special because it contains no tags, even if the subexpression itself does.
+but if we rewrite it as \texttt{(a(b?))(a(b?))}, the number of submatch groups will change to four.
+Generalized repetition, on the other hand, allows to express all kinds of iteration without duplication.
+\\
- \begin{Xdef}
- \emph{Tagged regular expressions (TRE)} over disjoint finite alphabets $\Sigma$ and $T$
- is a notation that is inductively defined as follows:
+ \begin{Xdef}\label{tre}
+ \emph{Tagged regular expression (TRE)} over disjoint finite alphabets $\Sigma$ and $T$ is one of the following:
\begin{enumerate}
\medskip
- \item[] $\emptyset$, $\epsilon$, $\alpha \Xin \Sigma$, $t \Xin T$ are \emph{atomic} TRE
- \item[] if $e_1$, $e_2$ are TRE, then $e_1 | e_2$ is TRE (\emph{sum})
- \item[] if $e_1$, $e_2$ are TRE, then $e_1 e_2$ is TRE (\emph{product})
-% \item[] if $e$ is TRE and $1 \!\leq\! n \!\leq\! \infty$, then $e^{1,n}$ is TRE (\emph{repetition})
- \item[] if $e$ is TRE and $0 \!<\! n \!\leq\! m \!\leq\! \infty$, then $e^{n,m}$ is TRE (\emph{repetition})
+ \item[] $\emptyset$, $\epsilon$, $\alpha \Xin \Sigma$ and $t \Xin T$ (\emph{atomic} TRE)
+ \item[] $(e_1 | e_2)$, where $e_1$, $e_2$ are TRE over $\Sigma$, $T$ (\emph{sum})
+ \item[] $(e_1 e_2)$, where $e_1$, $e_2$ are TRE over $\Sigma$, $T$ (\emph{product})
+ \item[] $(e^{n,m})$, where $e$ is a TRE over $\Sigma$, $T$ \\
+ \hphantom{\qquad} and $0 \!\leq\! n \!\leq\! m \!\leq\! \infty$ (\emph{repetition})
\medskip
\end{enumerate}
- Repetition has precedence over product and product over sum;
- parenthesis may be used to override it.
+ $\square$
+ \end{Xdef}
+
+ As usual, we assume that repetition has precedence over product and product has precedence over sum,
+ and redundant parenthesis may be omitted.
Additionally, the following shorthand notation may be used:
\begin{align*}
% e^n &\quad\text{for}\quad \overbrace{e \dots e}^{n} \\[-0.5em]
% e^{n,m} &\quad\text{for}\quad e^{n-1} e^{1,m-n} \\[-0.5em]
- e^n &\quad\text{for}\quad e^{n,n} \\[-0.5em]
- e^{0,m} &\quad\text{for}\quad e^{1,m} | \epsilon \\[-0.5em]
- e^* &\quad\text{for}\quad e^{1,\infty} | \epsilon \\[-0.5em]
+ e^* &\quad\text{for}\quad e^{0,\infty} \\[-0.5em]
e^+ &\quad\text{for}\quad e^{1,\infty} \\[-0.5em]
- e^? &\quad\text{for}\quad e | \epsilon
+ e^? &\quad\text{for}\quad e^{0,1} \\[-0.5em]
+ e^n &\quad\text{for}\quad e^{n,n}
\end{align*}
- $\square$
- \end{Xdef}
\begin{Xdef}
TRE over $\Sigma$, $T$ is \emph{well-formed} iff
Negative submatches are implicitly encoded in the structure of TRE:
we can always deduce the \emph{absense} of tag from its \emph{presence} on alternative branch of TRE.
To see why this is important, consider POSIX RE \texttt{(a(b)?)*} matched against string \texttt{aba}.
-The outermost capturing group matches twice at offsets \texttt{(0,2)} and \texttt{(2,3)}.
-The innermost group matches only once at \texttt{(1,2)}; there is no match on the second outermost iteration.
-POSIX standard demands that we report the absence of match \texttt{(?,?)}.
-Even aside from POSIX, we might be interested in the whole history of submatch.
+The outermost capturing group matches twice at offsets 0, 2 and 2, 3 (opening and closing parenthesis respectively).
+The innermost group matches only once at offsets 1, 2; there is no match corresponding to second outermost iteration.
+POSIX standard demands that the value on the last iteration is reported: that is, the absence of match.
+Aside from POSIX, one might be interested in the whole history of submatch.
Therefore we will rewrite TRE in a form that makes negative submatches explicit
(by tracking tags on alternative branches and inserting negative tags at all join points).
Negative tags are marked with bar, and $\Xbar{T}$ denotes the set of all negative tags.
\XX(t) &= t \\
\XX(e_1 | e_2)
&= \XX(e_1) \chi(e_2) \mid \XX(e_2) \chi(e_1) \\
- \text{where }
- &\chi(e) = \Xbar{t_1} \dots \Xbar{t_n} \text{ such that} \\
- &t_1 \dots t_n \text{ are all tags in } e \\
\XX(e_1 e_2) &= \XX(e_1) \XX(e_2) \\
- \XX(e^{n,m}) &= \XX(e)^{n,m}
+ \XX(e^{n,m}) &= \begin{cases}
+ \XX(e)^{1,m} \mid \chi(e) &\text{if } n \Xeq 0 \\
+ \XX(e)^{n,m} &\text{if } n \!\geq\! 1
+ \end{cases} \\
+ \\
+ \text{where }
+ \chi(e) &= \Xbar{t_1} \dots \Xbar{t_n}, \text{ such that} \\
+ &t_1 \dots t_n \text{ are all tags in } e
\end{align*}
$\square$
\end{Xdef}
\begin{align*}
\XT \Xlb \beta &| (\alpha 1)^{0,2} \Xrb
=
- \XT \Xlb \beta \Xrb \cdot \{\Xbar{1}\} \cup
- \XT \Xlb (\alpha 1)^{1,2} | \epsilon \Xrb \cdot \{\epsilon\} = \\
+ \XT \Xlb \beta \Xbar{1} | \big( (\alpha 1)^{1,2} | \Xbar{1} \big) \Xrb = \\
&=
- \{\beta \Xbar{1}\} \cup
- \big(\XT \Xlb \alpha 1 \Xrb
+ \XT \Xlb \beta \Xbar{1} \Xrb
+ \cup \XT \Xlb (\alpha 1)^{1,2} \Xrb
+ \cup \XT \Xlb \Xbar{1} \Xrb = \\
+&=
+ \XT \Xlb \beta \Xrb
+ \cdot \XT \Xlb \Xbar{1} \Xrb \cup
+ \XT \Xlb \alpha 1 \Xrb
\cup \XT \Xlb \alpha 1 \Xrb \cdot \XT \Xlb \alpha 1 \Xrb
- \big) \cdot \{\epsilon\}
\cup \{\Xbar{1}\} = \\
&=
\{\beta \Xbar{1}\} \cup
A more practical representation is the \emph{tag value function} used by Laurikari:
a separate list of offsets in the input string for each tag.
Tag value functions can be trivially reconstructed from T-strings.
-However, the two representations are inequivalent;
+However, the two representations are not equivalent;
in particular, tag value functions have a weaker notion of ambiguity and fail to capture ambiguity in some TRE, as shown below.
Therefore we use T-strings as a primary representation
and convert them to tag value functions after disambiguation.
\begin{Xdef}\label{tagvalfun}
\emph{Decomposition} of a T-string $x \Xeq \gamma_1 \dots \gamma_n$
- is a \emph{tag value function} $H: T \rightarrow (\YN_0 \cup \{ \varnothing \})^*$
+ is a \emph{tag value function} $H: T \rightarrow (\YN \cup \{ 0, \varnothing \})^*$
that maps each tag to a string of offsets in $S(x)$:
$H(t) \Xeq \varphi^t_1 \dots \varphi^t_n$, where:
$$\varphi^t_i = \begin{cases}
For example, TRE $(1 (3 \, 4)^{1,3} 2)^{2}$,
which may represent POSIX RE \texttt{(()\{1,3\})\{2\}},
denotes ambiguous T-strings $x \Xeq 1 3 4 3 4 2 1 3 4 2$ and $y \Xeq 1 3 4 2 1 3 4 3 4 2$.
-According to POSIX, first iteration is more important than the second one,
+According to POSIX, first iteration has higher priority than the second one,
and repeated empty match, if optional, should be avoided, therefore $y \prec x$.
However, both $x$ and $y$ decompose to the same tag value function:
$$H(t) \Xeq \begin{cases}
and in this perspective submatch extraction reduces to the problem of translation between regular languages.
The class of automata capable of performing such translation is known as \emph{finite state transducers (FST)} [??].
TNFA, as defined by Laurikari in [Lau01], is a nondeterministic FST
-that performs on-the-fly decomposition of output strings into tag value functions
+that decomposes output strings into tag value functions
and then applies disambiguation.
-Our definition is different in the following respects.
+Our definition is different in the following aspects.
First, we apply disambiguation \emph{before} decomposition
(for the reasons discussed in the previous section).
Second, we do not consider disambiguation policy as an attribute of TNFA:
\item[] $F \subseteq Q$ is the set of \emph{final} states
\item[] $q_0 \in Q$ is the \emph{initial} state
-% \item[] $\Delta \Xeq \Delta^\Sigma \sqcup \Delta^\epsilon$ is the \emph{transition} relation, where:
-% \begin{itemize}
-% \item[] $\Delta^\Sigma \subseteq Q \times \Sigma \times \{\epsilon\} \times Q$
-% \item[] $\Delta^\epsilon \subseteq Q \times (P \cup \{\epsilon\}) \times (T \cup \Xbar{T} \cup \{\epsilon\}) \times Q$
-% \end{itemize}
-
- \item[] $\Delta \Xeq \Delta^\Sigma \sqcup \Delta^\epsilon \sqcup \Delta^T \sqcup \Delta^P$ is the \emph{transition} relation, which includes
- transitions on symbols, $\epsilon$-transitions, tagged $\epsilon$-transitions and prioritized $\epsilon$-transitions:
+ \item[] $\Delta \Xeq \Delta^\Sigma \sqcup \Delta^\epsilon$ is the \emph{transition} relation, where:
\begin{itemize}
\item[] $\Delta^\Sigma \subseteq Q \times \Sigma \times \{\epsilon\} \times Q$
- \item[] $\Delta^\epsilon \subseteq Q \times \{\epsilon\} \times \{\epsilon\} \times Q$
- \item[] $\Delta^T \subseteq Q \times \{\epsilon\} \times (T \cup \Xbar{T}) \times Q$
- \item[] $\Delta^P \subseteq Q \times P \times \{\epsilon\} \times Q$
+ \item[] $\Delta^\epsilon \subseteq Q \times (P \cup \{\epsilon\}) \times (T \cup \Xbar{T} \cup \{\epsilon\}) \times Q$
\end{itemize}
- and all $\epsilon$-transitions from the same state have different priority:
- $\forall (x, r, \epsilon, y), (\widetilde{x}, \widetilde{r}, \epsilon, \widetilde{y}) \Xin \Delta:
- x \Xeq \widetilde{x} \wedge y \Xeq \widetilde{y} \Rightarrow \wedge r \!\neq\! \widetilde{r}$.
- \end{itemize}
- $\square$
- \end{Xdef}
-
- \begin{Xdef}
- State $q$ in TNFA $(\Sigma, T, P, Q, F, q_0, \Delta)$
- is a \emph{core} state if it is final: $q \Xin F$,
- or has outgoing transitions on symbols: $\exists \alpha \Xin \Sigma, p: (q, \alpha, \epsilon, p) \Xin \Delta$.
- \end{Xdef}
-
- \begin{Xdef}
- A \emph{path} in TNFA $(\Sigma, T, P, Q, F, q_0, \Delta)$ is a set of transitions
- $\{(q_i, \alpha_i, a_i, \widetilde{q}_i)\}_{i=1}^n \subseteq \Delta$, where $n \!\geq\! 0$
- and $\widetilde{q}_i \Xeq q_{i+1} \; \forall i \Xeq \overline{1,n-1}$.
- $\square$
- \end{Xdef}
-
- \begin{Xdef}
- Path $\{(q_i, \alpha_i, a_i, \widetilde{q}_i)\}_{i=1}^n$ in TNFA $(\Sigma, T, P, Q, F, q_0, \Delta)$ is \emph{accepting}
- if either $n \Xeq 0 \wedge q_0 \Xin F$ or $n\!>\!0 \wedge q_1 \Xeq q_0 \wedge \widetilde{q}_n \Xin F$.
- $\square$
- \end{Xdef}
-
-% \begin{Xdef}
-% Every path $\pi \Xeq \{(q_i, \alpha_i, a_i, \widetilde{q}_i)\}_{i=1}^n$
-% induces a string $u \Xeq \alpha_1 a_1 \dots \alpha_n a_n$
-% over the mixed alphabet $\Sigma \cup T \cup \Xbar{T} \cup \{\Xbar{0}, \Xbar{1}\}$
-% called \emph{P-string} and denoted as $\pi \Xmap u$.
-% $\square$
-% \end{Xdef}
-%
-% From the given P-string $u \Xeq \gamma_1 \dots \gamma_n$ it is possible to filter out S-string, T-string and \emph{bitcode}:
-% \begin{align*}
-% \XS(x) &= \alpha_1 \dots \alpha_n
-% &&\alpha_i = \begin{cases}
-% \gamma_i &\text{if } \gamma_i \Xin \Sigma \\[-0.5em]
-% \epsilon &\text{otherwise}
-% \end{cases} \\
-% \XT(x) &= \tau_1 \dots \tau_n
-% &&\tau_i = \begin{cases}
-% \gamma_i &\text{if } \gamma_i \Xin \Sigma \cup T \cup \Xbar{T} \\[-0.5em]
-% \epsilon &\text{otherwise}
-% \end{cases} \\
-% \XB(x) &= \beta_1 \dots \beta_n
-% &&\beta_i = \begin{cases}
-% \gamma_i &\text{if } \gamma_i \Xin \{ \Xbar{0}, \Xbar{1} \} \\[-0.5em]
-% \epsilon &\text{otherwise}
-% \end{cases}
-% \end{align*}
-
- \begin{Xdef}
- Every path $\pi \Xeq \{(q_i, \alpha_i, a_i, \widetilde{q}_i)\}_{i=1}^n$
- in TNFA $(\Sigma, T, P, Q, F, q_0, \Delta)$
- \emph{induces} an S-string, a T-string and a string over $P$ called \emph{bitcode}:
- \begin{align*}
- \XS(\pi) &= \alpha_1 \dots \alpha_n \\
- \XT(\pi) &= \alpha_1 \gamma_1 \dots \alpha_n \gamma_n
- &&\gamma_i = \begin{cases}
- a_i &\text{if } a_i \Xin T \cup \Xbar{T} \\[-0.5em]
- \epsilon &\text{otherwise}
- \end{cases} \\[-0.5em]
- \XB(\pi) &= \beta_1 \dots \beta_n
- &&\beta_i = \begin{cases}
- a_i &\text{if } a_i \Xin P \\[-0.5em]
- \epsilon &\text{otherwise}
- \end{cases}
- \end{align*}
- $\square$
- \end{Xdef}
-
- \begin{Xdef}
- Paths
- $\pi_1 \Xeq \{(q_i, \alpha_i, a_i, \widetilde{q}_i)\}_{i=1}^n$ and
- $\pi_2 \Xeq \{(p_i, \beta_i, b_i, \widetilde{p}_i)\}_{i=1}^m$
- are \emph{ambiguous} if their start and end states coincide: $q_1 \Xeq p_1$, $\widetilde{q}_n \Xeq \widetilde{p}_m$
- and their induced T-strings $\XT(\pi_1)$ and $\XT(\pi_2)$ are ambiguous.
- $\square$
- \end{Xdef}
-
- \begin{Xdef}
- TNFA $\XN$ \emph{transduces} string $s$ to a T-string $x$, denoted $s \xrightarrow{\XN} x$
- if $s \Xeq S(x)$ and there is an accepting path $\pi$ in $\XN$, such that $\XT(\pi) \Xeq x$.
- $\square$
- \end{Xdef}
-
- \begin{Xdef}
- The \emph{input language} of TNFA $\XN$ is \\
- $\XI(\XN) \Xeq \{ s \mid \exists x: s \xrightarrow{\XN} x \}$
- $\square$
- \end{Xdef}
+% \item[] $\Delta \Xeq \Delta^\Sigma \sqcup \Delta^\epsilon \sqcup \Delta^T \sqcup \Delta^P$ is the \emph{transition} relation, which includes
+% transitions on symbols, $\epsilon$-transitions, tagged $\epsilon$-transitions and prioritized $\epsilon$-transitions:
+% \begin{itemize}
+% \item[] $\Delta^\Sigma \subseteq Q \times \Sigma \times \{\epsilon\} \times Q$
+% \item[] $\Delta^\epsilon \subseteq Q \times \{\epsilon\} \times \{\epsilon\} \times Q$
+% \item[] $\Delta^T \subseteq Q \times \{\epsilon\} \times (T \cup \Xbar{T}) \times Q$
+% \item[] $\Delta^P \subseteq Q \times P \times \{\epsilon\} \times Q$
+% \end{itemize}
- \begin{Xdef}
- The \emph{output language} of TNFA $\XN$ is \\
- $\XO(\XN) \Xeq \{ x \mid \exists s: s \xrightarrow{\XN} x \}$
+ and all $\epsilon$-transitions from the same state have different priorities:
+ $\forall (x, r, \epsilon, y), (\widetilde{x}, \widetilde{r}, \epsilon, \widetilde{y}) \Xin \Delta^\epsilon:
+ x \Xeq \widetilde{x} \wedge y \Xeq \widetilde{y} \Rightarrow r \!\neq\! \widetilde{r}$.
+ \end{itemize}
$\square$
\end{Xdef}
-
-\begin{XThe}\label{theorem_tnfa}
-For any TRE $e$ over $\Sigma$, $T$ there is a TNFA $\XN(e)$, such that
-the input language of $\XN$ is the S-language of $e$:
-$\XI(\XN) \Xeq \XS \Xlb e \Xrb$ and
-the output language of $\XN$ is the T-language of $e$:
-$\XO(\XN) \Xeq \XT \Xlb e \Xrb$.
-
-\smallskip
-
-Proof.
-First, we give an algorithm for FST construction (derived from Thompson NFA construction).
-Let $\XN(e) \Xeq (\Sigma, T, \{0, 1\}, Q, \{ y \}, x, \Delta)$, such that $(Q, x, y, \Delta) \Xeq \XF(\XX(e))$, where:
+TNFA construction is similar to Thompson NFA construction,
+except for priorities and generalized repetition.
+For the given TRE $e$ over $\Sigma$, $T$, the corresponding TNFA is $\XN(e) \Xeq (\Sigma, T, \{0, 1\}, Q, \{ y \}, x, \Delta)$,
+where $(Q, x, y, \Delta) \Xeq \XF(\XX(e))$ and $\XF$ is defined as follows:
\begin{align*}
- \XF(\emptyset) &= (\{ x, y \}, x, y, \emptyset) \tag{1a} \\
- \XF(\epsilon) &= (\{ x, y \}, x, y, \{ (x, \epsilon, \epsilon, y) \}) \tag{1b} \\
- \XF(\alpha) &= (\{ x, y \}, x, y, \{ (x, \alpha, \epsilon, y) \}) \tag{1c} \\
- \XF(t) &= (\{ x, y \}, x, y, \{ (x, \epsilon, t, y) \}) \tag{1d} \\
- \XF(e_1 | e_2) &= \XF(e_1) \cup \XF(e_2) \tag{1e} \label{tnfaalt} \\
- \XF(e_1 e_2) &= \XF(e_1) \cdot \XF(e_2) \tag{1f} \label{tnfacat} \\
- \XF(e^{n,\infty}) &= \XF(e)^{n,\infty} \tag{1g} \label{tnfaunbounditer} \\
- \XF(e^{n,m}) &= \XF(e)^{n, m} \tag{1h} \label{tnfabounditer}
+ \XF(\emptyset) &= (\{ x, y \}, x, y, \emptyset) \\
+ \XF(\epsilon) &= (\{ x, y \}, x, y, \{ (x, \epsilon, \epsilon, y) \}) \\
+ \XF(\alpha) &= (\{ x, y \}, x, y, \{ (x, \alpha, \epsilon, y) \}) \\
+ \XF(t) &= (\{ x, y \}, x, y, \{ (x, \epsilon, t, y) \}) \\
+ \XF(e_1 | e_2) &= \XF(e_1) \cup \XF(e_2) \\
+ \XF(e_1 e_2) &= \XF(e_1) \cdot \XF(e_2) \\
+ \XF(e^{n,\infty}) &= \XF(e)^{n,\infty} \\
+ \XF(e^{n,m}) &= \XF(e)^{n, m}
\end{align*}
%
+Automata union:
\begin{align*}
- F_1 \cup F_2 &= (Q, x, y, \Delta) \tag{1i} \label{tnfaaltconstr} \\
+ F_1 \cup F_2 &= (Q, x, y, \Delta) \\
\text{where }
& (Q_1, x_1, y_1, \Delta_1) = F_1 \\
& (Q_2, x_2, y_2, \Delta_2) = F_2 \\
& \qquad (x, 1, \epsilon, x_2), (y_2, \epsilon, \epsilon, y) \}
\end{align*}
%
+\begin{center}
+\includegraphics[width=0.3\linewidth]{img/tnfa/union.png}
+\end{center}
+%
+Automata product:
\begin{align*}
- F_1 \cdot F_2 &= (Q, x_1, y_2, \Delta) \tag{1j} \label{tnfacatconstr} \\
+ F_1 \cdot F_2 &= (Q, x_1, y_2, \Delta) \\
\text{where }
& (Q_1, x_1, y_1, \Delta_1) = F_1 \\
& (Q_2, x_2, y_2, \Delta_2) = F_2 \\
& \Delta = \Delta_1 \cup \Delta_2 \cup \{ (y_1, \epsilon, \epsilon, x_2) \}
\end{align*}
%
+\begin{center}
+\includegraphics[width=0.25\linewidth]{img/tnfa/concat.png}
+\end{center}
+%
+Unbounded repetition of automata:
\begin{align*}
- F^{n,\infty} &= (Q, y_0, y_{n+1}, \Delta) \tag{1k} \label{tnfaunbounditerconstr} \\
+ F^{n,\infty} &= (Q, x_1, y_{n+1}, \Delta) \\
\text{where }
& \{(Q_i, x_i, y_i, \Delta_i)\}_{i=1}^n = \{F, \dots, F\} \\
- & Q = \bigcup\nolimits_{i=1}^n Q_i \cup \{ y_0, y_{n+1} \} \\
+ & Q = \bigcup\nolimits_{i=1}^n Q_i \cup \{ y_{n+1} \} \\
& \Delta = \bigcup\nolimits_{i=1}^n \Delta_i
- \cup \{(y_{i-1}, \epsilon, \epsilon, x_i)\}_{i=1}^n \\
+ \cup \{(y_i, \epsilon, \epsilon, x_{i+1})\}_{i=1}^{n\!-\!1} \\
& \hphantom{\hspace{2em}}
\cup \{ (y_n, 0, \epsilon, x_n), (y_n, 1, \epsilon, y_{n+1}) \}
\end{align*}
%
+\begin{center}
+\includegraphics[width=0.55\linewidth]{img/tnfa/repeat_unbound.png}
+\end{center}
+%
+Bounded repetition of automata:
\begin{align*}
- F^{n,m} &= (Q, y_0, y_{m+1}, \Delta) \tag{1l} \label{tnfabounditerconstr} \\
+ F^{n,m} &= (Q, x_1, y_m, \Delta) \\
\text{where }
& \{(Q_i, x_i, y_i, \Delta_i)\}_{i=1}^m = \{F, \dots, F\} \\
- & Q = \bigcup\nolimits_{i=1}^m Q_i \cup \{ y_0, y_{m+1} \} \\
+ & Q = \bigcup\nolimits_{i=1}^m Q_i \\
& \Delta = \bigcup\nolimits_{i=1}^m \Delta_i
- \cup \{(y_{i-1}, 0, \epsilon, x_i)\}_{i=1}^m \\
- & \hphantom{\hspace{6em}}
- \cup \{(y_i, 1, \epsilon, y_{m+1})\}_{i=n}^m
+ \cup \{(y_i, \epsilon, \epsilon, x_{i+1})\}_{i=1}^{n\!-\!1} \\
+ & \hphantom{\hspace{2em}}
+ \cup \{(y_i, 0, \epsilon, x_{i+1}), (y_i, 1, \epsilon, y_m)\}_{i=n}^{m\!-\!1}
\end{align*}
+%
+\begin{center}
+\includegraphics[width=0.9\linewidth]{img/tnfa/repeat_bound.png}
+\end{center}
-Second, we must prove language equality.
-Consider arbitrary TRE $e$: let $\widetilde{e} \Xeq \XF(\XX(e))$
-and let $\XO(\widetilde{e})$ denote $\XO(\XN(e))$.
-We will show by induction on the size of $\widetilde{e}$ that $\XO(\XF(\widetilde{e})) \Xeq \XL \Xlb \widetilde{e} \Xrb$.
-As a consequence, we will have $\XO(\XN) \Xeq \XT \Xlb e \Xrb$ and $\XI(\XN) \Xeq \XS \Xlb e \Xrb$,
-since $\XI(\XN) \Xeq \{S(h) \mid h \Xin \XO(\XN)\}$ and
-$\XS \Xlb e \Xrb \Xeq \{S(h) \mid h \Xin \XT \Xlb e \Xrb \}$.
-\\
-
-Induction basis for atomic expressions $\emptyset$, $\epsilon$, $\alpha$, $t$
-trivially follows from equations 1a - 1d and definition \ref{deftlang}.
-To make the induction step, consider compound TRE.
-First, note that the T-string induced by concatenation of two paths
-is a concatenation of T-strings induced by each path.
-%if $\pi_1 \Xeq \{(q_i, \alpha_i, a_i, \widetilde{q}_i)\}_{i=1}^n \Xmap x$
-%and $\pi_2 \Xeq \{(p_i, \beta_i, b_i, \widetilde{p}_i)\}_{i=1}^m \Xmap y$,
-%then $\pi_1
-% \cup \{(\widetilde{q}_n, \epsilon, \epsilon, p_1)\}
-% \cup \pi_2 \Xmap xy$
-Then by construction of TNFA we have:
+The above construction of TNFA has certain properties that will be used in subsequent sections.
+
+\begin{Xobs}\label{obs_tnfa_states}
+We can partition all TNFA states into three disjoint subsets: ???list
+states that have outgoing transitions on symbols,
+states that have outgoing $\epsilon$-transitions,
+and states without outgoing transitions (including the final state).
+This statement can be proved by induction on the structure of TNFA:
+automata for atomic TRE $\emptyset$, $\epsilon$, $\alpha$, $t$ obviously satisfy it;
+compound automata $F_1 \cup F_2$, $F_1 \cdot F_2$, $F^{n,\infty}$ and $F^{n,m}$
+do not violate it:
+they only add outgoing $\epsilon$-transitions to those states that have no outgoing transitions,
+and their final state is either a new state without outgoing transitions,
+or final state of one of the subautomata.
+\end{Xobs}
+
+\begin{Xobs}\label{obs_tnfa_repeat}
+For repetition automata $F^{n,\infty}$ and $F^{n,m}$
+the number of iterations uniquely determines the order of subautomata traversal:
+by construction subatomaton corresponding to $(i \!+\! 1)$-th iteration is only reachable
+from the one corresponding to $i$-th iteration
+(in case of unbounded repetition it may be the same subautomaton).
+\end{Xobs}
- \begin{align*}
- \XO(F_1 \cup F_2) \Xlongeq{\ref{tnfaaltconstr}}&\; \XO(F_1) \cup \XO(F_2)) \\
- \XO(F_1 \cdot F_2) \Xlongeq{\ref{tnfacatconstr}}&\; \XO(F_1) \cdot \XO(F_2) \\
- \XO(F^{n,m}) \Xlongeq{\ref{tnfaunbounditerconstr},\ref{tnfabounditerconstr}}&\; \bigcup_{i=n}^m {\XO(F)}^i
- \end{align*}
+ \begin{Xdef}
+ A \emph{path} in TNFA $(\Sigma, T, P, Q, F, q_0, \Delta)$ is a sequence of transitions
+ $\{(q_i, \alpha_i, a_i, \widetilde{q}_i)\}_{i=1}^n \subseteq \Delta$, where $n \!\geq\! 0$
+ and $\widetilde{q}_i \Xeq q_{i+1} \; \forall i \Xeq \overline{1,n-1}$.
+ $\square$
+ \end{Xdef}
-Given that, induction step is trivial:
+ \begin{Xdef}
+ Path $\{(q_i, \alpha_i, a_i, \widetilde{q}_i)\}_{i=1}^n$ in TNFA $(\Sigma, T, P, Q, F, q_0, \Delta)$ is \emph{accepting}
+ if either $n \Xeq 0 \wedge q_0 \Xin F$ or $n\!>\!0 \wedge q_1 \Xeq q_0 \wedge \widetilde{q}_n \Xin F$.
+ $\square$
+ \end{Xdef}
+ \begin{Xdef}
+ Every path $\pi \Xeq \{(q_i, \alpha_i, a_i, \widetilde{q}_i)\}_{i=1}^n$
+ in TNFA $(\Sigma, T, P, Q, F, q_0, \Delta)$
+ \emph{induces} an S-string, a T-string and a string over $P$ called \emph{bitcode}:
\begin{align*}
- \XO(&\XF(e_1 | e_2)) \Xlongeq{\ref{tnfaalt}} \;
- \XO(\XF(e_1) \cup \XF(e_2) = \\
- =&\; \XO(\XF(e_1)) \cup \XO(\XF(e_2)) =
- \XL \Xlb e_1 \Xrb \cup \XL \Xlb e_2 \Xrb
- = \XL \Xlb e_1 | e_2 \Xrb \\
-%
- \XO(&\XF(e_1 e_2)) \Xlongeq{\ref{tnfacat}} \; \XO(\XF(e_1) \cdot \XF(e_2)) = \\
- =&\; \XO(\XF(e_1)) \cdot \XO(\XF(e_2)) =
- \XL \Xlb e_1 \Xrb \cdot \XL \Xlb e_2 \Xrb = \XL \Xlb e_1 e_2 \Xrb \\
-%
- \XO(&\XF(e^{n,m})) \Xlongeq{\ref{tnfabounditer},\ref{tnfaunbounditer}} \; \XO(\XF(e)^{n,m}) = \\[-0.5em]
- &=\; \bigcup\limits_{i=n}^m \XO(\XF(e))^i
- \Xlongeq{ind} \bigcup\limits_{i=n}^m \XL\Xlb e \Xrb^i
- = \XL \Xlb e^{n,m} \Xrb
+ \XS(\pi) &= \alpha_1 \dots \alpha_n \\
+ \XT(\pi) &= \alpha_1 \gamma_1 \dots \alpha_n \gamma_n
+ &&\gamma_i = \begin{cases}
+ a_i &\text{if } a_i \Xin T \cup \Xbar{T} \\[-0.5em]
+ \epsilon &\text{otherwise}
+ \end{cases} \\[-0.5em]
+ \XB(\pi) &= \beta_1 \dots \beta_n
+ &&\beta_i = \begin{cases}
+ a_i &\text{if } a_i \Xin P \\[-0.5em]
+ \epsilon &\text{otherwise}
+ \end{cases}
\end{align*}
$\square$
- \end{XThe}
+ \end{Xdef}
+
+ \begin{Xdef}
+ TNFA $\XN$ \emph{transduces} S-string $s$ to a T-string $x$, denoted $s \xrightarrow{\XN} x$
+ if $s \Xeq S(x)$ and there is an accepting path $\pi$ in $\XN$, such that $\XT(\pi) \Xeq x$.
+ $\square$
+ \end{Xdef}
+
+The set of all S-strings that are transduced to some T-string is the \emph{input language} of TNFA;
+likewise, the set of all transduced T-strings is the \emph{output language} of TNFA.
+It is easy to see that for every TRE $e$ the input language of TNFA $\XN(e)$ equals to its S-language $\XS \Xlb e \Xrb$
+and the output language of $\XN(e)$ equals to its T-language $\XT \Xlb e \Xrb$
+(proof is by induction on the structure of TRE and by construction of TNFA).
+\\
The simplest way to simulate TNFA is as follows.
Starting from the initial state, trace all possible paths that match the input string; record T-strings along each path.
The efficiency of this algorithm depends on the implementation of $closure$, which is discussed in the next section.
\\
-% \begin{minipage}{\linewidth}
- $transduce((\Sigma, T, P, Q, F, q_0, T, \Delta), \alpha_1 \dots \alpha_n)$
- \hrule
- \begin{itemize}[leftmargin=0in]
- \smallskip
- \item[] $X \Xset closure(\{ (q_0, \epsilon) \}, F, \Delta)$
- \smallskip
- \item[] for $i \Xeq \overline{1,n}$:
- \begin{itemize}
- \item[] $Y \Xset reach(X, \Delta, \alpha_i)$
- \item[] $X \Xset closure(Y, F, \Delta)$
- \end{itemize}
- \item[] $x \Xset min_\prec\{ x \mid (q, x) \Xin X \wedge q \Xin F \}$
- \item[] return $H(x)$
- \end{itemize}
-% \end{minipage}
-
- \bigskip
-
-% \begin{minipage}{\linewidth}
- $reach(X, \Delta, \alpha)$
- \hrule
- \begin{itemize}[leftmargin=0in]
- \smallskip
- \item[] return $\{ (p, x \alpha) \mid (q, x) \Xin X \wedge (q, \alpha, \epsilon, p) \Xin \Delta \}$
- \end{itemize}
-% \end{minipage}
+ \begin{algorithm}[H] \DontPrintSemicolon \SetKwProg{Fn}{}{}{} \SetAlgoInsideSkip{medskip}
+ \Fn {$\underline{transduce((\Sigma, T, P, Q, F, q_0, T, \Delta), \alpha_1 \dots \alpha_n)} \smallskip$} {
+ $X \Xset closure(\{ (q_0, \epsilon) \}, F, \Delta)$ \;
+ \For {$i \Xeq \overline{1,n}$} {
+ $Y \Xset reach(X, \Delta, \alpha_i)$ \;
+ $X \Xset closure(Y, F, \Delta)$ \;
+ }
+ $x \Xset min_\prec\{ x \mid (q, x) \Xin X \wedge q \Xin F \}$ \;
+ \Return $H(x)$ \;
+ }
+ \end{algorithm}
+
+ \begin{algorithm}[H] \DontPrintSemicolon \SetKwProg{Fn}{}{}{} \SetAlgoInsideSkip{medskip}
+ \Fn {$\underline{reach(X, \Delta, \alpha)} \smallskip$} {
+ \Return $\{ (p, x \alpha) \mid (q, x) \Xin X \wedge (q, \alpha, \epsilon, p) \Xin \Delta \}$
+ }
+ \end{algorithm}
\section{Tagged $\epsilon$-closure}\label{section_closure}
The most straightforward implementation of $closure$ (shown below)
-is to simply gather all possible non-looping $\epsilon$-paths that end in core state:
-
- \bigskip
-
-% \begin{minipage}{\linewidth}
- $closure(X, F, \Delta)$
- \hrule
- \begin{itemize}[leftmargin=0in]
- \smallskip
- \item[] empty $stack$, $result \Xset \emptyset$
- \item[] for $(q, x) \Xin X:$
- \begin{itemize}
- \item[] $push(stack, (q, x))$
- \end{itemize}
- \item[] while $stack$ is not empty
- \begin{itemize}
- \item[] $(q, x) \Xset pop(stack)$
- \item[] $result \Xset result \cup \{(q, x)\}$
- \item[] for all outgoing arcs $(q, \epsilon, \chi, p) \Xin \Delta$
- \begin{itemize}
- \item[] if $\not \exists (\widetilde{p}, \widetilde{x})$ on stack $: \widetilde{p} \Xeq p$
- \begin{itemize}
- \item[] $push(stack, (p, x \chi))$
- \end{itemize}
- \end{itemize}
- \end{itemize}
- \item[] return $\{ (q, x) \Xin result \mid core(q, F, \Delta) \}$
- \end{itemize}
-% \end{minipage}
-
- \bigskip
+is to simply gather all possible non-looping $\epsilon$-paths.
+Note that we only need paths that end in the final state
+or paths which end state has outgoing transitions on symbols:
+all other paths will be dropped by $reach$ on the next simulation step.
+\\
- $core(q, F, \Delta)$
- \hrule
- \begin{itemize}[leftmargin=0in]
- \smallskip
- \item[] return $q \Xin F \vee \exists \alpha, p: (q, \alpha, \epsilon, p) \Xin \Delta$
- \end{itemize}
-
- \bigskip
+ \begin{algorithm}[H] \DontPrintSemicolon \SetKwProg{Fn}{}{}{} \SetAlgoInsideSkip{medskip}
+ \Fn {$\underline{closure(X, F, \Delta)} \smallskip$} {
+ empty $stack$, $result \Xset \emptyset$ \;
+ \For {$(q, x) \Xin X:$} {
+ $push(stack, (q, x))$ \;
+ }
+ \While {$stack$ is not empty} {
+ $(q, x) \Xset pop(stack)$ \;
+ $result \Xset result \cup \{(q, x)\}$ \;
+ \ForEach {outgoing arc $(q, \epsilon, \chi, p) \Xin \Delta$} {
+ \If {$\not \exists (\widetilde{p}, \widetilde{x})$ on stack $: \widetilde{p} \Xeq p$} {
+ $push(stack, (p, x \chi))$ \;
+ }
+ }
+ }
+ \Return $\{ (q, x) \Xin result \mid core(q, F, \Delta) \}$ \;
+ }
+ \end{algorithm}
+
+ \begin{algorithm}[H] \DontPrintSemicolon \SetKwProg{Fn}{}{}{} \SetAlgoInsideSkip{medskip}
+ \Fn {$\underline{core(q, F, \Delta)} \smallskip$} {
+ \Return $q \Xin F \vee \exists \alpha, p: (q, \alpha, \epsilon, p) \Xin \Delta$
+ }
+ \end{algorithm}
Since there might be multiple paths between two given states,
the number of different paths may grow up exponentially in the number of TNFA states.
We call such policy \emph{prefix-based};
later we will show that both POSIX and leftmost greedy policies have this property.
+ \begin{Xdef}
+ Paths
+ $\pi_1 \Xeq \{(q_i, \alpha_i, a_i, \widetilde{q}_i)\}_{i=1}^n$ and
+ $\pi_2 \Xeq \{(p_i, \beta_i, b_i, \widetilde{p}_i)\}_{i=1}^m$
+ are \emph{ambiguous} if their start and end states coincide: $q_1 \Xeq p_1$, $\widetilde{q}_n \Xeq \widetilde{p}_m$
+ and their induced T-strings $\XT(\pi_1)$ and $\XT(\pi_2)$ are ambiguous.
+ $\square$
+ \end{Xdef}
+
\begin{Xdef}
Disambiguation policy for TRE $e$ is \emph{prefix-based}
if it can be extended on the set of ambiguous prefixes of T-strings in $\XT \Xlb e \Xrb$,
Laurikari gives the following algorithm for closure construction (see Algorithm 3.4 in [Lau01]):
\\
-% \begin{minipage}{\linewidth}
- $closure \Xund laurikari(X, F, \Delta)$
- \hrule
- \begin{itemize}[leftmargin=0in]
- \smallskip
- \item[] empty $deque$, $result(q) \equiv \bot$
- \item[] $indeg \Xeq count \Xset indegree(X, \Delta)$
- \item[] for $(q, x) \Xin X$:
- \begin{itemize}
- \item[] $relax(q, x, result, deque, count, indeg)$
- \end{itemize}
- \item[] while $deque$ is not empty
- \begin{itemize}
- \item[] $q \Xset pop \Xund front (deque)$
- \item[] for all outgoing arcs $(q, \epsilon, \chi, p) \Xin \Delta$
- \begin{itemize}
- \item[] $x \Xset result(q) \chi$
- \item[] $relax(p, x, result, deque, count, indeg)$
- \end{itemize}
- \end{itemize}
- \item[] return $\{ (q, x) \mid x \Xeq result(q) \wedge core(q, F, \Delta) \}$
- \end{itemize}
-% \end{minipage}
-
- \bigskip
-
-% \begin{minipage}{\linewidth}
- $relax(q, x, result, deque, count, indeg)$
- \hrule
- \begin{itemize}[leftmargin=0in]
- \smallskip
- \item[] if $x \prec result(q)$
- \begin{itemize}
- \item[] $result(q) \Xset x$
- \item[] $count(p) \Xset count(p) - 1$
- \item[] if $count(p) \Xeq 0$
- \begin{itemize}
- \item[] $count(p) \Xset indeg(p)$
- \item[] $push \Xund front (deque, q)$
- \end{itemize}
- \item[] else
- \begin{itemize}
- \item[] $push \Xund back (deque, q)$
- \end{itemize}
- \end{itemize}
- \end{itemize}
-% \end{minipage}
-
- \bigskip
-
-% \begin{minipage}{\linewidth}
- $indegree(X, \Delta)$
- \hrule
- \begin{itemize}[leftmargin=0in]
- \smallskip
- \item[] empty $stack$, $indeg(q) \equiv 0$
- \item[] for $(q, x) \Xin X$
- \begin{itemize}
- \item[] $push(stack, q)$
- \end{itemize}
- \item[] while $stack$ is not empty
- \begin{itemize}
- \item[] $q \Xset pop(stack)$
- \item[] if $indeg(q) \Xeq 0$
- \begin{itemize}
- \item[] for all outgoing arcs $(q, \epsilon, \chi, p) \Xin \Delta$
- \begin{itemize}
- \item[] $push(stack, p)$
- \end{itemize}
- \end{itemize}
- \item[] $indeg(q) \Xset indeg(q) + 1$
- \end{itemize}
- \item[] return $indeg$
- \end{itemize}
-% \end{minipage}
-
- \bigskip
+ \begin{algorithm}[H] \DontPrintSemicolon \SetKwProg{Fn}{}{}{} \SetAlgoInsideSkip{medskip}
+ \Fn {$\underline{closure \Xund laurikari(X, F, \Delta)} \smallskip$} {
+ empty $deque$, $result(q) \equiv \bot$ \;
+ $indeg \Xset indegree(X, \Delta)$ \;
+ $count \Xset indeg$ \;
+ \For {$(q, x) \Xin X$} {
+ $relax(q, x, result, deque, count, indeg)$ \;
+ }
+ \While {$deque$ is not empty} {
+ $q \Xset pop \Xund front (deque)$ \;
+ \ForEach {outgoing arc $(q, \epsilon, \chi, p) \Xin \Delta$} {
+ $x \Xset result(q) \chi$ \;
+ $relax(p, x, result, deque, count, indeg)$ \;
+ }
+ }
+ \Return $\{ (q, x) \mid x \Xeq result(q) \wedge core(q, F, \Delta) \}$
+ }
+ \end{algorithm}
+
+
+ \begin{algorithm}[H] \DontPrintSemicolon \SetKwProg{Fn}{}{}{} \SetAlgoInsideSkip{medskip}
+ \Fn {$\underline{relax(q, x, result, deque, count, indeg)} \smallskip$} {
+ \If {$x \prec result(q)$} {
+ $result(q) \Xset x$ \;
+ $count(p) \Xset count(p) - 1$ \;
+ \If {$count(p) \Xeq 0$} {
+ $count(p) \Xset indeg(p)$ \;
+ $push \Xund front (deque, q)$ \;
+ } \Else {
+ $push \Xund back (deque, q)$ \;
+ }
+ }
+ }
+ \end{algorithm}
+
+
+ \begin{algorithm}[H] \DontPrintSemicolon \SetKwProg{Fn}{}{}{} \SetAlgoInsideSkip{medskip}
+ \Fn {$\underline{indegree(X, \Delta)} \smallskip$} {
+ empty $stack$, $indeg(q) \equiv 0$ \;
+ \For {$(q, x) \Xin X$} {
+ $push(stack, q)$ \;
+ }
+ \While {$stack$ is not empty} {
+ $q \Xset pop(stack)$ \;
+ \If {$indeg(q) \Xeq 0$} {
+ \ForEach {outgoing arc $(q, \epsilon, \chi, p) \Xin \Delta$} {
+ $push(stack, p)$ \;
+ }
+ }
+ $indeg(q) \Xset indeg(q) + 1$ \;
+ }
+ \Return $indeg$ \;
+ }
+ \end{algorithm}
We will refer to the above algorithm as LAU.
The key idea of LAU is to reorder scanned nodes so that anscestors are processed before their descendants.
(all the difference is in $relax$ procedure):
\\
-% \begin{minipage}{\linewidth}
- $relax(q, x, result, deque, count, indeg)$
- \hrule
- \begin{itemize}[leftmargin=0in]
- \smallskip
- \item[] if $count(q) \Xeq 0$
- \begin{itemize}
- \item[] $count(q) \Xset indeg(q)$
- \end{itemize}
- \item[] $count(p) \Xset count(p) - 1$
-
- \item[] if $count(p) \Xeq 0$ and $p$ is on $deque$
- \begin{itemize}
- \item[] $remove (deque, p)$
- \item[] $push \Xund front (deque, p)$
- \end{itemize}
-
- \item[] if $x \prec result(q)$
- \begin{itemize}
- \item[] $result(q) \Xset x$
- \item[] if $q$ is not on $deque$
- \begin{itemize}
- \item[] if $count(q) \Xeq 0$
- \begin{itemize}
- \item[] $push \Xund front (deque, q)$
- \end{itemize}
- \item[] else
- \begin{itemize}
- \item[] $push \Xund back (deque, q)$
- \end{itemize}
- \end{itemize}
- \end{itemize}
- \end{itemize}
-% \end{minipage}
-
- \bigskip
+ \begin{algorithm}[H] \DontPrintSemicolon \SetKwProg{Fn}{}{}{} \SetAlgoInsideSkip{medskip}
+ \Fn {$\underline{relax(q, x, result, deque, count, indeg)} \smallskip$} {
+ \If {$count(q) \Xeq 0$} {
+ $count(q) \Xset indeg(q)$ \;
+ }
+ $count(p) \Xset count(p) - 1$ \;
+
+ \If {$count(p) \Xeq 0$ and $p$ is on $deque$} {
+ $remove (deque, p)$ \;
+ $push \Xund front (deque, p)$ \;
+ }
+
+ \If {$x \prec result(q)$} {
+ $result(q) \Xset x$ \;
+ \If {$q$ is not on $deque$} {
+ \If {$count(q) \Xeq 0$} {
+ $push \Xund front (deque, q)$ \;
+ } \Else {
+ $push \Xund back (deque, q)$ \;
+ }
+ }
+ }
+ }
+ \end{algorithm}
Still for graphs with cycles worst-case complexity of LAU and LAU1 is unclear;
usually algorithms that schedule nodes in LIFO order (e.g. Pape-Levit) have exponential compexity [ShiWit81].
However, there is another algorithm also based on the idea of topological ordering,
which has $O(nm)$ worst-case complexity and $O(n + m)$ complexity on acyclic graphs
(where $n$ is the number of nodes and $m$ is the nuber of edges).
-It is the GOR1 algorithm described in [GolRad93]:
+It is the GOR1 algorithm described in [GolRad93]
+(the version listed here is one of the possible variations of the algorithm):
\\
-% \begin{minipage}{\linewidth}
- $closure \Xund goldberg \Xund radzik(X, F, \Delta)$
- \hrule
- \begin{itemize}[leftmargin=0in]
- \smallskip
- \item[] empty stacks $topsort$, $newpass$
- \item[] $result(q) \equiv \bot$
- \item[] $status(q) \equiv \mathit{OFFSTACK}$
- \item[] for $(q, x) \Xin X$:
- \begin{itemize}
- \item[] $relax(q, x, result, topsort)$
- \end{itemize}
- \item[] while $topsort$ is not empty
- \begin{itemize}
-
- \smallskip
- \item[] while $topsort$ is not empty
- \begin{itemize}
- \item[] $q \Xset pop(topsort)$
-
- \item[] if $status(q) \Xeq \mathit{TOPSORT}$
- \begin{itemize}
- \item[] $push(newpass, n)$
- \end{itemize}
-
- \item[] else if $status(q) \Xeq \mathit{NEWPASS}$
- \begin{itemize}
- \item[] $status(q) \Xset \mathit{TOPSORT}$
- \item[] $push(topsort, q)$
- \item[] $scan(q, result, topsort)$
- \end{itemize}
- \end{itemize}
-
- \smallskip
- \item[] while $newpass$ is not empty
- \begin{itemize}
- \item[] $q \Xset pop(newpass)$
- \item[] $scan(q, result, topsort)$
- \item[] $status(q) \Xset \mathit{OFFSTACK}$
- \end{itemize}
- \end{itemize}
-
- \item[] return $\{ (q, x) \mid x \Xeq result(q) \wedge core(q, F, \Delta) \}$
- \end{itemize}
-% \end{minipage}
-
- \bigskip
-
-% \begin{minipage}{\linewidth}
- $scan(q, result, topsort)$
- \hrule
- \begin{itemize}[leftmargin=0in]
- \smallskip
- \item[] for all outgoing arcs $(q, \epsilon, \chi, p) \Xin \Delta$:
- \begin{itemize}
- \item[] $x \Xset result(q) \chi$
- \item[] $relax(p, x, result, topsort)$
- \end{itemize}
- \end{itemize}
-% \end{minipage}
-
- \bigskip
-
-% \begin{minipage}{\linewidth}
- $relax(q, x, result, topsort)$
- \hrule
- \begin{itemize}[leftmargin=0in]
- \smallskip
- \item[] if $x \prec result(q)$
- \begin{itemize}
- \item[] $result(q) \Xset x$
- \item[] if $status(q) \neq \mathit{TOPSORT}$
- \begin{itemize}
- \item[] $push(topsort, q)$
- \item[] $status(q) \Xset \mathit{NEWPASS}$
- \end{itemize}
- \end{itemize}
- \end{itemize}
-% \end{minipage}
-
- \bigskip
-
-% $boundary(X, F, \Delta)$
-% \hrule
-% \begin{itemize}[leftmargin=0in]
-% \smallskip
-% \item[] return $q \Xin F \vee \exists \alpha, p: (q, \alpha, \epsilon, p) \Xin \Delta$
-% \end{itemize}
+ \begin{algorithm}[H] \DontPrintSemicolon \SetKwProg{Fn}{}{}{} \SetAlgoInsideSkip{medskip}
+ \Fn {$\underline{closure \Xund goldberg \Xund radzik(X, F, \Delta)} \smallskip$} {
+ empty stacks $topsort$, $newpass$ \;
+ $result(q) \equiv \bot$ \;
+ $status(q) \equiv \mathit{OFFSTACK}$ \;
+ \For {$(q, x) \Xin X$} {
+ $relax(q, x, result, topsort)$ \;
+ }
+ \While {$topsort$ is not empty} {
+ \While {$topsort$ is not empty} {
+ $q \Xset pop(topsort)$ \;
+
+ \If {$status(q) \Xeq \mathit{TOPSORT}$} {
+ $push(newpass, n)$ \;
+ } \ElseIf {$status(q) \Xeq \mathit{NEWPASS}$} {
+ $status(q) \Xset \mathit{TOPSORT}$ \;
+ $push(topsort, q)$ \;
+ $scan(q, result, topsort)$ \;
+ }
+ }
+ \While {$newpass$ is not empty} {
+ $q \Xset pop(newpass)$ \;
+ $scan(q, result, topsort)$ \;
+ $status(q) \Xset \mathit{OFFSTACK}$ \;
+ }
+ }
+ \Return $\{ (q, x) \mid x \Xeq result(q) \wedge core(q, F, \Delta) \}$ \;
+ }
+ \end{algorithm}
+
+
+ \begin{algorithm}[H] \DontPrintSemicolon \SetKwProg{Fn}{}{}{} \SetAlgoInsideSkip{medskip}
+ \Fn {$\underline{scan(q, result, topsort)} \smallskip$} {
+ \ForEach {outgoing arc $(q, \epsilon, \chi, p) \Xin \Delta$} {
+ $x \Xset result(q) \chi$ \;
+ $relax(p, x, result, topsort)$ \;
+ }
+ }
+ \end{algorithm}
+
+
+ \begin{algorithm}[H] \DontPrintSemicolon \SetKwProg{Fn}{}{}{} \SetAlgoInsideSkip{medskip}
+ \Fn {$\underline{relax(q, x, result, topsort)} \smallskip$} {
+ \If {$x \prec result(q)$} {
+ $result(q) \Xset x$ \;
+ \If {$status(q) \neq \mathit{TOPSORT}$} {
+ $push(topsort, q)$ \;
+ $status(q) \Xset \mathit{NEWPASS}$ \;
+ }
+ }
+ }
+ \end{algorithm}
In order to better understand all three algorithms and compare their behaviour on various classes of graphs
I used the benchmark sute described in [CheGolRad96].
The most important results are as follows.
On Acyc-Neg family (acyclic graphs with mixed weights)
LAU is non-linear and significantly slower,
-while LAU1 and GOR1 are both linear and LAU1 scans each node exactly once:
-
-\begin{center}
-\includegraphics[width=\linewidth]{img/plot_acyc_neg.png}\\*
-\footnotesize{Behavior of LAU, LAU1 and GOR1 on Acyc-Neg family.}
-\end{center}
-
-\begin{center}
-\includegraphics[width=\linewidth]{img/plot_acyc_neg_logscale.png}\\*
-\footnotesize{Behavior of LAU, LAU1 and GOR1 on Acyc-Neg family (logarithmic scale on both axes).}
-\end{center}
-
+while LAU1 and GOR1 are both linear and LAU1 scans each node exactly once.
On Grid-NHard and Grid-PHard families (graphs with cycles designed to be hard for algorithms that exploit graph structure)
both LAU and LAU1 are very slow (though approximation suggests polynomial, not exponential fit),
-while GOR1 is fast:
+while GOR1 is fast.
+On other graph families all three algorithms behave quite well;
+it is strange that LAU is fast on Acyc-Pos family, while being so slow on Acyc-Neg family.
+See also [NonPalXue00]: they study two modifications of GOR1, one of which is very close to LAU1,
+and conjecture (without a proof) that worst-case complexity is exponential.
-\begin{center}
-\includegraphics[width=\linewidth]{img/plot_grid_nhard.png}\\*
-\footnotesize{Behavior of LAU, LAU1 and GOR1 on Grid-NHard family.}
-\end{center}
+\end{multicols}
\begin{center}
-\includegraphics[width=\linewidth]{img/plot_grid_nhard_logscale.png}\\*
-\footnotesize{Behavior of LAU, LAU1 and GOR1 on Grid-NHard family (logarithmic scale on both axes).}
+\includegraphics[width=\linewidth]{img/plot_acyc_neg_both.png}\\*
+\footnotesize{Behavior of LAU, LAU1 and GOR1 on Acyc-Neg family (left: normal scale, right: logarithmic scale on both axes).}
+\includegraphics[width=\linewidth]{img/plot_grid_nhard_both.png}\\*
+\footnotesize{Behavior of LAU, LAU1 and GOR1 on Grid-Nhard family (left: normal scale, right: logarithmic scale on both axes).}
\end{center}
-On other graph families all three algorithms behave quite well;
-it is strange that LAU is fast on Acyc-Pos family, while being so slow on Acyc-Neg family.
-See also [NonPalXue00]: they study two modifications of GOR1, one of which is very close to LAU1,
-and conjecture (without a proof) that worst-case complexity is exponential.
+\begin{multicols}{2}
+
%\end{multicols}
%\begin{minipage}{\linewidth}
Then $x \prec y$ iff $\prec_{lexicographic} (a, b)$:
\\
- $\prec_{lexicographic} (a_1 \dots a_n, b_1 \dots b_m)$
- \hrule
- \begin{enumerate}[leftmargin=0in]
- \smallskip
- \item[] for $i \Xeq \overline{1, min(n, m)}$:
- \begin{enumerate}
- \item[] if $a_i \!\neq\! b_i$ return $a_i \!<\! b_i$
- \end{enumerate}
- \item[] return $n \!<\! m$.
- \\
- \end{enumerate}
- \bigskip
+ \begin{algorithm}[H] \DontPrintSemicolon \SetKwProg{Fn}{}{}{} \SetAlgoInsideSkip{medskip}
+ \Fn {$\underline{\prec_{lexicographic} (a_1 \dots a_n, b_1 \dots b_m)} \smallskip$} {
+ \For {$i \Xeq \overline{1, min(n, m)}$} {
+ \lIf {$a_i \!\neq\! b_i$} { \Return $a_i \!<\! b_i$ }
+ }
+ \Return $n \!<\! m$ \;
+ }
+ \end{algorithm}
This definition has one caveat: the existence of minimal element is not guaranteed for TRE that contain $\epsilon$-loops.
For example, TNFA for $\epsilon^+$ has infinitely many ambiguous paths with bitcodes
%The following lemma states an important property of bitcodes induced by paths gathered by $\epsilon$-closure:
\begin{XLem}\label{lemma_bitcodes}
-Let $\Pi$ be a set of TNFA paths that have common start state, induce the same S-string and end in a core state.
+Let $\Pi$ be a set of TNFA paths that start in the same state, induce the same S-string and end in a core state
+(e.g. the set of active paths on each step of TNFA simulation).
Then the set of bitcodes induced by paths in $\Pi$ is prefix-free
(compare with [Gra15], lemma 3.1).
-
-\smallskip
-
-Proof.
-Consider paths $\pi_1$ and $\pi_2$ in $\Pi$,
+\\[0.5em]
+\textbf{Proof.}
+Consider paths $\pi_1$ and $\pi_2$ in $\Pi$
and suppose that $\XB(\pi_1)$ is a prefix of $\XB(\pi_2)$.
Then $\pi_1$ must be a prefix of $\pi_2$: otherwise there is a state where $\pi_1$ and $\pi_2$ diverge,
and by TNFA construction all outgoing transitions from this state have different priorities,
Let $\pi_2 \Xeq \pi_1 \pi_3$.
Since $\XS(\pi_1) \Xeq \XS(\pi_2)$, and since $\XS(\rho\sigma) \Xeq \XS(\rho)\XS(\sigma)$ for arbitrary path $\rho\sigma$,
it must be that $\XS(\pi_3) \Xeq \epsilon$.
-The end state of $\pi_2$ is a core state: by TNFA construction it has no outgoing $\epsilon$-transitions.
-But it is also the start state of $\epsilon$-path $\pi_3$, therefore $\pi_3$ is an empty path and $\pi_1 \Xeq \pi_2$.
+The end state of $\pi_2$ is a core state: by observation \ref{obs_tnfa_states} it has no outgoing $\epsilon$-transitions.
+But the same state is also the start state of $\epsilon$-path $\pi_3$, therefore $\pi_3$ is an empty path and $\pi_1 \Xeq \pi_2$.
$\square$
\end{XLem}
-From Lemma \ref{lemma_bitcodes} it easily follows that leftmost greedy disambiguation is prefix-based.
+From lemma \ref{lemma_bitcodes} it easily follows that leftmost greedy disambiguation is prefix-based.
Consider ambiguous paths $\pi_1$, $\pi_2$ and arbitrary suffix $\pi_3$,
and let $\XB(\pi_1) \Xeq a$, $\XB(\pi_2) \Xeq b$, $\XB(\pi_3) \Xeq c$.
Note that $\XB(\rho\sigma) \Xeq \XB(\rho)\XB(\sigma)$ for arbitrary path $\rho\sigma$,
(compare with [Gra15], lemma 2.2).
\\
-From Lemma \ref{lemma_bitcodes} it also follows that leftmost greedy disambiguation is foldable:
+From lemma \ref{lemma_bitcodes} it also follows that leftmost greedy disambiguation is foldable:
prefix-free bitcodes can be compared incrementally on each step of simulation.
We define ``ambiguity shape'' of TDFA state as lexicographic order on bitcodes of all paths represented by configurations
(compare with [Gra15], definition 7.14).
first, by ordinals calculated on the previous step, then by bitcode fragments added by the $\epsilon$-closure.
\\
-%\begin{minipage}{\linewidth}
- $\prec_{leftmost \Xund greedy} ((n, a), (m, b))$
- \hrule
- \begin{enumerate}[leftmargin=0in]
- \smallskip
- \item[] if $n \!\neq\! m$ return $n \!<\! m$
- \item[] return $\prec_{lexicographic} (a, b)$
- \\
- \end{enumerate}
-%\end{minipage}
-
- \bigskip
-
-%\begin{minipage}{\linewidth}
- $ordinals (\{(q_i, o_i, x_i)\}_{i=1}^n)$
- \hrule
- \begin{enumerate}[leftmargin=0in]
- \smallskip
- \item[] $\{(p_i, B_i)\} \Xset $ sort $\{(i, (o_i, x_i))\}$ by second component \\
- \hphantom{hspace{2em}} using $\prec_{leftmost \Xund greedy}$ as comparator
- \item[] $\widetilde{o}_{p_1 t} \Xset 0, \; j \Xset 0$
- \item[] for $i \Xeq \overline{2, n}$:
- \begin{enumerate}
- \item[] if $B_{i-1} \!\neq\! B_i: j \Xset j \!+\! 1$
- \item[] $\widetilde{o}_{p_i t} \Xset j$
- \end{enumerate}
- \item[] return $\{(q_i, \widetilde{o}_i, x_i)\}_{i=1}^n$
- \\
- \end{enumerate}
-%\end{minipage}
-
- \bigskip
+ \begin{algorithm}[H] \DontPrintSemicolon \SetKwProg{Fn}{}{}{} \SetAlgoInsideSkip{medskip}
+ \Fn {$\underline{\prec_{leftmost \Xund greedy} ((n, a), (m, b))} \smallskip$} {
+ \lIf {$n \!\neq\! m$} {\Return $n \!<\! m$}
+ \Return $\prec_{lexicographic} (a, b)$ \;
+ }
+ \end{algorithm}
+
+
+ \begin{algorithm}[H] \DontPrintSemicolon \SetKw{Let}{let} \SetKw{Und}{undefined} \SetKwProg{Fn}{}{}{} \SetAlgoInsideSkip{medskip}
+ \Fn {$\underline{ordinals (\{(q_i, o_i, x_i)\}_{i=1}^n)} \smallskip$} {
+ $\{(p_i, B_i)\} \Xset $ sort $\{(i, (o_i, x_i))\}$ by second component using $\prec_{leftmost \Xund greedy}$ \;
+ \Let $o_{p_1}(t) \Xeq 0$, $ord \Xset 0$ \;
+ \For {$i \Xeq \overline{2, n}$} {
+ \lIf {$B_{i-1} \!\neq\! B_i$} {$ord \Xset ord \!+\! 1$}
+ \Let $o_{p_i}(t) \Xeq ord$ \;
+ }
+ \Return $\{(q_i, o_i, x_i)\}_{i=1}^n$ \;
+ }
+ \end{algorithm}
In practice explicit calculation of ordinals and comparison of bitcodes is not necessary:
if we treat TDFA states as ordered sets,
then the first path that arrives at any state would be the leftmost.
This approach is taken in e.g. [Karper].
Since tags are not engaged in disambiguation,
-we can use paired tags that represent capturing parenthesis, or just standalone tags --- this makes no difference with leftmost greedy policy.
+we can use paired tags that represent capturing parentheses, or just standalone tags --- this makes no difference with leftmost greedy policy.
\subsection*{POSIX}
POSIX policy is defined in [??]; [Fow] gives a comprehensible interpretation of it.
We will give a formal interpretation in terms of tags;
-it was first described by Laurikari in [Lau01], but the key idea must be absolutely attributed to Kuklewicz [??].
+it was first described by Laurikari in [Lau01], but the key idea should be absolutely attributed to Kuklewicz [??].
He never fully formalized his algorithm, and our version slightly deviates from the informal description,
so all errors should be attributed to the author of this paper.
Fuzz-testing RE2C against Regex-TDFA revealed no difference in submatch extraction
(see section \ref{section_tests_and_benchmarks} for details).
\\
-Consider an arbitrary RE without tags,
-and suppose that parenthesized subexpressions are marked for submatch extraction.
-We start by enumerating all subexpressions according to POSIX standard:
+POSIX disambiguation is defined in terms of \emph{subexpressions} and \emph{subpatterns}:
+subexpression is a parenthesized sub-RE and subpattern is a non-parenthesized sub-RE.
+Submatch extraction applies only to subexpressions, but disambiguation applies to both:
+subpatterns have ``equal rights'' with subexpressions.
+For simplicity we will now assume that all sub-RE are parenthesized;
+later in this section we will discuss the distinction in more detail.
+\\
- \begin{Xdef}
- For a given RE $e$, its \emph{indexed RE (IRE)} is $(\widetilde{e}, n \!-\! 1)$,
- where $(\widetilde{e}, n) \Xeq \XI(e, 1)$
- and $\XI$ is defined as follows:
+POSIX disambiguation is hierarchical:
+each subexpression has a certain \emph{priority} relative to other subexpressions,
+and disambiguation algorithm must consider subexpressions in the order of their priorities.
+Therefore we will start by enumerating all subexpressions of the given RE according to POSIX standard:
+outer ones before inner ones and left ones before right ones.
+Enumeration is done by rewriting RE $e$ into an \emph{indexed RE} (IRE): a pair $(i, \widetilde{e})$,
+where $i$ is the index and $\widetilde{e}$ mirrors the structure of $e$,
+except that each sub-IRE is an indexed pair rather than a RE.
+For example, RE $a^* (b | \epsilon)$ corresponds to IRE
+$(1,
+ (2, (3, a)^*)
+ (4, (5, b) | (6, \epsilon))
+)$.
+Enumeration operator $\XI$ is defined below:
+it transforms a pair $(e, i)$ into a pair $(\widetilde{e}, j)$,
+where $e$ is a RE, $i$ is the start index, $\widetilde{e}$ is the resulting IRE and $j$ is the next free index.
\begin{align*}
\XI(\emptyset, i) &= ((i, \emptyset), i \!+\! 1) \\
\XI(\epsilon, i) &= ((i, \epsilon), i \!+\! 1) \\
\XI(e^{n,m}, i) &= ((i, \widetilde{e}^{n,m}), j) \\
\text{where } & (\widetilde{e}, j) \Xeq \XI(e, i \!+\! 1)
\end{align*}
- $\square$
- \end{Xdef}
-Next, we rewrite IRE into TRE by rewriting each indexed subexpression $(i, e)$
+Now that the order on subexpressions is defined, we can
+rewrite IRE into TRE by rewriting each indexed subexpression $(i, e)$
into tagged subexpression $t_1 e \, t_2$, where $t_1 \Xeq 2i \!-\! 1$ is the \emph{start tag}
and $t_2 \Xeq 2i$ is the \emph{end tag}.
-Tags that correspond to iteration subexpressions are called \emph{orbit} tags.
-\\
-
-Strictly speaking, IRE is not necessary: it's just a consise way of enumerating tags.
-For example, RE $a^* (b | \epsilon)$ corresponds to IRE
-$(1,
- (2, (3, a)^*)
- (4, (5, b) | (6, \epsilon))
-)$
-and to TRE
+If $e$ is a repetition subexpression, then $t_1$ and $t_2$ are called \emph{orbit} tags.
+TRE corresponding to the above example is
$(1\,
3\, (5\, a \,6)^* \,4\,
7\, (9\, b \,10 | 11\, \epsilon \,12) 8\,
2)$.
\\
-POSIX disambiguation is hierarchical:
-it starts with the first subexpression and traverses subexpressions in order of their indices.
-For each subexpression it compares positions of start and end tags in the two given T-strings.
-Positions are called \emph{offsets}; each offset is either $\varnothing$ (if the occurence of tag is negative),
-or equal to the number of $\Sigma$-symbols before this occurence.
-The sequence of offsets is called \emph{history}.
+According to POSIX, each subexpression should start as early as possible and span as long as possible.
+In terms of tags this means that the position of start tag is \emph{minimized},
+while the position of the end tag is \emph{maximized}.
+Subexpression may match several times, therefore one tag may occur multiple times in the T-string.
+Obviously, orbit tags may repeat;
+non-orbit tags also may repeat provided that they are nested in a repetition subexpression.
+For example, TRE $(1 \, (3 \, (5 \, a \, 6 | 7 \, b \, 8) \, 4)^* \, 2)$ that corresponds to POSIX RE \texttt{(a|b)*}
+denotes T-string $1\, 3\, 5\, a\, 6\, \Xbar{7}\, \Xbar{8}\, 4\, 3\, 5\, a\, 6\, \Xbar{7}\, \Xbar{8}\, 4\, 2$
+(corresponding to S-string $aa$),
+in which orbit tags $3$ and $4$ occur twice, as well as non-orbit tags $5$, $6$, $7$ and $8$.
+Each occurence of tag has a corresponding \emph{offset}:
+either $\varnothing$ (for negative tags), or the number of preceding symbols in the S-string.
+The sequence of all offsets is called \emph{history}:
+for example, tag $3$ has history $0 \, 1$ and tag $7$ has history $\varnothing \, \varnothing$.
Each history consists of one or more \emph{subhistories}:
-they correspond to longest continuous substrings not interrupted by tags of subexpressions with lower indices.
-For non-iteration subexpressions subhistories contain exactly one offset;
-for iteration subexpressions they are either $\varnothing$, or may contain multiple non-$\varnothing$ offsets.
-If disambiguation is between T-string prefixes, then the last subhistory may be incomplete or empty.
-The following algorithm constructs a list of subhistories for the given T-string (possibly prefix) and tag.
-Its complexity is $O(n)$, since it processes each symbol at most once:
+longest subsequences of offsets not interrupted by tags of subexpressions with higher priority.
+In our example tag $3$ has one subhistory $0 \, 1$, while tag $7$ has two subhistories $\varnothing$ and $\varnothing$.
+Non-orbit subhistories contain exactly one offset (possibly $\varnothing$);
+orbit subhistories are either $\varnothing$, or may contain multiple non-$\varnothing$ offsets.
+Histories can be reconstructed from T-strings as follows:
+
+ \begin{algorithm}[H] \DontPrintSemicolon \SetKwProg{Fn}{}{}{} \SetAlgoInsideSkip{medskip}
+ \Fn {$\underline{history(a_1 \dots a_n, t)} \smallskip$} {
+ $i \Xset 1, \; j \Xset 1, \; pos \Xset 0$ \;
+ \While {$true$} {
+ \While {$i \leq n$ and $a_i \!\not\in\! \{t, \Xbar{t}\}$} {
+ \lIf {$a_i \Xin \Sigma$} {$pos \Xset pos \!+\! 1$}
+ $i \Xset i \!+\! 1$ \;
+ }
+ \While {$i \leq n$ and $a_i \!\not\in\! hightags(t)$} {
+ \lIf {$a_i \Xin \Sigma$} {$pos \Xset pos \!+\! 1$}
+ \lIf {$a_i \Xeq t$} {$A_j \Xset A_j pos$}
+ \lIf {$a_i \Xeq \Xbar{t}$} {$A_j \Xset A_j \varnothing$}
+ $i \Xset i \!+\! 1$ \;
+ }
+ \lIf {$i \!>\! n$} {break}
+ $j \Xset j \!+\! 1$ \;
+ }
+ \Return $A_1 \dots A_j$ \;
+ }
+ \end{algorithm}
+
+ \begin{algorithm}[H] \DontPrintSemicolon \SetKwProg{Fn}{}{}{} \SetAlgoInsideSkip{medskip}
+ \Fn {$\underline{hightags(t)} \smallskip$} {
+ \Return $\{ u, \Xbar{u} \mid u < 2 \lceil t / 2 \rceil \!-\! 1 \}$ \;
+ }
+ \end{algorithm}
+
+%Disambiguation algorithm should compare individual subhistories, not histories as a whole.
+%The number of subhistories depends on the number of iterations of enclosing repetition subexpressions.
+Due to the hierarchical nature of POSIX disambiguation, if comparison reaches $i$-th subexpression,
+it means that all enclosing subexpressions have already been compared and their tags coincide.
+Consequently the number of subhistories of tags $2i - 1$ and $2i$ in the compared T-strings must be equal.
\\
-%\begin{minipage}{\linewidth}
- $history(a_1 \dots a_n, t)$
- \hrule
- \begin{enumerate}[leftmargin=0in]
- \smallskip
- \item[] $U \Xset \{ u \mid u < 2 \lceil t / 2 \rceil \!-\! 1 \}$
- \item[] $i \Xset 1, \; j \Xset 1, \; p \Xset 0$
-
- \item[] while $true$
- \begin{enumerate}
-
- \item[] while $i \leq n$ and $a_i \!\not\in\! \{t, \Xbar{t}\}$
- \begin{enumerate}
- \item[] if $a_i \Xin \Sigma: p \Xset p \!+\! 1$
- \item[] $i \Xset i \!+\! 1$
- \end{enumerate}
-
- \item[] while $i \leq n$ and $a_i \!\not\in\! (U \cup \Xbar{U})$
- \begin{enumerate}
- \item[] if $a_i \Xin \Sigma: p \Xset p \!+\! 1$
- \item[] if $a_i \Xeq t: A_j \Xset A_j p$
- \item[] if $a_i \Xeq \Xbar{t}: A_j \Xset A_j \varnothing$
- \item[] $i \Xset i \!+\! 1$
- \end{enumerate}
-
- \item[] if $i \!>\! n$ break
- \item[] $j \Xset j \!+\! 1$
-
- \end{enumerate}
- \item[] return $A_1 \dots A_j$
- \end{enumerate}
-%\end{minipage}
+If disambiguation is defined on T-string prefixes, then the last subhistory may be incomplete.
+In particular, last subhistory of start tag may contain one more offset than last suhistory of end tag.
+In this case we assume that the missing offset is $\infty$, as it must be greater than any offset in the already matched S-string prefix.
+\\
- \bigskip
+Disambiguation algorithm for TRE with $N$ subexpressions is defined as comparison of T-strings $x$ and $y$:
+
+ \begin{algorithm}[H] \DontPrintSemicolon \SetKw{Let}{let} \SetKw{Und}{undefined} \SetKwProg{Fn}{}{}{} \SetAlgoInsideSkip{medskip}
+ \Fn {$\underline{\prec_{POSIX}(x, y)} \smallskip$} {
+ \For {$t \Xeq \overline{1, N}$} {
+ $A_1 \dots A_n \Xset history(x, 2t \!-\! 1)$ \;
+ $C_1 \dots C_n \Xset history(x, 2t)$ \;
+ $B_1 \dots B_n \Xset history(y, 2t \!-\! 1)$ \;
+ $D_1 \dots D_n \Xset history(y, 2t)$ \;
+ \For {$i \Xeq \overline{1, n}$} {
+ \Let $a_1 \dots a_m \Xeq A_i$, $b_1 \dots b_k \Xeq B_i$ \;
+ \Let $c_1 \dots c_{\widetilde{m}} \Xeq C_i$, $d_1 \dots d_{\widetilde{k}} \Xeq D_i$ \;
+ \lIf {$\widetilde{m} \!<\! m$} {$c_m \Xset \infty$}
+ \lIf {$\widetilde{k} \!<\! k$} {$d_k \Xset \infty$}
+ \For {$j \Xeq \overline{1, min(m, k)}$} {
+ \lIf {$a_j \!\neq\! b_j$} {\Return $a_j \!<\! b_j$}
+ \lIf {$c_j \!\neq\! d_j$} {\Return $c_j \!>\! d_j$}
+ }
+ \lIf {$m \!\neq\! k$} {\Return $m \!<\! k$}
+ }
+ }
+ \Return $false$ \;
+ }
+ \end{algorithm}
-Because of the hierarchical structure of POSIX disambiguation, if comparison reaches $i$-th subexpression,
-it means that all subexpressions with lower indices have already been considered and their histories coincide.
-Consequently for start and end tags of $i$-th subexpression
-the number of subhistories in the compared T-strings must be equal.
-If comparison is between prefixes, last subhistory of start tag may contain one more offset than last suhistory of end tag:
-in this case we assume that the missing offset if $\infty$ (as it must be greater than any offset in this history).
-Given all that, POSIX disambiguation for TRE with $N$ subexpressions is defined as follows:
-\\
-%\begin{minipage}{\linewidth}
- $\prec_{POSIX}(x, y)$
- \hrule
- \begin{enumerate}[leftmargin=0in]
- \smallskip
- \item[] for $t \Xeq \overline{1, N}$:
- \begin{enumerate}
- \item[] $A_1 \dots A_n \Xset history(x, 2t \!-\! 1)$
- \item[] $C_1 \dots C_n \Xset history(x, 2t)$
- \item[] $B_1 \dots B_n \Xset history(y, 2t \!-\! 1)$
- \item[] $D_1 \dots D_n \Xset history(y, 2t)$
- \item[] for $i \Xeq \overline{1, n}$:
- \begin{enumerate}
- \item[] let $A_i \Xeq a_1 \dots a_m$, $C_i \Xeq c_1 \dots c_{\widetilde{m}}$
- \item[] let $B_i \Xeq b_1 \dots b_k$, $D_i \Xeq d_1 \dots d_{\widetilde{k}}$
- \item[] if $\widetilde{m} \!<\! m: c_m \Xset \infty$
- \item[] if $\widetilde{k} \!<\! k: d_k \Xset \infty$
- \item[] for $j \Xeq \overline{1, min(m, k)}$:
- \begin{enumerate}
- \item[] if $a_j \!\neq\! b_j$ return $a_j \!<\! b_j$
- \item[] if $c_j \!\neq\! d_j$ return $c_j \!>\! d_j$
- \end{enumerate}
- \item[] if $m \!\neq\! k$ return $m \!<\! k$
- \end{enumerate}
- \end{enumerate}
- \item[] return $false$
- \\
- \end{enumerate}
-%\end{minipage}
- \bigskip
It's not hard to show that $\prec_{POSIX}$ is prefix-based.
Consider $t$-th iteration of the algorithm and let $s \Xeq 2t \!-\! 1$ be the start tag,
Let $d_1 \dots d_{k+m} \Xeq b_1 \dots b_k c_1 \dots c_m$.
None of $d_j$ can be $\varnothing$, because $n$-th subhistory contains multiple offsets.
Therefore $d_j$ are non-decreasing and $d_j \!\leq\! c_j$ for all $j \Xeq \overline{1, m}$.
-Then either $d_j \!<\! c_j$ at some index $j \!\leq\! m$, or $A'_n$ is shorter than $B'_n$; in both cases comparison result is unchanged.
+Then either $d_j \!<\! c_j$ at some index $j \!\leq\! m$, or $A'_n$ is shorter than $B'_n$; in both cases comparison is unchanged.
The same reasoning holds for the end tag.
\\
It makes a lot of redundant checks:
for adjacent tags the position of the second tag is fixed on the position of the first tag.
In particular, comparison of the start tags $a_j$ and $b_j$ is almost always redundant.
-If $j \!>\! 1$, then $a_j$ and $b_j$ are fixed on $c_{j-1}$ and $d_{j-1}$, which have been compared on the previous iteration.
+Namely, if $j \!>\! 1$, then $a_j$ and $b_j$ are fixed on $c_{j-1}$ and $d_{j-1}$, which have been compared on the previous iteration.
If $j \Xeq 1$, then $a_j$ and $b_j$ are fixed on some higher-priority tag which has already been checked, unless $t \Xeq 1$.
The only case when this comparison makes any difference is when $j \Xeq 1$ and $t \Xeq 1$:
the very first position of the whole match.
The simplified algorithm looks like this:
\\
-%\begin{minipage}{\linewidth}
- $\prec_{POSIX}(x, y)$
- \hrule
- \begin{enumerate}[leftmargin=0in]
- \smallskip
- \item[] for $t \Xeq \overline{1, N}$:
- \begin{enumerate}
- \item[] $A_1 \dots A_n \Xset history(x, 2t)$
- \item[] $B_1 \dots B_n \Xset history(y, 2t)$
- \item[] for $i \Xeq \overline{1, n}$:
- \begin{enumerate}
- \item[] if $A_i \!\neq\! B_i: return \prec_{subhistory} (A_i, B_i)$
- \end{enumerate}
- \end{enumerate}
- \item[] return $false$
- \\
- \end{enumerate}
-%\end{minipage}
-
- \bigskip
-
-%\begin{minipage}{\linewidth}
- $\prec_{subhistory} (a_1 \dots a_n, b_1 \dots b_m)$
- \hrule
- \begin{enumerate}[leftmargin=0in]
- \smallskip
- \item[] for $i \Xeq \overline{1, min(n, m)}$:
- \begin{enumerate}
- \item[] if $a_i \!\neq\! b_i$ return $a_i \!>\! b_i$
- \end{enumerate}
- \item[] return $n \!<\! m$.
- \\
- \end{enumerate}
-%\end{minipage}
-
- \bigskip
+ \begin{algorithm}[H] \DontPrintSemicolon \SetKwProg{Fn}{}{}{} \SetAlgoInsideSkip{medskip}
+ \Fn {$\underline{\prec_{POSIX}(x, y)} \smallskip$} {
+ \For {$t \Xeq \overline{1, N}$} {
+ $A_1 \dots A_n \Xset history(x, 2t)$ \;
+ $B_1 \dots B_n \Xset history(y, 2t)$ \;
+ \For {$i \Xeq \overline{1, n}$} {
+ \lIf {$A_i \!\neq\! B_i$} {\Return $A_i \prec_{subhistory} B_i$}
+ }
+ }
+ \Return $false$ \;
+ }
+ \end{algorithm}
+
+
+ \begin{algorithm}[H] \DontPrintSemicolon \SetKwProg{Fn}{}{}{} \SetAlgoInsideSkip{medskip}
+ \Fn {$\underline{\prec_{subhistory} (a_1 \dots a_n, b_1 \dots b_m)} \smallskip$} {
+ \For {$i \Xeq \overline{1, min(n, m)}$} {
+ \lIf {$a_i \!\neq\! b_i$} {\Return $a_i \!>\! b_i$}
+ }
+ \Return $n \!<\! m$ \;
+ }
+ \end{algorithm}
Next, we explore the structure of ambigous paths that contain multiple subhistories
and show that (under ceratin conditions) such paths can be split into ambiguous subpaths,
one per each subhistory.
-\\
\begin{XLem}\label{lemma_path_decomposition}
-For a POSIX TRE $e$,
-if $a$, $b$ are ambiguous paths in TNFA $\XN(e)$, such that $\XT(a) \Xeq x$, $\XT(b) \Xeq y$,
-$t$ is a tag such that $history(x, u) \Xeq history(y, u)$ for all $u \!<\! t$
-and $history(x, t) \Xeq A_1 \dots A_n$, $history(y, t) \Xeq B_1 \dots B_n$,
-then $a$, $b$ can be decomposed into path segments $a_1 \dots a_n$, $b_1 \dots b_n$,
-such that for all $i \!\leq\! n$ paths $a_1 \dots a_i$, $b_1 \dots b_i$ are ambiguous
-and $history(\XT(a_1 \dots a_i), t) \Xeq A_1 \dots A_i$, $history(\XT(b_1 \dots b_i), t) \Xeq B_1 \dots B_i$.
-
-\smallskip
-
-Proof is by induction on $t$ and relies on the construction of TNFA given in Theorem \ref{theorem_tnfa}.
+Let $e$ be a POSIX TRE and suppose that the following conditions are satisfied:
+\begin{enumerate}
+ \item $a$, $b$ are ambiguous paths in TNFA $\XN(e)$ that induce T-strings $x \Xeq \XT(a)$, $y \Xeq \XT(b)$
+ \item $t$ is a tag such that $history(x, t) \Xeq A_1 \dots A_n$, $history(y, t) \Xeq B_1 \dots B_n$
+ \item for all $u \!<\! t$: $history(x, u) \Xeq history(y, u)$
+ (tags with higher priority agree)
+\end{enumerate}
+Then $a$ and $b$ can be decomposed into path segments $a_1 \dots a_n$, $b_1 \dots b_n$,
+such that for all $i \!\leq\! n$ subpaths $a_i$, $b_i$ have common start and end states and
+contain subhistories $A_i$, $B_i$ respectively:
+$history(\XT(a_1 \dots a_i), t)$ $\Xeq$ $A_1 \dots A_i$,
+$history(\XT(b_1 \dots b_i), t)$ $\Xeq$ $B_1 \dots B_i$.
+\\[0.5em]
+\textbf{Proof.}
+Proof is by induction on $t$ and relies on the construction of TNFA given in section \ref{section_tnfa}.
Induction basis is $t \Xeq 1$ and $t \Xeq 2$ (start and end tags of the topmost subexpression): let $n \Xeq 1$, $a_1 \Xeq a$, $b_1 \Xeq b$.
Induction step: suppose that lemma is true for all $u \!<\! t$,
and for $t$ the conditions of lemma are satisfied.
let $c_1 \dots c_m$, $d_1 \dots d_m$ be the corresponding path decompositions.
Each subhistory of $t$ is covered by some subhistory of $r$ (by definition $history$ doesn't break at lower-priority tags),
therefore decompositions $a_1 \dots a_n$, $b_1 \dots b_n$ can be constructed as a refinement of $c_1 \dots c_m$, $d_1 \dots d_m$.
-If $r$ is a non-orbit tag, each subhistory of $r$ contains exactly one subhistory of $t$
+If $r$ is a non-orbit tag, each subhistory of $r$ covers exactly one subhistory of $t$
and the refinement is trivial: $n \Xeq m$, $a_i \Xeq c_i$, $b_i \Xeq d_i$.
Otherwise, $r$ is an orbit tag and single subhistory of $r$ may contain multiple subhistories of $t$.
Consider path segments $c_i$ and $d_i$:
since they have common start and end states, and since they cannot contain tagged transitions with higher-priority tags,
-both must be contained in the same subautomaton of the form $F^{k,l}$ (possibly $l \Xeq \infty$).
+both must be contained in the same subautomaton of the form $F^{k,l}$.
This subautomaton itself consists of one or more subautomata for $F$ each starting with an $r$-tagged transition;
let the start state of each subautomaton be a breaking point in the refinement of $c_i$ and $d_i$.
-By construction of TNFA the number of iterations through $F^{k,l}$ uniquely determins the order of subautomata traversal
-(automaton corresponding to $(j \!+\! 1)$-th iteration is only reachable from the one corresponding to $j$-th iteration).
+By observation \ref{obs_tnfa_repeat} the number of iterations through $F^{k,l}$ uniquely determins the order of subautomata traversal.
Since $history(x, r) \Xeq history(y, r)$, the number of iterations is equal and
therefore breaking points coincide.
$\square$
\end{XLem}
Lemma \ref{lemma_path_decomposition} has the following implication.
-Suppose that at $p$-th step of TNFA simulation we are comparing histories $A_1 \dots A_n$, $B_1 \dots B_n$.
-Let $j \!\leq\! n$ be the greatest index such that $A_1 \dots A_j$, $B_1 \dots B_j$ end before $p$-th position in the input string
-(it is the same index for both histories because higher-priority tags coincide).
+Suppose that during simulation we prune ambiguous paths immediately as they transition to to the same state,
+and suppose that
+at $p$-th step of simulation we are comparing histories $A_1 \dots A_n$, $B_1 \dots B_n$ of some tag.
+Let $j \!\leq\! n$ be the greatest index such that all offsets in $A_1 \dots A_j$, $B_1 \dots B_j$ are less than $p$
+(it must be the same index for both histories because higher-priority tags coincide).
Then $A_i \Xeq B_i$ for all $i \!\leq\! j$:
-by Lemma \ref{lemma_path_decomposition} $A_1 \dots A_j$, $B_1 \dots B_j$
-correspond to ambiguous subpaths which must have been compared on some previous step of the algorithm.
+by lemma \ref{lemma_path_decomposition} $A_1 \dots A_j$, $B_1 \dots B_j$
+correspond to subpaths which start and end states coincide;
+these subpaths are either equal, or ambiguous, in which case they must have been compared on some previous step of the algorithm.
This means that we only need to compare $A_{j+1} \dots A_n$ and $B_{j+1} \dots B_n$.
-Of them only $A_{j+1}$ and $B_{j+1}$ may start before $p$-th position in the input string:
-all other subhistories belong to current $\epsilon$-closure (though $A_n$, $B_n$ may not end in it).
-Therefore between steps of simulation we need to remeber only the last (possibly incomplete) subhistory of each tag.
+Of them only $A_{j+1}$ and $B_{j+1}$ may have offsets less than $p$:
+all other subhistories belong to current $\epsilon$-closure;
+the last pair of subhistories $A_n$, $B_n$ may be incomplete.
+Therefore we only need to remember $A_{j+1}$, $B_{j+1}$ from the previous simulation step,
+and we only need to pass $A_n$, $B_n$ to the next step.
+In other words, between simulation steps we need only the last subhistory for each tag.
\\
Now we can define ``ambiguity shape'' of TDFA state:
-we define it as a set of orders, one for each tag, on the last subhistories of this tag in this state.
-Of course, comparison only makes sense for subhistories that correspond to ambiguous paths and prefixes of ambiguous paths,
-and in general we do not know which prefixes will cause ambiguity on subsequent steps.
-Therefore some comparisons may be meaningless, incorrect and unjustified.
-However, they do not affect valid comparisons,
-and they do not cause disambiguation errors: their results are never used.
-At worst they can prevent state merging.
+we define it as a set of orders, one per tag, on the last subhistories of this tag in this state.
As with leftmost greedy policy, the number of different orders is finite
and therefore determinization terminates.
-\\
-
-Consider subhistories for which comparison is valid.
-Comparison algorithm differs for orbit and non-orbit tags.
-Orbit subhistories can be compared incrementally:
-if at some step we determine that $A \prec B$, then it will be true ever after, no matter what values are added to $A$ and $B$.
-(This can be proven by induction on the number of steps
-and using the fact that $\varnothing$ can be added only at the first step,
-by TNFA construction for the case of zero iterations).
-Therefore we can use the results of comparison on the previous step and compare only the added parts of subhistories.
-For non-orbit tags incremental comparison doesn't work:
-subhistories consist of a single value, but different paths may discover it at different steps,
-and the later value may turn to be either $\varnothing$ or not.
-However, a single value cannot be spread across multiple steps:
-it either belongs to current $\epsilon$-closure or to some earlier step.
-If both values belong to earlier steps, we can use results of the previous comparison;
-it they both belong to $\epsilon$-closure, they are equal;
-otherwise the one from the $\epsilon$-closure is better.
-\\
-
-Orders on configurations are represented with vectors of ordinal numbers (one per tag) assigned to each configuration.
-Ordinals are initialized to zero and updated on each step of simulation by comparing tag histories.
-Histories are compared using ordinals calculated on the previous step and T-string fragments added by $\epsilon$-closure.
-Ordinals are assigned in decreasing order, so that they can be compared in the same way as offsets: greater means better.
-\\
+In fact, comparison only makes sense for subhistories that correspond to ambiguous paths (or path prefixes),
+and only in case when higher-priority tags agree.
+We do not know in advance which prefixes will cause ambiguity on subsequent steps,
+therefore some comparisons may be meaningless:
+we impose total order on a set which is only partially ordered.
+However, meaningless comparisons do not affect valid comparisons, and they do not cause disambiguation errors: their results are never used.
+At worst they can prevent state merging.
+Kuklewicz suggests to group orbit subhistories by their \emph{base offset}
+(position of start tag on the first iteration) prior to comparison.
+However, experiments with such grouping revealed no effect on state merging,
+and for simplicity we abandon the idea of partial ordering.
+
+\begin{Xdef}
+Subhistories of the given tag are \emph{comparable} if they correspond to prefixes of ambiguous paths
+and all higher-priority tags agree.
+\end{Xdef}
+
+
+\begin{XLem}\label{lemma_orbit_subhistories}
+Comparable orbit subhistories can be compared incrementally with $\prec_{subhistory}$.
+\\[0.5em]
+\textbf{Proof.}
+Consider subhistories $A$, $B$ at some step of simulation and let $A \prec_{subhistory} B$.
+We will show that comparison result will not change on subsequent steps, when new offsets are added to $A$ and $B$.
+First, note that $\varnothing$ can be added only on the first step of comparison:
+negative orbit tags correspond to the case of zero iterations,
+and by TNFA construction for $F^{0,m}$ they are reachable by $\epsilon$-transitions from the initial state,
+but not from any other state of this subautomaton.
+Second, note that non-$\varnothing$ offsets increase with each step.
+Based on these two facts and the definition of $\prec_{subhistory}$, the proof is trivial by induction on the number of steps.
+$\square$
+\end{XLem}
-%\begin{minipage}{\linewidth}
- $ordinals (\{(q_i, o_i, x_i)\}_{i=1}^n)$
- \hrule
- \begin{enumerate}[leftmargin=0in]
- \smallskip
- \item[] for $t \Xeq \overline{1, N}$:
- \begin{enumerate}
- \item[] for $i \Xeq \overline{1, n}$:
- \begin{enumerate}
- \item[] $A_1 \dots A_m \Xset \epsilon \Xund history(x_i, t)$
- \item[] $B_i \Xset A_m$
- \item[] if $m \Xeq 1$ and ($t$ is an orbit tag or $B_i \Xeq \epsilon$)
- \begin{enumerate}
- \item[] $B_i \Xset o_{i t} B_i$
- \end{enumerate}
- \end{enumerate}
- \smallskip
- \item[] $\{(p_i, C_i)\} \Xset $ sort $\{(i, B_i)\}$ by second component \\
- \hphantom{\quad} using inverted $\prec_{subhistory}$ as comparator
- \item[] $\widetilde{o}_{p_1 t} \Xset 0, \; j \Xset 0$
- \item[] for $i \Xeq \overline{2, n}$:
- \begin{enumerate}
- \item[] if $C_{i-1} \!\neq\! C_i: j \Xset j \!+\! 1$
- \item[] $\widetilde{o}_{p_i t} \Xset j$
- \end{enumerate}
- \end{enumerate}
- \smallskip
- \item[] return $\{(q_i, \widetilde{o}_i, x_i)\}_{i=1}^n$
- \\
- \end{enumerate}
-%\end{minipage}
- \bigskip
+\begin{XLem}\label{lemma_nonorbit_end_subhistories}
+Comparable non-orbit subhistories can be compared incrementally with $\prec_{subhistory}$
+in case of end tags, but not in case of start tags.
+\\[0.5em]
+\textbf{Proof.}
+Non-orbit subhistories consist of a single offset (either $\varnothing$ or not),
+and ambiguous paths may discover it at different steps.
+Incremental comparison with $\prec_{subhistory}$ is correct in all cases except one:
+when $\varnothing$ is discovered at a later step than non-$\varnothing$.
+\\[0.5em]
+For start tags it is sufficient to show an example of such case.
+Consider TRE $1 (3 \, a \, 4 | 5 \, a \, 6) 2$ that corresponds to POSIX RE \texttt{(a)|(a)}
+and denotes ambiguous T-strings $x \Xeq 1 \, 3 \, a \, 4 \, \Xbar{5} \, \Xbar{6} \, 2$
+and $y \Xeq 1 \, 5 \, a \, 6 \, \Xbar{3} \, \Xbar{4} \, 2$.
+Subhistory of start tag $3$ in $y$ changes from $\epsilon$ on the first step (before consuming \texttt{a})
+to $\varnothing$ on the second step (after consuming \texttt{a}),
+while subhistory in $x$ remains $0$ on both steps.
+\\[0.5em]
+For end tags we will show that the faulty case is not possible:
+comparable subhistories must add $\varnothing$ at the same step as non-$\varnothing$.
+Consider non-orbit end tag $t$.
+Non-$\varnothing$ and $\varnothing$ must stem from different alternatives of a union subexpression $e_1 | e_2$,
+where $e_1$ contains $t$ and $e_2$ does not.
+Since subhistories of $t$ are comparable, $e_1$ cannot contain higher-priority tags:
+such tags would be negated in $e_2$ and comparison would stop before $t$.
+Consequently, $e_1$ itself must be the subexpression that ends with $t$.
+By construction of TNFA for $e_1 | e_2$
+all paths through it contain a single $t$-tagged transition at the very end (either positive or negative).
+Therefore both $\varnothing$ and non-$\varnothing$ must be discovered at the same step when ambiguous paths join.
+%subautomata for $e_1$ and $e_2$ contain a single $t$-tagged transition at the very end (positive and negative respectively).
+$\square$
+\end{XLem}
-The $history$ algorithm is modified so that it works on T-string fragments added by the $\epsilon$-closure.
-Non-$\varnothing$ offsets are set to $\infty$, since all tags in the $\epsilon$-closure have the same position
-which must be greater than any ordinal calculated on the previous step.
+This asymmetry between start and end tags in caused by inserting negative tags
+at the \emph{end} of alternative branches;
+if we inserted them at the \emph{beginning},
+then non-orbit tags would also have the property that $\varnothing$ belongs to the first step of comparison.
+Inserting negative tags at the end has other advantage: it effectively delays the associated operations,
+which should result in more efficient programs.
+Since our disambiguation algorithm ignores start tags,
+we can use the same comparison algorithm for all subhistories.
+Alternatively one can compare non-orbit tags using simple maximization/minimization strategy:
+if both last offsets of the given tag belong to the $\epsilon$-closure, they are equal;
+if only one of them belongs to the $\epsilon$-closure, it must be greater than the other one;
+otherwise the result of comparison on the previous step should be used.
\\
-%\begin{minipage}{\linewidth}
- $\epsilon \Xund history (a_1 \dots a_n, t)$
- \hrule
- \begin{enumerate}[leftmargin=0in]
- \smallskip
- \item[] $U \Xset \{ u \mid u < 2 \lceil t / 2 \rceil \!-\! 1 \}$
- \item[] $i \Xset 1, \; j \Xset 1$
-
- \item[] while $true$
- \begin{enumerate}
- \item[] while $i \leq n$ and $a_i \!\not\in\! (U \cup \Xbar{U})$
- \begin{enumerate}
- \item[] if $a_i \Xeq t: A_j \Xset A_j \infty$
- \item[] if $a_i \Xeq \Xbar{t}: A_j \Xset A_j \varnothing$
- \item[] $i \Xset i \!+\! 1$
- \end{enumerate}
-
- \item[] if $i \!>\! n$ break
- \item[] $j \Xset j \!+\! 1$
-
- \item[] while $i \leq n$ and $a_i \!\not\in\! \{t, \Xbar{t}\}$
- \begin{enumerate}
- \item[] $i \Xset i \!+\! 1$
- \end{enumerate}
-
- \end{enumerate}
- \item[] return $A_1 \dots A_j$
- \end{enumerate}
-%\end{minipage}
-
- \bigskip
-
-Finally, disambiguation algorithm is redefined in terms of ordinals and added T-string fragments:
+Orders are represented with vectors of ordinal numbers (one per tag) assigned to each configuration.
+Ordinals are initialized to zero and updated on each step of simulation by comparing last subhistories.
+Subhistories are compared using ordinals from the previous step and T-string fragments added by the $\epsilon$-closure.
+Ordinals are assigned in decreasing order, so that they can be compared like offsets:
+greater values have higher priority.
+The $history$ algorithm is modified to handle T-string fragments added by the $\epsilon$-closure:
+non-$\varnothing$ offsets are set to $\infty$, as all tags in the $\epsilon$-closure have the same offset
+which is greater than any ordinal calculated on the previous step.
+Disambiguation is defined as comparison of pairs $(ox, x)$ and $(oy, y)$,
+where $ox$, $oy$ are ordinals and $x$, $y$ are the added T-string fragments:
\\
-%\begin{minipage}{\linewidth}
- $\prec_{POSIX}((ox, x), (oy, y))$
- \hrule
- \begin{enumerate}[leftmargin=0in]
- \smallskip
- \item[] for $t \Xeq \overline{1, N}$:
- \begin{enumerate}
- \item[] $A_1 \dots A_n \Xset \epsilon \Xund history(x, 2t), \; a \Xset ox_{2t}$
- \item[] $B_1 \dots B_n \Xset \epsilon \Xund history(y, 2t), \; b \Xset oy_{2t}$
-
- \item[] if $2t$ is an orbit tag:
- \begin{enumerate}
- \item[] $A_1 \Xset a A_1$
- \item[] $B_1 \Xset b B_1$
- \end{enumerate}
- \item[] else
- \begin{enumerate}
- \item[] if $A_1 \Xeq \epsilon: A_1 \Xset a$
- \item[] if $B_1 \Xeq \epsilon: B_1 \Xset b$
- \end{enumerate}
-
- \item[] for $i \Xeq \overline{1, n}$:
- \begin{enumerate}
- \item[] if $A_i \!\neq\! B_i: return \prec_{subhistory} (A_i, B_i)$
- \end{enumerate}
-
- \end{enumerate}
- \item[] return $false$
- \\
- \end{enumerate}
-%\end{minipage}
-
- \bigskip
+ \begin{algorithm}[H] \DontPrintSemicolon \SetKwProg{Fn}{}{}{} \SetAlgoInsideSkip{medskip}
+ \Fn {$\underline{\prec_{POSIX}((ox, x), (oy, y))} \smallskip$} {
+ \For {$t \Xeq \overline{1, N}$} {
+ $A_1 \dots A_n \Xset \epsilon \Xund history(x, 2t), \; a \Xset ox(2t)$ \;
+ $B_1 \dots B_n \Xset \epsilon \Xund history(y, 2t), \; b \Xset oy(2t)$ \;
+ $A_1 \Xset a A_1$ \;
+ $B_1 \Xset b B_1$ \;
+
+% $A_1 \dots A_n \Xset \epsilon \Xund history(x, 2t), \; a \Xset ox_{2t}$ \;
+% $B_1 \dots B_n \Xset \epsilon \Xund history(y, 2t), \; b \Xset oy_{2t}$ \;
+% \If {$orbit(2t)$} {
+% $A_1 \Xset a A_1$ \;
+% $B_1 \Xset b B_1$ \;
+% } \Else {
+% \lIf {$A_1 \Xeq \epsilon$} {$A_1 \Xset a$}
+% \lIf {$B_1 \Xeq \epsilon$} {$B_1 \Xset b$}
+% }
+
+ \For {$i \Xeq \overline{1, n}$} {
+ \lIf {$A_i \!\neq\! B_i$} {\Return $A_i \prec_{subhistory} B_i$}
+ }
+ }
+ \Return $false$ \;
+ }
+ \end{algorithm}
+
+
+ \begin{algorithm}[H] \DontPrintSemicolon \SetKw{Let}{let} \SetKw{Und}{undefined} \SetKwProg{Fn}{}{}{} \SetAlgoInsideSkip{medskip}
+ \Fn {$\underline{ordinals (\{(q_i, o_i, x_i)\}_{i=1}^n)} \smallskip$} {
+ \For {$t \Xeq \overline{1, N}$} {
+ \For {$i \Xeq \overline{1, n}$} {
+ $A_1 \dots A_m \Xset \epsilon \Xund history(x_i, t)$ \;
+ $B_i \Xset A_m$ \;
+ \lIf {$m \Xeq 1$} {$B_i \Xset o_i(t) B_i$}
+% \If {$m \Xeq 1$ and ($orbit(t)$ or $B_i \Xeq \epsilon$)} {
+% $B_i \Xset o_{i t} B_i$ \;
+% }
+ }
+
+ \BlankLine
+ $\{(p_i, C_i)\} \Xset $ sort $\{(i, B_i)\}$ by second component using inverted $\prec_{subhistory}$ \;
+ \Let $o_{p_1}(t) \Xeq 0$, $ord \Xset 0$ \;
+ \For {$i \Xeq \overline{2, n}$} {
+ \lIf {$C_{i-1} \!\neq\! C_i$} {$ord \Xset ord \!+\! 1$}
+ \Let $o_{p_i}(t) \Xeq ord$ \;
+ }
+ }
+ \Return $\{(q_i, o_i, x_i)\}_{i=1}^n$ \;
+ }
+ \end{algorithm}
+
+
+ \begin{algorithm}[H] \DontPrintSemicolon \SetKwProg{Fn}{}{}{} \SetAlgoInsideSkip{medskip}
+ \Fn {$\underline{\epsilon \Xund history (a_1 \dots a_n, t)} \smallskip$} {
+ $i \Xset 1, \; j \Xset 1$ \;
+ \While {$true$} {
+ \While {$i \leq n$ and $a_i \!\not\in\! hightags(t)$} {
+ \lIf {$a_i \Xeq t$} {$A_j \Xset A_j \infty$}
+ \lIf {$a_i \Xeq \Xbar{t}$} {$A_j \Xset A_j \varnothing$}
+ $i \Xset i \!+\! 1$ \;
+ }
+ \lIf {$i \!>\! n$} {break}
+ $j \Xset j \!+\! 1$ \;
+ \While {$i \leq n$ and $a_i \!\not\in\! \{t, \Xbar{t}\}$} {
+ $i \Xset i \!+\! 1$ \;
+ }
+ }
+ \Return $A_1 \dots A_j$ \;
+ }
+ \end{algorithm}
So far we have treated all subexpressions uniformly as if they were marked for submatch extraction.
In practice most of them are not: we can reduce the amount of tags by dropping all tags in subexpressions without nested submatches
\\
Laurikari used TDFA(0); we study both methods and argue that TDFA(1) is better.
-Determinization algorithm is designed in a way that can handle both types of automata in a uniform way.
+Determinization algorithm can handle both types of automata in a uniform way:
+it has a boolean parameter $\ell$ that enables the use of lookahead.
+The full algorithm is defined on Figure \ref{fig_det}.
+\\
+
+\begin{figure*}\label{fig_det}
+\begin{multicols}{2}
+
+ \begin{algorithm}[H] \DontPrintSemicolon \SetKw{Let}{let} \SetKw{Und}{undefined} \SetKwProg{Fn}{}{}{} \SetAlgoInsideSkip{medskip}
+ \Fn {$\underline{determinization(\XN \Xeq (\Sigma, T, Q, F, q_0, T, \Delta), \ell)} \smallskip$} {
+% \Indm
+
+ \tcc {initialization}
+ \Let $initord(t) \Xeq 0$ \;
+ \Let $initreg(t) \Xeq t$ \;
+ \Let $finreg(t) \Xeq t + |T|$ \;
+ \Let $maxreg \Xeq 2|T|$ \;
+ \Let $newreg \equiv $ \Und \;
+
+ \BlankLine
+ \tcc {initial closure and reg-init function}
+ \hangindent=1.5em\hangafter=1
+ $(Q_0, regops, maxreg, newreg) \Xset closure(\XN, \ell, $
+ $\{(q_0, initreg, initord, \epsilon)\}, maxreg, newreg)$ \;
+ \Let $\YQ \Xeq \{ Q_0 \}$, $\YF \Xeq \emptyset$ \;
+ \ForEach {$(r_1, r_2, h) \Xin regops$} {
+ \Let $\iota(r_1) \Xeq (r_2, h)$ \;
+ }
+
+ \BlankLine
+ \tcc {main loop}
+ \While {exists unmarked state $X \Xin \YQ$} {
+ mark $X$ \;
+
+ \BlankLine
+ \tcc {explore all outgoing transitions}
+ \Let $newreg \equiv $ \Und \;
+ \ForEach {symbol $\alpha \in \Sigma$} {
+ $Y \Xset reach'(\Delta, X, \alpha)$ \;
+ \hangindent=1.5em\hangafter=1
+ $(Z, regops, maxreg, newreg) \Xset closure(\XN, \ell, Y, maxreg, newreg)$ \;
+
+ \BlankLine
+ \tcc {try to find mappable state}
+ \If {exists $Z' \Xin \YQ$ for which $regops' \Xeq $ $map(Z', Z, T, regops) \!\neq\! $ \Und,} {
+ $(Z, regops) \Xset (Z', regops')$ \;
+ } \lElse {
+ add $Z$ to $\YQ$
+ }
+
+ \BlankLine
+ \tcc {transition and reg-update functions}
+ \Let $\delta(X, \alpha) \Xeq Z$ \;
+ \ForEach {$(r_1, r_2, h) \Xin regops$} {
+ \Let $\zeta(X, \alpha, r_1) \Xeq (r_2, h)$ \;
+ }
+ }
+
+ \BlankLine
+ \tcc {final state and reg-finalize function}
+% \If {$X$ contains configuration $(q, v, o, x)$ with final state $q \Xin F$} {
+ \If {exists $(q, v, o, x) \Xin X \mid q \Xin F$} {
+ add $X$ to $\YF$ \;
+ \ForEach {tag $t \Xin T$} {
+ \Let $\eta(X, finreg(t)) \Xeq (v(t), op(x, t))$ \;
+ }
+ }
+ }
+
+ \BlankLine
+ \Let $R \Xeq \{ 1, \dots, maxreg \}$ \;
+ \Return $(\Sigma, T, \YQ, \YF, Q_0, R, \delta, \zeta, \eta, \iota)$ \;
+ }
+ \end{algorithm}
+
+
+ \begin{algorithm}[H] \DontPrintSemicolon \SetKw{Let}{let} \SetKw{Und}{undefined} \SetKwProg{Fn}{}{}{} \SetAlgoInsideSkip{medskip}
+ \Fn {$\underline{op(x, t)} \smallskip$} {
+% \lIf {$x \Xeq \epsilon$} {\Return $\epsilon$}
+% \Let $x \Xeq a y$
+% \Switch {$a$} {
+% \lCase {$\Xbar{t}$} {\Return $0 \cdot op(y, t)$}
+% \lCase {$t$} {\Return $1 \cdot op(y, t)$}
+% \lOther {\Return $op(y, t)$}
+% }
+ \Switch {$x$} {
+ \lCase {$\epsilon$} {\Return $\epsilon$}
+ \lCase {$\Xbar{t} y$} {\Return $0 \cdot op(y, t)$}
+ \lCase {$t y$} {\Return $1 \cdot op(y, t)$}
+ \lCase {$a y$} {\Return $op(y, t)$}
+ }
+ }
+ \end{algorithm}
+
+
+ \begin{algorithm}[H] \DontPrintSemicolon \SetKw{Let}{let} \SetKw{Und}{undefined} \SetKwProg{Fn}{}{}{} \SetAlgoInsideSkip{medskip}
+ \Fn {$\underline{closure(\XN, lookahead, X, maxreg, newreg)} \smallskip$} {
+
+ \tcc {construct closure and update ordinals}
+ $Y \Xset \{(q, o, \epsilon) \mid (q, v, o, x) \Xin X \}$ \;
+ $Y \Xset closure' (Y, F, \Delta)$ \;
+ $Y \Xset ordinals (Y)$ \;
+ $Z \Xset \{(q, v, \widetilde{o}, x, y) \mid (q, v, o, x) \Xin X \wedge (q, \widetilde{o}, y) \Xin Y \}$ \;
+
+ \BlankLine
+ \tcc {if TDFA(0), apply lookahead operations}
+ \If {not $lookahead$} {
+ $Z \Xset \{(q, v, o, y, \epsilon) \mid (q, v, o, x, y) \Xin Z \}$ \;
+ }
+
+ \BlankLine
+ \tcc {find all distinct operation righ-hand sides}
+ \Let $newops \Xeq \emptyset$ \;
+ \ForEach {configuration $(q, v, o, x, y) \Xin Z$} {
+ \ForEach {tag $t \Xin T$} {
+ $h \Xset op(x, t)$ \;
+ \lIf {$h \!\neq\! \epsilon$} {add $(t, v(t), h)$ to $newops$}
+ }
+ }
+
+ \BlankLine
+ \tcc {allocate registers for new operations}
+ \ForEach {$o \Xin newops$} {
+ \If {$newreg(o) \Xeq $ \Und} {
+ $maxreg \Xset maxreg + 1$ \;
+ \Let $newreg(o) \Xeq maxreg$ \;
+ }
+ }
+
+ \BlankLine
+ \tcc {update registers in closure}
+ \ForEach {configuration $(q, v, o, x, y) \Xin Z$} {
+ \ForEach {tag $t \Xin T$} {
+ $h \Xset op(x, t)$ \;
+ \lIf {$h \!\neq\! \epsilon$} {\Let $v(t) \Xeq newreg(t, v(t), h)$}
+ }
+ }
+
+ \BlankLine
+ $X \Xset \{(q, v, o, y) \mid (q, v, o, x, y) \Xin Z \}$ \;
+ $regops \Xset \{(newreg(o), r, h) | o \Xeq (t, r, h) \Xin newops\}$ \;
+ \Return $(X, regops, maxreg, newreg)$ \;
+ }
+ \end{algorithm}
+
+
+ \begin{algorithm}[H] \DontPrintSemicolon \SetKw{Let}{let} \SetKw{Und}{undefined} \SetKwProg{Fn}{}{}{} \SetAlgoInsideSkip{medskip}
+ \Fn {$\underline{map(X, Y, T, ops)} \smallskip$} {
+ \Let $xregs(t) \Xeq \{v(t) \mid (q, v, o, x) \Xin X \}$ \;
+ \Let $yregs(t) \Xeq \{v(t) \mid (q, v, o, x) \Xin Y \}$ \;
+
+ \BlankLine
+% \If {exists bijection $M$ between states $X$ and $Y$,
+% and for each tag $t \Xin T$ exists bijection $m(t)$ between $xregs(t)$ and $yregs(t)$, such that
+% corresponding configurations $(q, v, o, x)$, $(\widetilde{q}, \widetilde{v}, \widetilde{o}, \widetilde{x})$ in $M$
+% have equal states $q \Xeq \widetilde{q}$, equal ordinals $o \Xeq \widetilde{o}$,
+% equal lookahead operations $op(x, t) \Xeq op(\widetilde{x}, t)$ $\forall t \Xin T$
+% and their registers correspond $(v(t), \widetilde{v}(t)) \Xin m(t)$ $\forall t \Xin T$,
+% } {
+
+ \tcc {map one state to the other
+ so that the corresponding configurations have equal TNFA states, ordinals and lookahead operations,
+ and there is bijection between registers}
+ \If {exists bijection $M: X \leftrightarrow Y$,
+ and $\forall t \Xin T$ exists bijection $m(t): xregs(x) \leftrightarrow yregs(t)$,
+ such that $\forall ((q, v, o, x), (\widetilde{q}, \widetilde{v}, \widetilde{o}, \widetilde{x})) \Xin M:$
+ $q \Xeq \widetilde{q}$ and $o \Xeq \widetilde{o}$
+ and $\forall t \Xin T$:
+ $op(x, t) \Xeq op(\widetilde{x}, t)$
+ and $(v(t), \widetilde{v}(t)) \Xin m(t)$,
+ } {
+
+% \If {exists bijection $M: X \leftrightarrow Y$,
+% and for each tag $t \Xin T$ exists bijection $m(t): xregs(t) \leftrightarrow yregs(t)$, such that
+% corresponding configurations $(q, v, o, x)$, $(\widetilde{q}, \widetilde{v}, \widetilde{o}, \widetilde{x})$ in $M$
+% have equal states $q \Xeq \widetilde{q}$, equal ordinals $o \Xeq \widetilde{o}$,
+% equal lookahead operations $op(x, t)$ $\Xeq$ $op(\widetilde{x}, t)$ $\forall t \Xin T$
+% and their registers correspond $(v(t), \widetilde{v}(t)) \Xin m(t)$ $\forall t \Xin T$,
+% } {
+
+ \Let $m \Xeq \bigcup_{t \in T} m(t)$ \;
+
+ \tcc {fix target register in existing operations}
+ $ops_1 \Xset \{ (a, c, h) \mid (a, b) \Xin m \wedge (b, c, h) \Xin ops \}$ \;
+
+ \tcc {add copy operations}
+ $ops_2 \Xset \{ (a, b, \epsilon) \mid (a, b) \Xin m \wedge a \!\neq\! b$
+ $\hphantom{hspace{1.5em}} \wedge \nexists c, h: (b, c, h) \Xin ops \}$ \;
+
+ \Return $ops_1 \cup ops_2$ \;
+ } \lElse {
+ \Return \Und
+ }
+ }
+ \end{algorithm}
+
+\end{multicols}
+\begin{center}
+\caption{Determinization algorithm.}
+\footnotesize{
+Functions $reach'$ and $closure'$ are exactly as
+$reach$ from section \ref{section_tnfa} and $closure \Xund goldberg \Xund radzik$ from section \ref{section_closure},
+except for the trivial adjustments to carry around ordinals and pass them into disambiguation procedure.
+}
+\end{center}
+\end{figure*}
+
States are sets of configurations $(q, v, o, x)$,
where $q$ is a core TNFA state, $v$ is a vector of registers that hold tag values, $o$ is the ordinal
and $x$ is the T-string of the $\epsilon$-path by which $q$ was reached.
in this case we simply substitute it with $r_2$ instead of copying.
Determinization algorithm can handle both POSIX and leftmost greedy policies,
but in the latter case it can be simplified to avoid explicit calculation of ordinals, as discussed in section \ref{section_disambiguation}.
-\\
-
-%Determinization algorithm is parameterized with a boolean flag that allows to switch between TDFA(0) and TDFA(1).
-
-% \begin{minipage}{\linewidth}
- $determinization(\XN \Xeq (\Sigma, T, Q, F, q_0, T, \Delta), \ell)$
- \hrule
- \begin{itemize}[leftmargin=0in]
- \smallskip
- \item[] $v(t) \Xeq t,
- \; f(t) \Xeq |T| \!+\! t,
- \; o(t) \Xeq 0$
- \item[] $maxreg \Xset 2|T|$, $newreg(o) \Xeq \bot$
- \item[] $(Q_0, \iota, maxreg, newreg) \\
- \hphantom{\hspace{2em}} \Xset closure(\XN, \ell, \{(q_0, v, o, \epsilon)\}, maxreg, newreg)$
- \item[] $\YQ \Xset \{ Q_0 \}, \; \YF \Xset \emptyset$
- \smallskip
- \item[] while $\exists$ unmarked $X \Xin \YQ$
- \begin{itemize}
- \item[] mark $X$
- \smallskip
- \item[] $newreg(o) \Xeq \bot$
- \item[] for all $\alpha \in \Sigma$:
- \begin{itemize}
- \item[] $Y \Xset reach'(\Delta, X, \alpha)$
- \item[] $(Z, regops, maxreg, newreg) \\
- \hphantom{\hspace{2em}} \Xset closure(\XN, \ell, Y, maxreg, newreg)$
- \item[] if $\exists Z' \Xin \YQ \mid \bot\!\neq\!ops' \Xset map(Z', Z, T, regops)$
- \begin{itemize}
- \item[] $(Z, regops) \Xset (Z', regops')$
- \end{itemize}
- \item[] $\YQ \Xset \YQ \cup \{ Z \}$
- \item[] $\delta \Xset \delta \cup \{(X, \alpha, Z)\}$
- \item[] $\zeta \Xset \zeta \cup \{(X, \alpha, r_1, r_2, a) \mid (r_1, r_2, a) \Xin regops \}$
- \end{itemize}
- \smallskip
- \item[] if $\exists (q, v, o, x) \Xin X \mid q \Xin F$:
- \begin{itemize}
- \item[] $\YF \Xset \YF \cup \{ X \}$
- \item[] $\eta \Xset \eta \cup \{(X, f_t, v_t, h_t(x)) \mid t \Xin T\}$
- \end{itemize}
- \end{itemize}
- \smallskip
- \item[] $R \Xset \{ 1, \dots, maxreg \}$
- \item[] return $(\Sigma, T, \YQ, \YF, Q_0, R, \delta, \zeta, \eta, \iota)$
- \\ \\
- \end{itemize}
-% \end{minipage}\\ \\
-
-% \begin{minipage}{\linewidth}
- $map(X, Y, T, ops)$
- \hrule
- \begin{itemize}[leftmargin=0in]
- \smallskip
- \item[] $V^x(t) \Xset \{v_t \mid (q, v, o, x) \Xin X \}$
- \item[] $V^y(t) \Xset \{v_t \mid (q, v, o, x) \Xin Y \}$
- \smallskip
- \item[] if $\exists$ bijection $M: X \leftrightarrow Y$, such that
- \begin{itemize}
- \item[] $\forall t \Xin T\ \exists$ bijection $m_t: V^x_t \leftrightarrow V^y_t$
- \item[] and $\forall ((q, v, o, x), (\widetilde{q}, \widetilde{v}, \widetilde{o}, \widetilde{x})) \Xin M: \\
- \hphantom{\hspace{2em}} q \Xeq \widetilde{q} \wedge o \Xeq \widetilde{o} \wedge \forall t \Xin T: \\
- \hphantom{\hspace{4em}} (v_t, \widetilde{v}_t) \Xin m_t \wedge h_t(x) \Xeq h_t(\widetilde{x})$
- \smallskip
- \item[] $m \Xset \bigcup_{t \in T} m_t$
- \item[] $ops_1 \Xset \{ (r_1, r_3, a) \mid (r_1, r_2) \Xin m \wedge (r_2, r_3, a) \Xin ops \}$
- \item[] $ops_2 \Xset \{ (r_1, r_2, \epsilon) \mid (r_1, r_2) \Xin m \wedge r_1 \!\neq\! r_2 \\
- \hphantom{hspace{2em}} \wedge \nexists r_3, a: (r_2, r_3, a) \Xin ops \}$
- \item[] return $ops_1 \cup ops_2$
- \end{itemize}
- \smallskip
- \item[] else return $\bot$
- \\ \\
- \end{itemize}
-% \end{minipage}\\ \\
-
-% \begin{minipage}{\linewidth}
- $closure(\XN \Xeq (\Sigma, T, Q, F, q_0, T, \Delta), \\
- \hphantom{\hspace{2em}} lookahead, X, maxreg, newreg)$
- \hrule
- \begin{itemize}[leftmargin=0in]
- \smallskip
- \item[] $Y \Xset \{(q, o, \epsilon) \mid (q, v, o, x) \Xin X \}$
- \item[] $Y \Xset closure' (Y, F, \Delta)$
- \item[] $Y \Xset ordinals (Y)$
- \item[] $Z \Xset \{(q, v, \widetilde{o}, x, y) \mid (q, v, o, x) \Xin X \wedge (q, \widetilde{o}, y) \Xin Y \}$
-
- \item[] if not $lookahead$:
- \begin{itemize}
- \item[] $Z \Xset \{(q, v, o, y, \epsilon) \mid (q, v, o, x, y) \Xin Z \}$
- \end{itemize}
-
- \smallskip
- \item[] $opsrhs \Xset \{ (t, v_t, h_t(x)) \mid \\
- \hphantom{\hspace{4em}} t \Xin T, (q, v, o, x, y) \Xin Z \wedge h_t(x)\!\neq\!\epsilon \}$
-
- \item[] for all $o \Xin opsrhs \mid newreg(o) \Xeq \bot:$
- \begin{itemize}
- \item[] $maxreg \Xset maxreg + 1$
- \item[] $newreg \Xset newreg \cup \{(o, maxreg)\}$
-% \item[] $newreg(\widetilde{o}) \Xeq
-% \begin{cases}
-% maxreg & \text{if } \widetilde{o} \Xeq o \\[-0.5em]
-% newreg(o) & \text{otherwise}
-% \end{cases}$
- \end{itemize}
-
- \item[] $X \Xset \{(q, \widetilde{v}, o, y) \mid (q, v, o, x, y) \Xin Z,\\
- \hphantom{\hspace{4em}} \widetilde{v}(t) \Xeq
- \begin{cases}
- newreg(t, v_t, h_t(x)) & \text{if } h_t(x)\!\neq\!\epsilon \\[-0.5em]
- v_t & \text{otherwise}
- \end{cases} \; \}$
-
- \item[] $newops \Xset \{(newreg(o), r, a) \mid o \Xeq (t, r, a) \Xin opsrhs\}$
-
- \item[] return $(X, newops, maxreg, newreg)$
- \\ \\
- \end{itemize}
-% \end{minipage}\\ \\
-
-
-Functions $reach'$ and $closure'$ are exactly as
-$reach$ and $closure \Xund goldberg \Xund radzik$ from section \ref{section_closure},
-except for the trivial adjustments to carry around ordinals and pass them into disambiguation procedure.
-We use $h_t(x)$ to denote $H(t)$, where $H$ is decomposition of T-string $x$ into tag value function (definition \ref{tagvalfun}).
-
\begin{XThe}
Determinization algorithm terminates.
-
-\smallskip
-
-Proof.
+\\[0.5em]
+\textbf{Proof.}
We will show that for arbitrary TNFA with $t$ tags and $n$ states the number of unmappable TDFA states is finite.
Each TDFA state with $m$ configurations (where $m \!\leq\! n$) is a combination of the following components:
a set of $m$ TNFA states,
needs not to check bijection between registers if the lookahead history is not empty:
in this case register values will be overwritten on the next step
(for non-simple tags registers would be augmented, not overwritten).
-Condition $(v_t, \widetilde{v}_t) \Xin m_t \wedge h_t(x) \Xeq h_t(\widetilde{x})$
-can be changed to $h_t(x) \Xeq h_t(\widetilde{x}) \wedge (h_t(x) \!\neq\! \epsilon \vee (v_t, \widetilde{v}_t) \Xin m_t)$,
-which results in better mapping.
+Condition $(v(t), \widetilde{v}(t)) \Xin m(t)$
+can be replaced with a weaker condition $op(x, t) \!\neq\! \epsilon \vee (v(t), \widetilde{v}(t)) \Xin m(t)$,
+which increases the probability of successfull mapping.
This optimization applies only to TDFA(1), since lookahead history is always $\epsilon$ for TDFA(0),
-and it reduces the gap in the number of states between TDFA(0) and TDFA(1).
+so the optimization effectively reduces the gap in the number of states between TDFA(0) and TDFA(1).
\\
Second, operations on simple tags are reduced from normal form $r_1 \Xeq r_2 \cdot b_1 \dots b_n$
\section{Tests and benchmarks}\label{section_tests_and_benchmarks}
Correctness testing of RE2C was done in several different ways.
-First, RE2C has a test suite with over a thousand tests;
-most of them are hand-written snippets of code that used to trigger RE2C errors,
-or just examples of useful real-world programs.
+First, I used the main RE2C test suite. It includes hand-written snippets of code and examples of useful real-world programs:
+most of them check various optimizations, errors and special cases,
+or simply ensure that the basic examples are not broken.
\\
-Second, RE2C implementation of POSIX captures was verified on the canonical POSIX test suite compiled by Glenn Fowler [??].
-I used the augmented version provided by Kuklewicz [??] and excluded some tests from the ``basic'' subset
-that use start and end anchors \texttt{\^} and \texttt{\$}, which are not supported by RE2C.
+Second, I verified RE2C implementation of POSIX captures on the canonical POSIX test suite by Glenn Fowler [??].
+I used the augmented version provided by Kuklewicz [??] and excluded a few tests that check POSIX-specific extensions
+which are not supported by RE2C (e.g. start and end anchors \texttt{\^} and \texttt{\$}) ---
+the excluded tests do not contain any special cases of submatch extraction.
\\
-Third, I used RE2C self-validation mode [??].
-In this mode, instead of generating normal code, RE2C generates a special self-contained \emph{skeleton} program
-and two input files: one with input strings and one with compressed match results
-that are used to verify program behaviour on all inputs.
-Strings are generated so that they cover all TDFA transitions and many TDFA paths (including incorrect inputs that cause match failure).
+Third, I used RE2C self-validation option \texttt{--skeleton} [??].
+With this option RE2C ignores all user-defined interface code
+and embeds the generated lexer in a self-contained template program called \emph{skeleton}.
+Additionally, RE2C generates two input files: one with strings derived from the regular grammar
+and one with compressed match results that are used to verify program behaviour on all inputs.
+Strings are generated so that they cover all TDFA transitions and many TDFA paths (including inputs that cause match failure).
Generation of input data happens right after TDFA construction and before any optimizations,
-but the program itself is fully optimized.
+but the lexer itself is fully optimized (it is the same lexer that would be generated in normal mode).
Thus skeleton programs are capable of revealing any errors in optimization and code generation.
\\
-Fourth, I compared TDFA(0) against TDFA(1) on various RE and inputs:
-they result in very different programs, but must yield identical results.
+Fourth, I compared TDFA(0) vs. TDFA(1) on various RE and input strings:
+the two automata result in different programs, but they must yield identical results.
\\
-Last, and most important, I used fuzzer contributed by Sergei Trofimovich [??] and based on haskell QuickCheck library [??].
-I applied it in many different settings:
-ran TDFA(0) programs on skeleton inputs generated for TDFA(1) programs and vice versa;
-fuzz-tested RE2C against Kuklewicz library Regex-TDFA [??];
+Last, and most important, I used \emph{fuzzer} contributed by Sergei Trofimovich [??] and based on haskell QuickCheck library [??].
+It generates random RE with the given \emph{constrains} and for each RE verifies that ceratain \emph{properties} are satisfied.
+By redifining the set of properties and constraints one may easily adapt fuzzer to any particular setting.
+I used it in many different settings:
+executed TDFA(0) programs on skeleton inputs generated for TDFA(1) programs and vice versa;
+fuzz-tested RE2C against Regex-TDFA library [??] written by Kuklewicz;
verified or disproved numerous assumptions and hypotheses;
generated minimal triggers for bugs and special cases
-and otherwise compensated the lack of imagination with the use of random generator.
+and otherwise compensated the lack of imagination with the power of random generator.
\\
-Benchmarks are aimed at comparison of TDFA(0) and TDFA(1).
-We have already seen on numerous examples in section \ref{section_determinization}
-that TDFA(1) has every reason to result in faster code;
+Benchmarks are aimed at comparison of TDFA(0) and TDFA(1);
+comparison of RE2C and other lexer generators is beyound the scope of this paper (see [Bum94]).
+As we have already seen on numerous examples in section \ref{section_determinization},
+TDFA(1) has every reason to result in faster code;
however, only a real-world program can show if there is any percievable difference in practice.
-As an example of such real-world program I used two canonical use case for submatch extraction in RE: URI parser and HTTP parser.
-Both examples are used by many authors (see e.g. [ThoBra] and [SohTho]),
-as their syntax is simple enough to admit regular grammars [RFC-3986] [RFC7230],
-but at the same time they have non-trivial structure composed of multiple components of varying length and form.
-Each example comes in two flavors: RFC-compliant parser that performs full validation,
-and simplified parser that skips most of the validation and barely parses the input; both forms may be useful in practice.
+I used two canonical use cases for submatch extraction in RE: URI parser and HTTP parser.
+Both examples are used in literature [ThoBra] [SohTho],
+as they are simple enough to admit regular grammar,
+but at the same time both grammars have non-trivial structure composed of multiple components of varying length and form [RFC-3986] [RFC7230].
+Each example has two implementations: RFC-compliant and simplified (both forms may be useful in practice).
The input to each parser is a 1G file of randomly generated URIs or HTTP messages; it is buffered in 4K chunks.
-Programs are written so that they spend most of the time on parsing
-and do only the bare minimum of work necesssary to convince compiler that parse results cannot be optimized out ---
-this way benchmarks measure the efficiency of parsing, not the accompanying code or operating system.
-Alternatively all parsers can be built in ``verification mode'' and will print out parse results.
-For each parser there is a corresponding recognizer based on a simple DFA:
-it sets a baseline for expectations of how fast and small the lexer can be and what is the real overhed on submatch extraction.
-\\
-
-All benchmarks were run on 64-bit Intel Core i3 machine with 350M RAM and 32K L1d, 32K L1i, 256K L2 and 3072K L3 caches;
-each result is the average of 4 subsequent runs after two ``warmup'' runs.
-\\
-
-Benchmarks are written in C-90.
-I used four different C compilers:
-gcc-7.1.10 (Gnu Compiler Collection [??]),
-clang-4.0.1 ([??]),
-tcc-0.9.26 (Tiny C Compiler [??])
-and pcc-1.1.0 (Portable C Compiler [??]).
-All compilers were run with optimization level \texttt{-O2} (though some of them probably ignore it).
-\\
-
+Programs are written so that they spend most of the time on parsing,
+%and do only the bare minimum of work necesssary to convince compiler that parse results cannot be optimized out ---
+%this way benchmarks measure the efficiency of parsing, not the accompanying code or the operating system.
+so that benchmarks measure the efficiency of parsing, not the accompanying code or the operating system.
+%Alternatively each parser can be built in ``verification mode'', in which it prints out parse results.
+For each of the four parsers there is a corresponding DFA-based recognizer:
+it sets a baseline for expectations of how fast and small the lexer can be and what is the real overhead on submatch extraction.
+Benchmarks are written in C-90 and compiled with four different C compilers:
+GCC-7.1.10 [??],
+Clang-4.0.1 [??],
+TCC-0.9.26 [??]
+and PCC-1.1.0 [??]
+with optimization level \texttt{-O2} (though some compilers probably ignore it).
RE2C was run in three different settings:
default mode, with \texttt{-b} option (generate bit masks and nested \texttt{if}-s instead of plain \texttt{switch}-es),
-and with \texttt{--no-optimize-tags} option (it suppresses all optimizations
-of tag variables described in section \ref{section_implementation}, except compaction).
-\\
+and with \texttt{--no-optimize-tags} option (suppress optimizations of tag variables described in section \ref{section_implementation}).
+All benchmarks were run on 64-bit Intel Core i3 machine with 350M RAM and 32K L1d, 32K L1i, 256K L2 and 3072K L3 caches;
+each result is the average of 4 subsequent runs after a proper ``warmup''.
+Benchmark results are summarized in tables \ref{table1}, \ref{table2}, \ref{table3} and \ref{table4}
+and visualized on subsequent plots.
-\begin{table*}\label{table1}
+\end{multicols}
+
+%\begin{table*}\label{table1}
\begin{center}
+ \bigskip
\begin{tabular}{|c|ccccccccccc|}
\hline
& registers & states & code size (B) & \multicolumn{4}{c}{stripped binary size (B)} & \multicolumn{4}{c|}{run time (s)} \\
% TDFA(0) & 2054 & 625 & 835285 & 280632 & 272488 & 1132616 & 858232 & 14.14 & 13.24 & 105.87 & 59.71 \\
% TDFA(1) & 149 & 462 & 204119 & 63544 & 149608 & 238568 & 170104 & 6.64 & 5.90 & 68.50 & 29.39 \\
\hline
- \end{tabular}\\
- \caption{RFC-7230 compilant HTTP parser.}
- \smallskip
+ \end{tabular}\\*
+ \medskip
+ Table 1: RFC-7230 compilant HTTP parser.\\*
+ \medskip
\footnotesize{Total 39 tags: 34 simple and 5 with history.
Nondeterminism for TDFA(0): 23 tags with degree 2, 12 tags with degree 3 and 1 tag with degree 4.
Nondeterminism for TDFA(1): 18 tags with degree 2, 2 tags with degree 3.}
+ \bigskip
\end{center}
-\end{table*}
+%\end{table*}
-
-\begin{table*}\label{table2}
+%\begin{table*}\label{table2}
\begin{center}
+ \bigskip
\begin{tabular}{|c|ccccccccccc|}
\hline
& registers & states & code size (B) & \multicolumn{4}{c}{stripped binary size (B)} & \multicolumn{4}{c|}{run time (s)} \\
% TDFA(0) & 72 & 106 & 57956 & 22584 & 55400 & 73928 & 55416 & 8.61 & 6.77 & 73.05 & 34.68 \\
% TDFA(1) & 44 & 82 & 39674 & 18488 & 43112 & 49480 & 39032 & 6.01 & 5.38 & 63.87 & 27.44 \\
\hline
- \end{tabular}\\
- \caption{Simplified HTTP parser.}
- \smallskip
+ \end{tabular}\\*
+ \medskip
+ Table 2: Simplified HTTP parser.\\*
+ \medskip
\footnotesize{Total 15 tags: 12 simple and 3 with history.
Nondeterminism for TDFA(0): 8 tags with degree 2.
Nondeterminism for TDFA(1): 3 tags with degree 2.}
+ \bigskip
\end{center}
-\end{table*}
+%\end{table*}
-\begin{table*}\label{table3}
+%\begin{table*}\label{table3}
\begin{center}
+ \bigskip
\begin{tabular}{|c|ccccccccccc|}
\hline
& registers & states & code size (K) & \multicolumn{4}{c}{binary size (K, stripped)} & \multicolumn{4}{c|}{run time (s)} \\
% TDFA(0) & 611 & 280 & 435350 & 129072 & 153696 & 548272 & 473184 & 10.41 & 7.56 & 127.48 & 75.46 \\
% TDFA(1) & 64 & 256 & 133518 & 43056 & 88160 & 159248 & 125024 & 6.74 & 3.55 & 103.98 & 51.12 \\
\hline
- \end{tabular}\\
- \caption{RFC-3986 compilant URI parser.}
- \smallskip
+ \end{tabular}\\*
+ \medskip
+ Table 3: RFC-3986 compilant URI parser.\\*
+ \medskip
\footnotesize{Total 20 tags (all simple).
Nondeterminism for TDFA(0): 15 tags with degree 2 and 4 tags with degree 3.
Nondeterminism for TDFA(1): 10 tags with degree 2.}
+ \bigskip
\end{center}
-\end{table*}
+%\end{table*}
-\begin{table*}\label{table4}
+%\begin{table*}\label{table4}
\begin{center}
+ \bigskip
\begin{tabular}{|c|ccccccccccc|}
\hline
& registers & states & code size (K) & \multicolumn{4}{c}{binary size (K, stripped)} & \multicolumn{4}{c|}{run time (s)} \\
% TDFA(0) & 79 & 29 & 33745 & 18480 & 22624 & 43504 & 39008 & 7.46 & 3.94 & 105.22 & 61.72 \\
% TDFA(1) & 40 & 31 & 28013 & 14384 & 22624 & 36080 & 30816 & 6.29 & 3.33 & 102.00 & 48.22 \\
\hline
- \end{tabular}\\
- \caption{Simplified URI parser.}
- \smallskip
+ \end{tabular}\\*
+ \medskip
+ Table 4: Simplified URI parser.\\*
+ \medskip
\footnotesize{Total 14 tags (all simple).
Nondeterminism for TDFA(0): 8 tags with degree 2 and 5 tags with degree 3.
Nondeterminism for TDFA(1): 7 tags with degree 2.}
+ \bigskip
\end{center}
-\end{table*}
+%\end{table*}
+
+%\begin{table*}
+%\begin{center}
+\includegraphics[width=\linewidth]{img/bench/size_gcc_clang.png}
+\includegraphics[width=\linewidth]{img/bench/size_tcc_pcc.png}
+\includegraphics[width=\linewidth]{img/bench/time_gcc_clang.png}
+\includegraphics[width=\linewidth]{img/bench/time_tcc_pcc.png}
+%\end{center}
+%\end{table*}
-\begin{table*}
-\begin{center}
-\includegraphics[width=\linewidth]{img/bench/size_gcc_clang.png}\\
-\includegraphics[width=\linewidth]{img/bench/size_tcc_pcc.png}\\
-\end{center}
-\end{table*}
+\begin{multicols}{2}
-\begin{table*}
-\begin{center}
-\includegraphics[width=\linewidth]{img/bench/time_gcc_clang.png}\\
-\includegraphics[width=\linewidth]{img/bench/time_tcc_pcc.png}\\
-\end{center}
-\end{table*}
+Benchmark results show the following:
+
+\begin{itemize}
+ \setlength{\parskip}{0.5em}
+
+ \item Speed and size of the generated code vary between different compilers:
+ as expected, TCC and PCC generate slower and larger code than GCC and Clang (though PCC performs notably better);
+ but even GCC and Clang, which are both known for their optimizations, generate very different code:
+ GCC binaries are often 2x smaller, while the corresponding Clang-generated code runs up to 2x faster.
+
+ \item RE2C code-generation option \texttt{-b} has significant impact on the resulting code:
+ it results in up to 5x speedup for TCC, 2x speedup for PCC and about 2x reduction of binary size for Clang at the cost of about 1.5x slowdown;
+ of all compilers only GCC seems to be unaffected by this option.
+
+ \item Regardless of different compilers and options, TDFA(1) is consistently more efficient than TDFA(0):
+ the resulting code is about 1.5 - 2x faster and generally smaller,
+ especially on large programs and in the presence of tags with history.
+
+ \item TDFA(1) incurs modest overhead on submatch extraction compared to DFA-based recognition;
+ in particular, the gap between DFA and TDFA(0) is smaller than the gap between TDFA(0) and TDFA(1).
+
+ \item Nondeterminism levels are not so high in the example programs.
+
+ \item RE2C optimizations reduce binary size, even with optimizing C compilers.
+
+ \item RE2C optimizations have less effect on execution time: usually they reduce it, but not by much.
+ \\
+\end{itemize}
-Benchmark results are summarized in tables \ref{table1}, \ref{table2}, \ref{table3} and \ref{table4}
-and visualized on subsequent plots.
-They demonstrate that TDFA(1) can result in 1.5x - 2x speedup compared to TDFA(0), especially in the presence of tags with history;
-TDFA(1) incurs only modest overhead on submatch extraction compared to basic DFA-based recognition;
-nondeterminism levels are not so high in (at least some) real-world programs;
-RE2C optimizations reduce binary size, especially in complex cases with large automata and high submatch detalization,
-and even optimizing C compilers are not a substitution for them, as they lack the special knowledge of the program that RE2C has;
-RE2C optimizations have less effect on execution time (it is also reduced, but not by much).
\section{Future work}\label{section_future_work}
+The most interesting subject that requires further exploration and practical experiments
+is the comparison of TDFA (described in this paper) and DSST (described in [Gra15] and [SohTho])
+on practical problems of submatch extraction.
+Both models are aimed at generating fast parsers,
+and both depend heavily on the efficiency of particular implementation.
+For instance, DSST is applied to full parsing, which suggests that it has some overhead compared to TDFA;
+however, optimizations of the resutling program may reduce the overhead, as shown in [Gra15].
+TDFA, contrary to DSST, allows copy operations on registers;
+but in practice they can be reduced to copying scalar values, as shown in section \ref{section_implementation}.
+The construction of DSST given in [Gra15] works only for leftmost greedy disambiguation;
+it might be interesting to construct DSST with POSIX disambiguation.
+\\
+
There is also quite different use of position markers described in literature:
Watson mentions so-called \emph{dotted} RE [Wat95]
that go back to DeRemers's construction of DFA [DeRem74],
\section*{Acknowledgements}
+Premnogoe spasibo drugu na bukvu S ! ! ! :)
+
\end{multicols}
\pagebreak