-
-# NUM LEFT RE2 GOR1 GTOP KUKL SLOW LAZY BACK
-
-1 1.00 1.38 3.98 3.74 2.71 5.21 3.78 3.48
-2 1.00 1.41 6.71 7.29 3.74 14.14 4.17 3.25
-3 1.00 1.59 10.69 11.52 4.14 27.27 4.17 3.19
-4 1.00 1.52 15.48 16.51 4.36 44.42 4.14 3.12
-5 1.00 0.96 3.47 3.62 5.89 3.77 5.19 5.45
-6 1.00 1.36 5.04 5.75 8.18 8.63 8.14 4.73
-7 1.00 1.51 7.33 8.35 9.21 14.99 12.84 4.56
-8 1.00 1.50 9.92 11.07 9.44 22.33 15.00 4.48
-9 1.00 0.30 1.87 1.94 2.26 1.84 3.82 6.44
-10 1.00 0.88 4.23 4.53 16.22 4.47 11.18 8.74
-11 1.00 0.78 3.37 3.80 14.78 3.29 7.39 11.64
-12 1.00 0.53 2.98 3.33 12.90 2.74 7.08 14.04
-13 1.00 0.58 3.11 3.61 11.26 2.94 7.17 12.73
-14 1.00 0.61 3.38 3.98 12.81 3.23 7.51 12.52
-15 1.00 0.71 3.24 3.93 5.82 3.43 7.69 19.04
-16 1.00 0.80 3.48 3.93 8.56 3.34 6.45 7.58
-17 1.00 0.79 3.20 3.60 9.12 3.15 6.64 6.72
-18 1.00 0.73 2.79 3.20 8.36 2.80 7.32 9.03
-19 1.00 0.48 3.46 2.77 10.52 3.44 6.26 10.92
-20 1.00 0.54 2.81 2.37 5.68 2.80 5.08 6.41
-21 1.00 0.60 2.70 2.68 5.71 2.66 5.53 8.81
+1 1.00 1.27 3.72 3.49 2.60 4.56 2.98 3.21
+2 1.00 1.33 6.34 6.80 3.44 11.92 2.78 3.07
+3 1.00 1.44 10.19 10.94 3.87 22.45 2.31 3.01
+4 1.00 1.44 15.15 16.01 4.17 36.72 2.28 2.97
+5 1.00 1.12 2.96 3.28 5.68 3.22 4.94 4.36
+6 1.00 1.38 4.47 5.36 7.86 7.33 10.65 4.11
+7 1.00 1.46 6.39 7.43 8.61 13.06 14.67 3.99
+8 1.00 1.44 8.66 9.80 8.82 19.78 18.85 3.92
+9 1.00 0.78 1.91 1.98 3.50 1.86 5.96 7.22
+10 1.00 0.76 3.53 3.95 13.80 3.85 11.95 7.53
+11 1.00 0.89 3.16 3.76 15.35 3.16 20.59 9.86
+12 1.00 0.86 2.92 3.49 17.82 2.57 21.64 13.05
+13 1.00 0.79 2.87 3.45 12.61 2.72 24.53 11.54
+14 1.00 0.81 3.07 3.62 14.90 2.94 16.27 9.92
+15 1.00 0.78 3.18 3.92 5.99 3.33 8.01 8.81
+16 1.00 0.89 2.99 3.41 7.17 2.89 11.49 6.29
+17 1.00 0.76 2.75 3.02 7.95 2.70 9.96 5.83
+18 1.00 0.93 2.42 3.00 8.00 2.44 19.99 7.73
+19 1.00 0.87 3.27 2.57 11.24 3.22 24.15 10.08
+20 1.00 0.71 2.50 2.19 6.02 2.46 14.57 5.55
+21 1.00 0.79 2.51 2.52 6.20 2.47 16.15 7.50
set style line 1 lc rgb '#000000' lw 1
set style line 2 lc rgb '#000000' lw 1 dt ' -'
-set style line 3 lc rgb '#000000' lw 1 dt (40.00, 10.00)
-set style line 4 lc rgb '#000000' lw 1 dt (20.00, 15.00)
-set style line 5 lc rgb '#000000' lw 1 dt (4.00, 20.00, 40, 20)
-set style line 6 lc rgb '#000000' lw 1 dt ' - '
-set style line 7 lc rgb '#000000' lw 1 dt (60.00, 15.00)
-set style line 8 lc rgb '#000000' lw 1 dt (4, 16)
+set style line 3 lc rgb '#000000' lw 1 dt (70.00, 15.00)
+set style line 4 lc rgb '#000000' lw 1 dt (40.00, 15.00)
+set style line 5 lc rgb '#000000' lw 1 dt (4, 20, 40, 20)
+set style line 6 lc rgb '#000000' lw 1 dt (20.00, 15.00)
+set style line 7 lc rgb '#000000' lw 1 dt (10.00, 30.00)
+set style line 8 lc rgb '#000000' lw 1 dt (40, 20, 5, 20, 5, 20, 5, 20)
set output 'plot_realworld.png'
-set terminal pngcairo dashed font "Courier,mono" size 800,600
+set terminal pngcairo dashed font "Courier,mono" size 750,550
set title "real-world RE"
set xtics (\
"HTTP 6204-198" 1, \
set tmargin 2
set lmargin 12
set rmargin 1
-set yrange [-1:30]
+set yrange [-1:25]
plot \
"data_realworld" using 1:2 ls 1 with lines title "leftmost greedy", \
"data_realworld" using 1:3 ls 2 with lines title "RE2", \
set output 'plot_artificial.png'
-set terminal pngcairo dashed font "Courier" size 1300,700
-set title "artificial highly ambiguous RE on long (64K) input strings"
+set terminal pngcairo dashed font "Courier" size 1150,650
+set title "artificial highly ambiguous RE on long (16K) input strings"
set xtics (\
'(a\{2\}|a\{3\}|a\{5\})*' 1, \
'(a\{7\}|a\{11\}|a\{13\})*' 2, \
set tmargin 2
set lmargin 15
set rmargin 1
-set yrange [-1:30]
+set yrange [-1:25]
plot \
"data_artificial" using 1:2 ls 1 with lines title "leftmost greedy", \
"data_artificial" using 1:3 ls 2 with lines title "RE2", \
set output 'plot_pathological.png'
-set terminal pngcairo dashed font "Courier" size 500,600
+set terminal pngcairo dashed font "Courier" size 400,550
set title "pathological RE"
set xtics (\
'((a?)\{0,125\})*' 1, \
set tmargin 2
set lmargin 12
set rmargin 1
-set yrange [-50:32<*]
+set yrange [-50:1000]
plot \
"data_pathological" using 1:2 ls 1 with lines title "leftmost greedy", \
"data_pathological" using 1:3 ls 2 with lines title "RE2", \
where $n$ is the length of input, $m$ is the size of the regular expression with counted repetition subexpressions ``unrolled'',
and $t$ is the number of capturing groups and subexpressions that contain them.
%
-Benchmarks show that in practice our algorithm is 2x-10x slower than leftmost greedy matching
+Benchmarks show that in practice our algorithm is about 5x slower than leftmost greedy matching
(which has no overhead on disambiguation).
%
We present a lazy variation that is much faster, but requires memory proportional to the size of input.
\BlankLine
$t_1 = tag(U, n_1), \; t_2 = tag(U, n_2)$ \;
- \BlankLine
- \lIf {$t_1 mod \, 2 \equiv 0$} { \Return $-1$ }
- \lIf {$t_2 mod \, 2 \equiv 0$} { \Return $1$ }
-
\BlankLine
\lIf {$t_1 < 0$} { \Return $1$ }
\lIf {$t_2 < 0$} { \Return $-1$ }
+ \BlankLine
+ \lIf {$t_1 mod \, 2 \equiv 0$} { \Return $-1$ }
+ \lIf {$t_2 mod \, 2 \equiv 0$} { \Return $1$ }
+
\BlankLine
\Return $0$
}
\includegraphics[width=\linewidth]{img/bench/plot.png}
\vspace{-2em}
\caption{
-Benchmarks.
+Benchmarks.\\
+Real-world tests have labels of the form ``title $m$-$k$'', where $m$ is RE size and $k$ is the number of capturing groups.
%: real-world RE (upper left),
%pathological RE for naive precedence table algorithm (upper right),
%artifical highly ambiguous RE on very long inputs (lower).
\begin{itemize}[itemsep=0.5em]
\item Okui-Suzuki algorithm degrades with increased closure size.
- This is understandable, as the algorithm performs pairwise comparison of closure states.
+ This is understandable, as the algorithm performs pairwise comparison of closure states to compute precedence matrices.
Naive $update \Xund ptables ()$ algorithm degrades extremely fast,
- and the advanced algorithm behaves much better (but it may incur slight overhead in simple cases).
-
- \item Cox and Kuklewicz algorithms degrade as the number of tags increases.
- This is not surprizing, as both algorithms have per-tag inner loops in their core.
- On large real-world RE Kuklewicz algorithm is much slower than all Okui-Suzuki variations,
- and Cox algorithm is so slow that it did not fit into the plot space.
-
- \item The bottleneck of Cox algorithm is copying of offset arrays.
- Using GOR1 instead of naive depth-first search, though asymptotically faster,
- increases the amount of copying because depth-dirst scan order allows to use a single buffer array that is updated and restored in-place.
- However, copying offset arrays is also required in other parts of the algorithm,
- and in general Cox algorithm is not suited for RE with many submatch groups.
-
- \item Lazy variation of Okui-Suzuki is much faster than the main variation on real-world tests and not very long inputs.
-
- \item GTOP is somewhat faster than GOR1 on real-world RE, but can be slower on artificial RE.
-
- \item RE2 performs close to our implementations (sometimes better, sometimes worse).
+ and the advanced algorithm behaves much better (though it may incur slight overhead in simple cases).
+
+ \item Kuklewicz algorithms degrades with increased closure size and increased number of tags.
+ This is not surprizing, as the algorithm has per-state and per-tag loop used to compute precedence matrix.
+ On real-world tests with many capturing groups Kuklewicz algorithm is much slower than Okui-Suzuki algorithm.
+
+ \item Cox algorithm degrades with increased number of tags.
+ The bottleneck of the algorithm is copying of offset arrays
+ (each array contains a pair of offsets per tag).
+ Using GOR1 instead of naive depth-first search increases the amount of copying (though asymptotically faster),
+ because depth-dirst scan order allows to use a single buffer array that is updated and restored in-place.
+ However, copying is required elsewhere in the algorithm,
+ and in general it is not suited for RE with many submatch groups.
+ On real-world tests Cox algorithm is so slow that it did not fit into the plot space.
+
+ \item Lazy variation of Okui-Suzuki degrades with increased cache size and the size of path context.
+ This may happen because of long input strings and because of high level of ambiguity in RE
+ (in such cases lazy algorithm does all the work of non-lazy algorithm,
+ but with the additional overhead on cache lookups/insertions and accumulation of data from the previous steps).
+ On real-world tests lazy variation of Okui-Suzuki is fast.
+
+ \item GOR1 and GTOP performance is similar.
+
+ \item RE2 performance is close to our leftmost greedy implementation.
\\[-0.5em]
\end{itemize}
-One interesting test is RE of the form $(a^{k_1}|\hdots|a^{k_n})^{0,\infty}$,
-e.g. \texttt{(a\{2\}|a\{3\}|a\{5\})*}.
+One particularly interesting group of tests that show the above points
+are RE of the form $(a^{k_1}|\hdots|a^{k_n})^{0,\infty}$
+(artificial tests 1-4)
+and their variations with more capturing groups
+(artificial tests 5-8).
+For example, consider \texttt{(a\{2\}|a\{3\}|a\{5\})*} and \texttt{(((a)\{2\})|((a)\{3\})|((a)\{5\}))*}.
Given input string \texttt{a...a},
submatch on the last iteration varies with the length of input:
it equals \texttt{aaaaa} for $5n$-character string,
and \texttt{aaa} for strings of length $5n - 2$ and $5n + 1$ ($n \in \YN$).
Variation continues infinitely with a period of five characters.
%
-We can increase variation period and the range of possible submatch results by choosing different counter values.
+We can increase variation period and the range of possible submatch results by choosing larger counter values.
%
-Large period and wide range correspond to a higher level of ambiguity and many parallel competing paths,
-which means increased closure size (hence the slowdown of Okui-Suzuki algorithm, especially the ``naive Okui-Suzuki'' variation).
-Adding more capturing groups increases the number of tags (hence the slowdown of Kuklewicz and Cox algorithms).
+This causes increased closure size ---
+hence the slowdown of Okui-Suzuki algorithm on tests 1 to 4 and 5 to 8 (especially pronounced for the ``naive Okui-Suzuki'' variation),
+and the more gentle slowdown of Kuklewicz algorithm on the same ranges.
+%
+Adding more capturing groups increases the number of tags ---
+hence the slowdown of Kuklewicz and Cox algorithms on 5-8 group compared to 1-4 group.
+%
+%Note that Cox algorithm performs very well on this test and slows down at the same pace as leftmost greedy.
\\
In closing, we would like to point out that correctness
\FloatBarrier
+
\section{Conclusions and future work}
+The main result of our work is a practical POSIX matching algorithm
+that can be used on real-world regular expressions,
+does not require complex preprocessing
+and incurs relatively modest disambiguation overhead compared to other algorithms.
+%
+We tried to present the algorithm in full, with a few useful variations,
+in order to make implementation easy for the reader.
+\\
+
+We see a certain tradeoff between speed and memory usage:
+bounded-memory version of the algorithm performs a lot of redundant work,
+and the lazy version avoids redundant work at the expense of potentially unbounded memory usage.
+Both approaches seem not ideal;
+perhaps in practice a hybrid approach can be used.
+\\
+
+It is still an open question to us
+whether it is possible to combine the elegance of derivative-based approach to POSIX disambiguation
+with the practical efficiency of NFA-based methods.
+%
+Derivative-based approach constructs match results in such order that longest-leftmost result is always first.
+%
+We experimented with recursive descent parsers that embrace the same ordering idea,
+but the resulting algorithm was rather complex and slow in practice.
+\\
+
+It would be interesting to apply our approach to automata with counters
+instead of unrolling bounded repetition.
+
\vfill\null
\clearpage