From: Andrei Zmievski Date: Tue, 11 Apr 2000 18:40:47 +0000 (+0000) Subject: Re-add. X-Git-Tag: php-4.0RC2~409 X-Git-Url: https://granicus.if.org/sourcecode?a=commitdiff_plain;h=da1cca03ae8538c587e6bf9a98b340d45949352b;p=php Re-add. --- diff --git a/ext/pcre/pcrelib/AUTHORS b/ext/pcre/pcrelib/AUTHORS new file mode 100644 index 0000000000..bfe1b5d8a4 --- /dev/null +++ b/ext/pcre/pcrelib/AUTHORS @@ -0,0 +1,6 @@ +Written by: Philip Hazel + +University of Cambridge Computing Service, +Cambridge, England. Phone: +44 1223 334714. + +Copyright (c) 1997-2000 University of Cambridge diff --git a/ext/pcre/pcrelib/COPYING b/ext/pcre/pcrelib/COPYING new file mode 100644 index 0000000000..f305033c16 --- /dev/null +++ b/ext/pcre/pcrelib/COPYING @@ -0,0 +1,32 @@ +PCRE LICENCE +------------ + +PCRE is a library of functions to support regular expressions whose syntax +and semantics are as close as possible to those of the Perl 5 language. + +Written by: Philip Hazel + +University of Cambridge Computing Service, +Cambridge, England. Phone: +44 1223 334714. + +Copyright (c) 1997-2000 University of Cambridge + +Permission is granted to anyone to use this software for any purpose on any +computer system, and to redistribute it freely, subject to the following +restrictions: + +1. This software is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + +2. The origin of this software must not be misrepresented, either by + explicit claim or by omission. + +3. Altered versions must be plainly marked as such, and must not be + misrepresented as being the original software. + +4. If PCRE is embedded in any software that is released under the GNU + General Purpose Licence (GPL), then the terms of that licence shall + supersede any condition above with which it is incompatible. + +End diff --git a/ext/pcre/pcrelib/ChangeLog b/ext/pcre/pcrelib/ChangeLog new file mode 100644 index 0000000000..c594ab50d8 --- /dev/null +++ b/ext/pcre/pcrelib/ChangeLog @@ -0,0 +1,576 @@ +ChangeLog for PCRE +------------------ + + +Version 3.1 09-Feb-00 +--------------------- + +The only change in this release is the fixing of some bugs in Makefile.in for +the "install" target: + +(1) It was failing to install pcreposix.h. + +(2) It was overwriting the pcre.3 man page with the pcreposix.3 man page. + + +Version 3.0 01-Feb-00 +--------------------- + +1. Add support for the /+ modifier to perltest (to output $` like it does in +pcretest). + +2. Add support for the /g modifier to perltest. + +3. Fix pcretest so that it behaves even more like Perl for /g when the pattern +matches null strings. + +4. Fix perltest so that it doesn't do unwanted things when fed an empty +pattern. Perl treats empty patterns specially - it reuses the most recent +pattern, which is not what we want. Replace // by /(?#)/ in order to avoid this +effect. + +5. The POSIX interface was broken in that it was just handing over the POSIX +captured string vector to pcre_exec(), but (since release 2.00) PCRE has +required a bigger vector, with some working space on the end. This means that +the POSIX wrapper now has to get and free some memory, and copy the results. + +6. Added some simple autoconf support, placing the test data and the +documentation in separate directories, re-organizing some of the +information files, and making it build pcre-config (a GNU standard). Also added +libtool support for building PCRE as a shared library, which is now the +default. + +7. Got rid of the leading zero in the definition of PCRE_MINOR because 08 and +09 are not valid octal constants. Single digits will be used for minor values +less than 10. + +8. Defined REG_EXTENDED and REG_NOSUB as zero in the POSIX header, so that +existing programs that set these in the POSIX interface can use PCRE without +modification. + +9. Added a new function, pcre_fullinfo() with an extensible interface. It can +return all that pcre_info() returns, plus additional data. The pcre_info() +function is retained for compatibility, but is considered to be obsolete. + +10. Added experimental recursion feature (?R) to handle one common case that +Perl 5.6 will be able to do with (?p{...}). + +11. Added support for POSIX character classes like [:alpha:], which Perl is +adopting. + + +Version 2.08 31-Aug-99 +---------------------- + +1. When startoffset was not zero and the pattern began with ".*", PCRE was not +trying to match at the startoffset position, but instead was moving forward to +the next newline as if a previous match had failed. + +2. pcretest was not making use of PCRE_NOTEMPTY when repeating for /g and /G, +and could get into a loop if a null string was matched other than at the start +of the subject. + +3. Added definitions of PCRE_MAJOR and PCRE_MINOR to pcre.h so the version can +be distinguished at compile time, and for completeness also added PCRE_DATE. + +5. Added Paul Sokolovsky's minor changes to make it easy to compile a Win32 DLL +in GnuWin32 environments. + + +Version 2.07 29-Jul-99 +---------------------- + +1. The documentation is now supplied in plain text form and HTML as well as in +the form of man page sources. + +2. C++ compilers don't like assigning (void *) values to other pointer types. +In particular this affects malloc(). Although there is no problem in Standard +C, I've put in casts to keep C++ compilers happy. + +3. Typo on pcretest.c; a cast of (unsigned char *) in the POSIX regexec() call +should be (const char *). + +4. If NOPOSIX is defined, pcretest.c compiles without POSIX support. This may +be useful for non-Unix systems who don't want to bother with the POSIX stuff. +However, I haven't made this a standard facility. The documentation doesn't +mention it, and the Makefile doesn't support it. + +5. The Makefile now contains an "install" target, with editable destinations at +the top of the file. The pcretest program is not installed. + +6. pgrep -V now gives the PCRE version number and date. + +7. Fixed bug: a zero repetition after a literal string (e.g. /abcde{0}/) was +causing the entire string to be ignored, instead of just the last character. + +8. If a pattern like /"([^\\"]+|\\.)*"/ is applied in the normal way to a +non-matching string, it can take a very, very long time, even for strings of +quite modest length, because of the nested recursion. PCRE now does better in +some of these cases. It does this by remembering the last required literal +character in the pattern, and pre-searching the subject to ensure it is present +before running the real match. In other words, it applies a heuristic to detect +some types of certain failure quickly, and in the above example, if presented +with a string that has no trailing " it gives "no match" very quickly. + +9. A new runtime option PCRE_NOTEMPTY causes null string matches to be ignored; +other alternatives are tried instead. + + +Version 2.06 09-Jun-99 +---------------------- + +1. Change pcretest's output for amount of store used to show just the code +space, because the remainder (the data block) varies in size between 32-bit and +64-bit systems. + +2. Added an extra argument to pcre_exec() to supply an offset in the subject to +start matching at. This allows lookbehinds to work when searching for multiple +occurrences in a string. + +3. Added additional options to pcretest for testing multiple occurrences: + + /+ outputs the rest of the string that follows a match + /g loops for multiple occurrences, using the new startoffset argument + /G loops for multiple occurrences by passing an incremented pointer + +4. PCRE wasn't doing the "first character" optimization for patterns starting +with \b or \B, though it was doing it for other lookbehind assertions. That is, +it wasn't noticing that a match for a pattern such as /\bxyz/ has to start with +the letter 'x'. On long subject strings, this gives a significant speed-up. + + +Version 2.05 21-Apr-99 +---------------------- + +1. Changed the type of magic_number from int to long int so that it works +properly on 16-bit systems. + +2. Fixed a bug which caused patterns starting with .* not to work correctly +when the subject string contained newline characters. PCRE was assuming +anchoring for such patterns in all cases, which is not correct because .* will +not pass a newline unless PCRE_DOTALL is set. It now assumes anchoring only if +DOTALL is set at top level; otherwise it knows that patterns starting with .* +must be retried after every newline in the subject. + + +Version 2.04 18-Feb-99 +---------------------- + +1. For parenthesized subpatterns with repeats whose minimum was zero, the +computation of the store needed to hold the pattern was incorrect (too large). +If such patterns were nested a few deep, this could multiply and become a real +problem. + +2. Added /M option to pcretest to show the memory requirement of a specific +pattern. Made -m a synonym of -s (which does this globally) for compatibility. + +3. Subpatterns of the form (regex){n,m} (i.e. limited maximum) were being +compiled in such a way that the backtracking after subsequent failure was +pessimal. Something like (a){0,3} was compiled as (a)?(a)?(a)? instead of +((a)((a)(a)?)?)? with disastrous performance if the maximum was of any size. + + +Version 2.03 02-Feb-99 +---------------------- + +1. Fixed typo and small mistake in man page. + +2. Added 4th condition (GPL supersedes if conflict) and created separate +LICENCE file containing the conditions. + +3. Updated pcretest so that patterns such as /abc\/def/ work like they do in +Perl, that is the internal \ allows the delimiter to be included in the +pattern. Locked out the use of \ as a delimiter. If \ immediately follows +the final delimiter, add \ to the end of the pattern (to test the error). + +4. Added the convenience functions for extracting substrings after a successful +match. Updated pcretest to make it able to test these functions. + + +Version 2.02 14-Jan-99 +---------------------- + +1. Initialized the working variables associated with each extraction so that +their saving and restoring doesn't refer to uninitialized store. + +2. Put dummy code into study.c in order to trick the optimizer of the IBM C +compiler for OS/2 into generating correct code. Apparently IBM isn't going to +fix the problem. + +3. Pcretest: the timing code wasn't using LOOPREPEAT for timing execution +calls, and wasn't printing the correct value for compiling calls. Increased the +default value of LOOPREPEAT, and the number of significant figures in the +times. + +4. Changed "/bin/rm" in the Makefile to "-rm" so it works on Windows NT. + +5. Renamed "deftables" as "dftables" to get it down to 8 characters, to avoid +a building problem on Windows NT with a FAT file system. + + +Version 2.01 21-Oct-98 +---------------------- + +1. Changed the API for pcre_compile() to allow for the provision of a pointer +to character tables built by pcre_maketables() in the current locale. If NULL +is passed, the default tables are used. + + +Version 2.00 24-Sep-98 +---------------------- + +1. Since the (>?) facility is in Perl 5.005, don't require PCRE_EXTRA to enable +it any more. + +2. Allow quantification of (?>) groups, and make it work correctly. + +3. The first character computation wasn't working for (?>) groups. + +4. Correct the implementation of \Z (it is permitted to match on the \n at the +end of the subject) and add 5.005's \z, which really does match only at the +very end of the subject. + +5. Remove the \X "cut" facility; Perl doesn't have it, and (?> is neater. + +6. Remove the ability to specify CASELESS, MULTILINE, DOTALL, and +DOLLAR_END_ONLY at runtime, to make it possible to implement the Perl 5.005 +localized options. All options to pcre_study() were also removed. + +7. Add other new features from 5.005: + + $(?<= positive lookbehind + $(?a*))*/ (a PCRE_EXTRA facility). + + +Version 1.00 18-Nov-97 +---------------------- + +1. Added compile-time macros to support systems such as SunOS4 which don't have +memmove() or strerror() but have other things that can be used instead. + +2. Arranged that "make clean" removes the executables. + + +Version 0.99 27-Oct-97 +---------------------- + +1. Fixed bug in code for optimizing classes with only one character. It was +initializing a 32-byte map regardless, which could cause it to run off the end +of the memory it had got. + +2. Added, conditional on PCRE_EXTRA, the proposed (?>REGEX) construction. + + +Version 0.98 22-Oct-97 +---------------------- + +1. Fixed bug in code for handling temporary memory usage when there are more +back references than supplied space in the ovector. This could cause segfaults. + + +Version 0.97 21-Oct-97 +---------------------- + +1. Added the \X "cut" facility, conditional on PCRE_EXTRA. + +2. Optimized negated single characters not to use a bit map. + +3. Brought error texts together as macro definitions; clarified some of them; +fixed one that was wrong - it said "range out of order" when it meant "invalid +escape sequence". + +4. Changed some char * arguments to const char *. + +5. Added PCRE_NOTBOL and PCRE_NOTEOL (from POSIX). + +6. Added the POSIX-style API wrapper in pcreposix.a and testing facilities in +pcretest. + + +Version 0.96 16-Oct-97 +---------------------- + +1. Added a simple "pgrep" utility to the distribution. + +2. Fixed an incompatibility with Perl: "{" is now treated as a normal character +unless it appears in one of the precise forms "{ddd}", "{ddd,}", or "{ddd,ddd}" +where "ddd" means "one or more decimal digits". + +3. Fixed serious bug. If a pattern had a back reference, but the call to +pcre_exec() didn't supply a large enough ovector to record the related +identifying subpattern, the match always failed. PCRE now remembers the number +of the largest back reference, and gets some temporary memory in which to save +the offsets during matching if necessary, in order to ensure that +backreferences always work. + +4. Increased the compatibility with Perl in a number of ways: + + (a) . no longer matches \n by default; an option PCRE_DOTALL is provided + to request this handling. The option can be set at compile or exec time. + + (b) $ matches before a terminating newline by default; an option + PCRE_DOLLAR_ENDONLY is provided to override this (but not in multiline + mode). The option can be set at compile or exec time. + + (c) The handling of \ followed by a digit other than 0 is now supposed to be + the same as Perl's. If the decimal number it represents is less than 10 + or there aren't that many previous left capturing parentheses, an octal + escape is read. Inside a character class, it's always an octal escape, + even if it is a single digit. + + (d) An escaped but undefined alphabetic character is taken as a literal, + unless PCRE_EXTRA is set. Currently this just reserves the remaining + escapes. + + (e) {0} is now permitted. (The previous item is removed from the compiled + pattern). + +5. Changed all the names of code files so that the basic parts are no longer +than 10 characters, and abolished the teeny "globals.c" file. + +6. Changed the handling of character classes; they are now done with a 32-byte +bit map always. + +7. Added the -d and /D options to pcretest to make it possible to look at the +internals of compilation without having to recompile pcre. + + +Version 0.95 23-Sep-97 +---------------------- + +1. Fixed bug in pre-pass concerning escaped "normal" characters such as \x5c or +\x20 at the start of a run of normal characters. These were being treated as +real characters, instead of the source characters being re-checked. + + +Version 0.94 18-Sep-97 +---------------------- + +1. The functions are now thread-safe, with the caveat that the global variables +containing pointers to malloc() and free() or alternative functions are the +same for all threads. + +2. Get pcre_study() to generate a bitmap of initial characters for non- +anchored patterns when this is possible, and use it if passed to pcre_exec(). + + +Version 0.93 15-Sep-97 +---------------------- + +1. /(b)|(:+)/ was computing an incorrect first character. + +2. Add pcre_study() to the API and the passing of pcre_extra to pcre_exec(), +but not actually doing anything yet. + +3. Treat "-" characters in classes that cannot be part of ranges as literals, +as Perl does (e.g. [-az] or [az-]). + +4. Set the anchored flag if a branch starts with .* or .*? because that tests +all possible positions. + +5. Split up into different modules to avoid including unneeded functions in a +compiled binary. However, compile and exec are still in one module. The "study" +function is split off. + +6. The character tables are now in a separate module whose source is generated +by an auxiliary program - but can then be edited by hand if required. There are +now no calls to isalnum(), isspace(), isdigit(), isxdigit(), tolower() or +toupper() in the code. + +7. Turn the malloc/free funtions variables into pcre_malloc and pcre_free and +make them global. Abolish the function for setting them, as the caller can now +set them directly. + + +Version 0.92 11-Sep-97 +---------------------- + +1. A repeat with a fixed maximum and a minimum of 1 for an ordinary character +(e.g. /a{1,3}/) was broken (I mis-optimized it). + +2. Caseless matching was not working in character classes if the characters in +the pattern were in upper case. + +3. Make ranges like [W-c] work in the same way as Perl for caseless matching. + +4. Make PCRE_ANCHORED public and accept as a compile option. + +5. Add an options word to pcre_exec() and accept PCRE_ANCHORED and +PCRE_CASELESS at run time. Add escapes \A and \I to pcretest to cause it to +pass them. + +6. Give an error if bad option bits passed at compile or run time. + +7. Add PCRE_MULTILINE at compile and exec time, and (?m) as well. Add \M to +pcretest to cause it to pass that flag. + +8. Add pcre_info(), to get the number of identifying subpatterns, the stored +options, and the first character, if set. + +9. Recognize C+ or C{n,m} where n >= 1 as providing a fixed starting character. + + +Version 0.91 10-Sep-97 +---------------------- + +1. PCRE was failing to diagnose unlimited repeats of subpatterns that could +match the empty string as in /(a*)*/. It was looping and ultimately crashing. + +2. PCRE was looping on encountering an indefinitely repeated back reference to +a subpattern that had matched an empty string, e.g. /(a|)\1*/. It now does what +Perl does - treats the match as successful. + +**** diff --git a/ext/pcre/pcrelib/INSTALL b/ext/pcre/pcrelib/INSTALL new file mode 100644 index 0000000000..d63a78fef9 --- /dev/null +++ b/ext/pcre/pcrelib/INSTALL @@ -0,0 +1,185 @@ +Basic Installation +================== + + These are generic installation instructions that apply to systems that +can run the `configure' shell script - Unix systems and any that imitate +it. They are not specific to PCRE. There are PCRE-specific instructions +for non-Unix systems in the file NON-UNIX. + + The `configure' shell script attempts to guess correct values for +various system-dependent variables used during compilation. It uses +those values to create a `Makefile' in each directory of the package. +It may also create one or more `.h' files containing system-dependent +definitions. Finally, it creates a shell script `config.status' that +you can run in the future to recreate the current configuration, a file +`config.cache' that saves the results of its tests to speed up +reconfiguring, and a file `config.log' containing compiler output +(useful mainly for debugging `configure'). + + If you need to do unusual things to compile the package, please try +to figure out how `configure' could check whether to do them, and mail +diffs or instructions to the address given in the `README' so they can +be considered for the next release. If at some point `config.cache' +contains results you don't want to keep, you may remove or edit it. + + The file `configure.in' is used to create `configure' by a program +called `autoconf'. You only need `configure.in' if you want to change +it or regenerate `configure' using a newer version of `autoconf'. + +The simplest way to compile this package is: + + 1. `cd' to the directory containing the package's source code and type + `./configure' to configure the package for your system. If you're + using `csh' on an old version of System V, you might need to type + `sh ./configure' instead to prevent `csh' from trying to execute + `configure' itself. + + Running `configure' takes awhile. While running, it prints some + messages telling which features it is checking for. + + 2. Type `make' to compile the package. + + 3. Optionally, type `make check' to run any self-tests that come with + the package. + + 4. Type `make install' to install the programs and any data files and + documentation. + + 5. You can remove the program binaries and object files from the + source code directory by typing `make clean'. To also remove the + files that `configure' created (so you can compile the package for + a different kind of computer), type `make distclean'. There is + also a `make maintainer-clean' target, but that is intended mainly + for the package's developers. If you use it, you may have to get + all sorts of other programs in order to regenerate files that came + with the distribution. + +Compilers and Options +===================== + + Some systems require unusual options for compilation or linking that +the `configure' script does not know about. You can give `configure' +initial values for variables by setting them in the environment. Using +a Bourne-compatible shell, you can do that on the command line like +this: + CC=c89 CFLAGS=-O2 LIBS=-lposix ./configure + +Or on systems that have the `env' program, you can do it like this: + env CPPFLAGS=-I/usr/local/include LDFLAGS=-s ./configure + +Compiling For Multiple Architectures +==================================== + + You can compile the package for more than one kind of computer at the +same time, by placing the object files for each architecture in their +own directory. To do this, you must use a version of `make' that +supports the `VPATH' variable, such as GNU `make'. `cd' to the +directory where you want the object files and executables to go and run +the `configure' script. `configure' automatically checks for the +source code in the directory that `configure' is in and in `..'. + + If you have to use a `make' that does not supports the `VPATH' +variable, you have to compile the package for one architecture at a time +in the source code directory. After you have installed the package for +one architecture, use `make distclean' before reconfiguring for another +architecture. + +Installation Names +================== + + By default, `make install' will install the package's files in +`/usr/local/bin', `/usr/local/man', etc. You can specify an +installation prefix other than `/usr/local' by giving `configure' the +option `--prefix=PATH'. + + You can specify separate installation prefixes for +architecture-specific files and architecture-independent files. If you +give `configure' the option `--exec-prefix=PATH', the package will use +PATH as the prefix for installing programs and libraries. +Documentation and other data files will still use the regular prefix. + + In addition, if you use an unusual directory layout you can give +options like `--bindir=PATH' to specify different values for particular +kinds of files. Run `configure --help' for a list of the directories +you can set and what kinds of files go in them. + + If the package supports it, you can cause programs to be installed +with an extra prefix or suffix on their names by giving `configure' the +option `--program-prefix=PREFIX' or `--program-suffix=SUFFIX'. + +Optional Features +================= + + Some packages pay attention to `--enable-FEATURE' options to +`configure', where FEATURE indicates an optional part of the package. +They may also pay attention to `--with-PACKAGE' options, where PACKAGE +is something like `gnu-as' or `x' (for the X Window System). The +`README' should mention any `--enable-' and `--with-' options that the +package recognizes. + + For packages that use the X Window System, `configure' can usually +find the X include and library files automatically, but if it doesn't, +you can use the `configure' options `--x-includes=DIR' and +`--x-libraries=DIR' to specify their locations. + +Specifying the System Type +========================== + + There may be some features `configure' can not figure out +automatically, but needs to determine by the type of host the package +will run on. Usually `configure' can figure that out, but if it prints +a message saying it can not guess the host type, give it the +`--host=TYPE' option. TYPE can either be a short name for the system +type, such as `sun4', or a canonical name with three fields: + CPU-COMPANY-SYSTEM + +See the file `config.sub' for the possible values of each field. If +`config.sub' isn't included in this package, then this package doesn't +need to know the host type. + + If you are building compiler tools for cross-compiling, you can also +use the `--target=TYPE' option to select the type of system they will +produce code for and the `--build=TYPE' option to select the type of +system on which you are compiling the package. + +Sharing Defaults +================ + + If you want to set default values for `configure' scripts to share, +you can create a site shell script called `config.site' that gives +default values for variables like `CC', `cache_file', and `prefix'. +`configure' looks for `PREFIX/share/config.site' if it exists, then +`PREFIX/etc/config.site' if it exists. Or, you can set the +`CONFIG_SITE' environment variable to the location of the site script. +A warning: not all `configure' scripts look for a site script. + +Operation Controls +================== + + `configure' recognizes the following options to control how it +operates. + +`--cache-file=FILE' + Use and save the results of the tests in FILE instead of + `./config.cache'. Set FILE to `/dev/null' to disable caching, for + debugging `configure'. + +`--help' + Print a summary of the options to `configure', and exit. + +`--quiet' +`--silent' +`-q' + Do not print messages saying which checks are being made. To + suppress all normal output, redirect it to `/dev/null' (any error + messages will still be shown). + +`--srcdir=DIR' + Look for the package's source code in directory DIR. Usually + `configure' can determine that directory automatically. + +`--version' + Print the version of Autoconf used to generate the `configure' + script, and exit. + +`configure' also accepts some other, not widely useful, options. diff --git a/ext/pcre/pcrelib/LICENCE b/ext/pcre/pcrelib/LICENCE new file mode 100644 index 0000000000..f305033c16 --- /dev/null +++ b/ext/pcre/pcrelib/LICENCE @@ -0,0 +1,32 @@ +PCRE LICENCE +------------ + +PCRE is a library of functions to support regular expressions whose syntax +and semantics are as close as possible to those of the Perl 5 language. + +Written by: Philip Hazel + +University of Cambridge Computing Service, +Cambridge, England. Phone: +44 1223 334714. + +Copyright (c) 1997-2000 University of Cambridge + +Permission is granted to anyone to use this software for any purpose on any +computer system, and to redistribute it freely, subject to the following +restrictions: + +1. This software is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + +2. The origin of this software must not be misrepresented, either by + explicit claim or by omission. + +3. Altered versions must be plainly marked as such, and must not be + misrepresented as being the original software. + +4. If PCRE is embedded in any software that is released under the GNU + General Purpose Licence (GPL), then the terms of that licence shall + supersede any condition above with which it is incompatible. + +End diff --git a/ext/pcre/pcrelib/Makefile.in b/ext/pcre/pcrelib/Makefile.in new file mode 100644 index 0000000000..3d435c758b --- /dev/null +++ b/ext/pcre/pcrelib/Makefile.in @@ -0,0 +1,11 @@ + +DEPTH = ../../.. +topsrcdir = @topsrcdir@ +srcdir = @srcdir@ +VPATH = @srcdir@ + +LTLIBRARY_NAME = libpcre.la +LTLIBRARY_SOURCES = maketables.c get.c study.c pcre.c + +include $(topsrcdir)/build/ltlib.mk + diff --git a/ext/pcre/pcrelib/NEWS b/ext/pcre/pcrelib/NEWS new file mode 100644 index 0000000000..4c80bd6833 --- /dev/null +++ b/ext/pcre/pcrelib/NEWS @@ -0,0 +1,46 @@ +News about PCRE releases +------------------------ + +Release 3.0 01-Feb-00 +--------------------- + +1. A "configure" script is now used to configure PCRE for Unix systems. It +builds a Makefile, a config.h file, and the pcre-config script. + +2. PCRE is built as a shared library by default. + +3. There is support for POSIX classes such as [:alpha:]. + +5. There is an experimental recursion feature. + +---------------------------------------------------------------------------- + IMPORTANT FOR THOSE UPGRADING FROM VERSIONS BEFORE 2.00 + +Please note that there has been a change in the API such that a larger +ovector is required at matching time, to provide some additional workspace. +The new man page has details. This change was necessary in order to support +some of the new functionality in Perl 5.005. + + IMPORTANT FOR THOSE UPGRADING FROM VERSION 2.00 + +Another (I hope this is the last!) change has been made to the API for the +pcre_compile() function. An additional argument has been added to make it +possible to pass over a pointer to character tables built in the current +locale by pcre_maketables(). To use the default tables, this new arguement +should be passed as NULL. + + IMPORTANT FOR THOSE UPGRADING FROM VERSION 2.05 + +Yet another (and again I hope this really is the last) change has been made +to the API for the pcre_exec() function. An additional argument has been +added to make it possible to start the match other than at the start of the +subject string. This is important if there are lookbehinds. The new man +page has the details, but you just want to convert existing programs, all +you need to do is to stick in a new fifth argument to pcre_exec(), with a +value of zero. For example, change + + pcre_exec(pattern, extra, subject, length, options, ovec, ovecsize) +to + pcre_exec(pattern, extra, subject, length, 0, options, ovec, ovecsize) + +**** diff --git a/ext/pcre/pcrelib/NON-UNIX-USE b/ext/pcre/pcrelib/NON-UNIX-USE new file mode 100644 index 0000000000..09a743245b --- /dev/null +++ b/ext/pcre/pcrelib/NON-UNIX-USE @@ -0,0 +1,50 @@ +Compiling PCRE on non-Unix systems +---------------------------------- + +If you want to compile PCRE for a non-Unix system, note that it consists +entirely of code written in Standard C, and so should compile successfully +on any machine with a Standard C compiler and library, using normal compiling +commands to do the following: + +(1) Copy or rename the file config.in as config.h, and change the macros that +define HAVE_STRERROR and HAVE_MEMMOVE to define them as 1 rather than 0. +Unfortunately, because of the way Unix autoconf works, the default setting has +to be 0. + +(2) Copy or rename the file pcre.in as pcre.h, and change the macro definitions +for PCRE_MAJOR, PCRE_MINOR, and PCRE_DATE near its start to the values set in +configure.in. + +(3) Compile dftables.c as a stand-alone program, and then run it with +the standard output sent to chartables.c. This generates a set of standard +character tables. + +(4) Compile maketables.c, get.c, study.c and pcre.c and link them all +together into an object library in whichever form your system keeps such +libraries. This is the pcre library (chartables.c gets included by means of an +#include directive). + +(5) Similarly, compile pcreposix.c and link it as the pcreposix library. + +(6) Compile the test program pcretest.c. This needs the functions in the +pcre and pcreposix libraries when linking. + +(7) Run pcretest on the testinput files in the testdata directory, and check +that the output matches the corresponding testoutput files. You must use the +-i option when checking testinput2. + +If you have a system without "configure" but where you can use a Makefile, edit +Makefile.in to create Makefile, substituting suitable values for the variables +at the head of the file. + +Some help in building a Win32 DLL of PCRE in GnuWin32 environments was +contributed by Paul.Sokolovsky@technologist.com. These environments are +Mingw32 (http://www.xraylith.wisc.edu/~khan/software/gnu-win32/) and +CygWin (http://sourceware.cygnus.com/cygwin/). Paul comments: + + For CygWin, set CFLAGS=-mno-cygwin, and do 'make dll'. You'll get + pcre.dll (containing pcreposix also), libpcre.dll.a, and dynamically + linked pgrep and pcretest. If you have /bin/sh, run RunTest (three + main test go ok, locale not supported). + +**** diff --git a/ext/pcre/pcrelib/README b/ext/pcre/pcrelib/README new file mode 100644 index 0000000000..90aaf4d6a6 --- /dev/null +++ b/ext/pcre/pcrelib/README @@ -0,0 +1,239 @@ +README file for PCRE (Perl-compatible regular expression library) +----------------------------------------------------------------- + +The latest release of PCRE is always available from + + ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/pcre-xxx.tar.gz + +Please read the NEWS file if you are upgrading from a previous release. + + +Building PCRE on a Unix system +------------------------------ + +To build PCRE on a Unix system, run the "configure" command in the PCRE +distribution directory. This is a standard GNU "autoconf" configuration script, +for which generic instructions are supplied in INSTALL. On many systems just +running "./configure" is sufficient, but the usual methods of changing standard +defaults are available. For example + +CFLAGS='-O2 -Wall' ./configure --prefix=/opt/local + +specifies that the C compiler should be run with the flags '-O2 -Wall' instead +of the default, and that "make install" should install PCRE under /opt/local +instead of the default /usr/local. The "configure" script builds thre files: + +. Makefile is built by copying Makefile.in and making substitutions. +. config.h is built by copying config.in and making substitutions. +. pcre-config is built by copying pcre-config.in and making substitutions. + +Once "configure" has run, you can run "make". It builds two libraries called +libpcre and libpcreposix, a test program called pcretest, and the pgrep +command. You can use "make install" to copy these, and the public header file +pcre.h, to appropriate live directories on your system, in the normal way. + +Running "make install" also installs the command pcre-config, which can be used +to recall information about the PCRE configuration and installation. For +example, + + pcre-config --version + +prints the version number, and + + pcre-config --libs + +outputs information about where the library is installed. This command can be +included in makefiles for programs that use PCRE, saving the programmer from +having to remember too many details. + + +Shared libraries on Unix systems +-------------------------------- + +The default distribution builds PCRE as two shared libraries. This support is +new and experimental and may not work on all systems. It relies on the +"libtool" scripts - these are distributed with PCRE. It should build a +"libtool" script and use this to compile and link shared libraries, which are +placed in a subdirectory called .libs. The programs pcretest and pgrep are +built to use these uninstalled libraries by means of wrapper scripts. When you +use "make install" to install shared libraries, pgrep and pcretest are +automatically re-built to use the newly installed libraries. However, only +pgrep is installed, as pcretest is really just a test program. + +To build PCRE using static libraries you must use --disable-shared when +configuring it. For example + +./configure --prefix=/usr/gnu --disable-shared + +Then run "make" in the usual way. + + +Building on non-Unix systems +---------------------------- + +For a non-Unix system, read the comments in the file NON-UNIX-USE. PCRE has +been compiled on Windows systems and on Macintoshes, but I don't know the +details because I don't use those systems. It should be straightforward to +build PCRE on any system that has a Standard C compiler, because it uses only +Standard C functions. + + +Testing PCRE +------------ + +To test PCRE on a Unix system, run the RunTest script in the pcre directory. +(This can also be run by "make runtest" or "make check".) For other systems, +see the instruction in NON-UNIX-USE. + +The script runs the pcretest test program (which is documented in +doc/pcretest.txt) on each of the testinput files (in the testdata directory) in +turn, and compares the output with the contents of the corresponding testoutput +file. A file called testtry is used to hold the output from pcretest. To run +pcretest on just one of the test files, give its number as an argument to +RunTest, for example: + + RunTest 3 + +The first and third test files can also be fed directly into the perltest +script to check that Perl gives the same results. The third file requires the +additional features of release 5.005, which is why it is kept separate from the +main test input, which needs only Perl 5.004. In the long run, when 5.005 is +widespread, these two test files may get amalgamated. + +The second set of tests check pcre_info(), pcre_study(), pcre_copy_substring(), +pcre_get_substring(), pcre_get_substring_list(), error detection and run-time +flags that are specific to PCRE, as well as the POSIX wrapper API. + +The fourth set of tests checks pcre_maketables(), the facility for building a +set of character tables for a specific locale and using them instead of the +default tables. The tests make use of the "fr" (French) locale. Before running +the test, the script checks for the presence of this locale by running the +"locale" command. If that command fails, or if it doesn't include "fr" in the +list of available locales, the fourth test cannot be run, and a comment is +output to say why. If running this test produces instances of the error + + ** Failed to set locale "fr" + +in the comparison output, it means that locale is not available on your system, +despite being listed by "locale". This does not mean that PCRE is broken. + +PCRE has its own native API, but a set of "wrapper" functions that are based on +the POSIX API are also supplied in the library libpcreposix.a. Note that this +just provides a POSIX calling interface to PCRE: the regular expressions +themselves still follow Perl syntax and semantics. The header file +for the POSIX-style functions is called pcreposix.h. The official POSIX name is +regex.h, but I didn't want to risk possible problems with existing files of +that name by distributing it that way. To use it with an existing program that +uses the POSIX API, it will have to be renamed or pointed at by a link. + + +Character tables +---------------- + +PCRE uses four tables for manipulating and identifying characters. The final +argument of the pcre_compile() function is a pointer to a block of memory +containing the concatenated tables. A call to pcre_maketables() can be used to +generate a set of tables in the current locale. If the final argument for +pcre_compile() is passed as NULL, a set of default tables that is built into +the binary is used. + +The source file called chartables.c contains the default set of tables. This is +not supplied in the distribution, but is built by the program dftables +(compiled from dftables.c), which uses the ANSI C character handling functions +such as isalnum(), isalpha(), isupper(), islower(), etc. to build the table +sources. This means that the default C locale which is set for your system will +control the contents of these default tables. You can change the default tables +by editing chartables.c and then re-building PCRE. If you do this, you should +probably also edit Makefile to ensure that the file doesn't ever get +re-generated. + +The first two 256-byte tables provide lower casing and case flipping functions, +respectively. The next table consists of three 32-byte bit maps which identify +digits, "word" characters, and white space, respectively. These are used when +building 32-byte bit maps that represent character classes. + +The final 256-byte table has bits indicating various character types, as +follows: + + 1 white space character + 2 letter + 4 decimal digit + 8 hexadecimal digit + 16 alphanumeric or '_' + 128 regular expression metacharacter or binary zero + +You should not alter the set of characters that contain the 128 bit, as that +will cause PCRE to malfunction. + + +Manifest +-------- + +The distribution should contain the following files: + +(A) The actual source files of the PCRE library functions and their + headers: + + dftables.c auxiliary program for building chartables.c + get.c ) + maketables.c ) + study.c ) source of + pcre.c ) the functions + pcreposix.c ) + pcre.in "source" for the header for the external API; pcre.h + is built from this by "configure" + pcreposix.h header for the external POSIX wrapper API + internal.h header for internal use + config.in template for config.h, which is built by configure + +(B) Auxiliary files: + + AUTHORS information about the author of PCRE + ChangeLog log of changes to the code + INSTALL generic installation instructions + LICENCE conditions for the use of PCRE + COPYING the same, using GNU's standard name + Makefile.in template for Unix Makefile, which is built by configure + NEWS important changes in this release + NON-UNIX-USE notes on building PCRE on non-Unix systems + README this file + RunTest a Unix shell script for running tests + config.guess ) files used by libtool, + config.sub ) used only when building a shared library + configure a configuring shell script (built by autoconf) + configure.in the autoconf input used to build configure + doc/Tech.Notes notes on the encoding + doc/pcre.3 man page source for the PCRE functions + doc/pcre.html HTML version + doc/pcre.txt plain text version + doc/pcreposix.3 man page source for the POSIX wrapper API + doc/pcreposix.html HTML version + doc/pcreposix.txt plain text version + doc/pcretest.txt documentation of test program + doc/perltest.txt documentation of Perl test program + doc/pgrep.1 man page source for the pgrep utility + doc/pgrep.html HTML version + doc/pgrep.txt plain text version + install-sh a shell script for installing files + ltconfig ) files used to build "libtool", + ltmain.sh ) used only when building a shared library + pcretest.c test program + perltest Perl test program + pgrep.c source of a grep utility that uses PCRE + pcre-config.in source of script which retains PCRE information + testdata/testinput1 test data, compatible with Perl 5.004 and 5.005 + testdata/testinput2 test data for error messages and non-Perl things + testdata/testinput3 test data, compatible with Perl 5.005 + testdata/testinput4 test data for locale-specific tests + testdata/testoutput1 test results corresponding to testinput1 + testdata/testoutput2 test results corresponding to testinput2 + testdata/testoutput3 test results corresponding to testinput3 + testdata/testoutput4 test results corresponding to testinput4 + +(C) Auxiliary files for Win32 DLL + + dll.mk + pcre.def + +Philip Hazel +February 2000 diff --git a/ext/pcre/pcrelib/RunTest b/ext/pcre/pcrelib/RunTest new file mode 100755 index 0000000000..85eeb6245e --- /dev/null +++ b/ext/pcre/pcrelib/RunTest @@ -0,0 +1,94 @@ +#! /bin/sh + +# Run PCRE tests + +cf=diff + +# Select which tests to run; if no selection, run all + +do1=no +do2=no +do3=no +do4=no + +while [ $# -gt 0 ] ; do + case $1 in + 1) do1=yes;; + 2) do2=yes;; + 3) do3=yes;; + 4) do4=yes;; + *) echo "Unknown test number $1"; exit 1;; + esac + shift +done + +if [ $do1 = no -a $do2 = no -a $do3 = no -a $do4 = no ] ; then + do1=yes + do2=yes + do3=yes + do4=yes +fi + +# Primary test, Perl-compatible + +if [ $do1 = yes ] ; then + echo "Testing main functionality (Perl compatible)" + ./pcretest testdata/testinput1 testtry + if [ $? = 0 ] ; then + $cf testtry testdata/testoutput1 + if [ $? != 0 ] ; then exit 1; fi + else exit 1 + fi +fi + +# PCRE tests that are not Perl-compatible - API & error tests, mostly + +if [ $do2 = yes ] ; then + echo "Testing API and error handling (not Perl compatible)" + ./pcretest -i testdata/testinput2 testtry + if [ $? = 0 ] ; then + $cf testtry testdata/testoutput2 + if [ $? != 0 ] ; then exit 1; fi + else exit 1 + fi +fi + +# Additional Perl-compatible tests for Perl 5.005's new features + +if [ $do3 = yes ] ; then + echo "Testing Perl 5.005 features (Perl 5.005 compatible)" + ./pcretest testdata/testinput3 testtry + if [ $? = 0 ] ; then + $cf testtry testdata/testoutput3 + if [ $? != 0 ] ; then exit 1; fi + else exit 1 + fi +fi + +if [ $do1 = yes -a $do2 = yes -a $do3 = yes ] ; then + echo "The three main tests all ran OK" + echo " " +fi + +# Locale-specific tests, provided the "fr" locale is available + +if [ $do4 = yes ] ; then + locale -a | grep '^fr$' >/dev/null + if [ $? -eq 0 ] ; then + echo "Testing locale-specific features (using 'fr' locale)" + ./pcretest testdata/testinput4 testtry + if [ $? = 0 ] ; then + $cf testtry testdata/testoutput4 + if [ $? != 0 ] ; then exit 1; fi + echo "Locale test ran OK" + echo " " + else exit 1 + fi + else + echo "Cannot test locale-specific features - 'fr' locale not found," + echo "or the \"locale\" command is not available to check for it." + echo " " + fi +fi + +# End diff --git a/ext/pcre/pcrelib/chartables.c b/ext/pcre/pcrelib/chartables.c new file mode 100644 index 0000000000..9055da2d2d --- /dev/null +++ b/ext/pcre/pcrelib/chartables.c @@ -0,0 +1,183 @@ +/************************************************* +* Perl-Compatible Regular Expressions * +*************************************************/ + +/* This file is automatically written by the dftables auxiliary +program. If you edit it by hand, you might like to edit the Makefile to +prevent its ever being regenerated. + +This file is #included in the compilation of pcre.c to build the default +character tables which are used when no tables are passed to the compile +function. */ + +static unsigned char pcre_default_tables[] = { + +/* This table is a lower casing table. */ + + 0, 1, 2, 3, 4, 5, 6, 7, + 8, 9, 10, 11, 12, 13, 14, 15, + 16, 17, 18, 19, 20, 21, 22, 23, + 24, 25, 26, 27, 28, 29, 30, 31, + 32, 33, 34, 35, 36, 37, 38, 39, + 40, 41, 42, 43, 44, 45, 46, 47, + 48, 49, 50, 51, 52, 53, 54, 55, + 56, 57, 58, 59, 60, 61, 62, 63, + 64, 97, 98, 99,100,101,102,103, + 104,105,106,107,108,109,110,111, + 112,113,114,115,116,117,118,119, + 120,121,122, 91, 92, 93, 94, 95, + 96, 97, 98, 99,100,101,102,103, + 104,105,106,107,108,109,110,111, + 112,113,114,115,116,117,118,119, + 120,121,122,123,124,125,126,127, + 128,129,130,131,132,133,134,135, + 136,137,138,139,140,141,142,143, + 144,145,146,147,148,149,150,151, + 152,153,154,155,156,157,158,159, + 160,161,162,163,164,165,166,167, + 168,169,170,171,172,173,174,175, + 176,177,178,179,180,181,182,183, + 184,185,186,187,188,189,190,191, + 192,193,194,195,196,197,198,199, + 200,201,202,203,204,205,206,207, + 208,209,210,211,212,213,214,215, + 216,217,218,219,220,221,222,223, + 224,225,226,227,228,229,230,231, + 232,233,234,235,236,237,238,239, + 240,241,242,243,244,245,246,247, + 248,249,250,251,252,253,254,255, + +/* This table is a case flipping table. */ + + 0, 1, 2, 3, 4, 5, 6, 7, + 8, 9, 10, 11, 12, 13, 14, 15, + 16, 17, 18, 19, 20, 21, 22, 23, + 24, 25, 26, 27, 28, 29, 30, 31, + 32, 33, 34, 35, 36, 37, 38, 39, + 40, 41, 42, 43, 44, 45, 46, 47, + 48, 49, 50, 51, 52, 53, 54, 55, + 56, 57, 58, 59, 60, 61, 62, 63, + 64, 97, 98, 99,100,101,102,103, + 104,105,106,107,108,109,110,111, + 112,113,114,115,116,117,118,119, + 120,121,122, 91, 92, 93, 94, 95, + 96, 65, 66, 67, 68, 69, 70, 71, + 72, 73, 74, 75, 76, 77, 78, 79, + 80, 81, 82, 83, 84, 85, 86, 87, + 88, 89, 90,123,124,125,126,127, + 128,129,130,131,132,133,134,135, + 136,137,138,139,140,141,142,143, + 144,145,146,147,148,149,150,151, + 152,153,154,155,156,157,158,159, + 160,161,162,163,164,165,166,167, + 168,169,170,171,172,173,174,175, + 176,177,178,179,180,181,182,183, + 184,185,186,187,188,189,190,191, + 192,193,194,195,196,197,198,199, + 200,201,202,203,204,205,206,207, + 208,209,210,211,212,213,214,215, + 216,217,218,219,220,221,222,223, + 224,225,226,227,228,229,230,231, + 232,233,234,235,236,237,238,239, + 240,241,242,243,244,245,246,247, + 248,249,250,251,252,253,254,255, + +/* This table contains bit maps for various character classes. +Each map is 32 bytes long and the bits run from the least +significant end of each byte. The classes that have their own +maps are: space, xdigit, digit, upper, lower, word, graph +print, punct, and cntrl. Other classes are built from combinations. */ + + 0x00,0x3e,0x00,0x00,0x01,0x00,0x00,0x00, + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, + + 0x00,0x00,0x00,0x00,0x00,0x00,0xff,0x03, + 0x7e,0x00,0x00,0x00,0x7e,0x00,0x00,0x00, + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, + + 0x00,0x00,0x00,0x00,0x00,0x00,0xff,0x03, + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, + + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, + 0xfe,0xff,0xff,0x07,0x00,0x00,0x00,0x00, + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, + + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, + 0x00,0x00,0x00,0x00,0xfe,0xff,0xff,0x07, + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, + + 0x00,0x00,0x00,0x00,0x00,0x00,0xff,0x03, + 0xfe,0xff,0xff,0x87,0xfe,0xff,0xff,0x07, + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, + + 0x00,0x00,0x00,0x00,0xfe,0xff,0xff,0xff, + 0xff,0xff,0xff,0xff,0xff,0xff,0xff,0x7f, + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, + + 0x00,0x00,0x00,0x00,0xff,0xff,0xff,0xff, + 0xff,0xff,0xff,0xff,0xff,0xff,0xff,0x7f, + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, + + 0x00,0x00,0x00,0x00,0xfe,0xff,0x00,0xfc, + 0x01,0x00,0x00,0xf8,0x01,0x00,0x00,0x78, + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, + + 0xff,0xff,0xff,0xff,0x00,0x00,0x00,0x00, + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x80, + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, + +/* This table identifies various classes of character by individual bits: + 0x01 white space character + 0x02 letter + 0x04 decimal digit + 0x08 hexadecimal digit + 0x10 alphanumeric or '_' + 0x80 regular expression metacharacter or binary zero +*/ + + 0x80,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* 0- 7 */ + 0x00,0x01,0x01,0x01,0x01,0x01,0x00,0x00, /* 8- 15 */ + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* 16- 23 */ + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* 24- 31 */ + 0x01,0x00,0x00,0x00,0x80,0x00,0x00,0x00, /* - ' */ + 0x80,0x80,0x80,0x80,0x00,0x00,0x80,0x00, /* ( - / */ + 0x1c,0x1c,0x1c,0x1c,0x1c,0x1c,0x1c,0x1c, /* 0 - 7 */ + 0x1c,0x1c,0x00,0x00,0x00,0x00,0x00,0x80, /* 8 - ? */ + 0x00,0x1a,0x1a,0x1a,0x1a,0x1a,0x1a,0x12, /* @ - G */ + 0x12,0x12,0x12,0x12,0x12,0x12,0x12,0x12, /* H - O */ + 0x12,0x12,0x12,0x12,0x12,0x12,0x12,0x12, /* P - W */ + 0x12,0x12,0x12,0x80,0x00,0x00,0x80,0x10, /* X - _ */ + 0x00,0x1a,0x1a,0x1a,0x1a,0x1a,0x1a,0x12, /* ` - g */ + 0x12,0x12,0x12,0x12,0x12,0x12,0x12,0x12, /* h - o */ + 0x12,0x12,0x12,0x12,0x12,0x12,0x12,0x12, /* p - w */ + 0x12,0x12,0x12,0x80,0x80,0x00,0x00,0x00, /* x -127 */ + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* 128-135 */ + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* 136-143 */ + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* 144-151 */ + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* 152-159 */ + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* 160-167 */ + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* 168-175 */ + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* 176-183 */ + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* 184-191 */ + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* 192-199 */ + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* 200-207 */ + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* 208-215 */ + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* 216-223 */ + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* 224-231 */ + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* 232-239 */ + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* 240-247 */ + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00};/* 248-255 */ + +/* End of chartables.c */ diff --git a/ext/pcre/pcrelib/dftables.c b/ext/pcre/pcrelib/dftables.c new file mode 100644 index 0000000000..d572dfd3e6 --- /dev/null +++ b/ext/pcre/pcrelib/dftables.c @@ -0,0 +1,148 @@ +/************************************************* +* Perl-Compatible Regular Expressions * +*************************************************/ + +/* +PCRE is a library of functions to support regular expressions whose syntax +and semantics are as close as possible to those of the Perl 5 language. + +Written by: Philip Hazel + + Copyright (c) 1997-2000 University of Cambridge + +----------------------------------------------------------------------------- +Permission is granted to anyone to use this software for any purpose on any +computer system, and to redistribute it freely, subject to the following +restrictions: + +1. This software is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + +2. The origin of this software must not be misrepresented, either by + explicit claim or by omission. + +3. Altered versions must be plainly marked as such, and must not be + misrepresented as being the original software. + +4. If PCRE is embedded in any software that is released under the GNU + General Purpose Licence (GPL), then the terms of that licence shall + supersede any condition above with which it is incompatible. +----------------------------------------------------------------------------- + +See the file Tech.Notes for some information on the internals. +*/ + + +/* This is a support program to generate the file chartables.c, containing +character tables of various kinds. They are built according to the default C +locale and used as the default tables by PCRE. Now that pcre_maketables is +a function visible to the outside world, we make use of its code from here in +order to be consistent. */ + +#include +#include +#include + +#include "internal.h" + +#define DFTABLES /* maketables.c notices this */ +#include "maketables.c" + + +int main(void) +{ +int i; +unsigned const char *tables = pcre_maketables(); + +printf( + "/*************************************************\n" + "* Perl-Compatible Regular Expressions *\n" + "*************************************************/\n\n" + "/* This file is automatically written by the dftables auxiliary \n" + "program. If you edit it by hand, you might like to edit the Makefile to \n" + "prevent its ever being regenerated.\n\n" + "This file is #included in the compilation of pcre.c to build the default\n" + "character tables which are used when no tables are passed to the compile\n" + "function. */\n\n" + "static unsigned char pcre_default_tables[] = {\n\n" + "/* This table is a lower casing table. */\n\n"); + +printf(" "); +for (i = 0; i < 256; i++) + { + if ((i & 7) == 0 && i != 0) printf("\n "); + printf("%3d", *tables++); + if (i != 255) printf(","); + } +printf(",\n\n"); + +printf("/* This table is a case flipping table. */\n\n"); + +printf(" "); +for (i = 0; i < 256; i++) + { + if ((i & 7) == 0 && i != 0) printf("\n "); + printf("%3d", *tables++); + if (i != 255) printf(","); + } +printf(",\n\n"); + +printf( + "/* This table contains bit maps for various character classes.\n" + "Each map is 32 bytes long and the bits run from the least\n" + "significant end of each byte. The classes that have their own\n" + "maps are: space, xdigit, digit, upper, lower, word, graph\n" + "print, punct, and cntrl. Other classes are built from combinations. */\n\n"); + +printf(" "); +for (i = 0; i < cbit_length; i++) + { + if ((i & 7) == 0 && i != 0) + { + if ((i & 31) == 0) printf("\n"); + printf("\n "); + } + printf("0x%02x", *tables++); + if (i != cbit_length - 1) printf(","); + } +printf(",\n\n"); + +printf( + "/* This table identifies various classes of character by individual bits:\n" + " 0x%02x white space character\n" + " 0x%02x letter\n" + " 0x%02x decimal digit\n" + " 0x%02x hexadecimal digit\n" + " 0x%02x alphanumeric or '_'\n" + " 0x%02x regular expression metacharacter or binary zero\n*/\n\n", + ctype_space, ctype_letter, ctype_digit, ctype_xdigit, ctype_word, + ctype_meta); + +printf(" "); +for (i = 0; i < 256; i++) + { + if ((i & 7) == 0 && i != 0) + { + printf(" /* "); + if (isprint(i-8)) printf(" %c -", i-8); + else printf("%3d-", i-8); + if (isprint(i-1)) printf(" %c ", i-1); + else printf("%3d", i-1); + printf(" */\n "); + } + printf("0x%02x", *tables++); + if (i != 255) printf(","); + } + +printf("};/* "); +if (isprint(i-8)) printf(" %c -", i-8); + else printf("%3d-", i-8); +if (isprint(i-1)) printf(" %c ", i-1); + else printf("%3d", i-1); +printf(" */\n\n/* End of chartables.c */\n"); + +return 0; +} + +/* End of dftables.c */ diff --git a/ext/pcre/pcrelib/dll.mk b/ext/pcre/pcrelib/dll.mk new file mode 100644 index 0000000000..d8b728e57e --- /dev/null +++ b/ext/pcre/pcrelib/dll.mk @@ -0,0 +1,60 @@ +# dll.mk - auxilary Makefile to easy build dll's for mingw32 target +# ver. 0.6 of 1999-03-25 +# +# Homepage of this makefile - http://www.is.lg.ua/~paul/devel/ +# Homepage of original mingw32 project - +# http://www.fu.is.saga-u.ac.jp/~colin/gcc.html +# +# How to use: +# This makefile can: +# 1. Create automatical .def file from list of objects +# 2. Create .dll from objects and .def file, either automatical, or your +# hand-written (maybe) file, which must have same basename as dll +# WARNING! There MUST be object, which name match dll's name. Make sux. +# 3. Create import library from .def (as for .dll, only its name required, +# not dll itself) +# By convention implibs for dll have .dll.a suffix, e.g. libstuff.dll.a +# Why not just libstuff.a? 'Cos that's name for static lib, ok? +# Process divided into 3 phases because: +# 1. Pre-existent .def possible +# 2. Generating implib is enough time-consuming +# +# Variables: +# DLL_LDLIBS - libs for linking dll +# DLL_LDFLAGS - flags for linking dll +# +# By using $(DLL_SUFFIX) instead of 'dll', e.g. stuff.$(DLL_SUFFIX) +# you may help porting makefiles to other platforms +# +# Put this file in your make's include path (e.g. main include dir, for +# more information see include section in make doc). Put in the beginning +# of your own Makefile line "include dll.mk". Specify dependences, e.g.: +# +# Do all stuff in one step +# libstuff.dll.a: $(OBJECTS) stuff.def +# stuff.def: $(OBJECTS) +# +# Steps separated, pre-provided .def, link with user32 +# +# DLL_LDLIBS=-luser32 +# stuff.dll: $(OBJECTS) +# libstuff.dll.a: $(OBJECTS) + + +DLLWRAP=dllwrap +DLLTOOL=dlltool + +DLL_SUFFIX=dll + +.SUFFIXES: .o .$(DLL_SUFFIX) + +_%.def: %.o + $(DLLTOOL) --export-all --output-def $@ $^ + +%.$(DLL_SUFFIX): %.o + $(DLLWRAP) --dllname $(notdir $@) --driver-name $(CC) --def $*.def -o $@ $(filter %.o,$^) $(DLL_LDFLAGS) $(DLL_LDLIBS) + +lib%.$(DLL_SUFFIX).a:%.def + $(DLLTOOL) --dllname $(notdir $*.dll) --def $< --output-lib $@ + +# End diff --git a/ext/pcre/pcrelib/get.c b/ext/pcre/pcrelib/get.c new file mode 100644 index 0000000000..035668e301 --- /dev/null +++ b/ext/pcre/pcrelib/get.c @@ -0,0 +1,189 @@ +/************************************************* +* Perl-Compatible Regular Expressions * +*************************************************/ + +/* +This is a library of functions to support regular expressions whose syntax +and semantics are as close as possible to those of the Perl 5 language. See +the file Tech.Notes for some information on the internals. + +Written by: Philip Hazel + + Copyright (c) 1997-1999 University of Cambridge + +----------------------------------------------------------------------------- +Permission is granted to anyone to use this software for any purpose on any +computer system, and to redistribute it freely, subject to the following +restrictions: + +1. This software is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + +2. The origin of this software must not be misrepresented, either by + explicit claim or by omission. + +3. Altered versions must be plainly marked as such, and must not be + misrepresented as being the original software. + +4. If PCRE is embedded in any software that is released under the GNU + General Purpose Licence (GPL), then the terms of that licence shall + supersede any condition above with which it is incompatible. +----------------------------------------------------------------------------- +*/ + +/* This module contains some convenience functions for extracting substrings +from the subject string after a regex match has succeeded. The original idea +for these functions came from Scott Wimer . */ + + +/* Include the internals header, which itself includes Standard C headers plus +the external pcre header. */ + +#include "internal.h" + + + +/************************************************* +* Copy captured string to given buffer * +*************************************************/ + +/* This function copies a single captured substring into a given buffer. +Note that we use memcpy() rather than strncpy() in case there are binary zeros +in the string. + +Arguments: + subject the subject string that was matched + ovector pointer to the offsets table + stringcount the number of substrings that were captured + (i.e. the yield of the pcre_exec call, unless + that was zero, in which case it should be 1/3 + of the offset table size) + stringnumber the number of the required substring + buffer where to put the substring + size the size of the buffer + +Returns: if successful: + the length of the copied string, not including the zero + that is put on the end; can be zero + if not successful: + PCRE_ERROR_NOMEMORY (-6) buffer too small + PCRE_ERROR_NOSUBSTRING (-7) no such captured substring +*/ + +int +pcre_copy_substring(const char *subject, int *ovector, int stringcount, + int stringnumber, char *buffer, int size) +{ +int yield; +if (stringnumber < 0 || stringnumber >= stringcount) + return PCRE_ERROR_NOSUBSTRING; +stringnumber *= 2; +yield = ovector[stringnumber+1] - ovector[stringnumber]; +if (size < yield + 1) return PCRE_ERROR_NOMEMORY; +memcpy(buffer, subject + ovector[stringnumber], yield); +buffer[yield] = 0; +return yield; +} + + + +/************************************************* +* Copy all captured strings to new store * +*************************************************/ + +/* This function gets one chunk of store and builds a list of pointers and all +of the captured substrings in it. A NULL pointer is put on the end of the list. + +Arguments: + subject the subject string that was matched + ovector pointer to the offsets table + stringcount the number of substrings that were captured + (i.e. the yield of the pcre_exec call, unless + that was zero, in which case it should be 1/3 + of the offset table size) + listptr set to point to the list of pointers + +Returns: if successful: 0 + if not successful: + PCRE_ERROR_NOMEMORY (-6) failed to get store +*/ + +int +pcre_get_substring_list(const char *subject, int *ovector, int stringcount, + const char ***listptr) +{ +int i; +int size = sizeof(char *); +int double_count = stringcount * 2; +char **stringlist; +char *p; + +for (i = 0; i < double_count; i += 2) + size += sizeof(char *) + ovector[i+1] - ovector[i] + 1; + +stringlist = (char **)(pcre_malloc)(size); +if (stringlist == NULL) return PCRE_ERROR_NOMEMORY; + +*listptr = (const char **)stringlist; +p = (char *)(stringlist + stringcount + 1); + +for (i = 0; i < double_count; i += 2) + { + int len = ovector[i+1] - ovector[i]; + memcpy(p, subject + ovector[i], len); + *stringlist++ = p; + p += len; + *p++ = 0; + } + +*stringlist = NULL; +return 0; +} + + + +/************************************************* +* Copy captured string to new store * +*************************************************/ + +/* This function copies a single captured substring into a piece of new +store + +Arguments: + subject the subject string that was matched + ovector pointer to the offsets table + stringcount the number of substrings that were captured + (i.e. the yield of the pcre_exec call, unless + that was zero, in which case it should be 1/3 + of the offset table size) + stringnumber the number of the required substring + stringptr where to put a pointer to the substring + +Returns: if successful: + the length of the string, not including the zero that + is put on the end; can be zero + if not successful: + PCRE_ERROR_NOMEMORY (-6) failed to get store + PCRE_ERROR_NOSUBSTRING (-7) substring not present +*/ + +int +pcre_get_substring(const char *subject, int *ovector, int stringcount, + int stringnumber, const char **stringptr) +{ +int yield; +char *substring; +if (stringnumber < 0 || stringnumber >= stringcount) + return PCRE_ERROR_NOSUBSTRING; +stringnumber *= 2; +yield = ovector[stringnumber+1] - ovector[stringnumber]; +substring = (char *)(pcre_malloc)(yield + 1); +if (substring == NULL) return PCRE_ERROR_NOMEMORY; +memcpy(substring, subject + ovector[stringnumber], yield); +substring[yield] = 0; +*stringptr = substring; +return yield; +} + +/* End of get.c */ diff --git a/ext/pcre/pcrelib/internal.h b/ext/pcre/pcrelib/internal.h new file mode 100644 index 0000000000..d01ee3f16e --- /dev/null +++ b/ext/pcre/pcrelib/internal.h @@ -0,0 +1,360 @@ +/************************************************* +* Perl-Compatible Regular Expressions * +*************************************************/ + + +/* This is a library of functions to support regular expressions whose syntax +and semantics are as close as possible to those of the Perl 5 language. See +the file Tech.Notes for some information on the internals. + +Written by: Philip Hazel + + Copyright (c) 1997-2000 University of Cambridge + +----------------------------------------------------------------------------- +Permission is granted to anyone to use this software for any purpose on any +computer system, and to redistribute it freely, subject to the following +restrictions: + +1. This software is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + +2. The origin of this software must not be misrepresented, either by + explicit claim or by omission. + +3. Altered versions must be plainly marked as such, and must not be + misrepresented as being the original software. + +4. If PCRE is embedded in any software that is released under the GNU + General Purpose Licence (GPL), then the terms of that licence shall + supersede any condition above with which it is incompatible. +----------------------------------------------------------------------------- +*/ + +/* This header contains definitions that are shared between the different +modules, but which are not relevant to the outside. */ + +/* Get the definitions provided by running "configure" */ + +#include "php_config.h" + +/* To cope with SunOS4 and other systems that lack memmove() but have bcopy(), +define a macro for memmove() if HAVE_MEMMOVE is false. */ + +#if ! HAVE_MEMMOVE +#undef memmove /* some systems may have a macro */ +#define memmove(a, b, c) bcopy(b, a, c) +#endif + +/* Standard C headers plus the external interface definition */ + +#include +#include +#include +#include +#include +#include +#include "pcre.h" + +/* In case there is no definition of offsetof() provided - though any proper +Standard C system should have one. */ + +#ifndef offsetof +#define offsetof(p_type,field) ((size_t)&(((p_type *)0)->field)) +#endif + +/* These are the public options that can change during matching. */ + +#define PCRE_IMS (PCRE_CASELESS|PCRE_MULTILINE|PCRE_DOTALL) + +/* Private options flags start at the most significant end of the four bytes, +but skip the top bit so we can use ints for convenience without getting tangled +with negative values. The public options defined in pcre.h start at the least +significant end. Make sure they don't overlap, though now that we have expanded +to four bytes there is plenty of space. */ + +#define PCRE_FIRSTSET 0x40000000 /* first_char is set */ +#define PCRE_REQCHSET 0x20000000 /* req_char is set */ +#define PCRE_STARTLINE 0x10000000 /* start after \n for multiline */ +#define PCRE_INGROUP 0x08000000 /* compiling inside a group */ +#define PCRE_ICHANGED 0x04000000 /* i option changes within regex */ + +/* Options for the "extra" block produced by pcre_study(). */ + +#define PCRE_STUDY_MAPPED 0x01 /* a map of starting chars exists */ + +/* Masks for identifying the public options which are permitted at compile +time, run time or study time, respectively. */ + +#define PUBLIC_OPTIONS \ + (PCRE_CASELESS|PCRE_EXTENDED|PCRE_ANCHORED|PCRE_MULTILINE| \ + PCRE_DOTALL|PCRE_DOLLAR_ENDONLY|PCRE_EXTRA|PCRE_UNGREEDY) + +#define PUBLIC_EXEC_OPTIONS \ + (PCRE_ANCHORED|PCRE_NOTBOL|PCRE_NOTEOL|PCRE_NOTEMPTY) + +#define PUBLIC_STUDY_OPTIONS 0 /* None defined */ + +/* Magic number to provide a small check against being handed junk. */ + +#define MAGIC_NUMBER 0x50435245UL /* 'PCRE' */ + +/* Miscellaneous definitions */ + +typedef int BOOL; + +#define FALSE 0 +#define TRUE 1 + +/* These are escaped items that aren't just an encoding of a particular data +value such as \n. They must have non-zero values, as check_escape() returns +their negation. Also, they must appear in the same order as in the opcode +definitions below, up to ESC_z. The final one must be ESC_REF as subsequent +values are used for \1, \2, \3, etc. There is a test in the code for an escape +greater than ESC_b and less than ESC_X to detect the types that may be +repeated. If any new escapes are put in-between that don't consume a character, +that code will have to change. */ + +enum { ESC_A = 1, ESC_B, ESC_b, ESC_D, ESC_d, ESC_S, ESC_s, ESC_W, ESC_w, + ESC_Z, ESC_z, ESC_REF }; + +/* Opcode table: OP_BRA must be last, as all values >= it are used for brackets +that extract substrings. Starting from 1 (i.e. after OP_END), the values up to +OP_EOD must correspond in order to the list of escapes immediately above. */ + +enum { + OP_END, /* End of pattern */ + + /* Values corresponding to backslashed metacharacters */ + + OP_SOD, /* Start of data: \A */ + OP_NOT_WORD_BOUNDARY, /* \B */ + OP_WORD_BOUNDARY, /* \b */ + OP_NOT_DIGIT, /* \D */ + OP_DIGIT, /* \d */ + OP_NOT_WHITESPACE, /* \S */ + OP_WHITESPACE, /* \s */ + OP_NOT_WORDCHAR, /* \W */ + OP_WORDCHAR, /* \w */ + OP_EODN, /* End of data or \n at end of data: \Z. */ + OP_EOD, /* End of data: \z */ + + OP_OPT, /* Set runtime options */ + OP_CIRC, /* Start of line - varies with multiline switch */ + OP_DOLL, /* End of line - varies with multiline switch */ + OP_ANY, /* Match any character */ + OP_CHARS, /* Match string of characters */ + OP_NOT, /* Match anything but the following char */ + + OP_STAR, /* The maximizing and minimizing versions of */ + OP_MINSTAR, /* all these opcodes must come in pairs, with */ + OP_PLUS, /* the minimizing one second. */ + OP_MINPLUS, /* This first set applies to single characters */ + OP_QUERY, + OP_MINQUERY, + OP_UPTO, /* From 0 to n matches */ + OP_MINUPTO, + OP_EXACT, /* Exactly n matches */ + + OP_NOTSTAR, /* The maximizing and minimizing versions of */ + OP_NOTMINSTAR, /* all these opcodes must come in pairs, with */ + OP_NOTPLUS, /* the minimizing one second. */ + OP_NOTMINPLUS, /* This first set applies to "not" single characters */ + OP_NOTQUERY, + OP_NOTMINQUERY, + OP_NOTUPTO, /* From 0 to n matches */ + OP_NOTMINUPTO, + OP_NOTEXACT, /* Exactly n matches */ + + OP_TYPESTAR, /* The maximizing and minimizing versions of */ + OP_TYPEMINSTAR, /* all these opcodes must come in pairs, with */ + OP_TYPEPLUS, /* the minimizing one second. These codes must */ + OP_TYPEMINPLUS, /* be in exactly the same order as those above. */ + OP_TYPEQUERY, /* This set applies to character types such as \d */ + OP_TYPEMINQUERY, + OP_TYPEUPTO, /* From 0 to n matches */ + OP_TYPEMINUPTO, + OP_TYPEEXACT, /* Exactly n matches */ + + OP_CRSTAR, /* The maximizing and minimizing versions of */ + OP_CRMINSTAR, /* all these opcodes must come in pairs, with */ + OP_CRPLUS, /* the minimizing one second. These codes must */ + OP_CRMINPLUS, /* be in exactly the same order as those above. */ + OP_CRQUERY, /* These are for character classes and back refs */ + OP_CRMINQUERY, + OP_CRRANGE, /* These are different to the three seta above. */ + OP_CRMINRANGE, + + OP_CLASS, /* Match a character class */ + OP_REF, /* Match a back reference */ + OP_RECURSE, /* Match this pattern recursively */ + + OP_ALT, /* Start of alternation */ + OP_KET, /* End of group that doesn't have an unbounded repeat */ + OP_KETRMAX, /* These two must remain together and in this */ + OP_KETRMIN, /* order. They are for groups the repeat for ever. */ + + /* The assertions must come before ONCE and COND */ + + OP_ASSERT, /* Positive lookahead */ + OP_ASSERT_NOT, /* Negative lookahead */ + OP_ASSERTBACK, /* Positive lookbehind */ + OP_ASSERTBACK_NOT, /* Negative lookbehind */ + OP_REVERSE, /* Move pointer back - used in lookbehind assertions */ + + /* ONCE and COND must come after the assertions, with ONCE first, as there's + a test for >= ONCE for a subpattern that isn't an assertion. */ + + OP_ONCE, /* Once matched, don't back up into the subpattern */ + OP_COND, /* Conditional group */ + OP_CREF, /* Used to hold an extraction string number */ + + OP_BRAZERO, /* These two must remain together and in this */ + OP_BRAMINZERO, /* order. */ + + OP_BRA /* This and greater values are used for brackets that + extract substrings. */ +}; + +/* The highest extraction number. This is limited by the number of opcodes +left after OP_BRA, i.e. 255 - OP_BRA. We actually set it somewhat lower. */ + +#define EXTRACT_MAX 99 + +/* The texts of compile-time error messages are defined as macros here so that +they can be accessed by the POSIX wrapper and converted into error codes. Yes, +I could have used error codes in the first place, but didn't feel like changing +just to accommodate the POSIX wrapper. */ + +#define ERR1 "\\ at end of pattern" +#define ERR2 "\\c at end of pattern" +#define ERR3 "unrecognized character follows \\" +#define ERR4 "numbers out of order in {} quantifier" +#define ERR5 "number too big in {} quantifier" +#define ERR6 "missing terminating ] for character class" +#define ERR7 "invalid escape sequence in character class" +#define ERR8 "range out of order in character class" +#define ERR9 "nothing to repeat" +#define ERR10 "operand of unlimited repeat could match the empty string" +#define ERR11 "internal error: unexpected repeat" +#define ERR12 "unrecognized character after (?" +#define ERR13 "too many capturing parenthesized sub-patterns" +#define ERR14 "missing )" +#define ERR15 "back reference to non-existent subpattern" +#define ERR16 "erroffset passed as NULL" +#define ERR17 "unknown option bit(s) set" +#define ERR18 "missing ) after comment" +#define ERR19 "too many sets of parentheses" +#define ERR20 "regular expression too large" +#define ERR21 "failed to get memory" +#define ERR22 "unmatched parentheses" +#define ERR23 "internal error: code overflow" +#define ERR24 "unrecognized character after (?<" +#define ERR25 "lookbehind assertion is not fixed length" +#define ERR26 "malformed number after (?(" +#define ERR27 "conditional group contains more than two branches" +#define ERR28 "assertion expected after (?(" +#define ERR29 "(?p must be followed by )" +#define ERR30 "unknown POSIX class name" +#define ERR31 "POSIX collating elements are not supported" + +/* All character handling must be done as unsigned characters. Otherwise there +are problems with top-bit-set characters and functions such as isspace(). +However, we leave the interface to the outside world as char *, because that +should make things easier for callers. We define a short type for unsigned char +to save lots of typing. I tried "uchar", but it causes problems on Digital +Unix, where it is defined in sys/types, so use "uschar" instead. */ + +typedef unsigned char uschar; + +/* The real format of the start of the pcre block; the actual code vector +runs on as long as necessary after the end. */ + +typedef struct real_pcre { + unsigned long int magic_number; + size_t size; + const unsigned char *tables; + unsigned long int options; + uschar top_bracket; + uschar top_backref; + uschar first_char; + uschar req_char; + uschar code[1]; +} real_pcre; + +/* The real format of the extra block returned by pcre_study(). */ + +typedef struct real_pcre_extra { + uschar options; + uschar start_bits[32]; +} real_pcre_extra; + + +/* Structure for passing "static" information around between the functions +doing the compiling, so that they are thread-safe. */ + +typedef struct compile_data { + const uschar *lcc; /* Points to lower casing table */ + const uschar *fcc; /* Points to case-flipping table */ + const uschar *cbits; /* Points to character type table */ + const uschar *ctypes; /* Points to table of type maps */ +} compile_data; + +/* Structure for passing "static" information around between the functions +doing the matching, so that they are thread-safe. */ + +typedef struct match_data { + int errorcode; /* As it says */ + int *offset_vector; /* Offset vector */ + int offset_end; /* One past the end */ + int offset_max; /* The maximum usable for return data */ + const uschar *lcc; /* Points to lower casing table */ + const uschar *ctypes; /* Points to table of type maps */ + BOOL offset_overflow; /* Set if too many extractions */ + BOOL notbol; /* NOTBOL flag */ + BOOL noteol; /* NOTEOL flag */ + BOOL endonly; /* Dollar not before final \n */ + BOOL notempty; /* Empty string match not wanted */ + const uschar *start_pattern; /* For use when recursing */ + const uschar *start_subject; /* Start of the subject string */ + const uschar *end_subject; /* End of the subject string */ + const uschar *start_match; /* Start of this match attempt */ + const uschar *end_match_ptr; /* Subject position at end match */ + int end_offset_top; /* Highwater mark at end of match */ +} match_data; + +/* Bit definitions for entries in the pcre_ctypes table. */ + +#define ctype_space 0x01 +#define ctype_letter 0x02 +#define ctype_digit 0x04 +#define ctype_xdigit 0x08 +#define ctype_word 0x10 /* alphameric or '_' */ +#define ctype_meta 0x80 /* regexp meta char or zero (end pattern) */ + +/* Offsets for the bitmap tables in pcre_cbits. Each table contains a set +of bits for a class map. Some classes are built by combining these tables. */ + +#define cbit_space 0 /* [:space:] or \s */ +#define cbit_xdigit 32 /* [:xdigit:] */ +#define cbit_digit 64 /* [:digit:] or \d */ +#define cbit_upper 96 /* [:upper:] */ +#define cbit_lower 128 /* [:lower:] */ +#define cbit_word 160 /* [:word:] or \w */ +#define cbit_graph 192 /* [:graph:] */ +#define cbit_print 224 /* [:print:] */ +#define cbit_punct 256 /* [:punct:] */ +#define cbit_cntrl 288 /* [:cntrl:] */ +#define cbit_length 320 /* Length of the cbits table */ + +/* Offsets of the various tables from the base tables pointer, and +total length. */ + +#define lcc_offset 0 +#define fcc_offset 256 +#define cbits_offset 512 +#define ctypes_offset (cbits_offset + cbit_length) +#define tables_length (ctypes_offset + 256) + +/* End of internal.h */ diff --git a/ext/pcre/pcrelib/maketables.c b/ext/pcre/pcrelib/maketables.c new file mode 100644 index 0000000000..c0f06c0375 --- /dev/null +++ b/ext/pcre/pcrelib/maketables.c @@ -0,0 +1,132 @@ +/************************************************* +* Perl-Compatible Regular Expressions * +*************************************************/ + +/* +PCRE is a library of functions to support regular expressions whose syntax +and semantics are as close as possible to those of the Perl 5 language. + +Written by: Philip Hazel + + Copyright (c) 1997-2000 University of Cambridge + +----------------------------------------------------------------------------- +Permission is granted to anyone to use this software for any purpose on any +computer system, and to redistribute it freely, subject to the following +restrictions: + +1. This software is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + +2. The origin of this software must not be misrepresented, either by + explicit claim or by omission. + +3. Altered versions must be plainly marked as such, and must not be + misrepresented as being the original software. + +4. If PCRE is embedded in any software that is released under the GNU + General Purpose Licence (GPL), then the terms of that licence shall + supersede any condition above with which it is incompatible. +----------------------------------------------------------------------------- + +See the file Tech.Notes for some information on the internals. +*/ + + +/* This file is compiled on its own as part of the PCRE library. However, +it is also included in the compilation of dftables.c, in which case the macro +DFTABLES is defined. */ + +#ifndef DFTABLES +#include "internal.h" +#endif + + + +/************************************************* +* Create PCRE character tables * +*************************************************/ + +/* This function builds a set of character tables for use by PCRE and returns +a pointer to them. They are build using the ctype functions, and consequently +their contents will depend upon the current locale setting. When compiled as +part of the library, the store is obtained via pcre_malloc(), but when compiled +inside dftables, use malloc(). + +Arguments: none +Returns: pointer to the contiguous block of data +*/ + +unsigned const char * +pcre_maketables(void) +{ +unsigned char *yield, *p; +int i; + +#ifndef DFTABLES +yield = (unsigned char*)(pcre_malloc)(tables_length); +#else +yield = (unsigned char*)malloc(tables_length); +#endif + +if (yield == NULL) return NULL; +p = yield; + +/* First comes the lower casing table */ + +for (i = 0; i < 256; i++) *p++ = tolower(i); + +/* Next the case-flipping table */ + +for (i = 0; i < 256; i++) *p++ = islower(i)? toupper(i) : tolower(i); + +/* Then the character class tables. Don't try to be clever and save effort +on exclusive ones - in some locales things may be different. */ + +memset(p, 0, cbit_length); +for (i = 0; i < 256; i++) + { + if (isdigit(i)) + { + p[cbit_digit + i/8] |= 1 << (i&7); + p[cbit_word + i/8] |= 1 << (i&7); + } + if (isupper(i)) + { + p[cbit_upper + i/8] |= 1 << (i&7); + p[cbit_word + i/8] |= 1 << (i&7); + } + if (islower(i)) + { + p[cbit_lower + i/8] |= 1 << (i&7); + p[cbit_word + i/8] |= 1 << (i&7); + } + if (i == '_') p[cbit_word + i/8] |= 1 << (i&7); + if (isspace(i)) p[cbit_space + i/8] |= 1 << (i&7); + if (isxdigit(i))p[cbit_xdigit + i/8] |= 1 << (i&7); + if (isgraph(i)) p[cbit_graph + i/8] |= 1 << (i&7); + if (isprint(i)) p[cbit_print + i/8] |= 1 << (i&7); + if (ispunct(i)) p[cbit_punct + i/8] |= 1 << (i&7); + if (iscntrl(i)) p[cbit_cntrl + i/8] |= 1 << (i&7); + } +p += cbit_length; + +/* Finally, the character type table */ + +for (i = 0; i < 256; i++) + { + int x = 0; + if (isspace(i)) x += ctype_space; + if (isalpha(i)) x += ctype_letter; + if (isdigit(i)) x += ctype_digit; + if (isxdigit(i)) x += ctype_xdigit; + if (isalnum(i) || i == '_') x += ctype_word; + if (strchr("*+?{^.$|()[", i) != 0) x += ctype_meta; + *p++ = x; + } + +return yield; +} + +/* End of maketables.c */ diff --git a/ext/pcre/pcrelib/pcre-config.in b/ext/pcre/pcrelib/pcre-config.in new file mode 100644 index 0000000000..8daded9fe1 --- /dev/null +++ b/ext/pcre/pcrelib/pcre-config.in @@ -0,0 +1,59 @@ +#!/bin/sh + +prefix=@prefix@ +exec_prefix=@exec_prefix@ +exec_prefix_set=no + +usage="\ +Usage: pcre-config [--prefix] [--exec-prefix] [--version] [--libs] [--libs-posix] [--cflags] [--cflags-posix]" + +if test $# -eq 0; then + echo "${usage}" 1>&2 + exit 1 +fi + +while test $# -gt 0; do + case "$1" in + -*=*) optarg=`echo "$1" | sed 's/[-_a-zA-Z0-9]*=//'` ;; + *) optarg= ;; + esac + + case $1 in + --prefix=*) + prefix=$optarg + if test $exec_prefix_set = no ; then + exec_prefix=$optarg + fi + ;; + --prefix) + echo $prefix + ;; + --exec-prefix=*) + exec_prefix=$optarg + exec_prefix_set=yes + ;; + --exec-prefix) + echo $exec_prefix + ;; + --version) + echo @PCRE_VERSION@ + ;; + --cflags | --cflags-posix) + if test @includedir@ != /usr/include ; then + includes=-I@includedir@ + fi + echo $includes + ;; + --libs-posix) + echo -L@libdir@ -lpcreposix -lpcre + ;; + --libs) + echo -L@libdir@ -lpcre + ;; + *) + echo "${usage}" 1>&2 + exit 1 + ;; + esac + shift +done diff --git a/ext/pcre/pcrelib/pcre.c b/ext/pcre/pcrelib/pcre.c new file mode 100644 index 0000000000..e45dee8d96 --- /dev/null +++ b/ext/pcre/pcrelib/pcre.c @@ -0,0 +1,4769 @@ +/************************************************* +* Perl-Compatible Regular Expressions * +*************************************************/ + +/* +This is a library of functions to support regular expressions whose syntax +and semantics are as close as possible to those of the Perl 5 language. See +the file Tech.Notes for some information on the internals. + +Written by: Philip Hazel + + Copyright (c) 1997-2000 University of Cambridge + +----------------------------------------------------------------------------- +Permission is granted to anyone to use this software for any purpose on any +computer system, and to redistribute it freely, subject to the following +restrictions: + +1. This software is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + +2. The origin of this software must not be misrepresented, either by + explicit claim or by omission. + +3. Altered versions must be plainly marked as such, and must not be + misrepresented as being the original software. + +4. If PCRE is embedded in any software that is released under the GNU + General Purpose Licence (GPL), then the terms of that licence shall + supersede any condition above with which it is incompatible. +----------------------------------------------------------------------------- +*/ + + +/* Define DEBUG to get debugging output on stdout. */ + +/* #define DEBUG */ + +/* Use a macro for debugging printing, 'cause that eliminates the use of #ifdef +inline, and there are *still* stupid compilers about that don't like indented +pre-processor statements. I suppose it's only been 10 years... */ + +#ifdef DEBUG +#define DPRINTF(p) printf p +#else +#define DPRINTF(p) /*nothing*/ +#endif + +/* Include the internals header, which itself includes Standard C headers plus +the external pcre header. */ + +#include "internal.h" + + +/* Allow compilation as C++ source code, should anybody want to do that. */ + +#ifdef __cplusplus +#define class pcre_class +#endif + + +/* Number of items on the nested bracket stacks at compile time. This should +not be set greater than 200. */ + +#define BRASTACK_SIZE 200 + + +/* Min and max values for the common repeats; for the maxima, 0 => infinity */ + +static const char rep_min[] = { 0, 0, 1, 1, 0, 0 }; +static const char rep_max[] = { 0, 0, 0, 0, 1, 1 }; + +/* Text forms of OP_ values and things, for debugging (not all used) */ + +#ifdef DEBUG +static const char *OP_names[] = { + "End", "\\A", "\\B", "\\b", "\\D", "\\d", + "\\S", "\\s", "\\W", "\\w", "\\Z", "\\z", + "Opt", "^", "$", "Any", "chars", "not", + "*", "*?", "+", "+?", "?", "??", "{", "{", "{", + "*", "*?", "+", "+?", "?", "??", "{", "{", "{", + "*", "*?", "+", "+?", "?", "??", "{", "{", "{", + "*", "*?", "+", "+?", "?", "??", "{", "{", + "class", "Ref", "Recurse", + "Alt", "Ket", "KetRmax", "KetRmin", "Assert", "Assert not", + "AssertB", "AssertB not", "Reverse", "Once", "Cond", "Cref", + "Brazero", "Braminzero", "Bra" +}; +#endif + +/* Table for handling escaped characters in the range '0'-'z'. Positive returns +are simple data values; negative values are for special things like \d and so +on. Zero means further processing is needed (for things like \x), or the escape +is invalid. */ + +static const short int escapes[] = { + 0, 0, 0, 0, 0, 0, 0, 0, /* 0 - 7 */ + 0, 0, ':', ';', '<', '=', '>', '?', /* 8 - ? */ + '@', -ESC_A, -ESC_B, 0, -ESC_D, 0, 0, 0, /* @ - G */ + 0, 0, 0, 0, 0, 0, 0, 0, /* H - O */ + 0, 0, 0, -ESC_S, 0, 0, 0, -ESC_W, /* P - W */ + 0, 0, -ESC_Z, '[', '\\', ']', '^', '_', /* X - _ */ + '`', 7, -ESC_b, 0, -ESC_d, 27, '\f', 0, /* ` - g */ + 0, 0, 0, 0, 0, 0, '\n', 0, /* h - o */ + 0, 0, '\r', -ESC_s, '\t', 0, 0, -ESC_w, /* p - w */ + 0, 0, -ESC_z /* x - z */ +}; + +/* Tables of names of POSIX character classes and their lengths. The list is +terminated by a zero length entry. The first three must be alpha, upper, lower, +as this is assumed for handling case independence. */ + +static const char *posix_names[] = { + "alpha", "lower", "upper", + "alnum", "ascii", "cntrl", "digit", "graph", + "print", "punct", "space", "word", "xdigit" }; + +static const uschar posix_name_lengths[] = { + 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 4, 6, 0 }; + +/* Table of class bit maps for each POSIX class; up to three may be combined +to form the class. */ + +static const int posix_class_maps[] = { + cbit_lower, cbit_upper, -1, /* alpha */ + cbit_lower, -1, -1, /* lower */ + cbit_upper, -1, -1, /* upper */ + cbit_digit, cbit_lower, cbit_upper, /* alnum */ + cbit_print, cbit_cntrl, -1, /* ascii */ + cbit_cntrl, -1, -1, /* cntrl */ + cbit_digit, -1, -1, /* digit */ + cbit_graph, -1, -1, /* graph */ + cbit_print, -1, -1, /* print */ + cbit_punct, -1, -1, /* punct */ + cbit_space, -1, -1, /* space */ + cbit_word, -1, -1, /* word */ + cbit_xdigit,-1, -1 /* xdigit */ +}; + + +/* Definition to allow mutual recursion */ + +static BOOL + compile_regex(int, int, int *, uschar **, const uschar **, const char **, + BOOL, int, int *, int *, compile_data *); + + + +/************************************************* +* Global variables * +*************************************************/ + +/* PCRE is thread-clean and doesn't use any global variables in the normal +sense. However, it calls memory allocation and free functions via the two +indirections below, which are can be changed by the caller, but are shared +between all threads. */ + +void *(*pcre_malloc)(size_t) = malloc; +void (*pcre_free)(void *) = free; + + + + +/************************************************* +* Default character tables * +*************************************************/ + +/* A default set of character tables is included in the PCRE binary. Its source +is built by the maketables auxiliary program, which uses the default C ctypes +functions, and put in the file chartables.c. These tables are used by PCRE +whenever the caller of pcre_compile() does not provide an alternate set of +tables. */ + +#include "chartables.c" + + + +/************************************************* +* Return version string * +*************************************************/ + +#define STRING(a) # a +#define XSTRING(s) STRING(s) + +const char * +pcre_version(void) +{ +return XSTRING(PCRE_MAJOR) "." XSTRING(PCRE_MINOR) " " XSTRING(PCRE_DATE); +} + + + + +/************************************************* +* (Obsolete) Return info about compiled pattern * +*************************************************/ + +/* This is the original "info" function. It picks potentially useful data out +of the private structure, but its interface was too rigid. It remains for +backwards compatibility. The public options are passed back in an int - though +the re->options field has been expanded to a long int, all the public options +at the low end of it, and so even on 16-bit systems this will still be OK. +Therefore, I haven't changed the API for pcre_info(). + +Arguments: + external_re points to compiled code + optptr where to pass back the options + first_char where to pass back the first character, + or -1 if multiline and all branches start ^, + or -2 otherwise + +Returns: number of capturing subpatterns + or negative values on error +*/ + +int +pcre_info(const pcre *external_re, int *optptr, int *first_char) +{ +const real_pcre *re = (const real_pcre *)external_re; +if (re == NULL) return PCRE_ERROR_NULL; +if (re->magic_number != MAGIC_NUMBER) return PCRE_ERROR_BADMAGIC; +if (optptr != NULL) *optptr = (int)(re->options & PUBLIC_OPTIONS); +if (first_char != NULL) + *first_char = ((re->options & PCRE_FIRSTSET) != 0)? re->first_char : + ((re->options & PCRE_STARTLINE) != 0)? -1 : -2; +return re->top_bracket; +} + + + +/************************************************* +* Return info about compiled pattern * +*************************************************/ + +/* This is a newer "info" function which has an extensible interface so +that additional items can be added compatibly. + +Arguments: + external_re points to compiled code + external_study points to study data, or NULL + what what information is required + where where to put the information + +Returns: 0 if data returned, negative on error +*/ + +int +pcre_fullinfo(const pcre *external_re, const pcre_extra *study_data, int what, + void *where) +{ +const real_pcre *re = (const real_pcre *)external_re; +const real_pcre_extra *study = (const real_pcre_extra *)study_data; + +if (re == NULL || where == NULL) return PCRE_ERROR_NULL; +if (re->magic_number != MAGIC_NUMBER) return PCRE_ERROR_BADMAGIC; + +switch (what) + { + case PCRE_INFO_OPTIONS: + *((unsigned long int *)where) = re->options & PUBLIC_OPTIONS; + break; + + case PCRE_INFO_SIZE: + *((size_t *)where) = re->size; + break; + + case PCRE_INFO_CAPTURECOUNT: + *((int *)where) = re->top_bracket; + break; + + case PCRE_INFO_BACKREFMAX: + *((int *)where) = re->top_backref; + break; + + case PCRE_INFO_FIRSTCHAR: + *((int *)where) = + ((re->options & PCRE_FIRSTSET) != 0)? re->first_char : + ((re->options & PCRE_STARTLINE) != 0)? -1 : -2; + break; + + case PCRE_INFO_FIRSTTABLE: + *((const uschar **)where) = + (study != NULL && (study->options & PCRE_STUDY_MAPPED) != 0)? + study->start_bits : NULL; + break; + + case PCRE_INFO_LASTLITERAL: + *((int *)where) = + ((re->options & PCRE_REQCHSET) != 0)? re->req_char : -1; + break; + + default: return PCRE_ERROR_BADOPTION; + } + +return 0; +} + + + +#ifdef DEBUG +/************************************************* +* Debugging function to print chars * +*************************************************/ + +/* Print a sequence of chars in printable format, stopping at the end of the +subject if the requested. + +Arguments: + p points to characters + length number to print + is_subject TRUE if printing from within md->start_subject + md pointer to matching data block, if is_subject is TRUE + +Returns: nothing +*/ + +static void +pchars(const uschar *p, int length, BOOL is_subject, match_data *md) +{ +int c; +if (is_subject && length > md->end_subject - p) length = md->end_subject - p; +while (length-- > 0) + if (isprint(c = *(p++))) printf("%c", c); else printf("\\x%02x", c); +} +#endif + + + + +/************************************************* +* Handle escapes * +*************************************************/ + +/* This function is called when a \ has been encountered. It either returns a +positive value for a simple escape such as \n, or a negative value which +encodes one of the more complicated things such as \d. On entry, ptr is +pointing at the \. On exit, it is on the final character of the escape +sequence. + +Arguments: + ptrptr points to the pattern position pointer + errorptr points to the pointer to the error message + bracount number of previous extracting brackets + options the options bits + isclass TRUE if inside a character class + cd pointer to char tables block + +Returns: zero or positive => a data character + negative => a special escape sequence + on error, errorptr is set +*/ + +static int +check_escape(const uschar **ptrptr, const char **errorptr, int bracount, + int options, BOOL isclass, compile_data *cd) +{ +const uschar *ptr = *ptrptr; +int c, i; + +c = *(++ptr) & 255; /* Ensure > 0 on signed-char systems */ +if (c == 0) *errorptr = ERR1; + +/* Digits or letters may have special meaning; all others are literals. */ + +else if (c < '0' || c > 'z') {} + +/* Do an initial lookup in a table. A non-zero result is something that can be +returned immediately. Otherwise further processing may be required. */ + +else if ((i = escapes[c - '0']) != 0) c = i; + +/* Escapes that need further processing, or are illegal. */ + +else + { + const uschar *oldptr; + switch (c) + { + /* The handling of escape sequences consisting of a string of digits + starting with one that is not zero is not straightforward. By experiment, + the way Perl works seems to be as follows: + + Outside a character class, the digits are read as a decimal number. If the + number is less than 10, or if there are that many previous extracting + left brackets, then it is a back reference. Otherwise, up to three octal + digits are read to form an escaped byte. Thus \123 is likely to be octal + 123 (cf \0123, which is octal 012 followed by the literal 3). If the octal + value is greater than 377, the least significant 8 bits are taken. Inside a + character class, \ followed by a digit is always an octal number. */ + + case '1': case '2': case '3': case '4': case '5': + case '6': case '7': case '8': case '9': + + if (!isclass) + { + oldptr = ptr; + c -= '0'; + while ((cd->ctypes[ptr[1]] & ctype_digit) != 0) + c = c * 10 + *(++ptr) - '0'; + if (c < 10 || c <= bracount) + { + c = -(ESC_REF + c); + break; + } + ptr = oldptr; /* Put the pointer back and fall through */ + } + + /* Handle an octal number following \. If the first digit is 8 or 9, Perl + generates a binary zero byte and treats the digit as a following literal. + Thus we have to pull back the pointer by one. */ + + if ((c = *ptr) >= '8') + { + ptr--; + c = 0; + break; + } + + /* \0 always starts an octal number, but we may drop through to here with a + larger first octal digit */ + + case '0': + c -= '0'; + while(i++ < 2 && (cd->ctypes[ptr[1]] & ctype_digit) != 0 && + ptr[1] != '8' && ptr[1] != '9') + c = c * 8 + *(++ptr) - '0'; + break; + + /* Special escapes not starting with a digit are straightforward */ + + case 'x': + c = 0; + while (i++ < 2 && (cd->ctypes[ptr[1]] & ctype_xdigit) != 0) + { + ptr++; + c = c * 16 + cd->lcc[*ptr] - + (((cd->ctypes[*ptr] & ctype_digit) != 0)? '0' : 'W'); + } + break; + + case 'c': + c = *(++ptr); + if (c == 0) + { + *errorptr = ERR2; + return 0; + } + + /* A letter is upper-cased; then the 0x40 bit is flipped */ + + if (c >= 'a' && c <= 'z') c = cd->fcc[c]; + c ^= 0x40; + break; + + /* PCRE_EXTRA enables extensions to Perl in the matter of escapes. Any + other alphameric following \ is an error if PCRE_EXTRA was set; otherwise, + for Perl compatibility, it is a literal. This code looks a bit odd, but + there used to be some cases other than the default, and there may be again + in future, so I haven't "optimized" it. */ + + default: + if ((options & PCRE_EXTRA) != 0) switch(c) + { + default: + *errorptr = ERR3; + break; + } + break; + } + } + +*ptrptr = ptr; +return c; +} + + + +/************************************************* +* Check for counted repeat * +*************************************************/ + +/* This function is called when a '{' is encountered in a place where it might +start a quantifier. It looks ahead to see if it really is a quantifier or not. +It is only a quantifier if it is one of the forms {ddd} {ddd,} or {ddd,ddd} +where the ddds are digits. + +Arguments: + p pointer to the first char after '{' + cd pointer to char tables block + +Returns: TRUE or FALSE +*/ + +static BOOL +is_counted_repeat(const uschar *p, compile_data *cd) +{ +if ((cd->ctypes[*p++] & ctype_digit) == 0) return FALSE; +while ((cd->ctypes[*p] & ctype_digit) != 0) p++; +if (*p == '}') return TRUE; + +if (*p++ != ',') return FALSE; +if (*p == '}') return TRUE; + +if ((cd->ctypes[*p++] & ctype_digit) == 0) return FALSE; +while ((cd->ctypes[*p] & ctype_digit) != 0) p++; +return (*p == '}'); +} + + + +/************************************************* +* Read repeat counts * +*************************************************/ + +/* Read an item of the form {n,m} and return the values. This is called only +after is_counted_repeat() has confirmed that a repeat-count quantifier exists, +so the syntax is guaranteed to be correct, but we need to check the values. + +Arguments: + p pointer to first char after '{' + minp pointer to int for min + maxp pointer to int for max + returned as -1 if no max + errorptr points to pointer to error message + cd pointer to character tables clock + +Returns: pointer to '}' on success; + current ptr on error, with errorptr set +*/ + +static const uschar * +read_repeat_counts(const uschar *p, int *minp, int *maxp, + const char **errorptr, compile_data *cd) +{ +int min = 0; +int max = -1; + +while ((cd->ctypes[*p] & ctype_digit) != 0) min = min * 10 + *p++ - '0'; + +if (*p == '}') max = min; else + { + if (*(++p) != '}') + { + max = 0; + while((cd->ctypes[*p] & ctype_digit) != 0) max = max * 10 + *p++ - '0'; + if (max < min) + { + *errorptr = ERR4; + return p; + } + } + } + +/* Do paranoid checks, then fill in the required variables, and pass back the +pointer to the terminating '}'. */ + +if (min > 65535 || max > 65535) + *errorptr = ERR5; +else + { + *minp = min; + *maxp = max; + } +return p; +} + + + +/************************************************* +* Find the fixed length of a pattern * +*************************************************/ + +/* Scan a pattern and compute the fixed length of subject that will match it, +if the length is fixed. This is needed for dealing with backward assertions. + +Arguments: + code points to the start of the pattern (the bracket) + +Returns: the fixed length, or -1 if there is no fixed length +*/ + +static int +find_fixedlength(uschar *code) +{ +int length = -1; + +register int branchlength = 0; +register uschar *cc = code + 3; + +/* Scan along the opcodes for this branch. If we get to the end of the +branch, check the length against that of the other branches. */ + +for (;;) + { + int d; + register int op = *cc; + if (op >= OP_BRA) op = OP_BRA; + + switch (op) + { + case OP_BRA: + case OP_ONCE: + case OP_COND: + d = find_fixedlength(cc); + if (d < 0) return -1; + branchlength += d; + do cc += (cc[1] << 8) + cc[2]; while (*cc == OP_ALT); + cc += 3; + break; + + /* Reached end of a branch; if it's a ket it is the end of a nested + call. If it's ALT it is an alternation in a nested call. If it is + END it's the end of the outer call. All can be handled by the same code. */ + + case OP_ALT: + case OP_KET: + case OP_KETRMAX: + case OP_KETRMIN: + case OP_END: + if (length < 0) length = branchlength; + else if (length != branchlength) return -1; + if (*cc != OP_ALT) return length; + cc += 3; + branchlength = 0; + break; + + /* Skip over assertive subpatterns */ + + case OP_ASSERT: + case OP_ASSERT_NOT: + case OP_ASSERTBACK: + case OP_ASSERTBACK_NOT: + do cc += (cc[1] << 8) + cc[2]; while (*cc == OP_ALT); + cc += 3; + break; + + /* Skip over things that don't match chars */ + + case OP_REVERSE: + cc++; + /* Fall through */ + + case OP_CREF: + case OP_OPT: + cc++; + /* Fall through */ + + case OP_SOD: + case OP_EOD: + case OP_EODN: + case OP_CIRC: + case OP_DOLL: + case OP_NOT_WORD_BOUNDARY: + case OP_WORD_BOUNDARY: + cc++; + break; + + /* Handle char strings */ + + case OP_CHARS: + branchlength += *(++cc); + cc += *cc + 1; + break; + + /* Handle exact repetitions */ + + case OP_EXACT: + case OP_TYPEEXACT: + branchlength += (cc[1] << 8) + cc[2]; + cc += 4; + break; + + /* Handle single-char matchers */ + + case OP_NOT_DIGIT: + case OP_DIGIT: + case OP_NOT_WHITESPACE: + case OP_WHITESPACE: + case OP_NOT_WORDCHAR: + case OP_WORDCHAR: + case OP_ANY: + branchlength++; + cc++; + break; + + + /* Check a class for variable quantification */ + + case OP_CLASS: + cc += (*cc == OP_REF)? 2 : 33; + + switch (*cc) + { + case OP_CRSTAR: + case OP_CRMINSTAR: + case OP_CRQUERY: + case OP_CRMINQUERY: + return -1; + + case OP_CRRANGE: + case OP_CRMINRANGE: + if ((cc[1] << 8) + cc[2] != (cc[3] << 8) + cc[4]) return -1; + branchlength += (cc[1] << 8) + cc[2]; + cc += 5; + break; + + default: + branchlength++; + } + break; + + /* Anything else is variable length */ + + default: + return -1; + } + } +/* Control never gets here */ +} + + + + +/************************************************* +* Check for POSIX class syntax * +*************************************************/ + +/* This function is called when the sequence "[:" or "[." or "[=" is +encountered in a character class. It checks whether this is followed by an +optional ^ and then a sequence of letters, terminated by a matching ":]" or +".]" or "=]". + +Argument: + ptr pointer to the initial [ + endptr where to return the end pointer + cd pointer to compile data + +Returns: TRUE or FALSE +*/ + +static BOOL +check_posix_syntax(const uschar *ptr, const uschar **endptr, compile_data *cd) +{ +int terminator; /* Don't combine these lines; the Solaris cc */ +terminator = *(++ptr); /* compiler warns about "non-constant" initializer. */ +if (*(++ptr) == '^') ptr++; +while ((cd->ctypes[*ptr] & ctype_letter) != 0) ptr++; +if (*ptr == terminator && ptr[1] == ']') + { + *endptr = ptr; + return TRUE; + } +return FALSE; +} + + + + +/************************************************* +* Check POSIX class name * +*************************************************/ + +/* This function is called to check the name given in a POSIX-style class entry +such as [:alnum:]. + +Arguments: + ptr points to the first letter + len the length of the name + +Returns: a value representing the name, or -1 if unknown +*/ + +static int +check_posix_name(const uschar *ptr, int len) +{ +register int yield = 0; +while (posix_name_lengths[yield] != 0) + { + if (len == posix_name_lengths[yield] && + strncmp((const char *)ptr, posix_names[yield], len) == 0) return yield; + yield++; + } +return -1; +} + + + + +/************************************************* +* Compile one branch * +*************************************************/ + +/* Scan the pattern, compiling it into the code vector. + +Arguments: + options the option bits + brackets points to number of brackets used + code points to the pointer to the current code point + ptrptr points to the current pattern pointer + errorptr points to pointer to error message + optchanged set to the value of the last OP_OPT item compiled + reqchar set to the last literal character required, else -1 + countlits set to count of mandatory literal characters + cd contains pointers to tables + +Returns: TRUE on success + FALSE, with *errorptr set on error +*/ + +static BOOL +compile_branch(int options, int *brackets, uschar **codeptr, + const uschar **ptrptr, const char **errorptr, int *optchanged, + int *reqchar, int *countlits, compile_data *cd) +{ +int repeat_type, op_type; +int repeat_min, repeat_max; +int bravalue, length; +int greedy_default, greedy_non_default; +int prevreqchar; +int condcount = 0; +int subcountlits = 0; +register int c; +register uschar *code = *codeptr; +uschar *tempcode; +const uschar *ptr = *ptrptr; +const uschar *tempptr; +uschar *previous = NULL; +uschar class[32]; + +/* Set up the default and non-default settings for greediness */ + +greedy_default = ((options & PCRE_UNGREEDY) != 0); +greedy_non_default = greedy_default ^ 1; + +/* Initialize no required char, and count of literals */ + +*reqchar = prevreqchar = -1; +*countlits = 0; + +/* Switch on next character until the end of the branch */ + +for (;; ptr++) + { + BOOL negate_class; + int class_charcount; + int class_lastchar; + int newoptions; + int condref; + int subreqchar; + + c = *ptr; + if ((options & PCRE_EXTENDED) != 0) + { + if ((cd->ctypes[c] & ctype_space) != 0) continue; + if (c == '#') + { + while ((c = *(++ptr)) != 0 && c != '\n'); + continue; + } + } + + switch(c) + { + /* The branch terminates at end of string, |, or ). */ + + case 0: + case '|': + case ')': + *codeptr = code; + *ptrptr = ptr; + return TRUE; + + /* Handle single-character metacharacters */ + + case '^': + previous = NULL; + *code++ = OP_CIRC; + break; + + case '$': + previous = NULL; + *code++ = OP_DOLL; + break; + + case '.': + previous = code; + *code++ = OP_ANY; + break; + + /* Character classes. These always build a 32-byte bitmap of the permitted + characters, except in the special case where there is only one character. + For negated classes, we build the map as usual, then invert it at the end. + */ + + case '[': + previous = code; + *code++ = OP_CLASS; + + /* If the first character is '^', set the negation flag and skip it. */ + + if ((c = *(++ptr)) == '^') + { + negate_class = TRUE; + c = *(++ptr); + } + else negate_class = FALSE; + + /* Keep a count of chars so that we can optimize the case of just a single + character. */ + + class_charcount = 0; + class_lastchar = -1; + + /* Initialize the 32-char bit map to all zeros. We have to build the + map in a temporary bit of store, in case the class contains only 1 + character, because in that case the compiled code doesn't use the + bit map. */ + + memset(class, 0, 32 * sizeof(uschar)); + + /* Process characters until ] is reached. By writing this as a "do" it + means that an initial ] is taken as a data character. */ + + do + { + if (c == 0) + { + *errorptr = ERR6; + goto FAILED; + } + + /* Handle POSIX class names. Perl allows a negation extension of the + form [:^name]. A square bracket that doesn't match the syntax is + treated as a literal. We also recognize the POSIX constructions + [.ch.] and [=ch=] ("collating elements") and fault them, as Perl + 5.6 does. */ + + if (c == '[' && + (ptr[1] == ':' || ptr[1] == '.' || ptr[1] == '=') && + check_posix_syntax(ptr, &tempptr, cd)) + { + BOOL local_negate = FALSE; + int posix_class, i; + register const uschar *cbits = cd->cbits; + + if (ptr[1] != ':') + { + *errorptr = ERR31; + goto FAILED; + } + + ptr += 2; + if (*ptr == '^') + { + local_negate = TRUE; + ptr++; + } + + posix_class = check_posix_name(ptr, tempptr - ptr); + if (posix_class < 0) + { + *errorptr = ERR30; + goto FAILED; + } + + /* If matching is caseless, upper and lower are converted to + alpha. This relies on the fact that the class table starts with + alpha, lower, upper as the first 3 entries. */ + + if ((options & PCRE_CASELESS) != 0 && posix_class <= 2) + posix_class = 0; + + /* Or into the map we are building up to 3 of the static class + tables, or their negations. */ + + posix_class *= 3; + for (i = 0; i < 3; i++) + { + int taboffset = posix_class_maps[posix_class + i]; + if (taboffset < 0) break; + if (local_negate) + for (c = 0; c < 32; c++) class[c] |= ~cbits[c+taboffset]; + else + for (c = 0; c < 32; c++) class[c] |= cbits[c+taboffset]; + } + + ptr = tempptr + 1; + class_charcount = 10; /* Set > 1; assumes more than 1 per class */ + continue; + } + + /* Backslash may introduce a single character, or it may introduce one + of the specials, which just set a flag. Escaped items are checked for + validity in the pre-compiling pass. The sequence \b is a special case. + Inside a class (and only there) it is treated as backspace. Elsewhere + it marks a word boundary. Other escapes have preset maps ready to + or into the one we are building. We assume they have more than one + character in them, so set class_count bigger than one. */ + + if (c == '\\') + { + c = check_escape(&ptr, errorptr, *brackets, options, TRUE, cd); + if (-c == ESC_b) c = '\b'; + else if (c < 0) + { + register const uschar *cbits = cd->cbits; + class_charcount = 10; + switch (-c) + { + case ESC_d: + for (c = 0; c < 32; c++) class[c] |= cbits[c+cbit_digit]; + continue; + + case ESC_D: + for (c = 0; c < 32; c++) class[c] |= ~cbits[c+cbit_digit]; + continue; + + case ESC_w: + for (c = 0; c < 32; c++) class[c] |= cbits[c+cbit_word]; + continue; + + case ESC_W: + for (c = 0; c < 32; c++) class[c] |= ~cbits[c+cbit_word]; + continue; + + case ESC_s: + for (c = 0; c < 32; c++) class[c] |= cbits[c+cbit_space]; + continue; + + case ESC_S: + for (c = 0; c < 32; c++) class[c] |= ~cbits[c+cbit_space]; + continue; + + default: + *errorptr = ERR7; + goto FAILED; + } + } + /* Fall through if single character */ + } + + /* A single character may be followed by '-' to form a range. However, + Perl does not permit ']' to be the end of the range. A '-' character + here is treated as a literal. */ + + if (ptr[1] == '-' && ptr[2] != ']') + { + int d; + ptr += 2; + d = *ptr; + + if (d == 0) + { + *errorptr = ERR6; + goto FAILED; + } + + /* The second part of a range can be a single-character escape, but + not any of the other escapes. */ + + if (d == '\\') + { + d = check_escape(&ptr, errorptr, *brackets, options, TRUE, cd); + if (d < 0) + { + if (d == -ESC_b) d = '\b'; else + { + *errorptr = ERR7; + goto FAILED; + } + } + } + + if (d < c) + { + *errorptr = ERR8; + goto FAILED; + } + + for (; c <= d; c++) + { + class[c/8] |= (1 << (c&7)); + if ((options & PCRE_CASELESS) != 0) + { + int uc = cd->fcc[c]; /* flip case */ + class[uc/8] |= (1 << (uc&7)); + } + class_charcount++; /* in case a one-char range */ + class_lastchar = c; + } + continue; /* Go get the next char in the class */ + } + + /* Handle a lone single character - we can get here for a normal + non-escape char, or after \ that introduces a single character. */ + + class [c/8] |= (1 << (c&7)); + if ((options & PCRE_CASELESS) != 0) + { + c = cd->fcc[c]; /* flip case */ + class[c/8] |= (1 << (c&7)); + } + class_charcount++; + class_lastchar = c; + } + + /* Loop until ']' reached; the check for end of string happens inside the + loop. This "while" is the end of the "do" above. */ + + while ((c = *(++ptr)) != ']'); + + /* If class_charcount is 1 and class_lastchar is not negative, we saw + precisely one character. This doesn't need the whole 32-byte bit map. + We turn it into a 1-character OP_CHAR if it's positive, or OP_NOT if + it's negative. */ + + if (class_charcount == 1 && class_lastchar >= 0) + { + if (negate_class) + { + code[-1] = OP_NOT; + } + else + { + code[-1] = OP_CHARS; + *code++ = 1; + } + *code++ = class_lastchar; + } + + /* Otherwise, negate the 32-byte map if necessary, and copy it into + the code vector. */ + + else + { + if (negate_class) + for (c = 0; c < 32; c++) code[c] = ~class[c]; + else + memcpy(code, class, 32); + code += 32; + } + break; + + /* Various kinds of repeat */ + + case '{': + if (!is_counted_repeat(ptr+1, cd)) goto NORMAL_CHAR; + ptr = read_repeat_counts(ptr+1, &repeat_min, &repeat_max, errorptr, cd); + if (*errorptr != NULL) goto FAILED; + goto REPEAT; + + case '*': + repeat_min = 0; + repeat_max = -1; + goto REPEAT; + + case '+': + repeat_min = 1; + repeat_max = -1; + goto REPEAT; + + case '?': + repeat_min = 0; + repeat_max = 1; + + REPEAT: + if (previous == NULL) + { + *errorptr = ERR9; + goto FAILED; + } + + /* If the next character is '?' this is a minimizing repeat, by default, + but if PCRE_UNGREEDY is set, it works the other way round. Advance to the + next character. */ + + if (ptr[1] == '?') + { repeat_type = greedy_non_default; ptr++; } + else repeat_type = greedy_default; + + /* If previous was a string of characters, chop off the last one and use it + as the subject of the repeat. If there was only one character, we can + abolish the previous item altogether. A repeat with a zero minimum wipes + out any reqchar setting, backing up to the previous value. We must also + adjust the countlits value. */ + + if (*previous == OP_CHARS) + { + int len = previous[1]; + + if (repeat_min == 0) *reqchar = prevreqchar; + *countlits += repeat_min - 1; + + if (len == 1) + { + c = previous[2]; + code = previous; + } + else + { + c = previous[len+1]; + previous[1]--; + code--; + } + op_type = 0; /* Use single-char op codes */ + goto OUTPUT_SINGLE_REPEAT; /* Code shared with single character types */ + } + + /* If previous was a single negated character ([^a] or similar), we use + one of the special opcodes, replacing it. The code is shared with single- + character repeats by adding a suitable offset into repeat_type. */ + + else if ((int)*previous == OP_NOT) + { + op_type = OP_NOTSTAR - OP_STAR; /* Use "not" opcodes */ + c = previous[1]; + code = previous; + goto OUTPUT_SINGLE_REPEAT; + } + + /* If previous was a character type match (\d or similar), abolish it and + create a suitable repeat item. The code is shared with single-character + repeats by adding a suitable offset into repeat_type. */ + + else if ((int)*previous < OP_EODN || *previous == OP_ANY) + { + op_type = OP_TYPESTAR - OP_STAR; /* Use type opcodes */ + c = *previous; + code = previous; + + OUTPUT_SINGLE_REPEAT: + + /* If the maximum is zero then the minimum must also be zero; Perl allows + this case, so we do too - by simply omitting the item altogether. */ + + if (repeat_max == 0) goto END_REPEAT; + + /* Combine the op_type with the repeat_type */ + + repeat_type += op_type; + + /* A minimum of zero is handled either as the special case * or ?, or as + an UPTO, with the maximum given. */ + + if (repeat_min == 0) + { + if (repeat_max == -1) *code++ = OP_STAR + repeat_type; + else if (repeat_max == 1) *code++ = OP_QUERY + repeat_type; + else + { + *code++ = OP_UPTO + repeat_type; + *code++ = repeat_max >> 8; + *code++ = (repeat_max & 255); + } + } + + /* The case {1,} is handled as the special case + */ + + else if (repeat_min == 1 && repeat_max == -1) + *code++ = OP_PLUS + repeat_type; + + /* The case {n,n} is just an EXACT, while the general case {n,m} is + handled as an EXACT followed by an UPTO. An EXACT of 1 is optimized. */ + + else + { + if (repeat_min != 1) + { + *code++ = OP_EXACT + op_type; /* NB EXACT doesn't have repeat_type */ + *code++ = repeat_min >> 8; + *code++ = (repeat_min & 255); + } + + /* If the mininum is 1 and the previous item was a character string, + we either have to put back the item that got cancelled if the string + length was 1, or add the character back onto the end of a longer + string. For a character type nothing need be done; it will just get + put back naturally. Note that the final character is always going to + get added below. */ + + else if (*previous == OP_CHARS) + { + if (code == previous) code += 2; else previous[1]++; + } + + /* For a single negated character we also have to put back the + item that got cancelled. */ + + else if (*previous == OP_NOT) code++; + + /* If the maximum is unlimited, insert an OP_STAR. */ + + if (repeat_max < 0) + { + *code++ = c; + *code++ = OP_STAR + repeat_type; + } + + /* Else insert an UPTO if the max is greater than the min. */ + + else if (repeat_max != repeat_min) + { + *code++ = c; + repeat_max -= repeat_min; + *code++ = OP_UPTO + repeat_type; + *code++ = repeat_max >> 8; + *code++ = (repeat_max & 255); + } + } + + /* The character or character type itself comes last in all cases. */ + + *code++ = c; + } + + /* If previous was a character class or a back reference, we put the repeat + stuff after it, but just skip the item if the repeat was {0,0}. */ + + else if (*previous == OP_CLASS || *previous == OP_REF) + { + if (repeat_max == 0) + { + code = previous; + goto END_REPEAT; + } + if (repeat_min == 0 && repeat_max == -1) + *code++ = OP_CRSTAR + repeat_type; + else if (repeat_min == 1 && repeat_max == -1) + *code++ = OP_CRPLUS + repeat_type; + else if (repeat_min == 0 && repeat_max == 1) + *code++ = OP_CRQUERY + repeat_type; + else + { + *code++ = OP_CRRANGE + repeat_type; + *code++ = repeat_min >> 8; + *code++ = repeat_min & 255; + if (repeat_max == -1) repeat_max = 0; /* 2-byte encoding for max */ + *code++ = repeat_max >> 8; + *code++ = repeat_max & 255; + } + } + + /* If previous was a bracket group, we may have to replicate it in certain + cases. */ + + else if ((int)*previous >= OP_BRA || (int)*previous == OP_ONCE || + (int)*previous == OP_COND) + { + register int i; + int ketoffset = 0; + int len = code - previous; + uschar *bralink = NULL; + + /* If the maximum repeat count is unlimited, find the end of the bracket + by scanning through from the start, and compute the offset back to it + from the current code pointer. There may be an OP_OPT setting following + the final KET, so we can't find the end just by going back from the code + pointer. */ + + if (repeat_max == -1) + { + register uschar *ket = previous; + do ket += (ket[1] << 8) + ket[2]; while (*ket != OP_KET); + ketoffset = code - ket; + } + + /* The case of a zero minimum is special because of the need to stick + OP_BRAZERO in front of it, and because the group appears once in the + data, whereas in other cases it appears the minimum number of times. For + this reason, it is simplest to treat this case separately, as otherwise + the code gets far too mess. There are several special subcases when the + minimum is zero. */ + + if (repeat_min == 0) + { + /* If we set up a required char from the bracket, we must back off + to the previous value and reset the countlits value too. */ + + if (subcountlits > 0) + { + *reqchar = prevreqchar; + *countlits -= subcountlits; + } + + /* If the maximum is also zero, we just omit the group from the output + altogether. */ + + if (repeat_max == 0) + { + code = previous; + goto END_REPEAT; + } + + /* If the maximum is 1 or unlimited, we just have to stick in the + BRAZERO and do no more at this point. */ + + if (repeat_max <= 1) + { + memmove(previous+1, previous, len); + code++; + *previous++ = OP_BRAZERO + repeat_type; + } + + /* If the maximum is greater than 1 and limited, we have to replicate + in a nested fashion, sticking OP_BRAZERO before each set of brackets. + The first one has to be handled carefully because it's the original + copy, which has to be moved up. The remainder can be handled by code + that is common with the non-zero minimum case below. We just have to + adjust the value or repeat_max, since one less copy is required. */ + + else + { + int offset; + memmove(previous+4, previous, len); + code += 4; + *previous++ = OP_BRAZERO + repeat_type; + *previous++ = OP_BRA; + + /* We chain together the bracket offset fields that have to be + filled in later when the ends of the brackets are reached. */ + + offset = (bralink == NULL)? 0 : previous - bralink; + bralink = previous; + *previous++ = offset >> 8; + *previous++ = offset & 255; + } + + repeat_max--; + } + + /* If the minimum is greater than zero, replicate the group as many + times as necessary, and adjust the maximum to the number of subsequent + copies that we need. */ + + else + { + for (i = 1; i < repeat_min; i++) + { + memcpy(code, previous, len); + code += len; + } + if (repeat_max > 0) repeat_max -= repeat_min; + } + + /* This code is common to both the zero and non-zero minimum cases. If + the maximum is limited, it replicates the group in a nested fashion, + remembering the bracket starts on a stack. In the case of a zero minimum, + the first one was set up above. In all cases the repeat_max now specifies + the number of additional copies needed. */ + + if (repeat_max >= 0) + { + for (i = repeat_max - 1; i >= 0; i--) + { + *code++ = OP_BRAZERO + repeat_type; + + /* All but the final copy start a new nesting, maintaining the + chain of brackets outstanding. */ + + if (i != 0) + { + int offset; + *code++ = OP_BRA; + offset = (bralink == NULL)? 0 : code - bralink; + bralink = code; + *code++ = offset >> 8; + *code++ = offset & 255; + } + + memcpy(code, previous, len); + code += len; + } + + /* Now chain through the pending brackets, and fill in their length + fields (which are holding the chain links pro tem). */ + + while (bralink != NULL) + { + int oldlinkoffset; + int offset = code - bralink + 1; + uschar *bra = code - offset; + oldlinkoffset = (bra[1] << 8) + bra[2]; + bralink = (oldlinkoffset == 0)? NULL : bralink - oldlinkoffset; + *code++ = OP_KET; + *code++ = bra[1] = offset >> 8; + *code++ = bra[2] = (offset & 255); + } + } + + /* If the maximum is unlimited, set a repeater in the final copy. We + can't just offset backwards from the current code point, because we + don't know if there's been an options resetting after the ket. The + correct offset was computed above. */ + + else code[-ketoffset] = OP_KETRMAX + repeat_type; + } + + /* Else there's some kind of shambles */ + + else + { + *errorptr = ERR11; + goto FAILED; + } + + /* In all case we no longer have a previous item. */ + + END_REPEAT: + previous = NULL; + break; + + + /* Start of nested bracket sub-expression, or comment or lookahead or + lookbehind or option setting or condition. First deal with special things + that can come after a bracket; all are introduced by ?, and the appearance + of any of them means that this is not a referencing group. They were + checked for validity in the first pass over the string, so we don't have to + check for syntax errors here. */ + + case '(': + newoptions = options; + condref = -1; + + if (*(++ptr) == '?') + { + int set, unset; + int *optset; + + switch (*(++ptr)) + { + case '#': /* Comment; skip to ket */ + ptr++; + while (*ptr != ')') ptr++; + continue; + + case ':': /* Non-extracting bracket */ + bravalue = OP_BRA; + ptr++; + break; + + case '(': + bravalue = OP_COND; /* Conditional group */ + if ((cd->ctypes[*(++ptr)] & ctype_digit) != 0) + { + condref = *ptr - '0'; + while (*(++ptr) != ')') condref = condref*10 + *ptr - '0'; + ptr++; + } + else ptr--; + break; + + case '=': /* Positive lookahead */ + bravalue = OP_ASSERT; + ptr++; + break; + + case '!': /* Negative lookahead */ + bravalue = OP_ASSERT_NOT; + ptr++; + break; + + case '<': /* Lookbehinds */ + switch (*(++ptr)) + { + case '=': /* Positive lookbehind */ + bravalue = OP_ASSERTBACK; + ptr++; + break; + + case '!': /* Negative lookbehind */ + bravalue = OP_ASSERTBACK_NOT; + ptr++; + break; + + default: /* Syntax error */ + *errorptr = ERR24; + goto FAILED; + } + break; + + case '>': /* One-time brackets */ + bravalue = OP_ONCE; + ptr++; + break; + + case 'R': /* Pattern recursion */ + *code++ = OP_RECURSE; + ptr++; + continue; + + default: /* Option setting */ + set = unset = 0; + optset = &set; + + while (*ptr != ')' && *ptr != ':') + { + switch (*ptr++) + { + case '-': optset = &unset; break; + + case 'i': *optset |= PCRE_CASELESS; break; + case 'm': *optset |= PCRE_MULTILINE; break; + case 's': *optset |= PCRE_DOTALL; break; + case 'x': *optset |= PCRE_EXTENDED; break; + case 'U': *optset |= PCRE_UNGREEDY; break; + case 'X': *optset |= PCRE_EXTRA; break; + + default: + *errorptr = ERR12; + goto FAILED; + } + } + + /* Set up the changed option bits, but don't change anything yet. */ + + newoptions = (options | set) & (~unset); + + /* If the options ended with ')' this is not the start of a nested + group with option changes, so the options change at this level. At top + level there is nothing else to be done (the options will in fact have + been set from the start of compiling as a result of the first pass) but + at an inner level we must compile code to change the ims options if + necessary, and pass the new setting back so that it can be put at the + start of any following branches, and when this group ends, a resetting + item can be compiled. */ + + if (*ptr == ')') + { + if ((options & PCRE_INGROUP) != 0 && + (options & PCRE_IMS) != (newoptions & PCRE_IMS)) + { + *code++ = OP_OPT; + *code++ = *optchanged = newoptions & PCRE_IMS; + } + options = newoptions; /* Change options at this level */ + previous = NULL; /* This item can't be repeated */ + continue; /* It is complete */ + } + + /* If the options ended with ':' we are heading into a nested group + with possible change of options. Such groups are non-capturing and are + not assertions of any kind. All we need to do is skip over the ':'; + the newoptions value is handled below. */ + + bravalue = OP_BRA; + ptr++; + } + } + + /* Else we have a referencing group; adjust the opcode. */ + + else + { + if (++(*brackets) > EXTRACT_MAX) + { + *errorptr = ERR13; + goto FAILED; + } + bravalue = OP_BRA + *brackets; + } + + /* Process nested bracketed re. Assertions may not be repeated, but other + kinds can be. We copy code into a non-register variable in order to be able + to pass its address because some compilers complain otherwise. Pass in a + new setting for the ims options if they have changed. */ + + previous = (bravalue >= OP_ONCE)? code : NULL; + *code = bravalue; + tempcode = code; + + if (!compile_regex( + options | PCRE_INGROUP, /* Set for all nested groups */ + ((options & PCRE_IMS) != (newoptions & PCRE_IMS))? + newoptions & PCRE_IMS : -1, /* Pass ims options if changed */ + brackets, /* Bracket level */ + &tempcode, /* Where to put code (updated) */ + &ptr, /* Input pointer (updated) */ + errorptr, /* Where to put an error message */ + (bravalue == OP_ASSERTBACK || + bravalue == OP_ASSERTBACK_NOT), /* TRUE if back assert */ + condref, /* Condition reference number */ + &subreqchar, /* For possible last char */ + &subcountlits, /* For literal count */ + cd)) /* Tables block */ + goto FAILED; + + /* At the end of compiling, code is still pointing to the start of the + group, while tempcode has been updated to point past the end of the group + and any option resetting that may follow it. The pattern pointer (ptr) + is on the bracket. */ + + /* If this is a conditional bracket, check that there are no more than + two branches in the group. */ + + if (bravalue == OP_COND) + { + uschar *tc = code; + condcount = 0; + + do { + condcount++; + tc += (tc[1] << 8) | tc[2]; + } + while (*tc != OP_KET); + + if (condcount > 2) + { + *errorptr = ERR27; + goto FAILED; + } + } + + /* Handle updating of the required character. If the subpattern didn't + set one, leave it as it was. Otherwise, update it for normal brackets of + all kinds, forward assertions, and conditions with two branches. Don't + update the literal count for forward assertions, however. If the bracket + is followed by a quantifier with zero repeat, we have to back off. Hence + the definition of prevreqchar and subcountlits outside the main loop so + that they can be accessed for the back off. */ + + if (subreqchar > 0 && + (bravalue >= OP_BRA || bravalue == OP_ONCE || bravalue == OP_ASSERT || + (bravalue == OP_COND && condcount == 2))) + { + prevreqchar = *reqchar; + *reqchar = subreqchar; + if (bravalue != OP_ASSERT) *countlits += subcountlits; + } + + /* Now update the main code pointer to the end of the group. */ + + code = tempcode; + + /* Error if hit end of pattern */ + + if (*ptr != ')') + { + *errorptr = ERR14; + goto FAILED; + } + break; + + /* Check \ for being a real metacharacter; if not, fall through and handle + it as a data character at the start of a string. Escape items are checked + for validity in the pre-compiling pass. */ + + case '\\': + tempptr = ptr; + c = check_escape(&ptr, errorptr, *brackets, options, FALSE, cd); + + /* Handle metacharacters introduced by \. For ones like \d, the ESC_ values + are arranged to be the negation of the corresponding OP_values. For the + back references, the values are ESC_REF plus the reference number. Only + back references and those types that consume a character may be repeated. + We can test for values between ESC_b and ESC_Z for the latter; this may + have to change if any new ones are ever created. */ + + if (c < 0) + { + if (-c >= ESC_REF) + { + previous = code; + *code++ = OP_REF; + *code++ = -c - ESC_REF; + } + else + { + previous = (-c > ESC_b && -c < ESC_Z)? code : NULL; + *code++ = -c; + } + continue; + } + + /* Data character: reset and fall through */ + + ptr = tempptr; + c = '\\'; + + /* Handle a run of data characters until a metacharacter is encountered. + The first character is guaranteed not to be whitespace or # when the + extended flag is set. */ + + NORMAL_CHAR: + default: + previous = code; + *code = OP_CHARS; + code += 2; + length = 0; + + do + { + if ((options & PCRE_EXTENDED) != 0) + { + if ((cd->ctypes[c] & ctype_space) != 0) continue; + if (c == '#') + { + while ((c = *(++ptr)) != 0 && c != '\n'); + if (c == 0) break; + continue; + } + } + + /* Backslash may introduce a data char or a metacharacter. Escaped items + are checked for validity in the pre-compiling pass. Stop the string + before a metaitem. */ + + if (c == '\\') + { + tempptr = ptr; + c = check_escape(&ptr, errorptr, *brackets, options, FALSE, cd); + if (c < 0) { ptr = tempptr; break; } + } + + /* Ordinary character or single-char escape */ + + *code++ = c; + length++; + } + + /* This "while" is the end of the "do" above. */ + + while (length < 255 && (cd->ctypes[c = *(++ptr)] & ctype_meta) == 0); + + /* Update the last character and the count of literals */ + + prevreqchar = (length > 1)? code[-2] : *reqchar; + *reqchar = code[-1]; + *countlits += length; + + /* Compute the length and set it in the data vector, and advance to + the next state. */ + + previous[1] = length; + if (length < 255) ptr--; + break; + } + } /* end of big loop */ + +/* Control never reaches here by falling through, only by a goto for all the +error states. Pass back the position in the pattern so that it can be displayed +to the user for diagnosing the error. */ + +FAILED: +*ptrptr = ptr; +return FALSE; +} + + + + +/************************************************* +* Compile sequence of alternatives * +*************************************************/ + +/* On entry, ptr is pointing past the bracket character, but on return +it points to the closing bracket, or vertical bar, or end of string. +The code variable is pointing at the byte into which the BRA operator has been +stored. If the ims options are changed at the start (for a (?ims: group) or +during any branch, we need to insert an OP_OPT item at the start of every +following branch to ensure they get set correctly at run time, and also pass +the new options into every subsequent branch compile. + +Argument: + options the option bits + optchanged new ims options to set as if (?ims) were at the start, or -1 + for no change + brackets -> int containing the number of extracting brackets used + codeptr -> the address of the current code pointer + ptrptr -> the address of the current pattern pointer + errorptr -> pointer to error message + lookbehind TRUE if this is a lookbehind assertion + condref > 0 for OPT_CREF setting at start of conditional group + reqchar -> place to put the last required character, or a negative number + countlits -> place to put the shortest literal count of any branch + cd points to the data block with tables pointers + +Returns: TRUE on success +*/ + +static BOOL +compile_regex(int options, int optchanged, int *brackets, uschar **codeptr, + const uschar **ptrptr, const char **errorptr, BOOL lookbehind, int condref, + int *reqchar, int *countlits, compile_data *cd) +{ +const uschar *ptr = *ptrptr; +uschar *code = *codeptr; +uschar *last_branch = code; +uschar *start_bracket = code; +uschar *reverse_count = NULL; +int oldoptions = options & PCRE_IMS; +int branchreqchar, branchcountlits; + +*reqchar = -1; +*countlits = INT_MAX; +code += 3; + +/* At the start of a reference-based conditional group, insert the reference +number as an OP_CREF item. */ + +if (condref > 0) + { + *code++ = OP_CREF; + *code++ = condref; + } + +/* Loop for each alternative branch */ + +for (;;) + { + int length; + + /* Handle change of options */ + + if (optchanged >= 0) + { + *code++ = OP_OPT; + *code++ = optchanged; + options = (options & ~PCRE_IMS) | optchanged; + } + + /* Set up dummy OP_REVERSE if lookbehind assertion */ + + if (lookbehind) + { + *code++ = OP_REVERSE; + reverse_count = code; + *code++ = 0; + *code++ = 0; + } + + /* Now compile the branch */ + + if (!compile_branch(options, brackets, &code, &ptr, errorptr, &optchanged, + &branchreqchar, &branchcountlits, cd)) + { + *ptrptr = ptr; + return FALSE; + } + + /* Fill in the length of the last branch */ + + length = code - last_branch; + last_branch[1] = length >> 8; + last_branch[2] = length & 255; + + /* Save the last required character if all branches have the same; a current + value of -1 means unset, while -2 means "previous branch had no last required + char". */ + + if (*reqchar != -2) + { + if (branchreqchar >= 0) + { + if (*reqchar == -1) *reqchar = branchreqchar; + else if (*reqchar != branchreqchar) *reqchar = -2; + } + else *reqchar = -2; + } + + /* Keep the shortest literal count */ + + if (branchcountlits < *countlits) *countlits = branchcountlits; + DPRINTF(("literal count = %d min=%d\n", branchcountlits, *countlits)); + + /* If lookbehind, check that this branch matches a fixed-length string, + and put the length into the OP_REVERSE item. Temporarily mark the end of + the branch with OP_END. */ + + if (lookbehind) + { + *code = OP_END; + length = find_fixedlength(last_branch); + DPRINTF(("fixed length = %d\n", length)); + if (length < 0) + { + *errorptr = ERR25; + *ptrptr = ptr; + return FALSE; + } + reverse_count[0] = (length >> 8); + reverse_count[1] = length & 255; + } + + /* Reached end of expression, either ')' or end of pattern. Insert a + terminating ket and the length of the whole bracketed item, and return, + leaving the pointer at the terminating char. If any of the ims options + were changed inside the group, compile a resetting op-code following. */ + + if (*ptr != '|') + { + length = code - start_bracket; + *code++ = OP_KET; + *code++ = length >> 8; + *code++ = length & 255; + if (optchanged >= 0) + { + *code++ = OP_OPT; + *code++ = oldoptions; + } + *codeptr = code; + *ptrptr = ptr; + return TRUE; + } + + /* Another branch follows; insert an "or" node and advance the pointer. */ + + *code = OP_ALT; + last_branch = code; + code += 3; + ptr++; + } +/* Control never reaches here */ +} + + + + +/************************************************* +* Find first significant op code * +*************************************************/ + +/* This is called by several functions that scan a compiled expression looking +for a fixed first character, or an anchoring op code etc. It skips over things +that do not influence this. For one application, a change of caseless option is +important. + +Arguments: + code pointer to the start of the group + options pointer to external options + optbit the option bit whose changing is significant, or + zero if none are + optstop TRUE to return on option change, otherwise change the options + value and continue + +Returns: pointer to the first significant opcode +*/ + +static const uschar* +first_significant_code(const uschar *code, int *options, int optbit, + BOOL optstop) +{ +for (;;) + { + switch ((int)*code) + { + case OP_OPT: + if (optbit > 0 && ((int)code[1] & optbit) != (*options & optbit)) + { + if (optstop) return code; + *options = (int)code[1]; + } + code += 2; + break; + + case OP_CREF: + code += 2; + break; + + case OP_WORD_BOUNDARY: + case OP_NOT_WORD_BOUNDARY: + code++; + break; + + case OP_ASSERT_NOT: + case OP_ASSERTBACK: + case OP_ASSERTBACK_NOT: + do code += (code[1] << 8) + code[2]; while (*code == OP_ALT); + code += 3; + break; + + default: + return code; + } + } +/* Control never reaches here */ +} + + + + +/************************************************* +* Check for anchored expression * +*************************************************/ + +/* Try to find out if this is an anchored regular expression. Consider each +alternative branch. If they all start with OP_SOD or OP_CIRC, or with a bracket +all of whose alternatives start with OP_SOD or OP_CIRC (recurse ad lib), then +it's anchored. However, if this is a multiline pattern, then only OP_SOD +counts, since OP_CIRC can match in the middle. + +A branch is also implicitly anchored if it starts with .* and DOTALL is set, +because that will try the rest of the pattern at all possible matching points, +so there is no point trying them again. + +Arguments: + code points to start of expression (the bracket) + options points to the options setting + +Returns: TRUE or FALSE +*/ + +static BOOL +is_anchored(register const uschar *code, int *options) +{ +do { + const uschar *scode = first_significant_code(code + 3, options, + PCRE_MULTILINE, FALSE); + register int op = *scode; + if (op >= OP_BRA || op == OP_ASSERT || op == OP_ONCE || op == OP_COND) + { if (!is_anchored(scode, options)) return FALSE; } + else if ((op == OP_TYPESTAR || op == OP_TYPEMINSTAR) && + (*options & PCRE_DOTALL) != 0) + { if (scode[1] != OP_ANY) return FALSE; } + else if (op != OP_SOD && + ((*options & PCRE_MULTILINE) != 0 || op != OP_CIRC)) + return FALSE; + code += (code[1] << 8) + code[2]; + } +while (*code == OP_ALT); +return TRUE; +} + + + +/************************************************* +* Check for starting with ^ or .* * +*************************************************/ + +/* This is called to find out if every branch starts with ^ or .* so that +"first char" processing can be done to speed things up in multiline +matching and for non-DOTALL patterns that start with .* (which must start at +the beginning or after \n). + +Argument: points to start of expression (the bracket) +Returns: TRUE or FALSE +*/ + +static BOOL +is_startline(const uschar *code) +{ +do { + const uschar *scode = first_significant_code(code + 3, NULL, 0, FALSE); + register int op = *scode; + if (op >= OP_BRA || op == OP_ASSERT || op == OP_ONCE || op == OP_COND) + { if (!is_startline(scode)) return FALSE; } + else if (op == OP_TYPESTAR || op == OP_TYPEMINSTAR) + { if (scode[1] != OP_ANY) return FALSE; } + else if (op != OP_CIRC) return FALSE; + code += (code[1] << 8) + code[2]; + } +while (*code == OP_ALT); +return TRUE; +} + + + +/************************************************* +* Check for fixed first char * +*************************************************/ + +/* Try to find out if there is a fixed first character. This is called for +unanchored expressions, as it speeds up their processing quite considerably. +Consider each alternative branch. If they all start with the same char, or with +a bracket all of whose alternatives start with the same char (recurse ad lib), +then we return that char, otherwise -1. + +Arguments: + code points to start of expression (the bracket) + options pointer to the options (used to check casing changes) + +Returns: -1 or the fixed first char +*/ + +static int +find_firstchar(const uschar *code, int *options) +{ +register int c = -1; +do { + int d; + const uschar *scode = first_significant_code(code + 3, options, + PCRE_CASELESS, TRUE); + register int op = *scode; + + if (op >= OP_BRA) op = OP_BRA; + + switch(op) + { + default: + return -1; + + case OP_BRA: + case OP_ASSERT: + case OP_ONCE: + case OP_COND: + if ((d = find_firstchar(scode, options)) < 0) return -1; + if (c < 0) c = d; else if (c != d) return -1; + break; + + case OP_EXACT: /* Fall through */ + scode++; + + case OP_CHARS: /* Fall through */ + scode++; + + case OP_PLUS: + case OP_MINPLUS: + if (c < 0) c = scode[1]; else if (c != scode[1]) return -1; + break; + } + + code += (code[1] << 8) + code[2]; + } +while (*code == OP_ALT); +return c; +} + + + + + +/************************************************* +* Compile a Regular Expression * +*************************************************/ + +/* This function takes a string and returns a pointer to a block of store +holding a compiled version of the expression. + +Arguments: + pattern the regular expression + options various option bits + errorptr pointer to pointer to error text + erroroffset ptr offset in pattern where error was detected + tables pointer to character tables or NULL + +Returns: pointer to compiled data block, or NULL on error, + with errorptr and erroroffset set +*/ + +pcre * +pcre_compile(const char *pattern, int options, const char **errorptr, + int *erroroffset, const unsigned char *tables) +{ +real_pcre *re; +int length = 3; /* For initial BRA plus length */ +int runlength; +int c, reqchar, countlits; +int bracount = 0; +int top_backref = 0; +int branch_extra = 0; +int branch_newextra; +unsigned int brastackptr = 0; +size_t size; +uschar *code; +const uschar *ptr; +compile_data compile_block; +int brastack[BRASTACK_SIZE]; +uschar bralenstack[BRASTACK_SIZE]; + +#ifdef DEBUG +uschar *code_base, *code_end; +#endif + +/* We can't pass back an error message if errorptr is NULL; I guess the best we +can do is just return NULL. */ + +if (errorptr == NULL) return NULL; +*errorptr = NULL; + +/* However, we can give a message for this error */ + +if (erroroffset == NULL) + { + *errorptr = ERR16; + return NULL; + } +*erroroffset = 0; + +if ((options & ~PUBLIC_OPTIONS) != 0) + { + *errorptr = ERR17; + return NULL; + } + +/* Set up pointers to the individual character tables */ + +if (tables == NULL) tables = pcre_default_tables; +compile_block.lcc = tables + lcc_offset; +compile_block.fcc = tables + fcc_offset; +compile_block.cbits = tables + cbits_offset; +compile_block.ctypes = tables + ctypes_offset; + +/* Reflect pattern for debugging output */ + +DPRINTF(("------------------------------------------------------------------\n")); +DPRINTF(("%s\n", pattern)); + +/* The first thing to do is to make a pass over the pattern to compute the +amount of store required to hold the compiled code. This does not have to be +perfect as long as errors are overestimates. At the same time we can detect any +internal flag settings. Make an attempt to correct for any counted white space +if an "extended" flag setting appears late in the pattern. We can't be so +clever for #-comments. */ + +ptr = (const uschar *)(pattern - 1); +while ((c = *(++ptr)) != 0) + { + int min, max; + int class_charcount; + + if ((options & PCRE_EXTENDED) != 0) + { + if ((compile_block.ctypes[c] & ctype_space) != 0) continue; + if (c == '#') + { + while ((c = *(++ptr)) != 0 && c != '\n'); + continue; + } + } + + switch(c) + { + /* A backslashed item may be an escaped "normal" character or a + character type. For a "normal" character, put the pointers and + character back so that tests for whitespace etc. in the input + are done correctly. */ + + case '\\': + { + const uschar *save_ptr = ptr; + c = check_escape(&ptr, errorptr, bracount, options, FALSE, &compile_block); + if (*errorptr != NULL) goto PCRE_ERROR_RETURN; + if (c >= 0) + { + ptr = save_ptr; + c = '\\'; + goto NORMAL_CHAR; + } + } + length++; + + /* A back reference needs an additional char, plus either one or 5 + bytes for a repeat. We also need to keep the value of the highest + back reference. */ + + if (c <= -ESC_REF) + { + int refnum = -c - ESC_REF; + if (refnum > top_backref) top_backref = refnum; + length++; /* For single back reference */ + if (ptr[1] == '{' && is_counted_repeat(ptr+2, &compile_block)) + { + ptr = read_repeat_counts(ptr+2, &min, &max, errorptr, &compile_block); + if (*errorptr != NULL) goto PCRE_ERROR_RETURN; + if ((min == 0 && (max == 1 || max == -1)) || + (min == 1 && max == -1)) + length++; + else length += 5; + if (ptr[1] == '?') ptr++; + } + } + continue; + + case '^': + case '.': + case '$': + case '*': /* These repeats won't be after brackets; */ + case '+': /* those are handled separately */ + case '?': + length++; + continue; + + /* This covers the cases of repeats after a single char, metachar, class, + or back reference. */ + + case '{': + if (!is_counted_repeat(ptr+1, &compile_block)) goto NORMAL_CHAR; + ptr = read_repeat_counts(ptr+1, &min, &max, errorptr, &compile_block); + if (*errorptr != NULL) goto PCRE_ERROR_RETURN; + if ((min == 0 && (max == 1 || max == -1)) || + (min == 1 && max == -1)) + length++; + else + { + length--; /* Uncount the original char or metachar */ + if (min == 1) length++; else if (min > 0) length += 4; + if (max > 0) length += 4; else length += 2; + } + if (ptr[1] == '?') ptr++; + continue; + + /* An alternation contains an offset to the next branch or ket. If any ims + options changed in the previous branch(es), and/or if we are in a + lookbehind assertion, extra space will be needed at the start of the + branch. This is handled by branch_extra. */ + + case '|': + length += 3 + branch_extra; + continue; + + /* A character class uses 33 characters. Don't worry about character types + that aren't allowed in classes - they'll get picked up during the compile. + A character class that contains only one character uses 2 or 3 bytes, + depending on whether it is negated or not. Notice this where we can. */ + + case '[': + class_charcount = 0; + if (*(++ptr) == '^') ptr++; + do + { + if (*ptr == '\\') + { + int ch = check_escape(&ptr, errorptr, bracount, options, TRUE, + &compile_block); + if (*errorptr != NULL) goto PCRE_ERROR_RETURN; + if (-ch == ESC_b) class_charcount++; else class_charcount = 10; + } + else class_charcount++; + ptr++; + } + while (*ptr != 0 && *ptr != ']'); + + /* Repeats for negated single chars are handled by the general code */ + + if (class_charcount == 1) length += 3; else + { + length += 33; + + /* A repeat needs either 1 or 5 bytes. */ + + if (*ptr != 0 && ptr[1] == '{' && is_counted_repeat(ptr+2, &compile_block)) + { + ptr = read_repeat_counts(ptr+2, &min, &max, errorptr, &compile_block); + if (*errorptr != NULL) goto PCRE_ERROR_RETURN; + if ((min == 0 && (max == 1 || max == -1)) || + (min == 1 && max == -1)) + length++; + else length += 5; + if (ptr[1] == '?') ptr++; + } + } + continue; + + /* Brackets may be genuine groups or special things */ + + case '(': + branch_newextra = 0; + + /* Handle special forms of bracket, which all start (? */ + + if (ptr[1] == '?') + { + int set, unset; + int *optset; + + switch (c = ptr[2]) + { + /* Skip over comments entirely */ + case '#': + ptr += 3; + while (*ptr != 0 && *ptr != ')') ptr++; + if (*ptr == 0) + { + *errorptr = ERR18; + goto PCRE_ERROR_RETURN; + } + continue; + + /* Non-referencing groups and lookaheads just move the pointer on, and + then behave like a non-special bracket, except that they don't increment + the count of extracting brackets. Ditto for the "once only" bracket, + which is in Perl from version 5.005. */ + + case ':': + case '=': + case '!': + case '>': + ptr += 2; + break; + + /* A recursive call to the regex is an extension, to provide the + facility which can be obtained by $(?p{perl-code}) in Perl 5.6. */ + + case 'R': + if (ptr[3] != ')') + { + *errorptr = ERR29; + goto PCRE_ERROR_RETURN; + } + ptr += 3; + length += 1; + break; + + /* Lookbehinds are in Perl from version 5.005 */ + + case '<': + if (ptr[3] == '=' || ptr[3] == '!') + { + ptr += 3; + branch_newextra = 3; + length += 3; /* For the first branch */ + break; + } + *errorptr = ERR24; + goto PCRE_ERROR_RETURN; + + /* Conditionals are in Perl from version 5.005. The bracket must either + be followed by a number (for bracket reference) or by an assertion + group. */ + + case '(': + if ((compile_block.ctypes[ptr[3]] & ctype_digit) != 0) + { + ptr += 4; + length += 2; + while ((compile_block.ctypes[*ptr] & ctype_digit) != 0) ptr++; + if (*ptr != ')') + { + *errorptr = ERR26; + goto PCRE_ERROR_RETURN; + } + } + else /* An assertion must follow */ + { + ptr++; /* Can treat like ':' as far as spacing is concerned */ + + if (ptr[2] != '?' || strchr("=!<", ptr[3]) == NULL) + { + ptr += 2; /* To get right offset in message */ + *errorptr = ERR28; + goto PCRE_ERROR_RETURN; + } + } + break; + + /* Else loop checking valid options until ) is met. Anything else is an + error. If we are without any brackets, i.e. at top level, the settings + act as if specified in the options, so massage the options immediately. + This is for backward compatibility with Perl 5.004. */ + + default: + set = unset = 0; + optset = &set; + ptr += 2; + + for (;; ptr++) + { + c = *ptr; + switch (c) + { + case 'i': + *optset |= PCRE_CASELESS; + continue; + + case 'm': + *optset |= PCRE_MULTILINE; + continue; + + case 's': + *optset |= PCRE_DOTALL; + continue; + + case 'x': + *optset |= PCRE_EXTENDED; + continue; + + case 'X': + *optset |= PCRE_EXTRA; + continue; + + case 'U': + *optset |= PCRE_UNGREEDY; + continue; + + case '-': + optset = &unset; + continue; + + /* A termination by ')' indicates an options-setting-only item; + this is global at top level; otherwise nothing is done here and + it is handled during the compiling process on a per-bracket-group + basis. */ + + case ')': + if (brastackptr == 0) + { + options = (options | set) & (~unset); + set = unset = 0; /* To save length */ + } + /* Fall through */ + + /* A termination by ':' indicates the start of a nested group with + the given options set. This is again handled at compile time, but + we must allow for compiled space if any of the ims options are + set. We also have to allow for resetting space at the end of + the group, which is why 4 is added to the length and not just 2. + If there are several changes of options within the same group, this + will lead to an over-estimate on the length, but this shouldn't + matter very much. We also have to allow for resetting options at + the start of any alternations, which we do by setting + branch_newextra to 2. Finally, we record whether the case-dependent + flag ever changes within the regex. This is used by the "required + character" code. */ + + case ':': + if (((set|unset) & PCRE_IMS) != 0) + { + length += 4; + branch_newextra = 2; + if (((set|unset) & PCRE_CASELESS) != 0) options |= PCRE_ICHANGED; + } + goto END_OPTIONS; + + /* Unrecognized option character */ + + default: + *errorptr = ERR12; + goto PCRE_ERROR_RETURN; + } + } + + /* If we hit a closing bracket, that's it - this is a freestanding + option-setting. We need to ensure that branch_extra is updated if + necessary. The only values branch_newextra can have here are 0 or 2. + If the value is 2, then branch_extra must either be 2 or 5, depending + on whether this is a lookbehind group or not. */ + + END_OPTIONS: + if (c == ')') + { + if (branch_newextra == 2 && (branch_extra == 0 || branch_extra == 3)) + branch_extra += branch_newextra; + continue; + } + + /* If options were terminated by ':' control comes here. Fall through + to handle the group below. */ + } + } + + /* Extracting brackets must be counted so we can process escapes in a + Perlish way. */ + + else bracount++; + + /* Non-special forms of bracket. Save length for computing whole length + at end if there's a repeat that requires duplication of the group. Also + save the current value of branch_extra, and start the new group with + the new value. If non-zero, this will either be 2 for a (?imsx: group, or 3 + for a lookbehind assertion. */ + + if (brastackptr >= sizeof(brastack)/sizeof(int)) + { + *errorptr = ERR19; + goto PCRE_ERROR_RETURN; + } + + bralenstack[brastackptr] = branch_extra; + branch_extra = branch_newextra; + + brastack[brastackptr++] = length; + length += 3; + continue; + + /* Handle ket. Look for subsequent max/min; for certain sets of values we + have to replicate this bracket up to that many times. If brastackptr is + 0 this is an unmatched bracket which will generate an error, but take care + not to try to access brastack[-1] when computing the length and restoring + the branch_extra value. */ + + case ')': + length += 3; + { + int minval = 1; + int maxval = 1; + int duplength; + + if (brastackptr > 0) + { + duplength = length - brastack[--brastackptr]; + branch_extra = bralenstack[brastackptr]; + } + else duplength = 0; + + /* Leave ptr at the final char; for read_repeat_counts this happens + automatically; for the others we need an increment. */ + + if ((c = ptr[1]) == '{' && is_counted_repeat(ptr+2, &compile_block)) + { + ptr = read_repeat_counts(ptr+2, &minval, &maxval, errorptr, + &compile_block); + if (*errorptr != NULL) goto PCRE_ERROR_RETURN; + } + else if (c == '*') { minval = 0; maxval = -1; ptr++; } + else if (c == '+') { maxval = -1; ptr++; } + else if (c == '?') { minval = 0; ptr++; } + + /* If the minimum is zero, we have to allow for an OP_BRAZERO before the + group, and if the maximum is greater than zero, we have to replicate + maxval-1 times; each replication acquires an OP_BRAZERO plus a nesting + bracket set - hence the 7. */ + + if (minval == 0) + { + length++; + if (maxval > 0) length += (maxval - 1) * (duplength + 7); + } + + /* When the minimum is greater than zero, 1 we have to replicate up to + minval-1 times, with no additions required in the copies. Then, if + there is a limited maximum we have to replicate up to maxval-1 times + allowing for a BRAZERO item before each optional copy and nesting + brackets for all but one of the optional copies. */ + + else + { + length += (minval - 1) * duplength; + if (maxval > minval) /* Need this test as maxval=-1 means no limit */ + length += (maxval - minval) * (duplength + 7) - 6; + } + } + continue; + + /* Non-special character. For a run of such characters the length required + is the number of characters + 2, except that the maximum run length is 255. + We won't get a skipped space or a non-data escape or the start of a # + comment as the first character, so the length can't be zero. */ + + NORMAL_CHAR: + default: + length += 2; + runlength = 0; + do + { + if ((options & PCRE_EXTENDED) != 0) + { + if ((compile_block.ctypes[c] & ctype_space) != 0) continue; + if (c == '#') + { + while ((c = *(++ptr)) != 0 && c != '\n'); + continue; + } + } + + /* Backslash may introduce a data char or a metacharacter; stop the + string before the latter. */ + + if (c == '\\') + { + const uschar *saveptr = ptr; + c = check_escape(&ptr, errorptr, bracount, options, FALSE, + &compile_block); + if (*errorptr != NULL) goto PCRE_ERROR_RETURN; + if (c < 0) { ptr = saveptr; break; } + } + + /* Ordinary character or single-char escape */ + + runlength++; + } + + /* This "while" is the end of the "do" above. */ + + while (runlength < 255 && + (compile_block.ctypes[c = *(++ptr)] & ctype_meta) == 0); + + ptr--; + length += runlength; + continue; + } + } + +length += 4; /* For final KET and END */ + +if (length > 65539) + { + *errorptr = ERR20; + return NULL; + } + +/* Compute the size of data block needed and get it, either from malloc or +externally provided function. We specify "code[0]" in the offsetof() expression +rather than just "code", because it has been reported that one broken compiler +fails on "code" because it is also an independent variable. It should make no +difference to the value of the offsetof(). */ + +size = length + offsetof(real_pcre, code[0]); +re = (real_pcre *)(pcre_malloc)(size); + +if (re == NULL) + { + *errorptr = ERR21; + return NULL; + } + +/* Put in the magic number, and save the size, options, and table pointer */ + +re->magic_number = MAGIC_NUMBER; +re->size = size; +re->options = options; +re->tables = tables; + +/* Set up a starting, non-extracting bracket, then compile the expression. On +error, *errorptr will be set non-NULL, so we don't need to look at the result +of the function here. */ + +ptr = (const uschar *)pattern; +code = re->code; +*code = OP_BRA; +bracount = 0; +(void)compile_regex(options, -1, &bracount, &code, &ptr, errorptr, FALSE, -1, + &reqchar, &countlits, &compile_block); +re->top_bracket = bracount; +re->top_backref = top_backref; + +/* If not reached end of pattern on success, there's an excess bracket. */ + +if (*errorptr == NULL && *ptr != 0) *errorptr = ERR22; + +/* Fill in the terminating state and check for disastrous overflow, but +if debugging, leave the test till after things are printed out. */ + +*code++ = OP_END; + +#ifndef DEBUG +if (code - re->code > length) *errorptr = ERR23; +#endif + +/* Give an error if there's back reference to a non-existent capturing +subpattern. */ + +if (top_backref > re->top_bracket) *errorptr = ERR15; + +/* Failed to compile */ + +if (*errorptr != NULL) + { + (pcre_free)(re); + PCRE_ERROR_RETURN: + *erroroffset = ptr - (const uschar *)pattern; + return NULL; + } + +/* If the anchored option was not passed, set flag if we can determine that the +pattern is anchored by virtue of ^ characters or \A or anything else (such as +starting with .* when DOTALL is set). + +Otherwise, see if we can determine what the first character has to be, because +that speeds up unanchored matches no end. If not, see if we can set the +PCRE_STARTLINE flag. This is helpful for multiline matches when all branches +start with ^. and also when all branches start with .* for non-DOTALL matches. +*/ + +if ((options & PCRE_ANCHORED) == 0) + { + int temp_options = options; + if (is_anchored(re->code, &temp_options)) + re->options |= PCRE_ANCHORED; + else + { + int ch = find_firstchar(re->code, &temp_options); + if (ch >= 0) + { + re->first_char = ch; + re->options |= PCRE_FIRSTSET; + } + else if (is_startline(re->code)) + re->options |= PCRE_STARTLINE; + } + } + +/* Save the last required character if there are at least two literal +characters on all paths, or if there is no first character setting. */ + +if (reqchar >= 0 && (countlits > 1 || (re->options & PCRE_FIRSTSET) == 0)) + { + re->req_char = reqchar; + re->options |= PCRE_REQCHSET; + } + +/* Print out the compiled data for debugging */ + +#ifdef DEBUG + +printf("Length = %d top_bracket = %d top_backref = %d\n", + length, re->top_bracket, re->top_backref); + +if (re->options != 0) + { + printf("%s%s%s%s%s%s%s%s%s\n", + ((re->options & PCRE_ANCHORED) != 0)? "anchored " : "", + ((re->options & PCRE_CASELESS) != 0)? "caseless " : "", + ((re->options & PCRE_ICHANGED) != 0)? "case state changed " : "", + ((re->options & PCRE_EXTENDED) != 0)? "extended " : "", + ((re->options & PCRE_MULTILINE) != 0)? "multiline " : "", + ((re->options & PCRE_DOTALL) != 0)? "dotall " : "", + ((re->options & PCRE_DOLLAR_ENDONLY) != 0)? "endonly " : "", + ((re->options & PCRE_EXTRA) != 0)? "extra " : "", + ((re->options & PCRE_UNGREEDY) != 0)? "ungreedy " : ""); + } + +if ((re->options & PCRE_FIRSTSET) != 0) + { + if (isprint(re->first_char)) printf("First char = %c\n", re->first_char); + else printf("First char = \\x%02x\n", re->first_char); + } + +if ((re->options & PCRE_REQCHSET) != 0) + { + if (isprint(re->req_char)) printf("Req char = %c\n", re->req_char); + else printf("Req char = \\x%02x\n", re->req_char); + } + +code_end = code; +code_base = code = re->code; + +while (code < code_end) + { + int charlength; + + printf("%3d ", code - code_base); + + if (*code >= OP_BRA) + { + printf("%3d Bra %d", (code[1] << 8) + code[2], *code - OP_BRA); + code += 2; + } + + else switch(*code) + { + case OP_OPT: + printf(" %.2x %s", code[1], OP_names[*code]); + code++; + break; + + case OP_COND: + printf("%3d Cond", (code[1] << 8) + code[2]); + code += 2; + break; + + case OP_CREF: + printf(" %.2d %s", code[1], OP_names[*code]); + code++; + break; + + case OP_CHARS: + charlength = *(++code); + printf("%3d ", charlength); + while (charlength-- > 0) + if (isprint(c = *(++code))) printf("%c", c); else printf("\\x%02x", c); + break; + + case OP_KETRMAX: + case OP_KETRMIN: + case OP_ALT: + case OP_KET: + case OP_ASSERT: + case OP_ASSERT_NOT: + case OP_ASSERTBACK: + case OP_ASSERTBACK_NOT: + case OP_ONCE: + printf("%3d %s", (code[1] << 8) + code[2], OP_names[*code]); + code += 2; + break; + + case OP_REVERSE: + printf("%3d %s", (code[1] << 8) + code[2], OP_names[*code]); + code += 2; + break; + + case OP_STAR: + case OP_MINSTAR: + case OP_PLUS: + case OP_MINPLUS: + case OP_QUERY: + case OP_MINQUERY: + case OP_TYPESTAR: + case OP_TYPEMINSTAR: + case OP_TYPEPLUS: + case OP_TYPEMINPLUS: + case OP_TYPEQUERY: + case OP_TYPEMINQUERY: + if (*code >= OP_TYPESTAR) + printf(" %s", OP_names[code[1]]); + else if (isprint(c = code[1])) printf(" %c", c); + else printf(" \\x%02x", c); + printf("%s", OP_names[*code++]); + break; + + case OP_EXACT: + case OP_UPTO: + case OP_MINUPTO: + if (isprint(c = code[3])) printf(" %c{", c); + else printf(" \\x%02x{", c); + if (*code != OP_EXACT) printf("0,"); + printf("%d}", (code[1] << 8) + code[2]); + if (*code == OP_MINUPTO) printf("?"); + code += 3; + break; + + case OP_TYPEEXACT: + case OP_TYPEUPTO: + case OP_TYPEMINUPTO: + printf(" %s{", OP_names[code[3]]); + if (*code != OP_TYPEEXACT) printf(","); + printf("%d}", (code[1] << 8) + code[2]); + if (*code == OP_TYPEMINUPTO) printf("?"); + code += 3; + break; + + case OP_NOT: + if (isprint(c = *(++code))) printf(" [^%c]", c); + else printf(" [^\\x%02x]", c); + break; + + case OP_NOTSTAR: + case OP_NOTMINSTAR: + case OP_NOTPLUS: + case OP_NOTMINPLUS: + case OP_NOTQUERY: + case OP_NOTMINQUERY: + if (isprint(c = code[1])) printf(" [^%c]", c); + else printf(" [^\\x%02x]", c); + printf("%s", OP_names[*code++]); + break; + + case OP_NOTEXACT: + case OP_NOTUPTO: + case OP_NOTMINUPTO: + if (isprint(c = code[3])) printf(" [^%c]{", c); + else printf(" [^\\x%02x]{", c); + if (*code != OP_NOTEXACT) printf(","); + printf("%d}", (code[1] << 8) + code[2]); + if (*code == OP_NOTMINUPTO) printf("?"); + code += 3; + break; + + case OP_REF: + printf(" \\%d", *(++code)); + code ++; + goto CLASS_REF_REPEAT; + + case OP_CLASS: + { + int i, min, max; + code++; + printf(" ["); + + for (i = 0; i < 256; i++) + { + if ((code[i/8] & (1 << (i&7))) != 0) + { + int j; + for (j = i+1; j < 256; j++) + if ((code[j/8] & (1 << (j&7))) == 0) break; + if (i == '-' || i == ']') printf("\\"); + if (isprint(i)) printf("%c", i); else printf("\\x%02x", i); + if (--j > i) + { + printf("-"); + if (j == '-' || j == ']') printf("\\"); + if (isprint(j)) printf("%c", j); else printf("\\x%02x", j); + } + i = j; + } + } + printf("]"); + code += 32; + + CLASS_REF_REPEAT: + + switch(*code) + { + case OP_CRSTAR: + case OP_CRMINSTAR: + case OP_CRPLUS: + case OP_CRMINPLUS: + case OP_CRQUERY: + case OP_CRMINQUERY: + printf("%s", OP_names[*code]); + break; + + case OP_CRRANGE: + case OP_CRMINRANGE: + min = (code[1] << 8) + code[2]; + max = (code[3] << 8) + code[4]; + if (max == 0) printf("{%d,}", min); + else printf("{%d,%d}", min, max); + if (*code == OP_CRMINRANGE) printf("?"); + code += 4; + break; + + default: + code--; + } + } + break; + + /* Anything else is just a one-node item */ + + default: + printf(" %s", OP_names[*code]); + break; + } + + code++; + printf("\n"); + } +printf("------------------------------------------------------------------\n"); + +/* This check is done here in the debugging case so that the code that +was compiled can be seen. */ + +if (code - re->code > length) + { + *errorptr = ERR23; + (pcre_free)(re); + *erroroffset = ptr - (uschar *)pattern; + return NULL; + } +#endif + +return (pcre *)re; +} + + + +/************************************************* +* Match a back-reference * +*************************************************/ + +/* If a back reference hasn't been set, the length that is passed is greater +than the number of characters left in the string, so the match fails. + +Arguments: + offset index into the offset vector + eptr points into the subject + length length to be matched + md points to match data block + ims the ims flags + +Returns: TRUE if matched +*/ + +static BOOL +match_ref(int offset, register const uschar *eptr, int length, match_data *md, + unsigned long int ims) +{ +const uschar *p = md->start_subject + md->offset_vector[offset]; + +#ifdef DEBUG +if (eptr >= md->end_subject) + printf("matching subject "); +else + { + printf("matching subject "); + pchars(eptr, length, TRUE, md); + } +printf(" against backref "); +pchars(p, length, FALSE, md); +printf("\n"); +#endif + +/* Always fail if not enough characters left */ + +if (length > md->end_subject - eptr) return FALSE; + +/* Separate the caselesss case for speed */ + +if ((ims & PCRE_CASELESS) != 0) + { + while (length-- > 0) + if (md->lcc[*p++] != md->lcc[*eptr++]) return FALSE; + } +else + { while (length-- > 0) if (*p++ != *eptr++) return FALSE; } + +return TRUE; +} + + + +/************************************************* +* Match from current position * +*************************************************/ + +/* On entry ecode points to the first opcode, and eptr to the first character +in the subject string, while eptrb holds the value of eptr at the start of the +last bracketed group - used for breaking infinite loops matching zero-length +strings. + +Arguments: + eptr pointer in subject + ecode position in code + offset_top current top pointer + md pointer to "static" info for the match + ims current /i, /m, and /s options + condassert TRUE if called to check a condition assertion + eptrb eptr at start of last bracket + +Returns: TRUE if matched +*/ + +static BOOL +match(register const uschar *eptr, register const uschar *ecode, + int offset_top, match_data *md, unsigned long int ims, BOOL condassert, + const uschar *eptrb) +{ +unsigned long int original_ims = ims; /* Save for resetting on ')' */ + +for (;;) + { + int op = (int)*ecode; + int min, max, ctype; + register int i; + register int c; + BOOL minimize = FALSE; + + /* Opening capturing bracket. If there is space in the offset vector, save + the current subject position in the working slot at the top of the vector. We + mustn't change the current values of the data slot, because they may be set + from a previous iteration of this group, and be referred to by a reference + inside the group. + + If the bracket fails to match, we need to restore this value and also the + values of the final offsets, in case they were set by a previous iteration of + the same bracket. + + If there isn't enough space in the offset vector, treat this as if it were a + non-capturing bracket. Don't worry about setting the flag for the error case + here; that is handled in the code for KET. */ + + if (op > OP_BRA) + { + int number = op - OP_BRA; + int offset = number << 1; + +#ifdef DEBUG + printf("start bracket %d subject=", number); + pchars(eptr, 16, TRUE, md); + printf("\n"); +#endif + + if (offset < md->offset_max) + { + int save_offset1 = md->offset_vector[offset]; + int save_offset2 = md->offset_vector[offset+1]; + int save_offset3 = md->offset_vector[md->offset_end - number]; + + DPRINTF(("saving %d %d %d\n", save_offset1, save_offset2, save_offset3)); + md->offset_vector[md->offset_end - number] = eptr - md->start_subject; + + do + { + if (match(eptr, ecode+3, offset_top, md, ims, FALSE, eptr)) return TRUE; + ecode += (ecode[1] << 8) + ecode[2]; + } + while (*ecode == OP_ALT); + + DPRINTF(("bracket %d failed\n", number)); + + md->offset_vector[offset] = save_offset1; + md->offset_vector[offset+1] = save_offset2; + md->offset_vector[md->offset_end - number] = save_offset3; + return FALSE; + } + + /* Insufficient room for saving captured contents */ + + else op = OP_BRA; + } + + /* Other types of node can be handled by a switch */ + + switch(op) + { + case OP_BRA: /* Non-capturing bracket: optimized */ + DPRINTF(("start bracket 0\n")); + do + { + if (match(eptr, ecode+3, offset_top, md, ims, FALSE, eptr)) return TRUE; + ecode += (ecode[1] << 8) + ecode[2]; + } + while (*ecode == OP_ALT); + DPRINTF(("bracket 0 failed\n")); + return FALSE; + + /* Conditional group: compilation checked that there are no more than + two branches. If the condition is false, skipping the first branch takes us + past the end if there is only one branch, but that's OK because that is + exactly what going to the ket would do. */ + + case OP_COND: + if (ecode[3] == OP_CREF) /* Condition is extraction test */ + { + int offset = ecode[4] << 1; /* Doubled reference number */ + return match(eptr, + ecode + ((offset < offset_top && md->offset_vector[offset] >= 0)? + 5 : 3 + (ecode[1] << 8) + ecode[2]), + offset_top, md, ims, FALSE, eptr); + } + + /* The condition is an assertion. Call match() to evaluate it - setting + the final argument TRUE causes it to stop at the end of an assertion. */ + + else + { + if (match(eptr, ecode+3, offset_top, md, ims, TRUE, NULL)) + { + ecode += 3 + (ecode[4] << 8) + ecode[5]; + while (*ecode == OP_ALT) ecode += (ecode[1] << 8) + ecode[2]; + } + else ecode += (ecode[1] << 8) + ecode[2]; + return match(eptr, ecode+3, offset_top, md, ims, FALSE, eptr); + } + /* Control never reaches here */ + + /* Skip over conditional reference data if encountered (should not be) */ + + case OP_CREF: + ecode += 2; + break; + + /* End of the pattern. If PCRE_NOTEMPTY is set, fail if we have matched + an empty string - recursion will then try other alternatives, if any. */ + + case OP_END: + if (md->notempty && eptr == md->start_match) return FALSE; + md->end_match_ptr = eptr; /* Record where we ended */ + md->end_offset_top = offset_top; /* and how many extracts were taken */ + return TRUE; + + /* Change option settings */ + + case OP_OPT: + ims = ecode[1]; + ecode += 2; + DPRINTF(("ims set to %02lx\n", ims)); + break; + + /* Assertion brackets. Check the alternative branches in turn - the + matching won't pass the KET for an assertion. If any one branch matches, + the assertion is true. Lookbehind assertions have an OP_REVERSE item at the + start of each branch to move the current point backwards, so the code at + this level is identical to the lookahead case. */ + + case OP_ASSERT: + case OP_ASSERTBACK: + do + { + if (match(eptr, ecode+3, offset_top, md, ims, FALSE, NULL)) break; + ecode += (ecode[1] << 8) + ecode[2]; + } + while (*ecode == OP_ALT); + if (*ecode == OP_KET) return FALSE; + + /* If checking an assertion for a condition, return TRUE. */ + + if (condassert) return TRUE; + + /* Continue from after the assertion, updating the offsets high water + mark, since extracts may have been taken during the assertion. */ + + do ecode += (ecode[1] << 8) + ecode[2]; while (*ecode == OP_ALT); + ecode += 3; + offset_top = md->end_offset_top; + continue; + + /* Negative assertion: all branches must fail to match */ + + case OP_ASSERT_NOT: + case OP_ASSERTBACK_NOT: + do + { + if (match(eptr, ecode+3, offset_top, md, ims, FALSE, NULL)) return FALSE; + ecode += (ecode[1] << 8) + ecode[2]; + } + while (*ecode == OP_ALT); + + if (condassert) return TRUE; + ecode += 3; + continue; + + /* Move the subject pointer back. This occurs only at the start of + each branch of a lookbehind assertion. If we are too close to the start to + move back, this match function fails. */ + + case OP_REVERSE: + eptr -= (ecode[1] << 8) + ecode[2]; + if (eptr < md->start_subject) return FALSE; + ecode += 3; + break; + + /* Recursion matches the current regex, nested. If there are any capturing + brackets started but not finished, we have to save their starting points + and reinstate them after the recursion. However, we don't know how many + such there are (offset_top records the completed total) so we just have + to save all the potential data. There may be up to 99 such values, which + is a bit large to put on the stack, but using malloc for small numbers + seems expensive. As a compromise, the stack is used when there are fewer + than 16 values to store; otherwise malloc is used. A problem is what to do + if the malloc fails ... there is no way of returning to the top level with + an error. Save the top 15 values on the stack, and accept that the rest + may be wrong. */ + + case OP_RECURSE: + { + BOOL rc; + int *save; + int stacksave[15]; + + c = md->offset_max; + + if (c < 16) save = stacksave; else + { + save = (int *)(pcre_malloc)((c+1) * sizeof(int)); + if (save == NULL) + { + save = stacksave; + c = 15; + } + } + + for (i = 1; i <= c; i++) + save[i] = md->offset_vector[md->offset_end - i]; + rc = match(eptr, md->start_pattern, offset_top, md, ims, FALSE, eptrb); + for (i = 1; i <= c; i++) + md->offset_vector[md->offset_end - i] = save[i]; + if (save != stacksave) (pcre_free)(save); + if (!rc) return FALSE; + + /* In case the recursion has set more capturing values, save the final + number, then move along the subject till after the recursive match, + and advance one byte in the pattern code. */ + + offset_top = md->end_offset_top; + eptr = md->end_match_ptr; + ecode++; + } + break; + + /* "Once" brackets are like assertion brackets except that after a match, + the point in the subject string is not moved back. Thus there can never be + a move back into the brackets. Check the alternative branches in turn - the + matching won't pass the KET for this kind of subpattern. If any one branch + matches, we carry on as at the end of a normal bracket, leaving the subject + pointer. */ + + case OP_ONCE: + { + const uschar *prev = ecode; + + do + { + if (match(eptr, ecode+3, offset_top, md, ims, FALSE, eptr)) break; + ecode += (ecode[1] << 8) + ecode[2]; + } + while (*ecode == OP_ALT); + + /* If hit the end of the group (which could be repeated), fail */ + + if (*ecode != OP_ONCE && *ecode != OP_ALT) return FALSE; + + /* Continue as from after the assertion, updating the offsets high water + mark, since extracts may have been taken. */ + + do ecode += (ecode[1] << 8) + ecode[2]; while (*ecode == OP_ALT); + + offset_top = md->end_offset_top; + eptr = md->end_match_ptr; + + /* For a non-repeating ket, just continue at this level. This also + happens for a repeating ket if no characters were matched in the group. + This is the forcible breaking of infinite loops as implemented in Perl + 5.005. If there is an options reset, it will get obeyed in the normal + course of events. */ + + if (*ecode == OP_KET || eptr == eptrb) + { + ecode += 3; + break; + } + + /* The repeating kets try the rest of the pattern or restart from the + preceding bracket, in the appropriate order. We need to reset any options + that changed within the bracket before re-running it, so check the next + opcode. */ + + if (ecode[3] == OP_OPT) + { + ims = (ims & ~PCRE_IMS) | ecode[4]; + DPRINTF(("ims set to %02lx at group repeat\n", ims)); + } + + if (*ecode == OP_KETRMIN) + { + if (match(eptr, ecode+3, offset_top, md, ims, FALSE, eptr) || + match(eptr, prev, offset_top, md, ims, FALSE, eptr)) return TRUE; + } + else /* OP_KETRMAX */ + { + if (match(eptr, prev, offset_top, md, ims, FALSE, eptr) || + match(eptr, ecode+3, offset_top, md, ims, FALSE, eptr)) return TRUE; + } + } + return FALSE; + + /* An alternation is the end of a branch; scan along to find the end of the + bracketed group and go to there. */ + + case OP_ALT: + do ecode += (ecode[1] << 8) + ecode[2]; while (*ecode == OP_ALT); + break; + + /* BRAZERO and BRAMINZERO occur just before a bracket group, indicating + that it may occur zero times. It may repeat infinitely, or not at all - + i.e. it could be ()* or ()? in the pattern. Brackets with fixed upper + repeat limits are compiled as a number of copies, with the optional ones + preceded by BRAZERO or BRAMINZERO. */ + + case OP_BRAZERO: + { + const uschar *next = ecode+1; + if (match(eptr, next, offset_top, md, ims, FALSE, eptr)) return TRUE; + do next += (next[1] << 8) + next[2]; while (*next == OP_ALT); + ecode = next + 3; + } + break; + + case OP_BRAMINZERO: + { + const uschar *next = ecode+1; + do next += (next[1] << 8) + next[2]; while (*next == OP_ALT); + if (match(eptr, next+3, offset_top, md, ims, FALSE, eptr)) return TRUE; + ecode++; + } + break; + + /* End of a group, repeated or non-repeating. If we are at the end of + an assertion "group", stop matching and return TRUE, but record the + current high water mark for use by positive assertions. Do this also + for the "once" (not-backup up) groups. */ + + case OP_KET: + case OP_KETRMIN: + case OP_KETRMAX: + { + const uschar *prev = ecode - (ecode[1] << 8) - ecode[2]; + + if (*prev == OP_ASSERT || *prev == OP_ASSERT_NOT || + *prev == OP_ASSERTBACK || *prev == OP_ASSERTBACK_NOT || + *prev == OP_ONCE) + { + md->end_match_ptr = eptr; /* For ONCE */ + md->end_offset_top = offset_top; + return TRUE; + } + + /* In all other cases except a conditional group we have to check the + group number back at the start and if necessary complete handling an + extraction by setting the offsets and bumping the high water mark. */ + + if (*prev != OP_COND) + { + int number = *prev - OP_BRA; + int offset = number << 1; + + DPRINTF(("end bracket %d\n", number)); + + if (number > 0) + { + if (offset >= md->offset_max) md->offset_overflow = TRUE; else + { + md->offset_vector[offset] = + md->offset_vector[md->offset_end - number]; + md->offset_vector[offset+1] = eptr - md->start_subject; + if (offset_top <= offset) offset_top = offset + 2; + } + } + } + + /* Reset the value of the ims flags, in case they got changed during + the group. */ + + ims = original_ims; + DPRINTF(("ims reset to %02lx\n", ims)); + + /* For a non-repeating ket, just continue at this level. This also + happens for a repeating ket if no characters were matched in the group. + This is the forcible breaking of infinite loops as implemented in Perl + 5.005. If there is an options reset, it will get obeyed in the normal + course of events. */ + + if (*ecode == OP_KET || eptr == eptrb) + { + ecode += 3; + break; + } + + /* The repeating kets try the rest of the pattern or restart from the + preceding bracket, in the appropriate order. */ + + if (*ecode == OP_KETRMIN) + { + if (match(eptr, ecode+3, offset_top, md, ims, FALSE, eptr) || + match(eptr, prev, offset_top, md, ims, FALSE, eptr)) return TRUE; + } + else /* OP_KETRMAX */ + { + if (match(eptr, prev, offset_top, md, ims, FALSE, eptr) || + match(eptr, ecode+3, offset_top, md, ims, FALSE, eptr)) return TRUE; + } + } + return FALSE; + + /* Start of subject unless notbol, or after internal newline if multiline */ + + case OP_CIRC: + if (md->notbol && eptr == md->start_subject) return FALSE; + if ((ims & PCRE_MULTILINE) != 0) + { + if (eptr != md->start_subject && eptr[-1] != '\n') return FALSE; + ecode++; + break; + } + /* ... else fall through */ + + /* Start of subject assertion */ + + case OP_SOD: + if (eptr != md->start_subject) return FALSE; + ecode++; + break; + + /* Assert before internal newline if multiline, or before a terminating + newline unless endonly is set, else end of subject unless noteol is set. */ + + case OP_DOLL: + if ((ims & PCRE_MULTILINE) != 0) + { + if (eptr < md->end_subject) { if (*eptr != '\n') return FALSE; } + else { if (md->noteol) return FALSE; } + ecode++; + break; + } + else + { + if (md->noteol) return FALSE; + if (!md->endonly) + { + if (eptr < md->end_subject - 1 || + (eptr == md->end_subject - 1 && *eptr != '\n')) return FALSE; + + ecode++; + break; + } + } + /* ... else fall through */ + + /* End of subject assertion (\z) */ + + case OP_EOD: + if (eptr < md->end_subject) return FALSE; + ecode++; + break; + + /* End of subject or ending \n assertion (\Z) */ + + case OP_EODN: + if (eptr < md->end_subject - 1 || + (eptr == md->end_subject - 1 && *eptr != '\n')) return FALSE; + ecode++; + break; + + /* Word boundary assertions */ + + case OP_NOT_WORD_BOUNDARY: + case OP_WORD_BOUNDARY: + { + BOOL prev_is_word = (eptr != md->start_subject) && + ((md->ctypes[eptr[-1]] & ctype_word) != 0); + BOOL cur_is_word = (eptr < md->end_subject) && + ((md->ctypes[*eptr] & ctype_word) != 0); + if ((*ecode++ == OP_WORD_BOUNDARY)? + cur_is_word == prev_is_word : cur_is_word != prev_is_word) + return FALSE; + } + break; + + /* Match a single character type; inline for speed */ + + case OP_ANY: + if ((ims & PCRE_DOTALL) == 0 && eptr < md->end_subject && *eptr == '\n') + return FALSE; + if (eptr++ >= md->end_subject) return FALSE; + ecode++; + break; + + case OP_NOT_DIGIT: + if (eptr >= md->end_subject || + (md->ctypes[*eptr++] & ctype_digit) != 0) + return FALSE; + ecode++; + break; + + case OP_DIGIT: + if (eptr >= md->end_subject || + (md->ctypes[*eptr++] & ctype_digit) == 0) + return FALSE; + ecode++; + break; + + case OP_NOT_WHITESPACE: + if (eptr >= md->end_subject || + (md->ctypes[*eptr++] & ctype_space) != 0) + return FALSE; + ecode++; + break; + + case OP_WHITESPACE: + if (eptr >= md->end_subject || + (md->ctypes[*eptr++] & ctype_space) == 0) + return FALSE; + ecode++; + break; + + case OP_NOT_WORDCHAR: + if (eptr >= md->end_subject || + (md->ctypes[*eptr++] & ctype_word) != 0) + return FALSE; + ecode++; + break; + + case OP_WORDCHAR: + if (eptr >= md->end_subject || + (md->ctypes[*eptr++] & ctype_word) == 0) + return FALSE; + ecode++; + break; + + /* Match a back reference, possibly repeatedly. Look past the end of the + item to see if there is repeat information following. The code is similar + to that for character classes, but repeated for efficiency. Then obey + similar code to character type repeats - written out again for speed. + However, if the referenced string is the empty string, always treat + it as matched, any number of times (otherwise there could be infinite + loops). */ + + case OP_REF: + { + int length; + int offset = ecode[1] << 1; /* Doubled reference number */ + ecode += 2; /* Advance past the item */ + + /* If the reference is unset, set the length to be longer than the amount + of subject left; this ensures that every attempt at a match fails. We + can't just fail here, because of the possibility of quantifiers with zero + minima. */ + + length = (offset >= offset_top || md->offset_vector[offset] < 0)? + md->end_subject - eptr + 1 : + md->offset_vector[offset+1] - md->offset_vector[offset]; + + /* Set up for repetition, or handle the non-repeated case */ + + switch (*ecode) + { + case OP_CRSTAR: + case OP_CRMINSTAR: + case OP_CRPLUS: + case OP_CRMINPLUS: + case OP_CRQUERY: + case OP_CRMINQUERY: + c = *ecode++ - OP_CRSTAR; + minimize = (c & 1) != 0; + min = rep_min[c]; /* Pick up values from tables; */ + max = rep_max[c]; /* zero for max => infinity */ + if (max == 0) max = INT_MAX; + break; + + case OP_CRRANGE: + case OP_CRMINRANGE: + minimize = (*ecode == OP_CRMINRANGE); + min = (ecode[1] << 8) + ecode[2]; + max = (ecode[3] << 8) + ecode[4]; + if (max == 0) max = INT_MAX; + ecode += 5; + break; + + default: /* No repeat follows */ + if (!match_ref(offset, eptr, length, md, ims)) return FALSE; + eptr += length; + continue; /* With the main loop */ + } + + /* If the length of the reference is zero, just continue with the + main loop. */ + + if (length == 0) continue; + + /* First, ensure the minimum number of matches are present. We get back + the length of the reference string explicitly rather than passing the + address of eptr, so that eptr can be a register variable. */ + + for (i = 1; i <= min; i++) + { + if (!match_ref(offset, eptr, length, md, ims)) return FALSE; + eptr += length; + } + + /* If min = max, continue at the same level without recursion. + They are not both allowed to be zero. */ + + if (min == max) continue; + + /* If minimizing, keep trying and advancing the pointer */ + + if (minimize) + { + for (i = min;; i++) + { + if (match(eptr, ecode, offset_top, md, ims, FALSE, eptrb)) + return TRUE; + if (i >= max || !match_ref(offset, eptr, length, md, ims)) + return FALSE; + eptr += length; + } + /* Control never gets here */ + } + + /* If maximizing, find the longest string and work backwards */ + + else + { + const uschar *pp = eptr; + for (i = min; i < max; i++) + { + if (!match_ref(offset, eptr, length, md, ims)) break; + eptr += length; + } + while (eptr >= pp) + { + if (match(eptr, ecode, offset_top, md, ims, FALSE, eptrb)) + return TRUE; + eptr -= length; + } + return FALSE; + } + } + /* Control never gets here */ + + + + /* Match a character class, possibly repeatedly. Look past the end of the + item to see if there is repeat information following. Then obey similar + code to character type repeats - written out again for speed. */ + + case OP_CLASS: + { + const uschar *data = ecode + 1; /* Save for matching */ + ecode += 33; /* Advance past the item */ + + switch (*ecode) + { + case OP_CRSTAR: + case OP_CRMINSTAR: + case OP_CRPLUS: + case OP_CRMINPLUS: + case OP_CRQUERY: + case OP_CRMINQUERY: + c = *ecode++ - OP_CRSTAR; + minimize = (c & 1) != 0; + min = rep_min[c]; /* Pick up values from tables; */ + max = rep_max[c]; /* zero for max => infinity */ + if (max == 0) max = INT_MAX; + break; + + case OP_CRRANGE: + case OP_CRMINRANGE: + minimize = (*ecode == OP_CRMINRANGE); + min = (ecode[1] << 8) + ecode[2]; + max = (ecode[3] << 8) + ecode[4]; + if (max == 0) max = INT_MAX; + ecode += 5; + break; + + default: /* No repeat follows */ + min = max = 1; + break; + } + + /* First, ensure the minimum number of matches are present. */ + + for (i = 1; i <= min; i++) + { + if (eptr >= md->end_subject) return FALSE; + c = *eptr++; + if ((data[c/8] & (1 << (c&7))) != 0) continue; + return FALSE; + } + + /* If max == min we can continue with the main loop without the + need to recurse. */ + + if (min == max) continue; + + /* If minimizing, keep testing the rest of the expression and advancing + the pointer while it matches the class. */ + + if (minimize) + { + for (i = min;; i++) + { + if (match(eptr, ecode, offset_top, md, ims, FALSE, eptrb)) + return TRUE; + if (i >= max || eptr >= md->end_subject) return FALSE; + c = *eptr++; + if ((data[c/8] & (1 << (c&7))) != 0) continue; + return FALSE; + } + /* Control never gets here */ + } + + /* If maximizing, find the longest possible run, then work backwards. */ + + else + { + const uschar *pp = eptr; + for (i = min; i < max; eptr++, i++) + { + if (eptr >= md->end_subject) break; + c = *eptr; + if ((data[c/8] & (1 << (c&7))) != 0) continue; + break; + } + + while (eptr >= pp) + if (match(eptr--, ecode, offset_top, md, ims, FALSE, eptrb)) + return TRUE; + return FALSE; + } + } + /* Control never gets here */ + + /* Match a run of characters */ + + case OP_CHARS: + { + register int length = ecode[1]; + ecode += 2; + +#ifdef DEBUG /* Sigh. Some compilers never learn. */ + if (eptr >= md->end_subject) + printf("matching subject against pattern "); + else + { + printf("matching subject "); + pchars(eptr, length, TRUE, md); + printf(" against pattern "); + } + pchars(ecode, length, FALSE, md); + printf("\n"); +#endif + + if (length > md->end_subject - eptr) return FALSE; + if ((ims & PCRE_CASELESS) != 0) + { + while (length-- > 0) + if (md->lcc[*ecode++] != md->lcc[*eptr++]) + return FALSE; + } + else + { + while (length-- > 0) if (*ecode++ != *eptr++) return FALSE; + } + } + break; + + /* Match a single character repeatedly; different opcodes share code. */ + + case OP_EXACT: + min = max = (ecode[1] << 8) + ecode[2]; + ecode += 3; + goto REPEATCHAR; + + case OP_UPTO: + case OP_MINUPTO: + min = 0; + max = (ecode[1] << 8) + ecode[2]; + minimize = *ecode == OP_MINUPTO; + ecode += 3; + goto REPEATCHAR; + + case OP_STAR: + case OP_MINSTAR: + case OP_PLUS: + case OP_MINPLUS: + case OP_QUERY: + case OP_MINQUERY: + c = *ecode++ - OP_STAR; + minimize = (c & 1) != 0; + min = rep_min[c]; /* Pick up values from tables; */ + max = rep_max[c]; /* zero for max => infinity */ + if (max == 0) max = INT_MAX; + + /* Common code for all repeated single-character matches. We can give + up quickly if there are fewer than the minimum number of characters left in + the subject. */ + + REPEATCHAR: + if (min > md->end_subject - eptr) return FALSE; + c = *ecode++; + + /* The code is duplicated for the caseless and caseful cases, for speed, + since matching characters is likely to be quite common. First, ensure the + minimum number of matches are present. If min = max, continue at the same + level without recursing. Otherwise, if minimizing, keep trying the rest of + the expression and advancing one matching character if failing, up to the + maximum. Alternatively, if maximizing, find the maximum number of + characters and work backwards. */ + + DPRINTF(("matching %c{%d,%d} against subject %.*s\n", c, min, max, + max, eptr)); + + if ((ims & PCRE_CASELESS) != 0) + { + c = md->lcc[c]; + for (i = 1; i <= min; i++) + if (c != md->lcc[*eptr++]) return FALSE; + if (min == max) continue; + if (minimize) + { + for (i = min;; i++) + { + if (match(eptr, ecode, offset_top, md, ims, FALSE, eptrb)) + return TRUE; + if (i >= max || eptr >= md->end_subject || + c != md->lcc[*eptr++]) + return FALSE; + } + /* Control never gets here */ + } + else + { + const uschar *pp = eptr; + for (i = min; i < max; i++) + { + if (eptr >= md->end_subject || c != md->lcc[*eptr]) break; + eptr++; + } + while (eptr >= pp) + if (match(eptr--, ecode, offset_top, md, ims, FALSE, eptrb)) + return TRUE; + return FALSE; + } + /* Control never gets here */ + } + + /* Caseful comparisons */ + + else + { + for (i = 1; i <= min; i++) if (c != *eptr++) return FALSE; + if (min == max) continue; + if (minimize) + { + for (i = min;; i++) + { + if (match(eptr, ecode, offset_top, md, ims, FALSE, eptrb)) + return TRUE; + if (i >= max || eptr >= md->end_subject || c != *eptr++) return FALSE; + } + /* Control never gets here */ + } + else + { + const uschar *pp = eptr; + for (i = min; i < max; i++) + { + if (eptr >= md->end_subject || c != *eptr) break; + eptr++; + } + while (eptr >= pp) + if (match(eptr--, ecode, offset_top, md, ims, FALSE, eptrb)) + return TRUE; + return FALSE; + } + } + /* Control never gets here */ + + /* Match a negated single character */ + + case OP_NOT: + if (eptr >= md->end_subject) return FALSE; + ecode++; + if ((ims & PCRE_CASELESS) != 0) + { + if (md->lcc[*ecode++] == md->lcc[*eptr++]) return FALSE; + } + else + { + if (*ecode++ == *eptr++) return FALSE; + } + break; + + /* Match a negated single character repeatedly. This is almost a repeat of + the code for a repeated single character, but I haven't found a nice way of + commoning these up that doesn't require a test of the positive/negative + option for each character match. Maybe that wouldn't add very much to the + time taken, but character matching *is* what this is all about... */ + + case OP_NOTEXACT: + min = max = (ecode[1] << 8) + ecode[2]; + ecode += 3; + goto REPEATNOTCHAR; + + case OP_NOTUPTO: + case OP_NOTMINUPTO: + min = 0; + max = (ecode[1] << 8) + ecode[2]; + minimize = *ecode == OP_NOTMINUPTO; + ecode += 3; + goto REPEATNOTCHAR; + + case OP_NOTSTAR: + case OP_NOTMINSTAR: + case OP_NOTPLUS: + case OP_NOTMINPLUS: + case OP_NOTQUERY: + case OP_NOTMINQUERY: + c = *ecode++ - OP_NOTSTAR; + minimize = (c & 1) != 0; + min = rep_min[c]; /* Pick up values from tables; */ + max = rep_max[c]; /* zero for max => infinity */ + if (max == 0) max = INT_MAX; + + /* Common code for all repeated single-character matches. We can give + up quickly if there are fewer than the minimum number of characters left in + the subject. */ + + REPEATNOTCHAR: + if (min > md->end_subject - eptr) return FALSE; + c = *ecode++; + + /* The code is duplicated for the caseless and caseful cases, for speed, + since matching characters is likely to be quite common. First, ensure the + minimum number of matches are present. If min = max, continue at the same + level without recursing. Otherwise, if minimizing, keep trying the rest of + the expression and advancing one matching character if failing, up to the + maximum. Alternatively, if maximizing, find the maximum number of + characters and work backwards. */ + + DPRINTF(("negative matching %c{%d,%d} against subject %.*s\n", c, min, max, + max, eptr)); + + if ((ims & PCRE_CASELESS) != 0) + { + c = md->lcc[c]; + for (i = 1; i <= min; i++) + if (c == md->lcc[*eptr++]) return FALSE; + if (min == max) continue; + if (minimize) + { + for (i = min;; i++) + { + if (match(eptr, ecode, offset_top, md, ims, FALSE, eptrb)) + return TRUE; + if (i >= max || eptr >= md->end_subject || + c == md->lcc[*eptr++]) + return FALSE; + } + /* Control never gets here */ + } + else + { + const uschar *pp = eptr; + for (i = min; i < max; i++) + { + if (eptr >= md->end_subject || c == md->lcc[*eptr]) break; + eptr++; + } + while (eptr >= pp) + if (match(eptr--, ecode, offset_top, md, ims, FALSE, eptrb)) + return TRUE; + return FALSE; + } + /* Control never gets here */ + } + + /* Caseful comparisons */ + + else + { + for (i = 1; i <= min; i++) if (c == *eptr++) return FALSE; + if (min == max) continue; + if (minimize) + { + for (i = min;; i++) + { + if (match(eptr, ecode, offset_top, md, ims, FALSE, eptrb)) + return TRUE; + if (i >= max || eptr >= md->end_subject || c == *eptr++) return FALSE; + } + /* Control never gets here */ + } + else + { + const uschar *pp = eptr; + for (i = min; i < max; i++) + { + if (eptr >= md->end_subject || c == *eptr) break; + eptr++; + } + while (eptr >= pp) + if (match(eptr--, ecode, offset_top, md, ims, FALSE, eptrb)) + return TRUE; + return FALSE; + } + } + /* Control never gets here */ + + /* Match a single character type repeatedly; several different opcodes + share code. This is very similar to the code for single characters, but we + repeat it in the interests of efficiency. */ + + case OP_TYPEEXACT: + min = max = (ecode[1] << 8) + ecode[2]; + minimize = TRUE; + ecode += 3; + goto REPEATTYPE; + + case OP_TYPEUPTO: + case OP_TYPEMINUPTO: + min = 0; + max = (ecode[1] << 8) + ecode[2]; + minimize = *ecode == OP_TYPEMINUPTO; + ecode += 3; + goto REPEATTYPE; + + case OP_TYPESTAR: + case OP_TYPEMINSTAR: + case OP_TYPEPLUS: + case OP_TYPEMINPLUS: + case OP_TYPEQUERY: + case OP_TYPEMINQUERY: + c = *ecode++ - OP_TYPESTAR; + minimize = (c & 1) != 0; + min = rep_min[c]; /* Pick up values from tables; */ + max = rep_max[c]; /* zero for max => infinity */ + if (max == 0) max = INT_MAX; + + /* Common code for all repeated single character type matches */ + + REPEATTYPE: + ctype = *ecode++; /* Code for the character type */ + + /* First, ensure the minimum number of matches are present. Use inline + code for maximizing the speed, and do the type test once at the start + (i.e. keep it out of the loop). Also test that there are at least the + minimum number of characters before we start. */ + + if (min > md->end_subject - eptr) return FALSE; + if (min > 0) switch(ctype) + { + case OP_ANY: + if ((ims & PCRE_DOTALL) == 0) + { for (i = 1; i <= min; i++) if (*eptr++ == '\n') return FALSE; } + else eptr += min; + break; + + case OP_NOT_DIGIT: + for (i = 1; i <= min; i++) + if ((md->ctypes[*eptr++] & ctype_digit) != 0) return FALSE; + break; + + case OP_DIGIT: + for (i = 1; i <= min; i++) + if ((md->ctypes[*eptr++] & ctype_digit) == 0) return FALSE; + break; + + case OP_NOT_WHITESPACE: + for (i = 1; i <= min; i++) + if ((md->ctypes[*eptr++] & ctype_space) != 0) return FALSE; + break; + + case OP_WHITESPACE: + for (i = 1; i <= min; i++) + if ((md->ctypes[*eptr++] & ctype_space) == 0) return FALSE; + break; + + case OP_NOT_WORDCHAR: + for (i = 1; i <= min; i++) + if ((md->ctypes[*eptr++] & ctype_word) != 0) + return FALSE; + break; + + case OP_WORDCHAR: + for (i = 1; i <= min; i++) + if ((md->ctypes[*eptr++] & ctype_word) == 0) + return FALSE; + break; + } + + /* If min = max, continue at the same level without recursing */ + + if (min == max) continue; + + /* If minimizing, we have to test the rest of the pattern before each + subsequent match. */ + + if (minimize) + { + for (i = min;; i++) + { + if (match(eptr, ecode, offset_top, md, ims, FALSE, eptrb)) return TRUE; + if (i >= max || eptr >= md->end_subject) return FALSE; + + c = *eptr++; + switch(ctype) + { + case OP_ANY: + if ((ims & PCRE_DOTALL) == 0 && c == '\n') return FALSE; + break; + + case OP_NOT_DIGIT: + if ((md->ctypes[c] & ctype_digit) != 0) return FALSE; + break; + + case OP_DIGIT: + if ((md->ctypes[c] & ctype_digit) == 0) return FALSE; + break; + + case OP_NOT_WHITESPACE: + if ((md->ctypes[c] & ctype_space) != 0) return FALSE; + break; + + case OP_WHITESPACE: + if ((md->ctypes[c] & ctype_space) == 0) return FALSE; + break; + + case OP_NOT_WORDCHAR: + if ((md->ctypes[c] & ctype_word) != 0) return FALSE; + break; + + case OP_WORDCHAR: + if ((md->ctypes[c] & ctype_word) == 0) return FALSE; + break; + } + } + /* Control never gets here */ + } + + /* If maximizing it is worth using inline code for speed, doing the type + test once at the start (i.e. keep it out of the loop). */ + + else + { + const uschar *pp = eptr; + switch(ctype) + { + case OP_ANY: + if ((ims & PCRE_DOTALL) == 0) + { + for (i = min; i < max; i++) + { + if (eptr >= md->end_subject || *eptr == '\n') break; + eptr++; + } + } + else + { + c = max - min; + if (c > md->end_subject - eptr) c = md->end_subject - eptr; + eptr += c; + } + break; + + case OP_NOT_DIGIT: + for (i = min; i < max; i++) + { + if (eptr >= md->end_subject || (md->ctypes[*eptr] & ctype_digit) != 0) + break; + eptr++; + } + break; + + case OP_DIGIT: + for (i = min; i < max; i++) + { + if (eptr >= md->end_subject || (md->ctypes[*eptr] & ctype_digit) == 0) + break; + eptr++; + } + break; + + case OP_NOT_WHITESPACE: + for (i = min; i < max; i++) + { + if (eptr >= md->end_subject || (md->ctypes[*eptr] & ctype_space) != 0) + break; + eptr++; + } + break; + + case OP_WHITESPACE: + for (i = min; i < max; i++) + { + if (eptr >= md->end_subject || (md->ctypes[*eptr] & ctype_space) == 0) + break; + eptr++; + } + break; + + case OP_NOT_WORDCHAR: + for (i = min; i < max; i++) + { + if (eptr >= md->end_subject || (md->ctypes[*eptr] & ctype_word) != 0) + break; + eptr++; + } + break; + + case OP_WORDCHAR: + for (i = min; i < max; i++) + { + if (eptr >= md->end_subject || (md->ctypes[*eptr] & ctype_word) == 0) + break; + eptr++; + } + break; + } + + while (eptr >= pp) + if (match(eptr--, ecode, offset_top, md, ims, FALSE, eptrb)) + return TRUE; + return FALSE; + } + /* Control never gets here */ + + /* There's been some horrible disaster. */ + + default: + DPRINTF(("Unknown opcode %d\n", *ecode)); + md->errorcode = PCRE_ERROR_UNKNOWN_NODE; + return FALSE; + } + + /* Do not stick any code in here without much thought; it is assumed + that "continue" in the code above comes out to here to repeat the main + loop. */ + + } /* End of main loop */ +/* Control never reaches here */ +} + + + + +/************************************************* +* Execute a Regular Expression * +*************************************************/ + +/* This function applies a compiled re to a subject string and picks out +portions of the string if it matches. Two elements in the vector are set for +each substring: the offsets to the start and end of the substring. + +Arguments: + external_re points to the compiled expression + external_extra points to "hints" from pcre_study() or is NULL + subject points to the subject string + length length of subject string (may contain binary zeros) + start_offset where to start in the subject string + options option bits + offsets points to a vector of ints to be filled in with offsets + offsetcount the number of elements in the vector + +Returns: > 0 => success; value is the number of elements filled in + = 0 => success, but offsets is not big enough + -1 => failed to match + < -1 => some kind of unexpected problem +*/ + +int +pcre_exec(const pcre *external_re, const pcre_extra *external_extra, + const char *subject, int length, int start_offset, int options, int *offsets, + int offsetcount) +{ +int resetcount, ocount; +int first_char = -1; +int req_char = -1; +int req_char2 = -1; +unsigned long int ims = 0; +match_data match_block; +const uschar *start_bits = NULL; +const uschar *start_match = (const uschar *)subject + start_offset; +const uschar *end_subject; +const uschar *req_char_ptr = start_match - 1; +const real_pcre *re = (const real_pcre *)external_re; +const real_pcre_extra *extra = (const real_pcre_extra *)external_extra; +BOOL using_temporary_offsets = FALSE; +BOOL anchored = ((re->options | options) & PCRE_ANCHORED) != 0; +BOOL startline = (re->options & PCRE_STARTLINE) != 0; + +if ((options & ~PUBLIC_EXEC_OPTIONS) != 0) return PCRE_ERROR_BADOPTION; + +if (re == NULL || subject == NULL || + (offsets == NULL && offsetcount > 0)) return PCRE_ERROR_NULL; +if (re->magic_number != MAGIC_NUMBER) return PCRE_ERROR_BADMAGIC; + +match_block.start_pattern = re->code; +match_block.start_subject = (const uschar *)subject; +match_block.end_subject = match_block.start_subject + length; +end_subject = match_block.end_subject; + +match_block.endonly = (re->options & PCRE_DOLLAR_ENDONLY) != 0; + +match_block.notbol = (options & PCRE_NOTBOL) != 0; +match_block.noteol = (options & PCRE_NOTEOL) != 0; +match_block.notempty = (options & PCRE_NOTEMPTY) != 0; + +match_block.errorcode = PCRE_ERROR_NOMATCH; /* Default error */ + +match_block.lcc = re->tables + lcc_offset; +match_block.ctypes = re->tables + ctypes_offset; + +/* The ims options can vary during the matching as a result of the presence +of (?ims) items in the pattern. They are kept in a local variable so that +restoring at the exit of a group is easy. */ + +ims = re->options & (PCRE_CASELESS|PCRE_MULTILINE|PCRE_DOTALL); + +/* If the expression has got more back references than the offsets supplied can +hold, we get a temporary bit of working store to use during the matching. +Otherwise, we can use the vector supplied, rounding down its size to a multiple +of 3. */ + +ocount = offsetcount - (offsetcount % 3); + +if (re->top_backref > 0 && re->top_backref >= ocount/3) + { + ocount = re->top_backref * 3 + 3; + match_block.offset_vector = (int *)(pcre_malloc)(ocount * sizeof(int)); + if (match_block.offset_vector == NULL) return PCRE_ERROR_NOMEMORY; + using_temporary_offsets = TRUE; + DPRINTF(("Got memory to hold back references\n")); + } +else match_block.offset_vector = offsets; + +match_block.offset_end = ocount; +match_block.offset_max = (2*ocount)/3; +match_block.offset_overflow = FALSE; + +/* Compute the minimum number of offsets that we need to reset each time. Doing +this makes a huge difference to execution time when there aren't many brackets +in the pattern. */ + +resetcount = 2 + re->top_bracket * 2; +if (resetcount > offsetcount) resetcount = ocount; + +/* Reset the working variable associated with each extraction. These should +never be used unless previously set, but they get saved and restored, and so we +initialize them to avoid reading uninitialized locations. */ + +if (match_block.offset_vector != NULL) + { + register int *iptr = match_block.offset_vector + ocount; + register int *iend = iptr - resetcount/2 + 1; + while (--iptr >= iend) *iptr = -1; + } + +/* Set up the first character to match, if available. The first_char value is +never set for an anchored regular expression, but the anchoring may be forced +at run time, so we have to test for anchoring. The first char may be unset for +an unanchored pattern, of course. If there's no first char and the pattern was +studied, there may be a bitmap of possible first characters. */ + +if (!anchored) + { + if ((re->options & PCRE_FIRSTSET) != 0) + { + first_char = re->first_char; + if ((ims & PCRE_CASELESS) != 0) first_char = match_block.lcc[first_char]; + } + else + if (!startline && extra != NULL && + (extra->options & PCRE_STUDY_MAPPED) != 0) + start_bits = extra->start_bits; + } + +/* For anchored or unanchored matches, there may be a "last known required +character" set. If the PCRE_CASELESS is set, implying that the match starts +caselessly, or if there are any changes of this flag within the regex, set up +both cases of the character. Otherwise set the two values the same, which will +avoid duplicate testing (which takes significant time). This covers the vast +majority of cases. It will be suboptimal when the case flag changes in a regex +and the required character in fact is caseful. */ + +if ((re->options & PCRE_REQCHSET) != 0) + { + req_char = re->req_char; + req_char2 = ((re->options & (PCRE_CASELESS | PCRE_ICHANGED)) != 0)? + (re->tables + fcc_offset)[req_char] : req_char; + } + +/* Loop for handling unanchored repeated matching attempts; for anchored regexs +the loop runs just once. */ + +do + { + int rc; + register int *iptr = match_block.offset_vector; + register int *iend = iptr + resetcount; + + /* Reset the maximum number of extractions we might see. */ + + while (iptr < iend) *iptr++ = -1; + + /* Advance to a unique first char if possible */ + + if (first_char >= 0) + { + if ((ims & PCRE_CASELESS) != 0) + while (start_match < end_subject && + match_block.lcc[*start_match] != first_char) + start_match++; + else + while (start_match < end_subject && *start_match != first_char) + start_match++; + } + + /* Or to just after \n for a multiline match if possible */ + + else if (startline) + { + if (start_match > match_block.start_subject + start_offset) + { + while (start_match < end_subject && start_match[-1] != '\n') + start_match++; + } + } + + /* Or to a non-unique first char after study */ + + else if (start_bits != NULL) + { + while (start_match < end_subject) + { + register int c = *start_match; + if ((start_bits[c/8] & (1 << (c&7))) == 0) start_match++; else break; + } + } + +#ifdef DEBUG /* Sigh. Some compilers never learn. */ + printf(">>>> Match against: "); + pchars(start_match, end_subject - start_match, TRUE, &match_block); + printf("\n"); +#endif + + /* If req_char is set, we know that that character must appear in the subject + for the match to succeed. If the first character is set, req_char must be + later in the subject; otherwise the test starts at the match point. This + optimization can save a huge amount of backtracking in patterns with nested + unlimited repeats that aren't going to match. We don't know what the state of + case matching may be when this character is hit, so test for it in both its + cases if necessary. However, the different cased versions will not be set up + unless PCRE_CASELESS was given or the casing state changes within the regex. + Writing separate code makes it go faster, as does using an autoincrement and + backing off on a match. */ + + if (req_char >= 0) + { + register const uschar *p = start_match + ((first_char >= 0)? 1 : 0); + + /* We don't need to repeat the search if we haven't yet reached the + place we found it at last time. */ + + if (p > req_char_ptr) + { + /* Do a single test if no case difference is set up */ + + if (req_char == req_char2) + { + while (p < end_subject) + { + if (*p++ == req_char) { p--; break; } + } + } + + /* Otherwise test for either case */ + + else + { + while (p < end_subject) + { + register int pp = *p++; + if (pp == req_char || pp == req_char2) { p--; break; } + } + } + + /* If we can't find the required character, break the matching loop */ + + if (p >= end_subject) break; + + /* If we have found the required character, save the point where we + found it, so that we don't search again next time round the loop if + the start hasn't passed this character yet. */ + + req_char_ptr = p; + } + } + + /* When a match occurs, substrings will be set for all internal extractions; + we just need to set up the whole thing as substring 0 before returning. If + there were too many extractions, set the return code to zero. In the case + where we had to get some local store to hold offsets for backreferences, copy + those back references that we can. In this case there need not be overflow + if certain parts of the pattern were not used. */ + + match_block.start_match = start_match; + if (!match(start_match, re->code, 2, &match_block, ims, FALSE, start_match)) + continue; + + /* Copy the offset information from temporary store if necessary */ + + if (using_temporary_offsets) + { + if (offsetcount >= 4) + { + memcpy(offsets + 2, match_block.offset_vector + 2, + (offsetcount - 2) * sizeof(int)); + DPRINTF(("Copied offsets from temporary memory\n")); + } + if (match_block.end_offset_top > offsetcount) + match_block.offset_overflow = TRUE; + + DPRINTF(("Freeing temporary memory\n")); + (pcre_free)(match_block.offset_vector); + } + + rc = match_block.offset_overflow? 0 : match_block.end_offset_top/2; + + if (match_block.offset_end < 2) rc = 0; else + { + offsets[0] = start_match - match_block.start_subject; + offsets[1] = match_block.end_match_ptr - match_block.start_subject; + } + + DPRINTF((">>>> returning %d\n", rc)); + return rc; + } + +/* This "while" is the end of the "do" above */ + +while (!anchored && + match_block.errorcode == PCRE_ERROR_NOMATCH && + start_match++ < end_subject); + +if (using_temporary_offsets) + { + DPRINTF(("Freeing temporary memory\n")); + (pcre_free)(match_block.offset_vector); + } + +DPRINTF((">>>> returning %d\n", match_block.errorcode)); + +return match_block.errorcode; +} + +/* End of pcre.c */ diff --git a/ext/pcre/pcrelib/pcre.def b/ext/pcre/pcrelib/pcre.def new file mode 100644 index 0000000000..0e8cf3f442 --- /dev/null +++ b/ext/pcre/pcrelib/pcre.def @@ -0,0 +1,19 @@ +EXPORTS + +pcre_malloc DATA +pcre_free DATA + +pcre_compile +pcre_copy_substring +pcre_exec +pcre_get_substring +pcre_get_substring_list +pcre_info +pcre_maketables +pcre_study +pcre_version + +regcomp +regexec +regerror +regfree diff --git a/ext/pcre/pcrelib/pcre.h b/ext/pcre/pcrelib/pcre.h new file mode 100644 index 0000000000..85e2e74259 --- /dev/null +++ b/ext/pcre/pcrelib/pcre.h @@ -0,0 +1,107 @@ +/************************************************* +* Perl-Compatible Regular Expressions * +*************************************************/ + +/* Copyright (c) 1997-2000 University of Cambridge */ + +#ifndef _PCRE_H +#define _PCRE_H + +#define PCRE_MAJOR 3 +#define PCRE_MINOR 1 +#define PCRE_DATE 09-Feb-2000 + +#include "php_compat.h" + +/* Win32 uses DLL by default */ + +#ifdef _WIN32 +# ifdef STATIC +# define PCRE_DL_IMPORT +# else +# define PCRE_DL_IMPORT __declspec(dllimport) +# endif +#else +# define PCRE_DL_IMPORT +#endif + +/* Have to include stdlib.h in order to ensure that size_t is defined; +it is needed here for malloc. */ + +#include +#include + +/* Allow for C++ users */ + +#ifdef __cplusplus +extern "C" { +#endif + +/* Options */ + +#define PCRE_CASELESS 0x0001 +#define PCRE_MULTILINE 0x0002 +#define PCRE_DOTALL 0x0004 +#define PCRE_EXTENDED 0x0008 +#define PCRE_ANCHORED 0x0010 +#define PCRE_DOLLAR_ENDONLY 0x0020 +#define PCRE_EXTRA 0x0040 +#define PCRE_NOTBOL 0x0080 +#define PCRE_NOTEOL 0x0100 +#define PCRE_UNGREEDY 0x0200 +#define PCRE_NOTEMPTY 0x0400 + +/* Exec-time and get-time error codes */ + +#define PCRE_ERROR_NOMATCH (-1) +#define PCRE_ERROR_NULL (-2) +#define PCRE_ERROR_BADOPTION (-3) +#define PCRE_ERROR_BADMAGIC (-4) +#define PCRE_ERROR_UNKNOWN_NODE (-5) +#define PCRE_ERROR_NOMEMORY (-6) +#define PCRE_ERROR_NOSUBSTRING (-7) + +/* Request types for pcre_fullinfo() */ + +#define PCRE_INFO_OPTIONS 0 +#define PCRE_INFO_SIZE 1 +#define PCRE_INFO_CAPTURECOUNT 2 +#define PCRE_INFO_BACKREFMAX 3 +#define PCRE_INFO_FIRSTCHAR 4 +#define PCRE_INFO_FIRSTTABLE 5 +#define PCRE_INFO_LASTLITERAL 6 + +/* Types */ + +typedef void pcre; +typedef void pcre_extra; + +/* Store get and free functions. These can be set to alternative malloc/free +functions if required. Some magic is required for Win32 DLL; it is null on +other OS. */ + +PCRE_DL_IMPORT extern void *(*pcre_malloc)(size_t); +PCRE_DL_IMPORT extern void (*pcre_free)(void *); + +#undef PCRE_DL_IMPORT + +/* Functions */ + +extern pcre *pcre_compile(const char *, int, const char **, int *, + const unsigned char *); +extern int pcre_copy_substring(const char *, int *, int, int, char *, int); +extern int pcre_exec(const pcre *, const pcre_extra *, const char *, + int, int, int, int *, int); +extern int pcre_get_substring(const char *, int *, int, int, const char **); +extern int pcre_get_substring_list(const char *, int *, int, const char ***); +extern int pcre_info(const pcre *, int *, int *); +extern int pcre_fullinfo(const pcre *, const pcre_extra *, int, void *); +extern unsigned const char *pcre_maketables(void); +extern pcre_extra *pcre_study(const pcre *, int, const char **); +extern const char *pcre_version(void); + +#ifdef __cplusplus +} /* extern "C" */ +#endif + +#endif /* End of pcre.h */ diff --git a/ext/pcre/pcrelib/pcreposix.c b/ext/pcre/pcrelib/pcreposix.c new file mode 100644 index 0000000000..7c66ccec89 --- /dev/null +++ b/ext/pcre/pcrelib/pcreposix.c @@ -0,0 +1,276 @@ +/************************************************* +* Perl-Compatible Regular Expressions * +*************************************************/ + +/* +This is a library of functions to support regular expressions whose syntax +and semantics are as close as possible to those of the Perl 5 language. See +the file Tech.Notes for some information on the internals. + +This module is a wrapper that provides a POSIX API to the underlying PCRE +functions. + +Written by: Philip Hazel + + Copyright (c) 1997-2000 University of Cambridge + +----------------------------------------------------------------------------- +Permission is granted to anyone to use this software for any purpose on any +computer system, and to redistribute it freely, subject to the following +restrictions: + +1. This software is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + +2. The origin of this software must not be misrepresented, either by + explicit claim or by omission. + +3. Altered versions must be plainly marked as such, and must not be + misrepresented as being the original software. + +4. If PCRE is embedded in any software that is released under the GNU + General Purpose Licence (GPL), then the terms of that licence shall + supersede any condition above with which it is incompatible. +----------------------------------------------------------------------------- +*/ + +#include "internal.h" +#include "pcreposix.h" +#include "stdlib.h" + + + +/* Corresponding tables of PCRE error messages and POSIX error codes. */ + +static const char *estring[] = { + ERR1, ERR2, ERR3, ERR4, ERR5, ERR6, ERR7, ERR8, ERR9, ERR10, + ERR11, ERR12, ERR13, ERR14, ERR15, ERR16, ERR17, ERR18, ERR19, ERR20, + ERR21, ERR22, ERR23, ERR24, ERR25, ERR26, ERR27, ERR29, ERR29, ERR30, + ERR31 }; + +static int eint[] = { + REG_EESCAPE, /* "\\ at end of pattern" */ + REG_EESCAPE, /* "\\c at end of pattern" */ + REG_EESCAPE, /* "unrecognized character follows \\" */ + REG_BADBR, /* "numbers out of order in {} quantifier" */ + REG_BADBR, /* "number too big in {} quantifier" */ + REG_EBRACK, /* "missing terminating ] for character class" */ + REG_ECTYPE, /* "invalid escape sequence in character class" */ + REG_ERANGE, /* "range out of order in character class" */ + REG_BADRPT, /* "nothing to repeat" */ + REG_BADRPT, /* "operand of unlimited repeat could match the empty string" */ + REG_ASSERT, /* "internal error: unexpected repeat" */ + REG_BADPAT, /* "unrecognized character after (?" */ + REG_ESIZE, /* "too many capturing parenthesized sub-patterns" */ + REG_EPAREN, /* "missing )" */ + REG_ESUBREG, /* "back reference to non-existent subpattern" */ + REG_INVARG, /* "erroffset passed as NULL" */ + REG_INVARG, /* "unknown option bit(s) set" */ + REG_EPAREN, /* "missing ) after comment" */ + REG_ESIZE, /* "too many sets of parentheses" */ + REG_ESIZE, /* "regular expression too large" */ + REG_ESPACE, /* "failed to get memory" */ + REG_EPAREN, /* "unmatched brackets" */ + REG_ASSERT, /* "internal error: code overflow" */ + REG_BADPAT, /* "unrecognized character after (?<" */ + REG_BADPAT, /* "lookbehind assertion is not fixed length" */ + REG_BADPAT, /* "malformed number after (?(" */ + REG_BADPAT, /* "conditional group containe more than two branches" */ + REG_BADPAT, /* "assertion expected after (?(" */ + REG_BADPAT, /* "(?p must be followed by )" */ + REG_ECTYPE, /* "unknown POSIX class name" */ + REG_BADPAT /* "POSIX collating elements are not supported" */ +}; + +/* Table of texts corresponding to POSIX error codes */ + +static const char *pstring[] = { + "", /* Dummy for value 0 */ + "internal error", /* REG_ASSERT */ + "invalid repeat counts in {}", /* BADBR */ + "pattern error", /* BADPAT */ + "? * + invalid", /* BADRPT */ + "unbalanced {}", /* EBRACE */ + "unbalanced []", /* EBRACK */ + "collation error - not relevant", /* ECOLLATE */ + "bad class", /* ECTYPE */ + "bad escape sequence", /* EESCAPE */ + "empty expression", /* EMPTY */ + "unbalanced ()", /* EPAREN */ + "bad range inside []", /* ERANGE */ + "expression too big", /* ESIZE */ + "failed to get memory", /* ESPACE */ + "bad back reference", /* ESUBREG */ + "bad argument", /* INVARG */ + "match failed" /* NOMATCH */ +}; + + + + +/************************************************* +* Translate PCRE text code to int * +*************************************************/ + +/* PCRE compile-time errors are given as strings defined as macros. We can just +look them up in a table to turn them into POSIX-style error codes. */ + +static int +pcre_posix_error_code(const char *s) +{ +size_t i; +for (i = 0; i < sizeof(estring)/sizeof(char *); i++) + if (strcmp(s, estring[i]) == 0) return eint[i]; +return REG_ASSERT; +} + + + +/************************************************* +* Translate error code to string * +*************************************************/ + +size_t +regerror(int errcode, const regex_t *preg, char *errbuf, size_t errbuf_size) +{ +const char *message, *addmessage; +size_t length, addlength; + +message = (errcode >= (int)(sizeof(pstring)/sizeof(char *)))? + "unknown error code" : pstring[errcode]; +length = strlen(message) + 1; + +addmessage = " at offset "; +addlength = (preg != NULL && (int)preg->re_erroffset != -1)? + strlen(addmessage) + 6 : 0; + +if (errbuf_size > 0) + { + if (addlength > 0 && errbuf_size >= length + addlength) + sprintf(errbuf, "%s%s%-6d", message, addmessage, (int)preg->re_erroffset); + else + { + strncpy(errbuf, message, errbuf_size - 1); + errbuf[errbuf_size-1] = 0; + } + } + +return length + addlength; +} + + + + +/************************************************* +* Free store held by a regex * +*************************************************/ + +void +regfree(regex_t *preg) +{ +(pcre_free)(preg->re_pcre); +} + + + + +/************************************************* +* Compile a regular expression * +*************************************************/ + +/* +Arguments: + preg points to a structure for recording the compiled expression + pattern the pattern to compile + cflags compilation flags + +Returns: 0 on success + various non-zero codes on failure +*/ + +int +regcomp(regex_t *preg, const char *pattern, int cflags) +{ +const char *errorptr; +int erroffset; +int options = 0; + +if ((cflags & REG_ICASE) != 0) options |= PCRE_CASELESS; +if ((cflags & REG_NEWLINE) != 0) options |= PCRE_MULTILINE; + +preg->re_pcre = pcre_compile(pattern, options, &errorptr, &erroffset, NULL); +preg->re_erroffset = erroffset; + +if (preg->re_pcre == NULL) return pcre_posix_error_code(errorptr); + +preg->re_nsub = pcre_info(preg->re_pcre, NULL, NULL); +return 0; +} + + + + +/************************************************* +* Match a regular expression * +*************************************************/ + +/* Unfortunately, PCRE requires 3 ints of working space for each captured +substring, so we have to get and release working store instead of just using +the POSIX structures as was done in earlier releases when PCRE needed only 2 +ints. */ + +int +regexec(regex_t *preg, const char *string, size_t nmatch, + regmatch_t pmatch[], int eflags) +{ +int rc; +int options = 0; +int *ovector = NULL; + +if ((eflags & REG_NOTBOL) != 0) options |= PCRE_NOTBOL; +if ((eflags & REG_NOTEOL) != 0) options |= PCRE_NOTEOL; + +preg->re_erroffset = (size_t)(-1); /* Only has meaning after compile */ + +if (nmatch > 0) + { + ovector = (int *)malloc(sizeof(int) * nmatch * 3); + if (ovector == NULL) return REG_ESPACE; + } + +rc = pcre_exec(preg->re_pcre, NULL, string, (int)strlen(string), 0, options, + ovector, nmatch * 3); + +if (rc == 0) rc = nmatch; /* All captured slots were filled in */ + +if (rc >= 0) + { + size_t i; + for (i = 0; i < rc; i++) + { + pmatch[i].rm_so = ovector[i*2]; + pmatch[i].rm_eo = ovector[i*2+1]; + } + if (ovector != NULL) free(ovector); + for (; i < nmatch; i++) pmatch[i].rm_so = pmatch[i].rm_eo = -1; + return 0; + } + +else + { + if (ovector != NULL) free(ovector); + switch(rc) + { + case PCRE_ERROR_NOMATCH: return REG_NOMATCH; + case PCRE_ERROR_NULL: return REG_INVARG; + case PCRE_ERROR_BADOPTION: return REG_INVARG; + case PCRE_ERROR_BADMAGIC: return REG_INVARG; + case PCRE_ERROR_UNKNOWN_NODE: return REG_ASSERT; + case PCRE_ERROR_NOMEMORY: return REG_ESPACE; + default: return REG_ASSERT; + } + } +} + +/* End of pcreposix.c */ diff --git a/ext/pcre/pcrelib/pcreposix.h b/ext/pcre/pcrelib/pcreposix.h new file mode 100644 index 0000000000..7660acbd55 --- /dev/null +++ b/ext/pcre/pcrelib/pcreposix.h @@ -0,0 +1,88 @@ +/************************************************* +* Perl-Compatible Regular Expressions * +*************************************************/ + +/* Copyright (c) 1997-2000 University of Cambridge */ + +#ifndef _PCREPOSIX_H +#define _PCREPOSIX_H + +/* This is the header for the POSIX wrapper interface to the PCRE Perl- +Compatible Regular Expression library. It defines the things POSIX says should +be there. I hope. */ + +/* Have to include stdlib.h in order to ensure that size_t is defined. */ + +#include + +/* Allow for C++ users */ + +#ifdef __cplusplus +extern "C" { +#endif + +/* Options defined by POSIX. */ + +#define REG_ICASE 0x01 +#define REG_NEWLINE 0x02 +#define REG_NOTBOL 0x04 +#define REG_NOTEOL 0x08 + +/* These are not used by PCRE, but by defining them we make it easier +to slot PCRE into existing programs that make POSIX calls. */ + +#define REG_EXTENDED 0 +#define REG_NOSUB 0 + +/* Error values. Not all these are relevant or used by the wrapper. */ + +enum { + REG_ASSERT = 1, /* internal error ? */ + REG_BADBR, /* invalid repeat counts in {} */ + REG_BADPAT, /* pattern error */ + REG_BADRPT, /* ? * + invalid */ + REG_EBRACE, /* unbalanced {} */ + REG_EBRACK, /* unbalanced [] */ + REG_ECOLLATE, /* collation error - not relevant */ + REG_ECTYPE, /* bad class */ + REG_EESCAPE, /* bad escape sequence */ + REG_EMPTY, /* empty expression */ + REG_EPAREN, /* unbalanced () */ + REG_ERANGE, /* bad range inside [] */ + REG_ESIZE, /* expression too big */ + REG_ESPACE, /* failed to get memory */ + REG_ESUBREG, /* bad back reference */ + REG_INVARG, /* bad argument */ + REG_NOMATCH /* match failed */ +}; + + +/* The structure representing a compiled regular expression. */ + +typedef struct { + void *re_pcre; + size_t re_nsub; + size_t re_erroffset; +} regex_t; + +/* The structure in which a captured offset is returned. */ + +typedef int regoff_t; + +typedef struct { + regoff_t rm_so; + regoff_t rm_eo; +} regmatch_t; + +/* The functions */ + +extern int regcomp(regex_t *, const char *, int); +extern int regexec(regex_t *, const char *, size_t, regmatch_t *, int); +extern size_t regerror(int, const regex_t *, char *, size_t); +extern void regfree(regex_t *); + +#ifdef __cplusplus +} /* extern "C" */ +#endif + +#endif /* End of pcreposix.h */ diff --git a/ext/pcre/pcrelib/pcretest.c b/ext/pcre/pcrelib/pcretest.c new file mode 100644 index 0000000000..85569fbeb7 --- /dev/null +++ b/ext/pcre/pcrelib/pcretest.c @@ -0,0 +1,1071 @@ +/************************************************* +* PCRE testing program * +*************************************************/ + +#include +#include +#include +#include +#include +#include + +/* Use the internal info for displaying the results of pcre_study(). */ + +#include "internal.h" + +/* It is possible to compile this test program without including support for +testing the POSIX interface, though this is not available via the standard +Makefile. */ + +#if !defined NOPOSIX +#include "pcreposix.h" +#endif + +#ifndef CLOCKS_PER_SEC +#ifdef CLK_TCK +#define CLOCKS_PER_SEC CLK_TCK +#else +#define CLOCKS_PER_SEC 100 +#endif +#endif + +#define LOOPREPEAT 20000 + + +static FILE *outfile; +static int log_store = 0; +static size_t gotten_store; + + + +/* Debugging function to print the internal form of the regex. This is the same +code as contained in pcre.c under the DEBUG macro. */ + +static const char *OP_names[] = { + "End", "\\A", "\\B", "\\b", "\\D", "\\d", + "\\S", "\\s", "\\W", "\\w", "\\Z", "\\z", + "Opt", "^", "$", "Any", "chars", "not", + "*", "*?", "+", "+?", "?", "??", "{", "{", "{", + "*", "*?", "+", "+?", "?", "??", "{", "{", "{", + "*", "*?", "+", "+?", "?", "??", "{", "{", "{", + "*", "*?", "+", "+?", "?", "??", "{", "{", + "class", "Ref", "Recurse", + "Alt", "Ket", "KetRmax", "KetRmin", "Assert", "Assert not", + "AssertB", "AssertB not", "Reverse", "Once", "Cond", "Cref", + "Brazero", "Braminzero", "Bra" +}; + + +static void print_internals(pcre *re) +{ +unsigned char *code = ((real_pcre *)re)->code; + +fprintf(outfile, "------------------------------------------------------------------\n"); + +for(;;) + { + int c; + int charlength; + + fprintf(outfile, "%3d ", (int)(code - ((real_pcre *)re)->code)); + + if (*code >= OP_BRA) + { + fprintf(outfile, "%3d Bra %d", (code[1] << 8) + code[2], *code - OP_BRA); + code += 2; + } + + else switch(*code) + { + case OP_END: + fprintf(outfile, " %s\n", OP_names[*code]); + fprintf(outfile, "------------------------------------------------------------------\n"); + return; + + case OP_OPT: + fprintf(outfile, " %.2x %s", code[1], OP_names[*code]); + code++; + break; + + case OP_COND: + fprintf(outfile, "%3d Cond", (code[1] << 8) + code[2]); + code += 2; + break; + + case OP_CREF: + fprintf(outfile, " %.2d %s", code[1], OP_names[*code]); + code++; + break; + + case OP_CHARS: + charlength = *(++code); + fprintf(outfile, "%3d ", charlength); + while (charlength-- > 0) + if (isprint(c = *(++code))) fprintf(outfile, "%c", c); + else fprintf(outfile, "\\x%02x", c); + break; + + case OP_KETRMAX: + case OP_KETRMIN: + case OP_ALT: + case OP_KET: + case OP_ASSERT: + case OP_ASSERT_NOT: + case OP_ASSERTBACK: + case OP_ASSERTBACK_NOT: + case OP_ONCE: + fprintf(outfile, "%3d %s", (code[1] << 8) + code[2], OP_names[*code]); + code += 2; + break; + + case OP_REVERSE: + fprintf(outfile, "%3d %s", (code[1] << 8) + code[2], OP_names[*code]); + code += 2; + break; + + case OP_STAR: + case OP_MINSTAR: + case OP_PLUS: + case OP_MINPLUS: + case OP_QUERY: + case OP_MINQUERY: + case OP_TYPESTAR: + case OP_TYPEMINSTAR: + case OP_TYPEPLUS: + case OP_TYPEMINPLUS: + case OP_TYPEQUERY: + case OP_TYPEMINQUERY: + if (*code >= OP_TYPESTAR) + fprintf(outfile, " %s", OP_names[code[1]]); + else if (isprint(c = code[1])) fprintf(outfile, " %c", c); + else fprintf(outfile, " \\x%02x", c); + fprintf(outfile, "%s", OP_names[*code++]); + break; + + case OP_EXACT: + case OP_UPTO: + case OP_MINUPTO: + if (isprint(c = code[3])) fprintf(outfile, " %c{", c); + else fprintf(outfile, " \\x%02x{", c); + if (*code != OP_EXACT) fprintf(outfile, ","); + fprintf(outfile, "%d}", (code[1] << 8) + code[2]); + if (*code == OP_MINUPTO) fprintf(outfile, "?"); + code += 3; + break; + + case OP_TYPEEXACT: + case OP_TYPEUPTO: + case OP_TYPEMINUPTO: + fprintf(outfile, " %s{", OP_names[code[3]]); + if (*code != OP_TYPEEXACT) fprintf(outfile, "0,"); + fprintf(outfile, "%d}", (code[1] << 8) + code[2]); + if (*code == OP_TYPEMINUPTO) fprintf(outfile, "?"); + code += 3; + break; + + case OP_NOT: + if (isprint(c = *(++code))) fprintf(outfile, " [^%c]", c); + else fprintf(outfile, " [^\\x%02x]", c); + break; + + case OP_NOTSTAR: + case OP_NOTMINSTAR: + case OP_NOTPLUS: + case OP_NOTMINPLUS: + case OP_NOTQUERY: + case OP_NOTMINQUERY: + if (isprint(c = code[1])) fprintf(outfile, " [^%c]", c); + else fprintf(outfile, " [^\\x%02x]", c); + fprintf(outfile, "%s", OP_names[*code++]); + break; + + case OP_NOTEXACT: + case OP_NOTUPTO: + case OP_NOTMINUPTO: + if (isprint(c = code[3])) fprintf(outfile, " [^%c]{", c); + else fprintf(outfile, " [^\\x%02x]{", c); + if (*code != OP_NOTEXACT) fprintf(outfile, ","); + fprintf(outfile, "%d}", (code[1] << 8) + code[2]); + if (*code == OP_NOTMINUPTO) fprintf(outfile, "?"); + code += 3; + break; + + case OP_REF: + fprintf(outfile, " \\%d", *(++code)); + code++; + goto CLASS_REF_REPEAT; + + case OP_CLASS: + { + int i, min, max; + code++; + fprintf(outfile, " ["); + + for (i = 0; i < 256; i++) + { + if ((code[i/8] & (1 << (i&7))) != 0) + { + int j; + for (j = i+1; j < 256; j++) + if ((code[j/8] & (1 << (j&7))) == 0) break; + if (i == '-' || i == ']') fprintf(outfile, "\\"); + if (isprint(i)) fprintf(outfile, "%c", i); else fprintf(outfile, "\\x%02x", i); + if (--j > i) + { + fprintf(outfile, "-"); + if (j == '-' || j == ']') fprintf(outfile, "\\"); + if (isprint(j)) fprintf(outfile, "%c", j); else fprintf(outfile, "\\x%02x", j); + } + i = j; + } + } + fprintf(outfile, "]"); + code += 32; + + CLASS_REF_REPEAT: + + switch(*code) + { + case OP_CRSTAR: + case OP_CRMINSTAR: + case OP_CRPLUS: + case OP_CRMINPLUS: + case OP_CRQUERY: + case OP_CRMINQUERY: + fprintf(outfile, "%s", OP_names[*code]); + break; + + case OP_CRRANGE: + case OP_CRMINRANGE: + min = (code[1] << 8) + code[2]; + max = (code[3] << 8) + code[4]; + if (max == 0) fprintf(outfile, "{%d,}", min); + else fprintf(outfile, "{%d,%d}", min, max); + if (*code == OP_CRMINRANGE) fprintf(outfile, "?"); + code += 4; + break; + + default: + code--; + } + } + break; + + /* Anything else is just a one-node item */ + + default: + fprintf(outfile, " %s", OP_names[*code]); + break; + } + + code++; + fprintf(outfile, "\n"); + } +} + + + +/* Character string printing function. */ + +static void pchars(unsigned char *p, int length) +{ +int c; +while (length-- > 0) + if (isprint(c = *(p++))) fprintf(outfile, "%c", c); + else fprintf(outfile, "\\x%02x", c); +} + + + +/* Alternative malloc function, to test functionality and show the size of the +compiled re. */ + +static void *new_malloc(size_t size) +{ +gotten_store = size; +if (log_store) + fprintf(outfile, "Memory allocation (code space): %d\n", + (int)((int)size - offsetof(real_pcre, code[0]))); +return malloc(size); +} + + + + +/* Get one piece of information from the pcre_fullinfo() function */ + +static void new_info(pcre *re, pcre_extra *study, int option, void *ptr) +{ +int rc; +if ((rc = pcre_fullinfo(re, study, option, ptr)) < 0) + fprintf(outfile, "Error %d from pcre_fullinfo(%d)\n", rc, option); +} + + + + +/* Read lines from named file or stdin and write to named file or stdout; lines +consist of a regular expression, in delimiters and optionally followed by +options, followed by a set of test data, terminated by an empty line. */ + +int main(int argc, char **argv) +{ +FILE *infile = stdin; +int options = 0; +int study_options = 0; +int op = 1; +int timeit = 0; +int showinfo = 0; +int showstore = 0; +int posix = 0; +int debug = 0; +int done = 0; +unsigned char buffer[30000]; +unsigned char dbuffer[1024]; + +/* Static so that new_malloc can use it. */ + +outfile = stdout; + +/* Scan options */ + +while (argc > 1 && argv[op][0] == '-') + { + if (strcmp(argv[op], "-s") == 0 || strcmp(argv[op], "-m") == 0) + showstore = 1; + else if (strcmp(argv[op], "-t") == 0) timeit = 1; + else if (strcmp(argv[op], "-i") == 0) showinfo = 1; + else if (strcmp(argv[op], "-d") == 0) showinfo = debug = 1; + else if (strcmp(argv[op], "-p") == 0) posix = 1; + else + { + printf("*** Unknown option %s\n", argv[op]); + printf("Usage: pcretest [-d] [-i] [-p] [-s] [-t] [ []]\n"); + printf(" -d debug: show compiled code; implies -i\n" + " -i show information about compiled pattern\n" + " -p use POSIX interface\n" + " -s output store information\n" + " -t time compilation and execution\n"); + return 1; + } + op++; + argc--; + } + +/* Sort out the input and output files */ + +if (argc > 1) + { + infile = fopen(argv[op], "r"); + if (infile == NULL) + { + printf("** Failed to open %s\n", argv[op]); + return 1; + } + } + +if (argc > 2) + { + outfile = fopen(argv[op+1], "w"); + if (outfile == NULL) + { + printf("** Failed to open %s\n", argv[op+1]); + return 1; + } + } + +/* Set alternative malloc function */ + +pcre_malloc = new_malloc; + +/* Heading line, then prompt for first regex if stdin */ + +fprintf(outfile, "PCRE version %s\n\n", pcre_version()); + +/* Main loop */ + +while (!done) + { + pcre *re = NULL; + pcre_extra *extra = NULL; + +#if !defined NOPOSIX /* There are still compilers that require no indent */ + regex_t preg; + int do_posix = 0; +#endif + + const char *error; + unsigned char *p, *pp, *ppp; + unsigned const char *tables = NULL; + int do_study = 0; + int do_debug = debug; + int do_G = 0; + int do_g = 0; + int do_showinfo = showinfo; + int do_showrest = 0; + int erroroffset, len, delimiter; + + if (infile == stdin) printf(" re> "); + if (fgets((char *)buffer, sizeof(buffer), infile) == NULL) break; + if (infile != stdin) fprintf(outfile, "%s", (char *)buffer); + + p = buffer; + while (isspace(*p)) p++; + if (*p == 0) continue; + + /* Get the delimiter and seek the end of the pattern; if is isn't + complete, read more. */ + + delimiter = *p++; + + if (isalnum(delimiter) || delimiter == '\\') + { + fprintf(outfile, "** Delimiter must not be alphameric or \\\n"); + goto SKIP_DATA; + } + + pp = p; + + for(;;) + { + while (*pp != 0) + { + if (*pp == '\\' && pp[1] != 0) pp++; + else if (*pp == delimiter) break; + pp++; + } + if (*pp != 0) break; + + len = sizeof(buffer) - (pp - buffer); + if (len < 256) + { + fprintf(outfile, "** Expression too long - missing delimiter?\n"); + goto SKIP_DATA; + } + + if (infile == stdin) printf(" > "); + if (fgets((char *)pp, len, infile) == NULL) + { + fprintf(outfile, "** Unexpected EOF\n"); + done = 1; + goto CONTINUE; + } + if (infile != stdin) fprintf(outfile, "%s", (char *)pp); + } + + /* If the first character after the delimiter is backslash, make + the pattern end with backslash. This is purely to provide a way + of testing for the error message when a pattern ends with backslash. */ + + if (pp[1] == '\\') *pp++ = '\\'; + + /* Terminate the pattern at the delimiter */ + + *pp++ = 0; + + /* Look for options after final delimiter */ + + options = 0; + study_options = 0; + log_store = showstore; /* default from command line */ + + while (*pp != 0) + { + switch (*pp++) + { + case 'g': do_g = 1; break; + case 'i': options |= PCRE_CASELESS; break; + case 'm': options |= PCRE_MULTILINE; break; + case 's': options |= PCRE_DOTALL; break; + case 'x': options |= PCRE_EXTENDED; break; + + case '+': do_showrest = 1; break; + case 'A': options |= PCRE_ANCHORED; break; + case 'D': do_debug = do_showinfo = 1; break; + case 'E': options |= PCRE_DOLLAR_ENDONLY; break; + case 'G': do_G = 1; break; + case 'I': do_showinfo = 1; break; + case 'M': log_store = 1; break; + +#if !defined NOPOSIX + case 'P': do_posix = 1; break; +#endif + + case 'S': do_study = 1; break; + case 'U': options |= PCRE_UNGREEDY; break; + case 'X': options |= PCRE_EXTRA; break; + + case 'L': + ppp = pp; + while (*ppp != '\n' && *ppp != ' ') ppp++; + *ppp = 0; + if (setlocale(LC_CTYPE, (const char *)pp) == NULL) + { + fprintf(outfile, "** Failed to set locale \"%s\"\n", pp); + goto SKIP_DATA; + } + tables = pcre_maketables(); + pp = ppp; + break; + + case '\n': case ' ': break; + default: + fprintf(outfile, "** Unknown option '%c'\n", pp[-1]); + goto SKIP_DATA; + } + } + + /* Handle compiling via the POSIX interface, which doesn't support the + timing, showing, or debugging options, nor the ability to pass over + local character tables. */ + +#if !defined NOPOSIX + if (posix || do_posix) + { + int rc; + int cflags = 0; + if ((options & PCRE_CASELESS) != 0) cflags |= REG_ICASE; + if ((options & PCRE_MULTILINE) != 0) cflags |= REG_NEWLINE; + rc = regcomp(&preg, (char *)p, cflags); + + /* Compilation failed; go back for another re, skipping to blank line + if non-interactive. */ + + if (rc != 0) + { + (void)regerror(rc, &preg, (char *)buffer, sizeof(buffer)); + fprintf(outfile, "Failed: POSIX code %d: %s\n", rc, buffer); + goto SKIP_DATA; + } + } + + /* Handle compiling via the native interface */ + + else +#endif /* !defined NOPOSIX */ + + { + if (timeit) + { + register int i; + clock_t time_taken; + clock_t start_time = clock(); + for (i = 0; i < LOOPREPEAT; i++) + { + re = pcre_compile((char *)p, options, &error, &erroroffset, tables); + if (re != NULL) free(re); + } + time_taken = clock() - start_time; + fprintf(outfile, "Compile time %.3f milliseconds\n", + ((double)time_taken * 1000.0) / + ((double)LOOPREPEAT * (double)CLOCKS_PER_SEC)); + } + + re = pcre_compile((char *)p, options, &error, &erroroffset, tables); + + /* Compilation failed; go back for another re, skipping to blank line + if non-interactive. */ + + if (re == NULL) + { + fprintf(outfile, "Failed: %s at offset %d\n", error, erroroffset); + SKIP_DATA: + if (infile != stdin) + { + for (;;) + { + if (fgets((char *)buffer, sizeof(buffer), infile) == NULL) + { + done = 1; + goto CONTINUE; + } + len = (int)strlen((char *)buffer); + while (len > 0 && isspace(buffer[len-1])) len--; + if (len == 0) break; + } + fprintf(outfile, "\n"); + } + goto CONTINUE; + } + + /* Compilation succeeded; print data if required. There are now two + info-returning functions. The old one has a limited interface and + returns only limited data. Check that it agrees with the newer one. */ + + if (do_showinfo) + { + int old_first_char, old_options, old_count; + int count, backrefmax, first_char, need_char; + size_t size; + + if (do_debug) print_internals(re); + + new_info(re, NULL, PCRE_INFO_OPTIONS, &options); + new_info(re, NULL, PCRE_INFO_SIZE, &size); + new_info(re, NULL, PCRE_INFO_CAPTURECOUNT, &count); + new_info(re, NULL, PCRE_INFO_BACKREFMAX, &backrefmax); + new_info(re, NULL, PCRE_INFO_FIRSTCHAR, &first_char); + new_info(re, NULL, PCRE_INFO_LASTLITERAL, &need_char); + + old_count = pcre_info(re, &old_options, &old_first_char); + if (count < 0) fprintf(outfile, + "Error %d from pcre_info()\n", count); + else + { + if (old_count != count) fprintf(outfile, + "Count disagreement: pcre_fullinfo=%d pcre_info=%d\n", count, + old_count); + + if (old_first_char != first_char) fprintf(outfile, + "First char disagreement: pcre_fullinfo=%d pcre_info=%d\n", + first_char, old_first_char); + + if (old_options != options) fprintf(outfile, + "Options disagreement: pcre_fullinfo=%d pcre_info=%d\n", options, + old_options); + } + + if (size != gotten_store) fprintf(outfile, + "Size disagreement: pcre_fullinfo=%d call to malloc for %d\n", + size, gotten_store); + + fprintf(outfile, "Capturing subpattern count = %d\n", count); + if (backrefmax > 0) + fprintf(outfile, "Max back reference = %d\n", backrefmax); + if (options == 0) fprintf(outfile, "No options\n"); + else fprintf(outfile, "Options:%s%s%s%s%s%s%s%s\n", + ((options & PCRE_ANCHORED) != 0)? " anchored" : "", + ((options & PCRE_CASELESS) != 0)? " caseless" : "", + ((options & PCRE_EXTENDED) != 0)? " extended" : "", + ((options & PCRE_MULTILINE) != 0)? " multiline" : "", + ((options & PCRE_DOTALL) != 0)? " dotall" : "", + ((options & PCRE_DOLLAR_ENDONLY) != 0)? " dollar_endonly" : "", + ((options & PCRE_EXTRA) != 0)? " extra" : "", + ((options & PCRE_UNGREEDY) != 0)? " ungreedy" : ""); + + if (((((real_pcre *)re)->options) & PCRE_ICHANGED) != 0) + fprintf(outfile, "Case state changes\n"); + + if (first_char == -1) + { + fprintf(outfile, "First char at start or follows \\n\n"); + } + else if (first_char < 0) + { + fprintf(outfile, "No first char\n"); + } + else + { + if (isprint(first_char)) + fprintf(outfile, "First char = \'%c\'\n", first_char); + else + fprintf(outfile, "First char = %d\n", first_char); + } + + if (need_char < 0) + { + fprintf(outfile, "No need char\n"); + } + else + { + if (isprint(need_char)) + fprintf(outfile, "Need char = \'%c\'\n", need_char); + else + fprintf(outfile, "Need char = %d\n", need_char); + } + } + + /* If /S was present, study the regexp to generate additional info to + help with the matching. */ + + if (do_study) + { + if (timeit) + { + register int i; + clock_t time_taken; + clock_t start_time = clock(); + for (i = 0; i < LOOPREPEAT; i++) + extra = pcre_study(re, study_options, &error); + time_taken = clock() - start_time; + if (extra != NULL) free(extra); + fprintf(outfile, " Study time %.3f milliseconds\n", + ((double)time_taken * 1000.0)/ + ((double)LOOPREPEAT * (double)CLOCKS_PER_SEC)); + } + + extra = pcre_study(re, study_options, &error); + if (error != NULL) + fprintf(outfile, "Failed to study: %s\n", error); + else if (extra == NULL) + fprintf(outfile, "Study returned NULL\n"); + + else if (do_showinfo) + { + uschar *start_bits = NULL; + new_info(re, extra, PCRE_INFO_FIRSTTABLE, &start_bits); + if (start_bits == NULL) + fprintf(outfile, "No starting character set\n"); + else + { + int i; + int c = 24; + fprintf(outfile, "Starting character set: "); + for (i = 0; i < 256; i++) + { + if ((start_bits[i/8] & (1<<(i%8))) != 0) + { + if (c > 75) + { + fprintf(outfile, "\n "); + c = 2; + } + if (isprint(i) && i != ' ') + { + fprintf(outfile, "%c ", i); + c += 2; + } + else + { + fprintf(outfile, "\\x%02x ", i); + c += 5; + } + } + } + fprintf(outfile, "\n"); + } + } + } + } + + /* Read data lines and test them */ + + for (;;) + { + unsigned char *q; + unsigned char *bptr = dbuffer; + int count, c; + int copystrings = 0; + int getstrings = 0; + int getlist = 0; + int gmatched = 0; + int start_offset = 0; + int g_notempty = 0; + int offsets[45]; + int size_offsets = sizeof(offsets)/sizeof(int); + + options = 0; + + if (infile == stdin) printf("data> "); + if (fgets((char *)buffer, sizeof(buffer), infile) == NULL) + { + done = 1; + goto CONTINUE; + } + if (infile != stdin) fprintf(outfile, "%s", (char *)buffer); + + len = (int)strlen((char *)buffer); + while (len > 0 && isspace(buffer[len-1])) len--; + buffer[len] = 0; + if (len == 0) break; + + p = buffer; + while (isspace(*p)) p++; + + q = dbuffer; + while ((c = *p++) != 0) + { + int i = 0; + int n = 0; + if (c == '\\') switch ((c = *p++)) + { + case 'a': c = 7; break; + case 'b': c = '\b'; break; + case 'e': c = 27; break; + case 'f': c = '\f'; break; + case 'n': c = '\n'; break; + case 'r': c = '\r'; break; + case 't': c = '\t'; break; + case 'v': c = '\v'; break; + + case '0': case '1': case '2': case '3': + case '4': case '5': case '6': case '7': + c -= '0'; + while (i++ < 2 && isdigit(*p) && *p != '8' && *p != '9') + c = c * 8 + *p++ - '0'; + break; + + case 'x': + c = 0; + while (i++ < 2 && isxdigit(*p)) + { + c = c * 16 + tolower(*p) - ((isdigit(*p))? '0' : 'W'); + p++; + } + break; + + case 0: /* Allows for an empty line */ + p--; + continue; + + case 'A': /* Option setting */ + options |= PCRE_ANCHORED; + continue; + + case 'B': + options |= PCRE_NOTBOL; + continue; + + case 'C': + while(isdigit(*p)) n = n * 10 + *p++ - '0'; + copystrings |= 1 << n; + continue; + + case 'G': + while(isdigit(*p)) n = n * 10 + *p++ - '0'; + getstrings |= 1 << n; + continue; + + case 'L': + getlist = 1; + continue; + + case 'N': + options |= PCRE_NOTEMPTY; + continue; + + case 'O': + while(isdigit(*p)) n = n * 10 + *p++ - '0'; + if (n <= (int)(sizeof(offsets)/sizeof(int))) size_offsets = n; + continue; + + case 'Z': + options |= PCRE_NOTEOL; + continue; + } + *q++ = c; + } + *q = 0; + len = q - dbuffer; + + /* Handle matching via the POSIX interface, which does not + support timing. */ + +#if !defined NOPOSIX + if (posix || do_posix) + { + int rc; + int eflags = 0; + regmatch_t pmatch[sizeof(offsets)/sizeof(int)]; + if ((options & PCRE_NOTBOL) != 0) eflags |= REG_NOTBOL; + if ((options & PCRE_NOTEOL) != 0) eflags |= REG_NOTEOL; + + rc = regexec(&preg, (const char *)bptr, size_offsets, pmatch, eflags); + + if (rc != 0) + { + (void)regerror(rc, &preg, (char *)buffer, sizeof(buffer)); + fprintf(outfile, "No match: POSIX code %d: %s\n", rc, buffer); + } + else + { + size_t i; + for (i = 0; i < size_offsets; i++) + { + if (pmatch[i].rm_so >= 0) + { + fprintf(outfile, "%2d: ", (int)i); + pchars(dbuffer + pmatch[i].rm_so, + pmatch[i].rm_eo - pmatch[i].rm_so); + fprintf(outfile, "\n"); + if (i == 0 && do_showrest) + { + fprintf(outfile, " 0+ "); + pchars(dbuffer + pmatch[i].rm_eo, len - pmatch[i].rm_eo); + fprintf(outfile, "\n"); + } + } + } + } + } + + /* Handle matching via the native interface - repeats for /g and /G */ + + else +#endif /* !defined NOPOSIX */ + + for (;; gmatched++) /* Loop for /g or /G */ + { + if (timeit) + { + register int i; + clock_t time_taken; + clock_t start_time = clock(); + for (i = 0; i < LOOPREPEAT; i++) + count = pcre_exec(re, extra, (char *)bptr, len, + start_offset, options | g_notempty, offsets, size_offsets); + time_taken = clock() - start_time; + fprintf(outfile, "Execute time %.3f milliseconds\n", + ((double)time_taken * 1000.0)/ + ((double)LOOPREPEAT * (double)CLOCKS_PER_SEC)); + } + + count = pcre_exec(re, extra, (char *)bptr, len, + start_offset, options | g_notempty, offsets, size_offsets); + + if (count == 0) + { + fprintf(outfile, "Matched, but too many substrings\n"); + count = size_offsets/3; + } + + /* Matched */ + + if (count >= 0) + { + int i; + for (i = 0; i < count * 2; i += 2) + { + if (offsets[i] < 0) + fprintf(outfile, "%2d: \n", i/2); + else + { + fprintf(outfile, "%2d: ", i/2); + pchars(bptr + offsets[i], offsets[i+1] - offsets[i]); + fprintf(outfile, "\n"); + if (i == 0) + { + if (do_showrest) + { + fprintf(outfile, " 0+ "); + pchars(bptr + offsets[i+1], len - offsets[i+1]); + fprintf(outfile, "\n"); + } + } + } + } + + for (i = 0; i < 32; i++) + { + if ((copystrings & (1 << i)) != 0) + { + char copybuffer[16]; + int rc = pcre_copy_substring((char *)bptr, offsets, count, + i, copybuffer, sizeof(copybuffer)); + if (rc < 0) + fprintf(outfile, "copy substring %d failed %d\n", i, rc); + else + fprintf(outfile, "%2dC %s (%d)\n", i, copybuffer, rc); + } + } + + for (i = 0; i < 32; i++) + { + if ((getstrings & (1 << i)) != 0) + { + const char *substring; + int rc = pcre_get_substring((char *)bptr, offsets, count, + i, &substring); + if (rc < 0) + fprintf(outfile, "get substring %d failed %d\n", i, rc); + else + { + fprintf(outfile, "%2dG %s (%d)\n", i, substring, rc); + free((void *)substring); + } + } + } + + if (getlist) + { + const char **stringlist; + int rc = pcre_get_substring_list((char *)bptr, offsets, count, + &stringlist); + if (rc < 0) + fprintf(outfile, "get substring list failed %d\n", rc); + else + { + for (i = 0; i < count; i++) + fprintf(outfile, "%2dL %s\n", i, stringlist[i]); + if (stringlist[i] != NULL) + fprintf(outfile, "string list not terminated by NULL\n"); + free((void *)stringlist); + } + } + } + + /* Failed to match. If this is a /g or /G loop and we previously set + PCRE_NOTEMPTY after a null match, this is not necessarily the end. + We want to advance the start offset, and continue. Fudge the offset + values to achieve this. We won't be at the end of the string - that + was checked before setting PCRE_NOTEMPTY. */ + + else + { + if (g_notempty != 0) + { + offsets[0] = start_offset; + offsets[1] = start_offset + 1; + } + else + { + if (gmatched == 0) /* Error if no previous matches */ + { + if (count == -1) fprintf(outfile, "No match\n"); + else fprintf(outfile, "Error %d\n", count); + } + break; /* Out of the /g loop */ + } + } + + /* If not /g or /G we are done */ + + if (!do_g && !do_G) break; + + /* If we have matched an empty string, first check to see if we are at + the end of the subject. If so, the /g loop is over. Otherwise, mimic + what Perl's /g options does. This turns out to be rather cunning. First + we set PCRE_NOTEMPTY and try the match again at the same point. If this + fails (picked up above) we advance to the next character. */ + + g_notempty = 0; + if (offsets[0] == offsets[1]) + { + if (offsets[0] == len) break; + g_notempty = PCRE_NOTEMPTY; + } + + /* For /g, update the start offset, leaving the rest alone */ + + if (do_g) start_offset = offsets[1]; + + /* For /G, update the pointer and length */ + + else + { + bptr += offsets[1]; + len -= offsets[1]; + } + } /* End of loop for /g and /G */ + } /* End of loop for data lines */ + + CONTINUE: + +#if !defined NOPOSIX + if (posix || do_posix) regfree(&preg); +#endif + + if (re != NULL) free(re); + if (extra != NULL) free(extra); + if (tables != NULL) + { + free((void *)tables); + setlocale(LC_CTYPE, "C"); + } + } + +fprintf(outfile, "\n"); +return 0; +} + +/* End */ diff --git a/ext/pcre/pcrelib/perltest b/ext/pcre/pcrelib/perltest new file mode 100755 index 0000000000..1e96c792b9 --- /dev/null +++ b/ext/pcre/pcrelib/perltest @@ -0,0 +1,169 @@ +#! /usr/bin/perl + +# Program for testing regular expressions with perl to check that PCRE handles +# them the same. + + +# Function for turning a string into a string of printing chars + +sub pchars { +my($t) = ""; + +foreach $c (split(//, @_[0])) + { + if (ord $c >= 32 && ord $c < 127) { $t .= $c; } + else { $t .= sprintf("\\x%02x", ord $c); } + } +$t; +} + + + +# Read lines from named file or stdin and write to named file or stdout; lines +# consist of a regular expression, in delimiters and optionally followed by +# options, followed by a set of test data, terminated by an empty line. + +# Sort out the input and output files + +if (@ARGV > 0) + { + open(INFILE, "<$ARGV[0]") || die "Failed to open $ARGV[0]\n"; + $infile = "INFILE"; + } +else { $infile = "STDIN"; } + +if (@ARGV > 1) + { + open(OUTFILE, ">$ARGV[1]") || die "Failed to open $ARGV[1]\n"; + $outfile = "OUTFILE"; + } +else { $outfile = "STDOUT"; } + +printf($outfile "Perl $] Regular Expressions\n\n"); + +# Main loop + +NEXT_RE: +for (;;) + { + printf " re> " if $infile eq "STDIN"; + last if ! ($_ = <$infile>); + printf $outfile "$_" if $infile ne "STDIN"; + next if ($_ eq ""); + + $pattern = $_; + + while ($pattern !~ /^\s*(.).*\1/s) + { + printf " > " if $infile eq "STDIN"; + last if ! ($_ = <$infile>); + printf $outfile "$_" if $infile ne "STDIN"; + $pattern .= $_; + } + + chomp($pattern); + $pattern =~ s/\s+$//; + + # The private /+ modifier means "print $' afterwards". We use it + # only on the end of patterns to make it easy to chop off here. + + $showrest = ($pattern =~ s/\+(?=[a-z]*$)//); + + # Check that the pattern is valid + + eval "\$_ =~ ${pattern}"; + if ($@) + { + printf $outfile "Error: $@"; + next NEXT_RE; + } + + # If the /g modifier is present, we want to put a loop round the matching; + # otherwise just a single "if". + + $cmd = ($pattern =~ /g[a-z]*$/)? "while" : "if"; + + # If the pattern is actually the null string, Perl uses the most recently + # executed (and successfully compiled) regex is used instead. This is a + # nasty trap for the unwary! The PCRE test suite does contain null strings + # in places - if they are allowed through here all sorts of weird and + # unexpected effects happen. To avoid this, we replace such patterns with + # a non-null pattern that has the same effect. + + $pattern = "/(?#)/$2" if ($pattern =~ /^(.)\1(.*)$/); + + # Read data lines and test them + + for (;;) + { + printf "data> " if $infile eq "STDIN"; + last NEXT_RE if ! ($_ = <$infile>); + chomp; + printf $outfile "$_\n" if $infile ne "STDIN"; + + s/\s+$//; + s/^\s+//; + + last if ($_ eq ""); + + $x = eval "\"$_\""; # To get escapes processed + + # Empty array for holding results, then do the matching. + + @subs = (); + + eval "${cmd} (\$x =~ ${pattern}) {" . + "push \@subs,\$&;" . + "push \@subs,\$1;" . + "push \@subs,\$2;" . + "push \@subs,\$3;" . + "push \@subs,\$4;" . + "push \@subs,\$5;" . + "push \@subs,\$6;" . + "push \@subs,\$7;" . + "push \@subs,\$8;" . + "push \@subs,\$9;" . + "push \@subs,\$10;" . + "push \@subs,\$11;" . + "push \@subs,\$12;" . + "push \@subs,\$13;" . + "push \@subs,\$14;" . + "push \@subs,\$15;" . + "push \@subs,\$16;" . + "push \@subs,\$'; }"; + + if ($@) + { + printf $outfile "Error: $@\n"; + next NEXT_RE; + } + elsif (scalar(@subs) == 0) + { + printf $outfile "No match\n"; + } + else + { + while (scalar(@subs) != 0) + { + printf $outfile (" 0: %s\n", &pchars($subs[0])); + printf $outfile (" 0+ %s\n", &pchars($subs[17])) if $showrest; + $last_printed = 0; + for ($i = 1; $i <= 16; $i++) + { + if (defined $subs[$i]) + { + while ($last_printed++ < $i-1) + { printf $outfile ("%2d: \n", $last_printed); } + printf $outfile ("%2d: %s\n", $i, &pchars($subs[$i])); + $last_printed = $i; + } + } + splice(@subs, 0, 18); + } + } + } + } + +printf $outfile "\n"; + +# End diff --git a/ext/pcre/pcrelib/pgrep.c b/ext/pcre/pcrelib/pgrep.c new file mode 100644 index 0000000000..ad1b87eae9 --- /dev/null +++ b/ext/pcre/pcrelib/pgrep.c @@ -0,0 +1,225 @@ +/************************************************* +* PCRE grep program * +*************************************************/ + +#include +#include +#include +#include +#include "config.h" +#include "pcre.h" + +#define FALSE 0 +#define TRUE 1 + +typedef int BOOL; + + + +/************************************************* +* Global variables * +*************************************************/ + +static pcre *pattern; +static pcre_extra *hints; + +static BOOL count_only = FALSE; +static BOOL filenames_only = FALSE; +static BOOL invert = FALSE; +static BOOL number = FALSE; +static BOOL silent = FALSE; +static BOOL whole_lines = FALSE; + + + +#if ! HAVE_STRERROR +/************************************************* +* Provide strerror() for non-ANSI libraries * +*************************************************/ + +/* Some old-fashioned systems still around (e.g. SunOS4) don't have strerror() +in their libraries, but can provide the same facility by this simple +alternative function. */ + +extern int sys_nerr; +extern char *sys_errlist[]; + +char * +strerror(int n) +{ +if (n < 0 || n >= sys_nerr) return "unknown error number"; +return sys_errlist[n]; +} +#endif /* HAVE_STRERROR */ + + + +/************************************************* +* Grep an individual file * +*************************************************/ + +static int +pgrep(FILE *in, char *name) +{ +int rc = 1; +int linenumber = 0; +int count = 0; +int offsets[99]; +char buffer[BUFSIZ]; + +while (fgets(buffer, sizeof(buffer), in) != NULL) + { + BOOL match; + int length = (int)strlen(buffer); + if (length > 0 && buffer[length-1] == '\n') buffer[--length] = 0; + linenumber++; + + match = pcre_exec(pattern, hints, buffer, length, 0, 0, offsets, 99) >= 0; + if (match && whole_lines && offsets[1] != length) match = FALSE; + + if (match != invert) + { + if (count_only) count++; + + else if (filenames_only) + { + fprintf(stdout, "%s\n", (name == NULL)? "" : name); + return 0; + } + + else if (silent) return 0; + + else + { + if (name != NULL) fprintf(stdout, "%s:", name); + if (number) fprintf(stdout, "%d:", linenumber); + fprintf(stdout, "%s\n", buffer); + } + + rc = 0; + } + } + +if (count_only) + { + if (name != NULL) fprintf(stdout, "%s:", name); + fprintf(stdout, "%d\n", count); + } + +return rc; +} + + + + +/************************************************* +* Usage function * +*************************************************/ + +static int +usage(int rc) +{ +fprintf(stderr, "Usage: pgrep [-Vchilnsvx] pattern [file] ...\n"); +return rc; +} + + + + +/************************************************* +* Main program * +*************************************************/ + +int +main(int argc, char **argv) +{ +int i; +int rc = 1; +int options = 0; +int errptr; +const char *error; +BOOL filenames = TRUE; + +/* Process the options */ + +for (i = 1; i < argc; i++) + { + char *s; + if (argv[i][0] != '-') break; + s = argv[i] + 1; + while (*s != 0) + { + switch (*s++) + { + case 'c': count_only = TRUE; break; + case 'h': filenames = FALSE; break; + case 'i': options |= PCRE_CASELESS; break; + case 'l': filenames_only = TRUE; + case 'n': number = TRUE; break; + case 's': silent = TRUE; break; + case 'v': invert = TRUE; break; + case 'x': whole_lines = TRUE; options |= PCRE_ANCHORED; break; + + case 'V': + fprintf(stderr, "PCRE version %s\n", pcre_version()); + break; + + default: + fprintf(stderr, "pgrep: unknown option %c\n", s[-1]); + return usage(2); + } + } + } + +/* There must be at least a regexp argument */ + +if (i >= argc) return usage(0); + +/* Compile the regular expression. */ + +pattern = pcre_compile(argv[i++], options, &error, &errptr, NULL); +if (pattern == NULL) + { + fprintf(stderr, "pgrep: error in regex at offset %d: %s\n", errptr, error); + return 2; + } + +/* Study the regular expression, as we will be running it may times */ + +hints = pcre_study(pattern, 0, &error); +if (error != NULL) + { + fprintf(stderr, "pgrep: error while studing regex: %s\n", error); + return 2; + } + +/* If there are no further arguments, do the business on stdin and exit */ + +if (i >= argc) return pgrep(stdin, NULL); + +/* Otherwise, work through the remaining arguments as files. If there is only +one, don't give its name on the output. */ + +if (i == argc - 1) filenames = FALSE; +if (filenames_only) filenames = TRUE; + +for (; i < argc; i++) + { + FILE *in = fopen(argv[i], "r"); + if (in == NULL) + { + fprintf(stderr, "%s: failed to open: %s\n", argv[i], strerror(errno)); + rc = 2; + } + else + { + int frc = pgrep(in, filenames? argv[i] : NULL); + if (frc == 0 && rc == 1) rc = 0; + fclose(in); + } + } + +return rc; +} + +/* End */ diff --git a/ext/pcre/pcrelib/study.c b/ext/pcre/pcrelib/study.c new file mode 100644 index 0000000000..676db94665 --- /dev/null +++ b/ext/pcre/pcrelib/study.c @@ -0,0 +1,397 @@ +/************************************************* +* Perl-Compatible Regular Expressions * +*************************************************/ + +/* +This is a library of functions to support regular expressions whose syntax +and semantics are as close as possible to those of the Perl 5 language. See +the file Tech.Notes for some information on the internals. + +Written by: Philip Hazel + + Copyright (c) 1997-2000 University of Cambridge + +----------------------------------------------------------------------------- +Permission is granted to anyone to use this software for any purpose on any +computer system, and to redistribute it freely, subject to the following +restrictions: + +1. This software is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + +2. The origin of this software must not be misrepresented, either by + explicit claim or by omission. + +3. Altered versions must be plainly marked as such, and must not be + misrepresented as being the original software. + +4. If PCRE is embedded in any software that is released under the GNU + General Purpose Licence (GPL), then the terms of that licence shall + supersede any condition above with which it is incompatible. +----------------------------------------------------------------------------- +*/ + + +/* Include the internals header, which itself includes Standard C headers plus +the external pcre header. */ + +#include "internal.h" + + + +/************************************************* +* Set a bit and maybe its alternate case * +*************************************************/ + +/* Given a character, set its bit in the table, and also the bit for the other +version of a letter if we are caseless. + +Arguments: + start_bits points to the bit map + c is the character + caseless the caseless flag + cd the block with char table pointers + +Returns: nothing +*/ + +static void +set_bit(uschar *start_bits, int c, BOOL caseless, compile_data *cd) +{ +start_bits[c/8] |= (1 << (c&7)); +if (caseless && (cd->ctypes[c] & ctype_letter) != 0) + start_bits[cd->fcc[c]/8] |= (1 << (cd->fcc[c]&7)); +} + + + +/************************************************* +* Create bitmap of starting chars * +*************************************************/ + +/* This function scans a compiled unanchored expression and attempts to build a +bitmap of the set of initial characters. If it can't, it returns FALSE. As time +goes by, we may be able to get more clever at doing this. + +Arguments: + code points to an expression + start_bits points to a 32-byte table, initialized to 0 + caseless the current state of the caseless flag + cd the block with char table pointers + +Returns: TRUE if table built, FALSE otherwise +*/ + +static BOOL +set_start_bits(const uschar *code, uschar *start_bits, BOOL caseless, + compile_data *cd) +{ +register int c; + +/* This next statement and the later reference to dummy are here in order to +trick the optimizer of the IBM C compiler for OS/2 into generating correct +code. Apparently IBM isn't going to fix the problem, and we would rather not +disable optimization (in this module it actually makes a big difference, and +the pcre module can use all the optimization it can get). */ + +volatile int dummy; + +do + { + const uschar *tcode = code + 3; + BOOL try_next = TRUE; + + while (try_next) + { + try_next = FALSE; + + /* If a branch starts with a bracket or a positive lookahead assertion, + recurse to set bits from within them. That's all for this branch. */ + + if ((int)*tcode >= OP_BRA || *tcode == OP_ASSERT) + { + if (!set_start_bits(tcode, start_bits, caseless, cd)) + return FALSE; + } + + else switch(*tcode) + { + default: + return FALSE; + + /* Skip over lookbehind and negative lookahead assertions */ + + case OP_ASSERT_NOT: + case OP_ASSERTBACK: + case OP_ASSERTBACK_NOT: + try_next = TRUE; + do tcode += (tcode[1] << 8) + tcode[2]; while (*tcode == OP_ALT); + tcode += 3; + break; + + /* Skip over an option setting, changing the caseless flag */ + + case OP_OPT: + caseless = (tcode[1] & PCRE_CASELESS) != 0; + tcode += 2; + try_next = TRUE; + break; + + /* BRAZERO does the bracket, but carries on. */ + + case OP_BRAZERO: + case OP_BRAMINZERO: + if (!set_start_bits(++tcode, start_bits, caseless, cd)) + return FALSE; + dummy = 1; + do tcode += (tcode[1] << 8) + tcode[2]; while (*tcode == OP_ALT); + tcode += 3; + try_next = TRUE; + break; + + /* Single-char * or ? sets the bit and tries the next item */ + + case OP_STAR: + case OP_MINSTAR: + case OP_QUERY: + case OP_MINQUERY: + set_bit(start_bits, tcode[1], caseless, cd); + tcode += 2; + try_next = TRUE; + break; + + /* Single-char upto sets the bit and tries the next */ + + case OP_UPTO: + case OP_MINUPTO: + set_bit(start_bits, tcode[3], caseless, cd); + tcode += 4; + try_next = TRUE; + break; + + /* At least one single char sets the bit and stops */ + + case OP_EXACT: /* Fall through */ + tcode++; + + case OP_CHARS: /* Fall through */ + tcode++; + + case OP_PLUS: + case OP_MINPLUS: + set_bit(start_bits, tcode[1], caseless, cd); + break; + + /* Single character type sets the bits and stops */ + + case OP_NOT_DIGIT: + for (c = 0; c < 32; c++) + start_bits[c] |= ~cd->cbits[c+cbit_digit]; + break; + + case OP_DIGIT: + for (c = 0; c < 32; c++) + start_bits[c] |= cd->cbits[c+cbit_digit]; + break; + + case OP_NOT_WHITESPACE: + for (c = 0; c < 32; c++) + start_bits[c] |= ~cd->cbits[c+cbit_space]; + break; + + case OP_WHITESPACE: + for (c = 0; c < 32; c++) + start_bits[c] |= cd->cbits[c+cbit_space]; + break; + + case OP_NOT_WORDCHAR: + for (c = 0; c < 32; c++) + start_bits[c] |= ~cd->cbits[c+cbit_word]; + break; + + case OP_WORDCHAR: + for (c = 0; c < 32; c++) + start_bits[c] |= cd->cbits[c+cbit_word]; + break; + + /* One or more character type fudges the pointer and restarts, knowing + it will hit a single character type and stop there. */ + + case OP_TYPEPLUS: + case OP_TYPEMINPLUS: + tcode++; + try_next = TRUE; + break; + + case OP_TYPEEXACT: + tcode += 3; + try_next = TRUE; + break; + + /* Zero or more repeats of character types set the bits and then + try again. */ + + case OP_TYPEUPTO: + case OP_TYPEMINUPTO: + tcode += 2; /* Fall through */ + + case OP_TYPESTAR: + case OP_TYPEMINSTAR: + case OP_TYPEQUERY: + case OP_TYPEMINQUERY: + switch(tcode[1]) + { + case OP_NOT_DIGIT: + for (c = 0; c < 32; c++) + start_bits[c] |= ~cd->cbits[c+cbit_digit]; + break; + + case OP_DIGIT: + for (c = 0; c < 32; c++) + start_bits[c] |= cd->cbits[c+cbit_digit]; + break; + + case OP_NOT_WHITESPACE: + for (c = 0; c < 32; c++) + start_bits[c] |= ~cd->cbits[c+cbit_space]; + break; + + case OP_WHITESPACE: + for (c = 0; c < 32; c++) + start_bits[c] |= cd->cbits[c+cbit_space]; + break; + + case OP_NOT_WORDCHAR: + for (c = 0; c < 32; c++) + start_bits[c] |= ~cd->cbits[c+cbit_word]; + break; + + case OP_WORDCHAR: + for (c = 0; c < 32; c++) + start_bits[c] |= cd->cbits[c+cbit_word]; + break; + } + + tcode += 2; + try_next = TRUE; + break; + + /* Character class: set the bits and either carry on or not, + according to the repeat count. */ + + case OP_CLASS: + { + tcode++; + for (c = 0; c < 32; c++) start_bits[c] |= tcode[c]; + tcode += 32; + switch (*tcode) + { + case OP_CRSTAR: + case OP_CRMINSTAR: + case OP_CRQUERY: + case OP_CRMINQUERY: + tcode++; + try_next = TRUE; + break; + + case OP_CRRANGE: + case OP_CRMINRANGE: + if (((tcode[1] << 8) + tcode[2]) == 0) + { + tcode += 5; + try_next = TRUE; + } + break; + } + } + break; /* End of class handling */ + + } /* End of switch */ + } /* End of try_next loop */ + + code += (code[1] << 8) + code[2]; /* Advance to next branch */ + } +while (*code == OP_ALT); +return TRUE; +} + + + +/************************************************* +* Study a compiled expression * +*************************************************/ + +/* This function is handed a compiled expression that it must study to produce +information that will speed up the matching. It returns a pcre_extra block +which then gets handed back to pcre_exec(). + +Arguments: + re points to the compiled expression + options contains option bits + errorptr points to where to place error messages; + set NULL unless error + +Returns: pointer to a pcre_extra block, + NULL on error or if no optimization possible +*/ + +pcre_extra * +pcre_study(const pcre *external_re, int options, const char **errorptr) +{ +uschar start_bits[32]; +real_pcre_extra *extra; +const real_pcre *re = (const real_pcre *)external_re; +compile_data compile_block; + +*errorptr = NULL; + +if (re == NULL || re->magic_number != MAGIC_NUMBER) + { + *errorptr = "argument is not a compiled regular expression"; + return NULL; + } + +if ((options & ~PUBLIC_STUDY_OPTIONS) != 0) + { + *errorptr = "unknown or incorrect option bit(s) set"; + return NULL; + } + +/* For an anchored pattern, or an unchored pattern that has a first char, or a +multiline pattern that matches only at "line starts", no further processing at +present. */ + +if ((re->options & (PCRE_ANCHORED|PCRE_FIRSTSET|PCRE_STARTLINE)) != 0) + return NULL; + +/* Set the character tables in the block which is passed around */ + +compile_block.lcc = re->tables + lcc_offset; +compile_block.fcc = re->tables + fcc_offset; +compile_block.cbits = re->tables + cbits_offset; +compile_block.ctypes = re->tables + ctypes_offset; + +/* See if we can find a fixed set of initial characters for the pattern. */ + +memset(start_bits, 0, 32 * sizeof(uschar)); +if (!set_start_bits(re->code, start_bits, (re->options & PCRE_CASELESS) != 0, + &compile_block)) return NULL; + +/* Get an "extra" block and put the information therein. */ + +extra = (real_pcre_extra *)(pcre_malloc)(sizeof(real_pcre_extra)); + +if (extra == NULL) + { + *errorptr = "failed to get memory"; + return NULL; + } + +extra->options = PCRE_STUDY_MAPPED; +memcpy(extra->start_bits, start_bits, sizeof(start_bits)); + +return (pcre_extra *)extra; +} + +/* End of study.c */