PCRE 3.9 merge

author Brian Pane <brianp@apache.org>

Wed, 20 Mar 2002 06:22:57 +0000 (06:22 +0000)

committer Brian Pane <brianp@apache.org>

Wed, 20 Mar 2002 06:22:57 +0000 (06:22 +0000)
author Brian Pane <brianp@apache.org>
Wed, 20 Mar 2002 06:22:57 +0000 (06:22 +0000)
committer Brian Pane <brianp@apache.org>
Wed, 20 Mar 2002 06:22:57 +0000 (06:22 +0000)
diff --git a/srclib/pcre/ChangeLog b/srclib/pcre/ChangeLog

index 5bedd53bc637652737fc85ac2a75a1ace32a996a..a93f347f458a9a98e6a970a6005cefe4e057d253 100644 (file)
--- a/srclib/pcre/ChangeLog
+++ b/srclib/pcre/ChangeLog
@@ -1,6 +1,181 @@
  ChangeLog for PCRE
  ------------------
  
+Version 3.0 02-Jan-02
+---------------------
+
+1. A bit of extraneous text had somehow crept into the pcregrep documentation.
+
+2. If --disable-static was given, the building process failed when trying to
+build pcretest and pcregrep. (For some reason it was using libtool to compile
+them, which is not right, as they aren't part of the library.)
+
+
+Version 3.8 18-Dec-01
+---------------------
+
+1. The experimental UTF-8 code was completely screwed up. It was packing the
+bytes in the wrong order. How dumb can you get?
+
+
+Version 3.7 29-Oct-01
+---------------------
+
+1. In updating pcretest to check change 1 of version 3.6, I screwed up.
+This caused pcretest, when used on the test data, to segfault. Unfortunately,
+this didn't happen under Solaris 8, where I normally test things.
+
+2. The Makefile had to be changed to make it work on BSD systems, where 'make'
+doesn't seem to recognize that ./xxx and xxx are the same file. (This entry
+isn't in ChangeLog distributed with 3.7 because I forgot when I hastily made
+this fix an hour or so after the initial 3.7 release.)
+
+
+Version 3.6 23-Oct-01
+---------------------
+
+1. Crashed with /(sens|respons)e and \1ibility/ and "sense and sensibility" if
+offsets passed as NULL with zero offset count.
+
+2. The config.guess and config.sub files had not been updated when I moved to
+the latest autoconf.
+
+
+Version 3.5 15-Aug-01
+---------------------
+
+1. Added some missing #if !defined NOPOSIX conditionals in pcretest.c that
+had been forgotten.
+
+2. By using declared but undefined structures, we can avoid using "void"
+definitions in pcre.h while keeping the internal definitions of the structures
+private.
+
+3. The distribution is now built using autoconf 2.50 and libtool 1.4. From a
+user point of view, this means that both static and shared libraries are built
+by default, but this can be individually controlled. More of the work of
+handling this static/shared cases is now inside libtool instead of PCRE's make
+file.
+
+4. The pcretest utility is now installed along with pcregrep because it is
+useful for users (to test regexs) and by doing this, it automatically gets
+relinked by libtool. The documentation has been turned into a man page, so
+there are now .1, .txt, and .html versions in /doc.
+
+5. Upgrades to pcregrep:
+   (i)   Added long-form option names like gnu grep.
+   (ii)  Added --help to list all options with an explanatory phrase.
+   (iii) Added -r, --recursive to recurse into sub-directories.
+   (iv)  Added -f, --file to read patterns from a file.
+
+6. pcre_exec() was referring to its "code" argument before testing that
+argument for NULL (and giving an error if it was NULL).
+
+7. Upgraded Makefile.in to allow for compiling in a different directory from
+the source directory.
+
+8. Tiny buglet in pcretest: when pcre_fullinfo() was called to retrieve the
+options bits, the pointer it was passed was to an int instead of to an unsigned
+long int. This mattered only on 64-bit systems.
+
+9. Fixed typo (3.4/1) in pcre.h again. Sigh. I had changed pcre.h (which is
+generated) instead of pcre.in, which it its source. Also made the same change
+in several of the .c files.
+
+10. A new release of gcc defines printf() as a macro, which broke pcretest
+because it had an ifdef in the middle of a string argument for printf(). Fixed
+by using separate calls to printf().
+
+11. Added --enable-newline-is-cr and --enable-newline-is-lf to the configure
+script, to force use of CR or LF instead of \n in the source. On non-Unix
+systems, the value can be set in config.h.
+
+12. The limit of 200 on non-capturing parentheses is a _nesting_ limit, not an
+absolute limit. Changed the text of the error message to make this clear, and
+likewise updated the man page.
+
+13. The limit of 99 on the number of capturing subpatterns has been removed.
+The new limit is 65535, which I hope will not be a "real" limit.
+
+
+Version 3.4 22-Aug-00
+---------------------
+
+1. Fixed typo in pcre.h: unsigned const char * changed to const unsigned char *.
+
+2. Diagnose condition (?(0) as an error instead of crashing on matching.
+
+
+Version 3.3 01-Aug-00
+---------------------
+
+1. If an octal character was given, but the value was greater than \377, it
+was not getting masked to the least significant bits, as documented. This could
+lead to crashes in some systems.
+
+2. Perl 5.6 (if not earlier versions) accepts classes like [a-\d] and treats
+the hyphen as a literal. PCRE used to give an error; it now behaves like Perl.
+
+3. Added the functions pcre_free_substring() and pcre_free_substring_list().
+These just pass their arguments on to (pcre_free)(), but they are provided
+because some uses of PCRE bind it to non-C systems that can call its functions,
+but cannot call free() or pcre_free() directly.
+
+4. Add "make test" as a synonym for "make check". Corrected some comments in
+the Makefile.
+
+5. Add $(DESTDIR)/ in front of all the paths in the "install" target in the
+Makefile.
+
+6. Changed the name of pgrep to pcregrep, because Solaris has introduced a
+command called pgrep for grepping around the active processes.
+
+7. Added the beginnings of support for UTF-8 character strings.
+
+8. Arranged for the Makefile to pass over the settings of CC, CFLAGS, and
+RANLIB to ./ltconfig so that they are used by libtool. I think these are all
+the relevant ones. (AR is not passed because ./ltconfig does its own figuring
+out for the ar command.)
+
+
+Version 3.2 12-May-00
+---------------------
+
+This is purely a bug fixing release.
+
+1. If the pattern /((Z)+|A)*/ was matched agained ZABCDEFG it matched Z instead
+of ZA. This was just one example of several cases that could provoke this bug,
+which was introduced by change 9 of version 2.00. The code for breaking
+infinite loops after an iteration that matches an empty string was't working
+correctly.
+
+2. The pcretest program was not imitating Perl correctly for the pattern /a*/g
+when matched against abbab (for example). After matching an empty string, it
+wasn't forcing anchoring when setting PCRE_NOTEMPTY for the next attempt; this
+caused it to match further down the string than it should.
+
+3. The code contained an inclusion of sys/types.h. It isn't clear why this
+was there because it doesn't seem to be needed, and it causes trouble on some
+systems, as it is not a Standard C header. It has been removed.
+
+4. Made 4 silly changes to the source to avoid stupid compiler warnings that
+were reported on the Macintosh. The changes were from
+
+  while ((c = *(++ptr)) != 0 && c != '\n');
+to
+  while ((c = *(++ptr)) != 0 && c != '\n') ;
+
+Totally extraordinary, but if that's what it takes...
+
+5. PCRE is being used in one environment where neither memmove() nor bcopy() is
+available. Added HAVE_BCOPY and an autoconf test for it; if neither
+HAVE_MEMMOVE nor HAVE_BCOPY is set, use a built-in emulation function which
+assumes the way PCRE uses memmove() (always moving upwards).
+
+6. PCRE is being used in one environment where strchr() is not available. There
+was only one use in pcre.c, and writing it out to avoid strchr() probably gives
+faster code anyway.
+
  
  Version 3.2 12-May-00
  ---------------------
diff --git a/srclib/pcre/internal.h b/srclib/pcre/internal.h

index b4b750f6c653b50b3aa828fccae8f42fc9b0d1c3..0c8c1c9df6bdafd945508811a20851dc5703bc48 100644 (file)
--- a/srclib/pcre/internal.h
+++ b/srclib/pcre/internal.h
@@ -9,7 +9,7 @@ the file Tech.Notes for some information on the internals.
  
  Written by: Philip Hazel <ph10@cam.ac.uk>
  
-           Copyright (c) 1997-2000 University of Cambridge
+           Copyright (c) 1997-2001 University of Cambridge
  
  -----------------------------------------------------------------------------
  Permission is granted to anyone to use this software for any purpose on any
@@ -105,7 +105,7 @@ time, run time or study time, respectively. */
  
  #define PUBLIC_OPTIONS \
    (PCRE_CASELESS|PCRE_EXTENDED|PCRE_ANCHORED|PCRE_MULTILINE| \
-   PCRE_DOTALL|PCRE_DOLLAR_ENDONLY|PCRE_EXTRA|PCRE_UNGREEDY)
+   PCRE_DOTALL|PCRE_DOLLAR_ENDONLY|PCRE_EXTRA|PCRE_UNGREEDY|PCRE_UTF8)
  
  #define PUBLIC_EXEC_OPTIONS \
    (PCRE_ANCHORED|PCRE_NOTBOL|PCRE_NOTEOL|PCRE_NOTEMPTY)
@@ -123,12 +123,36 @@ typedef int BOOL;
  #define FALSE   0
  #define TRUE    1
  
+/* Escape items that are just an encoding of a particular data value. Note that
+ESC_N is defined as yet another macro, which is set in config.h to either \n
+(the default) or \r (which some people want). */
+
+#ifndef ESC_E
+#define ESC_E 27
+#endif
+
+#ifndef ESC_F
+#define ESC_F '\f'
+#endif
+
+#ifndef ESC_N
+#define ESC_N NEWLINE
+#endif
+
+#ifndef ESC_R
+#define ESC_R '\r'
+#endif
+
+#ifndef ESC_T
+#define ESC_T '\t'
+#endif
+
  /* These are escaped items that aren't just an encoding of a particular data
  value such as \n. They must have non-zero values, as check_escape() returns
  their negation. Also, they must appear in the same order as in the opcode
  definitions below, up to ESC_z. The final one must be ESC_REF as subsequent
  values are used for \1, \2, \3, etc. There is a test in the code for an escape
-greater than ESC_b and less than ESC_X to detect the types that may be
+greater than ESC_b and less than ESC_Z to detect the types that may be
  repeated. If any new escapes are put in-between that don't consume a character,
  that code will have to change. */
  
@@ -224,19 +248,26 @@ enum {
  
    OP_ONCE,           /* Once matched, don't back up into the subpattern */
    OP_COND,           /* Conditional group */
-  OP_CREF,           /* Used to hold an extraction string number */
+  OP_CREF,           /* Used to hold an extraction string number (cond ref) */
  
    OP_BRAZERO,        /* These two must remain together and in this */
    OP_BRAMINZERO,     /* order. */
  
+  OP_BRANUMBER,      /* Used for extracting brackets whose number is greater
+                        than can fit into an opcode. */
+
    OP_BRA             /* This and greater values are used for brackets that
-                        extract substrings. */
+                        extract substrings up to a basic limit. After that,
+                        use is made of OP_BRANUMBER. */
  };
  
-/* The highest extraction number. This is limited by the number of opcodes
-left after OP_BRA, i.e. 255 - OP_BRA. We actually set it somewhat lower. */
+/* The highest extraction number before we have to start using additional
+bytes. (Originally PCRE didn't have support for extraction counts highter than
+this number.) The value is limited by the number of opcodes left after OP_BRA,
+i.e. 255 - OP_BRA. We actually set it a bit lower to leave room for additional
+opcodes. */
  
-#define EXTRACT_MAX  99
+#define EXTRACT_BASIC_MAX  150
  
  /* The texts of compile-time error messages are defined as macros here so that
  they can be accessed by the POSIX wrapper and converted into error codes.  Yes,
@@ -255,13 +286,13 @@ just to accommodate the POSIX wrapper. */
  #define ERR10 "operand of unlimited repeat could match the empty string"
  #define ERR11 "internal error: unexpected repeat"
  #define ERR12 "unrecognized character after (?"
-#define ERR13 "too many capturing parenthesized sub-patterns"
+#define ERR13 "unused error"
  #define ERR14 "missing )"
  #define ERR15 "back reference to non-existent subpattern"
  #define ERR16 "erroffset passed as NULL"
  #define ERR17 "unknown option bit(s) set"
  #define ERR18 "missing ) after comment"
-#define ERR19 "too many sets of parentheses"
+#define ERR19 "parentheses nested too deeply"
  #define ERR20 "regular expression too large"
  #define ERR21 "failed to get memory"
  #define ERR22 "unmatched parentheses"
@@ -274,6 +305,10 @@ just to accommodate the POSIX wrapper. */
  #define ERR29 "(?p must be followed by )"
  #define ERR30 "unknown POSIX class name"
  #define ERR31 "POSIX collating elements are not supported"
+#define ERR32 "this version of PCRE is not compiled with PCRE_UTF8 support"
+#define ERR33 "characters with values > 255 are not yet supported in classes"
+#define ERR34 "character value in \\x{...} sequence is too large"
+#define ERR35 "invalid condition (?(0)"
  
  /* All character handling must be done as unsigned characters. Otherwise there
  are problems with top-bit-set characters and functions such as isspace().
@@ -292,8 +327,8 @@ typedef struct real_pcre {
    size_t size;
    const unsigned char *tables;
    unsigned long int options;
-  uschar top_bracket;
-  uschar top_backref;
+  unsigned short int top_bracket;
+  unsigned short int top_backref;
    uschar first_char;
    uschar req_char;
    uschar code[1];
@@ -330,6 +365,7 @@ typedef struct match_data {
    BOOL   offset_overflow;       /* Set if too many extractions */
    BOOL   notbol;                /* NOTBOL flag */
    BOOL   noteol;                /* NOTEOL flag */
+  BOOL   utf8;                  /* UTF8 flag */
    BOOL   endonly;               /* Dollar not before final \n */
    BOOL   notempty;              /* Empty string match not wanted */
    const uschar *start_pattern;  /* For use when recursing */
diff --git a/srclib/pcre/pcreposix.c b/srclib/pcre/pcreposix.c

index e48d93e4c66a38a95fdab88763791114f120e4b7..9b2efe209877d8f953dfe14c5fa6295c20384e94 100644 (file)
--- a/srclib/pcre/pcreposix.c
+++ b/srclib/pcre/pcreposix.c
@@ -12,7 +12,7 @@ functions.
  
  Written by: Philip Hazel <ph10@cam.ac.uk>
  
-           Copyright (c) 1997-2000 University of Cambridge
+           Copyright (c) 1997-2001 University of Cambridge
  
  -----------------------------------------------------------------------------
  Permission is granted to anyone to use this software for any purpose on any
@@ -62,13 +62,13 @@ static int eint[] = {
    REG_BADRPT,  /* "operand of unlimited repeat could match the empty string" */
    REG_ASSERT,  /* "internal error: unexpected repeat" */
    REG_BADPAT,  /* "unrecognized character after (?" */
-  REG_ESIZE,   /* "too many capturing parenthesized sub-patterns" */
+  REG_ASSERT,  /* "unused error" */
    REG_EPAREN,  /* "missing )" */
    REG_ESUBREG, /* "back reference to non-existent subpattern" */
    REG_INVARG,  /* "erroffset passed as NULL" */
    REG_INVARG,  /* "unknown option bit(s) set" */
    REG_EPAREN,  /* "missing ) after comment" */
-  REG_ESIZE,   /* "too many sets of parentheses" */
+  REG_ESIZE,   /* "parentheses nested too deeply" */
    REG_ESIZE,   /* "regular expression too large" */
    REG_ESPACE,  /* "failed to get memory" */
    REG_EPAREN,  /* "unmatched brackets" */
@@ -80,7 +80,11 @@ static int eint[] = {
    REG_BADPAT,  /* "assertion expected after (?(" */
    REG_BADPAT,  /* "(?p must be followed by )" */
    REG_ECTYPE,  /* "unknown POSIX class name" */
-  REG_BADPAT   /* "POSIX collating elements are not supported" */
+  REG_BADPAT,  /* "POSIX collating elements are not supported" */
+  REG_INVARG,  /* "this version of PCRE is not compiled with PCRE_UTF8 support" */
+  REG_BADPAT,  /* "characters with values > 255 are not yet supported in classes" */
+  REG_BADPAT,  /* "character value in \x{...} sequence is too large" */
+  REG_BADPAT   /* "invalid condition (?(0)" */
  };
  
  /* Table of texts corresponding to POSIX error codes */
author	Brian Pane <brianp@apache.org>
	Wed, 20 Mar 2002 06:22:57 +0000 (06:22 +0000)
committer	Brian Pane <brianp@apache.org>
	Wed, 20 Mar 2002 06:22:57 +0000 (06:22 +0000)
srclib/pcre/ChangeLog		patch \| blob \| history
srclib/pcre/internal.h		patch \| blob \| history
srclib/pcre/pcreposix.c		patch \| blob \| history