regexec. This was done to avoid exposing internal symbols globally;
all functions not meant to be part of the library API are static.
-(Actually the above is a lie in one respect: there is one more global
-symbol, pg_set_regex_collation in regcomp. It is not meant to be part of
-the API, but it has to be global because both regcomp and regexec call it.
-It'd be better to get rid of that, as well as the static variables it
-sets, in favor of keeping the needed locale state in the regex structs.
-We have not done this yet for lack of a design for how to add
-application-specific state to the structs.)
+(Actually the above is a lie in one respect: there are two more global
+symbols, pg_set_regex_collation and pg_reg_getcolor in regcomp. These are
+not meant to be part of the API, but they have to be global because both
+regcomp and regexec call them. It'd be better to get rid of
+pg_set_regex_collation, as well as the static variables it sets, in favor of
+keeping the needed locale state in the regex structs. We have not done this
+yet for lack of a design for how to add application-specific state to the
+structs.)
What's where in src/backend/regex/:
an existing color has to be subdivided.
The last two of these are handled with the "struct colordesc" array and
-the "colorchain" links in NFA arc structs. The color map proper (that
-is, the per-character lookup array) is handled as a multi-level tree,
-with each tree level indexed by one byte of a character's value. The
-code arranges to not have more than one copy of bottom-level tree pages
-that are all-the-same-color.
-
-Unfortunately, this design does not seem terribly efficient for common
-cases such as a tree in which all Unicode letters are colored the same,
-because there aren't that many places where we get a whole page all the
-same color, except at the end of the map. (It also strikes me that given
-PG's current restrictions on the range of Unicode values, we could use a
-3-level rather than 4-level tree; but there's not provision for that in
-regguts.h at the moment.)
-
-A bigger problem is that it just doesn't seem very reasonable to have to
-consider each Unicode letter separately at regex parse time for a regex
-such as "\w"; more than likely, a huge percentage of those codes will
-never be seen at runtime. We need to fix things so that locale-based
-character classes are somehow processed "symbolically" without making a
-full expansion of their contents at parse time. This would mean that we'd
-have to be ready to call iswalpha() at runtime, but if that only happens
-for high-code-value characters, it shouldn't be a big performance hit.
+the "colorchain" links in NFA arc structs.
+
+Ideally, we'd do the first two operations using a simple linear array
+storing the current color assignment for each character code.
+Unfortunately, that's not terribly workable for large charsets such as
+Unicode. Our solution is to divide the color map into two parts. A simple
+linear array is used for character codes up to MAX_SIMPLE_CHR, which can be
+chosen large enough to include all popular characters (so that the
+significantly-slower code paths about to be described are seldom invoked).
+Characters above that need be considered at compile time only if they
+appear explicitly in the regex pattern. We store each such mentioned
+character or character range as an entry in the "colormaprange" array in
+the colormap. (Overlapping ranges are split into unique subranges, so that
+each range in the finished list needs only a single color that describes
+all its characters.) When mapping a character above MAX_SIMPLE_CHR to a
+color at runtime, we search this list of ranges explicitly.
+
+That's still not quite enough, though, because of locale-dependent
+character classes such as [[:alpha:]]. In Unicode locales these classes
+may have thousands of entries that are above MAX_SIMPLE_CHR, and we
+certainly don't want to be searching large colormaprange arrays at runtime.
+Nor do we even want to spend the time to initialize cvec structures that
+exhaustively describe all of those characters. Our solution is to compute
+exact per-character colors at regex compile time only up to MAX_SIMPLE_CHR.
+For characters above that, we apply the <ctype.h> or <wctype.h> lookup
+functions at runtime for each locale-dependent character class used in the
+regex pattern, constructing a bitmap that describes which classes the
+runtime character belongs to. The per-character-range data structure
+mentioned above actually holds, for each range, a separate color entry
+for each possible combination of character class properties. That is,
+the color map for characters above MAX_SIMPLE_CHR is really a 2-D array,
+whose rows correspond to high characters or character ranges that are
+explicitly mentioned in the regex pattern, and whose columns correspond
+to sets of the locale-dependent character classes that are used in the
+regex.
+
+As an example, given the pattern '\w\u1234[\U0001D100-\U0001D1FF]'
+(and supposing that MAX_SIMPLE_CHR is less than 0x1234), we will need
+a high color map with three rows. One row is for the single character
+U+1234 (represented as a single-element range), one is for the range
+U+1D100..U+1D1FF, and the other row represents all remaining high
+characters. The color map has two columns, one for characters that
+satisfy iswalnum() and one for those that don't.
+
+We build this color map in parallel with scanning the regex. Each time
+we detect a new explicit high character (or range) or a locale-dependent
+character class, we split existing entry(s) in the high color map so that
+characters we need to be able to distinguish will have distinct entries
+that can be given separate colors. Often, though, single entries in the
+high color map will represent very large sets of characters.
+
+If there are both explicit high characters/ranges and locale-dependent
+character classes, we may have entries in the high color map array that
+have non-WHITE colors but don't actually represent any real characters.
+(For example, in a row representing a singleton range, only one of the
+columns could possibly be a live entry; it's the one matching the actual
+locale properties for that single character.) We don't currently make
+any effort to reclaim such colors. In principle it could be done, but
+it's not clear that it's worth the trouble.
Detailed semantics of an NFA
initcm(struct vars * v,
struct colormap * cm)
{
- int i;
- int j;
- union tree *t;
- union tree *nextt;
struct colordesc *cd;
cm->magic = CMMAGIC;
cm->free = 0;
cd = cm->cd; /* cm->cd[WHITE] */
+ cd->nschrs = MAX_SIMPLE_CHR - CHR_MIN + 1;
+ cd->nuchrs = 1;
cd->sub = NOSUB;
cd->arcs = NULL;
cd->firstchr = CHR_MIN;
- cd->nchrs = CHR_MAX - CHR_MIN + 1;
cd->flags = 0;
- /* upper levels of tree */
- for (t = &cm->tree[0], j = NBYTS - 1; j > 0; t = nextt, j--)
+ cm->locolormap = (color *)
+ MALLOC((MAX_SIMPLE_CHR - CHR_MIN + 1) * sizeof(color));
+ if (cm->locolormap == NULL)
{
- nextt = t + 1;
- for (i = BYTTAB - 1; i >= 0; i--)
- t->tptr[i] = nextt;
+ CERR(REG_ESPACE);
+ cm->cmranges = NULL; /* prevent failure during freecm */
+ cm->hicolormap = NULL;
+ return;
}
- /* bottom level is solid white */
- t = &cm->tree[NBYTS - 1];
- for (i = BYTTAB - 1; i >= 0; i--)
- t->tcolor[i] = WHITE;
- cd->block = t;
+ /* this memset relies on WHITE being zero: */
+ memset(cm->locolormap, WHITE,
+ (MAX_SIMPLE_CHR - CHR_MIN + 1) * sizeof(color));
+
+ memset(cm->classbits, 0, sizeof(cm->classbits));
+ cm->numcmranges = 0;
+ cm->cmranges = NULL;
+ cm->maxarrayrows = 4; /* arbitrary initial allocation */
+ cm->hiarrayrows = 1; /* but we have only one row/col initially */
+ cm->hiarraycols = 1;
+ cm->hicolormap = (color *) MALLOC(cm->maxarrayrows * sizeof(color));
+ if (cm->hicolormap == NULL)
+ {
+ CERR(REG_ESPACE);
+ return;
+ }
+ /* initialize the "all other characters" row to WHITE */
+ cm->hicolormap[0] = WHITE;
}
/*
static void
freecm(struct colormap * cm)
{
- size_t i;
- union tree *cb;
-
cm->magic = 0;
- if (NBYTS > 1)
- cmtreefree(cm, cm->tree, 0);
- for (i = 1; i <= cm->max; i++) /* skip WHITE */
- if (!UNUSEDCOLOR(&cm->cd[i]))
- {
- cb = cm->cd[i].block;
- if (cb != NULL)
- FREE(cb);
- }
if (cm->cd != cm->cdspace)
FREE(cm->cd);
+ if (cm->locolormap != NULL)
+ FREE(cm->locolormap);
+ if (cm->cmranges != NULL)
+ FREE(cm->cmranges);
+ if (cm->hicolormap != NULL)
+ FREE(cm->hicolormap);
}
/*
- * cmtreefree - free a non-terminal part of a colormap tree
+ * pg_reg_getcolor - slow case of GETCOLOR()
*/
-static void
-cmtreefree(struct colormap * cm,
- union tree * tree,
- int level) /* level number (top == 0) of this block */
+color
+pg_reg_getcolor(struct colormap * cm, chr c)
{
- int i;
- union tree *t;
- union tree *fillt = &cm->tree[level + 1];
- union tree *cb;
-
- assert(level < NBYTS - 1); /* this level has pointers */
- for (i = BYTTAB - 1; i >= 0; i--)
+ int rownum,
+ colnum,
+ low,
+ high;
+
+ /* Should not be used for chrs in the locolormap */
+ assert(c > MAX_SIMPLE_CHR);
+
+ /*
+ * Find which row it's in. The colormapranges are in order, so we can use
+ * binary search.
+ */
+ rownum = 0; /* if no match, use array row zero */
+ low = 0;
+ high = cm->numcmranges;
+ while (low < high)
{
- t = tree->tptr[i];
- assert(t != NULL);
- if (t != fillt)
+ int middle = low + (high - low) / 2;
+ const colormaprange *cmr = &cm->cmranges[middle];
+
+ if (c < cmr->cmin)
+ high = middle;
+ else if (c > cmr->cmax)
+ low = middle + 1;
+ else
{
- if (level < NBYTS - 2)
- { /* more pointer blocks below */
- cmtreefree(cm, t, level + 1);
- FREE(t);
- }
- else
- { /* color block below */
- cb = cm->cd[t->tcolor[0]].block;
- if (t != cb) /* not a solid block */
- FREE(t);
- }
+ rownum = cmr->rownum; /* found a match */
+ break;
}
}
-}
-/*
- * setcolor - set the color of a character in a colormap
- */
-static color /* previous color */
-setcolor(struct colormap * cm,
- chr c,
- color co)
-{
- uchr uc = c;
- int shift;
- int level;
- int b;
- int bottom;
- union tree *t;
- union tree *newt;
- union tree *fillt;
- union tree *lastt;
- union tree *cb;
- color prev;
-
- assert(cm->magic == CMMAGIC);
- if (CISERR() || co == COLORLESS)
- return COLORLESS;
-
- t = cm->tree;
- for (level = 0, shift = BYTBITS * (NBYTS - 1); shift > 0;
- level++, shift -= BYTBITS)
- {
- b = (uc >> shift) & BYTMASK;
- lastt = t;
- t = lastt->tptr[b];
- assert(t != NULL);
- fillt = &cm->tree[level + 1];
- bottom = (shift <= BYTBITS) ? 1 : 0;
- cb = (bottom) ? cm->cd[t->tcolor[0]].block : fillt;
- if (t == fillt || t == cb)
- { /* must allocate a new block */
- newt = (union tree *) MALLOC((bottom) ?
- sizeof(struct colors) : sizeof(struct ptrs));
- if (newt == NULL)
- {
- CERR(REG_ESPACE);
- return COLORLESS;
- }
- if (bottom)
- memcpy(VS(newt->tcolor), VS(t->tcolor),
- BYTTAB * sizeof(color));
- else
- memcpy(VS(newt->tptr), VS(t->tptr),
- BYTTAB * sizeof(union tree *));
- t = newt;
- lastt->tptr[b] = t;
- }
+ /*
+ * Find which column it's in --- this is all locale-dependent.
+ */
+ if (cm->hiarraycols > 1)
+ {
+ colnum = cclass_column_index(cm, c);
+ return cm->hicolormap[rownum * cm->hiarraycols + colnum];
+ }
+ else
+ {
+ /* fast path if no relevant cclasses */
+ return cm->hicolormap[rownum];
}
-
- b = uc & BYTMASK;
- prev = t->tcolor[b];
- t->tcolor[b] = co;
- return prev;
}
/*
}
/*
- * newcolor - find a new color (must be subject of setcolor at once)
+ * newcolor - find a new color (must be assigned at once)
* Beware: may relocate the colordescs.
*/
static color /* COLORLESS for error */
cd = &cm->cd[cm->max];
}
- cd->nchrs = 0;
+ cd->nschrs = 0;
+ cd->nuchrs = 0;
cd->sub = NOSUB;
cd->arcs = NULL;
cd->firstchr = CHR_MIN; /* in case never set otherwise */
cd->flags = 0;
- cd->block = NULL;
return (color) (cd - cm->cd);
}
assert(cd->arcs == NULL);
assert(cd->sub == NOSUB);
- assert(cd->nchrs == 0);
+ assert(cd->nschrs == 0);
+ assert(cd->nuchrs == 0);
cd->flags = FREECOL;
- if (cd->block != NULL)
- {
- FREE(cd->block);
- cd->block = NULL; /* just paranoia */
- }
if ((size_t) co == cm->max)
{
pseudocolor(struct colormap * cm)
{
color co;
+ struct colordesc *cd;
co = newcolor(cm);
if (CISERR())
return COLORLESS;
- cm->cd[co].nchrs = 1;
- cm->cd[co].flags = PSEUDO;
+ cd = &cm->cd[co];
+ cd->nschrs = 0;
+ cd->nuchrs = 1; /* pretend it is in the upper map */
+ cd->sub = NOSUB;
+ cd->arcs = NULL;
+ cd->firstchr = CHR_MIN;
+ cd->flags = PSEUDO;
return co;
}
/*
* subcolor - allocate a new subcolor (if necessary) to this chr
+ *
+ * This works only for chrs that map into the low color map.
*/
static color
subcolor(struct colormap * cm, chr c)
color co; /* current color of c */
color sco; /* new subcolor */
- co = GETCOLOR(cm, c);
+ assert(c <= MAX_SIMPLE_CHR);
+
+ co = cm->locolormap[c - CHR_MIN];
sco = newsub(cm, co);
if (CISERR())
return COLORLESS;
if (co == sco) /* already in an open subcolor */
return co; /* rest is redundant */
- cm->cd[co].nchrs--;
- if (cm->cd[sco].nchrs == 0)
+ cm->cd[co].nschrs--;
+ if (cm->cd[sco].nschrs == 0)
cm->cd[sco].firstchr = c;
- cm->cd[sco].nchrs++;
- setcolor(cm, c, sco);
+ cm->cd[sco].nschrs++;
+ cm->locolormap[c - CHR_MIN] = sco;
+ return sco;
+}
+
+/*
+ * subcolorhi - allocate a new subcolor (if necessary) to this colormap entry
+ *
+ * This is the same processing as subcolor(), but for entries in the high
+ * colormap, which do not necessarily correspond to exactly one chr code.
+ */
+static color
+subcolorhi(struct colormap * cm, color *pco)
+{
+ color co; /* current color of entry */
+ color sco; /* new subcolor */
+
+ co = *pco;
+ sco = newsub(cm, co);
+ if (CISERR())
+ return COLORLESS;
+ assert(sco != COLORLESS);
+
+ if (co == sco) /* already in an open subcolor */
+ return co; /* rest is redundant */
+ cm->cd[co].nuchrs--;
+ cm->cd[sco].nuchrs++;
+ *pco = sco;
return sco;
}
sco = cm->cd[co].sub;
if (sco == NOSUB)
{ /* color has no open subcolor */
- if (cm->cd[co].nchrs == 1) /* optimization */
+ /* optimization: singly-referenced color need not be subcolored */
+ if ((cm->cd[co].nschrs + cm->cd[co].nuchrs) == 1)
return co;
sco = newcolor(cm); /* must create subcolor */
if (sco == COLORLESS)
}
/*
- * subrange - allocate new subcolors to this range of chrs, fill in arcs
+ * newhicolorrow - get a new row in the hicolormap, cloning it from oldrow
+ *
+ * Returns array index of new row. Note the array might move.
*/
-static void
-subrange(struct vars * v,
- chr from,
- chr to,
- struct state * lp,
- struct state * rp)
+static int
+newhicolorrow(struct colormap * cm,
+ int oldrow)
{
- uchr uf;
+ int newrow = cm->hiarrayrows;
+ color *newrowptr;
int i;
- assert(from <= to);
+ /* Assign a fresh array row index, enlarging storage if needed */
+ if (newrow >= cm->maxarrayrows)
+ {
+ color *newarray;
+
+ if (cm->maxarrayrows >= INT_MAX / (cm->hiarraycols * 2))
+ {
+ CERR(REG_ESPACE);
+ return 0;
+ }
+ newarray = (color *) REALLOC(cm->hicolormap,
+ cm->maxarrayrows * 2 *
+ cm->hiarraycols * sizeof(color));
+ if (newarray == NULL)
+ {
+ CERR(REG_ESPACE);
+ return 0;
+ }
+ cm->hicolormap = newarray;
+ cm->maxarrayrows *= 2;
+ }
+ cm->hiarrayrows++;
+
+ /* Copy old row data */
+ newrowptr = &cm->hicolormap[newrow * cm->hiarraycols];
+ memcpy(newrowptr,
+ &cm->hicolormap[oldrow * cm->hiarraycols],
+ cm->hiarraycols * sizeof(color));
+
+ /* Increase color reference counts to reflect new colormap entries */
+ for (i = 0; i < cm->hiarraycols; i++)
+ cm->cd[newrowptr[i]].nuchrs++;
- /* first, align "from" on a tree-block boundary */
- uf = (uchr) from;
- i = (int) (((uf + BYTTAB - 1) & (uchr) ~BYTMASK) - uf);
- for (; from <= to && i > 0; i--, from++)
- newarc(v->nfa, PLAIN, subcolor(v->cm, from), lp, rp);
- if (from > to) /* didn't reach a boundary */
+ return newrow;
+}
+
+/*
+ * newhicolorcols - create a new set of columns in the high colormap
+ *
+ * Essentially, extends the 2-D array to the right with a copy of itself.
+ */
+static void
+newhicolorcols(struct colormap * cm)
+{
+ color *newarray;
+ int r,
+ c;
+
+ if (cm->hiarraycols >= INT_MAX / (cm->maxarrayrows * 2))
+ {
+ CERR(REG_ESPACE);
return;
+ }
+ newarray = (color *) REALLOC(cm->hicolormap,
+ cm->maxarrayrows *
+ cm->hiarraycols * 2 * sizeof(color));
+ if (newarray == NULL)
+ {
+ CERR(REG_ESPACE);
+ return;
+ }
+ cm->hicolormap = newarray;
- /* deal with whole blocks */
- for (; to - from >= BYTTAB; from += BYTTAB)
- subblock(v, from, lp, rp);
+ /* Duplicate existing columns to the right, and increase ref counts */
+ /* Must work backwards in the array because we realloc'd in place */
+ for (r = cm->hiarrayrows - 1; r >= 0; r--)
+ {
+ color *oldrowptr = &newarray[r * cm->hiarraycols];
+ color *newrowptr = &newarray[r * cm->hiarraycols * 2];
+ color *newrowptr2 = newrowptr + cm->hiarraycols;
+
+ for (c = 0; c < cm->hiarraycols; c++)
+ {
+ color co = oldrowptr[c];
- /* clean up any remaining partial table */
- for (; from <= to; from++)
- newarc(v->nfa, PLAIN, subcolor(v->cm, from), lp, rp);
+ newrowptr[c] = newrowptr2[c] = co;
+ cm->cd[co].nuchrs++;
+ }
+ }
+
+ cm->hiarraycols *= 2;
}
/*
- * subblock - allocate new subcolors for one tree block of chrs, fill in arcs
+ * subcolorcvec - allocate new subcolors to cvec members, fill in arcs
*
- * Note: subcolors that are created during execution of this function
- * will not be given a useful value of firstchr; it'll be left as CHR_MIN.
- * For the current usage of firstchr in pg_regprefix, this does not matter
- * because such subcolors won't occur in the common prefix of a regex.
+ * For each chr "c" represented by the cvec, do the equivalent of
+ * newarc(v->nfa, PLAIN, subcolor(v->cm, c), lp, rp);
+ *
+ * Note that in typical cases, many of the subcolors are the same.
+ * While newarc() would discard duplicate arc requests, we can save
+ * some cycles by not calling it repetitively to begin with. This is
+ * mechanized with the "lastsubcolor" state variable.
*/
static void
-subblock(struct vars * v,
- chr start, /* first of BYTTAB chrs */
- struct state * lp,
- struct state * rp)
+subcolorcvec(struct vars * v,
+ struct cvec * cv,
+ struct state * lp,
+ struct state * rp)
{
- uchr uc = start;
struct colormap *cm = v->cm;
- int shift;
- int level;
+ color lastsubcolor = COLORLESS;
+ chr ch,
+ from,
+ to;
+ const chr *p;
int i;
- int b;
- union tree *t;
- union tree *cb;
- union tree *fillt;
- union tree *lastt;
- int previ;
- int ndone;
- color co;
- color sco;
- assert((uc % BYTTAB) == 0);
-
- /* find its color block, making new pointer blocks as needed */
- t = cm->tree;
- fillt = NULL;
- for (level = 0, shift = BYTBITS * (NBYTS - 1); shift > 0;
- level++, shift -= BYTBITS)
- {
- b = (uc >> shift) & BYTMASK;
- lastt = t;
- t = lastt->tptr[b];
- assert(t != NULL);
- fillt = &cm->tree[level + 1];
- if (t == fillt && shift > BYTBITS)
- { /* need new ptr block */
- t = (union tree *) MALLOC(sizeof(struct ptrs));
- if (t == NULL)
+ /* ordinary characters */
+ for (p = cv->chrs, i = cv->nchrs; i > 0; p++, i--)
+ {
+ ch = *p;
+ subcoloronechr(v, ch, lp, rp, &lastsubcolor);
+ NOERR();
+ }
+
+ /* and the ranges */
+ for (p = cv->ranges, i = cv->nranges; i > 0; p += 2, i--)
+ {
+ from = *p;
+ to = *(p + 1);
+ if (from <= MAX_SIMPLE_CHR)
+ {
+ /* deal with simple chars one at a time */
+ chr lim = (to <= MAX_SIMPLE_CHR) ? to : MAX_SIMPLE_CHR;
+
+ while (from <= lim)
{
- CERR(REG_ESPACE);
- return;
+ color sco = subcolor(cm, from);
+
+ NOERR();
+ if (sco != lastsubcolor)
+ {
+ newarc(v->nfa, PLAIN, sco, lp, rp);
+ NOERR();
+ lastsubcolor = sco;
+ }
+ from++;
}
- memcpy(VS(t->tptr), VS(fillt->tptr),
- BYTTAB * sizeof(union tree *));
- lastt->tptr[b] = t;
}
+ /* deal with any part of the range that's above MAX_SIMPLE_CHR */
+ if (from < to)
+ subcoloronerange(v, from, to, lp, rp, &lastsubcolor);
+ else if (from == to)
+ subcoloronechr(v, from, lp, rp, &lastsubcolor);
+ NOERR();
}
- /* special cases: fill block or solid block */
- co = t->tcolor[0];
- cb = cm->cd[co].block;
- if (t == fillt || t == cb)
+ /* and deal with cclass if any */
+ if (cv->cclasscode >= 0)
{
- /* either way, we want a subcolor solid block */
- sco = newsub(cm, co);
- t = cm->cd[sco].block;
- if (t == NULL)
- { /* must set it up */
- t = (union tree *) MALLOC(sizeof(struct colors));
- if (t == NULL)
+ int classbit;
+ color *pco;
+ int r,
+ c;
+
+ /* Enlarge array if we don't have a column bit assignment for cclass */
+ if (cm->classbits[cv->cclasscode] == 0)
+ {
+ cm->classbits[cv->cclasscode] = cm->hiarraycols;
+ newhicolorcols(cm);
+ NOERR();
+ }
+ /* Apply subcolorhi() and make arc for each entry in relevant cols */
+ classbit = cm->classbits[cv->cclasscode];
+ pco = cm->hicolormap;
+ for (r = 0; r < cm->hiarrayrows; r++)
+ {
+ for (c = 0; c < cm->hiarraycols; c++)
{
- CERR(REG_ESPACE);
- return;
+ if (c & classbit)
+ {
+ color sco = subcolorhi(cm, pco);
+
+ NOERR();
+ /* add the arc if needed */
+ if (sco != lastsubcolor)
+ {
+ newarc(v->nfa, PLAIN, sco, lp, rp);
+ NOERR();
+ lastsubcolor = sco;
+ }
+ }
+ pco++;
}
- for (i = 0; i < BYTTAB; i++)
- t->tcolor[i] = sco;
- cm->cd[sco].block = t;
}
- /* find loop must have run at least once */
- lastt->tptr[b] = t;
- newarc(v->nfa, PLAIN, sco, lp, rp);
- cm->cd[co].nchrs -= BYTTAB;
- cm->cd[sco].nchrs += BYTTAB;
+ }
+}
+
+/*
+ * subcoloronechr - do subcolorcvec's work for a singleton chr
+ *
+ * We could just let subcoloronerange do this, but it's a bit more efficient
+ * if we exploit the single-chr case. Also, callers find it useful for this
+ * to be able to handle both low and high chr codes.
+ */
+static void
+subcoloronechr(struct vars * v,
+ chr ch,
+ struct state * lp,
+ struct state * rp,
+ color *lastsubcolor)
+{
+ struct colormap *cm = v->cm;
+ colormaprange *newranges;
+ int numnewranges;
+ colormaprange *oldrange;
+ int oldrangen;
+ int newrow;
+
+ /* Easy case for low chr codes */
+ if (ch <= MAX_SIMPLE_CHR)
+ {
+ color sco = subcolor(cm, ch);
+
+ NOERR();
+ if (sco != *lastsubcolor)
+ {
+ newarc(v->nfa, PLAIN, sco, lp, rp);
+ *lastsubcolor = sco;
+ }
+ return;
+ }
+
+ /*
+ * Potentially, we could need two more colormapranges than we have now, if
+ * the given chr is in the middle of some existing range.
+ */
+ newranges = (colormaprange *)
+ MALLOC((cm->numcmranges + 2) * sizeof(colormaprange));
+ if (newranges == NULL)
+ {
+ CERR(REG_ESPACE);
return;
}
+ numnewranges = 0;
- /* general case, a mixed block to be altered */
- i = 0;
- while (i < BYTTAB)
+ /* Ranges before target are unchanged */
+ for (oldrange = cm->cmranges, oldrangen = 0;
+ oldrangen < cm->numcmranges;
+ oldrange++, oldrangen++)
+ {
+ if (oldrange->cmax >= ch)
+ break;
+ newranges[numnewranges++] = *oldrange;
+ }
+
+ /* Match target chr against current range */
+ if (oldrangen >= cm->numcmranges || oldrange->cmin > ch)
+ {
+ /* chr does not belong to any existing range, make a new one */
+ newranges[numnewranges].cmin = ch;
+ newranges[numnewranges].cmax = ch;
+ /* row state should be cloned from the "all others" row */
+ newranges[numnewranges].rownum = newrow = newhicolorrow(cm, 0);
+ numnewranges++;
+ }
+ else if (oldrange->cmin == oldrange->cmax)
+ {
+ /* we have an existing singleton range matching the chr */
+ newranges[numnewranges++] = *oldrange;
+ newrow = oldrange->rownum;
+ /* we've now fully processed this old range */
+ oldrange++, oldrangen++;
+ }
+ else
{
- co = t->tcolor[i];
- sco = newsub(cm, co);
- newarc(v->nfa, PLAIN, sco, lp, rp);
- previ = i;
- do
+ /* chr is a subset of this existing range, must split it */
+ if (ch > oldrange->cmin)
+ {
+ /* emit portion of old range before chr */
+ newranges[numnewranges].cmin = oldrange->cmin;
+ newranges[numnewranges].cmax = ch - 1;
+ newranges[numnewranges].rownum = oldrange->rownum;
+ numnewranges++;
+ }
+ /* emit chr as singleton range, initially cloning from range */
+ newranges[numnewranges].cmin = ch;
+ newranges[numnewranges].cmax = ch;
+ newranges[numnewranges].rownum = newrow =
+ newhicolorrow(cm, oldrange->rownum);
+ numnewranges++;
+ if (ch < oldrange->cmax)
{
- t->tcolor[i++] = sco;
- } while (i < BYTTAB && t->tcolor[i] == co);
- ndone = i - previ;
- cm->cd[co].nchrs -= ndone;
- cm->cd[sco].nchrs += ndone;
+ /* emit portion of old range after chr */
+ newranges[numnewranges].cmin = ch + 1;
+ newranges[numnewranges].cmax = oldrange->cmax;
+ /* must clone the row if we are making two new ranges from old */
+ newranges[numnewranges].rownum =
+ (ch > oldrange->cmin) ? newhicolorrow(cm, oldrange->rownum) :
+ oldrange->rownum;
+ numnewranges++;
+ }
+ /* we've now fully processed this old range */
+ oldrange++, oldrangen++;
+ }
+
+ /* Update colors in newrow and create arcs as needed */
+ subcoloronerow(v, newrow, lp, rp, lastsubcolor);
+
+ /* Ranges after target are unchanged */
+ for (; oldrangen < cm->numcmranges; oldrange++, oldrangen++)
+ {
+ newranges[numnewranges++] = *oldrange;
+ }
+
+ /* Assert our original space estimate was adequate */
+ assert(numnewranges <= (cm->numcmranges + 2));
+
+ /* And finally, store back the updated list of ranges */
+ if (cm->cmranges != NULL)
+ FREE(cm->cmranges);
+ cm->cmranges = newranges;
+ cm->numcmranges = numnewranges;
+}
+
+/*
+ * subcoloronerange - do subcolorcvec's work for a high range
+ */
+static void
+subcoloronerange(struct vars * v,
+ chr from,
+ chr to,
+ struct state * lp,
+ struct state * rp,
+ color *lastsubcolor)
+{
+ struct colormap *cm = v->cm;
+ colormaprange *newranges;
+ int numnewranges;
+ colormaprange *oldrange;
+ int oldrangen;
+ int newrow;
+
+ /* Caller should take care of non-high-range cases */
+ assert(from > MAX_SIMPLE_CHR);
+ assert(from < to);
+
+ /*
+ * Potentially, if we have N non-adjacent ranges, we could need as many as
+ * 2N+1 result ranges (consider case where new range spans 'em all).
+ */
+ newranges = (colormaprange *)
+ MALLOC((cm->numcmranges * 2 + 1) * sizeof(colormaprange));
+ if (newranges == NULL)
+ {
+ CERR(REG_ESPACE);
+ return;
+ }
+ numnewranges = 0;
+
+ /* Ranges before target are unchanged */
+ for (oldrange = cm->cmranges, oldrangen = 0;
+ oldrangen < cm->numcmranges;
+ oldrange++, oldrangen++)
+ {
+ if (oldrange->cmax >= from)
+ break;
+ newranges[numnewranges++] = *oldrange;
+ }
+
+ /*
+ * Deal with ranges that (partially) overlap the target. As we process
+ * each such range, increase "from" to remove the dealt-with characters
+ * from the target range.
+ */
+ while (oldrangen < cm->numcmranges && oldrange->cmin <= to)
+ {
+ if (from < oldrange->cmin)
+ {
+ /* Handle portion of new range that corresponds to no old range */
+ newranges[numnewranges].cmin = from;
+ newranges[numnewranges].cmax = oldrange->cmin - 1;
+ /* row state should be cloned from the "all others" row */
+ newranges[numnewranges].rownum = newrow = newhicolorrow(cm, 0);
+ numnewranges++;
+ /* Update colors in newrow and create arcs as needed */
+ subcoloronerow(v, newrow, lp, rp, lastsubcolor);
+ /* We've now fully processed the part of new range before old */
+ from = oldrange->cmin;
+ }
+
+ if (from <= oldrange->cmin && to >= oldrange->cmax)
+ {
+ /* old range is fully contained in new, process it in-place */
+ newranges[numnewranges++] = *oldrange;
+ newrow = oldrange->rownum;
+ from = oldrange->cmax + 1;
+ }
+ else
+ {
+ /* some part of old range does not overlap new range */
+ if (from > oldrange->cmin)
+ {
+ /* emit portion of old range before new range */
+ newranges[numnewranges].cmin = oldrange->cmin;
+ newranges[numnewranges].cmax = from - 1;
+ newranges[numnewranges].rownum = oldrange->rownum;
+ numnewranges++;
+ }
+ /* emit common subrange, initially cloning from old range */
+ newranges[numnewranges].cmin = from;
+ newranges[numnewranges].cmax =
+ (to < oldrange->cmax) ? to : oldrange->cmax;
+ newranges[numnewranges].rownum = newrow =
+ newhicolorrow(cm, oldrange->rownum);
+ numnewranges++;
+ if (to < oldrange->cmax)
+ {
+ /* emit portion of old range after new range */
+ newranges[numnewranges].cmin = to + 1;
+ newranges[numnewranges].cmax = oldrange->cmax;
+ /* must clone the row if we are making two new ranges from old */
+ newranges[numnewranges].rownum =
+ (from > oldrange->cmin) ? newhicolorrow(cm, oldrange->rownum) :
+ oldrange->rownum;
+ numnewranges++;
+ }
+ from = oldrange->cmax + 1;
+ }
+ /* Update colors in newrow and create arcs as needed */
+ subcoloronerow(v, newrow, lp, rp, lastsubcolor);
+ /* we've now fully processed this old range */
+ oldrange++, oldrangen++;
+ }
+
+ if (from <= to)
+ {
+ /* Handle portion of new range that corresponds to no old range */
+ newranges[numnewranges].cmin = from;
+ newranges[numnewranges].cmax = to;
+ /* row state should be cloned from the "all others" row */
+ newranges[numnewranges].rownum = newrow = newhicolorrow(cm, 0);
+ numnewranges++;
+ /* Update colors in newrow and create arcs as needed */
+ subcoloronerow(v, newrow, lp, rp, lastsubcolor);
+ }
+
+ /* Ranges after target are unchanged */
+ for (; oldrangen < cm->numcmranges; oldrange++, oldrangen++)
+ {
+ newranges[numnewranges++] = *oldrange;
+ }
+
+ /* Assert our original space estimate was adequate */
+ assert(numnewranges <= (cm->numcmranges * 2 + 1));
+
+ /* And finally, store back the updated list of ranges */
+ if (cm->cmranges != NULL)
+ FREE(cm->cmranges);
+ cm->cmranges = newranges;
+ cm->numcmranges = numnewranges;
+}
+
+/*
+ * subcoloronerow - do subcolorcvec's work for one new row in the high colormap
+ */
+static void
+subcoloronerow(struct vars * v,
+ int rownum,
+ struct state * lp,
+ struct state * rp,
+ color *lastsubcolor)
+{
+ struct colormap *cm = v->cm;
+ color *pco;
+ int i;
+
+ /* Apply subcolorhi() and make arc for each entry in row */
+ pco = &cm->hicolormap[rownum * cm->hiarraycols];
+ for (i = 0; i < cm->hiarraycols; pco++, i++)
+ {
+ color sco = subcolorhi(cm, pco);
+
+ NOERR();
+ /* make the arc if needed */
+ if (sco != *lastsubcolor)
+ {
+ newarc(v->nfa, PLAIN, sco, lp, rp);
+ NOERR();
+ *lastsubcolor = sco;
+ }
}
}
{
/* is subcolor, let parent deal with it */
}
- else if (cd->nchrs == 0)
+ else if (cd->nschrs == 0 && cd->nuchrs == 0)
{
/* parent empty, its arcs change color to subcolor */
cd->sub = NOSUB;
scd = &cm->cd[sco];
- assert(scd->nchrs > 0);
+ assert(scd->nschrs > 0 || scd->nuchrs > 0);
assert(scd->sub == sco);
scd->sub = NOSUB;
while ((a = cd->arcs) != NULL)
/* parent's arcs must gain parallel subcolor arcs */
cd->sub = NOSUB;
scd = &cm->cd[sco];
- assert(scd->nchrs > 0);
+ assert(scd->nschrs > 0 || scd->nuchrs > 0);
assert(scd->sub == sco);
scd->sub = NOSUB;
for (a = cd->arcs; a != NULL; a = a->colorchain)
struct colordesc *end;
color co;
chr c;
- char *has;
fprintf(f, "max %ld\n", (long) cm->max);
- if (NBYTS > 1)
- fillcheck(cm, cm->tree, 0, f);
end = CDEND(cm);
for (cd = cm->cd + 1, co = 1; cd < end; cd++, co++) /* skip 0 */
+ {
if (!UNUSEDCOLOR(cd))
{
- assert(cd->nchrs > 0);
- has = (cd->block != NULL) ? "#" : "";
+ assert(cd->nschrs > 0 || cd->nuchrs > 0);
if (cd->flags & PSEUDO)
- fprintf(f, "#%2ld%s(ps): ", (long) co, has);
+ fprintf(f, "#%2ld(ps): ", (long) co);
else
- fprintf(f, "#%2ld%s(%2d): ", (long) co,
- has, cd->nchrs);
+ fprintf(f, "#%2ld(%2d): ", (long) co, cd->nschrs + cd->nuchrs);
/*
* Unfortunately, it's hard to do this next bit more efficiently.
- *
- * Spencer's original coding has the loop iterating from CHR_MIN
- * to CHR_MAX, but that's utterly unusable for 32-bit chr. For
- * debugging purposes it seems fine to print only chr codes up to
- * 1000 or so.
*/
- for (c = CHR_MIN; c < 1000; c++)
+ for (c = CHR_MIN; c <= MAX_SIMPLE_CHR; c++)
if (GETCOLOR(cm, c) == co)
dumpchr(c, f);
fprintf(f, "\n");
}
-}
-
-/*
- * fillcheck - check proper filling of a tree
- */
-static void
-fillcheck(struct colormap * cm,
- union tree * tree,
- int level, /* level number (top == 0) of this block */
- FILE *f)
-{
- int i;
- union tree *t;
- union tree *fillt = &cm->tree[level + 1];
-
- assert(level < NBYTS - 1); /* this level has pointers */
- for (i = BYTTAB - 1; i >= 0; i--)
+ }
+ /* dump the high colormap if it contains anything interesting */
+ if (cm->hiarrayrows > 1 || cm->hiarraycols > 1)
{
- t = tree->tptr[i];
- if (t == NULL)
- fprintf(f, "NULL found in filled tree!\n");
- else if (t == fillt)
+ int r,
+ c;
+ const color *rowptr;
+
+ fprintf(f, "other:\t");
+ for (c = 0; c < cm->hiarraycols; c++)
+ {
+ fprintf(f, "\t%ld", (long) cm->hicolormap[c]);
+ }
+ fprintf(f, "\n");
+ for (r = 0; r < cm->numcmranges; r++)
{
+ dumpchr(cm->cmranges[r].cmin, f);
+ fprintf(f, "..");
+ dumpchr(cm->cmranges[r].cmax, f);
+ fprintf(f, ":");
+ rowptr = &cm->hicolormap[cm->cmranges[r].rownum * cm->hiarraycols];
+ for (c = 0; c < cm->hiarraycols; c++)
+ {
+ fprintf(f, "\t%ld", (long) rowptr[c]);
+ }
+ fprintf(f, "\n");
}
- else if (level < NBYTS - 2) /* more pointer blocks below */
- fillcheck(cm, t, level + 1, f);
}
}
/*
* Notes:
- * Only (selected) functions in _this_ file should treat chr* as non-constant.
+ * Only (selected) functions in _this_ file should treat the chr arrays
+ * of a cvec as non-constant.
*/
/*
assert(cv != NULL);
cv->nchrs = 0;
cv->nranges = 0;
+ cv->cclasscode = -1;
return cv;
}
}
};
+/*
+ * The following arrays define the valid character class names.
+ */
+static const char *const classNames[NUM_CCLASSES + 1] = {
+ "alnum", "alpha", "ascii", "blank", "cntrl", "digit", "graph",
+ "lower", "print", "punct", "space", "upper", "xdigit", NULL
+};
+
+enum classes
+{
+ CC_ALNUM, CC_ALPHA, CC_ASCII, CC_BLANK, CC_CNTRL, CC_DIGIT, CC_GRAPH,
+ CC_LOWER, CC_PRINT, CC_PUNCT, CC_SPACE, CC_UPPER, CC_XDIGIT
+};
/*
* We do not use the hard-wired Unicode classification tables that Tcl does.
int i,
index;
- /*
- * The following arrays define the valid character class names.
- */
-
- static const char *const classNames[] = {
- "alnum", "alpha", "ascii", "blank", "cntrl", "digit", "graph",
- "lower", "print", "punct", "space", "upper", "xdigit", NULL
- };
-
- enum classes
- {
- CC_ALNUM, CC_ALPHA, CC_ASCII, CC_BLANK, CC_CNTRL, CC_DIGIT, CC_GRAPH,
- CC_LOWER, CC_PRINT, CC_PUNCT, CC_SPACE, CC_UPPER, CC_XDIGIT
- };
-
/*
* Map the name to the corresponding enumerated value.
*/
* pg_ctype_get_cache so that we can cache the results. Other classes
* have definitions that are hard-wired here, and for those we just
* construct a transient cvec on the fly.
+ *
+ * NB: keep this code in sync with cclass_column_index(), below.
*/
switch ((enum classes) index)
{
case CC_PRINT:
- cv = pg_ctype_get_cache(pg_wc_isprint);
+ cv = pg_ctype_get_cache(pg_wc_isprint, index);
break;
case CC_ALNUM:
- cv = pg_ctype_get_cache(pg_wc_isalnum);
+ cv = pg_ctype_get_cache(pg_wc_isalnum, index);
break;
case CC_ALPHA:
- cv = pg_ctype_get_cache(pg_wc_isalpha);
+ cv = pg_ctype_get_cache(pg_wc_isalpha, index);
break;
case CC_ASCII:
/* hard-wired meaning */
addrange(cv, 0x7f, 0x9f);
break;
case CC_DIGIT:
- cv = pg_ctype_get_cache(pg_wc_isdigit);
+ cv = pg_ctype_get_cache(pg_wc_isdigit, index);
break;
case CC_PUNCT:
- cv = pg_ctype_get_cache(pg_wc_ispunct);
+ cv = pg_ctype_get_cache(pg_wc_ispunct, index);
break;
case CC_XDIGIT:
}
break;
case CC_SPACE:
- cv = pg_ctype_get_cache(pg_wc_isspace);
+ cv = pg_ctype_get_cache(pg_wc_isspace, index);
break;
case CC_LOWER:
- cv = pg_ctype_get_cache(pg_wc_islower);
+ cv = pg_ctype_get_cache(pg_wc_islower, index);
break;
case CC_UPPER:
- cv = pg_ctype_get_cache(pg_wc_isupper);
+ cv = pg_ctype_get_cache(pg_wc_isupper, index);
break;
case CC_GRAPH:
- cv = pg_ctype_get_cache(pg_wc_isgraph);
+ cv = pg_ctype_get_cache(pg_wc_isgraph, index);
break;
}
return cv;
}
+/*
+ * cclass_column_index - get appropriate high colormap column index for chr
+ */
+static int
+cclass_column_index(struct colormap * cm, chr c)
+{
+ int colnum = 0;
+
+ /* Shouldn't go through all these pushups for simple chrs */
+ assert(c > MAX_SIMPLE_CHR);
+
+ /*
+ * Note: we should not see requests to consider cclasses that are not
+ * treated as locale-specific by cclass(), above.
+ */
+ if (cm->classbits[CC_PRINT] && pg_wc_isprint(c))
+ colnum |= cm->classbits[CC_PRINT];
+ if (cm->classbits[CC_ALNUM] && pg_wc_isalnum(c))
+ colnum |= cm->classbits[CC_ALNUM];
+ if (cm->classbits[CC_ALPHA] && pg_wc_isalpha(c))
+ colnum |= cm->classbits[CC_ALPHA];
+ assert(cm->classbits[CC_ASCII] == 0);
+ assert(cm->classbits[CC_BLANK] == 0);
+ assert(cm->classbits[CC_CNTRL] == 0);
+ if (cm->classbits[CC_DIGIT] && pg_wc_isdigit(c))
+ colnum |= cm->classbits[CC_DIGIT];
+ if (cm->classbits[CC_PUNCT] && pg_wc_ispunct(c))
+ colnum |= cm->classbits[CC_PUNCT];
+ assert(cm->classbits[CC_XDIGIT] == 0);
+ if (cm->classbits[CC_SPACE] && pg_wc_isspace(c))
+ colnum |= cm->classbits[CC_SPACE];
+ if (cm->classbits[CC_LOWER] && pg_wc_islower(c))
+ colnum |= cm->classbits[CC_LOWER];
+ if (cm->classbits[CC_UPPER] && pg_wc_isupper(c))
+ colnum |= cm->classbits[CC_UPPER];
+ if (cm->classbits[CC_GRAPH] && pg_wc_isgraph(c))
+ colnum |= cm->classbits[CC_GRAPH];
+
+ return colnum;
+}
+
/*
* allcases - supply cvec for all case counterparts of a chr (including itself)
*
* Note that the result must not be freed or modified by caller.
*/
static struct cvec *
-pg_ctype_get_cache(pg_wc_probefunc probefunc)
+pg_ctype_get_cache(pg_wc_probefunc probefunc, int cclasscode)
{
pg_ctype_cache *pcc;
pg_wchar max_chr;
pcc->cv.ranges = (chr *) malloc(pcc->cv.rangespace * sizeof(chr) * 2);
if (pcc->cv.chrs == NULL || pcc->cv.ranges == NULL)
goto out_of_memory;
+ pcc->cv.cclasscode = cclasscode;
/*
- * Decide how many character codes we ought to look through. For C locale
- * there's no need to go further than 127. Otherwise, if the encoding is
- * UTF8 go up to 0x7FF, which is a pretty arbitrary cutoff but we cannot
- * extend it as far as we'd like (say, 0xFFFF, the end of the Basic
- * Multilingual Plane) without creating significant performance issues due
- * to too many characters being fed through the colormap code. This will
- * need redesign to fix reasonably, but at least for the moment we have
- * all common European languages covered. Otherwise (not C, not UTF8) go
- * up to 255. These limits are interrelated with restrictions discussed
- * at the head of this file.
+ * Decide how many character codes we ought to look through. In general
+ * we don't go past MAX_SIMPLE_CHR; chr codes above that are handled at
+ * runtime using the "high colormap" mechanism. However, in C locale
+ * there's no need to go further than 127, and if we only have a 1-byte
+ * <ctype.h> API there's no need to go further than that can handle.
+ *
+ * If it's not MAX_SIMPLE_CHR that's constraining the search, mark the
+ * output cvec as not having any locale-dependent behavior, since there
+ * will be no need to do any run-time locale checks. (The #if's here
+ * would always be true for production values of MAX_SIMPLE_CHR, but it's
+ * useful to allow it to be small for testing purposes.)
*/
switch (pg_regex_strategy)
{
case PG_REGEX_LOCALE_C:
+#if MAX_SIMPLE_CHR >= 127
max_chr = (pg_wchar) 127;
+ pcc->cv.cclasscode = -1;
+#else
+ max_chr = (pg_wchar) MAX_SIMPLE_CHR;
+#endif
break;
case PG_REGEX_LOCALE_WIDE:
case PG_REGEX_LOCALE_WIDE_L:
- max_chr = (pg_wchar) 0x7FF;
+ max_chr = (pg_wchar) MAX_SIMPLE_CHR;
break;
case PG_REGEX_LOCALE_1BYTE:
case PG_REGEX_LOCALE_1BYTE_L:
+#if MAX_SIMPLE_CHR >= UCHAR_MAX
max_chr = (pg_wchar) UCHAR_MAX;
+ pcc->cv.cclasscode = -1;
+#else
+ max_chr = (pg_wchar) MAX_SIMPLE_CHR;
+#endif
break;
default:
max_chr = 0; /* can't get here, but keep compiler quiet */
static void brackpart(struct vars *, struct state *, struct state *);
static const chr *scanplain(struct vars *);
static void onechr(struct vars *, chr, struct state *, struct state *);
-static void dovec(struct vars *, struct cvec *, struct state *, struct state *);
static void wordchrs(struct vars *);
static void processlacon(struct vars *, struct state *, struct state *, int,
struct state *, struct state *);
/* === regc_color.c === */
static void initcm(struct vars *, struct colormap *);
static void freecm(struct colormap *);
-static void cmtreefree(struct colormap *, union tree *, int);
-static color setcolor(struct colormap *, chr, color);
static color maxcolor(struct colormap *);
static color newcolor(struct colormap *);
static void freecolor(struct colormap *, color);
static color pseudocolor(struct colormap *);
-static color subcolor(struct colormap *, chr c);
+static color subcolor(struct colormap *, chr);
+static color subcolorhi(struct colormap *, color *);
static color newsub(struct colormap *, color);
-static void subrange(struct vars *, chr, chr, struct state *, struct state *);
-static void subblock(struct vars *, chr, struct state *, struct state *);
+static int newhicolorrow(struct colormap *, int);
+static void newhicolorcols(struct colormap *);
+static void subcolorcvec(struct vars *, struct cvec *, struct state *, struct state *);
+static void subcoloronechr(struct vars *, chr, struct state *, struct state *, color *);
+static void subcoloronerange(struct vars *, chr, chr, struct state *, struct state *, color *);
+static void subcoloronerow(struct vars *, int, struct state *, struct state *, color *);
static void okcolors(struct nfa *, struct colormap *);
static void colorchain(struct colormap *, struct arc *);
static void uncolorchain(struct colormap *, struct arc *);
#ifdef REG_DEBUG
static void dumpcolors(struct colormap *, FILE *);
-static void fillcheck(struct colormap *, union tree *, int, FILE *);
static void dumpchr(chr, FILE *);
#endif
/* === regc_nfa.c === */
static int before(chr, chr);
static struct cvec *eclass(struct vars *, chr, int);
static struct cvec *cclass(struct vars *, const chr *, const chr *, int);
+static int cclass_column_index(struct colormap *, chr);
static struct cvec *allcases(struct vars *, chr);
static int cmp(const chr *, const chr *, size_t);
static int casecmp(const chr *, const chr *, size_t);
NOERR();
cv = eclass(v, startc, (v->cflags & REG_ICASE));
NOERR();
- dovec(v, cv, lp, rp);
+ subcolorcvec(v, cv, lp, rp);
return;
break;
case CCLASS:
NOERR();
cv = cclass(v, startp, endp, (v->cflags & REG_ICASE));
NOERR();
- dovec(v, cv, lp, rp);
+ subcolorcvec(v, cv, lp, rp);
return;
break;
default:
NOTE(REG_UUNPORT);
cv = range(v, startc, endc, (v->cflags & REG_ICASE));
NOERR();
- dovec(v, cv, lp, rp);
+ subcolorcvec(v, cv, lp, rp);
}
/*
{
if (!(v->cflags & REG_ICASE))
{
- newarc(v->nfa, PLAIN, subcolor(v->cm, c), lp, rp);
+ color lastsubcolor = COLORLESS;
+
+ subcoloronechr(v, c, lp, rp, &lastsubcolor);
return;
}
/* rats, need general case anyway... */
- dovec(v, allcases(v, c), lp, rp);
-}
-
-/*
- * dovec - fill in arcs for each element of a cvec
- */
-static void
-dovec(struct vars * v,
- struct cvec * cv,
- struct state * lp,
- struct state * rp)
-{
- chr ch,
- from,
- to;
- const chr *p;
- int i;
-
- /* ordinary characters */
- for (p = cv->chrs, i = cv->nchrs; i > 0; p++, i--)
- {
- ch = *p;
- newarc(v->nfa, PLAIN, subcolor(v->cm, ch), lp, rp);
- NOERR();
- }
-
- /* and the ranges */
- for (p = cv->ranges, i = cv->nranges; i > 0; p += 2, i--)
- {
- from = *p;
- to = *(p + 1);
- if (from <= to)
- subrange(v, from, to, lp, rp);
- NOERR();
- }
+ subcolorcvec(v, allcases(v, c), lp, rp);
}
/*
#include "regex/regexport.h"
-static void scancolormap(struct colormap * cm, int co,
- union tree * t, int level, chr partial,
- pg_wchar **chars, int *chars_len);
-
/*
* Get total number of NFA states.
*
* Note: we return -1 if the color number is invalid, or if it is a special
* color (WHITE or a pseudocolor), or if the number of members is uncertain.
- * The latter case cannot arise right now but is specified to allow for future
- * improvements (see musings about run-time handling of higher character codes
- * in regex/README). Callers should not try to extract the members if -1 is
- * returned.
+ * Callers should not try to extract the members if -1 is returned.
*/
int
pg_reg_getnumcharacters(const regex_t *regex, int co)
if (cm->cd[co].flags & PSEUDO) /* also pseudocolors (BOS etc) */
return -1;
- return cm->cd[co].nchrs;
+ /*
+ * If the color appears anywhere in the high colormap, treat its number of
+ * members as uncertain. In principle we could determine all the specific
+ * chrs corresponding to each such entry, but it would be expensive
+ * (particularly if character class tests are required) and it doesn't
+ * seem worth it.
+ */
+ if (cm->cd[co].nuchrs != 0)
+ return -1;
+
+ /* OK, return the known number of member chrs */
+ return cm->cd[co].nschrs;
}
/*
pg_wchar *chars, int chars_len)
{
struct colormap *cm;
+ chr c;
assert(regex != NULL && regex->re_magic == REMAGIC);
cm = &((struct guts *) regex->re_guts)->cmap;
if (cm->cd[co].flags & PSEUDO)
return;
- /* Recursively search the colormap tree */
- scancolormap(cm, co, cm->tree, 0, 0, &chars, &chars_len);
-}
-
-/*
- * Recursively scan the colormap tree to find chrs belonging to color "co".
- * See regex/README for info about the tree structure.
- *
- * t: tree block to scan
- * level: level (from 0) of t
- * partial: partial chr code for chrs within t
- * chars, chars_len: output area
- */
-static void
-scancolormap(struct colormap * cm, int co,
- union tree * t, int level, chr partial,
- pg_wchar **chars, int *chars_len)
-{
- int i;
-
- if (level < NBYTS - 1)
- {
- /* non-leaf node */
- for (i = 0; i < BYTTAB; i++)
- {
- /*
- * We do not support search for chrs of color 0 (WHITE), so
- * all-white subtrees need not be searched. These can be
- * recognized because they are represented by the fill blocks in
- * the colormap struct. This typically allows us to avoid
- * scanning large regions of higher-numbered chrs.
- */
- if (t->tptr[i] == &cm->tree[level + 1])
- continue;
-
- /* Recursively scan next level down */
- scancolormap(cm, co,
- t->tptr[i], level + 1,
- (partial | (chr) i) << BYTBITS,
- chars, chars_len);
- }
- }
- else
+ /*
+ * We need only examine the low character map; there should not be any
+ * matching entries in the high map.
+ */
+ for (c = CHR_MIN; c <= MAX_SIMPLE_CHR; c++)
{
- /* leaf node */
- for (i = 0; i < BYTTAB; i++)
+ if (cm->locolormap[c - CHR_MIN] == co)
{
- if (t->tcolor[i] == co)
- {
- if (*chars_len > 0)
- {
- **chars = partial | (chr) i;
- (*chars)++;
- (*chars_len)--;
- }
- }
+ *chars++ = c;
+ if (--chars_len == 0)
+ break;
}
}
}
if (thiscolor == COLORLESS)
break;
/* The color must be a singleton */
- if (cm->cd[thiscolor].nchrs != 1)
+ if (cm->cd[thiscolor].nschrs != 1)
+ break;
+ /* Must not have any high-color-map entries */
+ if (cm->cd[thiscolor].nuchrs != 0)
break;
/*
*/
#define CHR_IS_IN_RANGE(c) ((c) <= CHR_MAX)
+/*
+ * MAX_SIMPLE_CHR is the cutoff between "simple" and "complicated" processing
+ * in the color map logic. It should usually be chosen high enough to ensure
+ * that all common characters are <= MAX_SIMPLE_CHR. However, very large
+ * values will be counterproductive since they cause more regex setup time.
+ * Also, small values can be helpful for testing the high-color-map logic
+ * with plain old ASCII input.
+ */
+#define MAX_SIMPLE_CHR 0x7FF /* suitable value for Unicode */
+
/* functions operating on chr */
#define iscalnum(x) pg_wc_isalnum(x)
#define iscalpha(x) pg_wc_isalpha(x)
extern int pg_regprefix(regex_t *, pg_wchar **, size_t *);
extern void pg_regfree(regex_t *);
extern size_t pg_regerror(int, const regex_t *, char *, size_t);
-extern void pg_set_regex_collation(Oid collation);
#endif /* _REGEX_H_ */
#define ISBSET(uv, sn) ((uv)[(sn)/UBITS] & ((unsigned)1 << ((sn)%UBITS)))
-
-/*
- * We dissect a chr into byts for colormap table indexing. Here we define
- * a byt, which will be the same as a byte on most machines... The exact
- * size of a byt is not critical, but about 8 bits is good, and extraction
- * of 8-bit chunks is sometimes especially fast.
- */
-#ifndef BYTBITS
-#define BYTBITS 8 /* bits in a byt */
-#endif
-#define BYTTAB (1<<BYTBITS) /* size of table with one entry per byt value */
-#define BYTMASK (BYTTAB-1) /* bit mask for byt */
-#define NBYTS ((CHRBITS+BYTBITS-1)/BYTBITS)
-/* the definition of GETCOLOR(), below, assumes NBYTS <= 4 */
-
-
-
/*
* As soon as possible, we map chrs into equivalence classes -- "colors" --
* which are of much more manageable number.
#define MAX_COLOR 32767 /* max color (must fit in 'color' datatype) */
#define COLORLESS (-1) /* impossible color */
#define WHITE 0 /* default color, parent of all others */
-
+/* Note: various places in the code know that WHITE is zero */
-/*
- * A colormap is a tree -- more precisely, a DAG -- indexed at each level
- * by a byt of the chr, to map the chr to a color efficiently. Because
- * lower sections of the tree can be shared, it can exploit the usual
- * sparseness of such a mapping table. The tree is always NBYTS levels
- * deep (in the past it was shallower during construction but was "filled"
- * to full depth at the end of that); areas that are unaltered as yet point
- * to "fill blocks" which are entirely WHITE in color.
- *
- * Leaf-level tree blocks are of type "struct colors", while upper-level
- * blocks are of type "struct ptrs". Pointers into the tree are generally
- * declared as "union tree *" to be agnostic about what level they point to.
- */
-
-/* the tree itself */
-struct colors
-{
- color ccolor[BYTTAB];
-};
-struct ptrs
-{
- union tree *pptr[BYTTAB];
-};
-union tree
-{
- struct colors colors;
- struct ptrs ptrs;
-};
-
-/* use these pseudo-field names when dereferencing a "union tree" pointer */
-#define tcolor colors.ccolor
-#define tptr ptrs.pptr
-
/*
* Per-color data structure for the compile-time color machinery
*
*/
struct colordesc
{
- uchr nchrs; /* number of chars of this color */
+ int nschrs; /* number of simple chars of this color */
+ int nuchrs; /* number of upper map entries of this color */
color sub; /* open subcolor, if any; or free-chain ptr */
#define NOSUB COLORLESS /* value of "sub" when no open subcolor */
struct arc *arcs; /* chain of all arcs of this color */
- chr firstchr; /* char first assigned to this color */
+ chr firstchr; /* simple char first assigned to this color */
int flags; /* bit values defined next */
#define FREECOL 01 /* currently free */
#define PSEUDO 02 /* pseudocolor, no real chars */
#define UNUSEDCOLOR(cd) ((cd)->flags & FREECOL)
- union tree *block; /* block of solid color, if any */
};
/*
* The color map itself
*
- * Much of the data in the colormap struct is only used at compile time.
- * However, the bulk of the space usage is in the "tree" structure, so it's
- * not clear that there's much point in converting the rest to a more compact
- * form when compilation is finished.
+ * This struct holds both data used only at compile time, and the chr to
+ * color mapping information, used at both compile and run time. The latter
+ * is the bulk of the space, so it's not really worth separating out the
+ * compile-only portion.
+ *
+ * Ideally, the mapping data would just be an array of colors indexed by
+ * chr codes; but for large character sets that's impractical. Fortunately,
+ * common characters have smaller codes, so we can use a simple array for chr
+ * codes up to MAX_SIMPLE_CHR, and do something more complex for codes above
+ * that, without much loss of performance. The "something more complex" is a
+ * 2-D array of color entries, where row indexes correspond to individual chrs
+ * or chr ranges that have been mentioned in the regex (with row zero
+ * representing all other chrs), and column indexes correspond to different
+ * sets of locale-dependent character classes such as "isalpha". The
+ * classbits[k] entry is zero if we do not care about the k'th character class
+ * in this regex, and otherwise it is the bit to be OR'd into the column index
+ * if the character in question is a member of that class. We find the color
+ * of a high-valued chr by identifying which colormaprange it is in to get
+ * the row index (use row zero if it's in none of them), identifying which of
+ * the interesting cclasses it's in to get the column index, and then indexing
+ * into the 2-D hicolormap array.
+ *
+ * The colormapranges are required to be nonempty, nonoverlapping, and to
+ * appear in increasing chr-value order.
*/
+
+#define NUM_CCLASSES 13 /* must match data in regc_locale.c */
+
+typedef struct colormaprange
+{
+ chr cmin; /* range represents cmin..cmax inclusive */
+ chr cmax;
+ int rownum; /* row index in hicolormap array (>= 1) */
+} colormaprange;
+
struct colormap
{
int magic;
color free; /* beginning of free chain (if non-0) */
struct colordesc *cd; /* pointer to array of colordescs */
#define CDEND(cm) (&(cm)->cd[(cm)->max + 1])
+
+ /* mapping data for chrs <= MAX_SIMPLE_CHR: */
+ color *locolormap; /* simple array indexed by chr code */
+
+ /* mapping data for chrs > MAX_SIMPLE_CHR: */
+ int classbits[NUM_CCLASSES]; /* see comment above */
+ int numcmranges; /* number of colormapranges */
+ colormaprange *cmranges; /* ranges of high chrs */
+ color *hicolormap; /* 2-D array of color entries */
+ int maxarrayrows; /* number of array rows allocated */
+ int hiarrayrows; /* number of array rows in use */
+ int hiarraycols; /* number of array columns (2^N) */
+
/* If we need up to NINLINECDS, we store them here to save a malloc */
-#define NINLINECDS ((size_t)10)
+#define NINLINECDS ((size_t) 10)
struct colordesc cdspace[NINLINECDS];
- union tree tree[NBYTS]; /* tree top, plus lower-level fill blocks */
};
-/* optimization magic to do fast chr->color mapping */
-#define B0(c) ((c) & BYTMASK)
-#define B1(c) (((c)>>BYTBITS) & BYTMASK)
-#define B2(c) (((c)>>(2*BYTBITS)) & BYTMASK)
-#define B3(c) (((c)>>(3*BYTBITS)) & BYTMASK)
-#if NBYTS == 1
-#define GETCOLOR(cm, c) ((cm)->tree->tcolor[B0(c)])
-#endif
-/* beware, for NBYTS>1, GETCOLOR() is unsafe -- 2nd arg used repeatedly */
-#if NBYTS == 2
-#define GETCOLOR(cm, c) ((cm)->tree->tptr[B1(c)]->tcolor[B0(c)])
-#endif
-#if NBYTS == 4
-#define GETCOLOR(cm, c) ((cm)->tree->tptr[B3(c)]->tptr[B2(c)]->tptr[B1(c)]->tcolor[B0(c)])
-#endif
+/* fetch color for chr; beware of multiple evaluation of c argument */
+#define GETCOLOR(cm, c) \
+ ((c) <= MAX_SIMPLE_CHR ? (cm)->locolormap[(c) - CHR_MIN] : pg_reg_getcolor(cm, c))
/*
* Representation of a set of characters. chrs[] represents individual
* code points, ranges[] represents ranges in the form min..max inclusive.
*
+ * If the cvec represents a locale-specific character class, eg [[:alpha:]],
+ * then the chrs[] and ranges[] arrays contain only members of that class
+ * up to MAX_SIMPLE_CHR (inclusive). cclasscode is set to regc_locale.c's
+ * code for the class, rather than being -1 as it is in an ordinary cvec.
+ *
* Note that in cvecs gotten from newcvec() and intended to be freed by
* freecvec(), both arrays of chrs are after the end of the struct, not
* separately malloc'd; so chrspace and rangespace are effectively immutable.
int nranges; /* number of ranges (chr pairs) */
int rangespace; /* number of ranges allocated in ranges[] */
chr *ranges; /* pointer to vector of chr pairs */
+ int cclasscode; /* value of "enum classes", or -1 */
};
int nlacons; /* size of lacons[]; note that only slots
* numbered 1 .. nlacons-1 are used */
};
+
+
+/* prototypes for functions that are exported from regcomp.c to regexec.c */
+extern void pg_set_regex_collation(Oid collation);
+extern color pg_reg_getcolor(struct colormap * cm, chr c);
--- /dev/null
+/*
+ * This test is for Linux/glibc systems and others that implement proper
+ * locale classification of Unicode characters with high code values.
+ * It must be run in a database with UTF8 encoding and a Unicode-aware locale.
+ */
+SET client_encoding TO UTF8;
+--
+-- Test the "high colormap" logic with single characters and ranges that
+-- exceed the MAX_SIMPLE_CHR cutoff, here assumed to be less than U+2000.
+--
+-- trivial cases:
+SELECT 'aⓐ' ~ U&'a\24D0' AS t;
+ t
+---
+ t
+(1 row)
+
+SELECT 'aⓐ' ~ U&'a\24D1' AS f;
+ f
+---
+ f
+(1 row)
+
+SELECT 'aⓕ' ~ 'a[ⓐ-ⓩ]' AS t;
+ t
+---
+ t
+(1 row)
+
+SELECT 'aⒻ' ~ 'a[ⓐ-ⓩ]' AS f;
+ f
+---
+ f
+(1 row)
+
+-- cases requiring splitting of ranges:
+SELECT 'aⓕⓕ' ~ 'aⓕ[ⓐ-ⓩ]' AS t;
+ t
+---
+ t
+(1 row)
+
+SELECT 'aⓕⓐ' ~ 'aⓕ[ⓐ-ⓩ]' AS t;
+ t
+---
+ t
+(1 row)
+
+SELECT 'aⓐⓕ' ~ 'aⓕ[ⓐ-ⓩ]' AS f;
+ f
+---
+ f
+(1 row)
+
+SELECT 'aⓕⓕ' ~ 'a[ⓐ-ⓩ]ⓕ' AS t;
+ t
+---
+ t
+(1 row)
+
+SELECT 'aⓕⓐ' ~ 'a[ⓐ-ⓩ]ⓕ' AS f;
+ f
+---
+ f
+(1 row)
+
+SELECT 'aⓐⓕ' ~ 'a[ⓐ-ⓩ]ⓕ' AS t;
+ t
+---
+ t
+(1 row)
+
+SELECT 'aⒶⓜ' ~ 'a[Ⓐ-ⓜ][ⓜ-ⓩ]' AS t;
+ t
+---
+ t
+(1 row)
+
+SELECT 'aⓜⓜ' ~ 'a[Ⓐ-ⓜ][ⓜ-ⓩ]' AS t;
+ t
+---
+ t
+(1 row)
+
+SELECT 'aⓜⓩ' ~ 'a[Ⓐ-ⓜ][ⓜ-ⓩ]' AS t;
+ t
+---
+ t
+(1 row)
+
+SELECT 'aⓩⓩ' ~ 'a[Ⓐ-ⓜ][ⓜ-ⓩ]' AS f;
+ f
+---
+ f
+(1 row)
+
+SELECT 'aⓜ⓪' ~ 'a[Ⓐ-ⓜ][ⓜ-ⓩ]' AS f;
+ f
+---
+ f
+(1 row)
+
+SELECT 'a0' ~ 'a[a-ⓩ]' AS f;
+ f
+---
+ f
+(1 row)
+
+SELECT 'aq' ~ 'a[a-ⓩ]' AS t;
+ t
+---
+ t
+(1 row)
+
+SELECT 'aⓜ' ~ 'a[a-ⓩ]' AS t;
+ t
+---
+ t
+(1 row)
+
+SELECT 'a⓪' ~ 'a[a-ⓩ]' AS f;
+ f
+---
+ f
+(1 row)
+
+-- Locale-dependent character classes
+SELECT 'aⒶⓜ⓪' ~ '[[:alpha:]][[:alpha:]][[:alpha:]][[:graph:]]' AS t;
+ t
+---
+ t
+(1 row)
+
+SELECT 'aⒶⓜ⓪' ~ '[[:alpha:]][[:alpha:]][[:alpha:]][[:alpha:]]' AS f;
+ f
+---
+ f
+(1 row)
+
+-- Locale-dependent character classes with high ranges
+SELECT 'aⒶⓜ⓪' ~ '[a-z][[:alpha:]][ⓐ-ⓩ][[:graph:]]' AS t;
+ t
+---
+ t
+(1 row)
+
+SELECT 'aⓜⒶ⓪' ~ '[a-z][[:alpha:]][ⓐ-ⓩ][[:graph:]]' AS f;
+ f
+---
+ f
+(1 row)
+
+SELECT 'aⓜⒶ⓪' ~ '[a-z][ⓐ-ⓩ][[:alpha:]][[:graph:]]' AS t;
+ t
+---
+ t
+(1 row)
+
+SELECT 'aⒶⓜ⓪' ~ '[a-z][ⓐ-ⓩ][[:alpha:]][[:graph:]]' AS f;
+ f
+---
+ f
+(1 row)
+
--- /dev/null
+/*
+ * This test is for Linux/glibc systems and others that implement proper
+ * locale classification of Unicode characters with high code values.
+ * It must be run in a database with UTF8 encoding and a Unicode-aware locale.
+ */
+
+SET client_encoding TO UTF8;
+
+--
+-- Test the "high colormap" logic with single characters and ranges that
+-- exceed the MAX_SIMPLE_CHR cutoff, here assumed to be less than U+2000.
+--
+
+-- trivial cases:
+SELECT 'aⓐ' ~ U&'a\24D0' AS t;
+SELECT 'aⓐ' ~ U&'a\24D1' AS f;
+SELECT 'aⓕ' ~ 'a[ⓐ-ⓩ]' AS t;
+SELECT 'aⒻ' ~ 'a[ⓐ-ⓩ]' AS f;
+-- cases requiring splitting of ranges:
+SELECT 'aⓕⓕ' ~ 'aⓕ[ⓐ-ⓩ]' AS t;
+SELECT 'aⓕⓐ' ~ 'aⓕ[ⓐ-ⓩ]' AS t;
+SELECT 'aⓐⓕ' ~ 'aⓕ[ⓐ-ⓩ]' AS f;
+SELECT 'aⓕⓕ' ~ 'a[ⓐ-ⓩ]ⓕ' AS t;
+SELECT 'aⓕⓐ' ~ 'a[ⓐ-ⓩ]ⓕ' AS f;
+SELECT 'aⓐⓕ' ~ 'a[ⓐ-ⓩ]ⓕ' AS t;
+SELECT 'aⒶⓜ' ~ 'a[Ⓐ-ⓜ][ⓜ-ⓩ]' AS t;
+SELECT 'aⓜⓜ' ~ 'a[Ⓐ-ⓜ][ⓜ-ⓩ]' AS t;
+SELECT 'aⓜⓩ' ~ 'a[Ⓐ-ⓜ][ⓜ-ⓩ]' AS t;
+SELECT 'aⓩⓩ' ~ 'a[Ⓐ-ⓜ][ⓜ-ⓩ]' AS f;
+SELECT 'aⓜ⓪' ~ 'a[Ⓐ-ⓜ][ⓜ-ⓩ]' AS f;
+SELECT 'a0' ~ 'a[a-ⓩ]' AS f;
+SELECT 'aq' ~ 'a[a-ⓩ]' AS t;
+SELECT 'aⓜ' ~ 'a[a-ⓩ]' AS t;
+SELECT 'a⓪' ~ 'a[a-ⓩ]' AS f;
+
+-- Locale-dependent character classes
+
+SELECT 'aⒶⓜ⓪' ~ '[[:alpha:]][[:alpha:]][[:alpha:]][[:graph:]]' AS t;
+SELECT 'aⒶⓜ⓪' ~ '[[:alpha:]][[:alpha:]][[:alpha:]][[:alpha:]]' AS f;
+
+-- Locale-dependent character classes with high ranges
+
+SELECT 'aⒶⓜ⓪' ~ '[a-z][[:alpha:]][ⓐ-ⓩ][[:graph:]]' AS t;
+SELECT 'aⓜⒶ⓪' ~ '[a-z][[:alpha:]][ⓐ-ⓩ][[:graph:]]' AS f;
+SELECT 'aⓜⒶ⓪' ~ '[a-z][ⓐ-ⓩ][[:alpha:]][[:graph:]]' AS t;
+SELECT 'aⒶⓜ⓪' ~ '[a-z][ⓐ-ⓩ][[:alpha:]][[:graph:]]' AS f;