granicus.if.org Git - postgresql/commit

author	Andrew Gierth <rhodiumtoad@postgresql.org>
	Tue, 28 Aug 2018 08:52:25 +0000 (09:52 +0100)
committer	Andrew Gierth <rhodiumtoad@postgresql.org>
	Tue, 28 Aug 2018 11:17:33 +0000 (12:17 +0100)
commit	c8ea87e4bd950572cba4575e9a62284cebf85ac5
tree	5ce97d06d1ec24f6441ed877063ca2a1d42723fa	tree \| snapshot
parent	3e2ceb231ef0bbd04bb98aa3d3b58ebcac88c00a	commit \| diff

Avoid quadratic slowdown in regexp match/split functions.

regexp_matches, regexp_split_to_table and regexp_split_to_array all
work by compiling a list of match positions as character offsets (NOT
byte positions) in the source string.

Formerly, they then used text_substr to extract the matched text; but
in a multi-byte encoding, that counts the characters in the string,
and the characters needed to reach the starting byte position, on
every call. Accordingly, the performance degraded as the product of
the input string length and the number of match positions, such that
splitting a string of a few hundred kbytes could take many minutes.

Repair by keeping the wide-character copy of the input string
available (only in the case where encoding_max_length is not 1) after
performing the match operation, and extracting substrings from that
instead. This reduces the complexity to being linear in the number of
result bytes, discounting the actual regexp match itself (which is not
affected by this patch).

In passing, remove cleanup using retail pfree() which was obsoleted by
commit ff428cded (Feb 2008) which made cleanup of SRF multi-call
contexts automatic. Also increase (to ~134 million) the maximum number
of matches and provide an error message when it is reached.

Backpatch all the way because this has been wrong forever.

Analysis and patch by me; review by Kaiting Chen.

Discussion: https://postgr.es/m/87pnyn55qh.fsf@news-spur.riddles.org.uk

see also https://postgr.es/m/87lg996g4r.fsf@news-spur.riddles.org.uk