Anatol Belski [Sun, 30 Aug 2020 12:14:04 +0000 (14:14 +0200)]
hash: Add MurmurHash3 with streaming support
The implementation is based on the upstream PMurHash. The following
variants are implemented
- murmur3a, 32-bit hash
- murmur3c, 128-bit hash for x86
- murmur3f, 128-bit hash for x64
The custom seed support is not targeted by this implementation. It will
need a major change to the API, so then custom arguments can be passed
through `hash_init`. For now, the starting hash is always zero.
Fixes bug #68109, closes #6059
Signed-off-by: Anatol Belski <ab@php.net> Co-Developed-by: Michael Wallner <mike@php.net> Signed-off-by: Michael Wallner <mike@php.net>
Alex Dowad [Sun, 18 Oct 2020 12:41:40 +0000 (14:41 +0200)]
Add test suite for CP1252 encoding
Also remove a bogus test (bug62545.phpt) which wrongly assumed that all invalid
characters in CP1251 and CP1252 should map to Unicode 0xFFFD (REPLACEMENT
CHARACTER).
mbstring has an interface to specify what invalid characters should be
replaced with; it's called `mb_substitute_character`. If a user wants to see
the Unicode 'replacement character', they can specify that using
`mb_substitute_character`. But if they specify something else, we should
follow that.
Alex Dowad [Sun, 18 Oct 2020 12:56:32 +0000 (14:56 +0200)]
Fix mbstring support for CP1252 encoding
It's a bit surprising how much was broken here.
- Identify filter was utterly and completely wrong.
- Instead of handling invalid CP1252 bytes as specified by
`mb_substitute_character`, it would convert them to Unicode 0xFFFD
(generic replacement character).
- When converting ISO-8859-1 to CP1252, invalid ISO-8859-1 bytes would
be passed through silently.
- Unicode codepoints from 0x80-0x9F were converted to CP1252 bytes 0x80-0x9F,
which is wrong.
- Unicode codepoint 0xFFFD was converted to CP1252 0x9F, which is very wrong.
Also clean up some unneeded code, and make the conversion table consistent with
others by using zero as a 'invalid' marker, rather than 0xFFFD.
Alex Dowad [Thu, 29 Oct 2020 21:10:04 +0000 (23:10 +0200)]
Convert numeric string array keys to integers correctly in JITted code
While fixing bugs in mbstring, one of my new test cases failed with a strange
error message stating: 'Warning: Undefined array key 1...', when clearly the
array key had been set properly.
GDB'd that sucker and found that JIT'd PHP code was calling directly into
`zend_hash_add_new` (which was not converting the numeric string key to an
integer properly). But where was that code coming from? I examined the disasm,
looked up symbols to figure out where call instructions were going, then grepped
the codebase for those function names. It soon became clear that the disasm I
was looking at was compiled from `zend_jit_fetch_dim_w_helper`.
Nikita Popov [Wed, 21 Oct 2020 13:01:47 +0000 (15:01 +0200)]
Add --repeat testing mode
This testing mode executes the test multiple times in the same
process (but in different requests). It is primarily intended to
catch tracing JIT bugs, but also catches state leaks across
requests.
Nikita Popov [Fri, 30 Oct 2020 10:11:16 +0000 (11:11 +0100)]
Fixed bug #80290
Dropping the dtor arg args[3] rather than using STR_COPY: Since
PHP 8, we no longer support separation in call_user_function(),
so we also don't need to worry about things like arguments being
replaced with references.
divinity76 [Tue, 27 Oct 2020 15:19:58 +0000 (16:19 +0100)]
Use constant size string in hash bench.php
I don't like the previous behaviour where the bytes to hash change
every time the code changes, that may make it difficult to compare
hash() performance changes over time.
Use a fixed number instead, and allow passing an override for a
different length.
Fix bug #72413: Segfault with get_result and PS cursors
We cannot simply switch to use_result here, because the fetch_row
methods in get_result mode and in use_result/store_result mode
are different: In one case it accepts a statement, in the other
a return value zval. Thus, doing a switch to use_result results
in a segfault when trying to fetch a row.
Actually supporting get_result with cursors would require adding
cursor support in mysqlnd_result, not just mysqlnd_ps. That would
be a significant amount of effort and, given the age of the issue,
does not appear to be particularly likely to happen soon.
As such, we simply generate an error when using get_result()
with cursors, which is much better than causing a segfault.
Instead, parameter binding needs to be used.
Nikita Popov [Thu, 29 Oct 2020 13:07:08 +0000 (14:07 +0100)]
Handle errors during PDO row fetch
The EOF flag also gets set on error, so we always end up ignoring
errors here.
However, we should only check errors for unbuffered results. For
buffered results, this function is guaranteed not to error, and
querying the errno may return an unrelated error.
Fix #44618: Fetching may rely on uninitialized data
Unless `SQLGetData()` returns `SQL_SUCCESS` or `SQL_SUCCESS_WITH_INFO`,
the `StrLen_or_IndPtr` output argument is not guaranteed to be properly
set. Thus we handle retrieval failure other than `SQL_ERROR` by
yielding `false` for those column values and raising a warning.
Nikita Popov [Wed, 28 Oct 2020 16:12:35 +0000 (17:12 +0100)]
Fixed bug #65825
Set error_info when we fail to read a packet, instead of throwing
a warning. Additionally we also need to populate the right
error_info in rowp_read -- we'll later take the error from the
packet, not the connection.
No test case, as this is hard to reliably test. I'm using the
test case from:
https://github.com/php/php-src/pull/2131#issuecomment-538374838
Nikita Popov [Tue, 27 Oct 2020 10:23:49 +0000 (11:23 +0100)]
Don't throw for out of bounds offsets in strspn()
Make strspn($str1, $str2, $offset, $length) behaviorally
equivalent to strspn(substr($str1, $offset, $length), $str2)
by not throwing for out of bounds offset.
There have been two reports that this change cause issues,
including bug #80285.
Alex Dowad [Wed, 14 Oct 2020 18:25:19 +0000 (20:25 +0200)]
Improve error handling for UTF-16{,BE,LE}
Catch various errors such as the first part of a surrogate pair not being
followed by a proper second part, the first part of a surrogate pair appearing
at the end of a string, the second part of a surrogate pair appearing out
of place, and so on.
Alex Dowad [Tue, 13 Oct 2020 13:17:00 +0000 (15:17 +0200)]
UTF-16 text conversion handles truncated characters as illegal
This broke one old test (Zend/tests/multibyte_encoding_003.phpt), which used
a PHP script encoded as UTF-16. The problem was that to terminate the test
script, we need the text: "\n--EXPECT--". Out of that text, the terminating
newline (0x0A byte) becomes part of the resulting test script; but a bare
0x0A byte with no 0x00 is not valid UTF-16.
Since we now treat truncated UTF-16 characters as erroneous, an extra '?' is
appended to the output as an 'illegal character' marker.
Really, if we are running PHP scripts which are treated as encoded in UTF-16
or some other arbitrary text encoding (not ASCII), and the script is not
actually a valid string in that encoding, inserting '?' characters into the
code which the PHP interpreter runs is a bad thing to do. In such cases, the
script shouldn't be treated as UTF-16 (or whatever) at all.
I wonder if mbstring's encoding detection is being used in 'non-strict' mode?