Nikita Popov [Sat, 12 Aug 2017 11:00:39 +0000 (13:00 +0200)]
Fixed bug #74103 and bug #75054
Directly fail unserialization when trying to acquire an r/R
reference to an UNDEF HT slot. Previously this left an UNDEF and
later deleted the index/key from the HT.
What actually caused the issue here is a combination of two
factors: First, the key deletion was performed using the hash API,
rather than the symtable API, such that the element was not actually
removed if it used an integral string key. Second, a subsequent
deletion operation, while collecting trailing UNDEF ranges, would
mark the element as available for reuse (leaving a corrupted HT
state with nNumOfElemnts > nNumUsed).
Fix this by failing early and dropping the deletion code.
Frank Denis [Tue, 8 Aug 2017 15:51:08 +0000 (17:51 +0200)]
Merge branch 'PHP-7.2'
* PHP-7.2:
sodium ext: Use _ietf_ vs _IETF_ consistently
sodium ext: No need for #ifdef crypto_aead_chacha20poly1305_IETF_
Sodium ext: Isolate a return statement for consistency
sodium ext: The default password hashing function is not supposed to be Argon2i
sodium ext: long -> zend_long
sodium ext: Add missing "return" statements after zend_throw_exception()
Nikita Popov [Fri, 4 Aug 2017 16:38:36 +0000 (18:38 +0200)]
Store input and output filters in mbfl encodings
For functions like mb_chr() and mb_ord() just looking up the
input/output filter for the encoding dominates the runtime. This
commit stores the input/output filter for an encoding in the
mbfl encoding structure, so it can be looked up directly, rather
than scanning through filter function lists.
Nikita Popov [Thu, 3 Aug 2017 20:32:31 +0000 (22:32 +0200)]
Return false on invalid codepoint in mb_chr()
Instead of returning the encoding of the current substitution
character. This allows a robust check for the failure case. The
substitution character (especially the default of "?") is also
a valid output of mb_chr() for a valid input (for "?" that would be
0x3f), so it's a bad choice for an error value.
Nikita Popov [Thu, 3 Aug 2017 20:14:00 +0000 (22:14 +0200)]
Always use Unicode codepoints in mb_ord() and mb_chr()
Previously mb_chr() had two different encoding-dependent behaviors:
* For "Unicode-encodings" it took a Unicode codepoint and returned
its encoded representation.
* Otherwise it returned a big-endian binary encoding of the passed
integer.
Now the input is always interpreted as a Unicode codepoint. If
a big-endian binary encoding is what you want, you don't need
mbstring to implement that.
Nikita Popov [Thu, 3 Aug 2017 19:53:21 +0000 (21:53 +0200)]
Revert/fix substitution character fallback
The introduced checks were not correct in two respects:
* It was checked whether the source encoding of the string matches
the internal encoding, while the actually relevant encoding is
the *target* encoding.
* Even if the correct encoding is used, the checks are still too
conservative. Just because something is not a "Unicode-encoding"
does not mean that it does not map any non-ASCII characters.
I've reverted the added checks and instead adjusted mbfl_convert
to first try to use the provided substitution character and if
that fails, perform the fallback to '?' at that point. This means
that any codepoint mapped in the target encoding should now be
correctly supported and anything else should fall back to '?'.
The introduced checks did not treat "non-Unicode" encodings correctly,
because they treated the passed integer as encoded in the internal
encoding in that case, while in actuality the substitute character
is always a Unicode codepoint.
Additionally checking the codepoint against the internal encoding
is not correct in any case, because the substitution character must
be mapped in the *target* encoding of the conversion, which does
not necessarily coincide with the internal encoding (the internal
encoding is the default *source* encoding, not *target* encoding).
This reverts the checks back to simple range checks, but in a way
that still resolves #69079: Characters outside the Basic
Multilingual Plane are now accepted and Surrogate Codepoints are
rejected. A distinction between UTF-8 and non-UTF-8 encodings is
not made for surrogate checks (as in the original patch), as
surrogates are always illegal on their own. Specifying a surrogate
as substitution character would only make sense if you could
specify a substitution string with more than one character --
however we do not support that.