Alex Dowad [Tue, 8 Sep 2020 20:57:28 +0000 (22:57 +0200)]
SJIS-2004 encoding conversion: handle invalid (or truncated) 2nd byte for Kanji correctly
If the 2nd byte of a 2-byte character is invalid, then mb_substitute_character()
should be respected. Instead, what mbstring was doing was 'swallowing' the
first byte, then emitting the 2nd byte as if it was an ASCII character.
Likewise, if the 2nd byte is missing, instead of just keeping quiet, report an
illegal character as specified by mb_substitute_character().
Alex Dowad [Mon, 9 Nov 2020 19:37:04 +0000 (21:37 +0200)]
Fix broken binary search function in mbstring
This faulty binary search would never reject values at the very high
end of the range being searched, even if they were not actually in
the table.
Among other things, this meant that some Unicode codepoints which do
not correspond to any character in JIS X 0213 would be converted to
bogus Shift-JIS-2004 values rather than being rejected.
Alex Dowad [Mon, 9 Nov 2020 19:33:32 +0000 (21:33 +0200)]
Don't redundantly flush mbstring filters multiple times
Each flush function in a chain of mbstring conversion filters always
calls the next flush function in the chain. So it is not necessary to
explicitly flush the second filter in a chain. (Due to this bug, in many
cases, flush functions were actually being called three times.)
Alex Dowad [Mon, 9 Nov 2020 20:59:07 +0000 (22:59 +0200)]
Test EUC-JP and Shift-JIS more thoroughly
Previously, the unit tests for these text encodings covered all mappings
from legacy -> Unicode, and all _reversible_ mappings from Unicode -> legacy.
However, we should also test the few Unicode -> legacy mappings which
are not reversible.
Alex Dowad [Wed, 4 Nov 2020 18:10:14 +0000 (20:10 +0200)]
Remove mbstring identify filters
mbstring had an 'identify filter' for almost every supported text encoding
which was used when auto-detecting the most likely encoding for a string.
It would run over the string and set a 'flag' if it saw anything which
did not appear likely to be the encoding in question.
One problem with this scheme was that encodings which merely appeared
less likely to be the correct one were completely rejected, even if there
was no better candidate. Another problem was that the 'identify filters'
had a huge amount of code duplication with the 'conversion filters'.
Eliminate the identify filters. Instead, when auto-detecting text
encoding, use conversion filters to see whether the input string is valid
in candidate encodings or not. At the same type, watch the type of
codepoints which the string decodes to and mark it as less likely if
non-printable characters (ESC, form feed, bell, etc.) or 'private use
area' codepoints are seen.
Interestingly, one old test case in which JIS text was misidentified
as UTF-8 (and this wrong behavior was enshrined in the test) was 'fixed'
and the JIS string is now auto-detected as JIS.
Alex Dowad [Thu, 1 Oct 2020 17:56:42 +0000 (19:56 +0200)]
Fix mbstring support for EUC-JP text encoding
- Don't allow control characters to appear in the middle of a multi-byte
character. (A strange feature, or perhaps misfeature, of mbstring which is
not present in other libraries such as iconv.)
- When checking whether string is valid, reject kuten codes which do not
map to any character, whether converting from EUC-JP to another encoding,
or converting another encoding which uses JIS X 0208/0212 charsets to
EUC-JP.
- Truncated multi-byte characters are treated as an error.
Alex Dowad [Mon, 19 Oct 2020 18:57:58 +0000 (20:57 +0200)]
Fix mbstring support for Shift-JIS
- Reject otherwise valid kuten codes which don't map to anything in JIS X 0208.
- Handle truncated multi-byte characters as an error.
- Convert Shift-JIS 0x7E to Unicode 0x203E (overline) as recommended by the
Unicode Consortium, and as iconv does.
- Convert Shift-JIS 0x5C to Unicode 0xA5 (yen sign) as recommended by the
Unicode Consortium, and as iconv does.
(NOTE: This will affect PHP scripts which use an internal encoding of
Shift-JIS! PHP assigns a special meaning to 0x5C, the backslash. For example,
it is used for escapes in double-quoted strings. Mapping the Shift-JIS yen
sign to the Unicode yen sign means the yen sign will not be usable for
C escapes in double-quoted strings. Japanese PHP programmers who want to
write their source code in Shift-JIS for some strange reason will have to
use the JIS X 0208 backlash or 'REVERSE SOLIDUS' character for their C
escapes.)
- Convert Unicode 0x5C (backslash) to Shift-JIS 0x815F (reverse solidus).
- Immediately handle error if first Shift-JIS byte is over 0xEF, rather than
waiting to see the next byte. (Previously, the value used was 0xFC, which is
the limit for the 2nd byte and not the 1st byte of a multi-byte character.)
- Don't allow 'control characters' to appear in the middle of a multi-byte
character.
The test case for bug 47399 is now obsolete. That test assumed that a number
of Shift-JIS byte sequences which don't map to any character were 'valid'
(because the byte values were within the legal ranges).
Alex Dowad [Wed, 4 Nov 2020 17:54:02 +0000 (19:54 +0200)]
Remove useless byte{2,4}{be,le} encodings from mbstring
There is no meaningful difference between these and UCS-{2,4}. They are
just a little bit more lax about passing errors silently. They also have
no known use.
Alias to UCS-{2,4} in case someone, somewhere is using them.
If Zip operations fails due to PHP error conditions before libzip even
has been called, there is no meaningful indication what failed; the
functions just return false, and the Zip status indicated that no error
did occur. Therefore we raise `E_WARNING` in these cases.
Nikita Popov [Thu, 5 Nov 2020 10:58:31 +0000 (11:58 +0100)]
Fix multiple trait fixup
If a trait method is inherited, preloading trait fixup might be
performed on it multiple times. Usually this is fine, because
the opcodes pointer will have already been updated, and will thus
not be found in the xlat table.
However, it can happen that the new opcodes pointer is the same
as one of the old opcodes pointers, if the pointer has been reused
by the allocator. In this case we will look up the wrong op array
and overwrite the trait method with an unrelated trait method.
We fix this by indexing the xlat table not by the opcodes pointer,
but by the refcount pointer. The refcount pointer is not changed
during optimization, and accurately represents which op arrays
should use the same opcodes.
Fixes bug #80307. The test case does not reproduce the bug, because
this depends on a lot of "luck" with the allocator. The test case
merely illustrates a case where orig_op_array would have been NULL
in the original code.
Fix #80266: parse_url silently drops port number 0
As of commit 81b2f3e[1], `parse_url()` accepts URLs with a zero port,
but does not report that port, what is wrong in hindsight.
Since the port number is stored as `unsigned short` there is no way to
distinguish between port zero and no port. For BC reasons, we thus
introduce `parse_url_ex2()` which accepts an output parameter that
allows that distinction, and use the new function to fix the behavior.
The introduction of `parse_url_ex2()` has been suggested by Nikita.
Nikita Popov [Wed, 4 Nov 2020 13:51:44 +0000 (14:51 +0100)]
Assert that references are not persisted
There should not be any need to persist references, and it's unlikely
that persisting a reference will behave correctly at runtime, because
we don't have a concept of an immutable reference.
Nikita Popov [Tue, 3 Nov 2020 15:45:13 +0000 (16:45 +0100)]
Don't disable early binding during preloading script
We should only disable early binding during the opcache_compile_file()
calls, not inside the preloading script or anything it includes.
The right condition to check for is whether we compile the file
without execution, as declaring classes is "execution".
Nikita Popov [Tue, 3 Nov 2020 13:15:05 +0000 (14:15 +0100)]
Fix variance checks on resolved union types
This is a bit annoying: When preloading is used, types might be
resolved during inheritance checks, so we need to deal with CE
types rather than just NAME types everywhere.
Nikita Popov [Tue, 3 Nov 2020 10:50:14 +0000 (11:50 +0100)]
Don't ignore internal classes during preloading
When preloading, it's fine to make use of internal class information,
as we do not support Windows. It is also necessary to allow proper
variance checks against internal classes.