Alex Dowad [Tue, 20 Oct 2020 05:47:20 +0000 (07:47 +0200)]
Fix mbstring support for SJIS-Mobile (DoCoMo, KDDI, and Softbank variants of Shift-JIS)
Lots of problems here.
- Don't pass 'control' characters through silently in the middle of a
multi-byte character.
- Treat it as an error if a multi-byte character is truncated.
- For ESC sequences used to encode emoji on earlier Softbank phones, if an
invalid ESC sequence is found, don't pass it through. Rather, handle it as
an error and respect `mb_substitute_character`.
- In ranges used by mobile vendors for emoji, if a certain byte sequence
doesn't map to any emoji, don't emit a mangled value (actually a raw
(ku*94)+ten value, which may not even be a valid Unicode codepoint at all).
- When converting Unicode to SJIS-Mobile, don't mangle codepoints which fall
in the 2nd range of MicroSoft vendor extensions.
Some vendor-specific emoji have been mapped to standard Unicode codepoints
now, rather than 'private use area' codepoints. When the legacy code was
written, these codepoints may not have existed yet in the Unicode standard
which was current at that time.
Also do a major code cleanup -- remove dead code, rearrange what is left,
use some new macros and helper functions to make the code clearer...
Alex Dowad [Tue, 13 Oct 2020 05:58:53 +0000 (07:58 +0200)]
Combine MBFL_ENCTYPE_MWC2{BE,LE} constants
These constants indicate that a text encoding uses 2+ bytes for each character,
and is either big endian or little endian (respectively). But nothing in
mbstring cares about the difference between MBFL_ENCTYPE_MWC2BE and
MBFL_ENCTYPE_MWC2LE.
(Actually, nothing cares about whether these flags are set at all...
maybe we should just remove them?)
Alex Dowad [Sun, 20 Sep 2020 14:29:32 +0000 (16:29 +0200)]
Combine MBFL_ENCTYPE_WCS{2,4}{BE,LE} constants
These flags identify text encodings in mbstring which use a constant number of
bytes per character. While some parts of the code do use these flags, usually
to detect cases which can be optimized due to constant-width encoding, nothing
cares whether the encodings are 'LE' (little-endian) or 'BE' (big-endian).
Alex Dowad [Wed, 7 Oct 2020 20:30:34 +0000 (22:30 +0200)]
Don't pass invalid JIS X 0212, JIS X 0213, and Windows-CP932 characters through
Similarly to JIS X 0208, mbstring would pass kuten codes which are not mapped
in the JIS X 0212, JIS X 0213, or CP932 character sets through silently when
converting to another Japanese encoding.
Alex Dowad [Wed, 7 Oct 2020 20:12:27 +0000 (22:12 +0200)]
Don't pass invalid JIS X 0208 characters through
Many Japanese encodings, such as JIS7/8, Shift JIS, ISO-2022-JP, EUC-JP, and
so on encode characters from the JIS X 0208 character set. JIS X 0208 is based
on the concept of a 94x94 table, with numbered rows and columns. However,
more than a thousand of the cells in that table are empty; JIS X 0208 does not
actually use all 94x94=8,836 possible kuten codes.
mbstring had a dubious feature whereby, if a Japanese string contained one of
these 'unmapped' kuten codes, and it was being converted to another Japanese
encoding which was also based on JIS X 0208, the non-existent character would
be silently passed through, and the unmapped kuten code would be re-encoded
using the normal encoding method of the target text encoding.
Again, this _only_ happened if converting the text with the funky kuten code
to a Japanese encoding. If one tried converting it to Unicode, mbstring would
treat that as an error.
If somebody, somewhere, made their own private extension to JIS X 0208, and
used the regular Japanese encodings like Shift JIS and EUC-JP to encode this
private character set, then this feature might conceivably be useful. But how
likely is that? If someone is using Shift JIS, EUC-JP, ISO-2022-JP, etc. to
encode a funky version of JIS X 0208 with extra characters added, then that
should be treated as a separate text encoding.
The code which flags such characters with MBFL_WCSPLANE_JIS0208 is retained
solely for error reporting in `mbfl_filt_conv_illegal_output`.
Alex Dowad [Sun, 4 Oct 2020 20:29:34 +0000 (22:29 +0200)]
Enhance handling of CP932 text encoding
- Don't allow control characters to appear in the middle of a multi-byte
character. (This was a strange feature of mbstring; it doesn't make much
sense, and iconv doesn't allow it.)
- Treat truncated multi-byte characters as an error.
Nikita Popov [Wed, 25 Nov 2020 14:57:11 +0000 (15:57 +0100)]
Reindent ext/mysqli tests
Reindent ext/mysqli tests on PHP-7.4, so they match with the
indentation on PHP-8.0. Otherwise merging test changes across
branches is very unpleasant.
Nikita Popov [Wed, 25 Nov 2020 11:25:07 +0000 (12:25 +0100)]
Fix ref source management during unserialization
Only register the slot for adding ref sources later if we didn't
immediately register one. Also avoids leaking a ref source if
it is added early and the assignment fails.
Calvin Buckley [Tue, 24 Nov 2020 19:45:34 +0000 (15:45 -0400)]
sockets: Fix variable/macro name collision on AIX
The name "rem_size" is used by a macro in a system header on AIX,
specifically `sys/xmem.h`. Without changing the name, you get the
name mangled like so:
```
In file included from /usr/include/sys/uio.h:92:0,
from /QOpenSys/pkgs/lib/gcc/powerpc-ibm-aix6.1.0.0/6.3.0/include-fixed-7.1/sys/socket.h:83,
from /usr/include/sys/syslog.h:151,
from /usr/include/syslog.h:29,
from /home/calvin/rpmbuild/BUILD/php-8.0.0RC5/main/php_syslog.h:27,
from /home/calvin/rpmbuild/BUILD/php-8.0.0RC5/main/php.h:318,
from /home/calvin/rpmbuild/BUILD/php-8.0.0RC5/ext/sockets/sendrecvmsg.c:17:
/home/calvin/rpmbuild/BUILD/php-8.0.0RC5/ext/sockets/sendrecvmsg.c: In function 'zif_socket_cmsg_space':
/home/calvin/rpmbuild/BUILD/php-8.0.0RC5/ext/sockets/sendrecvmsg.c:298:10: error: expected '=', ',', ';', 'asm' or '__attribute__' before '.' token
size_t rem_size = ZEND_LONG_MAX - entry->size;
^
/home/calvin/rpmbuild/BUILD/php-8.0.0RC5/ext/sockets/sendrecvmsg.c:298:10: error: expected expression before '.' token
/home/calvin/rpmbuild/BUILD/php-8.0.0RC5/ext/sockets/sendrecvmsg.c:299:18: error: 'u2' undeclared (first use in this function)
size_t n_max = rem_size / entry->var_el_size;
^
/home/calvin/rpmbuild/BUILD/php-8.0.0RC5/ext/sockets/sendrecvmsg.c:299:18: note: each undeclared identifier is reported only once for each function it appears in
```
...because of the declaration in `sys/xmem.h`:
```
```
This just renames the variable so that it won't trip on this
definition. Tested to fix the build on IBM i PASE.
Nikita Popov [Tue, 24 Nov 2020 14:52:41 +0000 (15:52 +0100)]
Fixed bug #80377
Make sure the $PHP_THREAD_SAFETY variable is always available
when configuring extensions. It was previously available for
phpized extensions, but for in-tree builds it was being set
too late.
Then, use $PHP_THREAD_SAFETY instead of $enable_zts to check for
ZTS in bundled extensions, which makes sure these checks also
work for phpize builds.
Nikita Popov [Tue, 24 Nov 2020 14:52:41 +0000 (15:52 +0100)]
Fixed bug #80377
Use $PHP_THREAD_SAFETY instead of $enable_zts to check for ZTS.
This variable is also available for phpize builds, while enable_zts
is only present for in-tree builds.
Nikita Popov [Tue, 24 Nov 2020 10:46:03 +0000 (11:46 +0100)]
Use pkg-config for libargon2
We already tried this in PHP 7.4, but ran into issues, because
alpine did not support pkg-config for libargon2 (or had a broken
pc file, not sure). The Alpine issue has been resolved in the
meantime, so let's give this another try.
Nikita Popov [Tue, 24 Nov 2020 10:31:53 +0000 (11:31 +0100)]
Fixed bug #80404
For a division like [1..1]/[2..2] produce [0..1] as a result, which
would be the integer envelope of the floating-point result.
The implementation is pretty ugly (we're now taking min/max across
eight values...) but I couldn't come up with a more elegant way
to handle this that doesn't make things a lot more complex (the
division sign handling is the annoying issue here).
Nikita Popov [Thu, 19 Nov 2020 11:46:52 +0000 (12:46 +0100)]
Use MIN/MAX when dumping RANGE[]
It's very common that one of the bounds is LONG_MIN or LONG_MAX.
Dump them as MIN/MAX instead of the int representation in that
case, as it makes the dump less noisy.
Fix #72964: White space not unfolded for CC/Bcc headers
`\r\n` does only terminate a header, if not followed by `\t` or ` `.
We have to cater to that when determining the end position of the
respective headers.