Nikita Popov [Fri, 27 Nov 2020 10:54:39 +0000 (11:54 +0100)]
Fix AVX detection
Our CPU detection code currently only checks whether hardware
support for AVX exists. However, we also need to check for operating
system support for XSAVE, as well as whether XCR0 has the SSE and
AVX bits set.
If this is not the case, unset the AVX and AVX2 bits in the cpuinfo
structure.
Hopefully this resolves our issues with CPU support detection.
Nikita Popov [Fri, 27 Nov 2020 13:15:34 +0000 (14:15 +0100)]
Only use travis for cron jobs
Our primary CI has been Azure Pipelines for a while now already.
Travis was primarily retained as a) a fast feedback builder and
b) to test architectures not available elsewhere.
Due to Travis CI open source policy changes, Travis is no longer
useful as a fast feedback builder. As such, only use it for cron
job builds.
As the alternate path in this test covers all supported MySQL and MariaDB
versions and a signifant portion of unsupported versions lets keep it simple.
Nikita Popov [Fri, 27 Nov 2020 10:18:10 +0000 (11:18 +0100)]
Avoid direct calls to zend_cpu_supports()
While the use of zend_cpu_supports_*() is only strictly necessary
inside ifunc resolvers, where the cpu state has not been initialized
yet, we should prefer the compiler builtins in all cases.
Nikita Popov [Fri, 27 Nov 2020 09:02:00 +0000 (10:02 +0100)]
Fixed bug #80425
Rename the methods in MessageFormatAdapter to make sure they don't
clash with anything defined by icu itself, which may be a problem
if icu is linked statically.
Nikita Popov [Tue, 24 Nov 2020 11:23:03 +0000 (12:23 +0100)]
Fix bug #80402: Don't strip -lpthread
The current behavior has been introduced 20 years ago in f9e375f493a1aeacbbcc8f2f00880d05b4ba7aaf as part of a larger change.
It's not clear to me why special treatement of -lpthread is necessary
here.
Alex Dowad [Sat, 14 Nov 2020 21:43:28 +0000 (23:43 +0200)]
Convert U+00AF (MACRON) to 0x8150 (FULLWIDTH MACRON) in some SJIS variants
Except for vanilla Shift-JIS, where 0x7E is a halfwidth overline/macron.
As for Shift-JIS-2004, it has an added character (byte sequence 0x854A)
which was defined as a halfwidth macron in JIS X 0213:2000, so we use that.
Alex Dowad [Sat, 14 Nov 2020 21:07:17 +0000 (23:07 +0200)]
Convert U+FF5E (FULLWIDTH TILDE) to 0x8160 (WAVE DASH) in SJIS variants
By entering this character in the JIS X 0208 conversion table, we can
remove a bunch of explicit `if` clauses in different conversion filters.
It also means that U+FF5E can be converted into SJIS-mac now; I don't
know why this one SJIS variant rejected U+FF5E before, since 0x8160
means the same thing in SJIS-mac as the others.
Alex Dowad [Sat, 14 Nov 2020 19:15:11 +0000 (21:15 +0200)]
0x5C is not a Yen sign in CP932 (or CP51932)
When Microsoft created CP932 (their version of Shift-JIS), they explicitly
used bytes 0-0x7F to represent ASCII characters rather than JIS X 0201
characters.
So when converting Unicode to CP932, it is not correct to convert U+00A5
to CP932 0x5C. Fortunately, CP932 does have a multi-byte FULLWIDTH YEN SIGN
character which we can use instead.
CP51932 uses the same extended character set as CP932; while CP932 is
MicroSoft's extended version of Shift-JIS, CP51932 is their extended version
of EUC-JP. So the same reasoning applies to CP51932.
Alex Dowad [Sat, 14 Nov 2020 18:47:31 +0000 (20:47 +0200)]
0x5C is not a backslash in Shift-JIS-2004
Shift-JIS-2004 is an extension of Shift-JIS, which uses 0x5C for the Yen
sign. Therefore, it is not correct to convert ASCII 0x5C (backslash) to
Shift-JIS-2004 0x5C (yen sign). JIS X 0208 does have a backslash, so we
can convert ASCII backslash to SJIS-2004 backslash instead.
From time immemorial, there has been confusion around the treatment
of 0x5C bytes on systems using legacy Japanese encodings. JIS X 0201
specified that 0x5C means a yen sign, and thus fonts on Japanese systems,
including early versions of Windows, displayed a 0x5C byte as a yen sign.
This meant that when ASCII text files were displayed on such systems,
what were meant to be backslashes would appear as yen signs. Japanese C
programmers could write character escapes using yen signs, and C compilers
built on the assumption that the input was ASCII would interpret these
escapes as desired. Likewise for shell scripts. Et cetera, et cetera...
Therefore, if the input to `mb_convert_encoding` is (for example) a C
program, and after converting to Shift-JIS-2004, the user wishes to feed
the output into a C compiler, *then* perhaps ASCII 0x5C should be mapped
to SJIS 0x5C. However, this scenario is ridiculous and will never happen.
A more realistic scenario might be: an article written in SJIS-2004 has
embedded Windows file paths (like 'C:\Program Files'), with yen signs used
as a path separator. If we convert SJIS-2004 0x5C to ASCII 0x5C, then the
path separators will be 'fixed' by the conversion.
For general written texts, it is much better to convert backslashes to...
backslashes. And yen signs, to yen signs.
Alex Dowad [Tue, 20 Oct 2020 05:47:20 +0000 (07:47 +0200)]
Fix mbstring support for SJIS-Mobile (DoCoMo, KDDI, and Softbank variants of Shift-JIS)
Lots of problems here.
- Don't pass 'control' characters through silently in the middle of a
multi-byte character.
- Treat it as an error if a multi-byte character is truncated.
- For ESC sequences used to encode emoji on earlier Softbank phones, if an
invalid ESC sequence is found, don't pass it through. Rather, handle it as
an error and respect `mb_substitute_character`.
- In ranges used by mobile vendors for emoji, if a certain byte sequence
doesn't map to any emoji, don't emit a mangled value (actually a raw
(ku*94)+ten value, which may not even be a valid Unicode codepoint at all).
- When converting Unicode to SJIS-Mobile, don't mangle codepoints which fall
in the 2nd range of MicroSoft vendor extensions.
Some vendor-specific emoji have been mapped to standard Unicode codepoints
now, rather than 'private use area' codepoints. When the legacy code was
written, these codepoints may not have existed yet in the Unicode standard
which was current at that time.
Also do a major code cleanup -- remove dead code, rearrange what is left,
use some new macros and helper functions to make the code clearer...
Alex Dowad [Tue, 13 Oct 2020 05:58:53 +0000 (07:58 +0200)]
Combine MBFL_ENCTYPE_MWC2{BE,LE} constants
These constants indicate that a text encoding uses 2+ bytes for each character,
and is either big endian or little endian (respectively). But nothing in
mbstring cares about the difference between MBFL_ENCTYPE_MWC2BE and
MBFL_ENCTYPE_MWC2LE.
(Actually, nothing cares about whether these flags are set at all...
maybe we should just remove them?)
Alex Dowad [Sun, 20 Sep 2020 14:29:32 +0000 (16:29 +0200)]
Combine MBFL_ENCTYPE_WCS{2,4}{BE,LE} constants
These flags identify text encodings in mbstring which use a constant number of
bytes per character. While some parts of the code do use these flags, usually
to detect cases which can be optimized due to constant-width encoding, nothing
cares whether the encodings are 'LE' (little-endian) or 'BE' (big-endian).
Alex Dowad [Wed, 7 Oct 2020 20:30:34 +0000 (22:30 +0200)]
Don't pass invalid JIS X 0212, JIS X 0213, and Windows-CP932 characters through
Similarly to JIS X 0208, mbstring would pass kuten codes which are not mapped
in the JIS X 0212, JIS X 0213, or CP932 character sets through silently when
converting to another Japanese encoding.
Alex Dowad [Wed, 7 Oct 2020 20:12:27 +0000 (22:12 +0200)]
Don't pass invalid JIS X 0208 characters through
Many Japanese encodings, such as JIS7/8, Shift JIS, ISO-2022-JP, EUC-JP, and
so on encode characters from the JIS X 0208 character set. JIS X 0208 is based
on the concept of a 94x94 table, with numbered rows and columns. However,
more than a thousand of the cells in that table are empty; JIS X 0208 does not
actually use all 94x94=8,836 possible kuten codes.
mbstring had a dubious feature whereby, if a Japanese string contained one of
these 'unmapped' kuten codes, and it was being converted to another Japanese
encoding which was also based on JIS X 0208, the non-existent character would
be silently passed through, and the unmapped kuten code would be re-encoded
using the normal encoding method of the target text encoding.
Again, this _only_ happened if converting the text with the funky kuten code
to a Japanese encoding. If one tried converting it to Unicode, mbstring would
treat that as an error.
If somebody, somewhere, made their own private extension to JIS X 0208, and
used the regular Japanese encodings like Shift JIS and EUC-JP to encode this
private character set, then this feature might conceivably be useful. But how
likely is that? If someone is using Shift JIS, EUC-JP, ISO-2022-JP, etc. to
encode a funky version of JIS X 0208 with extra characters added, then that
should be treated as a separate text encoding.
The code which flags such characters with MBFL_WCSPLANE_JIS0208 is retained
solely for error reporting in `mbfl_filt_conv_illegal_output`.
Alex Dowad [Sun, 4 Oct 2020 20:29:34 +0000 (22:29 +0200)]
Enhance handling of CP932 text encoding
- Don't allow control characters to appear in the middle of a multi-byte
character. (This was a strange feature of mbstring; it doesn't make much
sense, and iconv doesn't allow it.)
- Treat truncated multi-byte characters as an error.
Nikita Popov [Wed, 25 Nov 2020 14:57:11 +0000 (15:57 +0100)]
Reindent ext/mysqli tests
Reindent ext/mysqli tests on PHP-7.4, so they match with the
indentation on PHP-8.0. Otherwise merging test changes across
branches is very unpleasant.
Nikita Popov [Wed, 25 Nov 2020 11:25:07 +0000 (12:25 +0100)]
Fix ref source management during unserialization
Only register the slot for adding ref sources later if we didn't
immediately register one. Also avoids leaking a ref source if
it is added early and the assignment fails.
Calvin Buckley [Tue, 24 Nov 2020 19:45:34 +0000 (15:45 -0400)]
sockets: Fix variable/macro name collision on AIX
The name "rem_size" is used by a macro in a system header on AIX,
specifically `sys/xmem.h`. Without changing the name, you get the
name mangled like so:
```
In file included from /usr/include/sys/uio.h:92:0,
from /QOpenSys/pkgs/lib/gcc/powerpc-ibm-aix6.1.0.0/6.3.0/include-fixed-7.1/sys/socket.h:83,
from /usr/include/sys/syslog.h:151,
from /usr/include/syslog.h:29,
from /home/calvin/rpmbuild/BUILD/php-8.0.0RC5/main/php_syslog.h:27,
from /home/calvin/rpmbuild/BUILD/php-8.0.0RC5/main/php.h:318,
from /home/calvin/rpmbuild/BUILD/php-8.0.0RC5/ext/sockets/sendrecvmsg.c:17:
/home/calvin/rpmbuild/BUILD/php-8.0.0RC5/ext/sockets/sendrecvmsg.c: In function 'zif_socket_cmsg_space':
/home/calvin/rpmbuild/BUILD/php-8.0.0RC5/ext/sockets/sendrecvmsg.c:298:10: error: expected '=', ',', ';', 'asm' or '__attribute__' before '.' token
size_t rem_size = ZEND_LONG_MAX - entry->size;
^
/home/calvin/rpmbuild/BUILD/php-8.0.0RC5/ext/sockets/sendrecvmsg.c:298:10: error: expected expression before '.' token
/home/calvin/rpmbuild/BUILD/php-8.0.0RC5/ext/sockets/sendrecvmsg.c:299:18: error: 'u2' undeclared (first use in this function)
size_t n_max = rem_size / entry->var_el_size;
^
/home/calvin/rpmbuild/BUILD/php-8.0.0RC5/ext/sockets/sendrecvmsg.c:299:18: note: each undeclared identifier is reported only once for each function it appears in
```
...because of the declaration in `sys/xmem.h`:
```
```
This just renames the variable so that it won't trip on this
definition. Tested to fix the build on IBM i PASE.
Nikita Popov [Tue, 24 Nov 2020 14:52:41 +0000 (15:52 +0100)]
Fixed bug #80377
Make sure the $PHP_THREAD_SAFETY variable is always available
when configuring extensions. It was previously available for
phpized extensions, but for in-tree builds it was being set
too late.
Then, use $PHP_THREAD_SAFETY instead of $enable_zts to check for
ZTS in bundled extensions, which makes sure these checks also
work for phpize builds.
Nikita Popov [Tue, 24 Nov 2020 14:52:41 +0000 (15:52 +0100)]
Fixed bug #80377
Use $PHP_THREAD_SAFETY instead of $enable_zts to check for ZTS.
This variable is also available for phpize builds, while enable_zts
is only present for in-tree builds.