substr() sample case

author Andrei Zmievski <andrei@php.net>

Fri, 23 Sep 2005 21:24:31 +0000 (21:24 +0000)

committer Andrei Zmievski <andrei@php.net>

Fri, 23 Sep 2005 21:24:31 +0000 (21:24 +0000)
author Andrei Zmievski <andrei@php.net>
Fri, 23 Sep 2005 21:24:31 +0000 (21:24 +0000)
committer Andrei Zmievski <andrei@php.net>
Fri, 23 Sep 2005 21:24:31 +0000 (21:24 +0000)
diff --git a/README.UNICODE-UPGRADES b/README.UNICODE-UPGRADES

index a66316339971ea660409eb16410cc9d5c568b4fd..8a637082c75dc4c85e5dc2b83d0252c67d10791e 100644 (file)
--- a/README.UNICODE-UPGRADES
+++ b/README.UNICODE-UPGRADES
@@ -262,6 +262,66 @@ Unicode strings:
  
  
  
+Upgrading Functions
+===================
+
+Let's take a look at a couple of functions that have been upgraded to
+support new string types.
+
+substr()
+--------
+
+This functions returns part of a string based on offset and length
+parameters.
+
+       void *str;
+       int32_t str_len, cp_len;
+       zend_uchar str_type;
+
+       if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "tl|l", &str, &str_len, &str_type, &f, &l) == FAILURE) {
+               return;
+       }
+
+The first thing we notice is that the incoming string specifier is 't',
+which means that we can accept all 3 string types. The 'str' variable is
+declared as void*, because it can point to either UChar* or char*.
+The actual type of the incoming string is stored in 'str_type' variable.
+
+       if (str_type == IS_UNICODE) {
+               cp_len = u_countChar32(str, str_len);
+       } else {
+               cp_len = str_len;
+       }
+
+If the string is a Unicode one, we cannot rely on the str_len value to tell
+us the number of characters in it. Instead, we call u_countChar32() to
+obtain it.
+
+The next several lines normalize start and length parameters to fit within the
+string. Nothing new here. Then we locate the appropriate segment.
+
+       if (str_type == IS_UNICODE) {
+               int32_t start = 0, end = 0;
+               U16_FWD_N((UChar*)str, end, str_len, f);
+               start = end;
+               U16_FWD_N((UChar*)str, end, str_len, l);
+               RETURN_UNICODEL((UChar*)str + start, end-start, 1);
+
+Since codepoint (character) #n is not necessarily at offset #n in Unicode
+strings, we start at the beginning and iterate forward until we have gone
+through the required number of codepoints to reach the start of the segment.
+Then we save the location in 'start' and continue iterating through the number
+of codepoints specified by the offset. Once that's done, we can return the
+segment as a Unicode string.
+
+       } else {
+               RETURN_STRINGL((char*)str + f, l, 1);
+       }
+
+For native and binary types, we can return the segment directly.
+
+
+
  References
  ==========
author	Andrei Zmievski <andrei@php.net>
	Fri, 23 Sep 2005 21:24:31 +0000 (21:24 +0000)
committer	Andrei Zmievski <andrei@php.net>
	Fri, 23 Sep 2005 21:24:31 +0000 (21:24 +0000)