To illustrate the use of the library, let's start with a small but complete program that opens a file containing UTF-8 encoded text, reads it line by line, checks each line for invalid UTF-8 byte sequences, and converts it to UTF-16 encoding and back to UTF-8:
-```
+```cpp
#include <fstream>
#include <iostream>
#include <string>
Here is a function that checks whether the content of a file is valid UTF-8 encoded text without reading the content into the memory:
-```
+```cpp
bool valid_utf8_file(const char* file_name)
{
ifstream ifs(file_name);
Note that other functions that take input iterator arguments can be used in a similar way. For instance, to read the content of a UTF-8 encoded text file and convert the text to UTF-16, just do something like:
-```
+```cpp
utf8::utf8to16(it, eos, back_inserter(u16string));
```
If we have some text that "probably" contains UTF-8 encoded text and we want to replace any invalid UTF-8 sequence with a replacement character, something like the following function may be used:
-```
+```cpp
void fix_utf8_string(std::string& str)
{
std::string temp;
Encodes a 32 bit code point as a UTF-8 sequence of octets and appends the sequence to a UTF-8 string.
-```
+```cpp
template <typename octet_iterator>
octet_iterator append(uint32_t cp, octet_iterator result);
```
Example of use:
-```
+```cpp
unsigned char u[5] = {0,0,0,0,0};
unsigned char* end = append(0x0448, u);
assert (u[0] == 0xd1 && u[1] == 0x88 && u[2] == 0 && u[3] == 0 && u[4] == 0);
Given the iterator to the beginning of the UTF-8 sequence, it returns the code point and moves the iterator to the next position.
-```
+```cpp
template <typename octet_iterator>
uint32_t next(octet_iterator& it, octet_iterator end);
```
Example of use:
-```
+```cpp
char* twochars = "\xe6\x97\xa5\xd1\x88";
char* w = twochars;
int cp = next(w, twochars + 6);
Given the iterator to the beginning of the UTF-8 sequence, it returns the code point for the following sequence without changing the value of the iterator.
-```
+```cpp
template <typename octet_iterator>
uint32_t peek_next(octet_iterator it, octet_iterator end);
```
Example of use:
-```
+```cpp
char* twochars = "\xe6\x97\xa5\xd1\x88";
char* w = twochars;
int cp = peek_next(w, twochars + 6);
Given a reference to an iterator pointing to an octet in a UTF-8 sequence, it decreases the iterator until it hits the beginning of the previous UTF-8 encoded code point and returns the 32 bits representation of the code point.
-```
+```cpp
template <typename octet_iterator>
uint32_t prior(octet_iterator& it, octet_iterator start);
```
Example of use:
-```
+```cpp
char* twochars = "\xe6\x97\xa5\xd1\x88";
unsigned char* w = twochars + 3;
int cp = prior (w, twochars);
Given a reference to an iterator pointing to an octet in a UTF-8 seqence, it decreases the iterator until it hits the beginning of the previous UTF-8 encoded code point and returns the 32 bits representation of the code point.
-```
+```cpp
template <typename octet_iterator>
uint32_t previous(octet_iterator& it, octet_iterator pass_start);
```
Example of use:
-```
+```cpp
char* twochars = "\xe6\x97\xa5\xd1\x88";
unsigned char* w = twochars + 3;
int cp = previous (w, twochars - 1);
Advances an iterator by the specified number of code points within an UTF-8 sequence.
-```
+```cpp
template <typename octet_iterator, typename distance_type>
void advance (octet_iterator& it, distance_type n, octet_iterator end);
```
Example of use:
-```
+```cpp
char* twochars = "\xe6\x97\xa5\xd1\x88";
unsigned char* w = twochars;
advance (w, 2, twochars + 6);
Given the iterators to two UTF-8 encoded code points in a seqence, returns the number of code points between them.
-```
+```cpp
template <typename octet_iterator>
typename std::iterator_traits<octet_iterator>::difference_type distance (octet_iterator first, octet_iterator last);
```
Example of use:
-```
+```cpp
char* twochars = "\xe6\x97\xa5\xd1\x88";
size_t dist = utf8::distance(twochars, twochars + 5);
assert (dist == 2);
Converts a UTF-16 encoded string to UTF-8.
-```
+```cpp
template <typename u16bit_iterator, typename octet_iterator>
octet_iterator utf16to8 (u16bit_iterator start, u16bit_iterator end, octet_iterator result);
```
Example of use:
-```
+```cpp
unsigned short utf16string[] = {0x41, 0x0448, 0x65e5, 0xd834, 0xdd1e};
vector<unsigned char> utf8result;
utf16to8(utf16string, utf16string + 5, back_inserter(utf8result));
Converts an UTF-8 encoded string to UTF-16
-```
+```cpp
template <typename u16bit_iterator, typename octet_iterator>
u16bit_iterator utf8to16 (octet_iterator start, octet_iterator end, u16bit_iterator result);
```
Example of use:
-```
+```cpp
char utf8_with_surrogates[] = "\xe6\x97\xa5\xd1\x88\xf0\x9d\x84\x9e";
vector <unsigned short> utf16result;
utf8to16(utf8_with_surrogates, utf8_with_surrogates + 9, back_inserter(utf16result));
Converts a UTF-32 encoded string to UTF-8.
-```
+```cpp
template <typename octet_iterator, typename u32bit_iterator>
octet_iterator utf32to8 (u32bit_iterator start, u32bit_iterator end, octet_iterator result);
```
Converts a UTF-8 encoded string to UTF-32.
-```
+```cpp
template <typename octet_iterator, typename u32bit_iterator>
u32bit_iterator utf8to32 (octet_iterator start, octet_iterator end, u32bit_iterator result);
```
Example of use:
-```
+```cpp
char* twochars = "\xe6\x97\xa5\xd1\x88";
vector<int> utf32result;
utf8to32(twochars, twochars + 5, back_inserter(utf32result));
Detects an invalid sequence within a UTF-8 string.
-```
+```cpp
template <typename octet_iterator>
octet_iterator find_invalid(octet_iterator start, octet_iterator end);
```
Example of use:
-```
+```cpp
char utf_invalid[] = "\xe6\x97\xa5\xd1\x88\xfa";
char* invalid = find_invalid(utf_invalid, utf_invalid + 6);
assert (invalid == utf_invalid + 5);
Checks whether a sequence of octets is a valid UTF-8 string.
-```
+```cpp
template <typename octet_iterator>
bool is_valid(octet_iterator start, octet_iterator end);
```
Example of use:
-```
+```cpp
char utf_invalid[] = "\xe6\x97\xa5\xd1\x88\xfa";
bool bvalid = is_valid(utf_invalid, utf_invalid + 6);
assert (bvalid == false);
Replaces all invalid UTF-8 sequences within a string with a replacement marker.
-```
+```cpp
template <typename octet_iterator, typename output_iterator>
output_iterator replace_invalid(octet_iterator start, octet_iterator end, output_iterator out, uint32_t replacement);
template <typename octet_iterator, typename output_iterator>
Example of use:
-```
+```cpp
char invalid_sequence[] = "a\x80\xe0\xa0\xc0\xaf\xed\xa0\x80z";
vector<char> replace_invalid_result;
replace_invalid (invalid_sequence, invalid_sequence + sizeof(invalid_sequence), back_inserter(replace_invalid_result), '?');
Checks whether an octet sequence starts with a UTF-8 byte order mark (BOM)
-```
+```cpp
template <typename octet_iterator>
bool starts_with_bom (octet_iterator it, octet_iterator end);
```
Example of use:
-```
+```cpp
unsigned char byte_order_mark[] = {0xef, 0xbb, 0xbf};
bool bbom = starts_with_bom(byte_order_mark, byte_order_mark + sizeof(byte_order_mark));
assert (bbom == true);
Checks whether a sequence of three octets is a UTF-8 byte order mark (BOM)
-```
+```cpp
template <typename octet_iterator>
bool is_bom (octet_iterator it); // Deprecated
```
Example of use:
-```
+```cpp
unsigned char byte_order_mark[] = {0xef, 0xbb, 0xbf};
bool bbom = is_bom(byte_order_mark);
assert (bbom == true);
Base class for the exceptions thrown by UTF CPP library functions.
-```
+```cpp
class exception : public std::exception {};
```
Example of use:
-```
+```cpp
try {
code_that_uses_utf_cpp_library();
}
Thrown by UTF8 CPP functions such as `advance` and `next` if an UTF-8 sequence represents and invalid code point.
-```
+```cpp
class invalid_code_point : public exception {
public:
uint32_t code_point() const;
Thrown by UTF8 CPP functions such as `next` and `prior` if an invalid UTF-8 sequence is detected during decoding.
-```
+```cpp
class invalid_utf8 : public exception {
public:
uint8_t utf8_octet() const;
Thrown by UTF8 CPP function `utf16to8` if an invalid UTF-16 sequence is detected during decoding.
-```
+```cpp
class invalid_utf16 : public exception {
public:
uint16_t utf16_word() const;
Thrown by UTF8 CPP functions such as `next` if the end of the decoded UTF-8 sequence was reached before the code point was decoded.
-```
+```cpp
class not_enough_room : public exception {};
```
Adapts the underlying octet iterator to iterate over the sequence of code points, rather than raw octets.
-```
+```cpp
template <typename octet_iterator>
class iterator;
```
Example of use:
-```
+```cpp
char* threechars = "\xf0\x90\x8d\x86\xe6\x97\xa5\xd1\x88";
utf8::iterator<char*> it(threechars, threechars, threechars + 9);
utf8::iterator<char*> it2 = it;
Note that `utf8::iterator` adapter is a checked iterator. It operates on the range specified in the constructor; any attempt to go out of that range will result in an exception. Even the comparison operators require both iterator object to be constructed against the same range - otherwise an exception is thrown. Typically, the range will be determined by sequence container functions `begin` and `end`, i.e.:
-```
+```cpp
std::string s = "example";
utf8::iterator i (s.begin(), s.begin(), s.end());
```
Encodes a 32 bit code point as a UTF-8 sequence of octets and appends the sequence to a UTF-8 string.
-```
+```cpp
template <typename octet_iterator>
octet_iterator append(uint32_t cp, octet_iterator result);
```
Example of use:
-```
+```cpp
unsigned char u[5] = {0,0,0,0,0};
unsigned char* end = unchecked::append(0x0448, u);
assert (u[0] == 0xd1 && u[1] == 0x88 && u[2] == 0 && u[3] == 0 && u[4] == 0);
Given the iterator to the beginning of a UTF-8 sequence, it returns the code point and moves the iterator to the next position.
-```
+```cpp
template <typename octet_iterator>
uint32_t next(octet_iterator& it);
```
Example of use:
-```
+```cpp
char* twochars = "\xe6\x97\xa5\xd1\x88";
char* w = twochars;
int cp = unchecked::next(w);
Given the iterator to the beginning of a UTF-8 sequence, it returns the code point.
-```
+```cpp
template <typename octet_iterator>
uint32_t peek_next(octet_iterator it);
```
Example of use:
-```
+```cpp
char* twochars = "\xe6\x97\xa5\xd1\x88";
char* w = twochars;
int cp = unchecked::peek_next(w);
Given a reference to an iterator pointing to an octet in a UTF-8 seqence, it decreases the iterator until it hits the beginning of the previous UTF-8 encoded code point and returns the 32 bits representation of the code point.
-```
+```cpp
template <typename octet_iterator>
uint32_t prior(octet_iterator& it);
```
Example of use:
-```
+```cpp
char* twochars = "\xe6\x97\xa5\xd1\x88";
char* w = twochars + 3;
int cp = unchecked::prior (w);
Given a reference to an iterator pointing to an octet in a UTF-8 seqence, it decreases the iterator until it hits the beginning of the previous UTF-8 encoded code point and returns the 32 bits representation of the code point.
-```
+```cpp
template <typename octet_iterator>
uint32_t previous(octet_iterator& it);
```
Example of use:
-```
+```cpp
char* twochars = "\xe6\x97\xa5\xd1\x88";
char* w = twochars + 3;
int cp = unchecked::previous (w);
Advances an iterator by the specified number of code points within an UTF-8 sequence.
-```
+```cpp
template <typename octet_iterator, typename distance_type>
void advance (octet_iterator& it, distance_type n);
```
Example of use:
-```
+```cpp
char* twochars = "\xe6\x97\xa5\xd1\x88";
char* w = twochars;
unchecked::advance (w, 2);
Given the iterators to two UTF-8 encoded code points in a seqence, returns the number of code points between them.
-```
+```cpp
template <typename octet_iterator>
typename std::iterator_traits<octet_iterator>::difference_type distance (octet_iterator first, octet_iterator last);
```
Example of use:
-```
+```cpp
char* twochars = "\xe6\x97\xa5\xd1\x88";
size_t dist = utf8::unchecked::distance(twochars, twochars + 5);
assert (dist == 2);
Converts a UTF-16 encoded string to UTF-8.
-```
+```cpp
template <typename u16bit_iterator, typename octet_iterator>
octet_iterator utf16to8 (u16bit_iterator start, u16bit_iterator end, octet_iterator result);
```
Example of use:
-```
+```cpp
unsigned short utf16string[] = {0x41, 0x0448, 0x65e5, 0xd834, 0xdd1e};
vector<unsigned char> utf8result;
unchecked::utf16to8(utf16string, utf16string + 5, back_inserter(utf8result));
Converts an UTF-8 encoded string to UTF-16
-```
+```cpp
template <typename u16bit_iterator, typename octet_iterator>
u16bit_iterator utf8to16 (octet_iterator start, octet_iterator end, u16bit_iterator result);
```
Example of use:
-```
+```cpp
char utf8_with_surrogates[] = "\xe6\x97\xa5\xd1\x88\xf0\x9d\x84\x9e";
vector <unsigned short> utf16result;
unchecked::utf8to16(utf8_with_surrogates, utf8_with_surrogates + 9, back_inserter(utf16result));
Converts a UTF-32 encoded string to UTF-8.
-```
+```cpp
template <typename octet_iterator, typename u32bit_iterator>
octet_iterator utf32to8 (u32bit_iterator start, u32bit_iterator end, octet_iterator result);
```
Example of use:
-```
+```cpp
int utf32string[] = {0x448, 0x65e5, 0x10346, 0};
vector<unsigned char> utf8result;
utf32to8(utf32string, utf32string + 3, back_inserter(utf8result));
Converts a UTF-8 encoded string to UTF-32.
-```
+```cpp
template <typename octet_iterator, typename u32bit_iterator>
u32bit_iterator utf8to32 (octet_iterator start, octet_iterator end, u32bit_iterator result);
```
Example of use:
-```
+```cpp
char* twochars = "\xe6\x97\xa5\xd1\x88";
vector<int> utf32result;
unchecked::utf8to32(twochars, twochars + 5, back_inserter(utf32result));
Adapts the underlying octet iterator to iterate over the sequence of code points, rather than raw octets.
-```
+```cpp
template <typename octet_iterator>
class iterator;
```
Example of use:
-```
+```cpp
char* threechars = "\xf0\x90\x8d\x86\xe6\x97\xa5\xd1\x88";
utf8::unchecked::iterator<char*> un_it(threechars);
utf8::unchecked::iterator<char*> un_it2 = un_it;
1. [The Unicode Consortium](http://www.unicode.org/).
2. [ICU Library](http://icu.sourceforge.net/).
3. [UTF-8 at Wikipedia](http://en.wikipedia.org/wiki/UTF-8)
-4. [UTF-8 and Unicode FAQ for Unix/Linux](http://www.cl.cam.ac.uk/~mgk25/unicode.html)
\ No newline at end of file
+4. [UTF-8 and Unicode FAQ for Unix/Linux](http://www.cl.cam.ac.uk/~mgk25/unicode.html)