From: Markus Scherer RuleBasedTransliterator
is a transliterator
- * that reads a set of rules in order to determine how to perform
- * translations. Rule sets are stored in resource bundles indexed by
- * name. Rules within a rule set are separated by semicolons (';').
- * To include a literal semicolon, prefix it with a backslash ('\').
- * Whitespace, as defined by Character.isWhitespace()
,
- * is ignored. If the first non-blank character on a line is '#',
- * the entire line is ignored as a comment.
Each set of rules consists of two groups, one forward, and one - * reverse. This is a convention that is not enforced; rules for one - * direction may be omitted, with the result that translations in - * that direction will not modify the source text. In addition, - * bidirectional forward-reverse rules may be specified for - * symmetrical transformations.
- * - *Rule syntax
- * - *Rule statements take one of the following forms:
- * - *$alefmadda=\u0622;
$alefmadda
", will be replaced by
- * the Unicode character U+0622. Variable names must begin
- * with a letter and consist only of letters, digits, and
- * underscores. Case is significant. Duplicate names cause
- * an exception to be thrown, that is, variables cannot be
- * redefined. The right hand side may contain well-formed
- * text of any length, including no text at all ("$empty=;
").
- * The right hand side may contain embedded UnicodeSet
- * patterns, for example, "$softvowel=[eiyEIY]
".ai>$alefmadda;
ai<$alefmadda;
ai<>$alefmadda;
Translation rules consist of a match pattern and an output
- * string. The match pattern consists of literal characters,
- * optionally preceded by context, and optionally followed by
- * context. Context characters, like literal pattern characters,
- * must be matched in the text being transliterated. However, unlike
- * literal pattern characters, they are not replaced by the output
- * text. For example, the pattern "abc{def}
"
- * indicates the characters "def
" must be
- * preceded by "abc
" for a successful match.
- * If there is a successful match, "def
" will
- * be replaced, but not "abc
". The final '}
'
- * is optional, so "abc{def
" is equivalent to
- * "abc{def}
". Another example is "{123}456
"
- * (or "123}456
") in which the literal
- * pattern "123
" must be followed by "456
".
- *
The output string of a forward or reverse rule consists of
- * characters to replace the literal pattern characters. If the
- * output string contains the character '|
', this is
- * taken to indicate the location of the cursor after
- * replacement. The cursor is the point in the text at which the
- * next replacement, if any, will be applied. The cursor is usually
- * placed within the replacement text; however, it can actually be
- * placed into the precending or following context by using the
- * special character '@
'. Examples:
- *- * - *- *
a {foo} z > | @ bar; # foo -> bar, move cursor - * before a
- * {foo} xyz > bar @@|; # foo -> bar, cursor between - * y and z
UnicodeSet
- * - *UnicodeSet
patterns may appear anywhere that
- * makes sense. They may appear in variable definitions.
- * Contrariwise, UnicodeSet
patterns may themselves
- * contain variable references, such as "$a=[a-z];$not_a=[^$a]
",
- * or "$range=a-z;$ll=[$range]
".
UnicodeSet
patterns may also be embedded directly
- * into rule strings. Thus, the following two rules are equivalent:
- *- * - *- *
$vowel=[aeiou]; $vowel>'*'; # One way to do this
- * [aeiou]>'*'; - * # - * Another way
See {@link UnicodeSet} for more documentation and examples.
- * - *Segments
- * - *Segments of the input string can be matched and copied to the - * output string. This makes certain sets of rules simpler and more - * general, and makes reordering possible. For example:
- * - *- *- * - *- *
([a-z]) > $1 $1; - * # - * double lowercase letters
- * ([:Lu:]) ([:Ll:]) > $2 $1; # reverse order of Lu-Ll pairs
The segment of the input string to be copied is delimited by
- * "(
" and ")
". Up to
- * nine segments may be defined. Segments may not overlap. In the
- * output string, "$1
" through "$9
"
- * represent the input string segments, in left-to-right order of
- * definition.
Anchors
- * - *Patterns can be anchored to the beginning or the end of the text. This is done with the
- * special characters '^
' and '$
'. For example:
- *- * - *- *
^ a > 'BEG_A'; # match 'a' at start of text
- * a > 'A'; # match other instances - * of 'a'
- * z $ > 'END_Z'; # match 'z' at end of text
- * z > 'Z'; # match other instances - * of 'z'
It is also possible to match the beginning or the end of the text using a UnicodeSet
.
- * This is done by including a virtual anchor character '$
' at the end of the
- * set pattern. Although this is usually the match chafacter for the end anchor, the set will
- * match either the beginning or the end of the text, depending on its placement. For
- * example:
- *- * - *- *
$x = [a-z$]; # match 'a' through 'z' OR anchor
- * $x 1 > 2; # match '1' after a-z or at the start
- * 3 $x > 4; # match '3' before a-z or at the end
Example
- * - *The following example rules illustrate many of the features of - * the rule language.
- * - *Rule 1. | - *abc{def}>x|y |
- *
Rule 2. | - *xyz>r |
- *
Rule 3. | - *yz>q |
- *
Applying these rules to the string "adefabcdefz
"
- * yields the following results:
|adefabcdefz |
- * Initial state, no rules match. Advance - * cursor. | - *
a|defabcdefz |
- * Still no match. Rule 1 does not match - * because the preceding context is not present. | - *
ad|efabcdefz |
- * Still no match. Keep advancing until - * there is a match... | - *
ade|fabcdefz |
- * ... | - *
adef|abcdefz |
- * ... | - *
adefa|bcdefz |
- * ... | - *
adefab|cdefz |
- * ... | - *
adefabc|defz |
- * Rule 1 matches; replace "def "
- * with "xy " and back up the cursor
- * to before the 'y '. |
- *
adefabcx|yz |
- * Although "xyz " is
- * present, rule 2 does not match because the cursor is
- * before the 'y ', not before the 'x '.
- * Rule 3 does match. Replace "yz "
- * with "q ". |
- *
adefabcxq| |
- * The cursor is at the end; - * transliteration is complete. | - *
The order of rules is significant. If multiple rules may match - * at some point, the first matching rule is applied.
- * - *Forward and reverse rules may have an empty output string. - * Otherwise, an empty left or right hand side of any statement is a - * syntax error.
- * - *Single quotes are used to quote any character other than a
- * digit or letter. To specify a single quote itself, inside or
- * outside of quotes, use two single quotes in a row. For example,
- * the rule "'>'>o''clock
" changes the
- * string ">
" to the string "o'clock
".
- *
Notes
- * - *While a RuleBasedTransliterator is being built, it checks that - * the rules are added in proper order. For example, if the rule - * "a>x" is followed by the rule "ab>y", - * then the second rule will throw an exception. The reason is that - * the second rule can never be triggered, since the first rule - * always matches anything it matches. In other words, the first - * rule masks the second rule.
- * + * built from a set of rules as defined for + * Transliterator::createFromRules(). + * See the C++ class Transliterator documentation for the rule syntax. + * * @author Alan Liu * @internal Use transliterator factory methods instead since this class will be removed in that release. */ diff --git a/icu4c/source/i18n/unicode/translit.h b/icu4c/source/i18n/unicode/translit.h index ebb9575a9f5..6b4888145f1 100644 --- a/icu4c/source/i18n/unicode/translit.h +++ b/icu4c/source/i18n/unicode/translit.h @@ -15,10 +15,10 @@ #include "unicode/utypes.h" /** - * \file + * \file * \brief C++ API: Tranforms text from one format to another. */ - + #if !UCONFIG_NO_TRANSLITERATION #include "unicode/uobject.h" @@ -31,7 +31,6 @@ U_NAMESPACE_BEGIN class UnicodeFilter; class UnicodeSet; -class CompoundTransliterator; class TransliteratorParser; class NormalizationTransliterator; class TransliteratorIDParser; @@ -97,18 +96,20 @@ class TransliteratorIDParser; * contents of the buffer may show text being modified as each new * character arrives. * - *Consider the simple `RuleBasedTransliterator`: - * + *
Consider the simple rule-based Transliterator: + *
* th>{theta} * t>{tau} + ** * When the user types 't', nothing will happen, since the * transliterator is waiting to see if the next character is 'h'. To * remedy this, we introduce the notion of a cursor, marked by a '|' * in the output string: - * + *
* t>|{tau} * {tau}h>{theta} + ** * Now when the user types 't', tau appears, and if the next character * is 'h', the tau changes to a theta. This is accomplished by @@ -130,7 +131,7 @@ class TransliteratorIDParser; * which the transliterator last stopped, either because it reached * the end, or because it required more characters to disambiguate * between possible inputs. The
CURSOR
can also be
- * explicitly set by rules in a RuleBasedTransliterator
.
+ * explicitly set by rules in a rule-based Transliterator.
* Any characters before the CURSOR
index are frozen;
* future keyboard transliteration calls within this input sequence
* will not change them. New text is inserted at the
@@ -232,6 +233,255 @@ class TransliteratorIDParser;
* if the performance of these methods can be improved over the
* performance obtained by the default implementations in this class.
*
+ * Rule syntax + * + *
A set of rules determines how to perform translations. + * Rules within a rule set are separated by semicolons (';'). + * To include a literal semicolon, prefix it with a backslash ('\'). + * Unicode Pattern_White_Space is ignored. + * If the first non-blank character on a line is '#', + * the entire line is ignored as a comment. + * + *
Each set of rules consists of two groups, one forward, and one + * reverse. This is a convention that is not enforced; rules for one + * direction may be omitted, with the result that translations in + * that direction will not modify the source text. In addition, + * bidirectional forward-reverse rules may be specified for + * symmetrical transformations. + * + *
Note: Another description of the Transliterator rule syntax is available in + * section + * Transform Rules Syntax of UTS #35: Unicode LDML. + * The rules are shown there using arrow symbols â and â and â. + * ICU supports both those and the equivalent ASCII symbols < and > and <>. + * + *
Rule statements take one of the following forms: + * + *
$alefmadda=\\u0622;
$alefmadda
", will be replaced by
+ * the Unicode character U+0622. Variable names must begin
+ * with a letter and consist only of letters, digits, and
+ * underscores. Case is significant. Duplicate names cause
+ * an exception to be thrown, that is, variables cannot be
+ * redefined. The right hand side may contain well-formed
+ * text of any length, including no text at all ("$empty=;
").
+ * The right hand side may contain embedded UnicodeSet
+ * patterns, for example, "$softvowel=[eiyEIY]
".ai>$alefmadda;
ai<$alefmadda;
ai<>$alefmadda;
Translation rules consist of a match pattern and an output
+ * string. The match pattern consists of literal characters,
+ * optionally preceded by context, and optionally followed by
+ * context. Context characters, like literal pattern characters,
+ * must be matched in the text being transliterated. However, unlike
+ * literal pattern characters, they are not replaced by the output
+ * text. For example, the pattern "abc{def}
"
+ * indicates the characters "def
" must be
+ * preceded by "abc
" for a successful match.
+ * If there is a successful match, "def
" will
+ * be replaced, but not "abc
". The final '}
'
+ * is optional, so "abc{def
" is equivalent to
+ * "abc{def}
". Another example is "{123}456
"
+ * (or "123}456
") in which the literal
+ * pattern "123
" must be followed by "456
".
+ *
+ *
The output string of a forward or reverse rule consists of
+ * characters to replace the literal pattern characters. If the
+ * output string contains the character '|
', this is
+ * taken to indicate the location of the cursor after
+ * replacement. The cursor is the point in the text at which the
+ * next replacement, if any, will be applied. The cursor is usually
+ * placed within the replacement text; however, it can actually be
+ * placed into the precending or following context by using the
+ * special character '@'. Examples:
+ *
+ *
+ * a {foo} z > | @ bar; # foo -> bar, move cursor before a + * {foo} xyz > bar @@|; # foo -> bar, cursor between y and z + *+ * + *
UnicodeSet + * + *
UnicodeSet
patterns may appear anywhere that
+ * makes sense. They may appear in variable definitions.
+ * Contrariwise, UnicodeSet
patterns may themselves
+ * contain variable references, such as "$a=[a-z];$not_a=[^$a]
",
+ * or "$range=a-z;$ll=[$range]
".
+ *
+ *
UnicodeSet
patterns may also be embedded directly
+ * into rule strings. Thus, the following two rules are equivalent:
+ *
+ *
+ * $vowel=[aeiou]; $vowel>'*'; # One way to do this + * [aeiou]>'*'; # Another way + *+ * + *
See {@link UnicodeSet} for more documentation and examples. + * + *
Segments + * + *
Segments of the input string can be matched and copied to the + * output string. This makes certain sets of rules simpler and more + * general, and makes reordering possible. For example: + * + *
+ * ([a-z]) > $1 $1; # double lowercase letters + * ([:Lu:]) ([:Ll:]) > $2 $1; # reverse order of Lu-Ll pairs + *+ * + *
The segment of the input string to be copied is delimited by
+ * "(
" and ")
". Up to
+ * nine segments may be defined. Segments may not overlap. In the
+ * output string, "$1
" through "$9
"
+ * represent the input string segments, in left-to-right order of
+ * definition.
+ *
+ *
Anchors + * + *
Patterns can be anchored to the beginning or the end of the text. This is done with the
+ * special characters '^
' and '$
'. For example:
+ *
+ *
+ * ^ a > 'BEG_A'; # match 'a' at start of text + * a > 'A'; # match other instances of 'a' + * z $ > 'END_Z'; # match 'z' at end of text + * z > 'Z'; # match other instances of 'z' + *+ * + *
It is also possible to match the beginning or the end of the text using a UnicodeSet
.
+ * This is done by including a virtual anchor character '$
' at the end of the
+ * set pattern. Although this is usually the match chafacter for the end anchor, the set will
+ * match either the beginning or the end of the text, depending on its placement. For
+ * example:
+ *
+ *
+ * $x = [a-z$]; # match 'a' through 'z' OR anchor + * $x 1 > 2; # match '1' after a-z or at the start + * 3 $x > 4; # match '3' before a-z or at the end + *+ * + *
Example + * + *
The following example rules illustrate many of the features of + * the rule language. + * + *
Rule 1. | + *abc{def}>x|y |
+ *
Rule 2. | + *xyz>r |
+ *
Rule 3. | + *yz>q |
+ *
Applying these rules to the string "adefabcdefz
"
+ * yields the following results:
+ *
+ *
|adefabcdefz |
+ * Initial state, no rules match. Advance + * cursor. | + *
a|defabcdefz |
+ * Still no match. Rule 1 does not match + * because the preceding context is not present. | + *
ad|efabcdefz |
+ * Still no match. Keep advancing until + * there is a match... | + *
ade|fabcdefz |
+ * ... | + *
adef|abcdefz |
+ * ... | + *
adefa|bcdefz |
+ * ... | + *
adefab|cdefz |
+ * ... | + *
adefabc|defz |
+ * Rule 1 matches; replace "def "
+ * with "xy " and back up the cursor
+ * to before the 'y '. |
+ *
adefabcx|yz |
+ * Although "xyz " is
+ * present, rule 2 does not match because the cursor is
+ * before the 'y ', not before the 'x '.
+ * Rule 3 does match. Replace "yz "
+ * with "q ". |
+ *
adefabcxq| |
+ * The cursor is at the end; + * transliteration is complete. | + *
The order of rules is significant. If multiple rules may match + * at some point, the first matching rule is applied. + * + *
Forward and reverse rules may have an empty output string. + * Otherwise, an empty left or right hand side of any statement is a + * syntax error. + * + *
Single quotes are used to quote any character other than a
+ * digit or letter. To specify a single quote itself, inside or
+ * outside of quotes, use two single quotes in a row. For example,
+ * the rule "'>'>o''clock
" changes the
+ * string ">
" to the string "o'clock
".
+ *
+ *
Notes + * + *
While a Transliterator is being built from rules, it checks that
+ * the rules are added in proper order. For example, if the rule
+ * "a>x" is followed by the rule "ab>y",
+ * then the second rule will throw an exception. The reason is that
+ * the second rule can never be triggered, since the first rule
+ * always matches anything it matches. In other words, the first
+ * rule masks the second rule.
+ *
* @author Alan Liu
* @stable ICU 2.0
*/
@@ -627,7 +877,7 @@ public:
/**
* Transliterate a substring of text, as specified by index, taking filters
* into account. This method is for subclasses that need to delegate to
- * another transliterator, such as CompoundTransliterator.
+ * another transliterator.
* @param text the text to be transliterated
* @param index the position indices
* @param incremental if TRUE, then assume more characters may be inserted
@@ -841,17 +1091,19 @@ public:
/**
* Returns a Transliterator
object constructed from
- * the given rule string. This will be a RuleBasedTransliterator,
+ * the given rule string. This will be a rule-based Transliterator,
* if the rule string contains only rules, or a
- * CompoundTransliterator, if it contains ID blocks, or a
- * NullTransliterator, if it contains ID blocks which parse as
+ * compound Transliterator, if it contains ID blocks, or a
+ * null Transliterator, if it contains ID blocks which parse as
* empty for the given direction.
+ *
* @param ID the id for the transliterator.
* @param rules rules, separated by ';'
* @param dir either FORWARD or REVERSE.
- * @param parseError Struct to recieve information on position
+ * @param parseError Struct to receive information on position
* of error if an error is encountered
* @param status Output param set to success/failure code.
+ * @return a newly created Transliterator
* @stable ICU 2.0
*/
static Transliterator* U_EXPORT2 createFromRules(const UnicodeString& ID,
diff --git a/icu4c/source/test/intltest/cpdtrtst.h b/icu4c/source/test/intltest/cpdtrtst.h
index e723619ad36..1733f1a6e42 100644
--- a/icu4c/source/test/intltest/cpdtrtst.h
+++ b/icu4c/source/test/intltest/cpdtrtst.h
@@ -20,6 +20,7 @@
#if !UCONFIG_NO_TRANSLITERATION
#include "unicode/translit.h"
+#include "cpdtrans.h"
#include "intltest.h"
/**
diff --git a/icu4j/main/classes/translit/src/com/ibm/icu/text/RuleBasedTransliterator.java b/icu4j/main/classes/translit/src/com/ibm/icu/text/RuleBasedTransliterator.java
index 97a51fdd2f2..be3beb6fdbd 100644
--- a/icu4j/main/classes/translit/src/com/ibm/icu/text/RuleBasedTransliterator.java
+++ b/icu4j/main/classes/translit/src/com/ibm/icu/text/RuleBasedTransliterator.java
@@ -13,259 +13,9 @@ import java.util.Map;
/**
* RuleBasedTransliterator
is a transliterator
- * that reads a set of rules in order to determine how to perform
- * translations. Rule sets are stored in resource bundles indexed by
- * name. Rules within a rule set are separated by semicolons (';').
- * To include a literal semicolon, prefix it with a backslash ('\').
- * Unicode Pattern_White_Space is ignored.
- * If the first non-blank character on a line is '#',
- * the entire line is ignored as a comment.
- *
- *
Each set of rules consists of two groups, one forward, and one - * reverse. This is a convention that is not enforced; rules for one - * direction may be omitted, with the result that translations in - * that direction will not modify the source text. In addition, - * bidirectional forward-reverse rules may be specified for - * symmetrical transformations. - * - *
Rule syntax - * - *
Rule statements take one of the following forms: - * - *
$alefmadda=\u0622;
$alefmadda
", will be replaced by
- * the Unicode character U+0622. Variable names must begin
- * with a letter and consist only of letters, digits, and
- * underscores. Case is significant. Duplicate names cause
- * an exception to be thrown, that is, variables cannot be
- * redefined. The right hand side may contain well-formed
- * text of any length, including no text at all ("$empty=;
").
- * The right hand side may contain embedded UnicodeSet
- * patterns, for example, "$softvowel=[eiyEIY]
".ai>$alefmadda;
ai<$alefmadda;
ai<>$alefmadda;
Translation rules consist of a match pattern and an output
- * string. The match pattern consists of literal characters,
- * optionally preceded by context, and optionally followed by
- * context. Context characters, like literal pattern characters,
- * must be matched in the text being transliterated. However, unlike
- * literal pattern characters, they are not replaced by the output
- * text. For example, the pattern "abc{def}
"
- * indicates the characters "def
" must be
- * preceded by "abc
" for a successful match.
- * If there is a successful match, "def
" will
- * be replaced, but not "abc
". The final '}
'
- * is optional, so "abc{def
" is equivalent to
- * "abc{def}
". Another example is "{123}456
"
- * (or "123}456
") in which the literal
- * pattern "123
" must be followed by "456
".
- *
- *
The output string of a forward or reverse rule consists of
- * characters to replace the literal pattern characters. If the
- * output string contains the character '|
', this is
- * taken to indicate the location of the cursor after
- * replacement. The cursor is the point in the text at which the
- * next replacement, if any, will be applied. The cursor is usually
- * placed within the replacement text; however, it can actually be
- * placed into the precending or following context by using the
- * special character '@
'. Examples:
- *
- *
- *- * - *
a {foo} z > | @ bar; # foo -> bar, move cursor - * before a
- *
- * {foo} xyz > bar @@|; # foo -> bar, cursor between - * y and z
UnicodeSet - * - *
UnicodeSet
patterns may appear anywhere that
- * makes sense. They may appear in variable definitions.
- * Contrariwise, UnicodeSet
patterns may themselves
- * contain variable references, such as "$a=[a-z];$not_a=[^$a]
",
- * or "$range=a-z;$ll=[$range]
".
- *
- *
UnicodeSet
patterns may also be embedded directly
- * into rule strings. Thus, the following two rules are equivalent:
- *
- *
- *- * - *
$vowel=[aeiou]; $vowel>'*'; # One way to do this
- *
- * [aeiou]>'*'; - * # - * Another way
See {@link UnicodeSet} for more documentation and examples. - * - *
Segments - * - *
Segments of the input string can be matched and copied to the - * output string. This makes certain sets of rules simpler and more - * general, and makes reordering possible. For example: - * - *
- *- * - *
([a-z]) > $1 $1; - * # - * double lowercase letters
- *
- * ([:Lu:]) ([:Ll:]) > $2 $1; # reverse order of Lu-Ll pairs
The segment of the input string to be copied is delimited by
- * "(
" and ")
". Up to
- * nine segments may be defined. Segments may not overlap. In the
- * output string, "$1
" through "$9
"
- * represent the input string segments, in left-to-right order of
- * definition.
- *
- *
Anchors - * - *
Patterns can be anchored to the beginning or the end of the text. This is done with the
- * special characters '^
' and '$
'. For example:
- *
- *
- *- * - *
^ a > 'BEG_A'; # match 'a' at start of text
- *
- * a > 'A'; # match other instances - * of 'a'
- * z $ > 'END_Z'; # match 'z' at end of text
- * z > 'Z'; # match other instances - * of 'z'
It is also possible to match the beginning or the end of the text using a UnicodeSet
.
- * This is done by including a virtual anchor character '$
' at the end of the
- * set pattern. Although this is usually the match chafacter for the end anchor, the set will
- * match either the beginning or the end of the text, depending on its placement. For
- * example:
- *
- *
- *- * - *
$x = [a-z$]; # match 'a' through 'z' OR anchor
- *
- * $x 1 > 2; # match '1' after a-z or at the start
- * 3 $x > 4; # match '3' before a-z or at the end
Example - * - *
The following example rules illustrate many of the features of - * the rule language. - * - *
Rule 1. | - *abc{def}>x|y |
- *
Rule 2. | - *xyz>r |
- *
Rule 3. | - *yz>q |
- *
Applying these rules to the string "adefabcdefz
"
- * yields the following results:
- *
- *
|adefabcdefz |
- * Initial state, no rules match. Advance - * cursor. | - *
a|defabcdefz |
- * Still no match. Rule 1 does not match - * because the preceding context is not present. | - *
ad|efabcdefz |
- * Still no match. Keep advancing until - * there is a match... | - *
ade|fabcdefz |
- * ... | - *
adef|abcdefz |
- * ... | - *
adefa|bcdefz |
- * ... | - *
adefab|cdefz |
- * ... | - *
adefabc|defz |
- * Rule 1 matches; replace "def "
- * with "xy " and back up the cursor
- * to before the 'y '. |
- *
adefabcx|yz |
- * Although "xyz " is
- * present, rule 2 does not match because the cursor is
- * before the 'y ', not before the 'x '.
- * Rule 3 does match. Replace "yz "
- * with "q ". |
- *
adefabcxq| |
- * The cursor is at the end; - * transliteration is complete. | - *
The order of rules is significant. If multiple rules may match - * at some point, the first matching rule is applied. - * - *
Forward and reverse rules may have an empty output string. - * Otherwise, an empty left or right hand side of any statement is a - * syntax error. - * - *
Single quotes are used to quote any character other than a
- * digit or letter. To specify a single quote itself, inside or
- * outside of quotes, use two single quotes in a row. For example,
- * the rule "'>'>o''clock
" changes the
- * string ">
" to the string "o'clock
".
- *
- *
Notes - * - *
While a RuleBasedTransliterator is being built, it checks that
- * the rules are added in proper order. For example, if the rule
- * "a>x" is followed by the rule "ab>y",
- * then the second rule will throw an exception. The reason is that
- * the second rule can never be triggered, since the first rule
- * always matches anything it matches. In other words, the first
- * rule masks the second rule.
+ * built from a set of rules as defined for
+ * {@link Transliterator#createFromRules(String, String, int)}.
+ * See the class {@link Transliterator} documentation for the rule syntax.
*
* @author Alan Liu
* @internal
@@ -369,7 +119,7 @@ public class RuleBasedTransliterator extends Transliterator {
static class Data {
public Data() {
- variableNames = new HashMap
- * Consider the simple Rule syntax
+ *
+ * A set of rules determines how to perform translations.
+ * Rules within a rule set are separated by semicolons (';').
+ * To include a literal semicolon, prefix it with a backslash ('\').
+ * Unicode Pattern_White_Space is ignored.
+ * If the first non-blank character on a line is '#',
+ * the entire line is ignored as a comment.
+ *
+ * Each set of rules consists of two groups, one forward, and one
+ * reverse. This is a convention that is not enforced; rules for one
+ * direction may be omitted, with the result that translations in
+ * that direction will not modify the source text. In addition,
+ * bidirectional forward-reverse rules may be specified for
+ * symmetrical transformations.
+ *
+ * Note: Another description of the Transliterator rule syntax is available in
+ * section
+ * Transform Rules Syntax of UTS #35: Unicode LDML.
+ * The rules are shown there using arrow symbols â and â and â.
+ * ICU supports both those and the equivalent ASCII symbols < and > and <>.
+ *
+ * Rule statements take one of the following forms:
+ *
+ * Translation rules consist of a match pattern and an output
+ * string. The match pattern consists of literal characters,
+ * optionally preceded by context, and optionally followed by
+ * context. Context characters, like literal pattern characters,
+ * must be matched in the text being transliterated. However, unlike
+ * literal pattern characters, they are not replaced by the output
+ * text. For example, the pattern " The output string of a forward or reverse rule consists of
+ * characters to replace the literal pattern characters. If the
+ * output string contains the character ' UnicodeSet
+ *
+ * See {@link UnicodeSet} for more documentation and examples.
+ *
+ * Segments
+ *
+ * Segments of the input string can be matched and copied to the
+ * output string. This makes certain sets of rules simpler and more
+ * general, and makes reordering possible. For example:
+ *
+ * The segment of the input string to be copied is delimited by
+ * " Anchors
+ *
+ * Patterns can be anchored to the beginning or the end of the text. This is done with the
+ * special characters ' It is also possible to match the beginning or the end of the text using a Example
+ *
+ * The following example rules illustrate many of the features of
+ * the rule language.
+ *
+ * Applying these rules to the string " The order of rules is significant. If multiple rules may match
+ * at some point, the first matching rule is applied.
+ *
+ * Forward and reverse rules may have an empty output string.
+ * Otherwise, an empty left or right hand side of any statement is a
+ * syntax error.
+ *
+ * Single quotes are used to quote any character other than a
+ * digit or letter. To specify a single quote itself, inside or
+ * outside of quotes, use two single quotes in a row. For example,
+ * the rule " Notes
+ *
+ * While a Transliterator is being built from rules, it checks that
+ * the rules are added in proper order. For example, if the rule
+ * "a>x" is followed by the rule "ab>y",
+ * then the second rule will throw an exception. The reason is that
+ * the second rule can never be triggered, since the first rule
+ * always matches anything it matches. In other words, the first
+ * rule masks the second rule.
+ *
* @author Alan Liu
* @stable ICU 2.0
*/
public abstract class Transliterator implements StringTransform {
/**
* Direction constant indicating the forward direction in a transliterator,
- * e.g., the forward rules of a RuleBasedTransliterator. An "A-B"
+ * e.g., the forward rules of a rule-based Transliterator. An "A-B"
* transliterator transliterates A to B when operating in the forward
* direction, and B to A when operating in the reverse direction.
* @stable ICU 2.0
@@ -237,7 +486,7 @@ public abstract class Transliterator implements StringTransform {
/**
* Direction constant indicating the reverse direction in a transliterator,
- * e.g., the reverse rules of a RuleBasedTransliterator. An "A-B"
+ * e.g., the reverse rules of a rule-based Transliterator. An "A-B"
* transliterator transliterates A to B when operating in the forward
* direction, and B to A when operating in the reverse direction.
* @stable ICU 2.0
@@ -1102,7 +1351,7 @@ public abstract class Transliterator implements StringTransform {
/**
* Transliterate a substring of text, as specified by index, taking filters
* into account. This method is for subclasses that need to delegate to
- * another transliterator, such as CompoundTransliterator.
+ * another transliterator.
* @param text the text to be transliterated
* @param index the position indices
* @param incremental if TRUE, then assume more characters may be inserted
@@ -1400,11 +1649,17 @@ public abstract class Transliterator implements StringTransform {
/**
* Returns a RuleBasedTransliterator
:
+ * Consider the simple rule-based Transliterator:
*
*
* th>{theta}
@@ -110,8 +110,8 @@ import com.ibm.icu.util.UResourceBundle;
* that the transliterator will look at. It is advanced as text becomes committed (but it is not the committed index;
* that's the cursor
). The cursor
index, described above, marks the point at which the
* transliterator last stopped, either because it reached the end, or because it required more characters to
- * disambiguate between possible inputs. The cursor
can also be explicitly set by rules in a
- * RuleBasedTransliterator
. Any characters before the cursor
index are frozen; future keyboard
+ * disambiguate between possible inputs. The cursor
can also be explicitly set by rules.
+ * Any characters before the cursor
index are frozen; future keyboard
* transliteration calls within this input sequence will not change them. New text is inserted at the limit
* index, which marks the end of the substring that the transliterator looks at.
*
@@ -222,13 +222,262 @@ import com.ibm.icu.util.UResourceBundle;
* transliterate()
method taking a String
and StringBuffer
if the performance of
* these methods can be improved over the performance obtained by the default implementations in this class.
*
+ *
+ *
+ *
+ * $alefmadda=\\u0622;
$alefmadda
", will be replaced by
+ * the Unicode character U+0622. Variable names must begin
+ * with a letter and consist only of letters, digits, and
+ * underscores. Case is significant. Duplicate names cause
+ * an exception to be thrown, that is, variables cannot be
+ * redefined. The right hand side may contain well-formed
+ * text of any length, including no text at all ("$empty=;
").
+ * The right hand side may contain embedded UnicodeSet
+ * patterns, for example, "$softvowel=[eiyEIY]
".ai>$alefmadda;
ai<$alefmadda;
+ *
+ *
+ * ai<>$alefmadda;
abc{def}
"
+ * indicates the characters "def
" must be
+ * preceded by "abc
" for a successful match.
+ * If there is a successful match, "def
" will
+ * be replaced, but not "abc
". The final '}
'
+ * is optional, so "abc{def
" is equivalent to
+ * "abc{def}
". Another example is "{123}456
"
+ * (or "123}456
") in which the literal
+ * pattern "123
" must be followed by "456
".
+ *
+ * |
', this is
+ * taken to indicate the location of the cursor after
+ * replacement. The cursor is the point in the text at which the
+ * next replacement, if any, will be applied. The cursor is usually
+ * placed within the replacement text; however, it can actually be
+ * placed into the precending or following context by using the
+ * special character '@'. Examples:
+ *
+ *
+ * a {foo} z > | @ bar; # foo -> bar, move cursor before a
+ * {foo} xyz > bar @@|; # foo -> bar, cursor between y and z
+ *
+ *
+ * UnicodeSet
patterns may appear anywhere that
+ * makes sense. They may appear in variable definitions.
+ * Contrariwise, UnicodeSet
patterns may themselves
+ * contain variable references, such as "$a=[a-z];$not_a=[^$a]
",
+ * or "$range=a-z;$ll=[$range]
".
+ *
+ * UnicodeSet
patterns may also be embedded directly
+ * into rule strings. Thus, the following two rules are equivalent:
+ *
+ *
+ * $vowel=[aeiou]; $vowel>'*'; # One way to do this
+ * [aeiou]>'*'; # Another way
+ *
+ *
+ *
+ * ([a-z]) > $1 $1; # double lowercase letters
+ * ([:Lu:]) ([:Ll:]) > $2 $1; # reverse order of Lu-Ll pairs
+ *
+ *
+ * (
" and ")
". Up to
+ * nine segments may be defined. Segments may not overlap. In the
+ * output string, "$1
" through "$9
"
+ * represent the input string segments, in left-to-right order of
+ * definition.
+ *
+ * ^
' and '$
'. For example:
+ *
+ *
+ * ^ a > 'BEG_A'; # match 'a' at start of text
+ * a > 'A'; # match other instances of 'a'
+ * z $ > 'END_Z'; # match 'z' at end of text
+ * z > 'Z'; # match other instances of 'z'
+ *
+ *
+ * UnicodeSet
.
+ * This is done by including a virtual anchor character '$
' at the end of the
+ * set pattern. Although this is usually the match chafacter for the end anchor, the set will
+ * match either the beginning or the end of the text, depending on its placement. For
+ * example:
+ *
+ *
+ * $x = [a-z$]; # match 'a' through 'z' OR anchor
+ * $x 1 > 2; # match '1' after a-z or at the start
+ * 3 $x > 4; # match '3' before a-z or at the end
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ * Rule 1.
+ *
+ * abc{def}>x|y
+ *
+ * Rule 2.
+ *
+ * xyz>r
+ *
+ * Rule 3.
+ *
+ * yz>q
adefabcdefz
"
+ * yields the following results:
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ * |adefabcdefz
Initial state, no rules match. Advance
+ * cursor.
+ *
+ *
+ *
+ * a|defabcdefz
Still no match. Rule 1 does not match
+ * because the preceding context is not present.
+ *
+ *
+ *
+ * ad|efabcdefz
Still no match. Keep advancing until
+ * there is a match...
+ *
+ *
+ *
+ * ade|fabcdefz
...
+ *
+ *
+ *
+ * adef|abcdefz
...
+ *
+ *
+ *
+ * adefa|bcdefz
...
+ *
+ *
+ *
+ * adefab|cdefz
...
+ *
+ *
+ *
+ * adefabc|defz
Rule 1 matches; replace "
+ * def
"
+ * with "xy
" and back up the cursor
+ * to before the 'y
'.
+ *
+ *
+ * adefabcx|yz
Although "
+ * xyz
" is
+ * present, rule 2 does not match because the cursor is
+ * before the 'y
', not before the 'x
'.
+ * Rule 3 does match. Replace "yz
"
+ * with "q
".
+ *
+ *
+ * adefabcxq|
The cursor is at the end;
+ * transliteration is complete.
+ * '>'>o''clock
" changes the
+ * string ">
" to the string "o'clock
".
+ *
+ * Transliterator
object constructed from
- * the given rule string. This will be a RuleBasedTransliterator,
+ * the given rule string. This will be a rule-based Transliterator,
* if the rule string contains only rules, or a
- * CompoundTransliterator, if it contains ID blocks, or a
- * NullTransliterator, if it contains ID blocks which parse as
+ * compound Transliterator, if it contains ID blocks, or a
+ * null Transliterator, if it contains ID blocks which parse as
* empty for the given direction.
+ *
+ * @param ID the id for the transliterator.
+ * @param rules rules, separated by ';'
+ * @param dir either FORWARD or REVERSE.
+ * @return a newly created Transliterator
+ * @throws IllegalArgumentException if there is a problem with the ID or the rules
* @stable ICU 2.0
*/
public static final Transliterator createFromRules(String ID, String rules, int dir) {