From: Markus Scherer Date: Thu, 20 Sep 2018 04:51:49 +0000 (-0700) Subject: ICU-13832 Transliterator: move rule syntax docs from internal class to public (#150) X-Git-Tag: release-63-rc~38 X-Git-Url: https://granicus.if.org/sourcecode?a=commitdiff_plain;h=a075ac9cf852918c192f92723da6f3ed644df669;p=icu ICU-13832 Transliterator: move rule syntax docs from internal class to public (#150) --- diff --git a/icu4c/source/i18n/rbt.h b/icu4c/source/i18n/rbt.h index b998c694c23..671149f66ef 100644 --- a/icu4c/source/i18n/rbt.h +++ b/icu4c/source/i18n/rbt.h @@ -29,262 +29,10 @@ class TransliterationRuleData; /** * RuleBasedTransliterator is a transliterator - * that reads a set of rules in order to determine how to perform - * translations. Rule sets are stored in resource bundles indexed by - * name. Rules within a rule set are separated by semicolons (';'). - * To include a literal semicolon, prefix it with a backslash ('\'). - * Whitespace, as defined by Character.isWhitespace(), - * is ignored. If the first non-blank character on a line is '#', - * the entire line is ignored as a comment.

- * - *

Each set of rules consists of two groups, one forward, and one - * reverse. This is a convention that is not enforced; rules for one - * direction may be omitted, with the result that translations in - * that direction will not modify the source text. In addition, - * bidirectional forward-reverse rules may be specified for - * symmetrical transformations.

- * - *

Rule syntax

- * - *

Rule statements take one of the following forms:

- * - *
- *
$alefmadda=\u0622;
- *
Variable definition. The name on the - * left is assigned the text on the right. In this example, - * after this statement, instances of the left hand name, - * "$alefmadda", will be replaced by - * the Unicode character U+0622. Variable names must begin - * with a letter and consist only of letters, digits, and - * underscores. Case is significant. Duplicate names cause - * an exception to be thrown, that is, variables cannot be - * redefined. The right hand side may contain well-formed - * text of any length, including no text at all ("$empty=;"). - * The right hand side may contain embedded UnicodeSet - * patterns, for example, "$softvowel=[eiyEIY]".
- *
 
- *
ai>$alefmadda;
- *
Forward translation rule. This rule - * states that the string on the left will be changed to the - * string on the right when performing forward - * transliteration.
- *
 
- *
ai<$alefmadda;
- *
Reverse translation rule. This rule - * states that the string on the right will be changed to - * the string on the left when performing reverse - * transliteration.
- *
- * - *
- *
ai<>$alefmadda;
- *
Bidirectional translation rule. This - * rule states that the string on the right will be changed - * to the string on the left when performing forward - * transliteration, and vice versa when performing reverse - * transliteration.
- *
- * - *

Translation rules consist of a match pattern and an output - * string. The match pattern consists of literal characters, - * optionally preceded by context, and optionally followed by - * context. Context characters, like literal pattern characters, - * must be matched in the text being transliterated. However, unlike - * literal pattern characters, they are not replaced by the output - * text. For example, the pattern "abc{def}" - * indicates the characters "def" must be - * preceded by "abc" for a successful match. - * If there is a successful match, "def" will - * be replaced, but not "abc". The final '}' - * is optional, so "abc{def" is equivalent to - * "abc{def}". Another example is "{123}456" - * (or "123}456") in which the literal - * pattern "123" must be followed by "456". - *

- * - *

The output string of a forward or reverse rule consists of - * characters to replace the literal pattern characters. If the - * output string contains the character '|', this is - * taken to indicate the location of the cursor after - * replacement. The cursor is the point in the text at which the - * next replacement, if any, will be applied. The cursor is usually - * placed within the replacement text; however, it can actually be - * placed into the precending or following context by using the - * special character '@'. Examples:

- * - *
- *

a {foo} z > | @ bar; # foo -> bar, move cursor - * before a
- * {foo} xyz > bar @@|; # foo -> bar, cursor between - * y and z

- *
- * - *

UnicodeSet

- * - *

UnicodeSet patterns may appear anywhere that - * makes sense. They may appear in variable definitions. - * Contrariwise, UnicodeSet patterns may themselves - * contain variable references, such as "$a=[a-z];$not_a=[^$a]", - * or "$range=a-z;$ll=[$range]".

- * - *

UnicodeSet patterns may also be embedded directly - * into rule strings. Thus, the following two rules are equivalent:

- * - *
- *

$vowel=[aeiou]; $vowel>'*'; # One way to do this
- * [aeiou]>'*'; - *                # - * Another way

- *
- * - *

See {@link UnicodeSet} for more documentation and examples.

- * - *

Segments

- * - *

Segments of the input string can be matched and copied to the - * output string. This makes certain sets of rules simpler and more - * general, and makes reordering possible. For example:

- * - *
- *

([a-z]) > $1 $1; - *           # - * double lowercase letters
- * ([:Lu:]) ([:Ll:]) > $2 $1; # reverse order of Lu-Ll pairs

- *
- * - *

The segment of the input string to be copied is delimited by - * "(" and ")". Up to - * nine segments may be defined. Segments may not overlap. In the - * output string, "$1" through "$9" - * represent the input string segments, in left-to-right order of - * definition.

- * - *

Anchors

- * - *

Patterns can be anchored to the beginning or the end of the text. This is done with the - * special characters '^' and '$'. For example:

- * - *
- *

^ a   > 'BEG_A';   # match 'a' at start of text
- *   a   > 'A';       # match other instances - * of 'a'
- *   z $ > 'END_Z';   # match 'z' at end of text
- *   z   > 'Z';       # match other instances - * of 'z'

- *
- * - *

It is also possible to match the beginning or the end of the text using a UnicodeSet. - * This is done by including a virtual anchor character '$' at the end of the - * set pattern. Although this is usually the match chafacter for the end anchor, the set will - * match either the beginning or the end of the text, depending on its placement. For - * example:

- * - *
- *

$x = [a-z$];   # match 'a' through 'z' OR anchor
- * $x 1    > 2;   # match '1' after a-z or at the start
- *    3 $x > 4;   # match '3' before a-z or at the end

- *
- * - *

Example

- * - *

The following example rules illustrate many of the features of - * the rule language.

- * - * - * - * - * - * - * - * - * - * - * - * - * - * - *
Rule 1.abc{def}>x|y
Rule 2.xyz>r
Rule 3.yz>q
- * - *

Applying these rules to the string "adefabcdefz" - * yields the following results:

- * - * - * - * - * - * - * - * - * - * - * - * - * - * - * - * - * - * - * - * - * - * - * - * - * - * - * - * - * - * - * - * - * - * - * - * - * - * - * - * - * - * - *
|adefabcdefzInitial state, no rules match. Advance - * cursor.
a|defabcdefzStill no match. Rule 1 does not match - * because the preceding context is not present.
ad|efabcdefzStill no match. Keep advancing until - * there is a match...
ade|fabcdefz...
adef|abcdefz...
adefa|bcdefz...
adefab|cdefz...
adefabc|defzRule 1 matches; replace "def" - * with "xy" and back up the cursor - * to before the 'y'.
adefabcx|yzAlthough "xyz" is - * present, rule 2 does not match because the cursor is - * before the 'y', not before the 'x'. - * Rule 3 does match. Replace "yz" - * with "q".
adefabcxq|The cursor is at the end; - * transliteration is complete.
- * - *

The order of rules is significant. If multiple rules may match - * at some point, the first matching rule is applied.

- * - *

Forward and reverse rules may have an empty output string. - * Otherwise, an empty left or right hand side of any statement is a - * syntax error.

- * - *

Single quotes are used to quote any character other than a - * digit or letter. To specify a single quote itself, inside or - * outside of quotes, use two single quotes in a row. For example, - * the rule "'>'>o''clock" changes the - * string ">" to the string "o'clock". - *

- * - *

Notes

- * - *

While a RuleBasedTransliterator is being built, it checks that - * the rules are added in proper order. For example, if the rule - * "a>x" is followed by the rule "ab>y", - * then the second rule will throw an exception. The reason is that - * the second rule can never be triggered, since the first rule - * always matches anything it matches. In other words, the first - * rule masks the second rule.

- * + * built from a set of rules as defined for + * Transliterator::createFromRules(). + * See the C++ class Transliterator documentation for the rule syntax. + * * @author Alan Liu * @internal Use transliterator factory methods instead since this class will be removed in that release. */ diff --git a/icu4c/source/i18n/unicode/translit.h b/icu4c/source/i18n/unicode/translit.h index ebb9575a9f5..6b4888145f1 100644 --- a/icu4c/source/i18n/unicode/translit.h +++ b/icu4c/source/i18n/unicode/translit.h @@ -15,10 +15,10 @@ #include "unicode/utypes.h" /** - * \file + * \file * \brief C++ API: Tranforms text from one format to another. */ - + #if !UCONFIG_NO_TRANSLITERATION #include "unicode/uobject.h" @@ -31,7 +31,6 @@ U_NAMESPACE_BEGIN class UnicodeFilter; class UnicodeSet; -class CompoundTransliterator; class TransliteratorParser; class NormalizationTransliterator; class TransliteratorIDParser; @@ -97,18 +96,20 @@ class TransliteratorIDParser; * contents of the buffer may show text being modified as each new * character arrives. * - *

Consider the simple `RuleBasedTransliterator`: - * + *

Consider the simple rule-based Transliterator: + *

  *     th>{theta}
  *     t>{tau}
+ * 
* * When the user types 't', nothing will happen, since the * transliterator is waiting to see if the next character is 'h'. To * remedy this, we introduce the notion of a cursor, marked by a '|' * in the output string: - * + *
  *     t>|{tau}
  *     {tau}h>{theta}
+ * 
* * Now when the user types 't', tau appears, and if the next character * is 'h', the tau changes to a theta. This is accomplished by @@ -130,7 +131,7 @@ class TransliteratorIDParser; * which the transliterator last stopped, either because it reached * the end, or because it required more characters to disambiguate * between possible inputs. The CURSOR can also be - * explicitly set by rules in a RuleBasedTransliterator. + * explicitly set by rules in a rule-based Transliterator. * Any characters before the CURSOR index are frozen; * future keyboard transliteration calls within this input sequence * will not change them. New text is inserted at the @@ -232,6 +233,255 @@ class TransliteratorIDParser; * if the performance of these methods can be improved over the * performance obtained by the default implementations in this class. * + *

Rule syntax + * + *

A set of rules determines how to perform translations. + * Rules within a rule set are separated by semicolons (';'). + * To include a literal semicolon, prefix it with a backslash ('\'). + * Unicode Pattern_White_Space is ignored. + * If the first non-blank character on a line is '#', + * the entire line is ignored as a comment. + * + *

Each set of rules consists of two groups, one forward, and one + * reverse. This is a convention that is not enforced; rules for one + * direction may be omitted, with the result that translations in + * that direction will not modify the source text. In addition, + * bidirectional forward-reverse rules may be specified for + * symmetrical transformations. + * + *

Note: Another description of the Transliterator rule syntax is available in + * section + * Transform Rules Syntax of UTS #35: Unicode LDML. + * The rules are shown there using arrow symbols ← and → and ↔. + * ICU supports both those and the equivalent ASCII symbols < and > and <>. + * + *

Rule statements take one of the following forms: + * + *

+ *
$alefmadda=\\u0622;
+ *
Variable definition. The name on the + * left is assigned the text on the right. In this example, + * after this statement, instances of the left hand name, + * "$alefmadda", will be replaced by + * the Unicode character U+0622. Variable names must begin + * with a letter and consist only of letters, digits, and + * underscores. Case is significant. Duplicate names cause + * an exception to be thrown, that is, variables cannot be + * redefined. The right hand side may contain well-formed + * text of any length, including no text at all ("$empty=;"). + * The right hand side may contain embedded UnicodeSet + * patterns, for example, "$softvowel=[eiyEIY]".
+ *
ai>$alefmadda;
+ *
Forward translation rule. This rule + * states that the string on the left will be changed to the + * string on the right when performing forward + * transliteration.
+ *
ai<$alefmadda;
+ *
Reverse translation rule. This rule + * states that the string on the right will be changed to + * the string on the left when performing reverse + * transliteration.
+ *
+ * + *
+ *
ai<>$alefmadda;
+ *
Bidirectional translation rule. This + * rule states that the string on the right will be changed + * to the string on the left when performing forward + * transliteration, and vice versa when performing reverse + * transliteration.
+ *
+ * + *

Translation rules consist of a match pattern and an output + * string. The match pattern consists of literal characters, + * optionally preceded by context, and optionally followed by + * context. Context characters, like literal pattern characters, + * must be matched in the text being transliterated. However, unlike + * literal pattern characters, they are not replaced by the output + * text. For example, the pattern "abc{def}" + * indicates the characters "def" must be + * preceded by "abc" for a successful match. + * If there is a successful match, "def" will + * be replaced, but not "abc". The final '}' + * is optional, so "abc{def" is equivalent to + * "abc{def}". Another example is "{123}456" + * (or "123}456") in which the literal + * pattern "123" must be followed by "456". + * + *

The output string of a forward or reverse rule consists of + * characters to replace the literal pattern characters. If the + * output string contains the character '|', this is + * taken to indicate the location of the cursor after + * replacement. The cursor is the point in the text at which the + * next replacement, if any, will be applied. The cursor is usually + * placed within the replacement text; however, it can actually be + * placed into the precending or following context by using the + * special character '@'. Examples: + * + *

+ *     a {foo} z > | @ bar; # foo -> bar, move cursor before a
+ *     {foo} xyz > bar @@|; # foo -> bar, cursor between y and z
+ * 
+ * + *

UnicodeSet + * + *

UnicodeSet patterns may appear anywhere that + * makes sense. They may appear in variable definitions. + * Contrariwise, UnicodeSet patterns may themselves + * contain variable references, such as "$a=[a-z];$not_a=[^$a]", + * or "$range=a-z;$ll=[$range]". + * + *

UnicodeSet patterns may also be embedded directly + * into rule strings. Thus, the following two rules are equivalent: + * + *

+ *     $vowel=[aeiou]; $vowel>'*'; # One way to do this
+ *     [aeiou]>'*'; # Another way
+ * 
+ * + *

See {@link UnicodeSet} for more documentation and examples. + * + *

Segments + * + *

Segments of the input string can be matched and copied to the + * output string. This makes certain sets of rules simpler and more + * general, and makes reordering possible. For example: + * + *

+ *     ([a-z]) > $1 $1; # double lowercase letters
+ *     ([:Lu:]) ([:Ll:]) > $2 $1; # reverse order of Lu-Ll pairs
+ * 
+ * + *

The segment of the input string to be copied is delimited by + * "(" and ")". Up to + * nine segments may be defined. Segments may not overlap. In the + * output string, "$1" through "$9" + * represent the input string segments, in left-to-right order of + * definition. + * + *

Anchors + * + *

Patterns can be anchored to the beginning or the end of the text. This is done with the + * special characters '^' and '$'. For example: + * + *

+ *   ^ a   > 'BEG_A';   # match 'a' at start of text
+ *     a   > 'A'; # match other instances of 'a'
+ *     z $ > 'END_Z';   # match 'z' at end of text
+ *     z   > 'Z';       # match other instances of 'z'
+ * 
+ * + *

It is also possible to match the beginning or the end of the text using a UnicodeSet. + * This is done by including a virtual anchor character '$' at the end of the + * set pattern. Although this is usually the match chafacter for the end anchor, the set will + * match either the beginning or the end of the text, depending on its placement. For + * example: + * + *

+ *   $x = [a-z$];   # match 'a' through 'z' OR anchor
+ *   $x 1    > 2;   # match '1' after a-z or at the start
+ *      3 $x > 4;   # match '3' before a-z or at the end
+ * 
+ * + *

Example + * + *

The following example rules illustrate many of the features of + * the rule language. + * + * + * + * + * + * + * + * + * + * + * + * + * + * + *
Rule 1.abc{def}>x|y
Rule 2.xyz>r
Rule 3.yz>q
+ * + *

Applying these rules to the string "adefabcdefz" + * yields the following results: + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + *
|adefabcdefzInitial state, no rules match. Advance + * cursor.
a|defabcdefzStill no match. Rule 1 does not match + * because the preceding context is not present.
ad|efabcdefzStill no match. Keep advancing until + * there is a match...
ade|fabcdefz...
adef|abcdefz...
adefa|bcdefz...
adefab|cdefz...
adefabc|defzRule 1 matches; replace "def" + * with "xy" and back up the cursor + * to before the 'y'.
adefabcx|yzAlthough "xyz" is + * present, rule 2 does not match because the cursor is + * before the 'y', not before the 'x'. + * Rule 3 does match. Replace "yz" + * with "q".
adefabcxq|The cursor is at the end; + * transliteration is complete.
+ * + *

The order of rules is significant. If multiple rules may match + * at some point, the first matching rule is applied. + * + *

Forward and reverse rules may have an empty output string. + * Otherwise, an empty left or right hand side of any statement is a + * syntax error. + * + *

Single quotes are used to quote any character other than a + * digit or letter. To specify a single quote itself, inside or + * outside of quotes, use two single quotes in a row. For example, + * the rule "'>'>o''clock" changes the + * string ">" to the string "o'clock". + * + *

Notes + * + *

While a Transliterator is being built from rules, it checks that + * the rules are added in proper order. For example, if the rule + * "a>x" is followed by the rule "ab>y", + * then the second rule will throw an exception. The reason is that + * the second rule can never be triggered, since the first rule + * always matches anything it matches. In other words, the first + * rule masks the second rule. + * * @author Alan Liu * @stable ICU 2.0 */ @@ -627,7 +877,7 @@ public: /** * Transliterate a substring of text, as specified by index, taking filters * into account. This method is for subclasses that need to delegate to - * another transliterator, such as CompoundTransliterator. + * another transliterator. * @param text the text to be transliterated * @param index the position indices * @param incremental if TRUE, then assume more characters may be inserted @@ -841,17 +1091,19 @@ public: /** * Returns a Transliterator object constructed from - * the given rule string. This will be a RuleBasedTransliterator, + * the given rule string. This will be a rule-based Transliterator, * if the rule string contains only rules, or a - * CompoundTransliterator, if it contains ID blocks, or a - * NullTransliterator, if it contains ID blocks which parse as + * compound Transliterator, if it contains ID blocks, or a + * null Transliterator, if it contains ID blocks which parse as * empty for the given direction. + * * @param ID the id for the transliterator. * @param rules rules, separated by ';' * @param dir either FORWARD or REVERSE. - * @param parseError Struct to recieve information on position + * @param parseError Struct to receive information on position * of error if an error is encountered * @param status Output param set to success/failure code. + * @return a newly created Transliterator * @stable ICU 2.0 */ static Transliterator* U_EXPORT2 createFromRules(const UnicodeString& ID, diff --git a/icu4c/source/test/intltest/cpdtrtst.h b/icu4c/source/test/intltest/cpdtrtst.h index e723619ad36..1733f1a6e42 100644 --- a/icu4c/source/test/intltest/cpdtrtst.h +++ b/icu4c/source/test/intltest/cpdtrtst.h @@ -20,6 +20,7 @@ #if !UCONFIG_NO_TRANSLITERATION #include "unicode/translit.h" +#include "cpdtrans.h" #include "intltest.h" /** diff --git a/icu4j/main/classes/translit/src/com/ibm/icu/text/RuleBasedTransliterator.java b/icu4j/main/classes/translit/src/com/ibm/icu/text/RuleBasedTransliterator.java index 97a51fdd2f2..be3beb6fdbd 100644 --- a/icu4j/main/classes/translit/src/com/ibm/icu/text/RuleBasedTransliterator.java +++ b/icu4j/main/classes/translit/src/com/ibm/icu/text/RuleBasedTransliterator.java @@ -13,259 +13,9 @@ import java.util.Map; /** * RuleBasedTransliterator is a transliterator - * that reads a set of rules in order to determine how to perform - * translations. Rule sets are stored in resource bundles indexed by - * name. Rules within a rule set are separated by semicolons (';'). - * To include a literal semicolon, prefix it with a backslash ('\'). - * Unicode Pattern_White_Space is ignored. - * If the first non-blank character on a line is '#', - * the entire line is ignored as a comment. - * - *

Each set of rules consists of two groups, one forward, and one - * reverse. This is a convention that is not enforced; rules for one - * direction may be omitted, with the result that translations in - * that direction will not modify the source text. In addition, - * bidirectional forward-reverse rules may be specified for - * symmetrical transformations. - * - *

Rule syntax - * - *

Rule statements take one of the following forms: - * - *

- *
$alefmadda=\u0622;
- *
Variable definition. The name on the - * left is assigned the text on the right. In this example, - * after this statement, instances of the left hand name, - * "$alefmadda", will be replaced by - * the Unicode character U+0622. Variable names must begin - * with a letter and consist only of letters, digits, and - * underscores. Case is significant. Duplicate names cause - * an exception to be thrown, that is, variables cannot be - * redefined. The right hand side may contain well-formed - * text of any length, including no text at all ("$empty=;"). - * The right hand side may contain embedded UnicodeSet - * patterns, for example, "$softvowel=[eiyEIY]".
- *
 
- *
ai>$alefmadda;
- *
Forward translation rule. This rule - * states that the string on the left will be changed to the - * string on the right when performing forward - * transliteration.
- *
 
- *
ai<$alefmadda;
- *
Reverse translation rule. This rule - * states that the string on the right will be changed to - * the string on the left when performing reverse - * transliteration.
- *
- * - *
- *
ai<>$alefmadda;
- *
Bidirectional translation rule. This - * rule states that the string on the right will be changed - * to the string on the left when performing forward - * transliteration, and vice versa when performing reverse - * transliteration.
- *
- * - *

Translation rules consist of a match pattern and an output - * string. The match pattern consists of literal characters, - * optionally preceded by context, and optionally followed by - * context. Context characters, like literal pattern characters, - * must be matched in the text being transliterated. However, unlike - * literal pattern characters, they are not replaced by the output - * text. For example, the pattern "abc{def}" - * indicates the characters "def" must be - * preceded by "abc" for a successful match. - * If there is a successful match, "def" will - * be replaced, but not "abc". The final '}' - * is optional, so "abc{def" is equivalent to - * "abc{def}". Another example is "{123}456" - * (or "123}456") in which the literal - * pattern "123" must be followed by "456". - * - *

The output string of a forward or reverse rule consists of - * characters to replace the literal pattern characters. If the - * output string contains the character '|', this is - * taken to indicate the location of the cursor after - * replacement. The cursor is the point in the text at which the - * next replacement, if any, will be applied. The cursor is usually - * placed within the replacement text; however, it can actually be - * placed into the precending or following context by using the - * special character '@'. Examples: - * - *

- *

a {foo} z > | @ bar; # foo -> bar, move cursor - * before a
- * {foo} xyz > bar @@|; # foo -> bar, cursor between - * y and z
- *

- * - *

UnicodeSet - * - *

UnicodeSet patterns may appear anywhere that - * makes sense. They may appear in variable definitions. - * Contrariwise, UnicodeSet patterns may themselves - * contain variable references, such as "$a=[a-z];$not_a=[^$a]", - * or "$range=a-z;$ll=[$range]". - * - *

UnicodeSet patterns may also be embedded directly - * into rule strings. Thus, the following two rules are equivalent: - * - *

- *

$vowel=[aeiou]; $vowel>'*'; # One way to do this
- * [aeiou]>'*'; - *                # - * Another way
- *

- * - *

See {@link UnicodeSet} for more documentation and examples. - * - *

Segments - * - *

Segments of the input string can be matched and copied to the - * output string. This makes certain sets of rules simpler and more - * general, and makes reordering possible. For example: - * - *

- *

([a-z]) > $1 $1; - *           # - * double lowercase letters
- * ([:Lu:]) ([:Ll:]) > $2 $1; # reverse order of Lu-Ll pairs
- *

- * - *

The segment of the input string to be copied is delimited by - * "(" and ")". Up to - * nine segments may be defined. Segments may not overlap. In the - * output string, "$1" through "$9" - * represent the input string segments, in left-to-right order of - * definition. - * - *

Anchors - * - *

Patterns can be anchored to the beginning or the end of the text. This is done with the - * special characters '^' and '$'. For example: - * - *

- *

^ a   > 'BEG_A';   # match 'a' at start of text
- *   a   > 'A';       # match other instances - * of 'a'
- *   z $ > 'END_Z';   # match 'z' at end of text
- *   z   > 'Z';       # match other instances - * of 'z'
- *

- * - *

It is also possible to match the beginning or the end of the text using a UnicodeSet. - * This is done by including a virtual anchor character '$' at the end of the - * set pattern. Although this is usually the match chafacter for the end anchor, the set will - * match either the beginning or the end of the text, depending on its placement. For - * example: - * - *

- *

$x = [a-z$];   # match 'a' through 'z' OR anchor
- * $x 1    > 2;   # match '1' after a-z or at the start
- *    3 $x > 4;   # match '3' before a-z or at the end
- *

- * - *

Example - * - *

The following example rules illustrate many of the features of - * the rule language. - * - * - * - * - * - * - * - * - * - * - * - * - * - * - *
Rule 1.abc{def}>x|y
Rule 2.xyz>r
Rule 3.yz>q
- * - *

Applying these rules to the string "adefabcdefz" - * yields the following results: - * - * - * - * - * - * - * - * - * - * - * - * - * - * - * - * - * - * - * - * - * - * - * - * - * - * - * - * - * - * - * - * - * - * - * - * - * - * - * - * - * - * - *
|adefabcdefzInitial state, no rules match. Advance - * cursor.
a|defabcdefzStill no match. Rule 1 does not match - * because the preceding context is not present.
ad|efabcdefzStill no match. Keep advancing until - * there is a match...
ade|fabcdefz...
adef|abcdefz...
adefa|bcdefz...
adefab|cdefz...
adefabc|defzRule 1 matches; replace "def" - * with "xy" and back up the cursor - * to before the 'y'.
adefabcx|yzAlthough "xyz" is - * present, rule 2 does not match because the cursor is - * before the 'y', not before the 'x'. - * Rule 3 does match. Replace "yz" - * with "q".
adefabcxq|The cursor is at the end; - * transliteration is complete.
- * - *

The order of rules is significant. If multiple rules may match - * at some point, the first matching rule is applied. - * - *

Forward and reverse rules may have an empty output string. - * Otherwise, an empty left or right hand side of any statement is a - * syntax error. - * - *

Single quotes are used to quote any character other than a - * digit or letter. To specify a single quote itself, inside or - * outside of quotes, use two single quotes in a row. For example, - * the rule "'>'>o''clock" changes the - * string ">" to the string "o'clock". - * - *

Notes - * - *

While a RuleBasedTransliterator is being built, it checks that - * the rules are added in proper order. For example, if the rule - * "a>x" is followed by the rule "ab>y", - * then the second rule will throw an exception. The reason is that - * the second rule can never be triggered, since the first rule - * always matches anything it matches. In other words, the first - * rule masks the second rule. + * built from a set of rules as defined for + * {@link Transliterator#createFromRules(String, String, int)}. + * See the class {@link Transliterator} documentation for the rule syntax. * * @author Alan Liu * @internal @@ -369,7 +119,7 @@ public class RuleBasedTransliterator extends Transliterator { static class Data { public Data() { - variableNames = new HashMap(); + variableNames = new HashMap<>(); ruleSet = new TransliterationRuleSet(); } @@ -487,5 +237,3 @@ public class RuleBasedTransliterator extends Transliterator { return new RuleBasedTransliterator(getID(), data, filter); } } - - diff --git a/icu4j/main/classes/translit/src/com/ibm/icu/text/Transliterator.java b/icu4j/main/classes/translit/src/com/ibm/icu/text/Transliterator.java index b8f82558e27..01be8a96dff 100644 --- a/icu4j/main/classes/translit/src/com/ibm/icu/text/Transliterator.java +++ b/icu4j/main/classes/translit/src/com/ibm/icu/text/Transliterator.java @@ -83,7 +83,7 @@ import com.ibm.icu.util.UResourceBundle; * modified as each new character arrives. * *

- * Consider the simple RuleBasedTransliterator: + * Consider the simple rule-based Transliterator: * *

* th>{theta}
@@ -110,8 +110,8 @@ import com.ibm.icu.util.UResourceBundle; * that the transliterator will look at. It is advanced as text becomes committed (but it is not the committed index; * that's the cursor). The cursor index, described above, marks the point at which the * transliterator last stopped, either because it reached the end, or because it required more characters to - * disambiguate between possible inputs. The cursor can also be explicitly set by rules in a - * RuleBasedTransliterator. Any characters before the cursor index are frozen; future keyboard + * disambiguate between possible inputs. The cursor can also be explicitly set by rules. + * Any characters before the cursor index are frozen; future keyboard * transliteration calls within this input sequence will not change them. New text is inserted at the limit * index, which marks the end of the substring that the transliterator looks at. * @@ -222,13 +222,262 @@ import com.ibm.icu.util.UResourceBundle; * transliterate() method taking a String and StringBuffer if the performance of * these methods can be improved over the performance obtained by the default implementations in this class. * + *

Rule syntax + * + *

A set of rules determines how to perform translations. + * Rules within a rule set are separated by semicolons (';'). + * To include a literal semicolon, prefix it with a backslash ('\'). + * Unicode Pattern_White_Space is ignored. + * If the first non-blank character on a line is '#', + * the entire line is ignored as a comment. + * + *

Each set of rules consists of two groups, one forward, and one + * reverse. This is a convention that is not enforced; rules for one + * direction may be omitted, with the result that translations in + * that direction will not modify the source text. In addition, + * bidirectional forward-reverse rules may be specified for + * symmetrical transformations. + * + *

Note: Another description of the Transliterator rule syntax is available in + * section + * Transform Rules Syntax of UTS #35: Unicode LDML. + * The rules are shown there using arrow symbols ← and → and ↔. + * ICU supports both those and the equivalent ASCII symbols < and > and <>. + * + *

Rule statements take one of the following forms: + * + *

+ *
$alefmadda=\\u0622;
+ *
Variable definition. The name on the + * left is assigned the text on the right. In this example, + * after this statement, instances of the left hand name, + * "$alefmadda", will be replaced by + * the Unicode character U+0622. Variable names must begin + * with a letter and consist only of letters, digits, and + * underscores. Case is significant. Duplicate names cause + * an exception to be thrown, that is, variables cannot be + * redefined. The right hand side may contain well-formed + * text of any length, including no text at all ("$empty=;"). + * The right hand side may contain embedded UnicodeSet + * patterns, for example, "$softvowel=[eiyEIY]".
+ *
ai>$alefmadda;
+ *
Forward translation rule. This rule + * states that the string on the left will be changed to the + * string on the right when performing forward + * transliteration.
+ *
ai<$alefmadda;
+ *
Reverse translation rule. This rule + * states that the string on the right will be changed to + * the string on the left when performing reverse + * transliteration.
+ *
+ * + *
+ *
ai<>$alefmadda;
+ *
Bidirectional translation rule. This + * rule states that the string on the right will be changed + * to the string on the left when performing forward + * transliteration, and vice versa when performing reverse + * transliteration.
+ *
+ * + *

Translation rules consist of a match pattern and an output + * string. The match pattern consists of literal characters, + * optionally preceded by context, and optionally followed by + * context. Context characters, like literal pattern characters, + * must be matched in the text being transliterated. However, unlike + * literal pattern characters, they are not replaced by the output + * text. For example, the pattern "abc{def}" + * indicates the characters "def" must be + * preceded by "abc" for a successful match. + * If there is a successful match, "def" will + * be replaced, but not "abc". The final '}' + * is optional, so "abc{def" is equivalent to + * "abc{def}". Another example is "{123}456" + * (or "123}456") in which the literal + * pattern "123" must be followed by "456". + * + *

The output string of a forward or reverse rule consists of + * characters to replace the literal pattern characters. If the + * output string contains the character '|', this is + * taken to indicate the location of the cursor after + * replacement. The cursor is the point in the text at which the + * next replacement, if any, will be applied. The cursor is usually + * placed within the replacement text; however, it can actually be + * placed into the precending or following context by using the + * special character '@'. Examples: + * + *

+ *     a {foo} z > | @ bar; # foo -> bar, move cursor before a
+ *     {foo} xyz > bar @@|; # foo -> bar, cursor between y and z
+ * 
+ * + *

UnicodeSet + * + *

UnicodeSet patterns may appear anywhere that + * makes sense. They may appear in variable definitions. + * Contrariwise, UnicodeSet patterns may themselves + * contain variable references, such as "$a=[a-z];$not_a=[^$a]", + * or "$range=a-z;$ll=[$range]". + * + *

UnicodeSet patterns may also be embedded directly + * into rule strings. Thus, the following two rules are equivalent: + * + *

+ *     $vowel=[aeiou]; $vowel>'*'; # One way to do this
+ *     [aeiou]>'*'; # Another way
+ * 
+ * + *

See {@link UnicodeSet} for more documentation and examples. + * + *

Segments + * + *

Segments of the input string can be matched and copied to the + * output string. This makes certain sets of rules simpler and more + * general, and makes reordering possible. For example: + * + *

+ *     ([a-z]) > $1 $1; # double lowercase letters
+ *     ([:Lu:]) ([:Ll:]) > $2 $1; # reverse order of Lu-Ll pairs
+ * 
+ * + *

The segment of the input string to be copied is delimited by + * "(" and ")". Up to + * nine segments may be defined. Segments may not overlap. In the + * output string, "$1" through "$9" + * represent the input string segments, in left-to-right order of + * definition. + * + *

Anchors + * + *

Patterns can be anchored to the beginning or the end of the text. This is done with the + * special characters '^' and '$'. For example: + * + *

+ *   ^ a   > 'BEG_A';   # match 'a' at start of text
+ *     a   > 'A'; # match other instances of 'a'
+ *     z $ > 'END_Z';   # match 'z' at end of text
+ *     z   > 'Z';       # match other instances of 'z'
+ * 
+ * + *

It is also possible to match the beginning or the end of the text using a UnicodeSet. + * This is done by including a virtual anchor character '$' at the end of the + * set pattern. Although this is usually the match chafacter for the end anchor, the set will + * match either the beginning or the end of the text, depending on its placement. For + * example: + * + *

+ *   $x = [a-z$];   # match 'a' through 'z' OR anchor
+ *   $x 1    > 2;   # match '1' after a-z or at the start
+ *      3 $x > 4;   # match '3' before a-z or at the end
+ * 
+ * + *

Example + * + *

The following example rules illustrate many of the features of + * the rule language. + * + * + * + * + * + * + * + * + * + * + * + * + * + * + *
Rule 1.abc{def}>x|y
Rule 2.xyz>r
Rule 3.yz>q
+ * + *

Applying these rules to the string "adefabcdefz" + * yields the following results: + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + * + *
|adefabcdefzInitial state, no rules match. Advance + * cursor.
a|defabcdefzStill no match. Rule 1 does not match + * because the preceding context is not present.
ad|efabcdefzStill no match. Keep advancing until + * there is a match...
ade|fabcdefz...
adef|abcdefz...
adefa|bcdefz...
adefab|cdefz...
adefabc|defzRule 1 matches; replace "def" + * with "xy" and back up the cursor + * to before the 'y'.
adefabcx|yzAlthough "xyz" is + * present, rule 2 does not match because the cursor is + * before the 'y', not before the 'x'. + * Rule 3 does match. Replace "yz" + * with "q".
adefabcxq|The cursor is at the end; + * transliteration is complete.
+ * + *

The order of rules is significant. If multiple rules may match + * at some point, the first matching rule is applied. + * + *

Forward and reverse rules may have an empty output string. + * Otherwise, an empty left or right hand side of any statement is a + * syntax error. + * + *

Single quotes are used to quote any character other than a + * digit or letter. To specify a single quote itself, inside or + * outside of quotes, use two single quotes in a row. For example, + * the rule "'>'>o''clock" changes the + * string ">" to the string "o'clock". + * + *

Notes + * + *

While a Transliterator is being built from rules, it checks that + * the rules are added in proper order. For example, if the rule + * "a>x" is followed by the rule "ab>y", + * then the second rule will throw an exception. The reason is that + * the second rule can never be triggered, since the first rule + * always matches anything it matches. In other words, the first + * rule masks the second rule. + * * @author Alan Liu * @stable ICU 2.0 */ public abstract class Transliterator implements StringTransform { /** * Direction constant indicating the forward direction in a transliterator, - * e.g., the forward rules of a RuleBasedTransliterator. An "A-B" + * e.g., the forward rules of a rule-based Transliterator. An "A-B" * transliterator transliterates A to B when operating in the forward * direction, and B to A when operating in the reverse direction. * @stable ICU 2.0 @@ -237,7 +486,7 @@ public abstract class Transliterator implements StringTransform { /** * Direction constant indicating the reverse direction in a transliterator, - * e.g., the reverse rules of a RuleBasedTransliterator. An "A-B" + * e.g., the reverse rules of a rule-based Transliterator. An "A-B" * transliterator transliterates A to B when operating in the forward * direction, and B to A when operating in the reverse direction. * @stable ICU 2.0 @@ -1102,7 +1351,7 @@ public abstract class Transliterator implements StringTransform { /** * Transliterate a substring of text, as specified by index, taking filters * into account. This method is for subclasses that need to delegate to - * another transliterator, such as CompoundTransliterator. + * another transliterator. * @param text the text to be transliterated * @param index the position indices * @param incremental if TRUE, then assume more characters may be inserted @@ -1400,11 +1649,17 @@ public abstract class Transliterator implements StringTransform { /** * Returns a Transliterator object constructed from - * the given rule string. This will be a RuleBasedTransliterator, + * the given rule string. This will be a rule-based Transliterator, * if the rule string contains only rules, or a - * CompoundTransliterator, if it contains ID blocks, or a - * NullTransliterator, if it contains ID blocks which parse as + * compound Transliterator, if it contains ID blocks, or a + * null Transliterator, if it contains ID blocks which parse as * empty for the given direction. + * + * @param ID the id for the transliterator. + * @param rules rules, separated by ';' + * @param dir either FORWARD or REVERSE. + * @return a newly created Transliterator + * @throws IllegalArgumentException if there is a problem with the ID or the rules * @stable ICU 2.0 */ public static final Transliterator createFromRules(String ID, String rules, int dir) {