From: David Beaumont Date: Sat, 24 Aug 2019 15:14:52 +0000 (+0000) Subject: ICU-20693 New LDML to ICU tooling. X-Git-Tag: release-65-rc~45 X-Git-Url: https://granicus.if.org/sourcecode?a=commitdiff_plain;h=2528d0bec150d6fe80d1f58837517f8340eecde2;p=icu ICU-20693 New LDML to ICU tooling. See #721 --- diff --git a/tools/cldr/cldr-to-icu/.gitignore b/tools/cldr/cldr-to-icu/.gitignore new file mode 100644 index 00000000000..8d31dd98484 --- /dev/null +++ b/tools/cldr/cldr-to-icu/.gitignore @@ -0,0 +1,7 @@ +# Exclude the Maven local repository but keep the lib directory and the top-level readme. +/lib/** +!/lib/README.txt + +# Ignore the default Maven target directory. +/target + diff --git a/tools/cldr/cldr-to-icu/README.txt b/tools/cldr/cldr-to-icu/README.txt new file mode 100644 index 00000000000..647bf99b497 --- /dev/null +++ b/tools/cldr/cldr-to-icu/README.txt @@ -0,0 +1,55 @@ +********************************************************************* +*** © 2019 and later: Unicode, Inc. and others. *** +*** License & terms of use: http://www.unicode.org/copyright.html *** +********************************************************************* + +Basic instructions for running the LdmlConverter via Maven +========================================================== + +Note that these instructions do not currently support configuration of the converter for things +such as limiting the set of files produced. That is supported in code and could be easily added +to the binary, or encapsulated via an Ant task, but currently it is not directly supported. +See the IcuConverterConfig class for the API by which this can be supported. + + +Important directories +--------------------- + + = The root directory of the CLDR release. + + = The root directory of the ICU release (probably a parent directory of where + this README file is located). This is an optional property and defaults to + the parent directory of the release from which it is run. + + = The temporary cache directory in which DTD files are downloaded (this is the + same directory as would be used when running tools from the CLDR project). + Note that the need to specify this directory is scheduled to be removed after + ICU release 65. + + = The output directory into which ICU data files should be written. + + +Generating all ICU data +----------------------- + +$ mvn exec:java \ + -DCLDR_DIR='' \ + -DCLDR_DTD_CACHE='' \ + -Dexec.args='' + + +Running unit tests +------------------ + +$ mvn test \ + -DCLDR_DIR='' \ + -DCLDR_DTD_CACHE='' + + +Importing and running from an IDE +--------------------------------- + +This project should be easy to import into an IDE which supports Maven development, such +as IntelliJ or Eclipse. It uses a local Maven repository directory for the unpublished +CLDR libraries (which are included in the project), but otherwise gets all dependencies +via Maven's public repositories. \ No newline at end of file diff --git a/tools/cldr/cldr-to-icu/lib/README.txt b/tools/cldr/cldr-to-icu/lib/README.txt new file mode 100644 index 00000000000..3e1db8efb04 --- /dev/null +++ b/tools/cldr/cldr-to-icu/lib/README.txt @@ -0,0 +1,61 @@ +********************************************************************* +*** © 2019 and later: Unicode, Inc. and others. *** +*** License & terms of use: http://www.unicode.org/copyright.html *** +********************************************************************* + +What is this directory and why is it empty? +------------------------------------------- + +This is the root of a local Maven repository which needs to be populated before the +code in this project can be executed. + +To do this, you need to have a local copy of the CLDR project configured on your +computer and be able able to build the API jar file and copy an existing utility +jar file. In the examples below it is assumed that references this CLDR +release. + + +Regenerating the CLDR API jar +----------------------------- + +To regenerate the CLDR API jar you need to build the "jar" target using the Ant +build.xml file in the "tools/java" directory of the CLDR project: + +$ cd /tools/java +$ ant clean jar + +This should result in the cldr.jar file being built into that directory, which can then +be installed as a Maven dependency as described above. + + +Updating local Maven repository +------------------------------- + +To update the local Maven repository (e.g. to install the CLDR jar) then from this +directory (lib/) you should run: + +$ mvn install:install-file \ + -DgroupId=org.unicode.cldr \ + -DartifactId=cldr-api \ + -Dversion=0.1-SNAPSHOT \ + -Dpackaging=jar \ + -DgeneratePom=true \ + -DlocalRepositoryPath=. \ + -Dfile=/tools/java/cldr.jar + +And also (for the utility jar): + +$ mvn install:install-file \ + -DgroupId=com.ibm.icu \ + -DartifactId=icu-utilities \ + -Dversion=0.1-SNAPSHOT \ + -Dpackaging=jar \ + -DgeneratePom=true \ + -DlocalRepositoryPath=. \ + -Dfile=/tools/java/libs/utilities.jar + +And if you have updated one of these libraries, run: + +$ mvn dependency:purge-local-repository -DsnapshotsOnly=true + +If you choose to update the version number, then remember to update the root pom.xml. diff --git a/tools/cldr/cldr-to-icu/pom.xml b/tools/cldr/cldr-to-icu/pom.xml new file mode 100644 index 00000000000..3c7884350a1 --- /dev/null +++ b/tools/cldr/cldr-to-icu/pom.xml @@ -0,0 +1,83 @@ + + + + 4.0.0 + + org.unicode.icu + cldr-to-icu + 1.0-SNAPSHOT + + + + org.apache.maven.plugins + maven-compiler-plugin + 3.5.1 + + 8 + 8 + + + + org.codehaus.mojo + exec-maven-plugin + + org.unicode.icu.tool.cldrtoicu.LdmlConverter + + + ICU_DIR + ${project.basedir}/../../.. + + + + + + + + + + + local-maven-repo + file:///${project.basedir}/lib + + + + + + org.unicode.cldr + cldr-api + 0.1-SNAPSHOT + + + com.ibm.icu + icu-utilities + 0.1-SNAPSHOT + + + com.ibm.icu + icu4j + 64.2 + + + com.google.guava + guava + 27.1-jre + + + com.google.truth + truth + 1.0 + test + + + com.google.truth.extensions + truth-java8-extension + 1.0 + test + + + \ No newline at end of file diff --git a/tools/cldr/cldr-to-icu/src/main/java/org/unicode/icu/tool/cldrtoicu/IcuConverterConfig.java b/tools/cldr/cldr-to-icu/src/main/java/org/unicode/icu/tool/cldrtoicu/IcuConverterConfig.java new file mode 100644 index 00000000000..f85c2012750 --- /dev/null +++ b/tools/cldr/cldr-to-icu/src/main/java/org/unicode/icu/tool/cldrtoicu/IcuConverterConfig.java @@ -0,0 +1,381 @@ +// © 2019 and later: Unicode, Inc. and others. +// License & terms of use: http://www.unicode.org/copyright.html +package org.unicode.icu.tool.cldrtoicu; + +import static com.google.common.base.Preconditions.checkArgument; +import static com.google.common.base.Preconditions.checkNotNull; + +import java.nio.file.Files; +import java.nio.file.Path; +import java.nio.file.Paths; +import java.util.Arrays; +import java.util.Map; +import java.util.Optional; +import java.util.Set; + +import org.unicode.cldr.api.CldrDraftStatus; + +import com.google.common.collect.ImmutableMap; +import com.google.common.collect.ImmutableSet; +import org.unicode.icu.tool.cldrtoicu.LdmlConverter.OutputType; + +/** + * The converter config intended to generate the standard ICU data files. This used to be something + * that was configured by text files such as "icu-locale-deprecates.xml" and "icu-config. + */ +public final class IcuConverterConfig implements LdmlConverterConfig { + + private static final Optional DEFAULT_CLDR_DIR = + Optional.ofNullable(System.getProperty("CLDR_DIR", null)) + .map(d -> Paths.get(d).toAbsolutePath()); + + private static final Optional DEFAULT_ICU_DIR = + Optional.ofNullable(System.getProperty("ICU_DIR", null)) + .map(d -> Paths.get(d).toAbsolutePath()); + + /** The builder with which to specify configuration for the {@link LdmlConverter}. */ + public static final class Builder { + private Path cldrDir = DEFAULT_CLDR_DIR.orElse(null); + private Path outputDir = + DEFAULT_ICU_DIR.map(d -> d.resolve("icu4c/source/data")).orElse(null); + private Path specialsDir = + DEFAULT_ICU_DIR.map(d -> d.resolve("icu4c/source/data/xml")).orElse(null);; + private ImmutableSet outputTypes = OutputType.ALL; + private CldrDraftStatus minimalDraftStatus = CldrDraftStatus.CONTRIBUTED; + private boolean emitReport = false; + + /** + * Sets the CLDR base directory from which to load all CLDR data. This is optional if the + * {@code CLDR_DIR} environment variable is set, which will be used instead. + */ + public Builder setCldrDir(Path cldrDir) { + this.cldrDir = checkNotNull(cldrDir.toAbsolutePath()); + return this; + } + + /** + * Sets the output directory in which the ICU data directories and files will go. This is + * optional if the {@code ICU_DIR} system property is set, which will be used to generate + * the path instead (i.e. {@code "icu4c/source/data"} inside the ICU release directory). + */ + public Builder setOutputDir(Path outputDir) { + this.outputDir = checkNotNull(outputDir); + return this; + } + + /** + * Sets the "specials" directory containing additional ICU specific data to be processed. + * This is optional if the {@code ICU_DIR} system property is set, which will be used to + * generate the path instead (i.e. {@code "icu4c/source/data/xml"} inside the ICU release + * directory). + */ + public Builder setSpecialsDir(Path specialsDir) { + this.specialsDir = checkNotNull(specialsDir); + return this; + } + + /** + * Sets the output types which will be converted. This is optional and defaults to {@link + * OutputType#ALL}. + */ + public Builder setOutputTypes(Iterable types) { + this.outputTypes = ImmutableSet.copyOf(types); + return this; + } + + /** + * Sets the minimum draft status for CLDR data to be converted (paths below this status are + * ignored during conversion). This is optional and defaults to {@link + * CldrDraftStatus#CONTRIBUTED}. + */ + public Builder setMinimalDraftStatus(CldrDraftStatus minimalDraftStatus) { + this.minimalDraftStatus = checkNotNull(minimalDraftStatus); + return this; + } + + public Builder setEmitReport(boolean emitReport) { + this.emitReport = emitReport; + return this; + } + + /** Returns a converter config from the current builder state. */ + public LdmlConverterConfig build() { + return new IcuConverterConfig(this); + } + } + + private final Path cldrDir; + private final Path outputDir; + private final Path specialsDir; + private final ImmutableSet outputTypes; + private final CldrDraftStatus minimalDraftStatus; + private final boolean emitReport; + + private IcuConverterConfig(Builder builder) { + this.cldrDir = checkNotNull(builder.cldrDir, + "must set a CLDR directory, or the CLDR_DIR system property"); + if (DEFAULT_CLDR_DIR.isPresent() && !this.cldrDir.equals(DEFAULT_CLDR_DIR.get())) { + System.err.format( + "Warning: Specified CLDR base directory does not appear to match the" + + " directory inferred by the 'CLDR_DIR' system property.\n" + + "Specified: %s\n" + + "Inferred: %s\n", + this.cldrDir, DEFAULT_CLDR_DIR.get()); + } + this.outputDir = checkNotNull(builder.outputDir); + checkArgument(!Files.isRegularFile(outputDir), + "specified output directory if not a directory: %s", outputDir); + this.specialsDir = checkNotNull(builder.specialsDir, + "must specify a 'specials' XML directory"); + checkArgument(Files.isDirectory(specialsDir), + "specified specials directory does not exist: %s", specialsDir); + this.outputTypes = builder.outputTypes; + checkArgument(!this.outputTypes.isEmpty(), + "must specify at least one output type to be generated (possible values are: %s)", + Arrays.asList(OutputType.values())); + this.minimalDraftStatus = builder.minimalDraftStatus; + this.emitReport = builder.emitReport; + } + + public static Builder builder() { + return new Builder(); + } + + @Override public Path getCldrDirectory() { + return cldrDir; + } + + @Override public Path getOutputDir() { + return outputDir; + } + + @Override public Set getOutputTypes() { + return outputTypes; + } + + @Override public CldrDraftStatus getMinimumDraftStatus() { + return minimalDraftStatus; + } + + @Override public Path getSpecialsDir() { + return specialsDir; + } + + @Override public boolean emitReport() { + return emitReport; + } + + // Currently hard-coded "hacks" which could be encoded via the builder if wanted. + + @Override public Map getForcedAliases(IcuLocaleDir dir) { + switch (dir) { + case COLL: + return ImmutableMap.builder() + // It is not at all clear why this is being done (we expect "sr_Latn_ME" normally). + // TODO: Find out and document this properly. + .put("sr_ME", "sr_Cyrl_ME") + + // This appears to be a hack to avoid needing to copy and maintain the same "zh" + // data for "yue". The files for "yue" in this directory should be empty otherwise. + // + // The maximized versions of "yue_Hans" is "yue_Hans_CN" (vs "zh_Hans_CN"), and for + // "yue" it's "yue_Hant_HK" (vs "zh_Hant_HK"), so the aliases are effectively just + // rewriting the base language. + .put("yue_Hans", "zh_Hans") + .put("yue", "zh_Hant") + .build(); + case RBNF: + // It is not at all clear why this is being done. It's certainly not exactly the same + // as above, since (a) the alias is reversed (b) "zh_Hant" does exist, with different + // data than "yue", so this alias is not just rewriting the base language. + // TODO: Find out and document this properly. + return ImmutableMap.of("zh_Hant_HK", "yue"); + default: + return ImmutableMap.of(); + } + } + + // This set of locale files in each directory denotes the supported/available locales for that + // API. In most cases, it's the same set, but a few directories support only a subset of IDs. + @Override public ImmutableSet getTargetLocaleIds(IcuLocaleDir dir) { + switch (dir) { + case COLL: + return COLL_LOCALE_IDS; + case BRKITR: + return BRKITR_LOCALE_IDS; + case RBNF: + return RBNF_LOCALE_IDS; + default: + return ICU_LOCALE_IDS; + } + } + + // The primary set of locale IDs to be generated. Other, directory specific, sets should be + // subsets of this. Some of these ID are aliases, so XML files may not exist for all of them. + // + // This was further modified (in order to better match the set of generated ICU files) by: + // * Removing "es_003" (which just seems to be ignored in current code) + // * Adding: "en_NH", "sr_XK", "yue_CN", "yue_HK" (deprecated locale IDs in the manual config) + // * Adding: "no_NO_NY" (a not even structurally valid ID that exists for very legacy reasons) + private static final ImmutableSet ICU_LOCALE_IDS = ImmutableSet.of( + "root", + // A + "af", "af_NA", "af_ZA", "agq", "agq_CM", "ak", "ak_GH", "am", "am_ET", "ar", "ar_001", + "ar_AE", "ar_BH", "ar_DJ", "ar_DZ", "ar_EG", "ar_EH", "ar_ER", "ar_IL", "ar_IQ", + "ar_JO", "ar_KM", "ar_KW", "ar_LB", "ar_LY", "ar_MA", "ar_MR", "ar_OM", "ar_PS", + "ar_QA", "ar_SA", "ar_SD", "ar_SO", "ar_SS", "ar_SY", "ar_TD", "ar_TN", "ar_YE", "ars", + "as", "as_IN", "asa", "asa_TZ", "ast", "ast_ES", "az", "az_AZ", "az_Cyrl", "az_Cyrl_AZ", + "az_Latn", "az_Latn_AZ", + // B + "bas", "bas_CM", "be", "be_BY", "bem", "bem_ZM", "bez", "bez_TZ", "bg", "bg_BG", "bm", + "bm_ML", "bn", "bn_BD", "bn_IN", "bo", "bo_CN", "bo_IN", "br", "br_FR", "brx", "brx_IN", + "bs", "bs_Cyrl", "bs_Cyrl_BA", "bs_Latn", "bs_Latn_BA", "bs_BA", + // C + "ca", "ca_AD", "ca_ES", "ca_FR", "ca_IT", "ccp", "ccp_BD", "ccp_IN", "ce", "ce_RU", + "ceb", "ceb_PH", "cgg", "cgg_UG", "chr", "chr_US", "ckb", "ckb_IQ", "ckb_IR", "cs", + "cs_CZ", "cy", "cy_GB", + // D + "da", "da_DK", "da_GL", "dav", "dav_KE", "de", "de_AT", "de_BE", "de_CH", "de_DE", + "de_IT", "de_LI", "de_LU", "dje", "dje_NE", "dsb", "dsb_DE", "dua", "dua_CM", "dyo", + "dyo_SN", "dz", "dz_BT", + // E + "ebu", "ebu_KE", "ee", "ee_GH", "ee_TG", "el", "el_CY", "el_GR", "en", "en_001", + "en_150", "en_AE", "en_AG", "en_AI", "en_AS", "en_AT", "en_AU", "en_BB", "en_BE", + "en_BI", "en_BM", "en_BS", "en_BW", "en_BZ", "en_CA", "en_CC", "en_CH", "en_CK", + "en_CM", "en_CX", "en_CY", "en_DE", "en_DG", "en_DK", "en_DM", "en_ER", "en_FI", + "en_FJ", "en_FK", "en_FM", "en_GB", "en_GD", "en_GG", "en_GH", "en_GI", "en_GM", + "en_GU", "en_GY", "en_HK", "en_IE", "en_IL", "en_IM", "en_IN", "en_IO", "en_JE", + "en_JM", "en_KE", "en_KI", "en_KN", "en_KY", "en_LC", "en_LR", "en_LS", "en_MG", + "en_MH", "en_MO", "en_MP", "en_MS", "en_MT", "en_MU", "en_MW", "en_MY", "en_NA", + "en_NF", "en_NG", "en_NH", "en_NL", "en_NR", "en_NU", "en_NZ", "en_PG", "en_PH", + "en_PK", "en_PN", "en_PR", "en_PW", "en_RH", "en_RW", "en_SB", "en_SC", "en_SD", + "en_SE", "en_SG", "en_SH", "en_SI", "en_SL", "en_SS", "en_SX", "en_SZ", "en_TC", + "en_TK", "en_TO", "en_TT", "en_TV", "en_TZ", "en_UG", "en_UM", "en_US", "en_US_POSIX", + "en_VC", "en_VG", "en_VI", "en_VU", "en_WS", "en_ZA", "en_ZM", "en_ZW", "eo", + "eo_001", "es", "es_419", "es_AR", "es_BO", "es_BR", "es_BZ", "es_CL", "es_CO", + "es_CR", "es_CU", "es_DO", "es_EA", "es_EC", "es_ES", "es_GQ", "es_GT", "es_HN", + "es_IC", "es_MX", "es_NI", "es_PA", "es_PE", "es_PH", "es_PR", "es_PY", "es_SV", + "es_US", "es_UY", "es_VE", "et", "et_EE", "eu", "eu_ES", "ewo", "ewo_CM", + // F + "fa", "fa_AF", "fa_IR", "ff", "ff_CM", "ff_GN", "ff_Latn", "ff_Latn_BF", "ff_Latn_CM", + "ff_Latn_GH", "ff_Latn_GM", "ff_Latn_GN", "ff_Latn_GW", "ff_Latn_LR", "ff_Latn_MR", + "ff_Latn_NE", "ff_Latn_NG", "ff_Latn_SL", "ff_Latn_SN", "ff_MR", "ff_SN", "fi", + "fi_FI", "fil", "fil_PH", "fo", "fo_DK", "fo_FO", "fr", "fr_BE", "fr_BF", "fr_BI", + "fr_BJ", "fr_BL", "fr_CA", "fr_CD", "fr_CF", "fr_CG", "fr_CH", "fr_CI", "fr_CM", + "fr_DJ", "fr_DZ", "fr_FR", "fr_GA", "fr_GF", "fr_GN", "fr_GP", "fr_GQ", "fr_HT", + "fr_KM", "fr_LU", "fr_MA", "fr_MC", "fr_MF", "fr_MG", "fr_ML", "fr_MQ", "fr_MR", + "fr_MU", "fr_NC", "fr_NE", "fr_PF", "fr_PM", "fr_RE", "fr_RW", "fr_SC", "fr_SN", + "fr_SY", "fr_TD", "fr_TG", "fr_TN", "fr_VU", "fr_WF", "fr_YT", "fur", "fur_IT", + "fy", "fy_NL", + // G + "ga", "ga_IE", "gd", "gd_GB", "gl", "gl_ES", "gsw", "gsw_CH", "gsw_FR", "gsw_LI", + "gu", "gu_IN", "guz", "guz_KE", "gv", "gv_IM", + // H + "ha", "ha_GH", "ha_NE", "ha_NG", "haw", "haw_US", "he", "he_IL", "hi", "hi_IN", + "hr", "hr_BA", "hr_HR", "hsb", "hsb_DE", "hu", "hu_HU", "hy", "hy_AM", + // I + "ia", "ia_001", "id", "id_ID", "ig", "ig_NG", "ii", "ii_CN", "in", "in_ID", "is", + "is_IS", "it", "it_CH", "it_IT", "it_SM", "it_VA", "iw", "iw_IL", + // J + "ja", "ja_JP", "ja_JP_TRADITIONAL", "jgo", "jgo_CM", "jmc", "jmc_TZ", "jv", "jv_ID", + // K + "ka", "ka_GE", "kab", "kab_DZ", "kam", "kam_KE", "kde", "kde_TZ", "kea", "kea_CV", + "khq", "khq_ML", "ki", "ki_KE", "kk", "kk_KZ", "kkj", "kkj_CM", "kl", "kl_GL", "kln", + "kln_KE", "km", "km_KH", "kn", "kn_IN", "ko", "ko_KP", "ko_KR", "kok", "kok_IN", + "ks", "ks_IN", "ksb", "ksb_TZ", "ksf", "ksf_CM", "ksh", "ksh_DE", "ku", "ku_TR", + "kw", "kw_GB", "ky", "ky_KG", + // L + "lag", "lag_TZ", "lb", "lb_LU", "lg", "lg_UG", "lkt", "lkt_US", "ln", "ln_AO", + "ln_CD", "ln_CF", "ln_CG", "lo", "lo_LA", "lrc", "lrc_IQ", "lrc_IR", "lt", "lt_LT", + "lu", "lu_CD", "luo", "luo_KE", "luy", "luy_KE", "lv", "lv_LV", + // M + "mas", "mas_KE", "mas_TZ", "mer", "mer_KE", "mfe", "mfe_MU", "mg", "mg_MG", "mgh", + "mgh_MZ", "mgo", "mgo_CM", "mi", "mi_NZ", "mk", "mk_MK", "ml", "ml_IN", "mn", + "mn_MN", "mo", "mr", "mr_IN", "ms", "ms_BN", "ms_MY", "ms_SG", "mt", "mt_MT", "mua", + "mua_CM", "my", "my_MM", "mzn", "mzn_IR", + // N + "naq", "naq_NA", "nb", "nb_NO", "nb_SJ", "nd", "nd_ZW", "nds", "nds_DE", "nds_NL", + "ne", "ne_IN", "ne_NP", "nl", "nl_AW", "nl_BE", "nl_BQ", "nl_CW", "nl_NL", "nl_SR", + "nl_SX", "nmg", "nmg_CM", "nn", "nn_NO", "nnh", "nnh_CM", "no", "no_NO", "no_NO_NY", + "nus", "nus_SS", "nyn", "nyn_UG", + // O + "om", "om_ET", "om_KE", "or", "or_IN", "os", "os_GE", "os_RU", + // P + "pa", "pa_Arab", "pa_Arab_PK", "pa_Guru", "pa_Guru_IN", "pa_IN", "pa_PK", "pl", + "pl_PL", "ps", "ps_AF", "ps_PK", "pt", "pt_AO", "pt_BR", "pt_CH", "pt_CV", "pt_GQ", + "pt_GW", "pt_LU", "pt_MO", "pt_MZ", "pt_PT", "pt_ST", "pt_TL", + // Q + "qu", "qu_BO", "qu_EC", "qu_PE", + // R + "rm", "rm_CH", "rn", "rn_BI", "ro", "ro_MD", "ro_RO", "rof", "rof_TZ", "ru", + "ru_BY", "ru_KG", "ru_KZ", "ru_MD", "ru_RU", "ru_UA", "rw", "rw_RW", "rwk", "rwk_TZ", + // S + "sah", "sah_RU", "saq", "saq_KE", "sbp", "sbp_TZ", "sd", "sd_PK", "se", "se_FI", + "se_NO", "se_SE", "seh", "seh_MZ", "ses", "ses_ML", "sg", "sg_CF", "sh", "sh_BA", + "sh_CS", "sh_YU", "shi", "shi_Latn", "shi_Latn_MA", "shi_Tfng", "shi_Tfng_MA", + "shi_MA", "si", "si_LK", "sk", "sk_SK", "sl", "sl_SI", "smn", "smn_FI", "sn", + "sn_ZW", "so", "so_DJ", "so_ET", "so_KE", "so_SO", "sq", "sq_AL", "sq_MK", "sq_XK", + "sr", "sr_Cyrl", "sr_Cyrl_BA", "sr_Cyrl_ME", "sr_Cyrl_RS", "sr_Cyrl_CS", "sr_Cyrl_XK", + "sr_Cyrl_YU", "sr_Latn", "sr_Latn_BA", "sr_Latn_ME", "sr_Latn_RS", "sr_Latn_CS", + "sr_Latn_XK", "sr_Latn_YU", "sr_BA", "sr_ME", "sr_RS", "sr_CS", "sr_XK", "sr_YU", + "sv", "sv_AX", "sv_FI", "sv_SE", "sw", "sw_CD", "sw_KE", "sw_TZ", "sw_UG", + // T + "ta", "ta_IN", "ta_LK", "ta_MY", "ta_SG", "te", "te_IN", "teo", "teo_KE", "teo_UG", + "tg", "tg_TJ", "th", "th_TH", "th_TH_TRADITIONAL", "ti", "ti_ER", "ti_ET", "tk", + "tk_TM", "tl", "tl_PH", "to", "to_TO", "tr", "tr_CY", "tr_TR", "tt", "tt_RU", + "twq", "twq_NE", "tzm", "tzm_MA", + // U + "ug", "ug_CN", "uk", "uk_UA", "ur", "ur_IN", "ur_PK", "uz", "uz_AF", "uz_Arab", + "uz_Arab_AF", "uz_Cyrl", "uz_Cyrl_UZ", "uz_Latn", "uz_Latn_UZ", "uz_UZ", + // V + "vai", "vai_Latn", "vai_Latn_LR", "vai_LR", "vai_Vaii", "vai_Vaii_LR", "vi", + "vi_VN", "vun", "vun_TZ", + // W + "wae", "wae_CH", "wo", "wo_SN", + // X + "xh", "xh_ZA", "xog", "xog_UG", + // Y + "yav", "yav_CM", "yi", "yi_001", "yo", "yo_BJ", "yo_NG", "yue", "yue_CN", "yue_HK", + "yue_Hans", "yue_Hans_CN", "yue_Hant", "yue_Hant_HK", + // Z + "zgh", "zgh_MA", "zh", "zh_Hans", "zh_Hans_CN", "zh_Hans_HK", "zh_Hans_MO", + "zh_Hans_SG", "zh_Hant", "zh_Hant_HK", "zh_Hant_MO", "zh_Hant_TW", "zh_CN", + "zh_HK", "zh_MO", "zh_SG", "zh_TW", "zu", "zu_ZA"); + + private static final ImmutableSet COLL_LOCALE_IDS = ImmutableSet.of( + "root", + // A-B + "af", "am", "ars", "ar", "as", "az", "be", "bg", "bn", "bo", "bs_Cyrl", "bs", + // C-F + "ca", "ceb", "chr", "cs", "cy", "da", "de_AT", "de", "dsb", "dz", "ee", "el", "en", + "en_US_POSIX", "en_US", "eo", "es", "et", "fa_AF", "fa", "fil", "fi", "fo", "fr_CA", "fr", + // G-J + "ga", "gl", "gu", "ha", "haw", "he", "hi", "hr", "hsb", "hu", "hy", + "id_ID", "id", "ig", "in", "in_ID", "is", "it", "iw_IL", "iw", "ja", + // K-P + "ka", "kk", "kl", "km", "kn", "kok", "ko", "ku", "ky", "lb", "lkt", "ln", "lo", "lt", "lv", + "mk", "ml", "mn", "mo", "mr", "ms", "mt", "my", "nb", "ne", "nl", "nn", "no_NO", "no", + "om", "or", "pa_IN", "pa", "pa_Guru", "pl", "ps", "pt", + // R-T + "ro", "ru", "se", "sh_BA", "sh_CS", "sh", "sh_YU", "si", "sk", "sl", "smn", "sq", + "sr_BA", "sr_Cyrl_ME", "sr_Latn", "sr_ME", "sr_RS", "sr", "sv", "sw", + "ta", "te", "th", "tk", "to", "tr", + // U-Z + "ug", "uk", "ur", "uz", "vi", "wae", "wo", "xh", "yi", "yo", "yue_CN", "yue_Hans", + "yue", "zh_CN", "zh_Hant", "zh_HK", "zh_MO", "zh_SG", "zh_TW", "zh", "zu"); + + private static final ImmutableSet BRKITR_LOCALE_IDS = ImmutableSet.of( + "root", "de", "el", "en", "en_US_POSIX", "en_US", "es", "fr", "it", "ja", "pt", "ru", + "zh_Hant", "zh"); + + private static final ImmutableSet RBNF_LOCALE_IDS = ImmutableSet.of( + "root", "af", "ak", "am", "ars", "ar", "az", "be", "bg", "bs", "ca", "ccp", "chr", "cs", + "cy", "da", "de_CH", "de", "ee", "el", "en_001", "en_IN", "en", "eo", "es_419", "es_DO", + "es_GT", "es_HN", "es_MX", "es_NI", "es_PA", "es_PR", "es_SV", "es", "es_US", "et", + "fa_AF", "fa", "ff", "fil", "fi", "fo", "fr_BE", "fr_CH", "fr", "ga", "he", "hi", "hr", + "hu", "hy", "id", "in", "is", "it", "iw", "ja", "ka", "kl", "km", "ko", "ky", "lb", + "lo", "lrc", "lt", "lv", "mk", "ms", "mt", "my", "nb", "nl", "nn", "no", "pl", "pt_PT", + "pt", "qu", "ro", "ru", "se", "sh", "sk", "sl", "sq", "sr_Latn", "sr", "sv", + "sw", "ta", "th", "tr", "uk", "vi", "yue_Hans", "yue", "zh_Hant_HK", "zh_Hant", "zh_HK", + "zh_MO", "zh_TW", "zh"); +} diff --git a/tools/cldr/cldr-to-icu/src/main/java/org/unicode/icu/tool/cldrtoicu/IcuData.java b/tools/cldr/cldr-to-icu/src/main/java/org/unicode/icu/tool/cldrtoicu/IcuData.java new file mode 100644 index 00000000000..63959d790d2 --- /dev/null +++ b/tools/cldr/cldr-to-icu/src/main/java/org/unicode/icu/tool/cldrtoicu/IcuData.java @@ -0,0 +1,165 @@ +// © 2019 and later: Unicode, Inc. and others. +// License & terms of use: http://www.unicode.org/copyright.html +package org.unicode.icu.tool.cldrtoicu; + +import static com.google.common.base.Preconditions.checkArgument; + +import java.util.Arrays; +import java.util.Collections; +import java.util.List; +import java.util.NavigableSet; +import java.util.Set; +import java.util.TreeSet; +import java.util.regex.Matcher; +import java.util.regex.Pattern; + +import com.google.common.collect.ArrayListMultimap; +import com.google.common.collect.ImmutableList; +import com.google.common.collect.ListMultimap; + +/** + * Mutable ICU data, represented as a mapping from resource bundle paths to a sequence of values. + */ +public final class IcuData { + private static final RbPath RB_VERSION = RbPath.of("Version"); + private static final Pattern ARRAY_INDEX = Pattern.compile("(/[^\\[]++)(?:\\[(\\d++)\\])?$"); + + private final String name; + private final boolean hasFallback; + private final NavigableSet paths = new TreeSet<>(); + private final ListMultimap rbPathToValues = ArrayListMultimap.create(); + private ImmutableList commentLines = ImmutableList.of(); + + /** + * IcuData constructor. + * + * @param name The name of the IcuData object, used as the name of the root node in the output file + * @param hasFallback true if the output file has another ICU file as a fallback. + */ + public IcuData(String name, boolean hasFallback) { + this.hasFallback = hasFallback; + this.name = name; + } + + /** @return whether data should fallback on data in other ICU files. */ + public boolean hasFallback() { + return hasFallback; + } + + /** + * @return the name of this ICU data instance. Used in the output filename, and in comments. + */ + public String getName() { + return name; + } + + /** Sets additional comment lines for the top of the file. */ + public void setFileComment(String... commentLines) { + setFileComment(Arrays.asList(commentLines)); + } + + public void setFileComment(Iterable commentLines) { + this.commentLines = ImmutableList.copyOf(commentLines); + } + + public List getFileComment() { + return commentLines; + } + + /** Adds a singleton resource bundle value for a given path. */ + public void add(RbPath rbPath, String element) { + add(rbPath, RbValue.of(element)); + } + + /** Adds a single resource bundle value for a given path. */ + public void add(RbPath rbPath, RbValue rbValue) { + rbPathToValues.put(rbPath, rbValue); + paths.add(rbPath); + } + + /** Adds a sequence of resource bundle values for a given path. */ + public void add(RbPath rbPath, Iterable rbValues) { + rbValues.forEach(v -> rbPathToValues.put(rbPath, v)); + paths.add(rbPath); + } + + /** Replaces all resource bundle values for a given path with the specified singleton value. */ + public void replace(RbPath rbPath, String element) { + rbPathToValues.removeAll(rbPath); + rbPathToValues.put(rbPath, RbValue.of(element)); + paths.add(rbPath); + } + + /** Replaces all resource bundle values for a given path with the specified value. */ + public void replace(RbPath rbPath, RbValue rbValue) { + rbPathToValues.removeAll(rbPath); + add(rbPath, rbValue); + } + + public void setVersion(String versionString) { + add(RB_VERSION, versionString); + } + + public void addResults(ListMultimap resultsByRbPath) { + for (RbPath rbPath : resultsByRbPath.keySet()) { + for (PathValueTransformer.Result r : resultsByRbPath.get(rbPath)) { + if (r.isGrouped()) { + // Grouped results have all the values in a single value entry. + add(rbPath, RbValue.of(r.getValues())); + } else { + if (rbPath.getSegment(rbPath.length() - 1).endsWith(":alias")) { + r.getValues().forEach(v -> add(rbPath, RbValue.of(v))); + } else { + // Ungrouped results are one value per entry, but might be expanded into + // grouped results if they are a path referencing a grouped entry. + r.getValues().forEach(v -> add(rbPath, replacePathValues(v))); + } + } + } + } + } + + /** + * Replaces an ungrouped CLDR value for the form "/foo/bar" or "/foo/bar[N]" which is assumed + * to be a reference to an existing value in a resource bundle. Note that the referenced bundle + * might be grouped (i.e. an array with more than one element). + */ + private RbValue replacePathValues(String value) { + Matcher m = ARRAY_INDEX.matcher(value); + if (!m.matches()) { + return RbValue.of(value); + } + // The only constraint is that the "path" value starts with a leading '/', but parsing into + // the RbPath ignores this. We must use "parse()" here, rather than RbPath.of(), since the + // captured value contains '/' characters to represent path delimiters. + RbPath replacePath = RbPath.parse(m.group(1)); + List replaceValues = get(replacePath); + checkArgument(replaceValues != null, "Path %s is missing from IcuData", replacePath); + // If no index is given (e.g. "/foo/bar") then treat it as index 0 (i.e. "/foo/bar[0]"). + int replaceIndex = m.groupCount() > 1 ? Integer.parseInt(m.group(2)) : 0; + return replaceValues.get(replaceIndex); + } + + /** + * Returns the mutable list of values associated with the given path (or null if there are no + * associated values). + */ + public List get(RbPath rbPath) { + return paths.contains(rbPath) ? rbPathToValues.get(rbPath) : null; + } + + /** Returns an unmodifiable view of the set of paths in this instance. */ + public Set getPaths() { + return Collections.unmodifiableSet(paths); + } + + /** Returns whether the given path is present in this instance. */ + public boolean contains(RbPath rbPath) { + return paths.contains(rbPath); + } + + /** Returns whether there are any paths in this instance. */ + public boolean isEmpty() { + return paths.isEmpty(); + } +} diff --git a/tools/cldr/cldr-to-icu/src/main/java/org/unicode/icu/tool/cldrtoicu/IcuDataDumper.java b/tools/cldr/cldr-to-icu/src/main/java/org/unicode/icu/tool/cldrtoicu/IcuDataDumper.java new file mode 100644 index 00000000000..13dbcd33dd4 --- /dev/null +++ b/tools/cldr/cldr-to-icu/src/main/java/org/unicode/icu/tool/cldrtoicu/IcuDataDumper.java @@ -0,0 +1,381 @@ +// © 2019 and later: Unicode, Inc. and others. +// License & terms of use: http://www.unicode.org/copyright.html +package org.unicode.icu.tool.cldrtoicu; + +import static com.google.common.base.CharMatcher.whitespace; +import static com.google.common.base.Preconditions.checkArgument; +import static com.google.common.base.Preconditions.checkElementIndex; +import static com.google.common.base.Preconditions.checkNotNull; +import static com.google.common.base.Preconditions.checkState; +import static com.google.common.collect.ImmutableList.toImmutableList; + +import java.io.IOException; +import java.nio.file.Files; +import java.nio.file.Path; +import java.nio.file.Paths; +import java.util.ArrayDeque; +import java.util.ArrayList; +import java.util.Deque; +import java.util.List; +import java.util.Optional; +import java.util.function.Function; +import java.util.function.Predicate; +import java.util.regex.Matcher; +import java.util.regex.Pattern; +import java.util.stream.Stream; + +import com.google.common.base.Joiner; +import com.google.common.collect.ArrayListMultimap; +import com.google.common.collect.HashMultiset; +import com.google.common.collect.ImmutableList; +import com.google.common.collect.ImmutableSetMultimap; +import com.google.common.collect.Iterables; +import com.google.common.collect.ListMultimap; +import com.google.common.collect.Lists; +import com.google.common.collect.Multiset; + +/** + * Helper tool to dump the resource bundle paths and values from an IcuData instance in a stable + * ordering, to allow easy comparison in cases where ICU ordering changes. This could easily be + * extended to be a more fully featured "diff" tool or a proper ICU data file parser. + * + *

This is a temporary debugging tool and should not be relied upon during any part of the data + * generation process. + */ +final class IcuDataDumper { + private static final Joiner LIST_JOINER = Joiner.on(','); + private static final RbPath VERSION = RbPath.of("Version"); + + public static void main(String... args) throws IOException { + Path fileOrDir; + Optional name = Optional.empty(); + switch (args.length) { + case 2: + name = Optional.of(Pattern.compile(args[1])); + case 1: + fileOrDir = Paths.get(args[0]); + break; + default: + throw new IllegalArgumentException("Usage: []"); + } + + if (Files.isDirectory(fileOrDir)) { + walkDirectory(fileOrDir, name); + } else { + checkArgument(!name.isPresent(), + "cannot specificy a name pattern for a non-directory file: %s", fileOrDir); + IcuDataParser parser = new IcuDataParser(fileOrDir); + parser.parse(); + dump(parser.icuData); + } + } + + private static void walkDirectory(Path fileOrDir, Optional name) throws IOException { + Predicate matchesName = + f -> name.map(n -> n.matcher(f.getFileName().toString()).matches()).orElse(true); + List icuParsers; + try (Stream files = Files.walk(fileOrDir)) { + icuParsers = files + .filter(Files::isRegularFile) + .filter(matchesName) + .map(IcuDataParser::new) + .collect(toImmutableList()); + } + ListMultimap allPaths = ArrayListMultimap.create(); + for (IcuDataParser p : icuParsers) { + p.parse(); + for (RbPath k : p.icuData.keySet()) { + List values = p.icuData.get(k); + if (!allPaths.containsKey(k)) { + allPaths.putAll(k, values); + } else if (!VERSION.equals(k)) { + checkState(allPaths.get(k).equals(values), "inconsistent data for path: ", k); + } + } + } + dump(allPaths); + } + + private static void dump(ListMultimap allPaths) { + allPaths.keySet().stream() + .sorted() + .forEach(k -> System.out.println(k + " :: " + LIST_JOINER.join(allPaths.get(k)))); + } + + private static final class IcuDataParser { + // Path of file being parsed. + private final Path path; + + // Comments in header (before data starts), without comment characters. + private final List headerComment = new ArrayList<>(); + // ICU data name (the name of the root element). + private String name = null; + // ICU data values. + private final ListMultimap icuData = ArrayListMultimap.create(); + + // Current line number (1-indexed). + private int lineNumber = 0; + // The type of the previous line that was processed. + private LineType lastType = LineType.COMMENT; + // True when inside /* .. */ comments in the header. + private boolean inBlockComment = false; + // True when in the final top-level group at the end of parsing. + private boolean inFinalGroup = false; + // True when a partial (line wrapped) value has been read. + private boolean isLineContinuation = false; + // Current path while parsing (NOT including the root element). + private Deque pathStack = new ArrayDeque<>(); + // Current sequence of values for the path (as defined in the current path stack). + private List currentValue = new ArrayList<>(); + // Current partially read value of a multi-line value. + private String wrappedValue = ""; + // Map of indices used to auto-generate names for anonymous path segments. + // TODO: Check if this is even needed and remove if not. + private Multiset indices = HashMultiset.create(); + + IcuDataParser(Path path) { + this.path = checkNotNull(path); + } + + public boolean parse() throws IOException { + List lines = Files.readAllLines(path); + // Best approximation to a magic number be have (BOM plus inline comment). This stops + // use trying to parse the transliteration files, which are a different type. + if (!lines.get(0).startsWith("\uFEFF//")) { + return false; + } + lines.stream().map(whitespace()::trimFrom).forEach(this::processLineWithCheck); + + // Sanity check for expected final state. Just checking the "lastType" should be enough + // to catch everything else (due to transition rules and how the code tidies up) but it + // seems prudent to sanity check everything just in case. + checkState(lastType == LineType.GROUP_END); + checkState(!inBlockComment); + checkState(name != null); + checkState(pathStack.isEmpty() && inFinalGroup); + checkState(wrappedValue.isEmpty() && currentValue.isEmpty()); + return true; + } + + void processLineWithCheck(String line) { + lineNumber++; + if (lineNumber == 1 && line.startsWith("\uFEFF")) { + line = line.substring(1); + } + try { + processLine(line); + } catch (RuntimeException e) { + throw new RuntimeException( + String.format("[%s:%s] %s (%s)", path, lineNumber, e.getMessage(), line), + e); + } + } + + void processLine(String line) { + line = maybeTrimEndOfLineComment(line); + if (line.isEmpty()) { + return; + } + LineMatch match = LineType.match(line, inBlockComment); + checkState(match.getType().isValidTransitionFrom(lastType), + "invalid state transition: %s --//-> %s", lastType, match.getType()); + boolean isEndOfWrappedValue = false; + switch (match.getType()) { + case COMMENT: + if (name != null) { + // Comments in data are ignored since they cannot be properly associated with + // paths or values in an IcuData instance (only legacy tooling emits these). + break; + } + if (line.startsWith("/*")) { + inBlockComment = true; + } + headerComment.add(match.get(0)); + if (inBlockComment && line.contains("*/")) { + checkState(line.indexOf("*/") == line.length() - 2, + "unexpected end of comment block"); + inBlockComment = false; + } + break; + + case INLINE_VALUE: + icuData.put( + getPathFromStack().extendBy(getSegment(match.get(0))), + RbValue.of(unquote(match.get(1)))); + break; + + case GROUP_START: + checkState(currentValue.isEmpty()); + if (name == null) { + name = match.get(0); + checkState(name != null, "cannot have anonymous top-level group"); + } else { + pathStack.push(getSegment(match.get(0))); + } + wrappedValue = ""; + isLineContinuation = false; + break; + + case QUOTED_VALUE: + wrappedValue += unquote(match.get(0)); + isLineContinuation = !line.endsWith(","); + if (!isLineContinuation) { + currentValue.add(wrappedValue); + wrappedValue = ""; + } + break; + + case VALUE: + checkState(!isLineContinuation, "unexpected unquoted value"); + currentValue.add(match.get(0)); + break; + + case GROUP_END: + // Account for quoted values without trailing ',' just before group end. + if (isLineContinuation) { + currentValue.add(wrappedValue); + isLineContinuation = false; + } + // Emit the collection sequence of values for the current path as an RbValue. + if (!currentValue.isEmpty()) { + icuData.put(getPathFromStack(), RbValue.of(currentValue)); + currentValue.clear(); + } + // Annoyingly the name is outside the stack so the stack will empty before the last + // end group. + if (!pathStack.isEmpty()) { + pathStack.pop(); + indices.setCount(pathStack.size(), 0); + } else { + checkState(!inFinalGroup, "unexpected group end"); + inFinalGroup = true; + } + break; + + case UNKNOWN: + throw new IllegalStateException("cannot parse line: " + match.get(0)); + } + lastType = match.getType(); + } + + private RbPath getPathFromStack() { + if (pathStack.isEmpty()) { + return RbPath.empty(); + } + List segments = new ArrayList<>(); + Iterables.addAll(segments, pathStack); + if (segments.get(0).matches("<[0-9]{4}>")) { + segments.remove(0); + } + return segments.isEmpty() ? RbPath.empty() : RbPath.of(Lists.reverse(segments)); + } + + private String getSegment(String segmentOrNull) { + if (segmentOrNull != null) { + return segmentOrNull; + } + int depth = pathStack.size(); + int index = indices.count(depth); + indices.add(depth, 1); + return String.format("<%04d>", index); + } + + private String maybeTrimEndOfLineComment(String line) { + // Once the name is set, we are past the header and into the data. + if (name != null) { + // Index to search for '//' from - must skip quoted values. + int startIdx = line.startsWith("\"") ? line.indexOf('"', 1) + 1 : 0; + int commentIdx = line.indexOf("//", startIdx); + if (commentIdx != -1) { + line = whitespace().trimTrailingFrom(line.substring(0, commentIdx)); + } + } + return line; + } + + private static String unquote(String s) { + if (s.startsWith("\"") && s.endsWith("\"")) { + return s.substring(1, s.length() - 1).replaceAll("\\\\([\"\\\\])", "$1"); + } + checkState(!s.contains("\""), "invalid unquoted value: %s", s); + return s; + } + + private static final class LineMatch { + private final LineType type; + private final Function args; + + LineMatch(LineType type, Function args) { + this.type = checkNotNull(type); + this.args = checkNotNull(args); + } + + String get(int n) { + return args.apply(n); + } + + LineType getType() { + return type; + } + } + + private enum LineType { + // Comment _start_ with any comment value captured. + COMMENT("(?://|/\\*)\\s*(.*)"), + // A combination of GROUP_START, VALUE and GROUP_END with whitespace. + INLINE_VALUE("(?:(.*\\S)\\s*)?\\{\\s*((?:\".*\")|(?:[^\"{}]*\\S))\\s*\\}"), + // Allows for empty segment names (anonymous arrays) which match 'null'. + GROUP_START("(?:(.*\\S)\\s*)?\\{"), + GROUP_END("\\}"), + QUOTED_VALUE("(\".*\"),?"), + VALUE("([^\"{}]+),?"), + UNKNOWN(".*"); + + // Table of allowed transitions expected during parsing. + // key=current state, values=set of permitted previous states + private static ImmutableSetMultimap TRANSITIONS = + ImmutableSetMultimap.builder() + .putAll(COMMENT, COMMENT) + .putAll(INLINE_VALUE, COMMENT, INLINE_VALUE, GROUP_START, GROUP_END) + .putAll(GROUP_START, COMMENT, GROUP_START, GROUP_END, INLINE_VALUE) + .putAll(VALUE, GROUP_START, VALUE, QUOTED_VALUE) + .putAll(QUOTED_VALUE, GROUP_START, VALUE, QUOTED_VALUE) + .putAll(GROUP_END, GROUP_END, INLINE_VALUE, VALUE, QUOTED_VALUE) + .build(); + + private final Pattern pattern; + + LineType(String regex) { + this.pattern = Pattern.compile(regex); + } + + boolean isValidTransitionFrom(LineType lastType) { + return TRANSITIONS.get(this).contains(lastType); + } + + static LineMatch match(String line, boolean inBlockComment) { + // Block comments kinda suck and it'd be great if the ICU data only used '//' style + // comments (if would definitely simplify any parsers out there). Once the + // transition to the new transformation tools is complete, they can be changed to + // only emit '//' style comments. + if (inBlockComment) { + if (line.startsWith("*")) { + line = whitespace().trimLeadingFrom(line.substring(1)); + } + return new LineMatch(COMMENT, ImmutableList.of(line)::get); + } + for (LineType type : TRANSITIONS.keySet()) { + // Regex groups start at 1, but we want the getter function to be zero-indexed. + Matcher m = type.pattern.matcher(line); + if (m.matches()) { + return new LineMatch(type, n -> { + checkElementIndex(n, m.groupCount()); + return m.group(n + 1); + }); + } + } + return new LineMatch(UNKNOWN, ImmutableList.of(line)::get); + } + } + } +} diff --git a/tools/cldr/cldr-to-icu/src/main/java/org/unicode/icu/tool/cldrtoicu/IcuFunctions.java b/tools/cldr/cldr-to-icu/src/main/java/org/unicode/icu/tool/cldrtoicu/IcuFunctions.java new file mode 100644 index 00000000000..a2f83139e05 --- /dev/null +++ b/tools/cldr/cldr-to-icu/src/main/java/org/unicode/icu/tool/cldrtoicu/IcuFunctions.java @@ -0,0 +1,209 @@ +// © 2019 and later: Unicode, Inc. and others. +// License & terms of use: http://www.unicode.org/copyright.html +package org.unicode.icu.tool.cldrtoicu; + +import static com.google.common.base.Preconditions.checkArgument; +import static com.google.common.base.Preconditions.checkNotNull; +import static java.lang.Integer.parseInt; + +import java.time.LocalDate; +import java.time.LocalDateTime; +import java.time.ZoneOffset; +import java.util.function.Function; +import java.util.regex.Matcher; +import java.util.regex.Pattern; + +import com.google.common.base.Ascii; +import com.google.common.base.CharMatcher; +import com.google.common.collect.ImmutableMap; +import org.unicode.icu.tool.cldrtoicu.regex.NamedFunction; + +/** + * The named functions used by the {@code RegexTransformer} for {@code ldml2icu_supplemental.txt}. + */ +final class IcuFunctions { + /** + * Converts an ISO date string to a space-separated pair of integer values representing the top + * and bottom parts of a deconstructed millisecond epoch value (i.e. {@code + * " "}). + * + *

Note that the values are formatted as signed decimal values, so it's entirely + * possible that the low bits value will be appear as a negative number (the high bits won't + * appear negative for many thousands of years). + * + *

    + *
  • args[0] = ISO date string (e.g. "2019-05-23") + *
  • args[1] = Date field type name (e.g. "from") + *
+ */ + static final NamedFunction DATE_FN = + NamedFunction.create("date", 2, args -> { + long millis = + DateFieldType.toEnum(args.get(1)).toEpochMillis(LocalDate.parse(args.get(0))); + // Strictly speaking the masking is redundant and could be removed. + int hiBits = (int) ((millis >>> 32) & 0xFFFFFFFFL); + int loBits = (int) (millis & 0xFFFFFFFFL); + return hiBits + " " + loBits; + }); + + // TODO(dbeaumont): Improve this documentation (e.g. why is this being done, give examples?). + /** + * Inserts '%' into numberingSystems descriptions. + * + *
    + *
  • args[0] = numbering system description (string) + *
+ */ + static final NamedFunction ALGORITHM_FN = + NamedFunction.create("algorithm", 1, args -> { + String value = args.get(0); + int percentPos = value.lastIndexOf('/') + 1; + return value.substring(0, percentPos) + '%' + value.substring(percentPos); + }); + + /** + * Converts a number into a special integer that represents the number in normalized scientific + * notation for ICU's RB parser. + * + *

Resultant integers are in the form "xxyyyyyy", where "xx" is the exponent offset by 50 + * and "yyyyyy" is the coefficient to 5 decimal places. Results may also have a leading '-' to + * denote negative values. + * + *

For example: + *

{@code
+     * 14660000000000 -> 1.466E13    -> 63146600
+     * 0.0001         -> 1E-4        -> 46100000
+     * -123.456       -> -1.23456E-2 -> -48123456
+     * }
+ * + *

The additional exponent offset is applied directly to the calculated exponent and is used + * to do things like converting percentages into their decimal representation (i.e. by passing + * a value of "-2"). + * + *

    + *
  • args[0] = number to be converted (double) + *
  • args[1] = additional exponent offset (integer) + *
+ */ + static final NamedFunction EXP_FN = + NamedFunction.create("exp", 2, args -> { + double value = Double.parseDouble(args.get(0)); + if (value == 0) { + return "0"; + } + int exponent = 50; + if (args.size() == 2) { + exponent += Integer.parseInt(args.get(1)); + } + String sign = value >= 0 ? "" : "-"; + value = Math.abs(value); + while (value >= 10) { + value /= 10; + exponent++; + } + while (value < 1) { + value *= 10; + exponent--; + } + if (exponent < 0 || exponent > 99) { + throw new IllegalArgumentException("Exponent out of bounds: " + exponent); + } + return sign + exponent + Math.round(value * 100000); + }); + + // Allow for single digit values in any part and negative year values. + private static final Pattern YMD = Pattern.compile("(-?[0-9]+)-([0-9]{1,2})-([0-9]{1,2})"); + + /** + * Converts an ISO date string (i.e. "YYYY-MM-DD") into an ICU date string, which is + * the same but with spaces instead of hyphens. Since functions are expanded before the + * resulting value is split, this function will result in 3 separate values being created, + * unless the function call is enclosed in quotes. + * + *

Note that for some cases (e.g. "eras") the year part can be negative (e.g. "-2165-1-1") + * so this is not as simple as "split by hyphen". + * + *

    + *
  • args[0] = ISO date string (e.g. "2019-05-23" or "-2165-1-1") + *
+ */ + static final NamedFunction YMD_FN = + NamedFunction.create("ymd", 1, args -> { + Matcher m = YMD.matcher(args.get(0)); + checkArgument(m.matches(), "invalid year-month-day string: %s", args.get(0)); + // NOTE: Re-parsing is not optional since it removes leading zeros (needed for ICU). + return String.format("%s %s %s", + parseInt(m.group(1)), parseInt(m.group(2)), parseInt(m.group(3))); + }); + + // For transforming day-of-week identifiers. + private static final ImmutableMap WEEKDAY_MAP_ID = + ImmutableMap.builder() + .put("sun", "1") + .put("mon", "2") + .put("tues", "3") + .put("wed", "4") + .put("thu", "5") + .put("fri", "6") + .put("sat", "7") + .build(); + + /** + * Converts a day-of-week identifier into its ordinal value (e.g. "sun" --> 1, "mon" --> 2 ...). + */ + static final NamedFunction DAY_NUMBER_FN = + NamedFunction.create("day_number", 1, + args -> { + String id = WEEKDAY_MAP_ID.get(args.get(0)); + checkArgument(id != null, "unknown weekday: %s", args.get(0)); + return id; + }); + + // For transform IDs in elements. + private static final ImmutableMap TRANSFORM_ID_MAP = + ImmutableMap.of("no-change", "0", "titlecase-firstword", "1"); + + /** + * Converts the transform type in the {@code } element into its ICU index + * (e.g. "titlecase-firstword" --> 1). + */ + static final NamedFunction CONTEXT_TRANSFORM_INDEX_FN = + NamedFunction.create("context_transform_index", 1, + args -> { + String id = TRANSFORM_ID_MAP.get(args.get(0)); + checkArgument(id != null, "unknown contextTransform: %s", args.get(0)); + return id; + }); + + // For DATE_FN only. + private enum DateFieldType { + from(LocalDate::atStartOfDay), + // Remember that atTime() takes nanoseconds, not micro or milli. + to(d -> d.atTime(23, 59, 59, 999_000_000)); + + private final Function adjustFn; + + DateFieldType(Function adjustFn) { + this.adjustFn = adjustFn; + } + + long toEpochMillis(LocalDate date) { + return adjustFn.apply(date).toInstant(ZoneOffset.UTC).toEpochMilli(); + } + + static DateFieldType toEnum(String value) { + switch (Ascii.toLowerCase(CharMatcher.whitespace().trimFrom(value))) { + case "from": + case "start": + return from; + case "to": + case "end": + return to; + default: + throw new IllegalArgumentException(value + " is not a valid date field type"); + } + } + } + + private IcuFunctions() {} +} diff --git a/tools/cldr/cldr-to-icu/src/main/java/org/unicode/icu/tool/cldrtoicu/IcuTextWriter.java b/tools/cldr/cldr-to-icu/src/main/java/org/unicode/icu/tool/cldrtoicu/IcuTextWriter.java new file mode 100644 index 00000000000..c5f2fe89178 --- /dev/null +++ b/tools/cldr/cldr-to-icu/src/main/java/org/unicode/icu/tool/cldrtoicu/IcuTextWriter.java @@ -0,0 +1,313 @@ +// © 2019 and later: Unicode, Inc. and others. +// License & terms of use: http://www.unicode.org/copyright.html +package org.unicode.icu.tool.cldrtoicu; + +import static com.google.common.base.Preconditions.checkNotNull; +import static java.util.stream.Collectors.joining; + +import java.io.IOException; +import java.io.PrintWriter; +import java.io.Writer; +import java.nio.file.Files; +import java.nio.file.Path; +import java.util.List; +import java.util.regex.Matcher; +import java.util.regex.Pattern; + +/** + * Writes an IcuData object to a text file. A lot of this class was copied directly from the + * original {@code IcuTextWriter} in the CLDR project and has a number of very idiosyncratic + * behaviours. The behaviour of this class is currently tuned to produce perfect parity with + * the original conversion tools, but once migration of the tools is complete, it should + * probably be revisited and tidied up. + */ +// TODO: Link to a definitive specification for the ICU data files and remove the hacks! +final class IcuTextWriter { + private static final String INDENT = " "; + // List of characters to escape in UnicodeSets + // ('\' followed by any of '\', '[', ']', '{', '}', '-', '&', ':', '^', '='). + private static final Pattern UNICODESET_ESCAPE = + Pattern.compile("\\\\[\\\\\\[\\]\\{\\}\\-&:^=]"); + // Only escape \ and " from other strings. + private static final Pattern STRING_ESCAPE = Pattern.compile("(?!')\\\\\\\\(?!')"); + private static final Pattern QUOTE_ESCAPE = Pattern.compile("\\\\?\""); + + /** Write a file in ICU data format with the specified header. */ + static void writeToFile(IcuData icuData, Path outDir, List header) { + try { + Files.createDirectories(outDir); + try (Writer w = Files.newBufferedWriter(outDir.resolve(icuData.getName() + ".txt")); + PrintWriter out = new PrintWriter(w)) { + new IcuTextWriter(icuData).writeTo(out, header); + } + } catch (IOException e) { + throw new RuntimeException("cannot write ICU data file: " + icuData.getName(), e); + } + } + + private final IcuData icuData; + private int depth = 0; + private boolean valueWasInline = false; + + IcuTextWriter(IcuData icuData) { + this.icuData = checkNotNull(icuData); + } + + // TODO: Write a UTF-8 header (see https://unicode-org.atlassian.net/browse/ICU-10197). + private void writeTo(PrintWriter out, List header) throws IOException { + out.write('\uFEFF'); + writeHeaderAndComments(out, header, icuData.getFileComment()); + + // Write the ICU data to file. This takes the form: + // ---- + // { + // foo{ + // bar{baz} + // } + // } + // ---- + // So it's like every RbPath has an implicit prefix of the IcuData name. + String root = icuData.getName(); + if (!icuData.hasFallback()) { + root += ":table(nofallback)"; + } + // TODO: Replace with "open(root, out)" once happy with differences (it adds a blank line). + out.print(root); + out.print("{"); + depth++; + + RbPath lastPath = RbPath.empty(); + for (RbPath path : icuData.getPaths()) { + // Close any blocks up to the common path length. Since paths are all distinct, the + // common length should always be shorter than either path. We add 1 since we must also + // account for the implicit root segment. + int commonDepth = RbPath.getCommonPrefixLength(lastPath, path) + 1; + // Before closing, the "cursor" is at the end of the last value written. + closeLastPath(lastPath, commonDepth, out); + // After opening the value will be ready for the next value to be written. + openNextPath(path, out); + valueWasInline = appendValues(icuData.getName(), path, icuData.get(path), out); + lastPath = path; + } + closeLastPath(lastPath, 0, out); + out.println(); + out.close(); + } + + // Before: Cursor is at the end of the previous line. + // After: Cursor is positioned immediately after the last closed '}' + private void closeLastPath(RbPath lastPath, int minDepth, PrintWriter out) { + if (valueWasInline) { + depth--; + out.print('}'); + valueWasInline = false; + } + while (depth > minDepth) { + close(out); + } + } + + // Before: Cursor is at the end of the previous line. + // After: Cursor is positioned immediately after the newly opened '{' + private void openNextPath(RbPath path, PrintWriter out) { + while (depth <= path.length()) { + // The -1 is to adjust for the implicit root element which means indentation (depth) + // no longer matches the index of the segment we are writing. + open(path.getSegment(depth - 1), out); + } + } + + private void open(String label, PrintWriter out) { + newLineAndIndent(out); + depth++; + // This handles the "magic" pseudo indexing paths that are added by RegexTransformer. + // These take the form of "" and are used to ensure that path order can be + // well defined even for anonymous lists of items. + if (!label.startsWith("<") && !label.endsWith(">")) { + out.print(label); + } + out.print('{'); + } + + private void close(PrintWriter out) { + depth--; + newLineAndIndent(out); + out.print('}'); + } + + private void newLineAndIndent(PrintWriter out) { + out.println(); + for (int i = 0; i < depth; i++) { + out.print(INDENT); + } + } + + // Currently the "header" uses '//' line comments but the comments are in a block. + // TODO: Sort this out so there isn't a messy mix of comment styles in the data files. + private static void writeHeaderAndComments( + PrintWriter out, List header, List comments) { + header.forEach(out::println); + if (!comments.isEmpty()) { + // TODO: Don't use /* */ block quotes, just use inline // quotes. + out.println( + comments.stream().collect(joining("\n * ", "/**\n * ", "\n */"))); + } + } + + /** Inserts padding and values between braces. */ + private boolean appendValues( + String name, RbPath rbPath, List values, PrintWriter out) { + + RbValue onlyValue; + boolean wasSingular = false; + boolean quote = !rbPath.isIntPath(); + boolean isSequence = rbPath.endsWith(RB_SEQUENCE); + if (values.size() == 1 && !mustBeArray(true, name, rbPath)) { + onlyValue = values.get(0); + if (onlyValue.size() == 1 && !mustBeArray(false, name, rbPath)) { + // Value has a single element and is not being forced to be an array. + String onlyElement = onlyValue.getElement(0); + if (quote) { + onlyElement = quoteInside(onlyElement); + } + // The numbers below are simply tuned to match the line wrapping in the original + // CLDR code. The behaviour it produces is sometimes strange (wrapping a line just + // for a single character) and could definitely be improved. + // TODO: Simplify this and add hysteresis to ensure less "jarring" line wrapping. + int maxWidth = Math.max(68, 80 - Math.min(4, rbPath.length()) * INDENT.length()); + if (onlyElement.length() <= maxWidth) { + // Single element for path: don't add newlines. + printValue(out, onlyElement, quote); + wasSingular = true; + } else { + // Element too long to fit in one line, so wrap. + int end; + for (int i = 0; i < onlyElement.length(); i = end) { + end = goodBreak(onlyElement, i + maxWidth); + String part = onlyElement.substring(i, end); + newLineAndIndent(out); + printValue(out, part, quote); + } + } + } else { + // Only one array for the rbPath, so don't add an extra set of braces. + printArray(onlyValue, quote, isSequence, out); + } + } else { + for (RbValue value : values) { + if (value.size() == 1) { + // Single-value array: print normally. + printArray(value, quote, isSequence, out); + } else { + // Enclose this array in braces to separate it from other values. + open("", out); + printArray(value, quote, isSequence, out); + close(out); + } + } + } + return wasSingular; + } + + private static final RbPath RB_SEQUENCE = RbPath.of("Sequence"); + private static final RbPath RB_RULES = RbPath.of("rules"); + private static final RbPath RB_LOCALE_SCRIPT = RbPath.of("LocaleScript"); + private static final RbPath RB_ERAS = RbPath.of("eras"); + private static final RbPath RB_NAMED = RbPath.of("named"); + private static final RbPath RB_CALENDAR_PREFERENCE_DATA = RbPath.of("calendarPreferenceData"); + private static final RbPath RB_METAZONE_INFO = RbPath.of("metazoneInfo"); + + /** + * Wrapper for a hack to determine if the given rb path should always present its values as an + * array. + */ + // TODO: Verify this is still needed, and either make it less hacky, or delete it. + private static boolean mustBeArray(boolean topValues, String name, RbPath rbPath) { + if (topValues) { + // matches "rules/setNN" (hence the mucking about with raw segments). + return name.equals("pluralRanges") + && rbPath.startsWith(RB_RULES) + && rbPath.getSegment(1).startsWith("set"); + } + return rbPath.equals(RB_LOCALE_SCRIPT) + || (rbPath.contains(RB_ERAS) + && !rbPath.getSegment(rbPath.length() - 1).endsWith(":alias") + && !rbPath.endsWith(RB_NAMED)) + || rbPath.startsWith(RB_CALENDAR_PREFERENCE_DATA) + || rbPath.startsWith(RB_METAZONE_INFO); + } + + private void printArray(RbValue rbValue, boolean quote, boolean isSequence, PrintWriter out) { + for (int n = 0; n < rbValue.size(); n++) { + newLineAndIndent(out); + printValue(out, quoteInside(rbValue.getElement(n)), quote); + if (!isSequence) { + out.print(","); + } + } + } + + private static void printValue(PrintWriter out, String value, boolean quote) { + if (quote) { + out.append('"').append(value).append('"'); + } else { + out.append(value); + } + } + + // Can a string be broken here? If not, backup until we can. + // TODO: Either don't bother line wrapping or look at making this use a line-break iterator. + private static int goodBreak(String quoted, int end) { + if (end > quoted.length()) { + return quoted.length(); + } + // Don't break escaped Unicode characters. + // Need to handle both e.g. \u4E00 and \U00020000 + for (int i = end - 1; i > end - 10;) { + char current = quoted.charAt(i--); + if (!Character.toString(current).matches("[0-9A-Fa-f]")) { + if ((current == 'u' || current == 'U') && i > end - 10 + && quoted.charAt(i) == '\\') { + return i; + } + break; + } + } + while (end > 0) { + char ch = quoted.charAt(end - 1); + if (ch != '\\' && (ch < '\uD800' || ch > '\uDFFF')) { + break; + } + --end; + } + return end; + } + + // Fix characters inside strings. + private static String quoteInside(String item) { + // Unicode-escape all quotes. + item = QUOTE_ESCAPE.matcher(item).replaceAll("\\\\u0022"); + // Double up on backslashes, ignoring Unicode-escaped characters. + Pattern pattern = + item.startsWith("[") && item.endsWith("]") ? UNICODESET_ESCAPE : STRING_ESCAPE; + Matcher matcher = pattern.matcher(item); + + if (!matcher.find()) { + return item; + } + StringBuilder buffer = new StringBuilder(); + int start = 0; + do { + buffer.append(item, start, matcher.start()); + int punctuationChar = item.codePointAt(matcher.end() - 1); + buffer.append("\\"); + if (punctuationChar == '\\') { + buffer.append('\\'); + } + buffer.append(matcher.group()); + start = matcher.end(); + } while (matcher.find()); + buffer.append(item.substring(start)); + return buffer.toString(); + } +} diff --git a/tools/cldr/cldr-to-icu/src/main/java/org/unicode/icu/tool/cldrtoicu/LdmlConverter.java b/tools/cldr/cldr-to-icu/src/main/java/org/unicode/icu/tool/cldrtoicu/LdmlConverter.java new file mode 100644 index 00000000000..9d1aaa3738e --- /dev/null +++ b/tools/cldr/cldr-to-icu/src/main/java/org/unicode/icu/tool/cldrtoicu/LdmlConverter.java @@ -0,0 +1,618 @@ +// © 2019 and later: Unicode, Inc. and others. +// License & terms of use: http://www.unicode.org/copyright.html +package org.unicode.icu.tool.cldrtoicu; + +import static com.google.common.base.Preconditions.checkArgument; +import static com.google.common.base.Preconditions.checkNotNull; +import static org.unicode.icu.tool.cldrtoicu.LdmlConverterConfig.IcuLocaleDir.BRKITR; +import static org.unicode.icu.tool.cldrtoicu.LdmlConverterConfig.IcuLocaleDir.COLL; +import static org.unicode.icu.tool.cldrtoicu.LdmlConverterConfig.IcuLocaleDir.CURR; +import static org.unicode.icu.tool.cldrtoicu.LdmlConverterConfig.IcuLocaleDir.LANG; +import static org.unicode.icu.tool.cldrtoicu.LdmlConverterConfig.IcuLocaleDir.LOCALES; +import static org.unicode.icu.tool.cldrtoicu.LdmlConverterConfig.IcuLocaleDir.RBNF; +import static org.unicode.icu.tool.cldrtoicu.LdmlConverterConfig.IcuLocaleDir.REGION; +import static org.unicode.icu.tool.cldrtoicu.LdmlConverterConfig.IcuLocaleDir.UNIT; +import static org.unicode.icu.tool.cldrtoicu.LdmlConverterConfig.IcuLocaleDir.ZONE; +import static java.util.stream.Collectors.toList; +import static org.unicode.cldr.api.CldrDataType.BCP47; +import static org.unicode.cldr.api.CldrDataType.LDML; +import static org.unicode.cldr.api.CldrDataType.SUPPLEMENTAL; + +import java.io.IOException; +import java.io.InputStream; +import java.io.InputStreamReader; +import java.nio.file.Files; +import java.nio.file.Path; +import java.nio.file.Paths; +import java.util.Arrays; +import java.util.Collection; +import java.util.HashSet; +import java.util.LinkedHashMap; +import java.util.LinkedHashSet; +import java.util.List; +import java.util.Map; +import java.util.Optional; +import java.util.Set; +import java.util.TreeSet; +import java.util.function.Consumer; +import java.util.function.Function; +import java.util.stream.Collectors; +import java.util.stream.Stream; + +import org.unicode.cldr.api.CldrData; +import org.unicode.cldr.api.CldrDataSupplier; +import org.unicode.cldr.api.CldrDataType; + +import com.google.common.base.CharMatcher; +import com.google.common.collect.HashMultimap; +import com.google.common.collect.ImmutableList; +import com.google.common.collect.ImmutableMap; +import com.google.common.collect.ImmutableSet; +import com.google.common.collect.LinkedListMultimap; +import com.google.common.collect.ListMultimap; +import com.google.common.collect.SetMultimap; +import com.google.common.collect.Sets; +import com.google.common.io.CharStreams; +import org.unicode.icu.tool.cldrtoicu.LdmlConverterConfig.IcuLocaleDir; +import org.unicode.icu.tool.cldrtoicu.mapper.Bcp47Mapper; +import org.unicode.icu.tool.cldrtoicu.mapper.BreakIteratorMapper; +import org.unicode.icu.tool.cldrtoicu.mapper.CollationMapper; +import org.unicode.icu.tool.cldrtoicu.mapper.DayPeriodsMapper; +import org.unicode.icu.tool.cldrtoicu.mapper.LocaleMapper; +import org.unicode.icu.tool.cldrtoicu.mapper.PluralRangesMapper; +import org.unicode.icu.tool.cldrtoicu.mapper.PluralsMapper; +import org.unicode.icu.tool.cldrtoicu.mapper.RbnfMapper; +import org.unicode.icu.tool.cldrtoicu.mapper.SupplementalMapper; +import org.unicode.icu.tool.cldrtoicu.mapper.TransformsMapper; +import org.unicode.icu.tool.cldrtoicu.regex.RegexTransformer; + +/** + * The main converter tool for CLDR to ICU data. To run this tool, you need to supply a suitable + * {@link LdmlConverterConfig} instance. There is a simple {@code main()} method available in this + * class which can be invoked passing just the desired output directory and which relies on the + * presence of several system properties for the remainder of its parameters: + *
    + *
  • CLDR_DIR: The root of the CLDR release from which CLDR data is read. + *
  • ICU_DIR: The root of the ICU release from which additional "specials" XML data is read. + *
  • CLDR_DTD_CACHE: A temporary directory with the various DTDs cached (this is a legacy + * requirement from the underlying CLDR libraries and might go away one day). + *
+ */ +public final class LdmlConverter { + // TODO: Do all supplemental data in one go and split similarly to locale data (using RbPath). + private static final PathMatcher GENDER_LIST_PATHS = + supplementalMatcher("gender"); + private static final PathMatcher LIKELY_SUBTAGS_PATHS = + supplementalMatcher("likelySubtags"); + private static final PathMatcher METAZONE_PATHS = + supplementalMatcher("metaZones", "primaryZones"); + private static final PathMatcher METADATA_PATHS = + supplementalMatcher("metadata"); + private static final PathMatcher SUPPLEMENTAL_DATA_PATHS = + supplementalMatcher( + "calendarData", + "calendarPreferenceData", + "codeMappings", + "codeMappingsCurrency", + "idValidity", + "languageData", + "languageMatching", + "measurementData", + "parentLocales", + "subdivisionContainment", + "territoryContainment", + "territoryInfo", + "timeData", + "unitPreferenceData", + "weekData", + "weekOfPreference"); + private static final PathMatcher CURRENCY_DATA_PATHS = + supplementalMatcher("currencyData"); + private static final PathMatcher NUMBERING_SYSTEMS_PATHS = + supplementalMatcher("numberingSystems"); + private static final PathMatcher WINDOWS_ZONES_PATHS = + supplementalMatcher("windowsZones"); + + // Special IDs which are not supported via CLDR, but for which synthetic data is injected. + // The "TRADITIONAL" variants are here because their calendar differs from the non-variant + // locale. However CLDR cannot represent this currently because calendar defaults are in + // supplemental data (rather than locale data) and are keyed only on territory. + private static final ImmutableSet PHANTOM_LOCALE_IDS = + ImmutableSet.of("ja_JP_TRADITIONAL", "th_TH_TRADITIONAL"); + + // Special alias mapping which exists in ICU even though "no_NO_NY" is simply not a + // structurally valid locale ID. This is injected manually when creating the alias map. + // This does mean that nobody can ever parse the _keys_ of the alias map, but so far there + // has been no need for that. + // TODO: Get "ars" into CLDR and remove this hack. + private static final Map PHANTOM_ALIASES = + ImmutableMap.of("ars", "ar_SA", "no_NO_NY", "nn_NO"); + + private static PathMatcher supplementalMatcher(String... spec) { + checkArgument(spec.length > 0, "must supply at least one matcher spec"); + if (spec.length == 1) { + return PathMatcher.of("supplementalData/" + spec[0]); + } + return PathMatcher.anyOf( + Arrays.stream(spec) + .map(s -> PathMatcher.of("supplementalData/" + s)) + .toArray(PathMatcher[]::new)); + } + + private static RbPath RB_PARENT = RbPath.of("%%Parent"); + // The quotes below are only so we achieve parity with the manually written alias files. + // TODO: Remove unnecessary quotes once the migration to this code is complete. + private static RbPath RB_ALIAS = RbPath.of("\"%%ALIAS\""); + // Special path for adding to empty files which only exist to complete the parent chain. + // TODO: Confirm that this has no meaningful effect and unify "empty" file contents. + private static RbPath RB_EMPTY_ALIAS = RbPath.of("___"); + + /** Provisional entry point until better config support exists. */ + public static void main(String... args) { + convert(IcuConverterConfig.builder() + .setOutputDir(Paths.get(args[0])) + .setEmitReport(true) + .build()); + } + + /** + * Output types defining specific subsets of the ICU data which can be converted separately. + * This closely mimics the original "NewLdml2IcuConverter" behaviour but could be simplified to + * hide what are essentially implementation specific data splits. + */ + public enum OutputType { + LOCALES(LDML, LdmlConverter::processLocales), + BRKITR(LDML, LdmlConverter::processBrkitr), + COLL(LDML, LdmlConverter::processCollation), + RBNF(LDML, LdmlConverter::processRbnf), + + DAY_PERIODS( + SUPPLEMENTAL, + c -> c.processDayPeriods("misc")), + GENDER_LIST( + SUPPLEMENTAL, + c -> c.processSupplemental("genderList", GENDER_LIST_PATHS, "misc", false)), + LIKELY_SUBTAGS( + SUPPLEMENTAL, + c -> c.processSupplemental("likelySubtags", LIKELY_SUBTAGS_PATHS, "misc", false)), + SUPPLEMENTAL_DATA( + SUPPLEMENTAL, + c -> c.processSupplemental("supplementalData", SUPPLEMENTAL_DATA_PATHS, "misc", true)), + CURRENCY_DATA( + SUPPLEMENTAL, + c -> c.processSupplemental("supplementalData", CURRENCY_DATA_PATHS, "curr", true)), + METADATA( + SUPPLEMENTAL, + c -> c.processSupplemental("metadata", METADATA_PATHS, "misc", false)), + META_ZONES( + SUPPLEMENTAL, + c -> c.processSupplemental("metaZones", METAZONE_PATHS, "misc", false)), + NUMBERING_SYSTEMS( + SUPPLEMENTAL, + c -> c.processSupplemental("numberingSystems", NUMBERING_SYSTEMS_PATHS, "misc", false)), + PLURALS( + SUPPLEMENTAL, + c -> c.processPlurals("misc")), + PLURAL_RANGES( + SUPPLEMENTAL, + c -> c.processPluralRanges("misc")), + WINDOWS_ZONES( + SUPPLEMENTAL, + c -> c.processSupplemental("windowsZones", WINDOWS_ZONES_PATHS, "misc", false)), + TRANSFORMS( + SUPPLEMENTAL, + c -> c.processTransforms("translit")), + KEY_TYPE_DATA( + BCP47, + c -> c.processKeyTypeData("misc")), + + // Batching by type. + DTD_LDML(LDML, c -> c.processAll(LDML)), + DTD_SUPPLEMENTAL(SUPPLEMENTAL, c -> c.processAll(SUPPLEMENTAL)), + DTD_BCP47(BCP47, c -> c.processAll(BCP47)); + + public static final ImmutableSet ALL = + ImmutableSet.of(DTD_BCP47, DTD_SUPPLEMENTAL, DTD_LDML); + + private final CldrDataType type; + private final Consumer converterFn; + + OutputType(CldrDataType type, Consumer converterFn) { + this.type = checkNotNull(type); + this.converterFn = checkNotNull(converterFn); + } + + void convert(LdmlConverter converter) { + converterFn.accept(converter); + } + + CldrDataType getCldrType() { + return type; + } + } + + private static void convert(LdmlConverterConfig config) { + CldrDataSupplier src = CldrDataSupplier + .forCldrFilesIn(config.getCldrDirectory()) + .withDraftStatusAtLeast(config.getMinimumDraftStatus()); + new LdmlConverter(config, src).convertAll(config); + } + + // The configuration controlling conversion behaviour. + private final LdmlConverterConfig config; + // The supplier for all data to be converted. + private final CldrDataSupplier src; + // The set of available locale IDs. + // TODO: Make available IDs include specials files (or fail if specials are not available). + private final ImmutableSet availableIds; + // Supplemental data available to mappers if needed. + private final SupplementalData supplementalData; + // Transformer for locale data. + private final PathValueTransformer localeTransformer; + // Transformer for supplemental data. + private final PathValueTransformer supplementalTransformer; + // Header string to go into every ICU data file. + private final ImmutableList icuFileHeader; + + private LdmlConverter(LdmlConverterConfig config, CldrDataSupplier src) { + this.config = checkNotNull(config); + this.src = checkNotNull(src); + this.supplementalData = SupplementalData.create(src.getDataForType(SUPPLEMENTAL)); + // Sort the set of available locale IDs but add "root" at the front. This is the + // set of non-alias locale IDs to be processed. + Set localeIds = new LinkedHashSet<>(); + localeIds.add("root"); + localeIds.addAll( + Sets.intersection(src.getAvailableLocaleIds(), config.getTargetLocaleIds(LOCALES))); + localeIds.addAll(PHANTOM_LOCALE_IDS); + this.availableIds = ImmutableSet.copyOf(localeIds); + + // Load the remaining path value transformers. + this.supplementalTransformer = + RegexTransformer.fromConfigLines(readLinesFromResource("/ldml2icu_supplemental.txt"), + IcuFunctions.ALGORITHM_FN, + IcuFunctions.DATE_FN, + IcuFunctions.DAY_NUMBER_FN, + IcuFunctions.EXP_FN, + IcuFunctions.YMD_FN); + this.localeTransformer = + RegexTransformer.fromConfigLines(readLinesFromResource("/ldml2icu_locale.txt"), + IcuFunctions.CONTEXT_TRANSFORM_INDEX_FN); + this.icuFileHeader = ImmutableList.copyOf(readLinesFromResource("/ldml2icu_header.txt")); + } + + private void convertAll(LdmlConverterConfig config) { + ListMultimap groupByType = LinkedListMultimap.create(); + for (OutputType t : config.getOutputTypes()) { + groupByType.put(t.getCldrType(), t); + } + for (CldrDataType cldrType : groupByType.keySet()) { + for (OutputType t : groupByType.get(cldrType)) { + t.convert(this); + } + } + if (config.emitReport()) { + System.out.println("Supplemental Data Transformer=" + supplementalTransformer); + System.out.println("Locale Data Transformer=" + localeTransformer); + } + } + + private static List readLinesFromResource(String name) { + try (InputStream in = LdmlConverter.class.getResourceAsStream(name)) { + return CharStreams.readLines(new InputStreamReader(in)); + } catch (IOException e) { + throw new RuntimeException("cannot read resource: " + name, e); + } + } + + private PathValueTransformer getLocaleTransformer() { + return localeTransformer; + } + + private PathValueTransformer getSupplementalTransformer() { + return supplementalTransformer; + } + + private void processAll(CldrDataType cldrType) { + List targets = Arrays.stream(OutputType.values()) + .filter(t -> t.getCldrType().equals(cldrType)) + .filter(t -> !t.name().startsWith("DTD_")) + .collect(toList()); + for (OutputType t : targets) { + t.convert(this); + } + } + + private Optional loadSpecialsData(String localeId) { + String expected = localeId + ".xml"; + try (Stream files = Files.walk(config.getSpecialsDir())) { + Set xmlFiles = files + .filter(Files::isRegularFile) + .filter(f -> f.getFileName().toString().equals(expected)) + .collect(Collectors.toSet()); + return !xmlFiles.isEmpty() + ? Optional.of( + CldrDataSupplier.forCldrFiles(LDML, config.getMinimumDraftStatus(), xmlFiles)) + : Optional.empty(); + } catch (IOException e) { + throw new RuntimeException( + "error processing specials directory: " + config.getSpecialsDir(), e); + } + } + + private void processLocales() { + // TODO: Pre-load specials files to avoid repeatedly re-loading them. + processAndSplitLocaleFiles( + id -> LocaleMapper.process( + id, src, loadSpecialsData(id), getLocaleTransformer(), supplementalData), + CURR, LANG, LOCALES, REGION, UNIT, ZONE); + } + + private void processBrkitr() { + processAndSplitLocaleFiles( + id -> BreakIteratorMapper.process(id, src, loadSpecialsData(id)), BRKITR); + } + + private void processCollation() { + processAndSplitLocaleFiles( + id -> CollationMapper.process(id, src, loadSpecialsData(id)), COLL); + } + + private void processRbnf() { + processAndSplitLocaleFiles( + id -> RbnfMapper.process(id, src, loadSpecialsData(id)), RBNF); + } + + private void processAndSplitLocaleFiles( + Function icuFn, IcuLocaleDir... splitDirs) { + + SetMultimap writtenLocaleIds = HashMultimap.create(); + Path baseDir = config.getOutputDir(); + + for (String id : config.getTargetLocaleIds(LOCALES)) { + // Skip "target" IDs that are aliases (they are handled later). + if (!availableIds.contains(id)) { + continue; + } + IcuData icuData = icuFn.apply(id); + + ListMultimap splitPaths = LinkedListMultimap.create(); + for (RbPath p : icuData.getPaths()) { + String rootName = getBaseSegmentName(p.getSegment(0)); + splitPaths.put(LOCALE_SPLIT_INFO.getOrDefault(rootName, LOCALES), p); + } + + // We always write base languages (even if empty). + boolean isBaseLanguage = !id.contains("_"); + // Run through all directories (not just the keySet() of the split path map) since we + // sometimes write empty files. + for (IcuLocaleDir dir : splitDirs) { + Set targetIds = config.getTargetLocaleIds(dir); + if (!targetIds.contains(id)) { + if (!splitPaths.get(dir).isEmpty()) { + System.out.format( + "target IDs for %s does not contain %s, but it has data: %s\n", + dir, id, splitPaths.get(dir)); + } + continue; + } + Path outDir = baseDir.resolve(dir.getOutputDir()); + IcuData splitData = new IcuData(icuData.getName(), icuData.hasFallback()); + // The split data can still be empty for this directory, but that's expected. + splitPaths.get(dir).forEach(p -> splitData.add(p, icuData.get(p))); + // Adding a parent locale makes the data non-empty and forces it to be written. + supplementalData.getExplicitParentLocaleOf(splitData.getName()) + .ifPresent(p -> splitData.add(RB_PARENT, p)); + if (!splitData.isEmpty() || isBaseLanguage || dir.includeEmpty()) { + splitData.setVersion(CldrDataSupplier.getCldrVersionString()); + write(splitData, outDir); + writtenLocaleIds.put(dir, id); + } + } + } + + for (IcuLocaleDir dir : splitDirs) { + Path outDir = baseDir.resolve(dir.getOutputDir()); + Set targetIds = config.getTargetLocaleIds(dir); + + Map aliasMap = getAliasMap(targetIds, dir); + aliasMap.forEach((s, t) -> { + // It's only important to record which alias files are written because of forced + // aliases, but since it's harmless otherwise, we just do it unconditionally. + // Normal alias files don't affect the empty file calculation, but forced ones can. + writtenLocaleIds.put(dir, s); + writeAliasFile(s, t, outDir); + }); + + calculateEmptyFiles(writtenLocaleIds.get(dir), aliasMap.values()) + .forEach(id -> writeEmptyFile(id, outDir, aliasMap.values())); + } + } + + private Map getAliasMap(Set localeIds, IcuLocaleDir dir) { + // There are four reasons for treating a locale ID as an alias. + // 1: It contains deprecated subtags (e.g. "sr_YU", which should be "sr_Cyrl_RS"). + // 2: It has no CLDR data but is missing a script subtag. + // 3: It is one of the special "phantom" alias which cannot be represented normally + // and must be manually mapped (e.g. legacy locale IDs which don't even parse). + // 4: It is a "super special" forced alias, which might replace existing aliases in + // some output directories. + Map aliasMap = new LinkedHashMap<>(); + for (String id : localeIds) { + if (PHANTOM_ALIASES.keySet().contains(id)) { + checkArgument(!availableIds.contains(id), + "phantom aliases should never be otherwise supported: %s\n" + + "(maybe the phantom alias can now be removed?)", id); + aliasMap.put(id, PHANTOM_ALIASES.get(id)); + continue; + } + String canonicalId = supplementalData.replaceDeprecatedTags(id); + if (!canonicalId.equals(id)) { + // If the canonical form of an ID differs from the requested ID, the this is an + // alias, and just needs to point to the canonical ID. + aliasMap.put(id, canonicalId); + continue; + } + if (availableIds.contains(id)) { + // If it's canonical and supported, it's not an alias. + continue; + } + // If the requested locale is not supported, maximize it and alias to that. + String maximizedId = supplementalData.maximize(id) + .orElseThrow(() -> new IllegalArgumentException("unsupported locale ID: " + id)); + // We can't alias to ourselves and we shouldn't be here is the ID was already maximal. + checkArgument(!maximizedId.equals(id), "unsupported maximized locale ID: %s", id); + aliasMap.put(id, maximizedId); + } + // Important that we overwrite entries which might already exist here, since we might have + // already calculated a "natural" alias for something that we want to force (and we should + // replace the existing target, since that affects how we determine empty files later). + aliasMap.putAll(config.getForcedAliases(dir)); + return aliasMap; + } + + private static final CharMatcher PATH_MODIFIER = CharMatcher.anyOf(":%"); + + // Resource bundle paths elements can have variants (e.g. "Currencies%narrow) or type + // annotations (e.g. "languages:intvector"). We strip these when considering the element name. + private static String getBaseSegmentName(String segment) { + int idx = PATH_MODIFIER.indexIn(segment); + return idx == -1 ? segment : segment.substring(0, idx); + } + + private void processDayPeriods(String dir) { + write(DayPeriodsMapper.process(src), dir); + } + + private void processPlurals(String dir) { + write(PluralsMapper.process(src), dir); + } + + private void processPluralRanges(String dir) { + write(PluralRangesMapper.process(src), dir); + } + + private void processKeyTypeData(String dir) { + Bcp47Mapper.process(src).forEach(d -> write(d, dir)); + } + + private void processTransforms(String dir) { + Path transformDir = createDirectory(config.getOutputDir().resolve(dir)); + write(TransformsMapper.process(src, transformDir), transformDir); + } + + private static final RbPath RB_CLDR_VERSION = RbPath.of("cldrVersion"); + + private void processSupplemental( + String label, PathMatcher paths, String dir, boolean addCldrVersion) { + IcuData icuData = + SupplementalMapper.process(src, getSupplementalTransformer(), label, paths); + // A hack for "supplementalData.txt" since the "cldrVersion" value doesn't come from the + // supplemental data XML files. + if (addCldrVersion) { + icuData.add(RB_CLDR_VERSION, CldrDataSupplier.getCldrVersionString()); + } + write(icuData, dir); + } + + private void writeAliasFile(String srcId, String destId, Path dir) { + IcuData icuData = new IcuData(srcId, true); + icuData.add(RB_ALIAS, destId); + write(icuData, dir); + } + + private void writeEmptyFile(String id, Path dir, Collection aliasTargets) { + IcuData icuData = new IcuData(id, true); + // TODO: Document the reason for this (i.e. why does it matter what goes into empty files?) + if (aliasTargets.contains(id)) { + icuData.setFileComment("generated alias target"); + icuData.add(RB_EMPTY_ALIAS, ""); + } else { + // These empty files only exist because the target of an alias has a parent locale + // which is itself not in the set of written ICU files. An "indirect alias target". + icuData.setVersion(CldrDataSupplier.getCldrVersionString()); + } + write(icuData, dir); + } + + private void write(IcuData icuData, String dir) { + write(icuData, config.getOutputDir().resolve(dir)); + } + + private void write(IcuData icuData, Path dir) { + createDirectory(dir); + IcuTextWriter.writeToFile(icuData, dir, icuFileHeader); + } + + private Path createDirectory(Path dir) { + try { + Files.createDirectories(dir); + } catch (IOException e) { + throw new RuntimeException("cannot create directory: " + dir, e); + } + return dir; + } + + // The set of IDs to process is: + // * any file that was written + // * any alias target (not written) + // + // From which we generate the complete "closure" under the "getParent()" function. This set + // contains all file (written or not) which need to exist to complete the locale hierarchy. + // + // Then we remove all the written files to just leave the ones that need to be generated. + // This is a simple and robust approach that handles things like "gaps" in non-aliased + // locale IDs, where an intermediate parent is not present. + private ImmutableSet calculateEmptyFiles( + Set writtenIds, Collection aliasTargetIds) { + + Set seedIds = new HashSet<>(writtenIds); + seedIds.addAll(aliasTargetIds); + // Be nice and sort the output (makes easier debugging). + Set allIds = new TreeSet<>(); + for (String id : seedIds) { + while (!id.equals("root") && !allIds.contains(id)) { + allIds.add(id); + id = supplementalData.getParent(id); + } + } + return ImmutableSet.copyOf(Sets.difference(allIds, writtenIds)); + } + + private static final ImmutableMap LOCALE_SPLIT_INFO = + ImmutableMap.builder() + // BRKITR + .put("boundaries", BRKITR) + .put("dictionaries", BRKITR) + .put("exceptions", BRKITR) + // COLL + .put("collations", COLL) + .put("depends", COLL) + .put("UCARules", COLL) + // CURR + .put("Currencies", CURR) + .put("CurrencyPlurals", CURR) + .put("CurrencyUnitPatterns", CURR) + .put("currencySpacing", CURR) + // LANG + .put("Keys", LANG) + .put("Languages", LANG) + .put("Scripts", LANG) + .put("Types", LANG) + .put("Variants", LANG) + .put("characterLabelPattern", LANG) + .put("codePatterns", LANG) + .put("localeDisplayPattern", LANG) + // RBNF + .put("RBNFRules", RBNF) + // REGION + .put("Countries", REGION) + // UNIT + .put("durationUnits", UNIT) + .put("units", UNIT) + .put("unitsShort", UNIT) + .put("unitsNarrow", UNIT) + // ZONE + .put("zoneStrings", ZONE) + .build(); +} diff --git a/tools/cldr/cldr-to-icu/src/main/java/org/unicode/icu/tool/cldrtoicu/LdmlConverterConfig.java b/tools/cldr/cldr-to-icu/src/main/java/org/unicode/icu/tool/cldrtoicu/LdmlConverterConfig.java new file mode 100644 index 00000000000..97b1048a38f --- /dev/null +++ b/tools/cldr/cldr-to-icu/src/main/java/org/unicode/icu/tool/cldrtoicu/LdmlConverterConfig.java @@ -0,0 +1,106 @@ +// © 2019 and later: Unicode, Inc. and others. +// License & terms of use: http://www.unicode.org/copyright.html +package org.unicode.icu.tool.cldrtoicu; + +import java.nio.file.Path; +import java.util.Map; +import java.util.Set; + +import org.unicode.cldr.api.CldrDraftStatus; + +import com.google.common.base.Ascii; +import org.unicode.icu.tool.cldrtoicu.LdmlConverter.OutputType; + +/** API for configuring the LDML converter. */ +public interface LdmlConverterConfig { + /** Output directories for ICU locale data (this is not used for supplemental data). */ + enum IcuLocaleDir { + /** Data for the break-iterator library. */ + BRKITR(true), + /** Data for the collations library. */ + COLL(true), + /** Currency data. */ + CURR(false), + /** Language data. */ + LANG(false), + /** General locale data. */ + LOCALES(true), + /** Rule-based number formatter data. */ + RBNF(true), + /** Region data. */ + REGION(false), + /** Measurement and units data. */ + UNIT(false), + /** Timezone data. */ + ZONE(false); + + private final String dirName = Ascii.toLowerCase(name()); + private final boolean includeEmpty; + + IcuLocaleDir(boolean includeEmpty) { + this.includeEmpty = includeEmpty; + } + + /** Returns the relative output directory name. */ + String getOutputDir() { + return dirName; + } + + /** + * Whether the directory is expected to contain empty data files (used to advertise + * the supported set of locales for the "service" provided by the data in that + * directory). + */ + // TODO: Document why there's a difference between directories for empty directories. + boolean includeEmpty() { + return includeEmpty; + } + } + + /** + * Returns the set of output types to be converted. Use {@link OutputType#ALL} to convert + * everything. + */ + Set getOutputTypes(); + + /** Returns the root directory in which the CLDR release is located. */ + Path getCldrDirectory(); + + /** + * Returns an additional "specials" directory containing additional ICU specific XML + * files depending on the given output type. This is where the converter finds any XML + * files using the "icu:" namespace. + */ + Path getSpecialsDir(); + + /** + * Returns the root of the ICU output directory hierarchy into which ICU data file are + * written. + */ + Path getOutputDir(); + + /** Returns the minimal draft status for CLDR data to be converted. */ + CldrDraftStatus getMinimumDraftStatus(); + + /** + * Returns the set of locale IDs to be processed for the given directory. + * + *

This set can contain IDs which have noICU data associated with them if they are + * suitable aliases (e.g. they are deprecated versions of locale IDs for which data does + * exist). + */ + Set getTargetLocaleIds(IcuLocaleDir dir); + + /** + * Return a map of locale IDs which specifies aliases which are applied to the given + * directory in contradiction to the natural alias or parent ID which would otherwise + * be generated. This is a mechanism for restructuring the parent chain and linking + * locales together in non-standard and unexpected ways. + */ + Map getForcedAliases(IcuLocaleDir dir); + + /** + * Whether to emit a summary report for debug purposes after conversion is complete. + */ + boolean emitReport(); +} diff --git a/tools/cldr/cldr-to-icu/src/main/java/org/unicode/icu/tool/cldrtoicu/PathMatcher.java b/tools/cldr/cldr-to-icu/src/main/java/org/unicode/icu/tool/cldrtoicu/PathMatcher.java new file mode 100644 index 00000000000..e6e8e4dba5c --- /dev/null +++ b/tools/cldr/cldr-to-icu/src/main/java/org/unicode/icu/tool/cldrtoicu/PathMatcher.java @@ -0,0 +1,259 @@ +// © 2019 and later: Unicode, Inc. and others. +// License & terms of use: http://www.unicode.org/copyright.html +package org.unicode.icu.tool.cldrtoicu; + +import static com.google.common.base.Preconditions.checkArgument; +import static com.google.common.base.Preconditions.checkNotNull; +import static com.google.common.base.Preconditions.checkPositionIndex; +import static com.google.common.base.Preconditions.checkState; +import static com.google.common.collect.ImmutableMap.toImmutableMap; +import static org.unicode.cldr.api.AttributeKey.keyOf; + +import java.util.ArrayList; +import java.util.LinkedHashMap; +import java.util.List; +import java.util.Map; +import java.util.Map.Entry; +import java.util.function.Predicate; +import java.util.regex.Matcher; +import java.util.regex.Pattern; + +import org.unicode.cldr.api.AttributeKey; +import org.unicode.cldr.api.CldrPath; + +import com.google.common.collect.ImmutableList; +import com.google.common.collect.ImmutableMap; + +/** + * An immutable matcher for {@link CldrPath} instances. A path matcher specification looks like + * {@code "foo/*[@x="z"]/bar[@y=*]"}, where element names and attribute values can be wildcards. + * + *

Note that the path fragment represented by the specification does not include either leading + * or trailing {@code '/'}. This is because matching can occur at any point in a {@code CdlrPath}. + * The choice of where to match in the path is governed by the match method used (e.g. + * {@link PathMatcher#matchesSuffixOf(CldrPath)}. + */ +public abstract class PathMatcher { + /** Parses the path specification into a matcher. */ + public static PathMatcher of(String pathSpec) { + // Supported so far: "a", "a/b", "a/b[@x=*]" + return new BasicMatcher(parse(pathSpec)); + } + + /** + * Combines the given matchers into a single composite matcher which tests all the given + * matchers in order. + */ + public static PathMatcher anyOf(PathMatcher... matchers) { + checkArgument(matchers.length > 0, "must supply at least one matcher"); + if (matchers.length == 1) { + return checkNotNull(matchers[0]); + } + return new CompositeMatcher(ImmutableList.copyOf(matchers)); + } + + /** Attempts a full match against a given path. */ + public abstract boolean matches(CldrPath path); + + /** Attempts a suffix match against a given path. */ + public abstract boolean matchesSuffixOf(CldrPath path); + + /** Attempts a prefix match against a given path. */ + public abstract boolean matchesPrefixOf(CldrPath path); + + // A matcher that simply combines a sequences of other matchers in order. + private static final class CompositeMatcher extends PathMatcher { + private final ImmutableList matchers; + + private CompositeMatcher(ImmutableList matchers) { + checkArgument(matchers.size() > 1); + this.matchers = checkNotNull(matchers); + } + + @Override + public boolean matches(CldrPath path) { + for (PathMatcher m : matchers) { + if (m.matches(path)) { + return true; + } + } + return false; + } + + @Override + public boolean matchesSuffixOf(CldrPath path) { + for (PathMatcher m : matchers) { + if (m.matchesSuffixOf(path)) { + return true; + } + } + return false; + } + + @Override + public boolean matchesPrefixOf(CldrPath path) { + for (PathMatcher m : matchers) { + if (m.matchesPrefixOf(path)) { + return true; + } + } + return false; + } + } + + private static final class BasicMatcher extends PathMatcher { + private final ImmutableList> elementMatchers; + + private BasicMatcher(List> elementMatchers) { + this.elementMatchers = ImmutableList.copyOf(elementMatchers); + } + + @Override + public boolean matches(CldrPath path) { + return elementMatchers.size() == path.getLength() && matchRegion(path, 0); + } + + @Override + public boolean matchesSuffixOf(CldrPath path) { + int start = path.getLength() - elementMatchers.size(); + return start >= 0 && matchRegion(path, start); + } + + @Override + public boolean matchesPrefixOf(CldrPath path) { + return path.getLength() >= elementMatchers.size() && matchRegion(path, 0); + } + + private boolean matchRegion(CldrPath path, int offset) { + // offset is the path element corresponding the the "top most" element matcher, it + // must be in the range 0 ... (path.length() - elementMatchers.size()). + checkPositionIndex(offset, path.getLength() - elementMatchers.size()); + // First jump over the path parents until we find the last matcher. + int matchPathLength = offset + elementMatchers.size(); + while (path.getLength() > matchPathLength) { + path = path.getParent(); + } + return matchForward(path, elementMatchers.size() - 1); + } + + private boolean matchForward(CldrPath path, int matcherIndex) { + if (matcherIndex < 0) { + return true; + } + return matchForward(path.getParent(), matcherIndex - 1) + && elementMatchers.get(matcherIndex).test(path); + } + } + + // Make a new, non-interned, unique instance here which we can test by reference to + // determine if the argument is to be captured (needed as ImmutableMap prohibits null). + // DO NOT change this code to assign "*" as the value directly, it MUST be a new instance. + private static final String WILDCARD = new String("*"); + + private static final Pattern ELEMENT_START_REGEX = + Pattern.compile("(\\*|[-:\\w]+)(?:/|\\[|$)"); + private static final Pattern ATTRIBUTE_REGEX = + Pattern.compile("\\[@([-:\\w]+)=(?:\\*|\"([^\"]*)\")\\]"); + + // element := foo, foo[@bar="baz"], foo[@bar=*] + // pathspec := element{/element}* + private static List> parse(String pathSpec) { + List> specs = new ArrayList<>(); + int pos = 0; + do { + pos = parse(pathSpec, pos, specs); + } while (pos >= 0); + return specs; + } + + // Return next start index or -1. + private static int parse(String pathSpec, int pos, List> specs) { + Matcher m = ELEMENT_START_REGEX.matcher(pathSpec).region(pos, pathSpec.length()); + checkArgument(m.lookingAt(), "invalid path specification (index=%s): %s", pos, pathSpec); + String name = m.group(1); + Map attributes = ImmutableMap.of(); + pos = m.end(1); + if (pos < pathSpec.length() && pathSpec.charAt(pos) == '[') { + // We have attributes to add. + attributes = new LinkedHashMap<>(); + do { + m = ATTRIBUTE_REGEX.matcher(pathSpec).region(pos, pathSpec.length()); + checkArgument(m.lookingAt(), + "invalid path specification (index=%s): %s", pos, pathSpec); + // Null if we matched the '*' wildcard. + String value = m.group(2); + attributes.put(m.group(1), value != null ? value : WILDCARD); + pos = m.end(); + } while (pos < pathSpec.length() && pathSpec.charAt(pos) == '['); + } + // Wildcard matching is less efficient because attribute keys cannot be made in advance, so + // since it's also very rare, we special case it. + Predicate matcher = name.equals(WILDCARD) + ? new WildcardElementMatcher(attributes)::match + : new ElementMatcher(name, attributes)::match; + specs.add(matcher); + if (pos == pathSpec.length()) { + return -1; + } + checkState(pathSpec.charAt(pos) == '/', + "invalid path specification (index=%s): %s", pos, pathSpec); + return pos + 1; + } + + // Matcher for path elements like "foo[@bar=*]" where the name is known in advance. + private static final class ElementMatcher { + private final String name; + private final ImmutableMap attributes; + + private ElementMatcher(String name, Map attributes) { + this.name = checkNotNull(name); + this.attributes = attributes.entrySet().stream() + .collect(toImmutableMap(e -> keyOf(name, e.getKey()), Entry::getValue)); + } + + boolean match(CldrPath path) { + if (!path.getName().equals(name)) { + return false; + } + for (Entry e : attributes.entrySet()) { + String actual = path.get(e.getKey()); + if (actual == null) { + return false; + } + String expected = e.getValue(); + // DO NOT change this to use expected.equals(WILDCARD). + if (expected != WILDCARD && !expected.equals(actual)) { + return false; + } + } + return true; + } + } + + // Matcher for path elements like "*[@bar=*]", where the name isn't known until match time. + private static final class WildcardElementMatcher { + private final ImmutableMap attributes; + + private WildcardElementMatcher(Map attributes) { + this.attributes = ImmutableMap.copyOf(attributes); + } + + private boolean match(CldrPath path) { + // The wildcard matcher never fails due to the element name but must create new key + // instances every time matching occurs (because the key name is dynamic). Since this + // is rare, it's worth making into a separate case. + for (Entry attribute : attributes.entrySet()) { + String actual = path.get(keyOf(path.getName(), attribute.getKey())); + if (actual == null) { + return false; + } + String expected = attribute.getValue(); + // DO NOT change this to use expected.equals(WILDCARD). + if (expected != WILDCARD && !expected.equals(actual)) { + return false; + } + } + return true; + } + } +} diff --git a/tools/cldr/cldr-to-icu/src/main/java/org/unicode/icu/tool/cldrtoicu/PathValueTransformer.java b/tools/cldr/cldr-to-icu/src/main/java/org/unicode/icu/tool/cldrtoicu/PathValueTransformer.java new file mode 100644 index 00000000000..d5075fac181 --- /dev/null +++ b/tools/cldr/cldr-to-icu/src/main/java/org/unicode/icu/tool/cldrtoicu/PathValueTransformer.java @@ -0,0 +1,130 @@ +// © 2019 and later: Unicode, Inc. and others. +// License & terms of use: http://www.unicode.org/copyright.html +package org.unicode.icu.tool.cldrtoicu; + +import static com.google.common.base.Preconditions.checkNotNull; + +import java.util.function.Function; + +import org.unicode.cldr.api.CldrPath; +import org.unicode.cldr.api.CldrValue; + +import com.google.common.collect.ImmutableList; + +/** + * API for transforming CLDR path/value pairs. Transformed results support grouping by their key + * and the ability to generate default "fallback" values to account for missing values in a group. + * + *

To transform some set of CLDR path/values: + *

    + *
  1. Transform all desired path/value pairs into a set of matched results, discarding duplicates + * (see {@link #transform(CldrValue)}. + *
  2. Group the results by key (e.g. into a {@code ListMultimap}). + *
  3. For each group, add any fallback values which don't yet exist for that key (see + * {@link #getFallbackResultsFor(RbPath, DynamicVars)} and {@link Result#isFallbackFor(Result)}). + *
  4. Sort elements within each group and flatten result values (see {@link Result#isGrouped()}). + *
+ * + *

For each unique key, this should yield correctly ordered sequence of values (according to the + * semantics of the chosen transformer implementation). + */ +public abstract class PathValueTransformer { + /** + * A result either obtained by transforming a path/value pair, or as a potential fallback for + * some known key (see {@link PathValueTransformer#transform(CldrValue)} and + * {@link PathValueTransformer#getFallbackResultsFor(RbPath, DynamicVars)}). + */ + public static abstract class Result implements Comparable { + private final RbPath key; + + protected Result(RbPath key) { + this.key = checkNotNull(key); + } + + /** + * Returns the key of this result, used to group results and determine fallback values + * according to the semantics of the chosen transformer. + */ + public RbPath getKey() { + return key; + } + + /** + * Returns whether the values in this result should be grouped or not. Un-grouped values + * should be considered as individual values in a sequence and might be joined with values + * from other results in the same group. Grouped values cannot be split and must appear + * as a single value. + * + *

For example for the ordered results: + *

+         * Result X = { key=K, values=[ "a", "b" ], grouped=false }
+         * Result Y = { key=K, values=[ "c", "d" ], grouped=false }
+         * Result Z = { key=K, values=[ "e" ], grouped=false }
+         * 
+ * the values for key {@code K} are conceptually {@code [ "a", "b", "c", "d", "e" ]}. + * + *

However if result {@code Y} has {@code grouped=true} then there are now 4 values + * {@code [ "a", "b", ["c", "d"], "e" ]}, and if {@code X} is also grouped, then it is + * {@code [ ["a", "b"], ["c", "d"], "e" ]}, producing only 3 top-level values. + */ + public abstract boolean isGrouped(); + + /** + * Returns the transformed values of this result, which may or may not be grouped + * according to {@link #isGrouped()}. + */ + public abstract ImmutableList getValues(); + + /** + * Returns whether this result is a fallback for some existing matched result. Fallback + * results should only be used when it is not a fallback for any existing result. + */ + public abstract boolean isFallbackFor(Result r); + + /** Debug only string representation. */ + @Override + public final String toString() { + return String.format( + "Result{ key='%s', grouped=%s, values=%s }", + getKey(), isGrouped(), getValues()); + } + } + + /** + * A "typedef" for the function to do late binding of dynamic variables. This is used for edge + * cases where a %N variable in the rules config is bound to a CLDR path (e.g. "//foo/bar") + * which cannot be resolved until the rule is evaluated. Unfortunately the need to support late + * binding of variables incurs significant additional complexity in the code, despite being + * used in exactly one situation so far (the '%D' variable to represent the default numbering + * scheme. + */ + // TODO: Figure out how to get rid of all of this mess. + public interface DynamicVars extends Function {} + + /** + * Transforms a CLDR value into a sequence of results (empty if the value was not matched by + * any rule). + * + * @param cldrValue the value to transform. + * @return the transformed result(s). + */ + public abstract ImmutableList transform(CldrValue cldrValue); + + /** + * Transforms a CLDR value into a sequence of results (empty if the value was not matched by + * any rule). The dynamic variable function provides any "late bound" CLDR path variables to be + * resolved from CLDR data during processing (e.g "%D=//ldml/numbers/defaultNumberingSystem"). + * + * @param cldrValue the value to transform. + * @param varFn a function for resolving "late bound" variables. + * @return the transformed result(s). + */ + public abstract ImmutableList transform(CldrValue cldrValue, DynamicVars varFn); + + /** + * Returns a possibly empty sequence of fallback results for a given key. A fallback result for + * a key should be used only if it is not a fallback for any other result with that key; see + * also {@link Result#isFallbackFor(Result)}. + */ + public abstract ImmutableList getFallbackResultsFor(RbPath key, DynamicVars varFn); +} diff --git a/tools/cldr/cldr-to-icu/src/main/java/org/unicode/icu/tool/cldrtoicu/RbPath.java b/tools/cldr/cldr-to-icu/src/main/java/org/unicode/icu/tool/cldrtoicu/RbPath.java new file mode 100644 index 00000000000..3af37b5646c --- /dev/null +++ b/tools/cldr/cldr-to-icu/src/main/java/org/unicode/icu/tool/cldrtoicu/RbPath.java @@ -0,0 +1,232 @@ +// © 2019 and later: Unicode, Inc. and others. +// License & terms of use: http://www.unicode.org/copyright.html +package org.unicode.icu.tool.cldrtoicu; + +import static com.google.common.base.CharMatcher.whitespace; +import static com.google.common.base.Preconditions.checkArgument; +import static com.google.common.base.Preconditions.checkState; +import static com.google.common.collect.ImmutableList.toImmutableList; + +import java.util.Arrays; +import java.util.Comparator; +import java.util.Objects; +import java.util.function.Function; + +import com.google.common.base.CharMatcher; +import com.google.common.base.Splitter; +import com.google.common.collect.Comparators; +import com.google.common.collect.ImmutableList; +import com.google.common.collect.Iterables; + +/** + * A resource bundle path, used to identify entries in ICU data. + * + *

Immutable and thread safe. + */ +public final class RbPath implements Comparable { + private static final Splitter PATH_SPLITTER = Splitter.on('/').trimResults(); + + // This defines ordering of paths in IcuData instances and thus the order in ICU data files. + // If there's ever a reason to have a different "natural" order for paths, this Comparator + // should be moved into the ICU file writer class(es). + private static final Comparator ORDERING = + Comparator.comparing( + p -> p.segments, + Comparators.lexicographical(Comparator.naturalOrder())); + + // Matches the definition of invariant characters in "uinvchar.cpp". We can make this all much + // faster if needed with a custom matcher (it's just a 128 way bit lookup via 2 longs). + private static final CharMatcher INVARIANT_CHARS = + CharMatcher.ascii().and(CharMatcher.anyOf("!#$@[\\]^`{|}~").negate()); + + // Note that we must also prohibit double-quote from appearing anywhere other than surrounding + // segment values. This is because some segment values can contain special ICU data characters + // (e.g. ':') but must be treated as literals. There is not proper "escaping" mechanism in ICU + // data for key values (since '\' is not an invariant, things like \\uxxxx are not possible). + // + // Ideally quoting would be done when the file is written, but that would require additional + // complexity in RbPath, since suffixes like ":intvector" must not be quoted and must somehow + // be distinguished from timezone "metazone" names which also contain ':'. + private static final CharMatcher QUOTED_SEGMENT_CHARS = + INVARIANT_CHARS + .and(CharMatcher.javaIsoControl().negate()) + .and(CharMatcher.isNot('"')); + private static final CharMatcher UNQUOTED_SEGMENT_CHARS = + QUOTED_SEGMENT_CHARS.and(whitespace().negate()); + + // Characters allowed in path segments which separate the "base name" from any suffix (e.g. + // the base name of "Foo:intvector" is "Foo"). + private static final CharMatcher SEGMENT_SEPARATORS = CharMatcher.anyOf("%:"); + + private static final RbPath EMPTY = new RbPath(ImmutableList.of()); + + public static RbPath empty() { + return EMPTY; + } + + public static RbPath of(String... segments) { + return of(Arrays.asList(segments)); + } + + public static RbPath of(Iterable segments) { + return new RbPath(segments); + } + + public static RbPath parse(String path) { + checkArgument(!path.isEmpty(), "cannot parse an empty path string"); + // Allow leading '/', but don't allow empty segments anywhere else. + if (path.startsWith("/")) { + path = path.substring(1); + } + return new RbPath(PATH_SPLITTER.split(path)); + } + + static int getCommonPrefixLength(RbPath lhs, RbPath rhs) { + int maxLength = Math.min(lhs.length(), rhs.length()); + int n = 0; + while (n < maxLength && lhs.getSegment(n).equals(rhs.getSegment(n))) { + n++; + } + return n; + } + + private final ImmutableList segments; + private final int hashCode; + + private RbPath(Iterable segments) { + this.segments = ImmutableList.copyOf(segments); + this.hashCode = Objects.hash(this.segments); + for (String segment : this.segments) { + checkArgument(!segment.isEmpty(), + "empty path segments not permitted: %s", this.segments); + // Either the label is quoted (e.g. "foo") or it is bar (e.g. foo) but it can only + // contain double quotes at either end, or not at all. If the string is quoted, only + // validate the content, and not the quotes themselves. + String toValidate; + switch (segment.charAt(0)) { + case '<': + // Allow anything in hidden labels, since they will be removed later and never + // appear in the final ICU data. + checkArgument(segment.endsWith(">"), + "mismatched quoting for hidden label: %s", segment); + continue; + + case '"': + checkArgument(segment.endsWith("\""), + "mismatched quoting for segment: %s", segment); + checkArgument( + QUOTED_SEGMENT_CHARS.matchesAllOf(segment.substring(1, segment.length() - 1)), + "invalid character in unquoted resource bundle path segment: %s", segment); + break; + + default: + checkArgument( + UNQUOTED_SEGMENT_CHARS.matchesAllOf(segment), + "invalid character in unquoted resource bundle path segment: %s", segment); + break; + } + } + } + + public int length() { + return segments.size(); + } + + public String getSegment(int n) { + return segments.get(n); + } + + public RbPath getParent() { + checkState(length() > 0, "cannot get parent of the empty path"); + return length() > 1 ? new RbPath(segments.subList(0, length() - 1)) : EMPTY; + } + + public boolean isAnonymous() { + return length() > 0 && segments.get(length() - 1).charAt(0) == '<'; + } + + public RbPath extendBy(String... parts) { + return new RbPath(Iterables.concat(segments, Arrays.asList(parts))); + } + + public RbPath extendBy(RbPath suffix) { + return new RbPath(Iterables.concat(segments, suffix.segments)); + } + + public RbPath mapSegments(Function fn) { + return new RbPath(segments.stream().map(fn).collect(toImmutableList())); + } + + /** + * Returns whether the first element of this path is prefix by the given "base name". + * + *

Resource bundle paths relating to semantically similar data are typically grouped by the + * same first path element. This is not as simple as just comparing the first element, as in + * {@code path.startsWith(prefix)} however, since path elements can have suffixes, such as + * {@code "Foo:alias"} or {@code "Foo%subtype"}. + * + * @param baseName the base name to test for. + * @return true is the "base name" of the first path element is the given prefix. + */ + public boolean hasPrefix(String baseName) { + checkArgument(!baseName.isEmpty() && SEGMENT_SEPARATORS.matchesNoneOf(baseName)); + if (length() == 0) { + return false; + } + String firstElement = getSegment(0); + // Slightly subtle (but safe) access to the separator character, since: + // (!a.equals(b) && a.startsWith(b)) ==> a.length() > b.length(). + return firstElement.equals(baseName) + || (firstElement.startsWith(baseName) + && SEGMENT_SEPARATORS.matches(firstElement.charAt(baseName.length()))); + } + + public boolean startsWith(RbPath prefix) { + return prefix.length() <= length() && matchesSublist(prefix, 0); + } + + public boolean endsWith(RbPath suffix) { + return suffix.length() <= length() && matchesSublist(suffix, length() - suffix.length()); + } + + public boolean contains(RbPath path) { + int maxOffset = length() - path.length(); + for (int i = 0; i <= maxOffset; i++) { + if (matchesSublist(path, i)) { + return true; + } + } + return false; + } + + // Assume length check has been done. + private boolean matchesSublist(RbPath path, int offset) { + for (int i = 0; i < path.length(); i++) { + if (!path.getSegment(i).equals(getSegment(i + offset))) { + return false; + } + } + return true; + } + + boolean isIntPath() { + String lastElement = segments.get(segments.size() - 1); + return lastElement.endsWith(":int") || lastElement.endsWith(":intvector"); + } + + @Override public int compareTo(RbPath other) { + return ORDERING.compare(this, other); + } + + @Override public boolean equals(Object other) { + return (other instanceof RbPath) && segments.equals(((RbPath) other).segments); + } + + @Override public int hashCode() { + return hashCode; + } + + @Override public String toString() { + return String.join("/", segments); + } +} diff --git a/tools/cldr/cldr-to-icu/src/main/java/org/unicode/icu/tool/cldrtoicu/RbValue.java b/tools/cldr/cldr-to-icu/src/main/java/org/unicode/icu/tool/cldrtoicu/RbValue.java new file mode 100644 index 00000000000..84751d43e5f --- /dev/null +++ b/tools/cldr/cldr-to-icu/src/main/java/org/unicode/icu/tool/cldrtoicu/RbValue.java @@ -0,0 +1,58 @@ +// © 2019 and later: Unicode, Inc. and others. +// License & terms of use: http://www.unicode.org/copyright.html +package org.unicode.icu.tool.cldrtoicu; + +import static com.google.common.base.Preconditions.checkArgument; + +import java.util.Arrays; +import java.util.Objects; +import java.util.function.Function; + +import com.google.common.collect.ImmutableList; + +/** + * A resource bundle value containing a sequence of elements. This is a very thin wrapper over an + * immutable list, with a few additional constraints (e.g. cannot be empty). + * + *

Immutable and thread safe. + */ +public final class RbValue { + private final ImmutableList elements; + + /** Returns a resource bundle value of the given elements. */ + public static RbValue of(String... elements) { + return of(Arrays.asList(elements)); + } + + /** Returns a resource bundle value of the given elements. */ + public static RbValue of(Iterable elements) { + return new RbValue(elements); + } + + private RbValue(Iterable elements) { + this.elements = ImmutableList.copyOf(elements); + checkArgument(!this.elements.isEmpty(), "Resource bundle values cannot be empty"); + } + + /** Returns the (non zero) number of elements in this value. */ + public int size() { + return elements.size(); + } + + /** Returns the Nth element of this value. */ + public String getElement(int n) { + return elements.get(n); + } + + @Override public int hashCode() { + return Objects.hashCode(elements); + } + + @Override public boolean equals(Object obj) { + return obj instanceof RbValue && elements.equals(((RbValue) obj).elements); + } + + @Override public String toString() { + return elements.toString(); + } +} diff --git a/tools/cldr/cldr-to-icu/src/main/java/org/unicode/icu/tool/cldrtoicu/SupplementalData.java b/tools/cldr/cldr-to-icu/src/main/java/org/unicode/icu/tool/cldrtoicu/SupplementalData.java new file mode 100644 index 00000000000..954ebe0c287 --- /dev/null +++ b/tools/cldr/cldr-to-icu/src/main/java/org/unicode/icu/tool/cldrtoicu/SupplementalData.java @@ -0,0 +1,593 @@ +// © 2019 and later: Unicode, Inc. and others. +// License & terms of use: http://www.unicode.org/copyright.html +package org.unicode.icu.tool.cldrtoicu; + +import static com.google.common.base.CharMatcher.whitespace; +import static com.google.common.base.Preconditions.checkArgument; +import static com.google.common.base.Preconditions.checkNotNull; +import static com.google.common.base.Preconditions.checkState; +import static com.google.common.collect.ImmutableMap.toImmutableMap; +import static java.util.function.Function.identity; +import static org.unicode.cldr.api.AttributeKey.keyOf; +import static org.unicode.cldr.api.CldrData.PathOrder.ARBITRARY; + +import java.util.Arrays; +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import java.util.Objects; +import java.util.Optional; +import java.util.function.Function; +import java.util.regex.Matcher; +import java.util.regex.Pattern; +import java.util.stream.Stream; + +import org.unicode.cldr.api.AttributeKey; +import org.unicode.cldr.api.CldrData; + +import com.google.common.base.Ascii; +import com.google.common.base.Splitter; +import com.google.common.base.Strings; +import com.google.common.collect.HashBasedTable; +import com.google.common.collect.ImmutableMap; +import com.google.common.collect.ImmutableTable; +import com.google.common.collect.Table; + +/** + * Auxiliary APIs for processing locale IDs and other supplemental data needed by business logic + * in some mapper classes. + * + * When a {@link SupplementalData} instance is used in a mapper class, it is imperative that it is + * build using the same underlying CLDR data. The only reason mapper classes do not create their + * own instances directly is the relative cost of processing all the supplemental data each time. + */ +// TODO: This should be moved into the API and leverage some of the existing utility functions. +public final class SupplementalData { + private static final Pattern SCRIPT_SUBTAG = Pattern.compile("[A-Z][a-z]{3}"); + + private static final PathMatcher ALIAS = + PathMatcher.of("supplementalData/metadata/alias/*[@type=*]"); + + private static final PathMatcher PARENT_LOCALE = + PathMatcher.of("supplementalData/parentLocales/parentLocale[@parent=*]"); + private static final AttributeKey PARENT = keyOf("parentLocale", "parent"); + private static final AttributeKey LOCALES = keyOf("parentLocale", "locales"); + + private static final PathMatcher CALENDER_PREFERENCE = + PathMatcher.of("supplementalData/calendarPreferenceData/calendarPreference[@territories=*]"); + private static final AttributeKey CALENDER_TERRITORIES = + keyOf("calendarPreference", "territories"); + private static final AttributeKey CALENDER_ORDERING = + keyOf("calendarPreference", "ordering"); + + private static final PathMatcher LIKELY_SUBTAGS = + PathMatcher.of("supplementalData/likelySubtags/likelySubtag[@from=*]"); + private static final AttributeKey SUBTAG_FROM = keyOf("likelySubtag", "from"); + private static final AttributeKey SUBTAG_TO = keyOf("likelySubtag", "to"); + + private static final Splitter LIST_SPLITTER = + Splitter.on(whitespace()).omitEmptyStrings(); + + // Aliases come in three flavours. Note that the TERRITORY aliases map to a _list_ rather than + // a single value (it's structurally always a list, but only territory aliases have a need for + // more than one value). + private enum Alias { + LANGUAGE, SCRIPT, TERRITORY; + + private static final ImmutableMap TYPE_MAP = + Arrays.stream(values()) + .collect(toImmutableMap(a -> Ascii.toLowerCase(a.name()) + "Alias", identity())); + + private final String elementName = Ascii.toLowerCase(name()) + "Alias"; + final AttributeKey typeKey = AttributeKey.keyOf(elementName, "type"); + final AttributeKey replacementKey = AttributeKey.keyOf(elementName, "replacement"); + + static Optional forElementName(String name) { + return Optional.ofNullable(TYPE_MAP.get(name)); + } + } + + /** + * Creates a supplemental data API instance from the given CLDR data. + * + * @param supplementalData the raw CLDR supplemental data instance. + * @return the supplemental data API. + */ + static SupplementalData create(CldrData supplementalData) { + Table aliasTable = HashBasedTable.create(); + Map parentLocaleMap = new HashMap<>(); + Map defaultCalendarMap = new HashMap<>(); + Map likelySubtagMap = new HashMap<>(); + + supplementalData.accept( + ARBITRARY, + v -> { + if (ALIAS.matches(v.getPath())) { + // Territory alias replacements can be a list of values (e.g. when countries + // break up). We use the first (geo-politically most significant) value. This + // doesn't happen for languages or scripts, but could in theory. + Alias.forElementName(v.getPath().getName()).ifPresent( + alias -> aliasTable.put( + alias, + alias.typeKey.valueFrom(v), + alias.replacementKey.valueFrom(v))); + } else if (PARENT_LOCALE.matches(v.getPath())) { + String p = PARENT.valueFrom(v); + LOCALES.listOfValuesFrom(v).forEach(c -> parentLocaleMap.put(c, p)); + } else if (CALENDER_PREFERENCE.matches(v.getPath())) { + String c = CALENDER_ORDERING.listOfValuesFrom(v).get(0); + CALENDER_TERRITORIES.listOfValuesFrom(v).forEach(t -> defaultCalendarMap.put(t, c)); + } else if (LIKELY_SUBTAGS.matches(v.getPath())) { + likelySubtagMap.put(SUBTAG_FROM.valueFrom(v), SUBTAG_TO.valueFrom(v)); + } + }); + + // WARNING: The original mapper code determines the full set of deprecated territories and + // then removes the following hard-coded list without any explanation as to why. While this + // is presumably to "undeprecate" them for the purposes of the locale processing, there's + // no explanation of where this list comes from, and thus no way to maintain it. + // + // asList("062", "172", "200", "830", "AN", "CS", "QU") + // .forEach(t -> aliasTable.remove(Alias.TERRITORY, t)); + // TODO: Understand and document what on Earth this is all about or delete this comment. + + return new SupplementalData( + aliasTable, parentLocaleMap, defaultCalendarMap, likelySubtagMap); + } + + // A simple-as-possible, mutable, locale ID data "struct" to handle the IDs used during ICU + // data generation. Because this is mutable, it is thoroughly unsuitable for general use. + private static final class LocaleId { + // From: https://unicode.org/reports/tr35/#Identifiers + // Locale ID is: + // ((_