import java.text.CharacterIterator;
/**
- * <p>SearchIterator is an abstract base class that defines a protocol
- * for text searching. Subclasses provide concrete implementations of
- * various search algorithms. A concrete subclass, StringSearch, is
- * provided that implements language-sensitive pattern matching based
- * on the comparison rules defined in a RuleBasedCollator
- * object. Instances of SearchIterator maintain a current position and
- * scan over the target text, returning the indices where a match is
- * found and the length of each match. Generally, the sequence of forward
- * matches will be equivalent to the sequence of backward matches.One
- * case where this statement may not hold is when non-overlapping mode
- * is set on and there are continuous repetitive patterns in the text.
- * Consider the case searching for pattern "aba" in the text
- * "ababababa", setting overlapping mode off will produce forward matches
- * at offsets 0, 4. However when a backwards search is done, the
- * results will be at offsets 6 and 2.</p>
- *
- * <p>If matches searched for have boundary restrictions. BreakIterators
- * can be used to define the valid boundaries of such a match. Once a
- * BreakIterator is set, potential matches will be tested against the
- * BreakIterator to determine if the boundaries are valid and that all
- * characters in the potential match are equivalent to the pattern
- * searched for. For example, looking for the pattern "fox" in the text
- * "foxy fox" will produce match results at offset 0 and 5 with length 3
- * if no BreakIterators were set. However if a WordBreakIterator is set,
- * the only match that would be found will be at the offset 5. Since,
- * the SearchIterator guarantees that if a BreakIterator is set, all its
- * matches will match the given pattern exactly, a potential match that
- * passes the BreakIterator might still not produce a valid match. For
- * instance the pattern "e" will not be found in the string
- * "\u00e9" (latin small letter e with acute) if a
- * CharacterBreakIterator is used. Even though "e" is
- * a part of the character "\u00e9" and the potential match at
- * offset 0 length 1 passes the CharacterBreakIterator test, "\u00e9"
- * is not equivalent to "e", hence the SearchIterator rejects the potential
- * match. By default, the SearchIterator
- * does not impose any boundary restriction on the matches, it will
- * return all results that match the pattern. Illustrating with the
- * above example, "e" will
- * be found in the string "\u00e9" if no BreakIterator is
- * specified.</p>
- *
- * <p>SearchIterator also provides a means to handle overlapping
- * matches via the API setOverlapping(boolean). For example, if
- * overlapping mode is set, searching for the pattern "abab" in the
- * text "ababab" will match at positions 0 and 2, whereas if
- * overlapping is not set, SearchIterator will only match at position
- * 0. By default, overlapping mode is not set.</p>
- *
- * <p>The APIs in SearchIterator are similar to that of other text
- * iteration classes such as BreakIterator. Using this class, it is
- * easy to scan through text looking for all occurances of a
- * match.</p>
+ * <tt>SearchIterator</tt> is an abstract base class that provides
+ * methods to search for a pattern within a text string. Instances of
+ * <tt>SearchIterator</tt> maintain a current position and scans over the
+ * target text, returning the indices the pattern is matched and the length
+ * of each match.
* <p>
- * Example of use:<br>
- * <pre>
+ * <tt>SearchIterator</tt> defines a protocol for text searching.
+ * Subclasses provide concrete implementations of various search algorithms.
+ * For example, <tt>StringSearch</tt> implements language-sensitive pattern
+ * matching based on the comparison rules defined in a
+ * <tt>RuleBasedCollator</tt> object.
+ * <p>
+ * Other options for searching includes using a BreakIterator to restrict
+ * the points at which matches are detected.
+ * <p>
+ * <tt>SearchIterator</tt> provides an API that is similar to that of
+ * other text iteration classes such as <tt>BreakIterator</tt>. Using
+ * this class, it is easy to scan through text looking for all occurances of
+ * a given pattern. The following example uses a <tt>StringSearch</tt>
+ * object to find all instances of "fox" in the target string. Any other
+ * subclass of <tt>SearchIterator</tt> can be used in an identical
+ * manner.
+ * <pre><code>
* String target = "The quick brown fox jumped over the lazy fox";
* String pattern = "fox";
* SearchIterator iter = new StringSearch(pattern, target);
- * for (int pos = iter.first(); pos != SearchIterator.DONE;
- * pos = iter.next()) {
- * // println matches at offset 16 and 41 with length 3
- * System.out.println("Found match at " + pos + ", length is "
- * + iter.getMatchLength());
- * }
- * target = "ababababa";
- * pattern = "aba";
- * iter.setTarget(new StringCharacterIterator(pattern));
- * iter.setOverlapping(false);
- * System.out.println("Overlapping mode set to false");
- * System.out.println("Forward matches of pattern " + pattern + " in text "
- * + text + ": ");
- * for (int pos = iter.first(); pos != SearchIterator.DONE;
- * pos = iter.next()) {
- * // println matches at offset 0 and 4 with length 3
- * System.out.println("offset " + pos + ", length "
- * + iter.getMatchLength());
+ * for (int pos = iter.first(); pos != SearchIterator.DONE;
+ * pos = iter.next()) {
+ * System.out.println("Found match at " + pos +
+ * ", length is " + iter.getMatchLength());
* }
- * System.out.println("Backward matches of pattern " + pattern + " in text "
- * + text + ": ");
- * for (int pos = iter.last(); pos != SearchIterator.DONE;
- * pos = iter.previous()) {
- * // println matches at offset 6 and 2 with length 3
- * System.out.println("offset " + pos + ", length "
- * + iter.getMatchLength());
- * }
- * System.out.println("Overlapping mode set to true");
- * System.out.println("Index set to 2");
- * iter.setIndex(2);
- * iter.setOverlapping(true);
- * System.out.println("Forward matches of pattern " + pattern + " in text "
- * + text + ": ");
- * for (int pos = iter.first(); pos != SearchIterator.DONE;
- * pos = iter.next()) {
- * // println matches at offset 2, 4 and 6 with length 3
- * System.out.println("offset " + pos + ", length "
- * + iter.getMatchLength());
- * }
- * System.out.println("Index set to 2");
- * iter.setIndex(2);
- * System.out.println("Backward matches of pattern " + pattern + " in text "
- * + text + ": ");
- * for (int pos = iter.last(); pos != SearchIterator.DONE;
- * pos = iter.previous()) {
- * // println matches at offset 0 with length 3
- * System.out.println("offset " + pos + ", length "
- * + iter.getMatchLength());
- * }
- * </pre>
- * </p>
+ * </code></pre>
+ *
* @author Laura Werner, synwee
* @stable ICU 2.0
* @see BreakIterator
+ * @see RuleBasedCollator
*/
public abstract class SearchIterator
{
* @stable ICU 2.0
*/
public static final int DONE = -1;
-
+
// public methods -----------------------------------------------------
// public setters -----------------------------------------------------
search_.setMatchedLength(0);
search_.matchedIndex_ = DONE;
}
-
+
/**
- * <p>
* Determines whether overlapping matches are returned. See the class
* documentation for more information about overlapping matches.
- * </p>
* <p>
* The default setting of this property is false
- * </p>
+ *
* @param allowOverlap flag indicator if overlapping matches are allowed
* @see #isOverlapping
* @stable ICU 2.8
*/
- public void setOverlapping(boolean allowOverlap)
- {
+ public void setOverlapping(boolean allowOverlap) {
search_.isOverlap_ = allowOverlap;
}
-
+
/**
- * Set the BreakIterator that is used to restrict the points at which
- * matches are detected.
- * Using <tt>null</tt> as the parameter is legal; it means that break
- * detection should not be attempted.
- * See class documentation for more information.
+ * Set the BreakIterator that will be used to restrict the points
+ * at which matches are detected.
+ *
* @param breakiter A BreakIterator that will be used to restrict the
- * points at which matches are detected.
- * @see #getBreakIterator
+ * points at which matches are detected. If a match is
+ * found, but the match's start or end index is not a
+ * boundary as determined by the {@link BreakIterator},
+ * the match will be rejected and another will be searched
+ * for. If this parameter is <tt>null</tt>, no break
+ * detection is attempted.
* @see BreakIterator
* @stable ICU 2.0
*/
- public void setBreakIterator(BreakIterator breakiter)
- {
+ public void setBreakIterator(BreakIterator breakiter) {
search_.setBreakIter(breakiter);
if (search_.breakIter() != null) {
// Create a clone of CharacterItearator, so it won't
/**
* Set the target text to be searched. Text iteration will then begin at
- * the start of the text string. This method is useful if you want to
+ * the start of the text string. This method is useful if you want to
* reuse an iterator to search within a different body of text.
+ *
* @param text new text iterator to look for match,
* @exception IllegalArgumentException thrown when text is null or has
* 0 length
}
}
- //TODO: We should add APIs below to match ICU4C APIs
+ //TODO: We may add APIs below to match ICU4C APIs
// setCanonicalMatch
- // setElementComparison
// public getters ----------------------------------------------------
-
+
/**
- * <p>
- * Returns the index of the most recent match in the target text.
- * This call returns a valid result only after a successful call to
- * {@link #first}, {@link #next}, {@link #previous}, or {@link #last}.
- * Just after construction, or after a searching method returns
- * <tt>DONE</tt>, this method will return <tt>DONE</tt>.
- * </p>
- * <p>
- * Use <tt>getMatchLength</tt> to get the length of the matched text.
- * <tt>getMatchedText</tt> will return the subtext in the searched
- * target text from index getMatchStart() with length getMatchLength().
- * </p>
- * @return index to a substring within the text string that is being
- * searched.
- * @see #getMatchLength
- * @see #getMatchedText
- * @see #first
- * @see #next
- * @see #previous
- * @see #last
- * @see #DONE
- * @stable ICU 2.8
- */
- public int getMatchStart()
- {
+ * Returns the index to the match in the text string that was searched.
+ * This call returns a valid result only after a successful call to
+ * {@link #first}, {@link #next}, {@link #previous}, or {@link #last}.
+ * Just after construction, or after a searching method returns
+ * {@link #DONE}, this method will return {@link #DONE}.
+ * <p>
+ * Use {@link #getMatchLength} to get the matched string length.
+ *
+ * @return index of a substring within the text string that is being
+ * searched.
+ * @see #first
+ * @see #next
+ * @see #previous
+ * @see #last
+ * @stable ICU 2.0
+ */
+ public int getMatchStart() {
return search_.matchedIndex_;
}
/**
- * Return the index in the target text at which the iterator is currently
- * positioned.
- * If the iteration has gone past the end of the target text, or past
- * the beginning for a backwards search, {@link #DONE} is returned.
- * @return index in the target text at which the iterator is currently
- * positioned.
+ * Return the current index in the text being searched.
+ * If the iteration has gone past the end of the text
+ * (or past the beginning for a backwards search), {@link #DONE}
+ * is returned.
+ *
+ * @return current index in the text being searched.
* @stable ICU 2.8
- * @see #first
- * @see #next
- * @see #previous
- * @see #last
- * @see #DONE
*/
public abstract int getIndex();
-
+
/**
- * <p>
- * Returns the length of the most recent match in the target text.
- * This call returns a valid result only after a successful
- * call to {@link #first}, {@link #next}, {@link #previous}, or
- * {@link #last}.
- * Just after construction, or after a searching method returns
- * <tt>DONE</tt>, this method will return 0. See getMatchStart() for
- * more details.
- * </p>
- * @return The length of the most recent match in the target text, or 0 if
- * there is no match.
- * @see #getMatchStart
- * @see #getMatchedText
+ * Returns the length of text in the string which matches the search
+ * pattern. This call returns a valid result only after a successful call
+ * to {@link #first}, {@link #next}, {@link #previous}, or {@link #last}.
+ * Just after construction, or after a searching method returns
+ * {@link #DONE}, this method will return 0.
+ *
+ * @return The length of the match in the target text, or 0 if there
+ * is no match currently.
* @see #first
* @see #next
* @see #previous
* @see #last
- * @see #DONE
* @stable ICU 2.0
*/
- public int getMatchLength()
- {
+ public int getMatchLength() {
return search_.matchedLength();
}
-
+
/**
* Returns the BreakIterator that is used to restrict the indexes at which
* matches are detected. This will be the same object that was passed to
- * the constructor or to <code>setBreakIterator</code>.
- * If the BreakIterator has not been set, <tt>null</tt> will be returned.
- * See setBreakIterator for more information.
+ * the constructor or to {@link #setBreakIterator}.
+ * If the {@link BreakIterator} has not been set, <tt>null</tt> will be returned.
+ * See {@link #setBreakIterator} for more information.
+ *
* @return the BreakIterator set to restrict logic matches
* @see #setBreakIterator
* @see BreakIterator
* @stable ICU 2.0
*/
- public BreakIterator getBreakIterator()
- {
+ public BreakIterator getBreakIterator() {
return search_.breakIter();
}
-
+
/**
- * Return the target text that is being searched.
- * @return target text being searched.
- * @see #setTarget
+ * Return the string text to be searched.
+ * @return text string to be searched.
* @stable ICU 2.0
*/
- public CharacterIterator getTarget()
- {
+ public CharacterIterator getTarget() {
return search_.text();
}
-
+
/**
* Returns the text that was matched by the most recent call to
- * {@link #first}, {@link #next}, {@link #previous}, or {@link #last}.
- * If the iterator is not pointing at a valid match, for instance just
- * after construction or after <tt>DONE</tt> has been returned, an empty
- * String will be returned. See getMatchStart for more information
- * @see #getMatchStart
- * @see #getMatchLength
+ * {@link #first}, {@link #next}, {@link #previous}, or {@link #last}.
+ * If the iterator is not pointing at a valid match (e.g. just after
+ * construction or after {@link #DONE} has been returned,
+ * returns an empty string.
+ *
+ * @return the substring in the target test of the most recent match,
+ * or null if there is no match currently.
* @see #first
* @see #next
* @see #previous
* @see #last
- * @see #DONE
- * @return the substring in the target text of the most recent match
* @stable ICU 2.0
*/
- public String getMatchedText()
- {
+ public String getMatchedText() {
if (search_.matchedLength() > 0) {
int limit = search_.matchedIndex_ + search_.matchedLength();
StringBuilder result = new StringBuilder(search_.matchedLength());
}
// miscellaneous public methods -----------------------------------------
-
+
/**
- * Search <b>forwards</b> in the target text for the next valid match,
- * starting the search from the current iterator position. The iterator is
- * adjusted so that its current index, as returned by {@link #getIndex},
- * is the starting position of the match if one was found. If a match is
- * found, the index of the match is returned, otherwise <tt>DONE</tt> is
- * returned. If overlapping mode is set, the beginning of the found match
- * can be before the end of the current match, if any.
- * @return The starting index of the next forward match after the current
- * iterator position, or
- * <tt>DONE</tt> if there are no more matches.
- * @see #getMatchStart
- * @see #getMatchLength
- * @see #getMatchedText
- * @see #following
- * @see #preceding
- * @see #previous
- * @see #first
- * @see #last
- * @see #DONE
+ * Returns the index of the next point at which the text matches the
+ * search pattern, starting from the current position
+ * The iterator is adjusted so that its current index (as returned by
+ * {@link #getIndex}) is the match position if one was found.
+ * If a match is not found, {@link #DONE} will be returned and
+ * the iterator will be adjusted to a position after the end of the text
+ * string.
+ *
+ * @return The index of the next match after the current position,
+ * or {@link #DONE} if there are no more matches.
+ * @see #getIndex
* @stable ICU 2.0
*/
- public int next()
- {
+ public int next() {
int index = getIndex(); // offset = getOffset() in ICU4C
int matchindex = search_.matchedIndex_;
int matchlength = search_.matchedLength();
}
/**
- * Search <b>backwards</b> in the target text for the next valid match,
- * starting the search from the current iterator position. The iterator is
- * adjusted so that its current index, as returned by {@link #getIndex},
- * is the starting position of the match if one was found. If a match is
- * found, the index is returned, otherwise <tt>DONE</tt> is returned. If
- * overlapping mode is set, the end of the found match can be after the
- * beginning of the previous match, if any.
- * @return The starting index of the next backwards match after the current
- * iterator position, or
- * <tt>DONE</tt> if there are no more matches.
- * @see #getMatchStart
- * @see #getMatchLength
- * @see #getMatchedText
- * @see #following
- * @see #preceding
- * @see #next
- * @see #first
- * @see #last
- * @see #DONE
+ * Returns the index of the previous point at which the string text
+ * matches the search pattern, starting at the current position.
+ * The iterator is adjusted so that its current index (as returned by
+ * {@link #getIndex}) is the match position if one was found.
+ * If a match is not found, {@link #DONE} will be returned and
+ * the iterator will be adjusted to the index {@link #DONE}.
+ *
+ * @return The index of the previous match before the current position,
+ * or {@link #DONE} if there are no more matches.
+ * @see #getIndex
* @stable ICU 2.0
*/
- public int previous()
- {
+ public int previous() {
int index; // offset in ICU4C
if (search_.reset_) {
index = search_.endIndex(); // m_search_->textLength in ICU4C
/**
* Return true if the overlapping property has been set.
- * See setOverlapping(boolean) for more information.
+ * See {@link #setOverlapping(boolean)} for more information.
+ *
* @see #setOverlapping
* @return true if the overlapping property has been set, false otherwise
* @stable ICU 2.8
*/
- public boolean isOverlapping()
- {
+ public boolean isOverlapping() {
return search_.isOverlap_;
}
- //TODO: We should add APIs below to match ICU4C APIs
+ //TODO: We may add APIs below to match ICU4C APIs
// isCanonicalMatch
- // getElementComparison
/**
- * <p>
- * Resets the search iteration. All properties will be reset to their
- * default values.
- * </p>
- * <p>
- * If a forward iteration is initiated, the next search will begin at the
- * start of the target text. Otherwise, if a backwards iteration is initiated,
- * the next search will begin at the end of the target text.
- * </p>
- * @stable ICU 2.8
- */
- public void reset()
- {
+ * Resets the iteration.
+ * Search will begin at the start of the text string if a forward
+ * iteration is initiated before a backwards iteration. Otherwise if a
+ * backwards iteration is initiated before a forwards iteration, the
+ * search will begin at the end of the text string.
+ *
+ * @stable ICU 2.0
+ */
+ public void reset() {
setMatchNotFound();
setIndex(search_.beginIndex());
search_.isOverlap_ = false;
search_.isForwardSearching_ = true;
search_.reset_ = true;
}
-
+
/**
- * Return the index of the first <b>forward</b> match in the target text.
- * This method sets the iteration to begin at the start of the
- * target text and searches forward from there.
- * @return The index of the first forward match, or <code>DONE</code>
- * if there are no matches.
- * @see #getMatchStart
- * @see #getMatchLength
- * @see #getMatchedText
- * @see #following
- * @see #preceding
- * @see #next
- * @see #previous
- * @see #last
- * @see #DONE
+ * Returns the first index at which the string text matches the search
+ * pattern. The iterator is adjusted so that its current index (as
+ * returned by {@link #getIndex()}) is the match position if one
+ *
+ * was found.
+ * If a match is not found, {@link #DONE} will be returned and
+ * the iterator will be adjusted to the index {@link #DONE}.
+ * @return The character index of the first match, or
+ * {@link #DONE} if there are no matches.
+ *
+ * @see #getIndex
* @stable ICU 2.0
*/
- public final int first()
- {
+ public final int first() {
int startIdx = search_.beginIndex();
setIndex(startIdx);
return handleNext(startIdx);
}
/**
- * Return the index of the first <b>forward</b> match in target text that
- * is at or after argument <tt>position</tt>.
- * This method sets the iteration to begin at the specified
- * position in the the target text and searches forward from there.
- * @return The index of the first forward match, or <code>DONE</code>
- * if there are no matches.
- * @see #getMatchStart
- * @see #getMatchLength
- * @see #getMatchedText
- * @see #first
- * @see #preceding
- * @see #next
- * @see #previous
- * @see #last
- * @see #DONE
+ * Returns the first index equal or greater than <tt>position</tt> at which the
+ * string text matches the search pattern. The iterator is adjusted so
+ * that its current index (as returned by {@link #getIndex()}) is the
+ * match position if one was found.
+ * If a match is not found, {@link #DONE} will be returned and the
+ * iterator will be adjusted to the index {@link #DONE}.
+ *
+ * @param position where search if to start from.
+ * @return The character index of the first match following
+ * <tt>position</tt>, or {@link #DONE} if there are no matches.
+ * @throws IndexOutOfBoundsException If position is less than or greater
+ * than the text range for searching.
+ * @see #getIndex
* @stable ICU 2.0
*/
- public final int following(int position)
- {
+ public final int following(int position) {
setIndex(position);
return handleNext(position);
}
-
+
/**
- * Return the index of the first <b>backward</b> match in target text.
- * This method sets the iteration to begin at the end of the
- * target text and searches backwards from there.
- * @return The starting index of the first backward match, or
- * <code>DONE</code> if there are no matches.
- * @see #getMatchStart
- * @see #getMatchLength
- * @see #getMatchedText
- * @see #first
- * @see #preceding
- * @see #next
- * @see #previous
- * @see #following
- * @see #DONE
+ * Returns the last index in the target text at which it matches the
+ * search pattern. The iterator is adjusted so that its current index
+ * (as returned by {@link #getIndex}) is the match position if one was
+ * found.
+ * If a match is not found, {@link #DONE} will be returned and
+ * the iterator will be adjusted to the index {@link #DONE}.
+ *
+ * @return The index of the first match, or {@link #DONE} if
+ * there are no matches.
+ * @see #getIndex
* @stable ICU 2.0
*/
- public final int last()
- {
+ public final int last() {
int endIdx = search_.endIndex();
setIndex(endIdx);
return handlePrevious(endIdx);
}
-
+
/**
- * Return the index of the first <b>backwards</b> match in target
- * text that ends at or before argument <tt>position</tt>.
- * This method sets the iteration to begin at the argument
- * position index of the target text and searches backwards from there.
- * @return The starting index of the first backwards match, or
- * <code>DONE</code>
- * if there are no matches.
- * @see #getMatchStart
- * @see #getMatchLength
- * @see #getMatchedText
- * @see #first
- * @see #following
- * @see #next
- * @see #previous
- * @see #last
- * @see #DONE
+ * Returns the first index less than <tt>position</tt> at which the string
+ * text matches the search pattern. The iterator is adjusted so that its
+ * current index (as returned by {@link #getIndex}) is the match
+ * position if one was found. If a match is not found,
+ * {@link #DONE} will be returned and the iterator will be
+ * adjusted to the index {@link #DONE}
+ * <p>
+ * When the overlapping option ({@link #isOverlapping}) is off, the last index of the
+ * result match is always less than <tt>position</tt>.
+ * When the overlapping option is on, the result match may span across
+ * <tt>position</tt>.
+ *
+ * @param position where search is to start from.
+ * @return The character index of the first match preceding
+ * <tt>position</tt>, or {@link #DONE} if there are
+ * no matches.
+ * @throws IndexOutOfBoundsException If position is less than or greater than
+ * the text range for searching
+ * @see #getIndex
* @stable ICU 2.0
*/
- public final int preceding(int position)
- {
+ public final int preceding(int position) {
setIndex(position);
return handlePrevious(position);
}
// protected constructor ----------------------------------------------
-
+
/**
* Protected constructor for use by subclasses.
* Initializes the iterator with the argument target text for searching
* and sets the BreakIterator.
* See class documentation for more details on the use of the target text
- * and BreakIterator.
+ * and {@link BreakIterator}.
+ *
* @param target The target text to be searched.
* @param breaker A {@link BreakIterator} that is used to determine the
* boundaries of a logical match. This argument can be null.
/**
* Sets the length of the most recent match in the target text.
* Subclasses' handleNext() and handlePrevious() methods should call this
- * after they find a match in the target text.
+ * after they find a match in the target text.
+ *
* @param length new length to set
* @see #handleNext
* @see #handlePrevious
}
/**
+ * Abstract method which subclasses override to provide the mechanism
+ * for finding the next match in the target text. This allows different
+ * subclasses to provide different search algorithms.
* <p>
- * Abstract method that subclasses override to provide the mechanism
- * for finding the next <b>forwards</b> match in the target text. This
- * allows different subclasses to provide different search algorithms.
- * </p>
- * <p>
- * If a match is found, this function must call setMatchLength(int) to
- * set the length of the result match.
- * The iterator is adjusted so that its current index, as returned by
- * {@link #getIndex}, is the starting position of the match if one was
- * found. If a match is not found, <tt>DONE</tt> will be returned.
- * </p>
- * @param start index in the target text at which the forwards search
- * should begin.
- * @return the starting index of the next forwards match if found, DONE
- * otherwise
- * @see #setMatchLength(int)
- * @see #handlePrevious(int)
- * @see #DONE
+ * If a match is found, the implementation should return the index at
+ * which the match starts and should call
+ * {@link #setMatchLength} with the number of characters
+ * in the target text that make up the match. If no match is found, the
+ * method should return {@link #DONE}.
+ *
+ * @param start The index in the target text at which the search
+ * should start.
+ * @return index at which the match starts, else if match is not found
+ * {@link #DONE} is returned
+ * @see #setMatchLength
* @stable ICU 2.0
*/
protected abstract int handleNext(int start);
-
+
/**
+ * Abstract method which subclasses override to provide the mechanism for
+ * finding the previous match in the target text. This allows different
+ * subclasses to provide different search algorithms.
* <p>
- * Abstract method which subclasses override to provide the mechanism
- * for finding the next <b>backwards</b> match in the target text.
- * This allows different
- * subclasses to provide different search algorithms.
- * </p>
- * <p>
- * If a match is found, this function must call setMatchLength(int) to
- * set the length of the result match.
- * The iterator is adjusted so that its current index, as returned by
- * {@link #getIndex}, is the starting position of the match if one was
- * found. If a match is not found, <tt>DONE</tt> will be returned.
- * </p>
- * @param startAt index in the target text at which the backwards search
- * should begin.
- * @return the starting index of the next backwards match if found,
- * DONE otherwise
- * @see #setMatchLength(int)
- * @see #handleNext(int)
- * @see #DONE
+ * If a match is found, the implementation should return the index at
+ * which the match starts and should call
+ * {@link #setMatchLength} with the number of characters
+ * in the target text that make up the match. If no match is found, the
+ * method should return {@link #DONE}.
+ *
+ * @param startAt The index in the target text at which the search
+ * should start.
+ * @return index at which the match starts, else if match is not found
+ * {@link #DONE} is returned
+ * @see #setMatchLength
* @stable ICU 2.0
*/
protected abstract int handlePrevious(int startAt);
*/
STANDARD_ELEMENT_COMPARISON,
/**
- * <p>Collation element comparison is modified to effectively provide behavior
- * between the specified strength and strength - 1.</p>
- *
- * <p>Collation elements in the pattern that have the base weight for the specified
+ * Collation element comparison is modified to effectively provide behavior
+ * between the specified strength and strength - 1.
+ * <p>
+ * Collation elements in the pattern that have the base weight for the specified
* strength are treated as "wildcards" that match an element with any other
* weight at that collation level in the searched text. For example, with a
* secondary-strength English collator, a plain 'e' in the pattern will match
* a plain e or an e with any diacritic in the searched text, but an e with
* diacritic in the pattern will only match an e with the same diacritic in
- * the searched text.<p>
+ * the searched text.
*
* @draft ICU 53
* @provisional This API might change or be removed in a future release.
PATTERN_BASE_WEIGHT_IS_WILDCARD,
/**
- * <p>Collation element comparison is modified to effectively provide behavior
- * between the specified strength and strength - 1.</p>
- *
- * <p>Collation elements in either the pattern or the searched text that have the
+ * Collation element comparison is modified to effectively provide behavior
+ * between the specified strength and strength - 1.
+ * <p>
+ * Collation elements in either the pattern or the searched text that have the
* base weight for the specified strength are treated as "wildcards" that match
* an element with any other weight at that collation level. For example, with
* a secondary-strength English collator, a plain 'e' in the pattern will match
* a plain e or an e with any diacritic in the searched text, but an e with
* diacritic in the pattern will only match an e with the same diacritic or a
- * plain e in the searched text.</p>
+ * plain e in the searched text.
*
* @draft ICU 53
* @provisional This API might change or be removed in a future release.
}
/**
- * <p>Sets the collation element comparison type.</p>
- *
- * <p>The default comparison type is {@link ElementComparisonType#STANDARD_ELEMENT_COMPARISON}.</p>
+ * Sets the collation element comparison type.
+ * <p>
+ * The default comparison type is {@link ElementComparisonType#STANDARD_ELEMENT_COMPARISON}.
*
* @see ElementComparisonType
* @see #getElementComparisonType()
}
/**
- * <p>Returns the collation element comparison type.</p>
+ * Returns the collation element comparison type.
*
* @see ElementComparisonType
* @see #setElementComparisonType(ElementComparisonType)
import com.ibm.icu.util.ULocale;
// Java porting note:
-// ICU4C implementation contains dead code in many places.
+//
+// ICU4C implementation contains dead code in many places.
// While porting ICU4C linear search implementation, these dead codes
// were not fully ported. The code block tagged by "// *** Boyer-Moore ***"
// are those dead code, still available in ICU4C.
-//TODO: ICU4C implementation does not seem to handle UCharacterIterator pointing
+// ICU4C implementation does not seem to handle UCharacterIterator pointing
// a fragment of text properly. ICU4J uses CharacterIterator to navigate through
// the input text. We need to carefully review the code ported from ICU4C
// assuming the start index is 0.
-//TODO: ICU4C implementation initializes pattern.CE and pattern.PCE. It looks
+// ICU4C implementation initializes pattern.CE and pattern.PCE. It looks
// CE is no longer used, except a few places checking CELength. It looks this
// is a left over from already disable Boyer-Moore search code. This Java implementation
// preserves the code, but we should clean them up later.
-//TODO: We need to update document to remove the term "Boyer-Moore search".
-
-/**
- * <p>
- * <code>StringSearch</code> is the concrete subclass of
- * <code>SearchIterator</code> that provides language-sensitive text searching
- * based on the comparison rules defined in a {@link RuleBasedCollator} object.
- * </p>
- * <p>
- * <code>StringSearch</code> uses a version of the fast Boyer-Moore search
- * algorithm that has been adapted to work with the large character set of
- * Unicode. Refer to
- * <a href="http://www.icu-project.org/docs/papers/efficient_text_searching_in_java.html">
- * "Efficient Text Searching in Java"</a>, published in the
- * <i>Java Report</i> on February, 1999, for further information on the
- * algorithm.
- * </p>
- * <p>
- * Users are also strongly encouraged to read the section on
- * <a href="http://www.icu-project.org/userguide/searchString.html">
- * String Search</a> and
- * <a href="http://www.icu-project.org/userguide/Collate_Intro.html">
- * Collation</a> in the user guide before attempting to use this class.
- * </p>
- * <p>
- * String searching becomes a little complicated when accents are encountered at
- * match boundaries. If a match is found and it has preceding or trailing
- * accents not part of the match, the result returned will include the
- * preceding accents up to the first base character, if the pattern searched
- * for starts an accent. Likewise,
- * if the pattern ends with an accent, all trailing accents up to the first
- * base character will be included in the result.
- * </p>
- * <p>
- * For example, if a match is found in target text "a\u0325\u0300" for
- * the pattern
- * "a\u0325", the result returned by StringSearch will be the index 0 and
- * length 3 <0, 3>. If a match is found in the target
- * "a\u0325\u0300"
- * for the pattern "\u0300", then the result will be index 1 and length 2
- * <1, 2>.
- * </p>
- * <p>
- * In the case where the decomposition mode is on for the RuleBasedCollator,
- * all matches that starts or ends with an accent will have its results include
- * preceding or following accents respectively. For example, if pattern "a" is
- * looked for in the target text "á\u0325", the result will be
- * index 0 and length 2 <0, 2>.
- * </p>
- * <p>
- * The StringSearch class provides two options to handle accent matching
- * described below:
- * </p>
+/**
+ *
+ * <tt>StringSearch</tt> is a {@link SearchIterator} that provides
+ * language-sensitive text searching based on the comparison rules defined
+ * in a {@link RuleBasedCollator} object.
+ * StringSearch ensures that language eccentricity can be
+ * handled, e.g. for the German collator, characters ß and SS will be matched
+ * if case is chosen to be ignored.
+ * See the <a href="http://source.icu-project.org/repos/icu/icuhtml/trunk/design/collation/ICU_collation_design.htm">
+ * "ICU Collation Design Document"</a> for more information.
* <p>
- * Let S' be the sub-string of a text string S between the offsets start and
- * end <start, end>.
- * <br>
- * A pattern string P matches a text string S at the offsets <start,
- * length>
+ * There are 2 match options for selection:<br>
+ * Let S' be the sub-string of a text string S between the offsets start and
+ * end [start, end].
* <br>
+ * A pattern string P matches a text string S at the offsets [start, end]
* if
* <pre>
- * option 1. P matches some canonical equivalent string of S'. Suppose the
- * RuleBasedCollator used for searching has a collation strength of
- * TERTIARY, all accents are non-ignorable. If the pattern
- * "a\u0300" is searched in the target text
- * "a\u0325\u0300",
- * a match will be found, since the target text is canonically
- * equivalent to "a\u0300\u0325"
- * option 2. P matches S' and if P starts or ends with a combining mark,
- * there exists no non-ignorable combining mark before or after S'
- * in S respectively. Following the example above, the pattern
- * "a\u0300" will not find a match in "a\u0325\u0300",
- * since
- * there exists a non-ignorable accent '\u0325' in the middle of
- * 'a' and '\u0300'. Even with a target text of
- * "a\u0300\u0325" a match will not be found because of the
- * non-ignorable trailing accent \u0325.
+ * option 1. Some canonical equivalent of P matches some canonical equivalent
+ * of S'
+ * option 2. P matches S' and if P starts or ends with a combining mark,
+ * there exists no non-ignorable combining mark before or after S?
+ * in S respectively.
* </pre>
- * Option 2. will be the default mode for dealing with boundary accents unless
- * specified via the API setCanonical(boolean).
- * One restriction is to be noted for option 1. Currently there are no
- * composite characters that consists of a character with combining class > 0
- * before a character with combining class == 0. However, if such a character
- * exists in the future, the StringSearch may not work correctly with option 1
- * when such characters are encountered.
- * </p>
+ * Option 2. will be the default.
* <p>
- * <tt>SearchIterator</tt> provides APIs to specify the starting position
- * within the text string to be searched, e.g. <tt>setIndex</tt>,
- * <tt>preceding</tt> and <tt>following</tt>. Since the starting position will
- * be set as it is specified, please take note that there are some dangerous
- * positions which the search may render incorrect results:
+ * This search has APIs similar to that of other text iteration mechanisms
+ * such as the break iterators in {@link BreakIterator}. Using these
+ * APIs, it is easy to scan through text looking for all occurrences of
+ * a given pattern. This search iterator allows changing of direction by
+ * calling a {@link #reset} followed by a {@link #next} or {@link #previous}.
+ * Though a direction change can occur without calling {@link #reset} first,
+ * this operation comes with some speed penalty.
+ * Match results in the forward direction will match the result matches in
+ * the backwards direction in the reverse order
+ * <p>
+ * {@link SearchIterator} provides APIs to specify the starting position
+ * within the text string to be searched, e.g. {@link SearchIterator#setIndex setIndex},
+ * {@link SearchIterator#preceding preceding} and {@link SearchIterator#following following}. Since the
+ * starting position will be set as it is specified, please take note that
+ * there are some danger points which the search may render incorrect
+ * results:
* <ul>
- * <li> The midst of a substring that requires decomposition.
+ * <li> The midst of a substring that requires normalization.
* <li> If the following match is to be found, the position should not be the
- * second character which requires to be swapped with the preceding
- * character. Vice versa, if the preceding match is to be found,
- * position to search from should not be the first character which
+ * second character which requires to be swapped with the preceding
+ * character. Vice versa, if the preceding match is to be found,
+ * position to search from should not be the first character which
* requires to be swapped with the next character. E.g certain Thai and
* Lao characters require swapping.
- * <li> If a following pattern match is to be found, any position within a
- * contracting sequence except the first will fail. Vice versa if a
- * preceding pattern match is to be found, a invalid starting point
+ * <li> If a following pattern match is to be found, any position within a
+ * contracting sequence except the first will fail. Vice versa if a
+ * preceding pattern match is to be found, a invalid starting point
* would be any character within a contracting sequence except the last.
* </ul>
- * </p>
* <p>
- * Though collator attributes will be taken into consideration while
- * performing matches, there are no APIs provided in StringSearch for setting
- * and getting the attributes. These attributes can be set by getting the
- * collator from <tt>getCollator</tt> and using the APIs in
- * <tt>com.ibm.icu.text.Collator</tt>. To update StringSearch to the new
- * collator attributes, <tt>reset()</tt> or
- * <tt>setCollator(RuleBasedCollator)</tt> has to be called.
- * </p>
+ * A {@link BreakIterator} can be used if only matches at logical breaks are desired.
+ * Using a {@link BreakIterator} will only give you results that exactly matches the
+ * boundaries given by the {@link BreakIterator}. For instance the pattern "e" will
+ * not be found in the string "\u00e9" if a character break iterator is used.
* <p>
- * Consult the
- * <a href="http://www.icu-project.org/userguide/searchString.html">
- * String Search</a> user guide and the <code>SearchIterator</code>
- * documentation for more information and examples of use.
- * </p>
+ * Options are provided to handle overlapping matches.
+ * E.g. In English, overlapping matches produces the result 0 and 2
+ * for the pattern "abab" in the text "ababab", where else mutually
+ * exclusive matches only produce the result of 0.
+ * <p>
+ * Though collator attributes will be taken into consideration while
+ * performing matches, there are no APIs here for setting and getting the
+ * attributes. These attributes can be set by getting the collator
+ * from {@link #getCollator} and using the APIs in {@link RuleBasedCollator}.
+ * Lastly to update <tt>StringSearch</tt> to the new collator attributes,
+ * {@link #reset} has to be called.
+ * <p>
+ * Restriction: <br>
+ * Currently there are no composite characters that consists of a
+ * character with combining class > 0 before a character with combining
+ * class == 0. However, if such a character exists in the future,
+ * <tt>StringSearch</tt> does not guarantee the results for option 1.
+ * <p>
+ * Consult the {@link SearchIterator} documentation for information on
+ * and examples of how to use instances of this class to implement text
+ * searching.
* <p>
- * This class is not subclassable
+ * Note, <tt>StringSearch</tt> is not to be subclassed.
* </p>
* @see SearchIterator
* @see RuleBasedCollator
* @author Laura Werner, synwee
- * @stable ICU 2.0
+ * @since ICU 2.0
*/
// internal notes: all methods do not guarantee the correct status of the
// characteriterator. the caller has to maintain the original index position
public final class StringSearch extends SearchIterator {
/**
- * DONE is returned by previous() and next() after all valid matches have
- * been returned, and by first() and last() if there are no matches at all.
+ * DONE is returned by {@link #previous()} and {@link #next()} after all valid matches have
+ * been returned, and by {@link SearchIterator#first() first()} and
+ * {@link SearchIterator#last() last()} if there are no matches at all.
* @see #previous
* @see #next
* @stable ICU 2.0
/**
* Initializes the iterator to use the language-specific rules defined in
* the argument collator to search for argument pattern in the argument
- * target text. The argument breakiter is used to define logical matches.
+ * target text. The argument <code>breakiter</code> is used to define logical matches.
* See super class documentation for more details on the use of the target
- * text and BreakIterator.
+ * text and {@link BreakIterator}.
* @param pattern text to look for.
* @param target target text to search for pattern.
- * @param collator RuleBasedCollator that defines the language rules
+ * @param collator {@link RuleBasedCollator} that defines the language rules
* @param breakiter A {@link BreakIterator} that is used to determine the
* boundaries of a logical match. This argument can be null.
- * @exception IllegalArgumentException thrown when argument target is null,
+ * @throws IllegalArgumentException thrown when argument target is null,
* or of length 0
* @see BreakIterator
* @see RuleBasedCollator
- * @see SearchIterator
* @stable ICU 2.0
*/
public StringSearch(String pattern, CharacterIterator target, RuleBasedCollator collator,
/**
* Initializes the iterator to use the language-specific rules defined in
* the argument collator to search for argument pattern in the argument
- * target text. No BreakIterators are set to test for logical matches.
+ * target text. No {@link BreakIterator}s are set to test for logical matches.
* @param pattern text to look for.
* @param target target text to search for pattern.
- * @param collator RuleBasedCollator that defines the language rules
- * @exception IllegalArgumentException thrown when argument target is null,
+ * @param collator {@link RuleBasedCollator} that defines the language rules
+ * @throws IllegalArgumentException thrown when argument target is null,
* or of length 0
* @see RuleBasedCollator
- * @see SearchIterator
* @stable ICU 2.0
*/
public StringSearch(String pattern, CharacterIterator target, RuleBasedCollator collator) {
* Initializes the iterator to use the language-specific rules and
* break iterator rules defined in the argument locale to search for
* argument pattern in the argument target text.
- * See super class documentation for more details on the use of the target
- * text and BreakIterator.
* @param pattern text to look for.
* @param target target text to search for pattern.
* @param locale locale to use for language and break iterator rules
- * @exception IllegalArgumentException thrown when argument target is null,
+ * @throws IllegalArgumentException thrown when argument target is null,
* or of length 0. ClassCastException thrown if the collator for
* the specified locale is not a RuleBasedCollator.
- * @see BreakIterator
- * @see RuleBasedCollator
- * @see SearchIterator
* @stable ICU 2.0
*/
public StringSearch(String pattern, CharacterIterator target, Locale locale) {
* break iterator rules defined in the argument locale to search for
* argument pattern in the argument target text.
* See super class documentation for more details on the use of the target
- * text and BreakIterator.
+ * text and {@link BreakIterator}.
* @param pattern text to look for.
* @param target target text to search for pattern.
- * @param locale ulocale to use for language and break iterator rules
- * @exception IllegalArgumentException thrown when argument target is null,
+ * @param locale locale to use for language and break iterator rules
+ * @throws IllegalArgumentException thrown when argument target is null,
* or of length 0. ClassCastException thrown if the collator for
* the specified locale is not a RuleBasedCollator.
* @see BreakIterator
/**
* Initializes the iterator to use the language-specific rules and
* break iterator rules defined in the default locale to search for
- * argument pattern in the argument target text.
- * See super class documentation for more details on the use of the target
- * text and BreakIterator.
+ * argument pattern in the argument target text.
* @param pattern text to look for.
* @param target target text to search for pattern.
- * @exception IllegalArgumentException thrown when argument target is null,
+ * @throws IllegalArgumentException thrown when argument target is null,
* or of length 0. ClassCastException thrown if the collator for
* the default locale is not a RuleBasedCollator.
- * @see BreakIterator
- * @see RuleBasedCollator
- * @see SearchIterator
* @stable ICU 2.0
*/
public StringSearch(String pattern, String target) {
}
/**
+ * Gets the {@link RuleBasedCollator} used for the language rules.
* <p>
- * Gets the RuleBasedCollator used for the language rules.
- * </p>
- * <p>
- * Since StringSearch depends on the returned RuleBasedCollator, any
- * changes to the RuleBasedCollator result should follow with a call to
- * either StringSearch.reset() or
- * StringSearch.setCollator(RuleBasedCollator) to ensure the correct
- * search behaviour.
+ * Since <tt>StringSearch</tt> depends on the returned {@link RuleBasedCollator}, any
+ * changes to the {@link RuleBasedCollator} result should follow with a call to
+ * either {@link #reset()} or {@link #setCollator(RuleBasedCollator)} to ensure the correct
+ * search behavior.
* </p>
- * @return RuleBasedCollator used by this StringSearch
+ * @return {@link RuleBasedCollator} used by this <tt>StringSearch</tt>
* @see RuleBasedCollator
* @see #setCollator
* @stable ICU 2.0
}
/**
+ * Sets the {@link RuleBasedCollator} to be used for language-specific searching.
* <p>
- * Sets the RuleBasedCollator to be used for language-specific searching.
- * </p>
- * <p>
- * This method causes internal data such as Boyer-Moore shift tables
- * to be recalculated, but the iterator's position is unchanged.
- * </p>
- * @param collator to use for this StringSearch
- * @exception IllegalArgumentException thrown when collator is null
+ * The iterator's position will not be changed by this method.
+ * @param collator to use for this <tt>StringSearch</tt>
+ * @throws IllegalArgumentException thrown when collator is null
* @see #getCollator
* @stable ICU 2.0
*/
}
/**
- * Returns the pattern for which StringSearch is searching for.
+ * Returns the pattern for which <tt>StringSearch</tt> is searching for.
* @return the pattern searched for
* @stable ICU 2.0
*/
}
/**
- * <p>
* Set the pattern to search for.
- * </p>
- * <p>
- * This method causes internal data such as Boyer-Moore shift tables
- * to be recalculated, but the iterator's position is unchanged.
- * </p>
+ * The iterator's position will not be changed by this method.
* @param pattern for searching
* @see #getPattern
* @exception IllegalArgumentException thrown if pattern is null or of
}
/**
- * <p>
* Set the canonical match mode. See class documentation for details.
* The default setting for this property is false.
- * </p>
* @param allowCanonical flag indicator if canonical matches are allowed
* @see #isCanonical
* @stable ICU 2.8
}
/**
- * Set the target text to be searched. Text iteration will hence begin at
- * the start of the text string. This method is useful if you want to
- * re-use an iterator to search within a different body of text.
- * @param text new text iterator to look for match,
- * @exception IllegalArgumentException thrown when text is null or has
- * 0 length
- * @see #getTarget
+ * {@inheritDoc}
* @stable ICU 2.8
*/
@Override
}
/**
- * Return the index in the target text where the iterator is currently
- * positioned at.
- * If the iteration has gone past the end of the target text or past
- * the beginning for a backwards search, {@link #DONE} is returned.
- * @return index in the target text where the iterator is currently
- * positioned at
+ * {@inheritDoc}
* @stable ICU 2.8
*/
@Override
}
/**
- * <p>
- * Sets the position in the target text which the next search will start
- * from to the argument. This method clears all previous states.
- * </p>
- * <p>
- * This method takes the argument position and sets the position in the
- * target text accordingly, without checking if position is pointing to a
- * valid starting point to begin searching.
- * </p>
- * <p>
- * Search positions that may render incorrect results are highlighted in
- * the class documentation.
- * </p>
- * @param position index to start next search from.
- * @exception IndexOutOfBoundsException thrown if argument position is out
- * of the target text range.
- * @see #getIndex
+ * {@inheritDoc}
* @stable ICU 2.8
*/
@Override
}
/**
- * <p>
- * Resets the search iteration. All properties will be reset to the
- * default value.
- * </p>
- * <p>
- * Search will begin at the start of the target text if a forward iteration
- * is initiated before a backwards iteration. Otherwise if a
- * backwards iteration is initiated before a forwards iteration, the search
- * will begin at the end of the target text.
- * </p>
- * <p>
- * Canonical match option will be reset to false, ie an exact match.
- * </p>
+ * {@inheritDoc}
* @stable ICU 2.8
*/
@Override
}
/**
- * <p>
- * Concrete method to provide the mechanism
- * for finding the next <b>forwards</b> match in the target text.
- * See super class documentation for its use.
- * </p>
- * @param position index in the target text at which the forwards search
- * should begin.
- * @return the starting index of the next forwards match if found, DONE
- * otherwise
- * @see #handlePrevious(int)
- * @see #DONE
+ * {@inheritDoc}
* @stable ICU 2.8
*/
@Override
}
/**
- * <p>
- * Concrete method to provide the mechanism
- * for finding the next <b>backwards</b> match in the target text.
- * See super class documentation for its use.
- * </p>
- * @param position index in the target text at which the backwards search
- * should begin.
- * @return the starting index of the next backwards match if found, DONE
- * otherwise
- * @see #handleNext(int)
- * @see #DONE
+ * {@inheritDoc}
* @stable ICU 2.8
*/
@Override