- Home
- Objectives

- XyzWs Study Guides
- Study Guides
- Study Notes
- Resources

- Mock Exams
SCJP Study Guide:
API Contents
Printer-friendly version |
Mail this to a friend
The java.util.regex Package
The Java 2 Platform, Standard Edition (J2SE), version 1.4, contains a package called java.util.regex, enabling the use of regular expressions. Now functionality includes the use of meta characters, which gives regular expressions versatility.
The package includes classes for matching character sequences against patterns specified by regular expressions.
An instance of the Pattern class represents a regular expression that is specified in string form in a syntax similiar to that used by Perl.
Instances of the Matcher class are used to match character sequences against a given pattern. Input is provided to matchers via the CharSequence interface in order to support matching against characters from a wide variety of input sources.
java.util.regex.Pattern Class
The java.util.regex.Pattern class provides a compiled representation of a regular expression.
A regular expression, specified as a string, must first be compiled into an instance of this class. The resulting pattern can then be used to create a Matcher object that can match arbitrary character sequences against the regular expression. All of the state involved in performing a match resides in the matcher, so many matchers can share the same pattern.
A typical invocation sequence is thus
Pattern p = Pattern.compile("a*b");
Matcher m = p.matcher("aaaaab");
boolean b = m.matches();
A matches method is defined by this class as a convenience for when a regular expression is used just once. This method compiles an expression and matches an input sequence against it in a single invocation. The statement
boolean b = Pattern.matches("a*b", "aaaaab");
is equivalent to the three statements above, though for repeated matches it is less efficient since it does not allow the compiled pattern to be reused.
Instances of this class are immutable and are safe for use by multiple concurrent threads. Instances of the Matcher class are not safe for such use.
The Pattern class defines an alternate compile method
that accepts a set of flags affecting the way the pattern is matched. The flags
parameter is a bit mask that may include any of the following public static
fields:
-
Pattern.CANON_EQ -
Pattern.CASE_INSENSITIVE -
Pattern.COMMENTS -
Pattern.DOTALL -
Pattern.MULTILINE -
Pattern.UNICODE_CASE -
Pattern.UNIX_LINES
You can get information about the instance of a Pattern class:
- public int flags() returns this pattern's match flags that specified when this pattern was compiled.
- public String pattern() returns the regular expression from which this pattern was compiled.
- public String toString() returns the string representation of this pattern. This is the regular expression from which this pattern was compiled.
Using the matches(String,CharSequence) Method
The Pattern class defines a convenient matches method
that allows you to quickly check if a pattern is present in a given input
string. As with all public static methods, you should call matches
with its class name, such as Pattern.matches("\\d","1"); In this
example, the method returns true, because the digit "1" matches the regular
expression \d.
public static boolean matches(String regex, CharSequence input)compiles the given regular expression and attempts to match the given input against it. An invocation of this convenience method of the form
Pattern.matches(regex, input);
behaves in exactly the same way as the expression
Pattern.compile(regex).matcher(input).matches()
If a pattern is to be used multiple times, compiling it once and reusing it will be more efficient than invoking this method each time.
If the expression's syntax is invalid, an PatternSyntaxException is thrown.
Using the split() methods
- public String[] split(CharSequence input, int limit) splits the given input sequence around matches of this pattern.
The array returned by this method contains each substring of the input sequence that is terminated by another subsequence that matches this pattern or is terminated by the end of the input sequence. The substrings in the array are in the order in which they occur in the input. If this pattern does not match any subsequence of the input then the resulting array has just one element, namely the input sequence in string form.
The limit parameter controls the number of times the pattern is applied and therefore affects the length of the resulting array. If the limit n is greater than zero then the pattern will be applied at most n - 1 times, the array's length will be no greater than n, and the array's last entry will contain all input beyond the last matched delimiter. If n is non-positive then the pattern will be applied as many times as possible and the array can have any length. If n is zero then the pattern will be applied as many times as possible, the array can have any length, and trailing empty strings will be discarded.
The input "boo:and:foo", for example, yields the following results with these parameters:
Regex
Limit
Result
: 2 { "boo", "and:foo" } : 5 { "boo", "and", "foo" } : -2 { "boo", "and", "foo" } o 5 { "b", "", ":and:f", "", "" } o -2 { "b", "", ":and:f", "", "" } o 0 { "b", "", ":and:f" }
-
public String[] split(CharSequence input) splits the given input sequence around matches of this pattern. It returns the array of strings computed by splitting the input around matches of this pattern This method works as if by invoking the two-argument split method with the given input sequence and a limit argument of zero. Trailing empty strings are therefore not included in the resulting array.
The split method is a great tool for gathering the text that lies
on either side of the pattern that's been matched. As shown below in the SplitTest
code, the split method could extract the words "one two three
four five" from the string "one1two2three3four4five":
import java.util.regex.*;
public final class SplitTest {
private static String REGEX = "\\d";
private static String INPUT = "one1two2three3four4five";
public static void main(String[] argv) {
Pattern p = Pattern.compile(REGEX);
String[] items = p.split(INPUT);
for(int i=0;i<items.length;i++) {
System.out.println(items[i]);
}
}
}
OUTPUT:
one
two
three
four
five
Create a Matcher object
- public Matcher matcher(CharSequence input) creates a new matcher, for invoked pattern instance, that will match the given input against this pattern.
java.util.regex.Matcher Class
An engine that performs match operations on a character sequence by interpreting a Pattern.
A matcher is created from a pattern by invoking the pattern's matcher method. Once created, a matcher can be used to perform three different kinds of match operations:
-
public boolean matches() attempts to match the entire region (e.g., entire sequence) against the pattern. It returns true if, and only if, the entire region sequence matches this matcher's pattern. If the match succeeds then more information can be obtained via the start, end, and group methods.
-
public boolean lookingAt() attempts to match the input sequence, starting at the beginning of the region, against the pattern. It returns true if, and only if, a prefix of the input sequence matches this matcher's pattern. Like the matches method, this method always starts at the beginning of the region; unlike that method, it does not require that the entire region be matched.
-
public boolean find() attempts to find the next subsequence of the input sequence that matches the pattern. It returns true if, and only if, a subsequence of the input sequence matches this matcher's pattern. This method starts at the beginning of this matcher's region, or, if a previous invocation of the method was successful and the matcher has not since been reset, at the first character not matched by the previous match.
-
public boolean find(int start) resets this matcher and then attempts to find the next subsequence of the input sequence that matches the pattern, starting at the specified index. It returns true if, and only if, a subsequence of the input sequence starting at the given index matches this matcher's pattern. If the match succeeds then more information can be obtained via the start, end, and group methods, and subsequent invocations of the find() method will start at the first character not matched by this match. If start is less than zero or if start is greater than the length of the input sequence, an IndexOutOfBoundsException exception is thrown.
Each of these methods returns a boolean indicating success or failure. If the match succeeds then more information can be obtained via the start, end, and group methods. Index methods provide useful index values that show precisely where the match was found in the input string.
- public int start() returns the start index of the previous match (e.g., the index of the first character matched). An IllegalStateException exception is thrown, if no match has yet been attempted, or if the previous match operation failed.
- public int start(int group) returns the start index of the subsequence captured by the given group during the previous match operation. Capturing groups are indexed from left to right, starting at one. Group zero denotes the entire pattern, so the expression m.start(0) is equivalent to m.start(). It returns the index of the first character captured by the group, or -1 if the match was successful but the group itself did not match anything An IllegalStateException exception is thrown if no match has yet been attempted, or if the previous match operation failed. An IndexOutOfBoundsException exception is thrown if there is no capturing group in the pattern with the given index.
- public int end() returns the offset after the last character matched. It returns the offset after the last character matched An IllegalStateException exception is thrown if no match has yet been attempted, or if the previous match operation failed.
- public int end(int group) returns the offset after the last character of the subsequence captured by the given group during the previous match operation. Capturing groups are indexed from left to right, starting at one. Group zero denotes the entire pattern, so the expression m.end(0) is equivalent to m.end(). It returns the offset after the last character captured by the group, or -1 if the match was successful but the group itself did not match anything An IllegalStateException exception is thrown, if no match has yet been attempted, or if the previous match operation failed. An IndexOutOfBoundsException exception is thrown, if there is no capturing group in the pattern with the given index.
- public String group() returns the input subsequence matched by the previous match. It returns the (possibly empty) subsequence matched by the previous match, in string form. For a matcher m with input sequence s, the expressions m.group() and s.substring(m.start(), m.end()) are equivalent. Note that some patterns, for example a*, match the empty string. This method will return the empty string when the pattern successfully matches the empty string in the input. An IllegalStateException exception is thrown, if no match has yet been attempted, or if the previous match operation failed.
- public String group(int group) returns the input subsequence captured by the given group during the previous match operation. For a matcher m, input sequence s, and group index g, the expressions m.group(g) and s.substring(m.start(g), m.end(g)) are equivalent. Capturing groups are indexed from left to right, starting at one. Group zero denotes the entire pattern, so the expression m.group(0) is equivalent to m.group(). If the match was successful but the group specified failed to match any part of the input sequence, then null is returned. Note that some groups, for example (a*), match the empty string. This method will return the empty string when such a group successfully matches the empty string in the input. It returns the (possibly empty) subsequence captured by the group during the previous match, or null if the group failed to match part of the input An IllegalStateException exception is thrown if no match has yet been attempted, or if the previous match operation failed . An IndexOutOfBoundsException exception is thrown, if there is no capturing group in the pattern with the given index.
A matcher finds matches in a subset of its input called the region. By
default, the region contains all of the matcher's input. The region can be
modified via the region method and queried via the regionStart
and regionEnd methods. The way that the region boundaries interact
with some pattern constructs can be changed. See useAnchoringBounds
and useTransparentBounds
for more details.
This class also defines methods for replacing matched subsequences with new
strings whose contents can, if desired, be computed from the match result. The
appendReplacement and appendTail methods can be used in
tandem in order to collect the result into an existing string buffer, or the
more convenient replaceAll method can be used to create a string
in which every matching subsequence in the input sequence is replaced.
- public Matcher appendReplacement(StringBuffer sb, String replacement) implements a non-terminal append-and-replace step.
This method performs the following actions:
It reads characters from the input sequence, starting at the append position, and appends them to the given string buffer. It stops after reading the last character preceding the previous match, that is, the character at index
start()- 1.It appends the given replacement string to the string buffer.
It sets the append position of this matcher to the index of the last character matched, plus one, that is, to
end().The replacement string may contain references to subsequences captured during the previous match: Each occurrence of $g will be replaced by the result of evaluating
group(g). The first number after the $ is always treated as part of the group reference. Subsequent numbers are incorporated into g if they would form a legal group reference. Only the numerals '0' through '9' are considered as potential components of the group reference. If the second group matched the string "foo", for example, then passing the replacement string "$2bar" would cause "foobar" to be appended to the string buffer. A dollar sign ($) may be included as a literal in the replacement string by preceding it with a backslash (\$ ).Note that backslashes (\) and dollar signs ($) in the replacement string may cause the results to be different than if it were being treated as a literal replacement string. Dollar signs may be treated as references to captured subsequences as described above, and backslashes are used to escape literal characters in the replacement string.
- public StringBuffer appendTail(StringBuffer sb) implements a terminal append-and-replace step. It returns the target string buffer. This method reads characters from the input sequence, starting at the append position, and appends them to the given string buffer. It is intended to be invoked after one or more invocations of the appendReplacement method in order to copy the remainder of the input sequence.
public class ReplacementTest {
private static String REGEX = "cat";
private static String INPUT = "one cat two cats in the yard";
private static String REPLACE = "dog";
public static void main(String[] argv) {
Pattern p = Pattern.compile(REGEX);
Matcher m = p.matcher(INPUT); // get a matcher object
StringBuffer sb = new StringBuffer();
while(m.find()){
m.appendReplacement(sb,REPLACE);
}
m.appendTail(sb);
System.out.println(sb.toString());
}
}
Output:
one dog two dogs in the yard
- public String replaceAll(String replacement) replaces every subsequence of the input sequence that matches the pattern with the given replacement string.
This method first resets this matcher. It then scans the input sequence looking for matches of the pattern. Characters that are not part of any match are appended directly to the result string; each match is replaced in the result by the replacement string. The replacement string may contain references to captured subsequences as in the appendReplacement method.
Note that backslashes (\) and dollar signs ($) in the replacement string may cause the results to be different than if it were being treated as a literal replacement string. Dollar signs may be treated as references to captured subsequences as described above, and backslashes are used to escape literal characters in the replacement string.
Given the regular expression a*b, the input "aabfooaabfooabfoob", and the replacement string "-", an invocation of this method on a matcher for that expression would yield the string "-foo-foo-foo-". Invoking this method changes this matcher's state. If the matcher is to be used in further matching operations then it should first be reset. It returns the string constructed by replacing each matching subsequence by the replacement string, substituting captured subsequences as needed.
- public String replaceFirst(String replacement) replaces the first subsequence of the input sequence that matches the pattern with the given replacement string. This method first resets this matcher. It then scans the input sequence looking for a match of the pattern. Characters that are not part of the match are appended directly to the result string; the match is replaced in the result by the replacement string. The replacement string may contain references to captured subsequences as in the appendReplacement method. Given the regular expression dog, the input "zzzdogzzzdogzzz", and the replacement string "cat", an invocation of this method on a matcher for that expression would yield the string "zzzcatzzzdogzzz". Invoking this method changes this matcher's state. If the matcher is to be used in further matching operations then it should first be reset. It returns the string constructed by replacing the first matching subsequence by the replacement string, substituting captured. A NullPointerException exception, if replacement is null. subsequences as needed.
public class ReplacementTest {
private static String REGEX = "dog";
private static String INPUT = "The dog says meow. All dogs say meow.";
private static String REPLACE = "cat";
public static void main(String[] argv) {
Pattern p = Pattern.compile(REGEX);
Matcher m = p.matcher(INPUT); // get a matcher object
INPUT = m.replaceAll(REPLACE);
System.out.println(INPUT);
}
}
Output:
The cat says meow. All cats say meow.
The explicit state of a matcher includes the start and end indices of the most recent successful match. It also includes the start and end indices of the input subsequence captured by each capturing group in the pattern as well as a total count of such subsequences. As a convenience, methods are also provided for returning these captured subsequences in string form.
The explicit state of a matcher is initially undefined; attempting to query any
part of it before a successful match will cause an IllegalStateException
to be thrown. The explicit state of a matcher is recomputed by every match
operation.
The implicit state of a matcher includes the input character sequence as well as
the append position, which is initially zero and is updated by the appendReplacement
method.
A matcher may be reset explicitly by invoking its reset() method
or, if a new input sequence is desired, its reset(CharSequence)
method. Resetting a matcher discards its explicit state information and sets
the append position to zero.
Instances of this class are not safe for use by multiple concurrent threads.