- Home
- Objectives

- XyzWs Study Guides
- Study Guides
- Study Notes
- Resources

- Mock Exams
SCJP Study Guide:
API Contents
Printer-friendly version |
Mail this to a friend
Regular Expression
The Java 2 Platform, Standard Edition (J2SE), version 1.4, contains a package called java.util.regex, enabling the use of regular expressions. Now functionality includes the use of meta characters, which gives regular expressions versatility.
A regular expression (regex for short) is a string that describes or matches a set of strings, according to certain syntax rules. Regular expressions are used by many text editors and utilities to search and manipulate bodies of text based on certain patterns. Many programming languages support regular expressions for string manipulation. Regular expressions (often refereed to as regex) are essentially a programming language of their own.
The Regex Methods in the 'String' Class
The java.lang.String class provides additional methods for supporting regular expressions. All of these methods take a given regular expression as their method parameter. A PatternSyntaxException will be thrown, if the given regular expression's syntax is invalid. If the regular expression is null, replaceFirst will throw a NullPointerException.
- public boolean matches(String regex) tests whether or not this string matches the given regular expression. It returns true if, and only if, this string matches the given regular expression.
An invocation of this method of the form str.matches(regex) yields exactly the same result as the expression:
Pattern.matches(regex, str)
- public String replaceFirst(String regex, String replacement) replaces the first substring of this string that matches the given regular expression with the given replacement. It returns the resulting String.
An invocation of this method of the form str.replaceFirst(regex, repl) yields exactly the same result as the expression:
Pattern.compile(regex).matcher(str).replaceFirst(repl)
- public String replaceAll(String regex, String replacement) replaces each substring of this string that matches the given regular expression with the given replacement. It returns the resulting String.
An invocation of this method of the form str.replaceAll(regex, repl) yields exactly the same result as the expression:
Pattern.compile(regex).matcher(str).replaceAll(repl)
- public String[] split(String regex, int limit) splits this string around matches of the given regular expression. The array returned by this method contains each substring of this string that is terminated by another substring that matches the given expression or is terminated by the end of the string. The substrings in the array are in the order in which they occur in this string. If the expression does not match any part of the input then the resulting array has just one element, namely this string.
The limit parameter controls the number of times the pattern is applied and therefore affects the length of the resulting array. If the limit n is greater than zero then the pattern will be applied at most n - 1 times, the array's length will be no greater than n, and the array's last entry will contain all input beyond the last matched delimiter. If n is non-positive then the pattern will be applied as many times as possible and the array can have any length. If n is zero then the pattern will be applied as many times as possible, the array can have any length, and trailing empty strings will be discarded.
- public String[] split(String regex) splits this string around matches of the given regular expression. It returns the array of strings computed by splitting this string around matches of the given regular expression. This method works as if by invoking the two-argument split method with the given expression and a limit argument of zero. Trailing empty strings are therefore not included in the resulting array.
Replacing Substrings and Splitting Strings Sometimes it is useful to replace parts of a string or to split a string into pieces. For this purpose, class String provides methods replaceAll, replaceFirst and split.
public class StringRegexDemo {
public static void main(String[] args) {
String str = "Java Java Java";
System.out.println("Run replaceFirst method");
str = str.replaceFirst("Java", "SCJP");
System.out.println(str);
str = str.replaceFirst("Java", "SCJP");
System.out.println(str);
str = str.replaceFirst("Java", "SCJP");
System.out.println(str);
String str1 = "Java Java Java";
System.out.println("Run replaceAll method");
str1 = str1.replaceAll("Java", "SCJP");
System.out.println(str1);
String str2 = "1a2b3c4d5e6f7g8h9i0";
String[] str3 = str2.split("\\d");
System.out.println("Run split() : " + str3.length);
for (String s: str3)
System.out.println("[split]" + s +"[/split]");
str2 = "1a2b3c4d5e6f7g8h9i0";
str3 = str2.split("\\d",4);
System.out.println("Run split() : " + str3.length);
for (String s: str3)
System.out.println("[split]" + s +"[/split]");
}
}
The output is
Run replaceFirst method SCJP Java Java SCJP SCJP Java SCJP SCJP SCJP Run replaceAll method SCJP SCJP SCJP Run split() : 10 [split][/split] [split]a[/split] [split]b[/split] [split]c[/split] [split]d[/split] [split]e[/split] [split]f[/split] [split]g[/split] [split]h[/split] [split]i[/split] Run split() : 4 [split][/split] [split]a[/split] [split]b[/split] [split]c4d5e6f7g8h9i0[/split]
Regular Experssion Constructs
Regular expressions are a programming language for describing patterns in
strings. A regular expression is a pattern of characters that describes a
set of strings. You can use the java.util.regex
package to find, display, or modify some or all of the occurrences of a pattern
in an input sequence.
The simplest form of a regular expression is a literal string, such as "Java" or "programming". Regular expression matching also allows you to test whether a string fits into a specific syntactic form, such as an email address. More complicated patterns involve the use of metacharacters to describe all the different choices and variations that you want to build into a pattern. Metacharacters don't match themselves, but describe something else. At the syntax level, it's important to understand which characters are metacharacters (have a special meaning), and which are literal characters (stand for themselves).
At the symantic level, several basic concepts are important: character classes, quantifiers, boundaries, grouping, and alternation. These fundamental regex elements apply to all implemenations, and will solve most or your regex needs.
Metacharacters
Because we want to do more than simple search for litera pieces of text, we need to reserve certain characters for special use. The characters that have special meaning are called metacharacters. A preceding backslash ("\") turns a metachacter into a literal character. The set of metacharacters in character classes, ie between [ and ], is different.
| Char | Meaning |
|---|---|
| \ | Turns metacharacters into literal characters, and literal characters into metacharacters. Because this is also the Java escape character in strings, it must be doubled. |
| [...] | Matches any one of the class of characters contained within the brackets |
| (...) | Groups regular expressions |
| {...} | Specifies a range of occurrences for the element preceding it. {min, max} |
| ^ | Matches boundary at beginning. Class negation when immediately after [. |
| $ | Matches boundary at end. |
| . | Matches any single character except a newline (unless /s is used). |
| ? | Matches preceding element must match zero or one time. |
| * | Matches preceding element must match zero or more times. |
| + | Matches preceding element must match one or more times. |
| | | Matches either preceding or following element must match. |
Matching a single character
The most basic regular expression consists of a single literal character. A few
single literal characters have been reserved as metacharacters. If you want to
use any of them as a literal in a regular expression, you need to escape them
with a backslash. Using backslash precedes characters that have a special
meaning: \. \+ \* \? \| \{ \( \[ \^ \$.
How can I specify a backslash as a literal in a regular expression? Good question. You need to use "\\" to present the backslash character.
You can use special character sequences to put non-printable characters in your regular expression:
| \t | The tab character('\u0009') |
| \n | The newline (line feed) character('\u000A') |
| \r | The carriage-return character('\u000D') |
| \f | The form-feed character('\u000C') |
| \a | The aler(bell) character('\u0007') |
| \e | The escape character('\u001B') |
| \cx | The control character corresponding to x |
You can include any character in your regular expression if you know its hexadecimal ASCII or ANSI code for the character set that you are working with. Note that the leading zero is required.
| \0nn | The character with octal value 0nn(0<=n<=7) |
| \0mnn | The character with octal value 0mnn(0<=m<=3, 0<=n<=7) |
| \xhh | The character with hexadecimal value 0xhh |
| \uhhhh | The character with hexadecimal value 0xhhhh |
Defining Character classes (match one character)
Character classes provide a way to specify a set of characters. The class specification is enclosed in []. The set can also be expressed by what must not be in it by beginning the set with a caret, "^". Minus, "-", can be used to indicate a range of character values. With a "character class", also called "character set", you can tell the regex engine to match only one out of several characters.
The order of the characters inside a character class does not matter. The results are identical. You can use a hyphen inside a character class to specify a rang of characters. For example, [0-9a-fxXA-F] matches a single hexadecimal digit (case insensitively).
Although a character class matches only one character, a quantifier following it can be used to match multiple characters.
| [abc] | a, b, or c (simple class) |
| [^abc] | Any character except a, b, or c (negation) |
| [a-zA-Z] | a through z or A through Z, inclusive (range) |
| [a-d[m-p]] | a through d, or m through p: [a-dm-p] (union) |
| [a-z&&[def]] | d, e, or f (intersection) |
| [a-z&&[^bc]] | a through z, except for b and c: [ad-z] (subtraction) |
| [a-z&&[^m-p]] | a through z, and not m through p: [a-lq-z](subtraction) |
- Simple Classes -- The most basic form of a character class is to simply place a set of characters side-by-side within square brackets. For example, the regular expression [bcr]at will match the words "bat", "cat", or "rat" because it defines a character class (accepting either "b", "c", or "r") as its first character. The overall match succeeds only when the first letter matches one of the characters defined by the character class.
- Negation -- To match all characters except those listed, insert the ^ metacharacter at the beginning of the character class. This technique is known as negation. The match is successful only if the first character of the input string does not contain any of the characters defined by the character class.
- Ranges -- Sometimes you'll want to define a character class that includes a range of values, such as the letters "a through h" or the numbers "1 through 5". To specify a range, simply insert the - metacharacter between the first and last character to be matched, such as [1-5] or [a-h]. You can also place different ranges beside each other within the class to further expand the match possibilities. For example, [a-zA-Z] will match any letter of the alphabet: a to z (lowercase) or A to Z (uppercase).
- Unions -- You can also use unions to create a single character class comprised of two or more separate character classes. To create a union, simply nest one class inside the other, such as [0-4[6-8]]. This particular union creates a single character class that matches the numbers 0, 1, 2, 3, 4, 6, 7, and 8.
- Intersections -- To create a single character class matching only the characters common to all of its nested classes, use the intersection operator &&, as in [0-9&&[345]]. This particular intersection creates a single character class matching only the numbers common to both character classes: 3, 4, and 5.
- Subtraction -- Finally, you can use subtraction to negate one or more nested character classes, such as [0-9&&[^345]]. This example creates a single character class that matches everything from 0 to 9, except the numbers 3, 4, and 5.
Predefined Character Classes
Since certain character classes are used often, a series of predefined (shorthand) character classes are available.
The Pattern API contains a number of useful predefined character
classes, which offer convenient shorthands for commonly-used regular
expressions. In the table below, each construct in the left-hand column is
shorthand for the character class in the right-hand column. For example,
\d means a range of digits (0-9), and \w means a word
character (any lowercase letter, any uppercase letter, the underscore
character, or any digit). Use the predefined classes whenever possible. They
make your code easier to read and eliminate errors introduced by malformed
character classes.
| . | Any character (may or may not match line terminators) |
| \d | A digit: [0-9] |
| \D | A non-digit: [^0-9] |
| \s | A whitespace character. Which characters this actually includes, depends on the regex flavor. In Java, it matches [ \t\n\x0B\f\r] |
| \S | A non-whitespace character: [^\s] |
| \w | A word character. Which characters it matches differs between regex flavors. In all flavors, it will include [A-Za-z]. In Java, it matches [a-zA-Z_0-9] |
| \W | A non-word character: [^\w] |
Predefined character classes can be used both inside and outside the square brackets. \s\d matches a whitespace character followed by a digit. [\s\d] matches a single character that is either whitespace or a digit.
Position and Boundary patterns (match zero characters)
You can make your pattern matches more precise by specifying such information with boundary matchers. For example, maybe you're interested in finding a particular word, but only if it appears at the beginning or end of a line. Or maybe you want to know if the match is taking place on a word boundary, or at the end of the previous match.
There are four different positions that qualify as word boundaries:
- Before the first character in the string, if the first character is a word character.
- After the last character in the string, if the last character is a word character.
- Between a word character and a non-word character following right after the word character.
- Between a non-word character and a word character following right after the non-word character.
| ^ | The beginning of a line. Very useful. |
| $ | The end of a line. Very userful. ^$ matches all emtpy lines. |
| \b | A word boundary |
| \B | A non-word boundary |
| \A | The beginning of the input |
| \G | The end of the previous match |
| \Z | The end of the input but for the final terminator, if any |
| \z | The end of the input |
\B is the negated version of \b. \B matches
at every position where \b does not. Effectively, \B matches at
any position between two word characters as well as at any position between two
non-word characters.
Quantifiers (repeating the previous element)
Quantifiers allow you to specify the number of occurrences to match
against. For convenience, the three sections of the API specification
describing greedy, relucant, and possessive quantifiers are presented below. At
first glance it may appear that the quantifiers X?, X??
and X?+ do exactly the same thing, since they all promise to match
"X, once or not at all". There are subtle implementation
differences which will be explained near the end of this section.
| Greedy quantifiers - Expand as much as possible | |
|---|---|
| X? | X, once or not at all |
| X* | X, zero or more times |
| X+ | X, one or more times |
| X{n} | X, exactly n times |
| X{n,} | X, at least n times |
| X{n,m} | X, at least n but not more than m times |
| Reluctant quantifiers - Expand only if forced by later failure to match | |
| X?? | X, once or not at all |
| X*? | X, zero or more times |
| X+? | X, one or more times |
| X{n}? | X, exactly n times |
| X{n,}? | X, at least n times |
| X{n,m}? | X, at least n but not more than m times |
| Possessive quantifiers - | |
| X?+ | X, once or not at all |
| X*+ | X, zero or more times |
| X++ | X, one or more times |
| X{n}+ | X, exactly n times |
| X{n,}+ | X, at least n times |
| X{n,m}+ | X, at least n but not more than m times |
Differences Among Greedy, Reluctant, and Possessive Quantifiers
As mentioned earlier, there are subtle differences among greedy, reluctant, and possessive quantifiers.
Greedy quantifiers are considered "greedy" because they force the matcher to read in, or eat , the entire input string prior to attempting the first match. If the first match attempt (the entire input string) fails, the matcher backs off the input string by one character and tries again, repeating the process until a match is found or there are no more characters left to back off from. Depending on the quantifier used in the expression, the last thing it will try matching against is 1 or 0 characters.
The reluctant quantifiers, however, take the opposite approach: they start at the beginning of the input string, then reluctantly eat one character at a time looking for a match. The last thing they try is the entire input string.
Finally, the possessive quantifiers always eat the entire input string, trying once (and only once) for a match. Unlike the greedy quantifiers, possessive quantifiers never back off, even if doing so would allow the overall match to succeed.
In fact, quantifiers can only attach to one character at a time, so the regular
expression "abc+" would mean "a, followed by b, followed by c one or more
times". It would not mean "abc" one or more times. However, quantifiers can
also attach to "Character Classes" and "Capturing Groups" , such as [abc]+
(a or b or c, one or more times) or (abc)+ (the group "abc", one
or more times).
Grouping & Backreferences
Capturing groups are a way to treat multiple characters as a single unit.
They are created by placing the characters to be grouped inside a set of
parentheses. For example, the regular expression (dog) creates a
single group containing the letters "d" "o" and "g".
The portion of the input string that matches the capturing group will be saved
in memory for later recall via backreferences. (e.g., round brackets
create the "backreference"). A backreference stores the part of the string
matched by the part of the regular expression inside the parentheses.
| Grouping - Parentheses both group and create a numbered element that can be used later. | |
|---|---|
| (X) | X. This capturing group is remembered so it can be referenced later. Numbered starting at 1. |
Numbering
Capturing groups are numbered by counting their opening parentheses from left to right. In the expression ((A)(B(C))), for example, there are four such groups:
1 ((A)(B(C))) 2 (A) 3 (B(C)) 4 (C)
To find out how many groups are present in the expression, call the groupCount
method on a matcher object. The groupCount method returns an int
showing the number of capturing groups present in the matcher's pattern. In
this example, groupCount would return the number 4,
showing that the pattern contains 4 capturing groups.
Group zero always represents the entire expression. This group is not included
in the total reported by groupCount
.
Capturing groups are so named because, during a match, each subsequence of the input sequence that matches such a group is saved. The captured subsequence may be used later in the expression, via a back reference, and may also be retrieved from the matcher once the match operation is complete.
The captured input associated with a group is always the subsequence that the group most recently matched. If a group is evaluated a second time because of quantification then its previously-captured value, if any, will be retained if the second evaluation fails. Matching the string "aba" against the expression (a(b)?)+, for example, leaves group two set to "b" . All captured input is discarded at the beginning of each match.
Groups beginning with (? are pure, non-capturing groups that do not capture text and do not count towards the group total.
It's important to understand how groups are numbered because some Matcher methods accept an int specifying a particular group number as a parameter:
- public int start(int group): Returns the start index of the subsequence captured by the given group during the previous match operation.
- public int end (int group): Returns the index of the last character, plus one, of the subsequence captured by the given group during the previous match operation.
- public String group (int group): Returns the input subsequence captured by the given group during the previous match operation.
Backreference
The section of the input string matching the capturing group(s) is saved in
memory for later recall via a backreference. A backreference is specified in
the regular expression as a backslash (\) followed by a digit
indicating the number of the group to be recalled. For example, the expression
(\d\d) defines one capturing group matching two digits in a row,
which can be recalled later in the expression via the backreference \1.
For nested capturing groups, backreferencing works in exactly the same way: Specify a backslash followed by the number of the group to be recalled.
http://www.regular-expressions.info/tutorial.html
http://java.sun.com/docs/books/tutorial/extra/regex/index.html