Regular Expressions Tutorial

Regular Expressions

Regular Expressions were introduced 1956 by Stephen Cole Kleene to model nervous activity, but proved to be very effective to describe all sorts of character strings.

A RE describes the Form of character strings. If a character string can be described by a RE, then the RE is said to match the given string. REs are greedy, which means they try to match as many characters as possible in a given input.

The three most common forms (flavours) of REs are:

Documentation

A full documentation for REs can be found in:

Definition of a (extended) RE

Atoms

Atoms are the basic components of a RE

xthe character 'x' itself.
\Xif X is an 'a', 'b', 'f', 'n', 'r', 't', or 'v', then the ANSI-C interpretation of \x. Otherwise, a literal 'X' (used to escape operators such as '*').
\123the character with octal value 123.
\xe5the character with hexadecimal value e5.
.any character (byte) except newline.
[xyz]a character class: x OR y OR z.
[ako‑sP]a character class with a range in it; matches an 'a', a 'k', any letter from 'k' through 's', or a 'P'.
[^A‑Z]a negated character class: i.e., any character but those in the class. In our example, any character EXCEPT an uppercase letter.
[:class:]a character class expression: allowed only within another character class. The valid contents of class are: alnum, alpha, blank, cntrl, digit, graph, lower, print, punct, space, upper, xdigit.

Pieces

Pieces are used to concatenate one or more REs, or to specify how often a precedent piece must be repeated.

(r)the RE r itself. (This is actually an atom.)
rsthe RE r followed by the RE s.
r|sthe RE r OR the RE s.
r*the RE r zero or more times.
r+the RE r one or more time.
r?the RE r zero or one time.
r{2,6}the RE r anywhere from two to six times.
r{2,}the RE r two or more times.
r{,6}the RE r up to six times.
r{4}the RE r exactly for times.

Examples

The RE (x|y|z) is equivalent to [xyz], both match the single character 'x' or 'y' or 'z'.

The RE (a|b) is equivalent to (b|a).

The RE Twe{2}dled(ee|um) matches both strings Tweedledum and Tweedledee.

Regular Examples

The real numbers such as 0.75 or 1.23e6 can be described by the RE:

[0-9]+\.[0-9]*([eE][+-]?[0-9]+)?

The above RE can be rewritten using a character class:

[[:digit:]]+\.[[:digit:]]*([eE][+-]?[[:digit:]]+)?

There is a small problem with the previous RE: numbers like 3. are accepted, but not .3. We want to accept any number with at least one digit either before or after the decimal point. The following RE solves the problem by stating both options separately: the first part matches all numbers with a leading digit, the second part matches all numbers with a leading decimal point.

(([[:digit:]]+\.[[:digit:]]*)|(\.[[:digit:]]+))([eE][+-]?[[:digit:]]+)?

Basic REs

Basic REs are used by older but still widely used programs such as sed and lex. The difference with extended REs are:

The simple RE matching real numbers, rewritten as a basic RE is:

[0-9][0-9]*\.[0-9]*\([eE][+-]\{0,1\}[0-9][0-9]*\)\{0,1\}