Regular Expressions Tutorial
Regular Expressions
- introduced 1956 from S.C. Kleene, to describe the states of a FSA (model of nervous activity)
- REs describe the Form of character strings
- A string is matched by a RE if the string is a element of the class described by the RE
- REs are greedy
- Forms of REs:
- basic REs (ed, sed, lex, ...)
- extended REs (egrep, awk, regex(3), ...)
- perl compatible REs (perl, libpcre, ...)
- Definition of a (extended) RE
- A RE is one or more non-empty branches, separated by '|'. It matches anything that matches one of the branches
- A branch is the concatenation of one or more pieces
- A piece is an atom, possibly followed by a single(!) '*', '+', '?', or a bound
- Documentation
- manpages regex(7), awk(1), lex(1)
- Regular Expressions in the The Single UNIX ® Specification, Version 2
- Jeffrey E. F. Friedl, Mastering Regular Expressions, O'Reilly
- Compilers — Principle, Techniques and Tools, by Aho Sethi and Ullman, Addison Wesley
Atoms
Atoms are the basic components of a RE
| x |
the character 'x' itself |
| \X |
if X is an 'a', 'b', 'f', 'n', 'r', 't', or 'v', then the ANSI-C interpretation of \x. Otherwise, a literal 'X' (used to escape operators such as '*') |
| \123 |
the character with octal value 123 |
| \xe5 |
the character with hexadecimal value e5 |
| . |
any character (byte) except newline |
| [xyz] |
a "character class": x OR y OR z |
| [ako-sP] |
a "character class" with a range in it; matches an 'a', a 'k', any letter from 'k' through 's', or a 'P' |
| [^A-Z] |
a "negated character class": i.e., any character but those in the class. In our example, any character EXCEPT an uppercase letter |
| [:str:] |
a "character class expression": Allowed only within another character class. The valid contents of str are: alnum, alpha, blank, cntrl, digit, graph, lower, print, punct, space, upper, xdigit |
Pieces
Pieces are used to concatenate one or more REs, or to specify how often a precedent piece must be repeated
| (r) |
the RE r itself |
| rs |
the RE r followed by the RE s |
| r|s |
the RE r OR the RE s |
| r* |
the RE r zero or more times |
| r+ |
the RE r one or more time |
| r? |
the RE r zero or one time |
| r{2,6} |
the RE r anywhere from two to six times |
| r{2,} |
the RE r two or more times |
| r{,6} |
the RE r up to six times |
| r{4} |
the RE r exactly for times |
Regular Examples
(x|y|z) is equivalent to the RE [xyz]
(a|b) is equivalent to (b|a)
The RE (Sch|Pr)eis{2} matches both the strings "Scheiss" and "Preiss".
Regular Examples
Real numbers (simple)
[0-9]+\.[0-9]*([eE][+-]?[0-9]+)?
Real numbers (character class)
[[:digit:]]+\.[[:digit:]]*([eE][+-]?[[:digit:]]+)?
Problem: numbers like "3." are acepted, but not ".3".
Real numbers (catch all)
(([[:digit:]]+\.[[:digit:]]*)|(\.[[:digit:]]+))([eE][+-]?[[:digit:]]+)?
Basic REs
- '|', '+', and '?' are ordinary characters and there is no equivalent for their functionality
- The delimiters for bounds are '\{' and '\}', with '{' and '}' by themselves ordinary characters
- The parentheses for nested subexpressions are '\(' and '\)', with '(' and ')' by themselves ordinary characters
- '^' is an ordinary character except at the beginning of the RE or(!) the beginning of a parenthesized subexpression
- '$' is an ordinary character except at the end of the RE or(!) the end of a parenthesized subexpression
- '*' is an ordinary character if it appears at the beginning of the RE or the beginning of a parenthesized subexpression (after a possible leading '^')
- There is one new type of atom, a back reference: '\' followed by a nonzero decimal digit d matches the same sequence of characters matched by the dth parenthesized subexpression (numbering subexpressions by the positions of their opening parentheses, left to right), so that (e.g.) '\([bc]\)\1' matches 'bb' or 'cc' but not 'bc'
Real numbers as Extended RE
[0-9]+\.[0-9]*([eE][+-]?[0-9]+)?
Real numbers as Basic RE
[0-9][0-9]*\.[0-9]*\([eE][+-]\{0,1\}[0-9][0-9]*\)\{0,1\}
Real numbers as Basic RE, written in the shell
\[0-9\]\[0-9\]\*\\.\[0-9\]\*\\\(\[eE\]\[+-\]\\\{0,1\\\}\[0-9\]\[0-9\]\*\\\)\\\{0,1\\\}