Regular Expressions |

Regular expressions are a compact, formal way to specify character patterns
in ASCII strings. For convenience, this page provides a brief summary of
regular expression syntax, following the conventions adopted in the regexp
package written by Henry Spencer that was included in Tcl releases prior
to tcl8.0. There is also a very readable chapter on regular expressions in
Brent B. Welch, *Practical Programming in Tcl and Tk*, 2nd ed., Prentice
Hall 1997. See also the standard compiler textbooks such as Alfred V. Aho,
Ravi Sethi and Jeffrey D. Ullman, *Compilers Principles, Techniques and
Tools*, Addison-Wesley 1986.

A regular expression (over the ASCII alphabet) is defined recursively as follows:

- A literal character or an
*empty set*is a regular expression. - Repetition operators:
`(r)*`is a regular expression denoting zero or more repetitions of`r``(r)+`is a regular expression denoting one or more repetitions of`r``(r)?`is a regular expression denoting zero or one repetions of`r`

- Concatenation:
`(r)(s)` - Alternation operator:
`(r) | (s)`is a regular expression denoting`(r)`OR`(s)` - Repetition, Concatenation and Alternation operators are all left-associative,
with the following precedence rules, in the order from the highest to lowest:
- grouping in parentheses ()
- unary repetition operators
`*, +, ?` - concatenation
- alternation

- Special matching characters (can be escaped by preceding with a backslash
\)
- . (a period) matches any character
`^(r)`matches the regular expression at the beginning of a string, must be first`(r)$`matches the regular expression at the end of a string, must be last

- Matching from a set of characters:
`[x-y]`matches a character from a range over the ASCII ordered character set between x and y, inclusive`[xyz]`matches a character from the character set {x, y, z}, equivalent to`(x|y|z)``[^xyz]`matches a character**not**in the character set {x, y, z}

abc%x |
matches the string abc%x |

.. |
matches all two-character strings |

ab* |
matches strings empty, ab, abb, abbb,... |

(ab)* |
matches strings empty, ab, abab, ababab,... |

a+ |
matches strings a, aa, aaa,... |

ab? |
matches strings a or ab |

[Hh]ello |
matches Hello or hello |

hello|Hello |
matches Hello or hello |

.* |
matches any string |

\. |
matches a period "." |

^(one|2|3) |
matches a string beginning with "one" or "2" or "3" |

[1-9][0-9]* |
matches any integer greater than zero |

[a-zA-Z0-9]+ |
matches any string containing one or more letters or digits only |

[^a-d] |
matches any string that does not contain any of the letters a, b, c, d |

^[a-zA-Z]$ |
matches a string of exactly one letter |

array\[N\] |
matches the string "array[N]" |

^[^\n]*\n |
matches everything from the beginning of a string up to a newline |

[ \t\n]* |
matches whitespace (spaces, tabs, newlines) |

[^:]+://[^:/]+(:[0-9]+)?/.* |
matches a URL, e.g., http://www.ssfnet.org:80/home/index.html |

A regular expression does not have to match the whole string. There can be unmatched characters before and after the match. To force the matching of the entire string, the regular expression must begin with "^" and terminate with "$" .

If a pattern can match several substrings of a string, take the earliest match in the string. Then, if there is more than one match from that point, take the longest match.

ato 28 March 1999