1 .\" $OpenBSD: re_format.7,v 1.14 2007/05/31 19:19:30 jmc Exp $
3 .\" Copyright (c) 1997, Phillip F Knaack. All rights reserved.
5 .\" Copyright (c) 1992, 1993, 1994 Henry Spencer.
6 .\" Copyright (c) 1992, 1993, 1994
7 .\" The Regents of the University of California. All rights reserved.
9 .\" This code is derived from software contributed to Berkeley by
12 .\" Redistribution and use in source and binary forms, with or without
13 .\" modification, are permitted provided that the following conditions
15 .\" 1. Redistributions of source code must retain the above copyright
16 .\" notice, this list of conditions and the following disclaimer.
17 .\" 2. Redistributions in binary form must reproduce the above copyright
18 .\" notice, this list of conditions and the following disclaimer in the
19 .\" documentation and/or other materials provided with the distribution.
20 .\" 3. Neither the name of the University nor the names of its contributors
21 .\" may be used to endorse or promote products derived from this software
22 .\" without specific prior written permission.
24 .\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
25 .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
26 .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
27 .\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
28 .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
29 .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
30 .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
31 .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
32 .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
33 .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
36 .\" @(#)re_format.7 8.3 (Berkeley) 3/20/94
38 .Dd $Mdocdate: May 31 2007 $
43 .Nd POSIX regular expressions
45 Regular expressions (REs),
49 basic regular expressions
51 and extended regular expressions
53 Both forms of regular expressions are supported
54 by the interfaces described in
56 Applications dealing with regular expressions
57 may use one or the other form
65 Consult the manual page for the specific application to find out which
68 POSIX leaves some aspects of RE syntax and semantics open;
70 marks decisions on these aspects that
71 may not be fully portable to other POSIX implementations.
73 This manual page first describes regular expressions in general,
74 specifically extended regular expressions,
75 and then discusses differences between them and basic regular expressions.
76 .Sh EXTENDED REGULAR EXPRESSIONS
77 An ERE is one** or more non-empty**
81 It matches anything that matches one of the branches.
83 A branch is one** or more
86 It matches a match for the first, followed by a match for the second, etc.
90 possibly followed by a single**
98 matches a sequence of 0 or more matches of the atom.
101 matches a sequence of 1 or more matches of the atom.
104 matches a sequence of 0 or 1 matches of the atom.
108 followed by an unsigned decimal integer,
111 possibly followed by another unsigned decimal integer,
114 The integers must lie between 0 and
117 and if there are two of them, the first may not exceed the second.
118 An atom followed by a bound containing one integer
121 a sequence of exactly
124 An atom followed by a bound
125 containing one integer
130 or more matches of the atom.
131 An atom followed by a bound
132 containing two integers
136 matches a sequence of
140 (inclusive) matches of the atom.
142 An atom is a regular expression enclosed in
144 (matching a part of the regular expression),
147 (matching the null string)**,
149 .Em bracket expression
152 (matching any single character),
154 (matching the null string at the beginning of a line),
156 (matching the null string at the end of a line),
159 followed by one of the characters
161 (matching that character taken as an ordinary character),
164 followed by any other character**
165 (matching that character taken as an ordinary character,
168 had not been present**),
169 or a single character with no other significance (matching that character).
172 followed by a character other than a digit is an ordinary character,
173 not the beginning of a bound**.
174 It is illegal to end an RE with
177 A bracket expression is a list of characters enclosed in
179 It normally matches any single character from the list (but see below).
180 If the list begins with
182 it matches any single character
184 from the rest of the list
186 If two characters in the list are separated by
188 this is shorthand for the full
190 of characters between those two (inclusive) in the
191 collating sequence, e.g.\&
193 in ASCII matches any decimal digit.
194 It is illegal** for two ranges to share an endpoint, e.g.\&
196 Ranges are very collating-sequence-dependent,
197 and portable programs should avoid relying on them.
201 in the list, make it the first character
202 (following a possible
206 make it the first or last character,
207 or the second endpoint of a range.
210 as the first endpoint of a range,
215 to make it a collating element (see below).
216 With the exception of these and some combinations using
218 (see next paragraphs),
219 all other special characters, including
221 lose their special significance within a bracket expression.
223 Within a bracket expression, a collating element
225 a multi-character sequence that collates as if it were a single character,
226 or a collating-sequence name for either)
231 stands for the sequence of characters of that collating element.
232 The sequence is a single element of the bracket expression's list.
233 A bracket expression containing a multi-character collating element
234 can thus match more than one character,
235 e.g. if the collating sequence includes a
240 matches the first five characters of
243 Within a bracket expression, a collating element enclosed in
247 is an equivalence class, standing for the sequences of characters
248 of all collating elements equivalent to that one, including itself.
249 (If there are no other equivalent collating elements,
250 the treatment is as if the enclosing delimiters were
258 are the members of an equivalence class,
265 An equivalence class may not** be an endpoint of a range.
267 Within a bracket expression, the name of a
274 stands for the list of all characters belonging to that class.
275 Standard character class names are:
276 .Bd -literal -offset indent
283 These stand for the character classes defined in
285 A locale may provide others.
286 A character class may not be used as an endpoint of a range.
288 There are two special cases** of bracket expressions:
289 the bracket expressions
293 match the null string at the beginning and end of a word, respectively.
294 A word is defined as a sequence of
295 characters starting and ending with a word character
296 which is neither preceded nor followed by
298 A word character is an
300 character (as defined by
303 This is an extension,
304 compatible with but not specified by POSIX,
305 and should be used with
306 caution in software intended to be portable to other systems.
308 In the event that an RE could match more than one substring of a given
310 the RE matches the one starting earliest in the string.
311 If the RE could match more than one substring starting at that point,
312 it matches the longest.
313 Subexpressions also match the longest possible substrings, subject to
314 the constraint that the whole match be as long as possible,
315 with subexpressions starting earlier in the RE taking priority over
317 Note that higher-level subexpressions thus take priority over
318 their lower-level component subexpressions.
320 Match lengths are measured in characters, not collating elements.
321 A null string is considered longer than no match at all.
324 matches the three middle characters of
326 .Sq (wee|week)(knights|nights)
327 matches all ten characters of
333 the parenthesized subexpression matches all three characters;
338 both the whole RE and the parenthesized subexpression match the null string.
340 If case-independent matching is specified,
341 the effect is much as if all case distinctions had vanished from the
343 When an alphabetic that exists in multiple cases appears as an
344 ordinary character outside a bracket expression, it is effectively
345 transformed into a bracket expression containing both cases,
350 When it appears inside a bracket expression,
351 all case counterparts of it are added to the bracket expression,
352 so that, for example,
361 No particular limit is imposed on the length of REs**.
362 Programs intended to be portable should not employ REs longer
364 as an implementation can refuse to accept such REs and remain
367 The following is a list of extended regular expressions:
372 not listed below matches itself.
374 Any backslash-escaped character
378 Matches any single character that is not a newline
381 Matches any single character in
387 it must be the first character.
388 A range of characters may be specified by separating the end characters
393 specifies the lower case characters.
394 The following literal expressions can also be used in
396 to specify sets of characters:
397 .Bd -unfilled -offset indent
398 [:alnum:] [:cntrl:] [:lower:] [:space:]
399 [:alpha:] [:digit:] [:print:] [:upper:]
400 [:blank:] [:graph:] [:punct:] [:xdigit:]
405 appears as the first or last character of
407 then it matches itself.
408 All other characters in
424 is a collating element, are interpreted according to
426 .Pq not currently supported .
427 .It Bq ^ Ns Ar char-class
428 Matches any single character, other than newline, not in
435 is the first character of a regular expression, then it
436 anchors the regular expression to the beginning of a line.
437 Otherwise, it matches itself.
441 is the last character of a regular expression,
442 it anchors the regular expression to the end of a line.
443 Otherwise, it matches itself.
445 Anchors the single character regular expression or subexpression
446 immediately following it to the beginning of a word.
448 Anchors the single character regular expression or subexpression
449 immediately following it to the end of a word.
451 Defines a subexpression
453 Any set of characters enclosed in parentheses
454 matches whatever the set of characters without parentheses matches
455 (that is a long-winded way of saying the constructs
461 Matches the single character regular expression or subexpression
462 immediately preceding it zero or more times.
465 is the first character of a regular expression or subexpression,
466 then it matches itself.
469 operator sometimes yields unexpected results.
470 For example, the regular expression
472 matches the beginning of the string
474 (as opposed to the substring
476 since a null match is the only leftmost match.
478 Matches the singular character regular expression
479 or subexpression immediately preceding it
482 Matches the singular character regular expression
483 or subexpression immediately preceding it
487 .Pf { Ar n , m No }\ \&
488 .Pf { Ar n , No }\ \&
492 Matches the single character regular expression or subexpression
493 immediately preceding it at least
500 is omitted, then it matches at least
503 If the comma is also omitted, then it matches exactly
507 Used to separate patterns.
516 .Sh BASIC REGULAR EXPRESSIONS
517 Basic regular expressions differ in several respects:
518 .Bl -bullet -offset 3n
524 are ordinary characters and there is no equivalent
525 for their functionality.
527 The delimiters for bounds are
535 by themselves ordinary characters.
537 The parentheses for nested subexpressions are
545 by themselves ordinary characters.
548 is an ordinary character except at the beginning of the
549 RE or** the beginning of a parenthesized subexpression.
552 is an ordinary character except at the end of the
553 RE or** the end of a parenthesized subexpression.
556 is an ordinary character if it appears at the beginning of the
557 RE or the beginning of a parenthesized subexpression
558 (after a possible leading
561 Finally, there is one new type of atom, a
564 followed by a non-zero decimal digit
566 matches the same sequence of characters matched by the
568 parenthesized subexpression
569 (numbering subexpressions by the positions of their opening parentheses,
571 so that, for example,
581 The following is a list of basic regular expressions:
586 not listed below matches itself.
588 Any backslash-escaped character
598 Matches any single character that is not a newline
601 Matches any single character in
607 it must be the first character.
608 A range of characters may be specified by separating the end characters
613 specifies the lower case characters.
614 The following literal expressions can also be used in
616 to specify sets of characters:
617 .Bd -unfilled -offset indent
618 [:alnum:] [:cntrl:] [:lower:] [:space:]
619 [:alpha:] [:digit:] [:print:] [:upper:]
620 [:blank:] [:graph:] [:punct:] [:xdigit:]
625 appears as the first or last character of
627 then it matches itself.
628 All other characters in
644 is a collating element, are interpreted according to
646 .Pq not currently supported .
647 .It Bq ^ Ns Ar char-class
648 Matches any single character, other than newline, not in
655 is the first character of a regular expression, then it
656 anchors the regular expression to the beginning of a line.
657 Otherwise, it matches itself.
661 is the last character of a regular expression,
662 it anchors the regular expression to the end of a line.
663 Otherwise, it matches itself.
665 Anchors the single character regular expression or subexpression
666 immediately following it to the beginning of a word.
668 Anchors the single character regular expression or subexpression
669 immediately following it to the end of a word.
670 .It \e( Ns Ar re Ns \e)
671 Defines a subexpression
673 Subexpressions may be nested.
674 A subsequent backreference of the form
678 is a number in the range [1,9], expands to the text matched by the
681 For example, the regular expression
683 matches any string consisting of identical adjacent substrings.
684 Subexpressions are ordered relative to their left delimiter.
686 Matches the single character regular expression or subexpression
687 immediately preceding it zero or more times.
690 is the first character of a regular expression or subexpression,
691 then it matches itself.
694 operator sometimes yields unexpected results.
695 For example, the regular expression
697 matches the beginning of the string
699 (as opposed to the substring
701 since a null match is the only leftmost match.
704 .Pf \e{ Ar n , m No \e}\ \&
705 .Pf \e{ Ar n , No \e}\ \&
709 Matches the single character regular expression or subexpression
710 immediately preceding it at least
717 is omitted, then it matches at least
720 If the comma is also omitted, then it matches exactly
729 Base Definitions, Chapter 9 (Regular Expressions).
731 Having two kinds of REs is a botch.
733 The current POSIX spec says that
735 is an ordinary character in the absence of an unmatched
737 this was an unintentional result of a wording error,
738 and change is likely.
741 Back-references are a dreadful botch,
742 posing major problems for efficient implementations.
743 They are also somewhat vaguely defined
745 .Sq a\e(\e(b\e)*\e2\e)*d
750 POSIX's specification of case-independent matching is vague.
752 .Dq one case implies all cases
753 definition given above
754 is the current consensus among implementors as to the right interpretation.
756 The syntax for word boundaries is incredibly ugly.