Contents - EasyPatterns 2.8

Conventions: Actual EasyPatterns are highlighted with [...].

Specifying literal text/static text

The only character that is 'special' is the left square bracket or [. The simplest pattern is just literal text, with no left square brackets. Whenever we need EasyPattern keywords we just put them inside [...].

EasyPattern Description Matches this text...
hello there No [ ... ] expression has been used, this is just literal text hello there
hello there [longest 1 or more letters] The special part is [longest 1 or more letters] hello there Fred, hello there Cornelia, etc
I am [1 or more digits] years old] The special part is [1 or more digits] I am 2 years old, I am 302 years old, etc
This is a left square bracket [ '[' ] This shows how to insert a left square bracket in literal text. This is a left square bracket [

To use multiple keywords, you can either

Put them next to each other [...][...] e.g. [letter][digit] matches "a1", "b1", etc.
Put commas between them [..., ...] e.g. [letter, digit] instead of [letter][digit]
Put spaces between them

[... ...]

e.g. [letter digit] instead of [letter][digit]

You can also put literal text anywhere inside [...] using single quotes or double quotes.

['literal']

e.g. "['abc']" instead of "abc"

[... 'literal']

e.g. "[digit, 'abc']" instead of "[digit]abc"

['literal' ...]

e.g. "['abc', digit]" instead of "abc[digit]"

[... 'literal' ...]

e.g. "[digit, 'abc', digit]" instead of "[digit]abc[digit]"

Common character classes such as letters and digits

The most important keywords represent character classes or sets, that is, a set of related characters.

Any character, letters, digits, etc.

[character], [char], [chars], [characters]

All 256 chars (every character including NULL). EasyPattern's [character] or [char] will match any character including return. If you want any character except a return (or formfeed), use [paragraphChar]; that is, any character that could appear in a paragraph. Details below.

[letter], [letters]

Includes ?and ? common in certain European languages.

[digit], [digits]

Decimal digits 0-9

[number], [numbers], [numeric] A number with an optional leading sign, digits, optional decimal point and trailing digits
[Integer] A number with an optional leading sign, followed by digits

[Float]

A number with an optional leading sign, digits, optional decimal point and trailing digits, optionally followed by 'e', a sign, and 1 or more digits

[EBCDICletter] An EBCDIC letter
[EBCDICupper] An EBCDIC uppercase letter
[EBCDIClower] An EBCDIC lowercase letter
[EBCDICdigit] An EBCDIC digit, ASCII F0-F9

[punctuation]

Printing characters, excluding letters and digits, includes !?.,:; " ' ' / - () {} -
Note that ? and ? are considered punctuation.

[symbol], [symbols]

~@#$%^&*
EasyPattern distinguishes punctuation from symbols; the sets do not overlap. For broader combinations, see [printableChar] and [typewriterChar]. For narrower focus, see [sentencePunctuation], [anyQuote], [anyBracket] and [anyDash].

Special letters

[upper], [uppercase], [uppercaseLetter]

Uppercase letters. Note: In TextPipe you will also need to enable the Match Case option for this to make any difference.

[lower], [lowercase], [lowercaseLetter]

Lowercase letters. Note: In TextPipe you will also need to enable the Match Case option for this to make any difference.

Reserved punctuation

[leftBracket]

[

[rightBracket]

]

[leftParen], [leftParenthesis]

(

[rightParen], [rightParenthesis]

)

[leftAngle], [lessThan] <
[rightAngle], [greaterThan] >

[comma]

,

[singleQuote]

'

[doubleQuote], [quote]

"  (i.e. standard ASCII "straight" quotation mark)

[backwardSingleQuote] `
ASCII function()
asc(code, ...), ascii(code, ...) The ASCII() or ASC() function embeds arbitrary control characters by entering the control code in decimal or hex (precede the hex digit with '$" eg $ff). You can add one or more control characters by separating each with a space or comma. e.g. ASC( 65, 66 ) outputs 'AB' into the pattern.
EBCDIC function()
ebcdic( literal ) The EBCDIC() function embeds Mainframe EBCDIC characters translated from a string literal you provide e.g. EBCDIC( '0' ) outputs \xF0 (an EBCDIC '0')

Filename patterns

[Drive]

A drive letter followed by a colon (:) e.g d:\special\folder\filename.doc, feeding the letter into @Drive@

[Folder]

A path fragment between \ ... \, e.g. d:\special\folder\filename.doc, feeding into @Folder@

[Path]

A path with optional drive e.g. e.g. d:\special\folder\filename.doc, feeding into @Drive@ and @Path@

[UNCpath]

A UNC path consisting of server, share and path - these feed into @Server@, @Share@ and @Path@

[Filename]

A filename, starting from \ and not ending with \ e.g d:\special\folder\filename.doc

Combining character classes and creating your own character classes

There are many ways to create your own character sets to match exactly the characters you require.

You can combine existing character sets using "or":

[... or ...]

e.g. [letter or digit], ['a' or 'b']

It doesn't hurt to add parentheses even though they are not required.
 [letter (letter or digit) letter] -- same as above

Negation - Match anything except a given set

Instead of specifying all the characters that could occur in a match, it is often convenient to specify characters that could not occur.

[not ...], [non ...], [anyExcept ...]

e.g. [oneOrMore non letter]

Custom Sets

Keywords such as [letter] and [digit] are character sets defined internally to EasyPattern; the angle bracket notation lets you define your own characters sets. In both case, EasyPattern matches any single character in that set.

[<...>]

e.g. [<aeiou>], [<135>], [<!@#$%^&*>]

Handling Alternatives - A or B

alternative patterns with "or"

[... or ...]

e.g. ['Player' or 'EasyPattern']

Remember: as noted in the section on expressions, commas are allowed between items to make patterns easier to read; they do not affect what the pattern means.

As noted in the previous section, parentheses are not required when [or] is used to combine character sets.

"or" as set vs. "or" as alternative

In many cases, you don't have to worry that there are two different uses for "or"; both generally make sense in context. However, there are 2 reasons for learning the differences:

How many repeats? Quantity, repetition and optional pieces

Notation: "..." is any appropriate keyword or expression, # is a number (one or more digits; the maximum varies with context).

repetition

examples

will match...

[optional ...], [zeroOrOne ...]

[digit, optional letter], [digit, zeroOrOne letter]

2, 2a

[0+ ...], [zeroOrMore ...]

[digit, zeroOrMore letters]

2, 2a, 2aa, 2aaa, 2aaaa...

[1+ ...], [oneOrMore ...]

[digit, oneOrMore letters]

2a, 2aa, 2aaa, 2aaaa...

[2+ ...], [many ...], [twoOrMore ...]

[digit, many letters], [digit, twoOrMore letters]

2aa, 2aaa, 2aaaa...

[#+ ...]

[digit, 5+ letters]

2aaaaa, 2aaaaaa...

specific quantity, quantity range (where # is a number)

will match...

[# ...]

[5 letters]

aaaaa, bbbbb

[# to # ...]
[# - # ...]
[# .. # ...]

[3 to 5 letters]
[3-5 letters]
[3..5 letters]

aaa, aaaa, aaaaa

Quantities can now be entered in Hex form by preceding them with '$' e.g $ff.

Greediness, or shortest match vs longest match, atomic matching

When the repetition or count includes a range of values to match, EasyPattern has the choice of matching the "shortest" sequence of characters that fits the pattern, or the "longest" that fits the pattern. For example

[shortest zeroOrOne ...] 0 or 1 will try to match zero occurrences
[shortest zeroOrMore ...] 0+ will try to match zero occurrences
[shortest oneOrMore ...] 1+ will try to match one occurrence
[shortest twoOrMore ...] 2+ will try to match two occurrences

EasyPattern defaults to the SHORTEST match so the "shortest" keyword is optional.

[shortest ... ...]

match the lowest possible number of repetitions (default)

[longest ... ...]

match the highest possible number of repetitions

Pattern matching can become very time consuming if the number of repeats is not known. Take for example

  Pattern: a[ 1+ digits ]b   Matching text: a22222222z

The pattern matcher first matches a, and 8 '2's, then it finds that 'z' does not match 'b'. So it backtracks, trying with 7 '2's, failing again, then with 6 '2's, all the way back to 1 '2', before finally giving up, and starting to test for 'a' again. If we know that backtracking into a repeated match will still result in failure, we can tell EasyPatterns to not bother, by using the atomic keyword.

  Pattern: a[ atomic(1+ digits) ]b  Matching text: a22222222z

This time, the pattern matcher first matches a, and 8 '2's, then it finds that 'z' does not match 'b'. So it backtracks all the way back to starting to test for 'a' again.

Literals, groups

All of the repetition & quantity keywords can be applied to literals and groups as well as to individual keywords, e.g.
 [oneOrMore 'ab'] ? matches "ab", "abab", "ababab" etc.
 [oneOrMore letter or digit] ? matches "aaa", "456", "a45bbb" etc.
 [oneOrMore not letter or digit] ? matches punctuation, symbols, whitespace etc.
 [oneOrMore ('alpha' or 'omega')] ? matches "alphaalapha", "alphaomega" etc.
 [oneOrMore (letter, digit)] ? matches "r2", "r2d2", "r2d2f7b2c4" etc.

Grouping text and capturing text for use in a replace string

[(...)]

A non-capturing group

[capture(...)],

[capture(...) as 'varname' ]

Assigns the contents of the group to a variable which can be referred to later in both the search pattern ([group#] e.g. [group6] ,# can range from 1 to 26) and in the replacement string ($# e.g. $6, # can range from 1-9, a-z. $0 represents the entire matched string). If specified, the text can also be stored in the global variable @varname in addition to the positional variables $1, $2 etc.

[group#]

Matches the same text that a previously captured group found.

[capture(letter), group1] ? matches "ee", "bb", "cc" etc

[mustBeginWith(...) ...], [mustNotBeginWith(...) ...]

When a match is found, it must be/must not be preceded by what is in the brackets. The bracket contents are NOT included in the actual match. The bracket contents are limited to fixed length strings - so no '3+' etc are allowed. This must be the first part of your pattern.

[mustBeginWith( 'hello' or 'goodbye' ) 'fred']

[... mustEndWith(...)], [... mustNotEndWith(...)]

When a match is found, it must be/must not be followed by what is in the brackets. The bracket contents are NOT included in the actual match. The bracket contents are limited to fixed length strings - so no '3+' etc are allowed. This must be the last part of your pattern.

['fred' mustEndWith( 'erick' or 'dy' ) ]

Commenting your patterns for readability

EasyPattern allows comments to be included in multi-line patterns using the character ';' or '#' to make the start of a comment, extending until the end of the line e.g.

[ 3 space ;look for 3 spaces
  'hello'  #then the keyword we want
]

Patterns for Whitespace

[space], [spaces]

ASCII 32

[nonbreakingSpace]

ASCII 202

[whitespace] [space OR tab OR cr OR lf OR verticalTab OR nonbreakingSpace]

[tab]

ASCII 9, \t

[return], [cr]

ASCII 13, \r

[linefeed], [lf]

ASCII 10, \n

[verticalTab]

ASCII 11

[formfeed]

ASCII 12, \f

[null] ASCII 0
[CRLF] [return, linefeed]
[newline] [(return, linefeed) or return or linefeed]
[DOSNewline] [return, linefeed]
[UNIXNewline] [linefeed]
[MacNewline] [return]

Whitespace combinations

[horizontalWhitespace], [hSpace]

[space or nonbreakingSpace or tab]

[verticalWhitespace], [vSpace]

[return or linefeed or formfeed or vertical tab]

words, columns, lines & paragraphs

[wordDelimiter]

[space OR tab OR linefeed OR verticalTab OR formfeed OR return]

[wordChar]

[not wordDelimiter]

[word]

[1+ wordChar]

 

 

[columnDelimiter]

[tab OR linefeed OR formfeed OR return]

[columnChar]

[not columnDelimiter]

[column]

[1+ columnChar]  Note: Use [0+ columnChar] instead if the column could be blank

 

 

[lineDelimiter]

[linefeed OR verticalTab OR formfeed OR return]

[lineChar]

[not lineDelimiter]

[line]

[1+ lineChar]  Note: Use [0+ lineChar] instead if the line could be blank

 

 

[paragraphDelimiter]

[formfeed OR return]

[paragraphChar]

[not paragraphDelimiter]

[paragraph]

[1+ paragraphChar]

Positions

[textStart]

matches at start of entire text

[textEnd]

matches at end of the entire text or before newline at end

[lineStart]

matches the start of a line (*)

[lineEnd]

matches the end of a line (*)

[wordBoundary] or [wordBreak]

matches at a word boundary

[notWordBoundary]

matches when not at a word boundary

(*) [lineStart] and [lineEnd] will work fine if the file you're editing has Unix end of line characters, because the core EasyPattern engine assumes this. For DOS or Windows files,  you should use
  [ cr lf or textEnd ]
or
  [ mustEndWith(cr lf or textEnd) ]

More Keywords

Combinations

[controlChar]

characters 0-31, 127 (careful: includes most whitespace)

[gremlin]

characters 0-31. The definition for [gremlin] is more cautious than in some products.

[printableChar]

[letter or digit or punctuation or symbol] (anything that prints ink on paper)

[typewriterChar]

[printableChar or space or tab or return] (excludes linefeed, vertical tab & formfeed)

Punctuation subsets (these items are included in [punctuation])

[sentencePunctuation]

.,;:!???

[anyBracket], [anyBrackets]

left/right paren/bracket/brace (i.e. "bracket" in the broad sense of the term)

[anyQuote]

[doubleQuote OR singleQuote OR backwardSingleQuote]

[dash], [hyphen]

-  used interchangeably. we have adopted the common notion that these terms refer to the same character

[period]

.

[caret]

^

[pound], [hash]

#

[slash]

/

[backslash]

\

[colon]

:

[percent]

%

[star], [asterisk]

*

[ampersand] &

[pipe]

|

Real-world patterns

[HTMLTag]

<[1+ not '>']>

[HTMLStartTag]

<[not '/', 0+ not '>']> (i.e. any tag except an end tag)

[HTMLEndTag]

</[1+ not '>']>

[QuotedString]

[quote, 1+ ((backslash, quote) or not quote), quote]

[SocialSecurityNumber]

[3 digits, dash, 2 digits, dash, 4 digits]

[PhoneNumber]

Matches a US-style (xxx) xxx-xxxx number with a variety of punctuation marks. The matching text is captured into 3 successive $variables

[EmailAddress]

Matches email addresses. The name and domain parts are captured into 2 successive $variables

[IPAddress]

Matches numeric IP addresses. The matching text  is captured into 4 successive $variables

[CreditCard]

Matches credit card numbers with a variety of punctuation marks. The matching text is captured into 4 successive $variables

[Hyperlink]

Matches a ftp, http, https, telnet, gopher or nntp internet url. The matching text is captured into 3 successive $variables

[DuplicateWord]

Matches a repeated word. The matching text is captured into 2 successive $variables

[PageNumber]

Matches a page number of the following forms:
Page dd
Page No dd
Page No. dd
Page Num. dd
Pg Num dd
Page Number dd.

The matching text is captured into 3 successive $variables (Page, Number, #)

Data processing patterns (in TextPipe 6.8.2 and later)

[CSVfield]

A Comma-Separated-Value field. If fields are delimited by single or double quotes, embedded newlines are allowed, as are doubled-up quotes. The quotes are returned as part of the match.

[TABfield] A Tab-delimited field. To process multiple tab fields e.g.
  [ 3 or more ( TABfield Tab) TABfield ]

[PipeField]

A Pipe-delimited field. To process multiple pipe fields e.g.
  [ 3 or more ( PipeField '|' ) PipeField ]

Date and time patterns

[Date]

Matches a date format DD-MM-YY or DD-MMM-YY e.g. 01-Jan-02, 29-03-98

[AMPM]

The AM/PM part of a time

[Month]

A MonthName or a MonthNumber

[MonthNumber]

1-12, with an optional leading zero e.g. 03, 12, 4, 7

[MonthName], [MonthNameShort], [MonthNameLong]

January-December and Jan-Dec

[MonthNameLocal] Full month names and 3 letter abbreviations for the current locale

[Day]

1-31, with an optional leading zero e.g. 1, 13, 08, 28

[DayNumber]

01-31 (the leading zero is required) e.g. 01, 08, 13

[DayName], [DayNameShort], [DayNameLong]

Sunday-Saturday and Sun-Sat
[DayNameLocal] Weekday names and 3 letter abbreviations for the current locale

[DayOfYear]

1..366

[Year], [YearShort], [YearLong]

A 2 or 4 digit year (between 1800 and 2199)

[Hour]

A 12 or 24-hour hour, with optional leading zero

[Minute]

A 2 digit minute with leading zero

[Second]

A 2 digit second with leading zero

Using the real world patterns above, you can easily construct the following EasyPatterns:

HMS

[ Hour <:.-> Minute <:.-> Second ]

DMY

[ Day <-/ > Month <-/ > Year ]

MDY

[ Month <-/ > Day <-/ > Year ]

YMD

[ Year <-/ > Month <-/ > Day ]

Julian

[ Year DayOfYear ]

MY

[ Month <-/ > Year ]

MD

[ Month <-/ > Day ]

DM

[ Day <-/ > Month ]

HM

[ Hour <:. > Minute ]

Advanced - Operator precedence, Order of operations

A complete pattern may include many individual keywords and many expressions. How do you know which keywords go together and where one expression stops and another begins? If in doubt, just enclose every expression in parentheses. But, EasyPattern has rules for combining keywords into expressions, so parentheses aren't always required. The traditional way of expressing these rules is to list the "precedence" of various operators or terms.:

Items with high precedence don't need parentheses; they group together automatically. For example, let's build a pattern step-by-step using the "high precedence" operators:
 [letter or digit] ? "or" for characters set keywords
 [letter or digit or '.'] ? and single-character literal
 [letter or digit or '.' or <!?>] ? and arbitrary set
 [not letter or digit or '.' or <!?>] ? reverse the meaning with not
 [1+ not letter or digit or '.' or <!?>] ? add a quantity specifier
 [1+ (not (letter or digit or '.' or <!?>))] ? if you like parentheses, though the meaning is the same

Adding lower precedence terms before, after or both doesn't change the grouping, though the expression is long enough that you may find a pair of commas, brackets, or parentheses helpful. As long as you understand how EasyPattern is doing the grouping, it doesn't matter whether you choose commas, brackets or parentheses. If the parentheses are added around something that is already a group, they don't change the meaning.
 [punctuation 1+ not letter or digit or '.' or <!?> symbol]
 [punctuation, 1+ not letter or digit or '.' or <!?>, symbol] ? same meaning but easier to read
 [punctuation][1+ not letter or digit or '.' or <!?>][symbol] ? same meaning
 [punctuation (1+ not letter or digit or '.' or <!?>) symbol] ? same meaning

Remember, commas and brackets don't change the meaning, only the look. If you put them in the middle of high precedence terms, you might confuse yourself:
 [punctuation 1+ not letter][or][digit or '.' or <!?> symbol] ? same meaning but HARDER to read
 [punctuation 1+ not letter, or, digit or '.' or <!?> symbol] ? same meaning but HARDER to read

Only parentheses change the meaning:
 [(punctuation 1+ not letter) or (digit or '.' or <!?> symbol)] ? different meaning

Note that [or] for character sets and [or] as alternative have opposite precedence. See Character Sets and Alternatives (above) for details & examples.

EasyPattern vs. perl regex or grep

At its core, EasyPattern uses "regular expression" technology that is similar to the "regex" or "grep" tools that originated on UNIX. EasyPattern's primary benefit is that the patterns are much easier to read and write.

For those who have some experience with regex, here are a few specific differences:

We welcome suggestions for extensions to the EasyPattern language - please contact us.