The Black Art of Pattern Matching

There are two steps to creating a pattern. First, come up with a pattern that will match exactly what you want. Second, translate that abstract pattern into a specific syntax. The goal of the EasyPattern language is to make the second step as easy as possible, to let you create patterns that are easy to read and easy to write. The first step is more elusive. While some patterns are extraordinarily simple to create and understand, others can be quite challenging. Pattern matching is both an art and a skill. It takes thought. Experience counts. This document attempts to shed a little light onto the art of pattern matching.

Any interesting bit of text can be described by multiple patterns. For example, each of the following patterns (and more) correctly describes "978-692-1256":
[1+ char]
[12 chars]
[1+ not whitespace]
[1+ paragraphChar]
[1+ digit or punctuation]
[12 digit or punctuation]
[1+ digit or '-']
[12 digit or '-']
[3 digits, punctuation, 3 digits, punctuation, 4 digits]
[3 digits, '-', 3 digits, '-', 4 digits]
978[1+ digit or punctuation]
978[punctuation, 3 digits, '-', 4 digits]
978-[3 digits, '-', 4 digits]
978[punctuation]692[punctuation]1256

Search

How do you decide which one to use? There are (initially) two considerations:

What other text must the pattern match?
What text should the pattern not match, i.e. what pattern is required to distinguish the target text from the surrounding text?

The above patterns are loosely arranged from least specific to most specific: the first will match almost anything, the last will match only a few variations on the original. Experienced pattern matchers probably start somewhere in the middle and then move "up" or "down" as they find other cases to match or similar text that should not be matched.

For example, start with [1+ digit or '-']. That pattern would probably suffice to match telephone numbers in plain text. However, it would also match social security numbers (###-##-####) and even single digits. Moving slightly more specific, [12 digit or '-'] would solve both problems. But, it's likely that we want to match not just this telephone number, but all telephone numbers. Temporarily ignoring the possibility of a leading left parentheses, we still know that / and . are likely, yielding [12 digit or '-' or '/' or '.']. Sometimes it's easier to be general rather than specific, e.g. [12 digit or punctuation]. However, that could long dollar amounts or numeric IDs. For North American phone numbers, the sequence forms a clear pattern, leading to [3 digits, punctuation, 3 digits, punctuation, 4 digits].

Remember consideration #1: what other text must be matched? In this case, perhaps worldwide telephone numbers. I'm not familiar with all the variations but here's an attempt: [6+ digit or space or '+' or '/' or '.' or '-']. Now balance with consideration #2: would this pattern find to much? Maybe. It would match 6 spaces. If the document might have this many spaces in a row, perhaps they could first be converted to a tab character. Or, perhaps it's better to make the pattern more specific, e.g. to require at least 2 digits in a row: [2 digits, 4+ digit or space or '+' or '/' or '.' or '-']. What about a set of digits that isn't a telephone number? There's no easy answer here. Perhaps the telephone number is always labelled with "tel" or "telephone", perhaps it always appears on its own line. Different documents may require different patterns -- or even a multi-step process, such as that provided by TextPipe.

Replace

Having matched the text, what do you want to do with it? Deleting is easy; just replace with nothing. Inserting text before or after (or both) is equally simple, use $0 in the replacement pattern with the new text either before or after it. What if you want to make changes inside the match? That's the third consideration.

With telephone numbers, one common task is to match multiple formats and convert to a single format. To do so, each part of the text to be kept must be grouped and labelled (with a number from 1 to 20). If you have read the complete EasyPattern docs and looked at the examples, this pattern is by now familiar:
replace "[(3 digits)1 punctuation (3 digits)2 punctuation (4 digits)3]"
with "$1.$2.$3"