A Regex Introduction
What is regex?
A regular expression (or regex) is a simple, rather mindless
way of matching a series of symbols to a pattern you have in mind.
As we discuss elsewhere,
there are certain patterns that you cannot match with regex.
But mostly, if you want to find a pattern in text, regex is the way
to go, and Perl regex will get you there. As the Perl manual says:
"Perl is an interpreted language optimized for scanning
arbitrary text files, extracting information from those
text files, and printing reports based on that information"
Regex has been around for some time - those who have struggled
with computer theory (in basic computing courses at university) will
know it well. Actually, it's not that bad. The basic ideas are simple,
but powerful.
Basic ideas
The rules are simple:
-
We want to know whether a text string matches a pattern. A simple
'yes' or 'no' will do nicely, thank you very much.
-
Every pattern we want a match for, we will turn into a 'finite
state machine'
-
We will feed the string we want to check into the machine representing
our pattern, and the machine will miraculously spit out either yes or no.
And that's it. Well, not quite, for we still have to say how we're
going to specify our patterns (and if we're really keen, perhaps look at
the machines that are manufactured according to our specifications).
Matches
Using regex, we will at some stage want to match a particular string.
Let's say we have a body of text (however long) and want to find if
it contains the string "blahblah" at some point. If we find "blahblah",
we will return true, otherwise we will fail. Here's the regex:
Not too bad, was it? All we do is place the text we wish to find
in between two (forward) slashes, and Perl does the rest. Note that
this regex will match each of the following:
blahblah
I'mso bordedblahblah
and so blahblah am I!
As long as the string is somewhere in the text, we have a match!
We lied(!)
Okay, in the example above, we lied just a little. If we have a string
in Perl, we have to store it somewhere. Let's say we have the string
"ABCblahblahDEF" stored in a Perl string called $mystring - how do
we test for the presence of "blahblah" in the string? Actually, it's like
this:
We specify the string to be tested by writing its name, followed by
=~ and then the regex. Note that the above
test will return a value of true or false, so we might include the
test in some actual Perl code as follows:
if ( $mystring=~/blahblah/ )
{ print "Hooray it worked!\n";
};
A tiny aside: note that you can turn around the sense of the
regex, so that true becomes false and vice versa, simply by saying:
in other words, !~ negates the regex.
Anchoring the regex
If we want to anchor the search, so that the text has to start with
"blahblah", then we can say:
and similarly, if we insist that the text ends with "blahblah", then
we put in a dollar sign at the end:
Okay, the makers of Perl could presumably have thought up more mnemonic
symbols, but, they work (and are now time-hallowed). Get to know them.
Matching Anything
Let's say we have now become a bit more ambitious, and wish to match
any one of a set of characters. We might want to match one of several
words, for example the words "shot" and "shut". The regex is:
We put the several options in square brackets!
The above would match shot or shut, but not, say,
"shxt". Note that "shout" would NOT be matched - the square
brackets select between single options!
What if we want to match any single character? Try:
The dot (period, if you wish) can be taken to match any character
whatsoever! {Note: except a 'newline', but that's for later}.
Matching several characters
Let's set our sights even higher. Say we wished to 'match' several characters
in the middle, for example, "shaaaaght", "shxxt", or even perhaps
"sh$##$#$#@@t". How do we do this? Thus:
The + sign tells Perl to 'match one or more of the preceding
character'. As the preceding character was "anything", Perl looks for one
or more "anythings", followed by a t ! Ask yourself, what would
the following match?
Clearly, shot, shoot, and even shooooooooooot. The question you have
to ask yourself is "How do I match zero or more characters?" How
could we match say the text string "sht" and also "shot", "shut", and
so on? (Which will clearly fail if
we try something like /sh.+t/ ). The answer is:
You have to be careful with this * thingy. You'll find that
thinking in terms of "matching absolutely no characters" is sometimes
a little tricky, and if you're not careful, you could end up bashing
your head quite hard against your keyboard.
Escaping confusion
By now, you're probably saying to yourself "What if I want to match
one of those fancy characters you've been using - for example, ^ .
/ * + and so on?" A problem, but not insuperable. Let's say you want
to look for the text string "a + b". If you say:
Then you will get a match to an "a" followed by one or more blanks, followed
by yet another blank, and then a "b", but you certainly won't get what
you want! The solution is:
- we use a simple \ (backslash) to indicate that the subsequent character
(Here the "+") is to be regarded as something to match, and not some fancy
control character. We say that we escape the "+" character. You can
do similar things with "\/" "\." "\[" and so on.
When in doubt, it's probably best to escape. It may not look
pretty, but remember that Perl uses an awful lot of characters as special
controls. We will soon encounter more!
Case iNseNSItivITy, and more..
Perl is by default CASE SENSITIVE. For example:
will return a match for "I am sensitive, dammit" but will NOT
match "I am Sensitive, dammit". It is however easy to render a match
case insensitive, thus:
- all you need do is put the modifier i after the
second slash of the regex, and - Voila - case i nsensitivity!
There are other modifiers:
-
m - multiple lines (Discussed below)
-
s - Treat the whole string as one line, so that even
/./ will match a "newline" character.
-
x - a rather complex modifier that we will (for now) avoid like
the plague!
{note: look up 'locales' for more information about /i modifier;
also have a note on $* = }
More matching tricks
There are several more tricks that you will encounter in Perl (Nobody
ever accused Perl of lacking options, did they)? Here are a few:
-
? - matches zero or one of the preceding character
-
{n} - matches n copies of the preceding character!
-
{n,m} - matches at least n but not more than m copies of the
preceding character
-
{n,} - matches at least n copies of the preceding character.
I would generally avoid most of the above, except where absolutely
necessary. Keep it simple.
Greedy matching
It's unfortunate that the "?" character is used to match 'one or none' of
the preceding characters, for "?" has quite a distinct use. Consider the
regex
and then apply it to the string "a xxx b fjdlfkjdl b". Clearly, there
is a match, but is the match with "a xxx b" or with "a xxx b fjdlfkjdl b"?
Your initial answer might be "Who cares?", but there is a good reason for
our obsessive questioning. We will soon discover how to pull out
a matched string, and then things will get really interesting. First,
let's resolve our dilemma. The answer is:
|
Perl by default uses 'greedy' matching
|
What this means is that /a.+b/ matches the whole darn string, not
just "a xxx b". Perl stuffs as much as it can into the match, unless
we specifically tell it to be "stingy"! How do we make Perl parsimonious?
Easy, we turn off greedy matching using a
?
after the * or +, thus:
which will then match "a xxx b" when we feed in the above string.
You can even say things like:
.. which we'll leave as an exercise for you to work out! But let's
now keep our promise, and tell you how to..
Extract text from the match
It's easy to extract information from part of a match. Consider the
regex:
The above clearly will match a string such as "xxalphazzzgamma",
as well as "alpha beta gamma delta". But what do the (parentheses)
achieve? The answer is simple - everything in parenthesis is put
into the Perl variable $1. (If you have a second set of parentheses,
the contents of this set go into $2, and so on). So after we feed
"xxalphazzzgamma" into our regex, $1 becomes "zzz". Likewise, for
our second example, $1 becomes " beta ".
It's even possible to reuse (!) the value that goes into $1
inside the very same regex ! To do so, we use a very special
convention, instead of saying "$1" within the regex, we instead say:
\1
Which translates as "the value of $1 we've just found, thank you
very much". Note that we wrote a backslash followed by a one (not an ell).
Let's try an example. First consider the HTML code:
"<html><head><title>Arbitrary Stuff</title></head><body>etc"
.. and we wished to pull out the title (The stuff in between the
<title> and </title> tags). We might say..
Okay, straightforward, isn't it? We find the opening title tag,
and then the closing one, and grab the stuff in between into $1. (Incidentally,
note how we escaped the "/" character, so that Perl didn't become
confused and think "Aha! This is the end of the regex"). But what
if we want to get a bit more fancy, and identify the start of
any HTML tag, and then its closure. Consider:
"<b><i>this is bold italic</i></b>, so there"
We can find the opening <b> tag, and then its closure, by saying:
This looks rather intimidating, until we realise that we have simply
used \/ as above to escape the "/", and that \1 is a reference
to the value that we've previously grabbed into $1. We now have a way
of matching a tag and its closure, without specifying a specific tag
such as <title> !
Matching fancy characters
There are many special characters and conventions in Perl.
A backslash, followed by an alphabetical character, is commonly used
to match newline characters. We will present two tables, one a lot
more useful than the other. But before we begin, let's note that:
is the same as saying
and
is the same as
We can also say "Give me anything OTHER THAN the following.." using
the convention
which translates as "match any character that is NOT
one of [01234]".
|
Useful Perl characters
|
|
Character |
Meaning
|
|
\n |
newline (line feed)
|
|
\w |
a word character [a-zA-Z0-9_]
|
|
\W |
NOT a word character, that is [^a-zA-Z0-9_]
|
|
\s |
white space (new line, carriage return, space, tab, form feed)
|
|
\S |
NOT white space
|
|
\d |
a digit [0-9]
|
|
\D |
NOT a digit, i.e. [^0-9]
|
See how we Capitalise a special character to reverse its meaning.
Now here's a really rather frightening list of other characters
and conventions:
|
Obscure Perl special characters
|
|
\t |
tab (HT, TAB)
|
|
\r |
return (CR)
|
|
\f |
form feed (FF)
|
|
\a |
alarm (bell) (BEL)
|
|
\e |
escape (think troff) (ESC)
|
|
\033 |
octal char (think of a PDP-11)
|
|
\x1B |
hex char
|
|
\c[ |
control char
|
|
\l |
lowercase next char (think vi)
|
|
\u |
uppercase next char (think vi)
|
|
\L |
lowercase till \E (think vi)
|
|
\U |
uppercase till \E (think vi)
|
|
\E |
end case modification (think vi)
|
|
\Q |
quote (disable) pattern metacharacters till \E
|
The above table was swiped from the
Perl monks.
Don't get too intimidated by this second table. The main characters you will use will
be \Q and \E (see below), and possibly \e. {"vi" is an obsolete UNIX editor, and nobody even
remembers what a PDP-11 was}!
Yet more matching
Say you wanted to match something that is at the start or end of a word,
or a string. Perl even has fancy conventions for these:
-
\b Match a word boundary
-
\B Match a non-(word boundary)
-
\A Match only at beginning of string
-
\Z Match only at end of string, or before newline at the end
-
\z Match only at end of string
-
\G Match only where previous m//g left off (works only with /g)
Convenient Perl conventions
Because Perl by default uses the / character to start and end regex,
any string that contains multiple slashes soon starts to look like
a forest:
http://www.anaesthetist.com/icu/index.htm becomes:
/http:\/\/www\.anaesthetist\.com\/icu\/index\.htm/
.. far from attractive. Perl allows us to substitute a different character
for the conventional / that delimits regex. For example if we wanted to use
the # character, we could say:
m#http://www\.anaesthetist\.com/icu/index\.htm#
Think of m as standing for m atch.
Note that we still have the irritating \.
escape of the period character.
We can even get rid of this:
m#\Qhttp://www.anaesthetist.com/icu/index.htm\E#
We used the \Q..\E convention from our list above to quote absolutely
everything from after the \Q until the \E is encountered. (By the way,
this quoting automatically gets turned off when the delimiter character
is encountered).
Perl Substitution
The format for substitution is simple:
Which means that we want to substitute "Jim" for "Anne" wherever
Anne occurs in the given string.
We lied again (!)
Okay, how do we specify the string to assault? Here it is:
In other words, we simply use our standard regex convention (=~),
but place an s between the =~ and the regex itself. Think s
for s ubstitute.
Tricks and traps
Note that if you use the above substitution command, only one
substitution is made! You can substitute globally throughout the
string using:
where g stands for global. Can you guess what
does? Yes, as for regex above, i makes things case InSENsitIvE!
Note that you have to be careful, for Perl won't worry whether a string
is, for example, within a word. If you try and substitute "is" for "was"
in the string "This is silly" using
you won't get "This was silly", you'll get "Thwas was silly".
Global matching
In Perl it's even possible to use the /g switch for pattern searching, without
performing a substitution! At first viewing, this statement doesn't
seem to make sense. For who cares if there is one match, or several?
In fact, we should care, for it's possible to actually pull out ALL
of the matches into a list! Thus:
$_ = "alpha xbetay xgammay xdeltay so on";
($first, $second, $third) = /x(.+?)y/g;
will put beta, gamma and delta into $first, $second and $third
respectively! The above needs some explanation:
-
The reason this works is because by default
regex acts on the default pattern searching variable , known
to its many friends as:
We first set $_ to the string we wish to test.
-
Perl arrays are described using parentheses, so
($first, $second, $third) is an array to be filled up with goodies.
-
Perl understands that when we say =,
it mustn't simply throw away the results of its pattern searching, but rather
put each result (remember that we said /g )
into the corresponding element of the array.
You can even use a global test within a while statement thus:
while ( /x(.+?)y/g )
{ /* here do something */ };
.. but watch out - if you leave out the /g then the statement
will loop forever!
Perl has another operator called split .
This is most useful in splitting up a string into component parts, using
a specified delimiter, something along the lines of:
@info = split /;/ , $fred ; #use semicolon delimiter to split $fred
Note that the array @info is filled with the resulting components.
The second 'argument' of split is the name of the string to split, in this
case, $fred. If we were to say
$count = split /;/, $fred ;
then we would get back the number of components returned, but
the actual values would be thrown away! (It's possible to supply a third
paramenter to split - the number of elements you want returned).
The opposite of split is the join operator:
$fred = join ';', $alpha,$beta,$gamma;
You can also use an array instead of a comma-delimited list as above.
Another use for (parenthesis)
Consider the following regex:
What does it do? In many computer languages, a vertical pipe ( | ) means
OR, and Perl is no exception. The regex matches the names "Adam", "Anne"
OR "Andrew" - any one will do. There is however a cost - because we used
parenthesis, $1 is created, and filled with the value Adam, Anne or whatever.
This is wasteful , so in the interests of efficiency, we have
the following alternative convention (which doesn't create
$1):
There is yet another convention (Perl is stuffed with them, isn't it)
that allows us to pull text out of a string
without using parenthesis ! If we use the "variables" $`, $& and $'
just after some regex, then they will respectively contain (1) the text
before the match, (2) the matched text, and (3) the text after the match.
Avoid them - there's a time penalty if you use them at all. The terrible
thing is that if you use any of these variables anywhere in your
program, Perl will provide them for all regex!
(An aside: $+ returns the most recent parenthesis variable match).
