Learning Regular Expressions for Beginners: The BasicsWritten by Mark Sanborn: Oct 21, 2008
If you consider yourself a programmer and you have not learned regular expressions yet now is the time! At some point in your development you will need to manipulate strings in more complicated ways that simple string functions can’t provide. Regular expressions are used in almost all programming languages and are considered the de facto standard for string manipulation. Regular expressions can also be used in searches and many other utilities.
The problem with regular expressions is that nobody wants to learn them because they look very intimidating. Don’t worry though, we will try to make it as painless as possible.
When I first looked at regular expressions I have to admit they looked absolutely foriegn. You can’t really look at a example snippets and learn a whole lot as you can with some languages.
For example a regular expression might look like this:
If you are new to regular expressions this will look like absolute gibberish or some encrypted code. With regular expressions you have to start with absolutely nothing and build up. Examples snippets simply wont do.
A very simple regular expression
Here is an extremely easy example.
This simple regular expression matches the word, ‘foobar’, easy enough. The problem though is that it also would match, ‘asdfoobaradsf’. The regular expression would NOT match, ‘Foobar’ as regular expressions are case sensitive. This is where the fun begins.
If you wanted to match the “word” foobar you would use:
Now we are really getting into regular expressions. As you can see here I have ‘\b’ at the beginning of the word and the end. As with programming languages there are certain characters that have special meaning and by adding a ‘**’ before the ‘b’ we are giving the ‘b**’ a special meaning. These special characters are how you are going to tell the computer what to match and how.
Don’t worry about the ‘\b’ too much right now you don’t even need to memorize it. Just try to stay with me here.
The pieces of a regular expression
Those ‘\b’ things in the example are called “anchors”. Anchors tell’s the program where to start looking for a string. What is important now is knowing the different pieces that make up a regular expresson. These include:
- Actual text you are matching
- Character classes
- Pattern modifiers
- Special characters
Working Through a Problem
For our first article in the series we are going to keep it simple walk through coming up with a regular expression for a common problem.
Lets say we are returning a header of a website and recieve the following code:
HTTP/1.1 200 OK
Date: Tue, 21 Oct 2008 05:32:32 GMT
We would like to know what the number follows the ‘HTTP/1.1’. In this case the number is 200. If the number was 404 or something else we would know that the website is down.
To start building our regular expression we can simply start typing in what we see in (/) forward slashes.
OK with regular expressions you have to think sequential and literal. We are first matching an H, then a T, then another T, and then a P. Avoid the trap of treating it like a word, think literal characters. As of right now like our earlier example this would match ‘asdflkjHTTP’.
We also ran into another problem. The next character in our string is a forward slash. Since this is a reserved (special meaning) character in regular expressions it must be escaped with a back slash.
OK, How about that? We have our slash escaped and the next portion of our expression. As you might have guessed the dot (.) is a special character in regular expressions and must also be escaped if we want it to be literal.
There we go. Now I just remembered that HTTP status codes can sometimes return as version 1.1 or version 1.2. What If we want to allow ANY digit? To do this we need to use a character class-A class of common characters.
There, now this will match HTTP/1.1 or HTTP/1.2 or HTTP/1.3 etc..
The next character in our string that we want to match looks like a space. There is a character class for that.
OK, Lets check our work by saying what we are matching literally. We are looking for an H followed by a T then another T then a P followed by a forward slash (/) followed by a 1 followed by a dot (.) followed by any digit followed by some white space. Phew, almost done.
Now we have added another digit character class followed by a quantifier. ‘\d’ says we will match any digit. The ‘+’ says we will match whatever comes before the plus 1 or more times. You can think of it as match one digitdigitdigitdigit…digit repeating forever until there are no more digits in sequence.
It is important to note that ‘+’ means 1 or more not 0 or more. If there is no digits the entire match will fail. To match 0 or more you would use the star (*****) quantifier.
Adding the parentheses says that we want to group that part of the code. This will ensure that we only get back the number and not the entire match.
In this case the regular expression would return, 200.
The character class \d (digit) is really just a shortcut for a range of characters.
\d is the same as [0-9] which is a shortcut of .
A much more common range however would be matching alphabetic characters only. [a-zA-Z]
Where to go from here?
So I told you that examples are bad and then I used one. Well what I ended up doing was giving you examples of each of the common elements of a regular expression.
Literal Text (HTTP/1.1)
Escaped special characters (\/1.1)
Character Classes (\d or \s)
Quantifiers (+ and *)
Grouping with ()
The example is worthless. You only need to understand the technique of thinking about regular expressions and the components that make them up. If you can think about regular expressions one character at a time and understand these basic components you will be a master of regular expressions in no time. The syntax (giberish) will be crystal clear in literally moments after grasping these concepts.
Need to print shipping labels on your site?
Checkout my product RocketShipIt for simple easy-to-use developer tools for UPS™ FedEx™ USPS™ and more.