Using Regular Expressions to Match XML

Written by Mark Sanborn: Aug 25, 2008

After reading the chapter in my Perl book about regular expressions I decided to go ahead and solve one of the problems I usually have when getting cURL data.

Often times I had cURL report back the HTTP header with the XML data. Although I can disable this and tell cURL to only display the XML data, I wanted to be able to use the header data to distinguish if the website is reachable or not. This is especially useful for web apps that go down often like Twitter.

Until recently I have been using PHP’s function ‘strstr’ to cut out only the XML portion of the cURL response.

$data = strstr($result, '<?');

Although this seems to do the job, regular expressions are far more portable. Perl for example doesn’t have a simple function for cutting the string off where the ‘<?’ starts. Instead it uses a far more powerful system. They use regular expressions for everything. Once you learn how to use them you will use them for all types of string manipulations.

Since regular expressions are used in a multitude of different programming languages and programs themselves, I thought it was worth learning one system that can be used in virtually every situation that requires string minipulation.

The code

So here is the snippet that I use for matching XML in a cURL response now.

preg_match("/(\<\?xml[\d\D]*\?\>)/", $xml, $result); data = $result[0];

I will break down each part of the code one by one to make it easier to understand.

The breakdown

The first part of the code is initiating the ‘preg_match’ function. This is PHP’s built in function that handles Perl Compatible Regular Expression matching.

The next part of the code is the actual regular expression, ‘/(<\?xml[\d\D]*\?>)/’.

The first part of the regular expression is a slash, ‘/’ And the end of the regular repression is also a ‘/’. This marks the part of the code that we want to match. Anything after these slashes are called regular expression option modifiers. These can include case insensitive ignore whitespace and other things. In this example we are not using any of these.

the parentheses, ‘(’ just like in math problems allow us to group certain matchings and do matchings based on those groups. In Perl the first group has a variable name of, ‘$1’ as a shortcut. Since I like shortcuts I add the parentheses even on single match phrases.

Next we have, ‘<\?xml’. Just as in PHP, regular expressions have reserved/special characters. It just so happens that we need to use escape two of those characters right off the bat. The first one is ‘<’ and the next one is the question mark, ‘?’. The backslash in front of these characters tells PHP to use ? as the character instead of using its special meaning.

Then we have ‘[\d\D]’. The brackets represent a range of characters that we want to match. You can have various things in here, like ‘[0-9]’ or ‘[a-z]’. The ‘\d’ is a shortcut for saying we want to match digit characters. The ‘\D’ is a shortcut for NON digit characters. In other words we did all that because we wanted to say “Match everything”.

The next part is the asterisk, ‘*****‘. This is what we call in regular expressions a multiplier. A multiplier tells the script how many times to match something. This particular multiplier we are using here means, match zero or more times.

Finally we have, ‘\?>/’. Which as you probably guessed means that we want to stop matching when we find the string ‘?>

$result[0]; is just the first match that it comes across. If there were multiple XML snippets in a text file you could do $result[1]; for the next match.

Need to print shipping labels on your site?

Checkout my product RocketShipIt for simple easy-to-use developer tools for UPS™ FedEx™ USPS™ and more.

Share: