Attempting to understand handling regular expressions with php
Regular Expression, commonly known as RegEx is considered to be one of the most complex concepts. However, this is not really true. Unless you have worked with regular expressions before, when you look at a regular expression containing a sequence of special characters like /, $, ^, \, ?, *, etc., in combination with alphanumeric characters, you might think it a mess. RegEx is a kind of language and if you have learnt its symbols and understood their meaning, you would find it as the most useful tool in hand to solve many complex problems related to text searches.
Just consider how you would make a search for files on your computer. You most likely use the ? and * characters to help find the files you're looking for. The ? character matches a single character in a file name, while the * matches zero or more characters. A pattern such as 'file?.txt' would find the following files:
The following table contains the list of some metacharacters and their behavior in the context of regular expressions:
Let us create a Perl-Compatible RegEx pattern to match the above patterns. First we would need to match the single digit ISD code (let us not restrict it to 1). But this may or may not available in the phone numbers, hence we would write it as follows:
Now what would appear next in the sequence? The possibilities are a blank space or a hyphen. So we would add the pattern “(\s|-)?” with the above RegEx. This pattern indicates that either a blank space or a hyphen may or may not appear. So our RegEx becomes:
Now we need to use this RegEx to perform some task, so that we can understand the significance of RegEx better. Now let us try to script a code to fetch the phone numbers from Google contact us page. So first we need to fetch the html content from Google’s contact us page.
This script will display the following output,
Just consider how you would make a search for files on your computer. You most likely use the ? and * characters to help find the files you're looking for. The ? character matches a single character in a file name, while the * matches zero or more characters. A pattern such as 'file?.txt' would find the following files:
file1.txt
filer.txt
files.txt
Using the * character instead of the ? character expands the number of files found. 'file*.txt' matches all of the following:filer.txt
files.txt
file1.txt
file2.txt
file12.txt
filer.txt
filedce.txt
While this method of searching for files can certainly be useful, it is also very limited. The limited ability of the ? and * wildcard characters give you an idea of what regular expressions can do, but regular expressions are much more powerful and flexible.file2.txt
file12.txt
filer.txt
filedce.txt
Let Us Start on RegEx
A regular expression is a pattern of text that consists of ordinary characters (for example, letters a through z) and special characters, known as metacharacters. The pattern describes one or more strings to match when searching a body of text. The regular expression serves as a template for matching a character pattern to the string being searched.The following table contains the list of some metacharacters and their behavior in the context of regular expressions:
|
RegEx functions in PHP
PHP has functions to work on complex string manipulation using RegEx. The following are the RegEx functions provided in PHP.
|
Finding US Zip Code
Now let us see a simple example to match a US 5 digit zip code from a string <?
$zip_pattern = "[0-9]{5}";
$str = "Mission Viejo, CA 92692";
ereg($zip_pattern,$str,$regs);
echo $regs[0];
?>
This script would output as follows $zip_pattern = "[0-9]{5}";
$str = "Mission Viejo, CA 92692";
ereg($zip_pattern,$str,$regs);
echo $regs[0];
?>
92692
The above example can also be rewritten using Perl-compatible regular expression syntax with preg_match() function. <?
$zip_pattern = "/\d{5}/";
$str = "Mission Viejo, CA 92692";
preg_match($zip_pattern,$str,$regs);
echo $regs[0];
?>
Note the change in the RegEx pattern in both examples. preg_match() is considered as faster alternative for ereg().$zip_pattern = "/\d{5}/";
$str = "Mission Viejo, CA 92692";
preg_match($zip_pattern,$str,$regs);
echo $regs[0];
?>
RegEx for US Phone Numbers
Now let us try to create a RegEx pattern to match a US telephone number. US telephone numbers are 10 digit numbers usually written with three parts like xxx xxx xxxx. These three parts are normally used with – hyphen, () braces, and blank spaces. The most common patterns can be seen as follows: XXX XXX XXXX
(XXX) XXX XXXX
XXX-XXX-XXXX
(XXX) XXX-XXXX
In some cases, US ISD code would be added in the first, like +1 XXX XXX XXXX.(XXX) XXX XXXX
XXX-XXX-XXXX
(XXX) XXX-XXXX
Let us create a Perl-Compatible RegEx pattern to match the above patterns. First we would need to match the single digit ISD code (let us not restrict it to 1). But this may or may not available in the phone numbers, hence we would write it as follows:
$Phone_Pattern = “/(\d)?/”;
Here \d is equivalent to 0-9 and the succeeding ‘?’ indicates that the digit may appear one time or doesn’t appear at all.Now what would appear next in the sequence? The possibilities are a blank space or a hyphen. So we would add the pattern “(\s|-)?” with the above RegEx. This pattern indicates that either a blank space or a hyphen may or may not appear. So our RegEx becomes:
$Phone_Pattern = “/(\d)?(\s|-)?/”;
The next sequence would be either XXX or (XXX). To match this sequence, we need to first match the braces with the pattern “(\()?”. As we use braces to enclose the patterns in RegEx, braces are metacharacters and to match these metacharacters explicitly, we need to use the escape character “\” preceding the metacharacters. Hence we use “\(“ in our RegEx pattern. Now we need to match the three digits and a closing braces. So this can be written as “(\d){3}(\))?”. Now our RegEx is added with these patterns, $Phone_Pattern = “/(\d)?(\s|-)?(\()?(\d){3}(\))?/”;
After the first part XXX, there should be either a blank space or a hyphen. So we add “(\s|-){1}” to the phone pattern. $Phone_Pattern = “/(\d)?(\s|-)?(\()?(\d){3}(\))?(\s|-){1}/”;
Further construction of RegEx would be much more simpler, as we need to match either XXX-XXXX or XXX XXXX. This could be written as “(\d){3}(\s|-){1}(\d){4}”. Adding this part of pattern to our RegEx, $Phone_Pattern = “/(\d)?(\s|-)?(\()?(\d){3}(\))?(\s|-){1}(\d){3}(\s|-){1}(\d){4}/”;
Yippee!!! We have created a RegEx to match US phone numbers. Now we need to use this RegEx to perform some task, so that we can understand the significance of RegEx better. Now let us try to script a code to fetch the phone numbers from Google contact us page. So first we need to fetch the html content from Google’s contact us page.
$str = implode("",file("http://www.google.com/intl/en/contact/index.html"));
Then we need to search for the phone number pattern with the help of our “Just Created” RegEx. If we use the preg_match(), we can fetch only one match. So to get more than one match we would use preg_match_all(). preg_match_all($Phone_Pattern,$str,$phone);
Now putting all these pieces into a single script,<?
$str = implode("",file("http://www.google.com/intl/en/contact/index.html"));
$Phone_Pattern = "/(\d)?(\s|-)?(\()?(\d){3}(\))?(\s|-){1}(\d){3}(\s|-){1}(\d){4}/";
preg_match_all($Phone_Pattern,$str,$phone);
for($i=0;$i<count($phone[0]);$i++)
{
?>
$str = implode("",file("http://www.google.com/intl/en/contact/index.html"));
$Phone_Pattern = "/(\d)?(\s|-)?(\()?(\d){3}(\))?(\s|-){1}(\d){3}(\s|-){1}(\d){4}/";
preg_match_all($Phone_Pattern,$str,$phone);
for($i=0;$i<count($phone[0]);$i++)
{
echo $phone[0][$i]."<br>";
}?>
(650) 253-0000
(650) 253-0001
(650) 253-0001
Wrap Up
Hope you had a good session with RegEx and now you would have some understanding on tackling problems related to text pattern findings using RegEx. To become a specialist in RegEx, you need to continuously practice it and need to identify complex problems and give a try to solve them. Happy Practicing With RegEx.
Comments
Post a Comment