SiteExperts.com Logo Home | Community | Developer's Paradise | Jobs
User Groups | Site Tools | Site Information | Search

Inside Technique : Form Validation Made Easy : Regular Expressions

Regular expressions (or "regexes" as I prefer) offer a comprehensive way to define patterns of characters. The power of the regular expression is the ability to describe the order, number, types, and even absence of a series of characters or character groups within a string. JScript in Internet Explorer 4 is roughly equivalent to JavaScript 1.2, which implements Perl 4 regular expression syntax. Since regular expressions are such a complex topic, I will touch on the syntax just enough to get you excited enough about them to run out and get a book.

Regular expressions use special characters called "metacharacters" to descibe textual patterns. The following table lists the metacharacters and their corresponding meanings [the table comes straight from the Microsoft Scripting Technologies site]. I recommend that you print or otherwise store this information somewhere that is quickly accessible if you aren't already familiar with the syntax.
Character Description
\ Marks the next character as special. /n/ matches the character "n". The sequence /\n/ matches a linefeed or newline character.
^ Matches the beginning of input or line.
$ Matches the end of input or line.
* Matches the preceding character zero or more times. /zo*/ matches either "z" or "zoo."
+ Matches the preceding character one or more times. /zo+/ matches "zoo" but not "z."
? Matches the preceding character zero or one time. /a?ve?/ matches the "ve" in "never."
. Matches any single character except a newline character.
(pattern) Matches pattern and remembers the match. The matched substring can be retrieved from the result Array object elements [1]...[n] or the RegExp object's $1...$9 properties. To match parentheses characters ( ), use "\(" or "\)".
x|y Matches either x or y. /z|food?/ matches "zoo" or "food."
{n} n is a nonnegative integer. Matches exactly n times. /o{2}/ does not match the "o" in "Bob," but matches the first two o's in "foooood."
{n,} n is a nonnegative integer. Matches at least n times. /o{2,}/ does not match the "o" in "Bob" and matches all the o's in "foooood." /o{1,}/ is equivalent to /o+/.
{n,m} m and n are nonnegative integers. Matches at least n and at most m times. /o{1,3}/ matches the first three o's in "fooooood."
[xyz] A character set. Matches any one of the enclosed characters. /[abc]/ matches the "a" in "plain."
[^xyz] A negative character set. Matches any character not enclosed. /[^abc]/ matches the "p" in "plain."
\b Matches a word boundary, such as a space. /ea*r\b/ matches the "er" in "never early."
\B Matches a nonword boundary. /ea*r\B/ matches the "ear" in "never early."
\d Matches a digit character. Equivalent to [0-9].
\D Matches a nondigit character. Equivalent to [^0-9].
\f Matches a form-feed character.
\n Matches a linefeed character.
\r Matches a carriage return character.
\s Matches any white space including space, tab, form-feed, and so on. Equivalent to [ \f\n\r\t\v]
\S Matches any nonwhite space character. Equivalent to [^ \f\n\r\t\v]
\t Matches a tab character.
\v Matches a vertical tab character.
\w Matches any word character including underscore. Equivalent to [A-Za-z0-9_].
\W Matches any nonword character. Equivalent to [^A-Za-z0-9_].
\num Matches num, where num is a positive integer. A reference back to remembered matches. \1 matches what is stored in RegExp.$1.
/n/ Matches n, where n is an octal, hexadecimal, or decimal escape value. Allows embedding of ASCII codes into regular expressions.

NOTE: The use of the backslash ("\") has special meaning when used with the metacharacters. It acts as an escape character, telling the regex parser that the following character should be interpreted literally (i.e. "\*" matches a literal asterisk character).

Now that you know all the metacharacters, let's look at a couple of examples to test your understanding. I'll start with some common uses and move on to more complex regexes.

	/^\d{5}$/

Maybe we should step through this one bit by bit. The carat (^) indicates the beginning of the input; that just means that the following character must be the first character in the string. A \d indicates a digit (any single number 0 through 9). The braces following it require that exactly five of the preceding character [the digit] must occur together. It does not require the same number to occur five times since we only used the digit class and not a specific digit. The dollar sign ($) means the end of input. All together, this regex simply matches any string that is exactly five digits - no more, no fewer.

If we were to remove the beginning- and end-of-input markers (^ and $, respectively), the string would only have to contain five digits in a row. When attacking form validation with regular expressions, you will almost always need to use these metacharacters to make certain your user has entered correct data.

Let's take a look at a more advanced version of the first example:

	/^\d{5}(\-?\d{4})?$/

We already know the first little bit (\d{5}) so let's move directly to the last half. The use of parentheses serves to group all the metacharacters within them. Any repetition metacharacters following the parentheses ("?" in this case) operate on the group as a whole; therefore, we expect zero or one occurrence of (\-?\d{4}).

If we decompose this group, we can expect a literal hyphen (-) zero or one time followed by exactly four digits. [As you can glean now, this regex represents a US ZIP code.] With regular expressions, you don't need to fan through the string character by character looking for non-digits and an optional hyphen delimiter.

Here is a JavaScript example of how we test a string for a match:

	var bResult = /^\d{5}(\-?\d{4})?$/.test(string);

All regular expressions are written between forward slashes. The test method of the RegExp object returns true if the string parameter matches the regular expression and false if it does not.

If you prefer to construct your regexes at runtime, you may use the RegExp constructor function:

	var re = new RegExp("^\\d{5}(\\-?\\d{4})?$");

Notice the escaped backslashes. Since this constructor uses a string parameter to create a regular expression object, all the characters must be resolved as literals.

There is a second parameter to the RegExp constructor that you should know as well. It defines how the expression should act when it is attempting to match, whether ignoring the case of the letters ("i") or matching all instances of the pattern within the input ("g") or both ("ig"). The second parameter has its counterpart in the normal regex syntax as well:

	var sInput = "10:52 AM";
	var re = new RegExp("am", "i");
	var bResult = re.test(sInput);        // true

The above regular expression looks for the string "am" in the input without regard to case. The following replaces all instances of the string "flounder" with "fish":

	// must match exactly
        sInput = sInput.replace(/flounder/g, "fish"); 
	// case-insensitive match   
	sInput = sInput.replace(/flounder/gi, "fish");

Now you have seen a small portion of regular expressions. We can combine the power of regular expression testing with the technique of creating your own methods to make a validation method for form elements.