Tuesday, January 28, 2014

Average, Normal, Ordinary, Everyday...Regular Expressions

This week I began a new Software Engineering class in which I am learning Ruby on Rails. I started watching the introductory tutorials on the new language, and I immediately discovered a shortcoming in my education-to-date. I realized I know next to nothing about regular expressions. I was unable to decipher which of the characters I was seeing belonged to these regular expressions, and which were syntactical requirements of Ruby. If anyone else is experiencing similar difficulties, I hope to help your understanding by providing a quick introduction here as well as some resources for further study. Because my background is in Java, I will start there, then transition into Ruby formatting. So without further adieu...

What is a Regular Expression?
This is a way of matching patterns within sequences of characters. For example, I could search a string such as "How much wood could a woodchuck chuck?" for the string "wood." I could do several things...I could find out if the pattern existed at all, I could count the number of occurrences within the string, I could perform the search based on capital letters or lowercase letters, or even a combination. It is even possible to "find and replace" matching patterns.

Why Regular Expressions?
It is simple to consider potential uses of "find and replace" functionality, from simple misspellings of a single word to intricate changes in long documents, such as new contact information for Congress members after post-election. Even more simply, we might just want to search a small amount of input for a particular set of characters. If we were to write basic "beginner" Java code to search the above input string for "wood," it would involve nested loops and if-else statements and storing and backtracking and...well, suffice it to say regular expressions not only condense the amount of code, but also are simpler to understand as they are a higher level way of scripting, thus closer to the way we think.

Java: regex
In Java, the java.util.regex and java.util.matcher packages contain the classes needed to utilize this functionality. To implement you would first build the pattern you want to match, then the test string, and finally the test statement:
Pattern pattern = Pattern.compile("\\d{5}");
Matcher m = pattern.matcher("90210");
boolean bool = m.matches(); 
This defines our pattern as five digits (0-9). Then, builds the test string (90210), and finally tests to see if the test string matches the pattern. This is the equivalent of testing a field to see if it could potentially be a zip code.

Ruby: Regexp
In Ruby, the regular expression looks very similar. Instead of being enclosed in quotation marks, it is enclosed in two forward slashes:
/\d{5}/
Additionally, instead of calling .matcher, you simply call .match.  I'm beginning to like the simplicity already!

I Want to Know More
As I mentioned, these are just the basics. There are much more intricate uses of regular expressions. For more on either language, here are some resources:

Javadoc Tutorial on Regular Expressions
Java Regex Tutorial by vogella.com
Ruby-doc on Regexp

If you have any corrections, suggestions, or questions, please feel free to share your wisdom in the comments!