Tutorial de Regexps en codehaus.org

El siguiente texto está tomado de las páginas oficiales de Groovy, concretamente de Tutorial 4 - Regular expressions basics.

El siguiente programa puede ser útil para comprobar los ejemplos del tutorial:

generaciondecodigos@nereida:~/src/groovy/strings$ cat -n WisnieskiArg.groovy
     1  def checkSpelling(spellingAttempt, spellingRegularExpression)
     2  {
     3          if (spellingAttempt ==~ spellingRegularExpression)
     4          {
     5                 println("Congratulations, you spelled it correctly.")
     6           } else {
     7                 println("Sorry, try again.")
     8          }
     9  }
    10
    11  theRegularExpression = /Wisniewski/
    12  if (args.length > 0) {
    13    def attempt = args[0]
    14    checkSpelling(attempt, theRegularExpression)
    15  }
    16  else {
    17    println "Provide one argument"
    18  }
    19
Sigue un ejemplo de uso:
generaciondecodigos@nereida:~/src/groovy/strings$ groovy WisnieskiArg.groovy Wisniewski
Congratulations, you spelled it correctly.

Regular expressions are the Swiss Army knife of text processing. They provide the programmer the ability to match and extract patterns from strings. The simplest example of a regular expression is a string of letters and numbers. And the simplest expression involving a regular expression uses the ==~ operator. So for example to match Dan Quayle's spelling of potato:

                  "potatoe" ==~ /potatoe/

If you put that in the groovyConsole and run it, it will evaluate to true. There are a couple of things to notice. First is the ==~ operator, which is similar to the == operator, but matches patterns instead of computing exact equality. Second is that the regular expression is enclosed in /'s. This tells groovy (and also anyone else reading your code) that this is a regular expression and not just a string.

But let's say that we also wanted to match the correct spelling, we could add a ? after the e to say that the e is optional. The following will still evaluate to true.

                  "potatoe" ==~ /potatoe?/

And the correct spelling will also match:

"potato" ==~ /potatoe?/

But anything else will not match:

"motato" ==~ /potatoe?/

So this is how you define a simple boolean expression involving a regular expression. But let's get a little bit more tricky. Let's define a method that tests a regular expression. So for example, let's write some code to match Pete Wisniewski's last name:

def checkSpelling(spellingAttempt, spellingRegularExpression)
{
        if (spellingAttempt ==~ spellingRegularExpression)
        {
               println("Congratulations, you spelled it correctly.")
         } else {
               println("Sorry, try again.")
        }
}

theRegularExpression = /Wisniewski/
checkSpelling("Wisniewski", theRegularExpression)
checkSpelling("Wisnewski", theRegularExpression)

There are a couple of new things we have done here. First is that we have defined a function (actually a method, but I'll use the two words interchangably). A function is a collection of code similar to a closure. Functions always have names, whereas closures can be anonymous". Once we define this function we can use it over and over later.

In this function the if statement tests to see if the parameter spellingAttempt matches the regular expression given to the function by using the ==~ operator.

Now let's get a little bit more tricky. Let's say we also want to match the string if the name does not have the w in the middle, we might:

  theRegularExpression = /Wisniew?ski/
  checkSpelling("Wisniewski", theRegularExpression)
  checkSpelling("Wisnieski", theRegularExpression)
  checkSpelling("Wisniewewski", theRegularExpression)

The single ? that was added to the spellingRegularExpression says that the item directly before it (the character w) is optional. Try running this code with different spellings in the variable spellingAttempt to prove to yourself that the only two spellings accepted are now Wisniewski and Wisnieski. (Note that you'll have to leave the definition of checkSpelling at the top of your groovyConsole)

The ? is one of the characters that have special meaning in the world of regular expressions. You should probably assume that any punctuation has special meaning.

Now let's also make it accept the spelling if ie in the middle is transposed. Consider the following:

theRegularExpression = /Wisn(ie|ei)w?ski/
checkSpelling("Wisniewski", theRegularExpression)
checkSpelling("Wisnieski", theRegularExpression)
checkSpelling("Wisniewewski", theRegularExpression)

Once again, play around with the spelling. There should be only four spellings that work, Wisniewski, Wisneiwski, Wisnieski and Wisneiski. The bar character | says that either the thing to the left or the thing to the right is acceptable, in this case ie or ei. The parentheses are simply there to mark the beginning and end of the interesting section.

One last interesting feature is the ability to specify a group of characters all of which are ok. This is done using square brackets [ ]. Try the following regular expressions with various misspellings of Pete's last name:

theRegularExpression = /Wis[abcd]niewski/ // requires one of 'a', 'b', 'c' or 'd'
theRegularExpression = /Wis[abcd]?niewski/ // will allow one of 'a', 'b', 'c' or 'd', but not required (like above)
theRegularExpression = /Wis[a-zA-Z]niewski/ // requires one of any upper\- or lower-case letter
theRegularExpression = /Wis[^abcd]niewski/ // requires one of any character that is '''not''' 'a', 'b', 'c' or 'd'

The last one warrants some explanation. If the first character in the square brackets is a ^ then it means anything but the characters specified in the brackets. The operators

So now that you have a sense for how regular expressions work, here are the operators that you will find helpful, and what they do:
Regular Expression Operators
a?              matches 0 or 1 occurrence of *a*                'a' or empty string
a*              matches 0 or more occurrences of *a*             empty string or 'a', 'aa', 'aaa', etc
a+              matches 1 or more occurrences of *a*            'a', 'aa', 'aaa', etc
a|b             match *a* or *b*                                'a' or 'b'
.               match any single character                      'a', 'q', 'l', '_', '+', etc
[woeirjsd]      match any of the named characters               'w', 'o', 'e', 'i', 'r', 'j', 's', 'd'
[1-9]           match any of the characters in the range        '1', '2', '3', '4', '5', '6', '7', '8', '9'
[^13579]        match any characters not named  even digits, or any other character
(ie)            group an expression (for use with other operators)      'ie'
^a              match an *a* at the beginning of a line         'a'
a$              match an *a* at the end of a line               'a'

There are a couple of other things you should know. If you want to use one of the operators above to mean the actual character, like you want to match a question mark, you need to put a '\' in front of it. For example:

// evaluates to true, and will for anything ending in a question mark (that doesn't have a question mark in it)
"How tall is Angelina Jolie?" ==~ /[^\?]+\?/

This is your first really ugly regular expression. (The frequent use of these in PERL is one of the reasons it is considered a "write only" language). By the way, google knows how tall she is. The only way to understand expressions like this is to pick it apart:
/                                   [^?]                            +                  ?                             /
begin expression        any character other than '?'    more than one of those  a question mark         end expression

So the use of the \ in front of the ? makes it refer to an actual question mark.

Casiano Rodríguez León
2010-04-30