January 23: A word-matching game using regular expressions in Ruby
If you’ve read any of my previous blog posts, you can probably guess that I love games and puzzles! And not just because I love playing them on long train rides — games are the perfect template for diving deep into software engineering. Even a simple game has rules, hierarchies, states and conditions —implementing it all forces one to think like a programmer and gain a real, true understanding of how to solve any problem. No surprise that Ada Lovelace, the world’s first computer scientist, was obsessed with games.
In this blog post, we’ll explore the use of regular expressions to create a version one of my favorite word matching games: Spelling Bee!
Background: The rules of Spelling Bee
Spelling Bee was developed by New York Times Puzzle Editor Will Shortz to lend variety to that paper’s already-revered Crossword section. Introduced in The New York Times Magazine in 2014, the game’s online version has since generated a massive and fanatical online following. I’m not made of money, so I can’t afford a New York Times Puzzle Section subscription… but what I do have is the power of Ruby!
As shown in the above animation, a player is presented with a hexagonal grid of letters, and guesses as many words as possible which contain those letters. A guess must include the letter in the center cell; letters in outer cells are optional. Extra points are awarded for longer words, and for pangrams (words which contain every one of the seven letters at least once). Only words four letters or longer are valid guesses.
The New York Times doesn’t say much on how it creates puzzles or validates guesses for Spelling Bee — a policy which has generated a non-insignificant number of complaints among the game’s devotees. Editor Sam Ezersky has, however, revealed that the list of possible guesses includes about 60,000 “common words” curated from the Oxford English Dictionary and excludes duplicate spellings from British English, technical/professional vocabulary and obscenities/slurs (among other categories). Because I don’t own the rights to this game or its dictionary, I used the Official Scrabble Players Dictionary word list, which is significantly more generous (by a factor of 4 or more) than the Spelling Bee word list.
What’s a regular expression? What’s it used for?
In our Spelling Bee game, we’ll determine if a player’s guess is correct using regular expressions or regex — pieces of semi-standardized syntax for defining a search pattern for a string of text. The concept was first defined by a University of Wisconsin-Madison mathematics professor in the 1950s and has since become a central and defining feature of most common languages and computer programs. Applications most familiar to us include search engines and find-and-replace features in text editors and web browsers.
Each part of a regex is either a literal character (with a literal meaning) or a metacharacter (with a predefined meaning). Take, for example, a very simple regex:
/N./
Ruby regular expressions always start and end with forward slashes. The N
is a literal, which will match any occurrence of a capital letter ‘N’, while the .
is a metacharacter which matches anything except a carriage return (new line). So, the above regex will match to any string that starts with a capital N:
"Norm".match?( /N./ ) => true
"Nancy".match?( /N./ ) => true
"Josh".match?( /N./ ) => false
"N1234%".match?( /N./ ) => true
Note that above I say regex is semi-standardized because there are different regex standards, though they all appear superficially very similar; the most common are the POSIX standard and the Perl standard. Ruby uses a standard called Onigmo, which is much more similar to Perl than POSIX. It provides us with a class called Regexp
, which in turn defines a number of methods for defining and working with regular expressions.
Please also note that regex is a black hole of pure torturous evil, and that studying it too closely for too long will inevitably result in a slowly-worsening downward spiral of insanity and homicidal rage. Ruby is designed to be elegant, succinct, precise, flexible and readable — regular expressions are the exact polar opposite of these. Take a deep breath (and maybe a shot or two), have a seat, and feast your eyes on this example:
(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t] )+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?: \r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:( ?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\0 31]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\ ](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+ (?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?: (?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z |(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n) ?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\ r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n) ?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t] )*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])* )(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t] )+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*) *:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+ |\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r \n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?: \r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t ]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031 ]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\]( ?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(? :(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(? :\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(? :(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)? [ \t]))*"(?:(?:\r\n)?[ \t])*)*:(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]| \\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<> @,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|" (?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t] )*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\ ".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(? :[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[ \]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000- \031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|( ?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,; :\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([ ^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\" .\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\ ]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\ [\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\ r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\] |\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \0 00-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\ .|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@, ;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(? :[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])* (?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\". \[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[ ^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\] ]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)(?:,\s*( ?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\ ".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:( ?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[ \["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t ])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t ])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(? :\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+| \Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?: [^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\ ]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n) ?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\[" ()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n) ?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<> @,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@, ;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t] )*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\ ".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)? (?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\". \[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?: \r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\[ "()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t]) *))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]) +|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\ .(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z |(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:( ?:\r\n)?[ \t])*))*)?;\s*)
Believe it or not, this is an actual, functional regex for validating an email address, and it’s 6,424 characters long!
Defining a puzzle class
Unfortunately for my will to live, Spelling Bee is a word-matching game — ideal for regex, whose sole purpose is word-matching. So hold onto your hat and examine the below Puzzle
class, which defines the rules of Spelling Bee as regular expressions:
class Puzzle
@@dictionary = File.open( "words.txt" ).read.split( "\r\n" )
attr_reader :match, :pangram
attr_accessor :letters, :guesses
def initialize( these_letters )
raise ArgumentError, "Puzzle must contain exactly 7 letters" unless these_letters.length == 7
@letters = these_letters.upcase
@guesses = []
@match = /^(?=.*#{ these_letters.upcase[ 0 ] })[#{ these_letters.upcase }]{4,}$/
@pangram = /^(?=.*#{ these_letters.upcase[ 0 ] })(?=.*#{ these_letters.upcase[ 1 ] })(?=.*#{ these_letters.upcase[ 2 ] })(?=.*#{ these_letters.upcase[ 3 ] })(?=.*#{ these_letters.upcase[ 4 ] })(?=.*#{ these_letters.upcase[ 5 ] })(?=.*#{ these_letters.upcase[ 6 ] })\w+$/
end
end
What in the Ruby is happening on lines 9 and 10?! It’s easiest to understand with an example — all that interpolation is making me feel faint! (However, it is good to know and keep in mind that interpolation can be done with #{ }
when defining a regex, just the same as in a text string.)
It’s interesting to note that this Puzzle
class is essentially just a wrapper for a string of letters
and an array of guesses
. For the purpose of this implementation of the Spelling Bee game, we’ll assume that the center cell is always the first letter (letters[ 0 ]
). So, let’s create the puzzle from the animation, RTINAVM, and take a peek at our instance variables:
puzzle = Puzzle.new( "rtinavm" )
=> #<Puzzle:0x00007fc3c5f27c28
@guesses=[],
@letters="RTINAVM",
@match=/^(?=.*R)[RTINAVM]{4,}$/,
@pangram=/^(?=.*R)(?=.*T)(?=.*I)(?=.*N)(?=.*A)(?=.*V)(?=.*M)[RTINAVM]{4,}$/>
Let’s break down our match
instance variable, character by character:
^
is a metacharacter that matches the beginning of a string.- The series of metacharacters
(?=.*R)
is called a lookahead. It defines a condition (a mini-regex) to evaluate before even touching the string to match; if the lookahead doesn’t match, the whole string isn’t a match. In this case, we’re looking ahead to see if the string matches the regex.*R
. - We already know
.
is a metacharacter matching to any literal except a line break.*
is a quantifier, a metacharacter that defines the number of times the character right before it occurs in a string to match — in this case, zero or more times (permitted, but not required). So here, we’re asking if the characterR
(the center cell) occurs in the string we’re matching, with any number of characters before it. If it doesn’t, the guess isn’t valid according to the rules of Spelling Bee. - A set of square brackets
[ ]
is a character class, which will match to any of the characters defined within — so[RTINAVM]
will match to a single occurrence of one of the letters in our puzzle. A set of curly brackets{ }
is a range quantifier.{1,2}
will match if a character occurs once or twice, while{,3}
will match if a character occurs up to three times. In Spelling Bee, a guess must have four or more of the letters in our puzzle, so[RTINAVM]{4,}
will match only strings that contain four or more occurrences of any of the letters in the character class[RTINAVM]
. - The final metacharacter,
$
, matches the end of a string.
Dare we even try to examine the pangram
instance variable? Don’t lose heart — it’s just a chained series of lookaheads, one for each letter in the puzzle, to make sure each occurs in the puzzle at least once. If they all match, we return the entire word using the same character class and range quantifier as before, [RTINAVM]{4,}
.
Regular expressions in action
We’ve now defined a Puzzle
class that gives us everything we need to play a game of Spelling Bee:
- A string of
letters
; - An array of
guesses
; - A regular expression
match
defining the rules for a correct guess; - A regular expression
pangram
defining the rules for a pangram (which gets us juicy bonus points); and… - A class variable,
dictionary
— a gargantuan array of words, which we get by using Ruby’sFile
class toread
andsplit
a text file calledwords.txt
containing all the roughly 280,000 words in the Scrabble word list, each on a separate line.
Now it’s just a matter of defining methods which use Ruby’s Regexp
class to respond appropriately to a player’s guess
es, like below:
def possible_words
@@dictionary.select{ | word | word.match?( self.match ) }
enddef pangrams
@@dictionary.select{ | word | word.match?( self.pangram ) }
enddef correct?( guess )
self.possible_words.include?( guess )
enddef bonus?( guess )
self.pangrams.include?( guess )
end
All of these methods are only a single line! Of particular importance are the first two, possible_words
and pangrams
, which call Regexp.match?
— a method that, predictably, returns true
if a string match
es a regex and false
if it doesn’t:
"JOSH".match?( puzzle.match )
=> false
"TRAIN".match?( puzzle.match )
=> true
Both these methods enumerate through each word in dictionary
, .select
ing words that .match?
and ignoring words that don’t:
puzzle.possible_words
=> [“AARTI”,
“AIRMAN”,
“AIRN”,
“AIRT”,
“AIRTRAM”,
“AMARANT”,
“AMARANTIN”,
“AMARNA”,
“AMIR”,
“AMRIT”,
...
puzzle.pangrams
=> ["INTRAVITAM", "VARMINT"]
Conclusion: You’ve gotta know when to hold ’em, know when to fold ’em, know when to walk away…
Regex is a feisty girl whose elusive, seductive charms can give any text-heavy Ruby application extra pizzazz and a professional shine. But she’s hard to please, and dancing with her can leave you with a broken heart!
When writing my Ruby Spelling Bee knockoff, I quickly got bored stealing puzzles from Reddit and The New York Times website, and found it difficult to generate random puzzles that were satisfying to play. Just selecting seven random letters, even common ones, well… that doesn’t work. So, with a Greek chorus admonishing me in the background, I thought to myself, why not use a regex to select seven-letter pangrams and generate puzzles from those?
What I realized rather quickly is that regex is not the right tool for that job. I wrestled with three different possibilities, but all relied on lookbacks (the opposite of a lookahead) that grew horrifyingly long-winded. A solution using .split.uniq
was much, much more direct and elegant.
The moral of this story is you should think long and hard about whether regex is called for before you start coding. There are a great many tools to help long-suffering programmers endure the hellish soul-sucking nightmare of regular expressions, including Rubular and Regex101. But Ruby also gives us enumerables, and a String
class with a great myriad of friendly tools like .include?
, .starts_with?
and .ends_with?
that don’t require a mental root canal. Regex is a diesel jackhammer, and in nine of ten cases all you need is a mallet.