Markov Chain Word Generator

The first screen the player sees in Goblin Caves is a black screen with a question and a prompt.

This “new player” screen was originally made as a “quick and dirty” sort of thing, with the intention of replacing it with something fancier - or at least more flashy. It’s really grown on me since I originally wrote it, and I don’t think I’m going to change it. What does still bother me though, is that if you just hit enter without putting down something a little message box pops up, asking you to “Please try again.”

That’s kinda lame, and directly above the code that does that error checking (was) a comment I wrote to myself: “It would be super cool if this was replaced with a call to a random name generator…”

Well I came across that comment while rewriting all the drawing and input handling code, and decided that was going to be my next project.

Random words are typically not just random characters slapped together - that would just give words like:

alxvik hhbatxr orikumo kxzzmk cafzrx jkiuvfm

Not very satisfying, right? Are those names? Places? Plants? Weapons? Who knows. To get more satisfying words, a better approach is to use a Markov Chain - what Wikipedia defines cryptically as “a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event.”

That’s just a bunch of fancy words to say (in this context) that if you start with a sequence of letters, which next letter is most likely to occur?

If you know those probabilities, word generation becomes easy - start with a few letters, figure out what letter could come next, choose one of those letters and start the process over with that letter and the preceding letters. With any dataset input, the words that would be generated following these probabilities would (should) be very similar to the input. I could even take multiple datasets, generate some words, handpick words I like for another dataset, and then generate a whole bunch of words similar to the ones I like!

To start this project, I needed data structures for the following:

A list of strings, to hold the words read in from the dataset.
A list of characters, to hold the characters that follow each “key” in the hash table.
A hash table, with each key/value pair being a string/list of characters.

These data structures were a lot of fun to implement, and good practice to put together. I might write another post on what those look like and the thoughts that went into those.

I also needed “training” datasets to generate the hash table (lists of words) and a routine to read those words in. I started this project with a list of names that I pulled from the Social Security Administration’s website - amazingly you can download a set of CSV files that contain names, assigned sex, and number of children registered with that name each year since 1880. Which is all sorts of fun - there were 10,069 Marjories registered with the SSA in 1924, for example.

The hash table is the real workhorse - it provides the tables we roll on to determine which characters come next. To build the table, the program does the following:

Open’s the dataset file
Read and store words in a list of strings
Loop through the list and:
- Take the first N characters of a word (eg: for the word “Richard”, “Ri”)
- Look through the list of words for any two characters that match the key. Any matches and the following character is added to a list (‘c’ comes after “Ri” in “Richard”)
Add to the hash table the key, and the list of characters
Repeat with the next N characters in the word (“ic” from our example)
Continue until you reach the end of the word, and then move to the next word

Easy enough, right? Well it turns out there is still a couple little things to smooth out. Lets look at an example dataset, with the following names (the top ten male names of 1937):

Robert James John William Richard Charles Donald David Thomas George

The hash table generated (with a key size of 2) would look like this:

********************
HashTable
********************
Key: al | Values: ['d']
Key: am | Values: ['e','','','e']
Key: ar | Values: ['d','l','l','d']
Key: as | Values: ['']
Key: av | Values: ['i']
Key: be | Values: ['r']
Key: ch | Values: ['a','a','a','a']
Key: da | Values: ['v']
Key: do | Values: ['n']
Key: eo | Values: ['r']
Key: er | Values: ['t']
Key: es | Values: ['','','','']
Key: ge | Values: ['o','o']
Key: ha | Values: ['r','r','r','r']
Key: hn | Values: ['']
Key: ho | Values: ['m']
Key: ia | Values: ['m']
Key: ic | Values: ['h']
Key: id | Values: ['']
Key: il | Values: ['l']
Key: ja | Values: ['m']
Key: jo | Values: ['h']
Key: ld | Values: ['']
Key: le | Values: ['s']
Key: li | Values: ['a']
Key: ll | Values: ['i']
Key: ma | Values: ['s']
Key: me | Values: ['s']
Key: na | Values: ['l']
Key: ob | Values: ['e']
Key: oh | Values: ['n']
Key: om | Values: ['a']
Key: on | Values: ['a']
Key: or | Values: ['g']
Key: rd | Values: ['']
Key: rg | Values: ['e']
Key: ri | Values: ['c']
Key: rl | Values: ['e']
Key: ro | Values: ['b']
Key: rt | Values: ['']
Key: th | Values: ['o']
Key: vi | Values: ['d']
Key: wi | Values: ['l']

Very cool. The important thing to note is the duplicate letters and the “empty” lists. The empty lists actually contain the character ‘\0’ (NULL) - indicating the end of a word. If you look closely you’ll see that every two letter sequence at the end of the word has the NULL character as an option in it’s list of characters. The duplicates are what gives our probabilities - in this example ‘ar’ has a 2/4 chance of being followed by a ‘d’, and a 2/4 chance of being followed by an ‘l’. The keen observer will also notice that those come from the words “Charles” and “Richard”, and that the process to generate the table reads each word twice. The duplicates aren’t a problem in practice, since if each word was only read once the probabilities would still be 1/2 and 1/2 - 50%. Similarly, ‘am’ has an equal chance of being followed by an ‘e’ or NULL, ending the word.

If we use this table by grabbing a random node and preceding outwards from there until some max length we end up with words like:

Nald Rgeorge Avid Nald John Id Es Icharle Onald Ard Nald Bert Onald Harles Mas Rles Hn Lliames Es Icharle

Better than just choosing random characters - these at least look almost recognizable as names! However, it’s odd starting a name with some of those keys - and importantly some of those keys have no chance of generating a following character. To fix this, the hash table keeps a list of “starter” keys, keys that are at the beginning of each word in the dataset. In our example dataset:

ro ja jo wi ri ch do da th ge

Starting a word with those and moving outwards changes our results to:

Georgeo Chard Richard Donald David Thomas Robert Richarl Robert Georgeo David John John Charles William David James Charles Donald Charles

Well hot dang, we are in business! Those are clearly recognizable names, but definitely not very random. In fact, only a couple of those aren’t in our original dataset! We need a bigger dataset to see different names, so lets feed the machine a list of a 100 names:

Dank Me Leormas Joe Bruce Hard Billiamue Eddie Walvin Lloyd Josephill Ardonard Frandrenn Frene Bobby Cary Vey Haelvin Las

Alright, better. However, using the larger dataset we definitely lost some of the “flavor” of the input. The names don’t look very recognizable anymore - like maybe names in another language. Adjusting the key size from 2 to 3:

Samuel Kennis John Cliffordo Marvey Johnny Bruce Jim Clarence Andrew Herbert Edwin Johnny Jack Henry Jimmie Raymond Hen Roy Raymond Bernard Donard Martin Clarry Patrick Ray Paul Lawrence Gord Eugene George Vernest

There it is. Those are clearly names, and it preserves the taste of the original dataset. I definitely wouldn’t be surprised to see any of those names in the dataset. Now comes the fun part - what if we put in a list of something else?

Minerals? That output looks like:

Antimotoite Austibite Algodonite Allacrandite Akagai Aeschite Alite Abertonite Ardite Anorthinite Bayldonite Alactite Abergentigorite Benioside Billite Barsenic Borax Apatite Apatite Arstowite Ldersonite Aeschite Arcticite Axinite Baotite Azurite Antlerite Aerite Annite Acantlerite Barsenopyrite Anorthony

Scientific names of flowers gives us:

Syrine Lathyrus Alstrelizia Digitalis Spectabiosa Muscabiosa Neria Rosa Alchemilladio Syringa Lavandula Tanacea Belladonna Tanacetum Neria Chamaryllia Aquilegia Parthemum Aster Delphinops Dianthurium Gladonna Chrysantirrha

Excellent - this is what I wanted. Without checking, I would believe that any of those are actually flowers or minerals. Now lets fancy it up a bit - lets feed the generator a list of Tolkien names, and a list of Dutch words.

Neek Sloth Margrimbul Eradalmarie Marel Nijmegen Leod Hingeloopen Bardenbrambaran Veereneke Everdamiril Adrahamuet Joch Joostoffelandri Hulst Eleb Fran Eren Eothoron Maratan Alone Ommen Theda Belenducalmeleb Hanniel Mauhur Nime

I would absolutely believe those are the names of some Elves or something. I could mix up and match datasets, fiddling with results all day. Probably one of the cooler bits of code I’ve ever written. The project is up on Github. It still needs some tweaks - if it’s fed a dataset with extended characters (“wide” characters, or wchar characters in C parlance) the whole thing breaks. I could fix this pretty easily, but not quickly - I would have to rewrite the list of strings structure and dataset loading functions at a minimum. It will also crash if the dataset is too large, or if the key size causes too many keys to be generated. All minor things really, and stuff that could easily be worked around. I haven’t plugged all this into the Goblin Caves project - yet. Still trying to figure out the best way to do that. The generator works quick enough that I could have it generate new words “on the fly” - but really it might be better to have it spit out a thousand words and then just have the game choose from those at random, loaded in from a config file. For the new player screen, I could give the player a list of ten random names, and they could choose from those to fill in the entry field to edit or accept as is.

I suppose all of those things are next on the agenda - I made a cool thing, now I need to stick it in with all the other cool things and use the hell out of it. “Unique” goblins with names: Cirdan the Butcher, Radalf the Sly, Chief Ratitis… Weapons: The battle axe “Tulkasr”, “Anfaug” the cleaver… All sorts of possibilities to add some variety and color to the caves.