Learning More about Grist using Wordle

March 24th, 2022

I recently learned of an exciting new tool, Grist. What seems great about it, you may ask? Let me list a few things that caught my interest:

It permits building relational databases without code
It's got an intuitive, spreadsheet-like, familiar interface
It can be extended using Python!
Although it's a cloud-based tool, you can back up your important documents to a SQLite-based file, for peace of mind
It's free for basic use!

I wanted to learn it better. And what better way to learn something than to dive right in, and use it to answer burning questions?

Right now, like many other people, I've been enjoying playing Wordle, the smash hit word game recently purchased by the NY Times. I sometimes play on my phone, sometimes on my computer - so the built-in statistics didn't cut it for me. I "needed" a place to track my games - and I wanted to get some insights into how to play better. I decided to try using Grist, and it has turned out beautifully!

Here, I'll share some of what I learned, and how I used Grist to do it. I'm going to be writing this for those who are already somewhat familiar with the fundamentals of Grist. If you want to famliarize yourself with any of the concepts I mention here, Grist's documentation is great.

Most of the links in this post are read-only, FYI, but if you jump to the conclusion, there's a link you can use to play around with this data yourself.

Sorry if this is a bit long - my enthusiasm is a bit greater than my focus, perhaps. 🙂

Data Entry

First and foremost, I want to store information about each game I play. Obviously, I needed a table called "Games"! I added fields for:

"Date"
"Word"
"Shareable Result"
"Notes"
"Screenshot" (because, why not)

I realized pretty quickly that when I play on my phone, I needed a better data entry interface than the default "Table" Widget. Fortunately, the "Card List" widget works really well! I went ahead and added a field called "Submit" which is a toggle. Then, I added a filter to the card list to only show cards where "Submit" is not yet toggled. This works awfully well as a form, until Grist have a chance to implement a form widget.

Analysis: How am I doing?

The first thing I was interested in was replicating the basic statistics that Wordle offers you, but across all my devices:

Games Played
Win %
Current Streak
Max Streak
Guess Distribution

In order to calculate this, I needed a calculated column within the Games table, to show how many guesses my game had taken. Fortunately, that information is available in the "Shareable Result" that Wordle automatically generates. Just as a reminder for those who aren't familiar, an example Shareable Result might look like:

Wordle 276 4/6
⬜⬜🟩🟩⬜
⬜⬜🟨⬜⬜
⬜🟩🟩🟩🟨
🟩🟩🟩🟩🟩

A simple Python formula to extract the number of turns looks like this: return str($Shareable_Result[11]). I saved this formula to a column "Turns". I wrapped it in str() so that if I lose, and thus get an "X", it will be the same data type as a winning game (1-6).

Now, I created another new table. This one is just a holding table, with a single record, called "Stats".

To calculate my stats, the fundamental function type I need is Games.lookupRecords(), which returns a list of Records based on the criteria I set. So, for example, to get my total games, I just need len(Games.lookupRecords(Submit=True)).

To get my Win %, I can use the formula

1 - (
    len(Games.lookupRecords(Turns="X")) / len(Games.lookupRecords(Submit=True))
)

Note that to get a prettier presentation, I can select a Numeric Column Type, a % Number Format, and a max decimals. I like a little bit of detail, so I selected 1 decimal place.

To get my Current Streak, I needed something a bit more. I may have done this in a way that is more complicated than necessary (this will be a theme!), here is how I did it:

all_games_sorted = sorted(Games.lookupRecords(Submit=True),  key = lambda d: d.Date, reverse=True)
streak = 0
for game in all_games_sorted:
if game.Turns != 'X':
    streak += 1
else:
    break
return streak

(Recall that we use d.Date as this is the way Grist has implemented getting the value of a specific field from a Record object.)

We can calculate my Best Streak with only a minor tweak:

all_games_sorted = sorted(Games.lookupRecords(Submit=True),  key = lambda d: d.Date, reverse=True)
streak = 0
for game in all_games_sorted:
if game.Turns != 'X':
    streak += 1
else:
    streak = 0
return streak

Now, I was ready to build a statistics dashboard. I included 2 "widgets": a card, to display the single statistics record that rolls up info about all games, and a chart widget, to display a breakdown of Games by Turns. I know that Wordle uses a Bar chart, but I prefer a Donut chart, so I've gone with that. On PC, it looks pretty nice! On a phone screen, only 1 widget can be expanded at a time, so you have to toggle back and forth to see what Wordle would usually present all in 1 screen, but it's not that big a deal.

Analysis: What are the best guesses?

I wanted to take my Wordle game to the next level, and once I had a solid set of prior games to use as a data set, I realized I could use my data to get some interesting facts to guide my future guesses. (I wasn't really interested in loading in data beyond my own games, since I want to avoid spoilers at all costs.)

I have established a strategy to use my first 2 guesses to get as many different "green" and "yellow" clues as I can. (In other words, if I get a clue in my first guess, I ignore it in my second guess. I wait until my third guess to use my clues from my first 2 guesses.)

As such, I found myself interested in a simple question:

What are the 10 most common letters in Wordle?

To start, I created a table called "Letters", and added a column called "Letter". I added 26 records, "A" through "Z".

Back in my "Games" table, I created 5 formula columns, 1 to extract each letter of the correct word.

"Letter1": Letters.lookupOne(Letter=$Word[0]).Letter
"Letter2": Letters.lookupOne(Letter=$Word[1]).Letter
"Letter3": Letters.lookupOne(Letter=$Word[2]).Letter
"Letter4": Letters.lookupOne(Letter=$Word[3]).Letter
"Letter5": Letters.lookupOne(Letter=$Word[4]).Letter

Back in "Letters", I went ahead and added a field that pulls a list of all the words that contain that letter at least once ("Words Containing Sorted"):

sorted(
    set(
        Games.lookupRecords(Letter1=$Letter).Word + 
        Games.lookupRecords(Letter2=$Letter).Word + 
        Games.lookupRecords(Letter3=$Letter).Word + 
        Games.lookupRecords(Letter4=$Letter).Word + 
        Games.lookupRecords(Letter5=$Letter).Word
    )
)

...and, voila! We know how often a solution contains that letter at least once ("Word Count"): len($Words_Containing_Sorted)

I was not satisfied to just sort by Word Count and call it a day, though. What I wanted was to break the letters out into categories based on their Top 10 status while considering "ties" in "Word Count". To do that, I needed a couple of different rankings.

First of all, I created a "Word Count Rank". This returns the ranking from best to worst, preserving ties. Once again, I might have been able to do this more compactly, but it works:

letter_array = [{'letter': item.Letter, 'word_count': item.Word_Count} for item in Letters.lookupRecords()]
letter_array_sorted = sorted(letter_array, key = lambda d: d['word_count'], reverse=True)
letter_array_sorted[0]['rank'] = 1
i = 1
while i < len(letter_array_sorted):
    previous_word_count = letter_array_sorted[i-1]['word_count']
    current_word_count = letter_array_sorted[i]['word_count']
    previous_rank = letter_array_sorted[i-1]['rank']
    if previous_word_count > current_word_count:
        letter_array_sorted[i]['rank'] = i + 1
    else:
        letter_array_sorted[i]['rank'] = previous_rank
    i += 1
return [item['rank'] for item in letter_array_sorted if item['letter'] == $Letter][0]

I also created a "Word Count Rank Reversed", which was identical except for reverse=False in line 2.

Using these fields, I created an "Is Top 10" field:

if $Word_Count_Rank>10:
    return "Outside Top 10"
elif $Word_Count_Rank_Reversed<17:
    return "Marginally Inside Top 10"
else:
    return "Inside Top 10"

An illustration of how this works in practice can be found at the top of this post, and here. If a letter's word count is greater than that of at least 16 other letters, it is "Inside Top 10". If the letter's word count is lower than that of at least 10 letters, then it is definitively "Outside Top 10". But if the letter's word count is neither greater than 16 others, nor less than 10 others (due to ties), then it is "Marginally Inside Top 10". In practice, I find that U, C and H all migrate between these categories as I add more games over time. If you have been tracking longer, or have downloaded the entire 200+ set of prior solutions, I'm sure it'll look still different.

At first, I used these 10-12 letters to craft guesses, without regard to anything else. So, for example, I tried guess combinations of:

LOUSE,TRAIN
SHIRE,TALON
CLONE,STAIR

But I realized I had more opportunities to refine. After all, my guesses were not taking into account at all what position a letter was likely to be in - only whether it was present at all. What I needed was to answer the question:

Which guesses maximize my chances of both yellow and green clues?

To start, I returned to the Letters table, and added 5 calculated fields:

"Freq As Letter 1": len(Games.lookupRecords(Letter1=$Letter).Word)
"Freq As Letter 2": len(Games.lookupRecords(Letter2=$Letter).Word)
"Freq As Letter 3": len(Games.lookupRecords(Letter3=$Letter).Word)
"Freq As Letter 4": len(Games.lookupRecords(Letter4=$Letter).Word)
"Freq As Letter 5": len(Games.lookupRecords(Letter5=$Letter).Word)

I realized I'd also need to know the number of times a letter appears twice in a word, so I created a "Games with Double Occurences" field:

len(
    Games.lookupRecords(Letter1=$Letter).Word + 
    Games.lookupRecords(Letter2=$Letter).Word + 
    Games.lookupRecords(Letter3=$Letter).Word + 
    Games.lookupRecords(Letter4=$Letter).Word + 
    Games.lookupRecords(Letter5=$Letter).Word
) - len(
    set(
        Games.lookupRecords(Letter1=$Letter).Word + 
        Games.lookupRecords(Letter2=$Letter).Word + 
        Games.lookupRecords(Letter3=$Letter).Word + 
        Games.lookupRecords(Letter4=$Letter).Word + 
        Games.lookupRecords(Letter5=$Letter).Word
    )
)

Moving on from Letters, I started actually entering a few combinations of 2 guesses in a table called Guesses. I enter the guesses themselves into a text field with a comma separating (e.g. LOUSE,TRAIN as above).

For convenience, I also created a "Guess Type" which just sorts out the letters. ''.join(sorted(str($Guess))[1:]) returns, for the example above, AEILNORSTU.

I needed to calculate the number of times over the tracked games a guess would have resulted in green clues and yellow clues. To get the green clues ("Green Freq Sum"), it's pretty straightforward:

return sum(
    [
        Letters.lookupRecords(Letter=$Guess[0]).Freq_As_Letter_1[0],
        Letters.lookupRecords(Letter=$Guess[1]).Freq_As_Letter_2[0],
        Letters.lookupRecords(Letter=$Guess[2]).Freq_As_Letter_3[0],
        Letters.lookupRecords(Letter=$Guess[3]).Freq_As_Letter_4[0],
        Letters.lookupRecords(Letter=$Guess[4]).Freq_As_Letter_5[0],
        Letters.lookupRecords(Letter=$Guess[6]).Freq_As_Letter_1[0],
        Letters.lookupRecords(Letter=$Guess[7]).Freq_As_Letter_2[0],
        Letters.lookupRecords(Letter=$Guess[8]).Freq_As_Letter_3[0],
        Letters.lookupRecords(Letter=$Guess[9]).Freq_As_Letter_4[0],
        Letters.lookupRecords(Letter=$Guess[10]).Freq_As_Letter_5[0]
    ]
)

To get the yellow clues, it is a little bit more complex ("Yellow Freq Sum"):

base_count = sum(
    [
        Letters.lookupRecords(Letter=$Guess_Type[0]).Word_Count[0],
        Letters.lookupRecords(Letter=$Guess_Type[1]).Word_Count[0],
        Letters.lookupRecords(Letter=$Guess_Type[2]).Word_Count[0],
        Letters.lookupRecords(Letter=$Guess_Type[3]).Word_Count[0],
        Letters.lookupRecords(Letter=$Guess_Type[4]).Word_Count[0],
        Letters.lookupRecords(Letter=$Guess_Type[5]).Word_Count[0],
        Letters.lookupRecords(Letter=$Guess_Type[6]).Word_Count[0],
        Letters.lookupRecords(Letter=$Guess_Type[7]).Word_Count[0],
        Letters.lookupRecords(Letter=$Guess_Type[8]).Word_Count[0],
        Letters.lookupRecords(Letter=$Guess_Type[9]).Word_Count[0]
    ]
)
for letter in $Guess_Type:
    if $Guess_Type.count(letter) > 1:
    repetitions = $Guess_Type.count(letter) - 1
    i = 0
    while i < repetitions:
        base_count += Letters.lookupRecords(Letter=letter).Games_with_Double_Occurence[0]
        base_count -= Letters.lookupRecords(Letter=letter).Word_Count[0]
        i += 1
return base_count - $Green_Freq_Sum

(The $Green_Freq_Sum refers to the value of the "Green Freq Sum" field of the same record. Preceding a variable with "$" is is Grist's special syntax.)

Here are the results!

Starting off: just sorting by the "Green Freq Sum" is very valuable. It allows me to note that TRAIL,SHONE results in green clues much more often than, say TAILS,HERON even though they have the same letters. This is due to the fact that, for example, E is much more likely to be in the 5th position than elsewhere in a 5-letter word, and since Wordle avoids simple plurals like TAILS, in our data set S is 7 times more likely to be 1st position than 5th.

Speaking with less confidence (but since this is for fun!), combining the "Yellow Freq Sum" and the "Green Freq Sum" allows us to balance between maximizing yellow clues and maximizing green ones. For example, PROSE,SHALT results in a green clue at a higher rate than TRAIL,SHONE (110 vs 101), but naturally results in a yellow clue less frequently (105 vs 159), due to the repeated S.

How exactly to value green vs. yellow clues is not something I know how to do perfectly, but I decided to create a simple metric based on my best guess.

Based on gut more than science, I weighted green clues 5 times as heavily as yellow clues. This metric will give a decimal above 1 for really good guesses, about 1 for guesses that seem pretty "par" to me, and will be lower for poor guesses.

green_metric = $Green_Freq_Sum / len(Games.lookupRecords(Submit=True))
yellow_metric = $Yellow_Freq_Sum / len(Games.lookupRecords(Submit=True)) / 5
return green_metric + yellow_metric

Based on this metric, my best guess that I have yet tried is AROSE,CLINT. Ultimately, the sheer number of yellow clues this guess generates, even with weighting yellow clues less highly, is more valuable than the green clues PROSE,SHALT would have given me. If, in the future, I can find a guess that has an even higher green clue frequency than PROSE,SHALT, then perhaps that ranking will be upset.

Conclusion

If you liked this, and you want to use my data yourself, please feel free! You can grab the "fiddle" version of the overall document, which you freely copy to your own account and revise, here.

Grist is a fantastic playground for personal data. I haven't even dug in to their features around access restrictions in this post - that's because I haven't experimented with them yet in my actual usage - but they make it possible to take information and share it based on the role and interest of the person you're working. So hopefully I can share in a later post how I find those.

Grist also seems to me to be an interesting entry into the various tools that newcomers to Python can use to learn the standard library. Admittedly, the formula editor could use some tweaking, but the immediate feedback I receive when pressing "Enter" and the potential use cases / productivity hacks are exciting to play around with, and rivaled only by Jupyter Notebooks in my experience.

I did notice one "gotcha", though. I accidentally created a While loop that would never end, and Grist pretty much crashed. I had to refresh my page, avoid that particular field, and revert to a previous snapshot of my document. It would have been good, I think, to set a pretty restrictive timeout and throw an error somehow ("Infinite loop detected" or some such).

Note: I haven't enabled any way for you to leave a comment here on this article, sorry. Please consider joining in the discussion at the Grist Community.

Sincerely,
David Smedberg