Last month, I posted a blog posted on approaches to creating an automatic solver for Wordle.
After that posting, there was also a Five Thirty Eight riddler question about the same topic by Zach Wissner Gross. In Zach’s original posting he referenced code and code and examples done by Laurent Lessard. In Lessard’s work, including this blog post Lessard ran experiments trying several different techniques for creating an optimal approach, e.g.
- Picking the guess that created the most choices (most buckets)
- Picking the guess that minimized the size of the largest bucket
- Picking a guess that maximized entropy
Experimental results suggested that the first of these was slightly better than the other two alternatives.
A week later, Zach Wissner-Gross posted the best solution received a solution credited to Jenny Mitchell. Either from Jenny or from Zach, this solution posted also turned the problem into a lookup table with each element of the tree having a designated solution. The declared victor was essentially a variation of the “maximize splits” approach
I was playing a little further with these solutions and trying to understand the search space. As discussed in previous post, each word splits the candidate mystery set into up to 238 unique buckets (3^6 minus five invalid choices) and in practice up to 150 of these buckets populated using the word TRACE. As Jenny Michell’s solution and Laurent Lessard’s experiments show, this is indeed a very good solution approach. However, I am skeptical whether it is the absolute best approach because of two reasons:
- In the context of creating a lookup-table, it is possible that different approaches could create just slightly more optimal approaches for parts of the search subtree.
- The splits themselves are also influenced by the choices of the valid guess words. There are quite a few, but it is not fully populated.
Zach made one statement in arguing for a maximizing splits approach that I am not convinced is true:
It didn’t matter how many words were in each constellation, just the total number of constellations.
To think of why, consider the following thought exercise of two hypothetical ways in which a a group of 20 candidates could be split by a guess into 10 buckets.
- One method would be word just happened to split this into 10 buckets of two elements each. In this case, the expected number of guesses would be 2.5 – one for the initial split into 10 buckets and then for each of these buckets one could either guess correctly (1 more guess) or incorrectly (2 more guesses) thus- 1 + (50% * 1) + (50% * 2) = 2.5
- A second method would be a word that creates 9 buckets of one element each, and a remaining bucket of 11 elements. In this case the expected number of guesses would be one for the initial split into 10 buckets and then all the buckets with one element could be guessed in the next guess – and the remaining ones however long it takes to split out the bucket of 16 – thus – 1 + 45% * 1) + (55% * x).
There is no inherent reasons what these two calculations have to be the same. Some notion of entropy can still play a role, even if maximizing entropy is not the best solution. Hence, I believe Zach’s statement above is a good description of the approach, but not guaranteed to find the absolute best solution.
So I created a metric that was a little more entropy based instead of purely the number of buckets or size of the largest bucket. I considered this as the “weighted size” of the bucket. In the example above, 55% of the mystery words are in the bucket with 11 elements and 45% of the mystery words are in a bucket with 1 element. So the weighted size is (55% * 11 + 45% * 1) = 6.05. This is in comparison to the weighted size of (100% * 2) = 2.0. So there is more entropy in the buckets that are all split out, than one that is still more concentrated in a larger bucket.
Note: That in general, I would guess that a lower weighted size should help – however, interestingly enough a bucket of size 2 is likely among one of the least efficient choices – so higher entropy might not always be the best solution in some of these small cases.
When I print a list of the top 50 choices that minimize this weighted size, I can also print their number of buckets as well as the largest bucket sizes in the table below:
num word bucket max ones wsize eguess 1 roate 126 195 23 60.42 3.31 2 raise 132 168 28 61.00 3.28 3 raile 128 173 22 61.33 3.28 4 soare 127 183 22 62.30 3.24 5 arise 123 168 26 63.73 3.26 6 irate 124 194 14 63.78 3.28 7 orate 127 195 28 63.89 3.30 8 ariel 125 173 21 65.29 3.33 9 arose 121 183 23 66.02 3.30 10 raine 129 195 27 67.06 3.24 11 artel 128 196 27 67.50 3.27 12 taler 134 196 24 67.74 3.29 13 ratel 134 196 25 69.84 3.29 14 aesir 116 168 32 69.88 3.33 15 arles 108 205 14 69.89 3.34 16 realo 112 176 20 69.95 3.36 17 alter 128 196 25 69.99 3.31 18 saner 132 219 33 70.13 3.26 19 later 134 196 34 70.22 3.33 20 snare 132 219 32 71.10 3.18 21 oater 128 195 38 71.25 3.38 22 salet 148 221 34 71.27 3.22 23 taser 134 227 31 71.28 3.26 24 stare 133 227 23 71.29 3.19 25 tares 128 227 26 71.54 3.24 26 slate 147 221 29 71.57 3.22 27 alert 131 196 25 71.60 3.24 28 reais 114 168 24 71.61 3.35 29 lares 118 205 22 71.74 3.33 30 reast 147 227 29 71.77 3.15 31 strae 125 227 16 71.85 3.19 32 laser 123 205 27 72.12 3.33 33 saine 136 207 25 72.59 3.25 34 rales 117 205 21 72.80 3.34 35 urate 122 202 25 72.83 3.31 36 crate 148 246 30 72.90 3.17 37 serai 110 168 20 72.92 3.30 38 toile 123 204 20 73.04 3.23 39 seral 128 205 24 73.08 3.17 40 rates 118 227 24 73.33 3.30 41 carte 146 246 30 73.52 3.21 42 antre 134 226 31 73.94 3.25 43 slane 133 225 25 73.99 3.19 44 trace 150 246 32 74.02 3.16 45 coate 123 192 22 74.51 3.22 46 carle 144 249 37 74.68 3.23 47 carse 139 237 26 74.83 3.20 48 stoae 110 177 18 74.90 3.26 49 reals 116 205 21 74.94 3.27 50 terai 113 194 17 75.14 3.27
The word “TRACE” is 44th on this list, even though it has the highest value for the number of of buckets in column #3. Similarly, if you instead want to pick the words with lowest maximum bucket size, a word like RAISE is lowest in column #4 but numbers in this column are not always the lowest expected weighted bucket size.
I am not convinced that a increase the entropy situation uniformly is best choice, any more than I am convinced a maximize buckets or maximize spread is optimal for the entire search tree. Instead, I think it can be slightly influenced by subtleties of particular trees as well as counts.
Looking at weighted numbers of guesses, recursively
To look at potential approaches and expectations, I decided to look at this inductively. In the end, we are creating a search tree, where the tree is composed of nodes ranging from one or more elements. What might be the expected number of guesses for these nodes and how can the be accumulated recursively?
- One of the bottom building blocks are nodes with one element. These are also listed in column #5 of the table above under the heading “ones”. There are not very many of them near the root of the tree, only 32 of them at a first level guess of TRACE or ~1% of the 2315 cases. If you are fortunate enough to reach one of them, it takes one more guess to confirm. However, as you go to lower levels of the tree, I also expect the percentage of buckets with one element to increase.
- Another building block can be an node with two elements. A described above, one can flip a coin get the right one ~50% of the time, so expected guesses are 1.5.
- Next are nodes with three elements. An average expectation could be to guess this in two guesses – using one of two possible approaches. Either you try the choices in this bucket one at a time and (1/3 * 1) + (1/3 * 2) + (1/3 * 3) and have expectation of 2 guesses to solve – or you find something that splits the three choices into two and pick in the next guess. If you are lucky one of the two guesses might split the other two if incorrect.
- Nodes with four or five elements might be even more “efficient” than nodes with three – since there is a larger probability of having a word that can split these into exactly four or five choices. At least this is what I’ve observed.
- As the number of elements within a bucket increases, one gets into more situations where the problem instead gets solved recursively. Find the expected guesses and then break the problem up recursively.
So I created an example that created a search tree in this more dynamic fashion. When the node has only two elements, don’t find a pick that evenly splits it into two buckets to maximize the split (or to minimize the bucket size) – but instead try guessing one of the choices and if that wasn’t correct – pick the other. Hence, using a strategy with a 1.5 expected count rather than 2 expected count…
I did this recursively, and when a node had too many elements in it, then rather than an exhaustive search for the absolute best candidate – I used a rough heuristic (wsize in my case, but it could have been number of buckets or size of largest bucket) – and tried the best candidate. I think my rough heuristic can do a reasonable job – but I don’t claim to be absolutely optimal because there is always a chance that further down on the list might just be slightly better.
Looking at the table above, I notice that TRACE does pretty well within this overall list if used as a first choice, even if followed up later with a different heuristic. I am also struck that there is a word REAST at number 30 on the list that might be even slightly better as this first choice, even though it is not the best choice in terms of other metrics like maximal number of splits, smallest bucket size or weighted bucket size.
Instead, REAST looks like it might have almost as many buckets as TRACE (247 vs 250) but then also a slightly smaller maximum bucket size (227 vs 246).
I am not claiming that I have found the best solution for solving Wordle.
Instead, I am suggesting that other claims of ideal might not as strong. With a lookup table approach, I do think one could find a case that optimizes. However, rather than a single heuristic for creating all of that table, it may be the case that parts of the search tree use slightly different heuristics.
In the table above, the last column eguess is an estimated guess derived recursively and weighted by the number of elements in each node along the way. There is possibility my calculations are slightly off, but I’ll also observe numbers slightly lower than those presented in FiveThirtyEight.
I expect the largest form of discrepancy is for nodes with two elements. A naive approach of “maximize spread” solves this in two guesses, whereas an approach of try one and if it isn’t right try the other solves it on average in one and a half.
The second potential discrepancy isn’t as strong but comes from examining expectations for REAST vs. TRACE. These might suggest that it is possible to pick particular choices optimized for different parts of the tree (in this case the initial guess, but it could also be lower down). Undoubtedly maximizing spread and minimizing largest buckets are good things to do and lead to strategies that perform very well. It is with that step after that of claiming “ideal” that I am questioning.
Two general notes made in the last blog post also apply. First, this experiment was done in the context of a search tree for exactly the 2315 possible words in Wordle. I expect the general techniques to still apply but results to be different if picking a larger set such as ~40,000 words in Google corpus that are five letters long.
Second note is that the way systems solve Wordle are different from how people solve Wordle. My observation is people often try to find initial letters to work from and go from there. Whereas a system it is a combinatoric elimination problem. This was reinforced by Wordle for 14 February where chosing REATH followed by DOILY narrowed the space to a single word in a way I’m sure I would still have no clue…