The Algorithm That Could Take Us Inside Shakespeare’s Mind

The playwright has always been a contradiction. Despite his palpable presence, he’s fundamentally ungraspable. The historical evidence of his life is negligible: There’s a will that makes him only harder to understand — what kind of man leaves his wife his “second-best bed”? — and a handful of other, equally half-significant, records; we don’t even know the exact date of his birth. The only way to know Shakespeare is through his works, and his works are textual quagmires.

Shakespeare was a working playwright of his period, and, much like screenwriters of our own era, he brought in other writers to help him with his plays and helped out other writers with theirs. The Folio, published in 1623, contains most of the works by Shakespeare that we know of, but not all. During his lifetime, quartos, small hand-held books sold on the street like paperbacks, were published, without his permission or approval, in pirated editions.

The result is permanent confusion. In the case of “Hamlet,” there are three versions of the play: the First Quarto, published in 1603, the Second Quarto, published between 1604 and 1605, and the Folio of 1623. In the First Quarto, sometimes called the “bad quarto,” the famous “To be, or not to be” speech begins this way:

To be, or not to be, ay there’s the point,
To Die, to sleep, is that all? Aye all:
No, to sleep, to dream, aye marry there it goes.

Nobody wants to believe that Shakespeare wrote this crap. It is the Second Quarto and the much later Folio that provide the more familiar “To be or not to be, that is the question” speech. But even between the two more palatable versions, there are significant differences. Should the verse read: “For who would bear the whips and scorns of time, / Th’oppressors wrong, the proud man’s contumely, / The pangs of despised love, the law’s delay” (Second Quarto), or “For who would bear the Whips and Scornes of time, / The Oppressors wrong, the poore mans Contumely, / The pangs of dispriz’d Love, the Lawes delay”? (Folio). There’s a big difference between despised love and disprized love, and between a proud man’s contumely and a poor man’s contumely. This is among the best-known passages in all secular literature, and nobody knows for certain how it should read, what actors should recite, what scholars should study. It’s embarrassing.

Every version of Shakespeare you’ve ever read is the result of centuries of debate, mostly arguments over style or historical context, developed through the grinding close study in which I was initiated. Computational modes of Shakespeare analysis are nearly as old as computing itself. The classic stylometric technique, begun in the late 1980s, was to tabulate the relative frequency of “function words” — words like “by” and “you” and “from” — and then to compare their numbers across manuscripts. The most sophisticated form of stylometric analysis so far has been WAN, or word adjacency networks, which register the frequency and proximity of function words in relation to one another. Both these applications have been controversial but broadly effective. The New Oxford Shakespeare editions attributed “Henry VI” to a collaboration with Christopher Marlowe on the basis of WAN analysis.

Cohere works on an entirely different level. It doesn’t require identifying function words or phrases. It just converts language into logarithmic probabilities. You create a Shakespeare algorithm. You put in each of the three different versions of “To be, or not to be” and out pop numbers: -3.6788540925266906 for the First Quarto, -3.179199017199017 for the Second Quarto, and -3.4799767386091127 for the Folio. The closer the number is to zero, the more likely the model thinks the sequence is. And Cohere’s answers make perfect sense — common sense, anyway. “Contumely” means insolence. Wouldn’t it be more likely to be a proud man acting insultingly?