I wrote a script that finds pairs of words connected by a common origin but have evolved to differ significantly in modern meaning.
I actually came up with this project an hour and a half ago, between meetings I threw together something using Python and ChatGPT, and here are the first results. Importantly, these results come not from ChatGPT, but from the script working with dictionaries.
For example, grammar – glamour. The word glamour originates from the Scottish pronunciation of the word grammar (meaning “knowledge,” especially magical). The early association of grammar with secret knowledge transformed into “glamour” as “magical enchantment.”
It turns out that Jack is a diminutive form of John, evolved through Jankin.
It turns out that espresso and sprain share a common root—the Latin exprimere, meaning “to press out, extract.”
debut and butt. They share a common root: Old French but—”goal.” Debut: from French débuter—”to start a game,” literally “to make the first strike at the goal.” Butt: in the sense of “target” (e.g. the butt of a joke), also from but—”goal, target.”
Technical details: What does the script do?
1. First, it downloads a vast array of data from the English Wiktionary (Kaikki) and a large language model FastText, which knows the “meaning” of words in the form of vectors.
2. Then it analyzes the etymology (origin) of words, finding their common “ancestors”—ancient words (etymons) from which the modern ones derive.
3. It then selects only those words that are full dictionary entries in Wiktionary and are commonly found in modern English (filtering out very rare or archaic words).
4. Then it measures the “distance” between meanings using word vectors (word embeddings) from FastText. By comparing these vectors, the script calculates how far the meanings of words with a common root have diverged. Low similarity in vectors indicates a significant difference in meaning.
5. It then finds “distant relatives”: Ultimately, the script searches for and displays pairs of common words that were once “relatives” but today their meanings are as distant from each other as possible.
The script still generates quite a lot of “noise,” but I have a clear idea of how to clean it up.
Read more of such goodness by clicking here –> #RaufLikesEtymology

