A Database of Words

If you write code that deals with natural language, then at some point, you will need to use data from a dictionary. You have to make a choice at this point.

  • You can either choose one the big names e.g. Oxford, Merriam-Webster, Macmillan etc. and use their API for getting the data
  • Or you can choose WordNet.

I have tried both and find WordNet to be the best tool for the job.

For those who don’t know, WordNet is a machine readable database of words which can be accessed from most popular programming languages (C, C#, Java, Ruby, Python etc.). I have several reasons for preferring WordNet over the other options.

  • Many of the big company APIs require payment. WordNet is free.
  • Many of the big company APIs are online only. WordNet can be downloaded and used offline.
  • WordNet is many times more powerful that any other dictionary or thesaurus out there.

The last point requires some explanation.

WordNet is not like your everyday dictionary. While a traditional dictionary features a list of words and their definitions, WordNet focuses on the relationship between words (in addition to definitions). The focus on relationships makes WordNet a network instead of a list. You might have guessed this already from the name WordNet.

WordNet is a network of words!

In the WordNet network, the words are connected by linguistic relations. These linguistic relations (hypernym, hyponym, meronym, pertainym and other fancy sounding stuff), are WordNet’s secret sauce. They give you powerful capabilities that are missing in ordinary dictionaries/thesauri.

We will not go deep into linguistics in this article because that is besides the point. But I do want to show you what you can achieve in your code using WordNet. So let’s look at the two most common use cases (which any dictionary or thesaurus should be able to do) and some advanced use cases (which only WordNet can do) with example code.

Common use cases

Word lookup

Let’s start with the simplest use case i.e word lookups. We can look up the meaning of the any word in WordNet in three lines of code (examples are in Python).

### checking the definition of the word "hacker"
# import the NLTK wordnet interface
>>> from nltk.corpus import wordnet as wn
# lookup the word
>>> hacker = wn.synset(“hacker.n.03”)
>>> print(hacker.definition())
a programmer for whom computing is its own reward; 
may enjoy the challenge of breaking into other 
computers but does no harm

Synonym and Antonym lookup

WordNet can function as a thesaurus too, making it easy to find synonyms and antonyms. To get the synonyms of the word beloved, for instance, I can type the following line in Python…

>>> wn.synset(“beloved.n.01”).lemmas()
[Lemma(‘beloved.n.01.beloved’), Lemma(‘beloved.n.01.dear’), 
Lemma(‘beloved.n.01.dearest’), Lemma(‘beloved.n.01.honey’),

… and get the synonyms dear, dearest, honey and love, as expected. Antonyms can be obtained just as simply.

Advanced use cases

Cross Part of Speech lookup

WordNet can do things that dictionaries/thesauri can’t. For example, WordNet knows about cross Part of Speech relations. This kind of relation connects a noun (e.g. president) with its derived verb (preside), derived adjective (presidential) and derived adverb (presidentially). The following snippet displays this functionality of WordNet (using a WordNet based Python package called word_forms).

### Generate all possible forms of the word "president"
>>> from word_forms.word_forms import get_word_forms
>>> get_word_forms(“president”)
{’n’: {‘president’, ‘Presidents’, ‘President’, ‘presidentship’,        # nouns
       ‘presidencies’, ‘presidency’, ‘presidentships’, ‘presidents’}, 
 ‘r’: {‘presidentially’},                                              # adverb
 ‘a’: {‘presidential’},                                                # adjective
 ‘v’: {‘presiding’, ‘presides’, ‘preside’, ‘presided’}                 # verbs

Being able to generate these relations is particularly useful for Natural Language Processing and for English learners.

Classification lookup

In addition to being a dictionary and thesaurus, WordNet is also a taxonomical classification system. For instance, WordNet classifies dog as a domestic animal, a domestic animal as an animal, and an animal as an organism. All words in WordNet have been similarly classified, in a way that reminds me of taxonomical classifications in biology.

The following snippet shows what happens if we follow this chain of relationships till the very end.

### follow hypernym relationship recursively till the end
# define a function that prints the next hypernym
# recursively till it reaches the end
>>> def get_parent_classes(synset):
…     while True:
…       try:
…         synset = synset.hypernyms()[-1]
…         print(synset)
…       except IndexError:
…         break 
# find the hypernyms of the word "dog"
>>> dog = wn.synset(“dog.n.01”)
>>> get_parent_classes(dog)
Synset(‘domestic_animal.n.01’) # dog is a domestic animal
Synset(‘animal.n.01’)          # a domestic animal is an animal
Synset(‘organism.n.01’)        # an animal is an organism
Synset(‘living_thing.n.01’)    # an organism is a living thing
Synset(‘whole.n.02’)           # a living thing is a whole 
Synset(‘object.n.01’)          # a whole is an object
Synset(‘physical_entity.n.01’) # an object is a physical entity
Synset(‘entity.n.01’)          # a physical entity is an entity

To visualize the classification model, it is helpful to look at the following picture, which shows a small part of WordNet.

Image courtesy the original WordNet paper.

Semantic word similarity

The classification model of WordNet have been used for many useful applications. One such application computes the similarity between two words based on the distance between words in the WordNet network. The smaller the distance, the more similar the words. In this way, it is possible to quantitatively figure out that a cat and a dog are similar, a phone and a computer are similar, but a cat and a phone are not similar!

### Checking similarity between the words "dog", "cat", "phone" and "computer"
>>> dog = wn.synset(‘dog.n.01’)
>>> cat = wn.synset(‘cat.n.01’)
>>> computer = wn.synset(‘computer.n.01’)
>>> phone = wn.synset(“phone.n.01”)
>>> wn.path_similarity(dog, cat)          # a higher score indicates more similar
>>> wn.path_similarity(phone, computer)
>>> wn.path_similarity(phone, cat)        # a lower score indicates less similar

WordNet has comprehensive coverage of the English language. Currently, it has 155,287 English words. The complete Oxford English Dictionary has nearly the same number of modern words (171,476). WordNet was last updated in 2011. Some contemporary English words like bromance or chillax seems to be missing it in for this reason, but this should not be a deal breaker for most of us.

If you want to know more about WordNet, the following references are very helpful.

This article is taken from Dibya Chakravorty of  Medium Corporation Broken Window Blog

Leave a Reply

Your email address will not be published. Required fields are marked *