Exploring NLP¶

In this notebook I will be exploring similarity between two words in a quantative manner. In particular I will take a specific blog post and show you information that can be inferred by the algorithm.

#lets create a database
import sqlite3
conn = sqlite3.connect("word-relations.db")
db = conn.cursor()

Now lets explore the way in which we can create a description of words. Take the word "at" for instance. Given the sentence "I ate at pizzahut yesterday" one can describe the word at by its neighbours like so

    "at" : {
        "left" : ["I", "ate"],
        "right" : ["pizzahut", "yesterday"]
    }

Let us create a database table to store this information

db.execute("CREATE TABLE IF NOT EXISTS wordrelation"
           "(word text, right text, occurences INTEGER)")

<sqlite3.Cursor at 0x107a6c570>

Time to populate this table. For this task I chose a short story found on the blog https://432m.wordpress.com/. I stored the story in the file story.txt.

storyfile = open("/volumes/bigdata/story.txt","r", encoding="utf-8")
story = storyfile.read()

After that, I took the story and broke it into individual sentences.

sentences = story.split(".\n")

Using the output sentences I populated the database.

for sentence in sentences:
    words = sentence.split(" ")
    for index in range(0,len(words)-1):
        right = words[index:]
        for word in right:
            #find out whether the relation existed before
            db.execute("SELECT * FROM wordrelation WHERE word=? AND right=?",
                       (words[index],word))
            exists = False
            rows = db.fetchall()
            for row in rows:
                exists = True
                db.execute("UPDATE wordrelation SET occurences=?" 
                           "WHERE word=? AND right=?",
                           (row[2]+1,words[index],word))
            if exists == False:
                db.execute("INSERT INTO wordrelation VALUES (?,?,?)",
                           (words[index],word,1))
            conn.commit()

---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-5-2e7ae3cee221> in <module>()
      8                        (words[index],word))
      9             exists = False
---> 10             rows = db.fetchall()
     11             for row in rows:
     12                 exists = True

KeyboardInterrupt:

Now that we have populated the database we can explore things like similarity of words. Disk I/O error occured because I put my hand on the cable. Seems to be a faulty cable. Either ways, we can still use whatever the database contains for the next part.

Similarity between words¶

We can use this database to determine the similiraty between two words. One way of measuring distance is using the bhattacharya distance. Bhattacharya distance is defined as: $$D_{bhattacharya} = -ln(\sum \sqrt{p(x)q(x)})$$

Where p(x) and q(x) are the probability distribution of a given word occring before or after the word in question. To find the similarity between two words I use the following metric: $$S(w1,w2) = D_{bhattacharya}(w1_{left},w2_{left}) + D_{bhattacharya}(w1_{right},w2_{right})$$

import math #needed for sqrt and log

def similarity(word1,word2):
    return bhattacharya(buildLeftDist(word1),buildLeftDist(word2))+bhattacharya(buildRightDist(word1),buildRightDist(word2))

def bhattacharya(distribution1,distribution2):
    commonkeys = set(distribution1.keys()) & set(distribution2.keys())
    totalsum = 0
    #compare the probability distributions
    for i in commonkeys :
        totalsum+= math.sqrt(distribution1[i]*distribution2[i])
    print("totalsum: %d",totalsum)
    if totalsum>0:
        return -math.log(totalsum)
    else:
        return 0

def buildLeftDist(word):
    return buildDist(word,"left")

def buildRightDist(word):
    return buildDist(word,"right")

def buildDist(word,direction):
    if direction == "left":
        db.execute("SELECT * FROM wordrelation WHERE right=? COLLATE NOCASE",(word,))
    elif direction == "right":
        db.execute("SELECT * FROM wordrelation WHERE word=? COLLATE NOCASE",(word,))
    rows = db.fetchall()
    total = 0
    distribution = {}
    for row in rows:
        total += row[2]
        if direction == "left":
            distribution[row[0]] = row[2]
        if direction == "right":    
            distribution[row[1]] = row[2]
    for key in distribution:
        distribution[key] = distribution[key]/total
    return distribution

The following are some of the results (a lower output means smaller distance, i.e. more similarity, however, 0 on output means no relation). Notice how the algorithm has learned the concept of countries from the text.

similarity("China","Singapore")

totalsum: %d 0.9981438794705978
totalsum: %d 0.9934573693937343

0.00842197268468112

similarity("India","china")

totalsum: %d 0.9959851358047804
totalsum: %d 0

0.004022945399683885

similarity("India","Indonesia")

totalsum: %d 0.9999955916629425
totalsum: %d 0

4.408346774250682e-06

Apparently the algorithm has learnt about third world vs first world too...

similarity("Singapore","Expensive")

totalsum: %d 0.9993335125272609
totalsum: %d 0

0.0006667096742499458

similarity("Indonesia","poverty")

totalsum: %d 0.9999826259761224
totalsum: %d 0

1.7374174807682052e-05

The algorithm also seems to have associated the concept of money:

similarity("poverty","expensive")

totalsum: %d 0.9999999999999971
totalsum: %d 0

2.8865798640254113e-15

The author clearly wrote about leaving China not singapore:

similarity("Singapore","Leaving")

totalsum: %d 0.5436698850672579
totalsum: %d 0.9901629009149022

0.6192988482529488

similarity("China","Leaving")

totalsum: %d 0.5637782870516115
totalsum: %d 0.9972010217873676

0.5758971155126812

Clearly plurality is a concept the algorithm picked up:

similarity("four","seasons.")

totalsum: %d 0.8994358772490896
totalsum: %d 0.9998638862396183

0.10612363826225472

similarity("four","season.")

totalsum: %d 0.8705513489626149
totalsum: %d 0.9994811649274917

0.1391475033325885

similarity("one","seasons.")

totalsum: %d 0.8053486115065345
totalsum: %d 0.9991943678245226

0.2172859944054157

On the concept of seasons:

similarity("Spring","white")

totalsum: %d 0.7121753986423712
totalsum: %d 0.6970907120611235

0.7002707817613969

similarity("winters","white")

totalsum: %d 0.9973738009293232
totalsum: %d 0.9992444429384159

0.003385496219612564

similarity("autumn","yellow")

totalsum: %d 0.9633461171623349
totalsum: %d 0.998041082367253

0.03930335504236147

similarity("winter","yellow")

totalsum: %d 0
totalsum: %d 0

0

similarity("car","driver")

totalsum: %d 0.9999999999999971
totalsum: %d 0

2.8865798640254113e-15

Oh, where did the olympics happen Beijing or Dehli?

similarity("beijing","olympic")

totalsum: %d 0.999442785111475
totalsum: %d 0.9999064052971917

0.0006509692735005581

similarity("dehli","olympic") #clearly no correlation between the two

totalsum: %d 0
totalsum: %d 0

0

Which year did they take place 2006 or 2008?

similarity("2008","olympic")

totalsum: %d 0.997513782704472
totalsum: %d 0.9991228595531229

0.0033668384256419336

similarity("2006","olympic")

totalsum: %d 0
totalsum: %d 0

0

Concept of languages:

similarity("english","mandarin")

totalsum: %d 0.9999999999999971
totalsum: %d 0

2.8865798640254113e-15

similarity("italian","mandarin")

totalsum: %d 0.9999999999999971
totalsum: %d 0

2.8865798640254113e-15

Notice how they are the same distance apart...

similarity("italian","seasons.")

totalsum: %d 0.7928850367461084
totalsum: %d 0

0.23207704043424576

Conclusion¶

In conclusion just by using simple statistical algorithms on one blog post I have shown that computers can infer a great deal of infromation ranging from the concepts of money, development, languages, nations to grammatical concepts such as plurality.