Exploring NLP

In this notebook I will be exploring similarity between two words in a quantative manner. In particular I will take a specific blog post and show you information that can be inferred by the algorithm.

In [1]:
#lets create a database
import sqlite3
conn = sqlite3.connect("word-relations.db")
db = conn.cursor()

Now lets explore the way in which we can create a description of words. Take the word "at" for instance. Given the sentence "I ate at pizzahut yesterday" one can describe the word at by its neighbours like so

    "at" : {
        "left" : ["I", "ate"],
        "right" : ["pizzahut", "yesterday"]
    }

Let us create a database table to store this information

In [2]:
db.execute("CREATE TABLE IF NOT EXISTS wordrelation"
           "(word text, right text, occurences INTEGER)")
Out[2]:
<sqlite3.Cursor at 0x107a6c570>

Time to populate this table. For this task I chose a short story found on the blog https://432m.wordpress.com/. I stored the story in the file story.txt.

In [3]:
storyfile = open("/volumes/bigdata/story.txt","r", encoding="utf-8")
story = storyfile.read()

After that, I took the story and broke it into individual sentences.

In [4]:
sentences = story.split(".\n")

Using the output sentences I populated the database.

In [5]:
for sentence in sentences:
    words = sentence.split(" ")
    for index in range(0,len(words)-1):
        right = words[index:]
        for word in right:
            #find out whether the relation existed before
            db.execute("SELECT * FROM wordrelation WHERE word=? AND right=?",
                       (words[index],word))
            exists = False
            rows = db.fetchall()
            for row in rows:
                exists = True
                db.execute("UPDATE wordrelation SET occurences=?" 
                           "WHERE word=? AND right=?",
                           (row[2]+1,words[index],word))
            if exists == False:
                db.execute("INSERT INTO wordrelation VALUES (?,?,?)",
                           (words[index],word,1))
            conn.commit()    
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-5-2e7ae3cee221> in <module>()
      8                        (words[index],word))
      9             exists = False
---> 10             rows = db.fetchall()
     11             for row in rows:
     12                 exists = True

KeyboardInterrupt: 

Now that we have populated the database we can explore things like similarity of words. Disk I/O error occured because I put my hand on the cable. Seems to be a faulty cable. Either ways, we can still use whatever the database contains for the next part.

Similarity between words

We can use this database to determine the similiraty between two words. One way of measuring distance is using the bhattacharya distance. Bhattacharya distance is defined as: $$D_{bhattacharya} = -ln(\sum \sqrt{p(x)q(x)})$$

Where p(x) and q(x) are the probability distribution of a given word occring before or after the word in question. To find the similarity between two words I use the following metric: $$S(w1,w2) = D_{bhattacharya}(w1_{left},w2_{left}) + D_{bhattacharya}(w1_{right},w2_{right})$$

In [9]:
import math #needed for sqrt and log

def similarity(word1,word2):
    return bhattacharya(buildLeftDist(word1),buildLeftDist(word2))+bhattacharya(buildRightDist(word1),buildRightDist(word2))

def bhattacharya(distribution1,distribution2):
    commonkeys = set(distribution1.keys()) & set(distribution2.keys())
    totalsum = 0
    #compare the probability distributions
    for i in commonkeys :
        totalsum+= math.sqrt(distribution1[i]*distribution2[i])
    print("totalsum: %d",totalsum)
    if totalsum>0:
        return -math.log(totalsum)
    else:
        return 0

def buildLeftDist(word):
    return buildDist(word,"left")

def buildRightDist(word):
    return buildDist(word,"right")

def buildDist(word,direction):
    if direction == "left":
        db.execute("SELECT * FROM wordrelation WHERE right=? COLLATE NOCASE",(word,))
    elif direction == "right":
        db.execute("SELECT * FROM wordrelation WHERE word=? COLLATE NOCASE",(word,))
    rows = db.fetchall()
    total = 0
    distribution = {}
    for row in rows:
        total += row[2]
        if direction == "left":
            distribution[row[0]] = row[2]
        if direction == "right":    
            distribution[row[1]] = row[2]
    for key in distribution:
        distribution[key] = distribution[key]/total
    return distribution

The following are some of the results (a lower output means smaller distance, i.e. more similarity, however, 0 on output means no relation). Notice how the algorithm has learned the concept of countries from the text.

In [3]:
similarity("China","Singapore")
totalsum: %d 0.9981438794705978
totalsum: %d 0.9934573693937343
Out[3]:
0.00842197268468112
In [16]:
similarity("India","china")
totalsum: %d 0.9959851358047804
totalsum: %d 0
Out[16]:
0.004022945399683885
In [18]:
similarity("India","Indonesia")
totalsum: %d 0.9999955916629425
totalsum: %d 0
Out[18]:
4.408346774250682e-06

Apparently the algorithm has learnt about third world vs first world too...

In [14]:
similarity("Singapore","Expensive")
totalsum: %d 0.9993335125272609
totalsum: %d 0
Out[14]:
0.0006667096742499458
In [21]:
similarity("Indonesia","poverty")
totalsum: %d 0.9999826259761224
totalsum: %d 0
Out[21]:
1.7374174807682052e-05

The algorithm also seems to have associated the concept of money:

In [20]:
similarity("poverty","expensive")
totalsum: %d 0.9999999999999971
totalsum: %d 0
Out[20]:
2.8865798640254113e-15

The author clearly wrote about leaving China not singapore:

In [5]:
similarity("Singapore","Leaving")
totalsum: %d 0.5436698850672579
totalsum: %d 0.9901629009149022
Out[5]:
0.6192988482529488
In [6]:
similarity("China","Leaving")
totalsum: %d 0.5637782870516115
totalsum: %d 0.9972010217873676
Out[6]:
0.5758971155126812

Clearly plurality is a concept the algorithm picked up:

In [15]:
similarity("four","seasons.")
totalsum: %d 0.8994358772490896
totalsum: %d 0.9998638862396183
Out[15]:
0.10612363826225472
In [27]:
similarity("four","season.")
totalsum: %d 0.8705513489626149
totalsum: %d 0.9994811649274917
Out[27]:
0.1391475033325885
In [28]:
similarity("one","seasons.")
totalsum: %d 0.8053486115065345
totalsum: %d 0.9991943678245226
Out[28]:
0.2172859944054157

On the concept of seasons:

In [39]:
similarity("Spring","white")
totalsum: %d 0.7121753986423712
totalsum: %d 0.6970907120611235
Out[39]:
0.7002707817613969
In [38]:
similarity("winters","white")
totalsum: %d 0.9973738009293232
totalsum: %d 0.9992444429384159
Out[38]:
0.003385496219612564
In [40]:
similarity("autumn","yellow")
totalsum: %d 0.9633461171623349
totalsum: %d 0.998041082367253
Out[40]:
0.03930335504236147
In [41]:
similarity("winter","yellow")
totalsum: %d 0
totalsum: %d 0
Out[41]:
0
In [12]:
similarity("car","driver")
totalsum: %d 0.9999999999999971
totalsum: %d 0
Out[12]:
2.8865798640254113e-15

Oh, where did the olympics happen Beijing or Dehli?

In [30]:
similarity("beijing","olympic")
totalsum: %d 0.999442785111475
totalsum: %d 0.9999064052971917
Out[30]:
0.0006509692735005581
In [31]:
similarity("dehli","olympic") #clearly no correlation between the two
totalsum: %d 0
totalsum: %d 0
Out[31]:
0

Which year did they take place 2006 or 2008?

In [32]:
similarity("2008","olympic")
totalsum: %d 0.997513782704472
totalsum: %d 0.9991228595531229
Out[32]:
0.0033668384256419336
In [33]:
similarity("2006","olympic")
totalsum: %d 0
totalsum: %d 0
Out[33]:
0

Concept of languages:

In [42]:
similarity("english","mandarin")
totalsum: %d 0.9999999999999971
totalsum: %d 0
Out[42]:
2.8865798640254113e-15
In [43]:
similarity("italian","mandarin")
totalsum: %d 0.9999999999999971
totalsum: %d 0
Out[43]:
2.8865798640254113e-15

Notice how they are the same distance apart...

In [47]:
similarity("italian","seasons.")
totalsum: %d 0.7928850367461084
totalsum: %d 0
Out[47]:
0.23207704043424576

Conclusion

In conclusion just by using simple statistical algorithms on one blog post I have shown that computers can infer a great deal of infromation ranging from the concepts of money, development, languages, nations to grammatical concepts such as plurality.