In this notebook I will be exploring similarity between two words in a quantative manner. In particular I will take a specific blog post and show you information that can be inferred by the algorithm.
#lets create a database
import sqlite3
conn = sqlite3.connect("word-relations.db")
db = conn.cursor()
Now lets explore the way in which we can create a description of words. Take the word "at" for instance. Given the sentence "I ate at pizzahut yesterday" one can describe the word at by its neighbours like so
"at" : {
"left" : ["I", "ate"],
"right" : ["pizzahut", "yesterday"]
}
Let us create a database table to store this information
db.execute("CREATE TABLE IF NOT EXISTS wordrelation"
"(word text, right text, occurences INTEGER)")
Time to populate this table. For this task I chose a short story found on the blog https://432m.wordpress.com/. I stored the story in the file story.txt.
storyfile = open("/volumes/bigdata/story.txt","r", encoding="utf-8")
story = storyfile.read()
After that, I took the story and broke it into individual sentences.
sentences = story.split(".\n")
Using the output sentences I populated the database.
for sentence in sentences:
words = sentence.split(" ")
for index in range(0,len(words)-1):
right = words[index:]
for word in right:
#find out whether the relation existed before
db.execute("SELECT * FROM wordrelation WHERE word=? AND right=?",
(words[index],word))
exists = False
rows = db.fetchall()
for row in rows:
exists = True
db.execute("UPDATE wordrelation SET occurences=?"
"WHERE word=? AND right=?",
(row[2]+1,words[index],word))
if exists == False:
db.execute("INSERT INTO wordrelation VALUES (?,?,?)",
(words[index],word,1))
conn.commit()
Now that we have populated the database we can explore things like similarity of words. Disk I/O error occured because I put my hand on the cable. Seems to be a faulty cable. Either ways, we can still use whatever the database contains for the next part.
We can use this database to determine the similiraty between two words. One way of measuring distance is using the bhattacharya distance. Bhattacharya distance is defined as: $$D_{bhattacharya} = -ln(\sum \sqrt{p(x)q(x)})$$
Where p(x) and q(x) are the probability distribution of a given word occring before or after the word in question. To find the similarity between two words I use the following metric: $$S(w1,w2) = D_{bhattacharya}(w1_{left},w2_{left}) + D_{bhattacharya}(w1_{right},w2_{right})$$
import math #needed for sqrt and log
def similarity(word1,word2):
return bhattacharya(buildLeftDist(word1),buildLeftDist(word2))+bhattacharya(buildRightDist(word1),buildRightDist(word2))
def bhattacharya(distribution1,distribution2):
commonkeys = set(distribution1.keys()) & set(distribution2.keys())
totalsum = 0
#compare the probability distributions
for i in commonkeys :
totalsum+= math.sqrt(distribution1[i]*distribution2[i])
print("totalsum: %d",totalsum)
if totalsum>0:
return -math.log(totalsum)
else:
return 0
def buildLeftDist(word):
return buildDist(word,"left")
def buildRightDist(word):
return buildDist(word,"right")
def buildDist(word,direction):
if direction == "left":
db.execute("SELECT * FROM wordrelation WHERE right=? COLLATE NOCASE",(word,))
elif direction == "right":
db.execute("SELECT * FROM wordrelation WHERE word=? COLLATE NOCASE",(word,))
rows = db.fetchall()
total = 0
distribution = {}
for row in rows:
total += row[2]
if direction == "left":
distribution[row[0]] = row[2]
if direction == "right":
distribution[row[1]] = row[2]
for key in distribution:
distribution[key] = distribution[key]/total
return distribution
The following are some of the results (a lower output means smaller distance, i.e. more similarity, however, 0 on output means no relation). Notice how the algorithm has learned the concept of countries from the text.
similarity("China","Singapore")
similarity("India","china")
similarity("India","Indonesia")
Apparently the algorithm has learnt about third world vs first world too...
similarity("Singapore","Expensive")
similarity("Indonesia","poverty")
The algorithm also seems to have associated the concept of money:
similarity("poverty","expensive")
The author clearly wrote about leaving China not singapore:
similarity("Singapore","Leaving")
similarity("China","Leaving")
Clearly plurality is a concept the algorithm picked up:
similarity("four","seasons.")
similarity("four","season.")
similarity("one","seasons.")
On the concept of seasons:
similarity("Spring","white")
similarity("winters","white")
similarity("autumn","yellow")
similarity("winter","yellow")
similarity("car","driver")
Oh, where did the olympics happen Beijing or Dehli?
similarity("beijing","olympic")
similarity("dehli","olympic") #clearly no correlation between the two
Which year did they take place 2006 or 2008?
similarity("2008","olympic")
similarity("2006","olympic")
Concept of languages:
similarity("english","mandarin")
similarity("italian","mandarin")
Notice how they are the same distance apart...
similarity("italian","seasons.")
In conclusion just by using simple statistical algorithms on one blog post I have shown that computers can infer a great deal of infromation ranging from the concepts of money, development, languages, nations to grammatical concepts such as plurality.