This post was originally part of the DataRobot Community. Visit now to browse discussions and ask questions about DataRobot, AI Platform, data science, and more.
W-shingling is a very popular and straightforward technique for text mining. It can be used for classification, clusterization, and other problems. It was optimized and developed by Andrei Zary Broder, a distinguished scientist at Google.
Example Use Case for W-Shingling
Suppose you have a document and a library of documents, and you want to determine which document in the library is most similar to the one you already have.
The original (test) document contains the following phrase:
the brown dog and the white horse are friends
And similar phrases in the other documents are as follows (numbered for reference):
- the brown cow and the white horse are in the field eating the green grass during the day in the summer
- the yellow and the white flowers close to the brown dog
- brown dogs are my favorite ones
Although it appears that the first and the second sentences are the most silimilar to the original/test document, let’s see how this can be deterimined from an algorithmic perspective.
To start with, we have to define test and train sets.
import numpy as np
train_set = "the brown dog and the white horse are friends"
test_set = ("the brown cow and the white horse are in the field eating the green grass during the day in the summer",
"the yellow and the white flowers close to the brown dog",
"brown dogs are my favorite ones")
The w-shingling algorithm operates by splitting both the test and train sets into bigger sets of sequential substrings of a fixed length. Then it creates a substring beginning with each word in the string. (The following code and resulting output illustrate how the substrings are created.)
First, we tokenize the train set.
shingleLength = 3 #length of sub-set
tokens = train_set.split()
A = [tuple(tokens[i:i+shingleLength]) for i in range(len(tokens) - shingleLength + 1) ] #split train set
print "A=", A
a = set(A)
Output:
A= [('the', 'brown', 'dog'), ('brown', 'dog', 'and'), ('dog', 'and', 'the'), ('and', 'the', 'white'), ('the', 'white', 'horse'), ('white', 'horse', 'are'), ('horse', 'are', 'friends')]
Then, we tokenize the test set:
B = A
for j in range(0, len(test_set)):
tokens = test_set[j].split()
B[j] = [tuple(tokens[i:i+shingleLength]) for i in range(len(tokens) - shingleLength + 1)] #split each string in test set
print "B[1]=", B[1]
print ""
print "B[2]=", B[2]
print ""
print "B[3]=", B[3]
print ""
Output:
B[1]= [('the', 'yellow', 'and'), ('yellow', 'and', 'the'), ('and', 'the', 'white'), ('the', 'white', 'flowers'), ('white', 'flowers', 'close'), ('flowers', 'close', 'to'), ('close', 'to', 'the'), ('to', 'the', 'brown'), ('the', 'brown', 'dog')]
B[2]= [('brown', 'dogs', 'are'), ('dogs', 'are', 'my'), ('are', 'my', 'favorite'), ('my', 'favorite', 'ones')]
B[3]= ('and', 'the', 'white')
At this point we have defined two sets: A and B.
Next, we define the distance between set A and each element of B. There are many ways to do this. For our example, define the following in w-shingling:
$$r(A,B_i) = \frac{N(A and B_i)}{N(A or B_i)},$$
where N is the number of elements in the set.
import numpy as np
R = np.empty(len(test_set))
for j in range(0, len(test_set)):
b = set(B[j])
R[j] = float(len(a & b)) / len(a | b)
print a & b
print len(a & b)
print ""
print R
print R.argmax()
Output:
set([('the', 'white', 'horse'), ('white', 'horse', 'are'), ('and', 'the', 'white')])
3
set([('the', 'brown', 'dog'), ('and', 'the', 'white')])
2
set([])
0
[ 0.13043478 0.14285714 0. ]
1
We can see that the second element of B has the biggest score; according to w-shingling, that phrase is the most similar to set A.
In our next example we use same train and test sets, but with different punctuation. For this example, each phrase (in both the train and test sets) starts with a capital letter and includes punctuation; this seemingly minor change will have a big affect on the result.
import numpy as np
train_set = "The brown dog and the white horse are friends."
test_set = ("The brown cow and the white horse are in the field eating the green grass during the day in the summer.",
"The yellow and the white flowers close to the brown dog.",
"Brown dogs are my favorite ones") #we just added punctuation. It would affect results!
shingleLength = 3
tokens = train_set.split()
A = [tuple(tokens[i:i+shingleLength]) for i in range(len(tokens) - shingleLength + 1) if len(tokens[i]) < 4]
for j in range(0, len(test_set)):
tokens = test_set[j].split()
B[j] = [tuple(tokens[i:i+shingleLength]) for i in range(len(tokens) - shingleLength + 1)]
And then,
import numpy as np
R = np.empty(len(test_set))
a = set(A)
for j in range(0, len(test_set)):
b = set(B[j])
R[j] = float(len(a & b)) / len(a | b)
print a & b
print len(a & b)
print ""
print R
print R.argmax()
Output:
set([('the', 'white', 'horse'), ('and', 'the', 'white')])
2
set([('and', 'the', 'white')])
1
set([])
0
[ 0.0952381 0.08333333 0. ]
0
Now the first element of B has the highest score, so has to be selected as the phrase most similar to A.
Important Use Note
These examples show how important it is for the words/strings to be developed with consistent rules for spelling, punctuation, capitalization, etc. when using the w-shingling algorithm. If the algorithm detects any of these differences, it determines they are different, thereby affecting the results. For example, by default w-shingling determines that “dog” and “dog.” (one with a period and one without a period) are two different words.
You can read more about w-shingling here. Find out more about DataRobot’s text mining capabilities by navigating to our public documentation portal.