Daniel Rodriguez

word2vec in yhat: Word vector similarity

A few weeks ago Google released some code to convert words to vectors called word2vec. The company I am currently working on does something similar and I was quite amazed by the performance and accuracy of Google's algorithm so I created a simple python wrapper to call the C code for training and read the training vectors into numpy arrays, you can check it out on pypi (word2vec).

At the same time I found out about yhat, I found about them via twitter and after reading their blog I had to try their product. What they do is very simple but very useful: take some python (scikit-learn) or R classifier and create a REST endpoint to make predictions on new data. The product is still in very beta but the guys were very responsive and they helped me to solve some of my issues.

The only restriction I had is the yhat limit for free accounts is 50 Mgs per classifier which on this particular case is not enough so I had to reduce the vector size to 25 from the default (100). And reduce it to only 70k vectors, so the results in the app below are a little limited, but the results are very similar.

Training

Using my word2vec wrapper is as simple as download and unzip the text8 (link) file and:

from word2vec import word2vec
word2vec('text8', 'text8-25.vec', size=25)

This created a file (text8-25.vec) with the vectors that can be loaded into numpy. Again using my word2vec wrapper is really simple:

from word2vec import WordVectors
vectors = WordVectors('text8-31.vec')

Yhat

Then just need to create a yhat model and in the predict method calculate the distance between the vectors. That code is also included on my word2vec package using scipy cosine distance (example), on this case I just used the numpy linalg.norm.

import numpy as np
from yhat import BaseModel

class Word2VecCLF(BaseModel):
    def transform(self, request):
        return request

    def predict(self, request):
        ''' Calculate distances
        '''
        target = request['word']
        n = request['n'] if 'n' in request else 10

        distances = [1e6 for i in range(n)]
        values = [None for i in range(n)]

        words = self.words
        vectors = self.vectors

        target_ix = np.where(words == target)[0]
        for word, vector in zip(words, vectors):
            if word != target:
                n_dist = np.linalg.norm(vectors[target_ix, :] - vector)

                if n_dist < max(distances):
                    if n_dist < min(distances):
                        distances.insert(0, n_dist)
                        distances = distances[:n]
                        values.insert(0, word)
                        values = values[:n]
                    else:
                        for i, dist_1 in enumerate(reversed(distances[:-1])):
                            dist_2 = distances[n - i - 1]
                            if n_dist < dist_2 and n_dist >= dist_1:
                                distances.insert(n - i - 1, n_dist)
                                distances = distances[:n]
                                values.insert(n - i - 1, word)
                                values = values[:n]
                                break
        return {'distances': distances, 'words': values}

Then just need to upload to yhat.

from yhat import Yhat
yh = Yhat("EMAIL", "TOKEN")
yh.deploy("word2vec", word2vec_clf)

If everything goes fine you have a REST endpoint you can call.

Example

I built a simple app using angularJS. Just type any word and the number of close word vectors you want and click the button. On the list that it generates you can click on any word and it will give you the neighbors for that word.

word distance
{{word.word}} {{word.distance}}

Conclusions

Definitely some interesting new technologies and tools to keep and eye on. Thanks to Google for open sourcing the code and thanks to yhat for a good product. I had to do something similar a few weeks ago and my solution was to use ZMQ to connect the rest endpoint with the actual classifier yhat makes that possible in 5% of the time.

Found some interesting relations? Let me know in the comments below.