codecogs equations

lördag 17 augusti 2019

Playing with Word2Vec

In a lecture by Yann LeCunn, he mentions that word2vec embeddings can be used to learn analogy properties. I want to see if any other natural language concepts are modeled by word2vec's.  

A word2vec embedding is a vector on the order of 100 elements that represents a word. The embedding is the hidden layer of a network that for each pair of words predicts the probability that they appear close to each other in a large corpus. An example of an analogy:

London is to England what Paris is to France.
When people have looked at word2vec's, they have found that the following property holds approximately true:
w2v(london) - w2v(england) = w2v(paris) - w2v(france)
This difference is also the same for most capital-country pairs. That's quite interesting! Let's see if we can find some other interesting properties.

I use a dataset of 43000 words from Kaggle. Many of the words have leading upper case, while not being proper names. After converting everything to lower case and removing duplicates (keeping only the first to appear in the list), there remains 38000 words. 

Relation
First, wouldn't it be great if this was true?
w2v(london) - w2v(england) = w2v(capital)
Looking at the 10 closest elements of london - england:
paris: 0.345
tokyo: 0.321
brussels: 0.315
mayfair: 0.314
uptown: 0.303
nairobi: 0.302
jakarta: 0.302
budapest: 0.299
amsterdam: 0.294
OK, that didn't work. It doesn't work to take england - london either. The similarity to "capital" is only 0.15.

Hypernym
A hypernym of a set of words is a word that describes the set as a whole. For example:
Cutlery is a hypernym of knife, fork, and spoon
Color is a hypernym of red, green, and blue
Can we shake out the hypernym with word vectors?
hypernym(words) = mean([w2v(word) for word in words])
hypernym(["red", "green", "blue"])
yellow: 0.459
purple: 0.440
orange: 0.436
pink: 0.433
brown: 0.418
purple: 0.418
white: 0.407
gray: 0.396
colored: 0.387
crimson: 0.380
maroon: 0.372
pink: 0.370
color: 0.355
Not too bad! Color shows up among the top answers.

w2v.hypernym(["mercedes", "jeep", "ford"])  
1 mercedes: 0.615
2 ford: 0.535
3 jeep: 0.514
4 car: 0.511
5 jeep: 0.505
6 sedan: 0.490
7 jaguar: 0.470
8 buick: 0.468
9 vehicle: 0.458
10 mustang: 0.448
Great! Car is the top answer except for the three included words. 

w2v.hypernym(["bed", "table", "chair"])                                                                                                                                           
1 bed: 0.499
2 chair: 0.474
3 table: 0.410
4 sofa: 0.384
5 couch: 0.378
6 divan: 0.359
7 beds: 0.349
8 chairs: 0.334
9 footstool: 0.333
10 daybed: 0.328
I wanted to see 'furniture' here, but no dice.

Reduce: 'country'
What if we subtract more than one vector from another? Can we peel back layers of meaning this way? Example:
The word 'country' has several meanings. It can be a music genre, it can be a synonym for 'nation', and it can refer to land outside of cities. Remove the 'land' and 'nation' context, we should be thinking of music.
For the implementation of this, I tried two approaches: simply subtracting away vectors, and also projecting away components. They both worked about as well. The word vector is normalized after each reduction, so the scale is relevant throughout.

w2v.reduce("country", [])                                                                                                                                                         
2 nation: 0.724
3 world: 0.598
4 globe: 0.514
5 america: 0.486
6 countries: 0.482
7 national: 0.478
8 abroad: 0.472
9 republic: 0.453
10 europe: 0.453

Without context, the strongest associations to country is in the sense of 'nation'.

w2v.reduce("country", ["nation"])                                                                                                                                                 
2 abroad: 0.445
3 countryside: 0.374
4 homeland: 0.363
5 border: 0.347
6 province: 0.347
7 villages: 0.346
8 rumanians: 0.336
9 europe: 0.335
10 thailand: 0.335

And with a heap of reductions we can indeed force it to think only of music!

w2v.reduce("country", ["land", "nation", "rural", "abroad", "province", "europe"])                                                                                                
2 genres: 0.211
3 carreer: 0.207
4 singer: 0.206
5 musician: 0.177
6 catatonia: 0.175
7 promoter: 0.173
8 roped: 0.169
9 duet: 0.166
10 crooning: 0.165
11 vocalist: 0.164

Reduce: 'house'

Can we do the same for 'house'? Also a music genre, with more than one meaning.

w2v.reduce("house", [])                                                                                                                                                           
2 senate: 0.702
3 bill: 0.541
4 appropriations: 0.520
5 congress: 0.511
6 commons: 0.487
7 congressional: 0.486
8 senators: 0.483
9 lawmakers: 0.473
10 conferees: 0.447

I make a rule to always reduce the top word.

w2v.reduce("house", ["senate"])                                                                                                                                                   
2 manor: 0.461
3 sanctuary: 0.414
4 lodge: 0.409
5 lounge: 0.400
6 gardens: 0.388
7 club: 0.388
8 palace: 0.381
9 gate: 0.377
10 factory: 0.375

... repeating this a number of times ...

w2v.reduce("house", ["senate", "manor", "factory", "lords", "society", "rowdy", "sanctuary"])                                                                                     
2 ways: 0.155
3 true: 0.133
4 diet: 0.132
5 networks: 0.122
6 inset: 0.120
7 r: 0.116
8 feat: 0.114
9 signature: 0.113
10 crestfallen: 0.110

Yes! At least we get 'feat' (short for featuring) and 'true' in there.

Alignment
Can the embeddings be used to sort things? Example:
Since the sun is bigger than a car, 'sun' should be closer to 'big' than 'car' is. 
This is admittedly quite a long shot for such a simple model, and it turns out to be very wrong:

w2v.alignment("big", ["sun", "planet", "car", "dog", "ant"])                                                                                   
1 car: 0.120
2 ant: 0.100
3 dog: 0.099
4 planet: 0.079
5 sun: 0.022

w2v.alignment("healthy", ["carrot", "burger", "wine", "juice", "cake"])                                                                                         
1 juice: 0.221
2 carrot: 0.161
3 cake: 0.084
4 burger: 0.054
5 wine: 0.025

Probably what we are seeing here is mostly how often people talk about these objects in the context of being "big" or "healthy".

Conclusion

Word embeddings can yield surprisingly relevant results with very simple models. However, it is important to know that they have been generated using a simple optimization that only looks at local context, and has no real way of modelling that context. This becomes apparent when trying to make it perform more abstract tasks.