I remember riding the bus from Ithaca down to the office of an AI startup at Union Square the summer BERT came out and trying to train a scikit-learn classifier on my laptop to classify Word2Vec embeddings on things like: part of speech, "is this a color word?", etc. and found I just couldn't. If I had a tiny number of examples (<10) it might seem to work but if I put more examples in it would break.
My conclusion was that "Word2Vec sucks", probably a lot of people tried the same thing and either came to that conclusion or thought they did something wrong. People don't usually publish negative results so I've never read about anybody doing it. It takes bravery. Great work!
The diagrams on this page are a disgrace in my mind
what it comes down to is that they are projecting down from a N=50 space to an N=2 space. You have a lot of dimensions to play with so if you have, say 20 points, you can find some projection where those points lie wherever you want, even if it was just a random point cloud.
It's really a lie because if they tried to map 100 cities to their ZIP codes it wouldn't work at all, that's what I found trying to make classifiers.
The main reason the vector arithmetic of Word2Vec worked is due to how it was trained (directly training the network with a shallow network such that the entire knowledge for the model is contained within the embeddings). This is not the case with any modern embedding model.
At most with current models, you can average embeddings together.
I remember riding the bus from Ithaca down to the office of an AI startup at Union Square the summer BERT came out and trying to train a scikit-learn classifier on my laptop to classify Word2Vec embeddings on things like: part of speech, "is this a color word?", etc. and found I just couldn't. If I had a tiny number of examples (<10) it might seem to work but if I put more examples in it would break.
My conclusion was that "Word2Vec sucks", probably a lot of people tried the same thing and either came to that conclusion or thought they did something wrong. People don't usually publish negative results so I've never read about anybody doing it. It takes bravery. Great work!
The diagrams on this page are a disgrace in my mind
https://nlp.stanford.edu/projects/glove/
what it comes down to is that they are projecting down from a N=50 space to an N=2 space. You have a lot of dimensions to play with so if you have, say 20 points, you can find some projection where those points lie wherever you want, even if it was just a random point cloud.
It's really a lie because if they tried to map 100 cities to their ZIP codes it wouldn't work at all, that's what I found trying to make classifiers.
The main reason the vector arithmetic of Word2Vec worked is due to how it was trained (directly training the network with a shallow network such that the entire knowledge for the model is contained within the embeddings). This is not the case with any modern embedding model.
At most with current models, you can average embeddings together.