Trying and Failing to Interpret Embeddings

PaulHoule 6 months ago

I remember riding the bus from Ithaca down to the office of an AI startup at Union Square the summer BERT came out and trying to train a scikit-learn classifier on my laptop to classify Word2Vec embeddings on things like: part of speech, "is this a color word?", etc. and found I just couldn't. If I had a tiny number of examples (<10) it might seem to work but if I put more examples in it would break.

My conclusion was that "Word2Vec sucks", probably a lot of people tried the same thing and either came to that conclusion or thought they did something wrong. People don't usually publish negative results so I've never read about anybody doing it. It takes bravery. Great work!

The diagrams on this page are a disgrace in my mind

https://nlp.stanford.edu/projects/glove/

what it comes down to is that they are projecting down from a N=50 space to an N=2 space. You have a lot of dimensions to play with so if you have, say 20 points, you can find some projection where those points lie wherever you want, even if it was just a random point cloud.

It's really a lie because if they tried to map 100 cities to their ZIP codes it wouldn't work at all, that's what I found trying to make classifiers.

tedtimbrell 6 months ago

Thanks for the read!
To be fair to word2vec (rather, word embeddings) I think both require a fair amount of sentence context.
On a semi-related note, one of the reasons I avoided tackling smells yet is because so much written about smell is in the form of perfume/cologne marketing speak. Asking gpt-4o for smells lists that "the smell of jasmine and tuberose [...] evokes the mystery and elegance of a moonlit garden". I'd hope modern models would understand that this is nonsense but I can imagine a word2vec model would end up with bizarre associations.

minimaxir 6 months ago

The main reason the vector arithmetic of Word2Vec worked is due to how it was trained (directly training the network with a shallow network such that the entire knowledge for the model is contained within the embeddings). This is not the case with any modern embedding model.

At most with current models, you can average embeddings together.

tedtimbrell 6 months ago

Ah, well good to know. I need to read more. Thanks for taking a look.
When you refer to averaging embeddings together, do you mean averaging a bunch of sentences/words for "male" to get a general concept vector or do you mean averaging two different words, like "royal" and "adult male", to get to the combined concept, say "king"?