University of Cambridge > Talks.cam > NLIP Seminar Series > Understanding Word Embeddings

Understanding Word Embeddings

Add to your list(s) Download to your calendar using vCal

If you have a question about this talk, please contact Tamara Polajnar.

Neural word embeddings, such as word2vec (Mikolov et al., 2013), have become increasingly popular in both academic and industrial NLP . These methods attempt to capture the semantic meanings of words by processing huge unlabeled corpora with methods inspired by neural networks and the recent onset of Deep Learning. The result is a vectorial representation of every word in a low-dimensional continuous space. These word vectors exhibit interesting arithmetic properties (e.g. king – man + woman = queen) (Mikolov et al., 2013), and seemingly outperform traditional vector-space models of meaning inspired by Harris’s Distributional Hypothesis (Baroni et al., 2014). Our work attempts to demystify word embeddings, and understand what makes them so much better than traditional methods at capturing semantic properties.

Our main result shows that state-of-the-art word embeddings are actually “more of the same”. In particular, we show that skip-grams with negative sampling, the latest algorithm in word2vec, is implicitly factorizing a word-context PMI matrix, which has been thoroughly used and studied in the NLP community for the past 20 years. We also identify that the root of word2vec’s perceived superiority can be attributed to a collection of hyperparameter settings. While these hyperparameters were thought to be unique to neural-network-inspired embedding methods, we show that they can, in fact, be ported to traditional distributional methods, significantly improving their performance. Among our qualitative results is a method for interpreting these seemingly-opaque word-vectors, and the answer to why king – man + woman = queen.

Based on joint work with Yoav Goldberg and Ido Dagan.

This talk is part of the NLIP Seminar Series series.

Tell a friend about this talk:

This talk is included in these lists:

Note that ex-directory lists are not shown.

 

© 2006-2024 Talks.cam, University of Cambridge. Contact Us | Help and Documentation | Privacy and Publicity