Word2Vec (0):從原理到實現
這篇是在 notion 整理的筆記大綱,只提供綱要性的說明
預備知識
language model: NLP 語言模型
huffman tree
簡介
兩種網路結構
Continuous bag of words (CBOW) & Softmax
CBOW feeds $n$ words around the target word $w_t$ at each step
- $V$: the vocabulary size
- $N$ : the embedding dimension
- $W$: the input side matrix which is $V \times N$
- each row is the $N$ dimension vector
- $\text{v}_{w_i}$ is the representation of the input word $w_i$
- $W’$: the output side matrix which is $N \times V$
- each column is the $N$ dimension vector
- $\text{v}^{‘}_{w_j}$ is the j-th column of the matrix $W’$ representing $w_j$
CBOW 的 Objective Function
$J_\theta = \frac{1}{V}\sum\limits_{t=1}^V\ \text{log} \space p(w_t \: | \: w_{t-n} , \cdots , w_{t-1}, w_{t+1}, \cdots , w_{t+n})$
其中
- $n$ 表 window size
- $w_t$ 表 CBOW target center word
- $w_i$ 表 word $i$ in vocabulary $V$
- $\text{v}_{w_i}$ 表 matrix $W$ 中,代表 $w_i$ 的那個 row vector
- $\text{v}’_{w_i}$ 表 matrix $W’$ 中,代表 $w_i$ 的那個 column vector
- $h$ 表 hidden layer 的輸出,其值為 input context word vector 的平均 $\cfrac{1}{C}(\text{v}_{w_1} + \text{v}_{w_2}+ …+ \text{v}_{w_C})^T$
Because there are multiple contextual words, we construct the input vector by averaging each words’s distribution representation
Skipgram & Softmax
skipgram uses the centre word to predict the surrounding words instead of using the surrounding words to predict the centre word
Skipgram 的 Objective Function
其中
- $n$ 為 window size
- $w_{t+j}$ 表 skipgram target 第 j 個 context word
- $w_t$ 為 skipgram input 的 center word
- skip-gram 有兩個獨立的 embedding matrix $W$ and $W^{‘}$
- $W$ : $V \times N$ , $V$ is vocabulary size; N is vector dimension
- output matrix $W^{‘}$: $N \times V$, encoding the meaning of context
- $\text{v}^{‘}_{w_i}$ is column vector of word $w_i$ in $W^{‘}$
- $h$ is the hidden layer’s output
事實上,skip-gram 沒有 hidden layer, 因爲 input 只有一個 word, $h$ 就是 word embedding $\text{v}_{w_t}$of the word $w_t$ in $W$。
所以
$p(w_{t+j} \: | \: w_t ) = \dfrac{\text{exp}({\text{v}^\top_{w_t} \text{v}’_{w_{t+j}}})}{\sum_{w_i \in V} \text{exp}({\text{v}^\top_{w_t} \text{v}’_{w_i}})}$
兩種 loss function 優化
原始 softmax 作為 objective function, 分母必須對所有 word $w_i$ in vocabulary 與 vector $ h$ 內積,造成運算瓶頸
所以 word2Vec 論文中採用兩種 objective function 的優化方案,Hierarchical Softmax 與 Negatvie Sampling
Hierarchical Softmax
原理推導請參閱 Word2Vec (2):Hierarchical Softmax 背後的數學
Hierarchical softmax build a full binary tree to avoid computation over all vocabulary
Negative Sampling
原理推導請參閱 Word2Vec (3):Negative Sampling 背後的數學
negative sampling did further simplification because it focuses on learning high-quality word embedding rather than modeling the likelihood of correct word in natural language.
- In implement,negative sampling addresses this by having each training sample only modify a small percentage of the weights, rather than all of them
實現 WordVec
- skip gram + softmax Word2Vec (4):Pytorch 實作 Word2Vec with Softmax
- CBOW + softmax Word2Vec (4):Pytorch 實作 Word2Vec with Softmax
- CBOW + hierarchical softmax Word2Vec (5):Pytorch 實作 CBOW with Hierarchical Softmax
- CBOW + negatove sampling
- skip gram + hierarchical softmax
- skip gram + negative sampling Word2Vec (6):Pytorch 實作 Skipgram with Negative Sampling
Conclusion
Skip gram 與 CBOW 實際上都 train 了兩個 embedding matrix $W$ and $W’$
- $W:$ 在 C implement 稱作 $syn0$。
- $W’$:
- 若採用 hierarchical softmax 稱為 $syn1$
- 若採用 negative sampling 叫 $syn1neg$
根據性質,$syn0$ 與 $syn1neg$ 是一體兩面的,本身都可以代表 word embedding,皆可使用。
而代表 $W’$ 的 $syn1$ 因爲連結的是 huffman tree 的 non leaf node,所以其本身對 word $w_i$ 沒直接意義。
Reference
On word embeddings - Part 1 https://ruder.io/word-embeddings-1/
On word embeddings - Part 2: Approximating the Softmax https://ruder.io/word-embeddings-softmax/
Learning Word Embedding https://lilianweng.github.io/lil-log/2017/10/15/learning-word-embedding.html#cross-entropy
other
- Rong, X. (2014). word2vec Parameter Learning Explained, 1–21. Retrieved from http://arxiv.org/abs/1411.2738
- Language Models, Word2Vec, and Efficient Softmax Approximations https://rohanvarma.me/Word2Vec/
C source code
- Word2Vec源码解析 https://www.cnblogs.com/neopenx/p/4571996.html
應用
- 小白看Word2Vec的正确打开姿势|全部理解和应用 https://zhuanlan.zhihu.com/p/120148300
- 推荐系统从零单排系列(六)—Word2Vec优化策略层次Softmax与负采样 [https://zhuanlan.zhihu.com/p/66417229
Word2Vec (0):從原理到實現