Word2Vec (0):從原理到實現

這篇是在 notion 整理的筆記大綱,只提供綱要性的說明

預備知識

簡介

兩種網路結構

CBOW and skipgram

Continuous bag of words (CBOW) & Softmax

CBOW feeds $n$ words around the target word $w_t$ at each step

CBOW

  • $V$: the vocabulary size
  • $N$ : the embedding dimension
  • $W$: the input side matrix which is $V \times N$
    • each row is the $N$ dimension vector
    • $\text{v}_{w_i}$ is the representation of the input word $w_i$
  • $W’$: the output side matrix which is $N \times V$
    • each column is the $N$ dimension vector
    • $\text{v}^{‘}_{w_j}$ is the j-th column of the matrix $W’$ representing $w_j$

CBOW 的 Objective Function

$J_\theta = \frac{1}{V}\sum\limits_{t=1}^V\ \text{log} \space p(w_t \: | \: w_{t-n} , \cdots , w_{t-1}, w_{t+1}, \cdots , w_{t+n})$

其中

  • $n$ 表 window size
  • $w_t$ 表 CBOW target center word
  • $w_i$ 表 word $i$ in vocabulary $V$
  • $\text{v}_{w_i}$ 表 matrix $W$ 中,代表 $w_i$ 的那個 row vector
  • $\text{v}’_{w_i}$ 表 matrix $W’$ 中,代表 $w_i$ 的那個 column vector
  • $h$ 表 hidden layer 的輸出,其值為 input context word vector 的平均 $\cfrac{1}{C}(\text{v}_{w_1} + \text{v}_{w_2}+ …+ \text{v}_{w_C})^T$

Because there are multiple contextual words, we construct the input vector by averaging each words’s distribution representation

Skipgram & Softmax

skipgram uses the centre word to predict the surrounding words instead of using the surrounding words to predict the centre word

Skipgram

Skipgram 的 Objective Function

其中

  • $n$ 為 window size
  • $w_{t+j}$ 表 skipgram target 第 j 個 context word
  • $w_t$ 為 skipgram input 的 center word
  • skip-gram 有兩個獨立的 embedding matrix $W$ and $W^{‘}$
    • $W$ : $V \times N$ , $V$ is vocabulary size; N is vector dimension
    • output matrix $W^{‘}$: $N \times V$, encoding the meaning of context
  • $\text{v}^{‘}_{w_i}$ is column vector of word $w_i$ in $W^{‘}$
  • $h$ is the hidden layer’s output

事實上,skip-gram 沒有 hidden layer, 因爲 input 只有一個 word, $h$ 就是 word embedding $\text{v}_{w_t}$of the word $w_t$ in $W$。

所以

$p(w_{t+j} \: | \: w_t ) = \dfrac{\text{exp}({\text{v}^\top_{w_t} \text{v}’_{w_{t+j}}})}{\sum_{w_i \in V} \text{exp}({\text{v}^\top_{w_t} \text{v}’_{w_i}})}$

兩種 loss function 優化

原始 softmax 作為 objective function, 分母必須對所有 word $w_i$ in vocabulary 與 vector $ h$ 內積,造成運算瓶頸

所以 word2Vec 論文中採用兩種 objective function 的優化方案,Hierarchical Softmax 與 Negatvie Sampling

Hierarchical Softmax

原理推導請參閱 Word2Vec (2):Hierarchical Softmax 背後的數學

Hierarchical softmax build a full binary tree to avoid computation over all vocabulary

Negative Sampling

原理推導請參閱 Word2Vec (3):Negative Sampling 背後的數學

negative sampling did further simplification because it focuses on learning high-quality word embedding rather than modeling the likelihood of correct word in natural language.

  • In implement,negative sampling addresses this by having each training sample only modify a small percentage of the weights, rather than all of them

實現 WordVec

Conclusion

Skip gram 與 CBOW 實際上都 train 了兩個 embedding matrix $W$ and $W’$

  • $W:$ 在 C implement 稱作 $syn0$。
  • $W’$:
    • 若採用 hierarchical softmax 稱為 $syn1$
    • 若採用 negative sampling 叫 $syn1neg$

根據性質,$syn0$ 與 $syn1neg$ 是一體兩面的,本身都可以代表 word embedding,皆可使用。

而代表 $W’$ 的 $syn1$ 因爲連結的是 huffman tree 的 non leaf node,所以其本身對 word $w_i$ 沒直接意義。

Reference

Author

seed9D

Posted on

2021-01-24

Updated on

2021-02-10

Licensed under


Comments