Word2Vec (3):Negative Sampling 背後的數學
以下用 Skip-gram 為例
- $V$: the vocabulary size
- $N$ : the embedding dimension
- $W$: the input side matrix which is $V \times N$
- each row is the $N$ dimension vector
- $\text{v}_{w_i}$ is the representation of the input word $w_i$
- $W’$: the output side matrix which is $N \times V$
- each column is the $N$ dimension vector
- $\text{v}^{‘}_{w_j}$ is the j-th column of the matrix $W’$ representing $w_j$
Noise Contrastive Estimation (NCE)
NCE attempts to approximately maximize the log probability of the softmax output
- The Noise Contractive Estimation metric intends to differentiate the target word from noise sample using a logistic regression classifier
從 cross entropy 說起
True label $y_i$ is 1 only when $w_i$ is the output word:
又
$p(w_O \vert w_I) = \frac{\exp({\text{v}’_{w_O}}^{\top} \text{v}_{w_I})}{\sum_{i=1}^V \exp({\text{v}’_{w_i}}^{\top} \text{v}_{w_I})}$
- $\text{v}^{‘}_{w_i}$ is vector of word $w_i$ in $W^{‘}$
- $w_O$ is the output word in $V$
- $w_I$ is the input word in $V$
代入後
Compute gradient of loss function w.s.t mode’s parameter $\theta$,令 $z_{IO} = {\text{v}’_{w_O}}^{\top}{\text{v}_{w_I}}$ ; $z_{Ii} = {\text{v}’_{w_i}}^{\top}{\text{v}_{w_I}}$
可以看出,gradient $\nabla_{\theta}\mathcal{L}_{\theta}$是由兩部分組成 :
- a positive reinforcement for the target word $w_O$, $\nabla_{\theta}z_{O}$
- a negative reinforcement for all other words $w_i$, which weighted by their probability, $\sum_{i=1}^V p(w_i \vert w_I) \nabla_\theta z_{Ii}$
Second term actually is just the expectation of the gradient $\nabla_{\theta}z_{Ii}$ for all words $w_i$ in $V$。
And probability distribution $Q(\tilde{w})$ could see as the distribution of noise samples
NCE sample 原理
According to gradient of loss function $\nabla_{\theta}\mathcal{L}_{\theta}$
Given an input word $w_I$, the correct output word is known as $w$,我們同時可以從 noise sample distribution $Q$ sample 出 $M$ 個 samples $\tilde{w}_1
, \tilde{w}_2, \dots, \tilde{w}_M \sim Q$ 來近似 cross entropy gradient 的後半部分
現在我們手上有個 correct output word $w$ 和 $M$ 個從 $Q$ sample 出來的 word $\tilde{w}$ , 假設我們有一個 binary classifier 負責告訴我們,手上的 word $w$,哪個是 correct $p(d=1 | w, w_I)$,哪個是 negative $p(d=0 | \tilde{w}, w_I)$
於是 loss function 改寫成:
According to the law of large numbers $E_{p(x)} [ f(x)] \approx \frac{1}{n} \sum^{n}_{i=1}f(x_i)$,we could simplify:
$p(d |w, w_I)$ 可以從 joint probability $p(d, w|w_I)$ 求
$p(d, w | w_I) =
\begin{cases}
\frac{1}{M+1} p(w \vert w_I) & \text{if } d=1 \\
\frac{M}{M+1} q(\tilde{w}) & \text{if } d=0
\end{cases}$- $d$ is binary value
- $M$ is number of samples,we have a correct output word $w$ and sample M word from distribution $Q$
因爲 $p(d| w, w_I) = \frac{p(d, w, w_I)}{p(w, w_I)} = \frac{p(d, w | w_I) p(w_I)}{p(w|w_I)p(w_I)} = \frac{p(d, w| w_I)}{\sum_dp(d,w| w_I)}$
可以得出
- $q(\tilde{w})$ 表從 distribution $Q$ sample 出 word $\tilde{w}$ 的 probability
最終 loss function of NCE
$p(w_O \vert w_I) = \frac{\exp({\text{v}’_{w_O}}^{\top} \text{v}_{w_I})}{\sum_{i=1}^V \exp({\text{v}’_{w_i}}^{\top} \text{v}_{w_I})}$ 代入 $\mathcal{L}_{\theta}$
可以看到 normalizer $Z(w) = {\sum_{i=1}^V \exp({\text{v}’_{w_i}}^{\top} \text{v}_{w_I})}$ 依然要 sum up the entire vocabulary,實務上通常會假設 $Z(w)\approx1$,所以 $\mathcal{L}_\theta$ 簡化成:
關於 Noise distribution $Q$
關於 noise distribution $Q$,在設計的時候通常會考慮
- it should intuitively be very similar to the real data distribution.
- it should be easy to sample from.
Negative Sampling (NEG)
Negative sampling can be seen as an approximation to NCE
- Noise Contrastive Estimation attempts to approximately maximize the log probability of the softmax output
- The objective of NEG is to learn high-quality word representations rather than achieving low perplexity on a test set, as is the goal in language modeling
從 NCE 說起
NCE $p(d|w, w_I)$ equation,$p(d=1 |w, w_I)$ 代表 word $w$ 是 output word 的機率; $p(d=0|w, w_I)$ 表 sample 出 noise word 的機率
NCE 假設 $p(w_O \vert w_I) = \frac{\exp({\text{v}’_{w_O}}^{\top} \text{v}_{w_I})}{\sum_{i=1}^V \exp({\text{v}’_{w_i}}^{\top} \text{v}_{w_I})}$ 中的分母 $Z(w) = {\sum_{i=1}^V \exp({\text{v}’_{w_i}}^{\top} \text{v}_{w_I})} = 1$,所以簡化成
NEG 繼續化簡
NEG 繼續假設 $Nq(\tilde{w}) = 1$ 式子變成
最終得到 loss function
前項是 positive sample $p(d=1 \vert w, w_I)$,後項是 M 個 negative samples $p(d=0|w, w_I) $
在 skipgram with negative sampling 上
- $\text{v}_{w_I}$ 表 input 的 center word $w_I$ 的 vector,來自 $W$
- $\text{v}’_{w}$ 表 output side 的一個 context word $w$ 的 vector, 來自 $W’$
實作上 skipgram 一個單位的 data sample 只會包含 一個 center word 與 一個 context word 的 pair 對 $(w_I, w_{C,j})$
參閱 Word2Vec (6):Pytorch 實作 Skipgram with Negative Sampling
結論
- NEG 實際上目的跟 Hierarchical Softmax 不一樣,並不直接學 likelihood of correct word,而是直接學 words 的 embedding,也就是更好的 word distribution representation
Reference
- Learning Word Embedding https://lilianweng.github.io/lil-log/2017/10/15/learning-word-embedding.html#loss-functions
- 此篇從 skip gram 講解 negative sampling
- On word embeddings - Part 2: Approximating the Softmax https://ruder.io/word-embeddings-softmax/index.html#samplingbasedapproaches
- 此篇從 CBOW 講解 negative sampling
- Goldberg, Y., & Levy, O. (2014). word2vec Explained: deriving Mikolov et al.’s negative-sampling word-embedding method, (2), 1–5. Retrieved from http://arxiv.org/abs/1402.3722
- Word2Vec Tutorial Part 2 - Negative Sampling [http://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/
Word2Vec (3):Negative Sampling 背後的數學