Word2Vec (3):Negative Sampling 背後的數學

2021-01-242021-02-10NLP13 minutes read (About 1993 words)0 visits

以下用 Skip-gram 為例

$V$： the vocabulary size
$N$ : the embedding dimension
$W$： the input side matrix which is $V \times N$
- each row is the $N$ dimension vector
- $\text{v}_{w_i}$ is the representation of the input word $w_i$
$W’$: the output side matrix which is $N \times V$
- each column is the $N$ dimension vector
- $\text{v}^{‘}_{w_j}$ is the j-th column of the matrix $W’$ representing $w_j$

Noise Contrastive Estimation (NCE)

NCE attempts to approximately maximize the log probability of the softmax output

The Noise Contractive Estimation metric intends to differentiate the target word from noise sample using a logistic regression classifier

從 cross entropy 說起

True label $y_i$ is 1 only when $w_i$ is the output word:

$\mathcal{L}_\theta = - \sum_{i=1}^V y_i \log p(w_i | w_I) = - \log p(w_O \vert w_I)$

又

$p(w_O \vert w_I) = \frac{\exp({\text{v}’_{w_O}}^{\top} \text{v}_{w_I})}{\sum_{i=1}^V \exp({\text{v}’_{w_i}}^{\top} \text{v}_{w_I})}$

$\text{v}^{‘}_{w_i}$ is vector of word $w_i$ in $W^{‘}$
$w_O$ is the output word in $V$
$w_I$ is the input word in $V$

代入後

$\mathcal{L}_{\theta} = - \log \frac{\exp({\text{v}'_{w_O}}^{\top}{\text{v}_{w_I}})}{\sum_{i=1}^V \exp({\text{v}'_{w_i}}^{\top}{\text{v}_{w_I} })} = - {\text{v}'_{w_O}}^{\top}{\text{v}_{w_I} } + \log \sum_{i=1}^V \exp({\text{v}'_{w_i} }^{\top}{\text{v}_{w_I}})$

Compute gradient of loss function w.s.t mode’s parameter $\theta$，令 $z_{IO} = {\text{v}’_{w_O}}^{\top}{\text{v}_{w_I}}$ ; $z_{Ii} = {\text{v}’_{w_i}}^{\top}{\text{v}_{w_I}}$

$\begin{aligned} \nabla_\theta \mathcal{L}_{\theta} &= \nabla_\theta\big( - z_{IO} + \log \sum_{i=1}^V e^{z_{Ii}} \big) \\ &= - \nabla_\theta z_{IO} + \nabla_\theta \big( \log \sum_{i=1}^V e^{z_{Ii}} \big) \\ &= - \nabla_\theta z_{IO} + \frac{1}{\sum_{i=1}^V e^{z_{Ii}}} \sum_{i=1}^V e^{z_{Ii}} \nabla_\theta z_{Ii} \\ &= - \nabla_\theta z_{IO} + \sum_{i=1}^V \frac{e^{z_{Ii}}}{\sum_{i=1}^V e^{z_{Ii}}} \nabla_\theta z_{Ii} \\ &= - \nabla_\theta z_{IO} + \sum_{i=1}^V p(w_i \vert w_I) \nabla_\theta z_{Ii} \\ &= - \nabla_\theta z_{IO} + \mathbb{E}_{w_i \sim Q(\tilde{w})} \nabla_\theta z_{Ii} \end{aligned}$

可以看出，gradient $\nabla_{\theta}\mathcal{L}_{\theta}$是由兩部分組成 :

a positive reinforcement for the target word $w_O$, $\nabla_{\theta}z_{O}$
a negative reinforcement for all other words $w_i$, which weighted by their probability, $\sum_{i=1}^V p(w_i \vert w_I) \nabla_\theta z_{Ii}$

Second term actually is just the expectation of the gradient $\nabla_{\theta}z_{Ii}$ for all words $w_i$ in $V$。

And probability distribution $Q(\tilde{w})$ could see as the distribution of noise samples

NCE sample 原理

According to gradient of loss function $\nabla_{\theta}\mathcal{L}_{\theta}$

$\nabla_{\theta}\mathcal{L}_{\theta} =- \nabla_\theta z_{IO} + \mathbb{E}_{w_i \sim Q(\tilde{w})} \nabla_\theta z_{Ii}$

Given an input word $w_I$, the correct output word is known as $w$，我們同時可以從 noise sample distribution $Q$ sample 出 $M$ 個 samples $\tilde{w}_1
, \tilde{w}_2, \dots, \tilde{w}_M \sim Q$ 來近似 cross entropy gradient 的後半部分

現在我們手上有個 correct output word $w$ 和 $M$ 個從 $Q$ sample 出來的 word $\tilde{w}$ ，假設我們有一個 binary classifier 負責告訴我們，手上的 word $w$，哪個是 correct $p(d=1 | w, w_I)$，哪個是 negative $p(d=0 | \tilde{w}, w_I)$

於是 loss function 改寫成：

$\mathcal{L}_\theta = - [ \log p(d=1 \vert w, w_I) + \sum_{i=1, \tilde{w}_i \sim Q}^M \log p(d=0|\tilde{w}_i, w_I) ]$

According to the law of large numbers $E_{p(x)} [ f(x)] \approx \frac{1}{n} \sum^{n}_{i=1}f(x_i)$，we could simplify:

$\mathcal{L}_\theta = - [ \log p(d=1 \vert w, w_I) + M \mathbb{E}_{\tilde{w}_i \sim Q} \log p(d=0|\tilde{w}_i, w_I)]$

$p(d |w, w_I)$ 可以從 joint probability $p(d, w|w_I)$ 求

$p(d, w | w_I) =
\begin{cases}
\frac{1}{M+1} p(w \vert w_I) & \text{if } d=1 \\
\frac{M}{M+1} q(\tilde{w}) & \text{if } d=0
\end{cases}$
- $d$ is binary value
- $M$ is number of samples，we have a correct output word $w$ and sample M word from distribution $Q$
因爲 $p(d| w, w_I) = \frac{p(d, w, w_I)}{p(w, w_I)} = \frac{p(d, w | w_I) p(w_I)}{p(w|w_I)p(w_I)} = \frac{p(d, w| w_I)}{\sum_dp(d,w| w_I)}$

可以得出
$\begin{aligned} p(d=1 \vert w, w_I) &= \frac{p(d=1, w \vert w_I)}{p(d=1, w \vert w_I) + p(d=0, w \vert w_I)} &= \frac{p(w \vert w_I)}{p(w \vert w_I) + Mq(\tilde{w})} \end{aligned}$ $\begin{aligned} p(d=0 \vert w, w_I) &= \frac{p(d=0, w \vert w_I)}{p(d=1, w \vert w_I) + p(d=0, w \vert w_I)} &= \frac{Mq(\tilde{w})}{p(w \vert w_I) + Mq(\tilde{w})} \end{aligned}$

$q(\tilde{w})$ 表從 distribution $Q$ sample 出 word $\tilde{w}$ 的 probability

最終 loss function of NCE

$\begin{aligned} \mathcal{L}_\theta & = - [ \log p(d=1 \vert w, w_I) + \sum_{\substack{i=1 \\ \tilde{w}_i \sim Q}}^M \log p(d=0|\tilde{w}_i, w_I)] \\ & = - [ \log \frac{p(w \vert w_I)}{p(w \vert w_I) + Mq(\tilde{w})} + \sum_{\substack{i=1 \\ \tilde{w}_i \sim Q}}^M \log \frac{Mq(\tilde{w}_i)}{p(w \vert w_I) + Mq(\tilde{w}_i)}] \end{aligned}$

$p(w_O \vert w_I) = \frac{\exp({\text{v}’_{w_O}}^{\top} \text{v}_{w_I})}{\sum_{i=1}^V \exp({\text{v}’_{w_i}}^{\top} \text{v}_{w_I})}$ 代入 $\mathcal{L}_{\theta}$

$\begin{aligned}\mathcal{L}_{\theta} &= -[log\frac{\frac{\exp({\text{v}'_{w}}^{\top} \text{v}_{w_I})}{\sum_{i=1}^V \exp({\text{v}'_{w_i}}^{\top} \text{v}_{w_I})}}{\frac{\exp({\text{v}'_{w}}^{\top} \text{v}_{w_I})}{\sum_{i=1}^V \exp({\text{v}'_{w_i}}^{\top} \text{v}_{w_I})+ Mq(\tilde{w})}} + \sum_{\substack{i=1 \\ \tilde{w}_i\sim Q}}^M \log \frac{Mq(\tilde{w}_i)}{\frac{\exp({\text{v}'_{w}}^{\top} \text{v}_{w_I})}{\sum_{i=1}^V \exp({\text{v}'_{w_i}}^{\top} \text{v}_{w_I})}+ Mq(\tilde{w}_i)}]\end{aligned}$

可以看到 normalizer $Z(w) = {\sum_{i=1}^V \exp({\text{v}’_{w_i}}^{\top} \text{v}_{w_I})}$ 依然要 sum up the entire vocabulary，實務上通常會假設 $Z(w)\approx1$，所以 $\mathcal{L}_\theta$ 簡化成:

$\mathcal{L}_\theta = - [ \log \frac{\exp({\text{v}'_w}^{\top}{\text{v}_{w_I}})}{\exp({\text{v}'_w}^{\top}{\text{v}_{w_I}}) + Mq(\tilde{w})} + \sum_{\substack{i=1 \\ \tilde{w}_i \sim Q}}^M \log \frac{Mq(\tilde{w}_i)}{\exp({\text{v}'_w}^{\top}{\text{v}_{w_I}}) + Mq(\tilde{w}_i)}]$

關於 Noise distribution $Q$

關於 noise distribution $Q$，在設計的時候通常會考慮

it should intuitively be very similar to the real data distribution.
it should be easy to sample from.

Negative Sampling (NEG)

Negative sampling can be seen as an approximation to NCE

Noise Contrastive Estimation attempts to approximately maximize the log probability of the softmax output
The objective of NEG is to learn high-quality word representations rather than achieving low perplexity on a test set, as is the goal in language modeling

從 NCE 說起

NCE $p(d|w, w_I)$ equation，$p(d=1 |w, w_I)$ 代表 word $w$ 是 output word 的機率; $p(d=0|w, w_I)$ 表 sample 出 noise word 的機率

$\begin{aligned} p(d=1 \vert w, w_I) &= \frac{p(d=1, w \vert w_I)}{p(d=1, w \vert w_I) + p(d=0, w \vert w_I)} &= \frac{p(w \vert w_I)}{p(w \vert w_I) + Mq(\tilde{w})} \end{aligned}$ $\begin{aligned} p(d=0 \vert w, w_I) &= \frac{p(d=0, w \vert w_I)}{p(d=1, w \vert w_I) + p(d=0, w \vert w_I)} &= \frac{Mq(\tilde{w})}{p(w \vert w_I) + Mq(\tilde{w})} \end{aligned}$

NCE 假設 $p(w_O \vert w_I) = \frac{\exp({\text{v}’_{w_O}}^{\top} \text{v}_{w_I})}{\sum_{i=1}^V \exp({\text{v}’_{w_i}}^{\top} \text{v}_{w_I})}$ 中的分母 $Z(w) = {\sum_{i=1}^V \exp({\text{v}’_{w_i}}^{\top} \text{v}_{w_I})} = 1$，所以簡化成

$\begin{aligned} p(d=1 \vert w, w_I) &= \frac{p(w \vert w_I)}{p(w \vert w_I) + Mq(\tilde{w})} &= \frac{\exp({\text{v}^{'}_w}^{\top}\text{v}_{w_i})}{\exp({\text{v}^{'}_w}^{\top}\text{v}_{w_i}) + Mq(\tilde{w})} \end{aligned}$

NEG 繼續化簡

NEG 繼續假設 $Nq(\tilde{w}) = 1$ 式子變成

$\begin{aligned} p(d=1 \vert w, w_I) &= \frac{\exp({\text{v}^{'}_w}^{\top}\text{v}_{w_i})}{\exp({\text{v}^{'}_w}^{\top}\text{v}_{w_i}) + 1} &= \frac{1}{1 +\exp(-{\text{v}^{'}_w}^{\top}\text{v}_{w_i})} = \sigma({\text{v}^{'}_w}^{\top}\text{v}_{w_i}) \end{aligned}$ $p(d=0|w, w_I) = 1 - \sigma({\text{v}^{'}_w}^{\top}\text{v}_{w_i}) = \sigma(-{\text{v}^{'}_w}^{\top}\text{v}_{w_i})$

最終得到 loss function

$\mathcal{L}_\theta = - [ \log \sigma({\text{v}'_{w}}^\top \text{v}_{w_I}) + \sum_{\substack{i=1 \\ \tilde{w}_i \sim Q}}^M \log \sigma(-{\text{v}'_{\tilde{w}_i}}^\top \text{v}_{w_I})]$

前項是 positive sample $p(d=1 \vert w, w_I)$，後項是 M 個 negative samples $p(d=0|w, w_I) $

在 skipgram with negative sampling 上

$\text{v}_{w_I}$ 表 input 的 center word $w_I$ 的 vector，來自 $W$
$\text{v}’_{w}$ 表 output side 的一個 context word $w$ 的 vector，來自 $W’$

實作上 skipgram 一個單位的 data sample 只會包含一個 center word 與一個 context word 的 pair 對 $(w_I, w_{C,j})$

參閱 Word2Vec (6):Pytorch 實作 Skipgram with Negative Sampling

結論

NEG 實際上目的跟 Hierarchical Softmax 不一樣，並不直接學 likelihood of correct word，而是直接學 words 的 embedding，也就是更好的 word distribution representation

Reference

Learning Word Embedding https://lilianweng.github.io/lil-log/2017/10/15/learning-word-embedding.html#loss-functions
- 此篇從 skip gram 講解 negative sampling
On word embeddings - Part 2: Approximating the Softmax https://ruder.io/word-embeddings-softmax/index.html#samplingbasedapproaches
- 此篇從 CBOW 講解 negative sampling
Goldberg, Y., & Levy, O. (2014). word2vec Explained: deriving Mikolov et al.’s negative-sampling word-embedding method, (2), 1–5. Retrieved from http://arxiv.org/abs/1402.3722
Word2Vec Tutorial Part 2 - Negative Sampling [http://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/

Word2Vec (3):Negative Sampling 背後的數學

https://seed9d.github.io/negative-sampling-in-word2vec/

Author

seed9D

Posted on

2021-01-24

Updated on

2021-02-10

Licensed under

NLP, word embedding

Word2Vec (3):Negative Sampling 背後的數學

Noise Contrastive Estimation (NCE)

從 cross entropy 說起

NCE sample 原理

關於 Noise distribution $Q$

Negative Sampling (NEG)

從 NCE 說起

NEG 繼續化簡

結論

Reference

Author

Posted on

Updated on

Licensed under

Comments

Catalogue

Recents