Word2Vec (3):Negative Sampling 背後的數學

以下用 Skip-gram 為例

  • $V$: the vocabulary size
  • $N$ : the embedding dimension
  • $W$: the input side matrix which is $V \times N$
    • each row is the $N$ dimension vector
    • $\text{v}_{w_i}$ is the representation of the input word $w_i$
  • $W’$: the output side matrix which is $N \times V$
    • each column is the $N$ dimension vector
    • $\text{v}^{‘}_{w_j}$ is the j-th column of the matrix $W’$ representing $w_j$

Noise Contrastive Estimation (NCE)

NCE attempts to approximately maximize the log probability of the softmax output

  • The Noise Contractive Estimation metric intends to differentiate the target word from noise sample using a logistic regression classifier

從 cross entropy 說起

True label $y_i$ is 1 only when $w_i$ is the output word:

$p(w_O \vert w_I) = \frac{\exp({\text{v}’_{w_O}}^{\top} \text{v}_{w_I})}{\sum_{i=1}^V \exp({\text{v}’_{w_i}}^{\top} \text{v}_{w_I})}$

  • $\text{v}^{‘}_{w_i}$ is vector of word $w_i$ in $W^{‘}$
  • $w_O$ is the output word in $V$
  • $w_I$ is the input word in $V$

代入後

Compute gradient of loss function w.s.t mode’s parameter $\theta$,令 $z_{IO} = {\text{v}’_{w_O}}^{\top}{\text{v}_{w_I}}$ ; $z_{Ii} = {\text{v}’_{w_i}}^{\top}{\text{v}_{w_I}}$

可以看出,gradient $\nabla_{\theta}\mathcal{L}_{\theta}$是由兩部分組成 :

  1. a positive reinforcement for the target word $w_O$, $\nabla_{\theta}z_{O}$
  2. a negative reinforcement for all other words $w_i$, which weighted by their probability, $\sum_{i=1}^V p(w_i \vert w_I) \nabla_\theta z_{Ii}$

Second term actually is just the expectation of the gradient $\nabla_{\theta}z_{Ii}$ for all words $w_i$ in $V$。

And probability distribution $Q(\tilde{w})$ could see as the distribution of noise samples

NCE sample 原理

According to gradient of loss function $\nabla_{\theta}\mathcal{L}_{\theta}$

Given an input word $w_I$, the correct output word is known as $w$,我們同時可以從 noise sample distribution $Q$ sample 出 $M$ 個 samples $\tilde{w}_1
, \tilde{w}_2, \dots, \tilde{w}_M \sim Q$ 來近似 cross entropy gradient 的後半部分

現在我們手上有個 correct output word $w$ 和 $M$ 個從 $Q$ sample 出來的 word $\tilde{w}$ , 假設我們有一個 binary classifier 負責告訴我們,手上的 word $w$,哪個是 correct $p(d=1 | w, w_I)$,哪個是 negative $p(d=0 | \tilde{w}, w_I)$

於是 loss function 改寫成:

According to the law of large numbers $E_{p(x)} [ f(x)] \approx \frac{1}{n} \sum^{n}_{i=1}f(x_i)$,we could simplify:

$p(d |w, w_I)$ 可以從 joint probability $p(d, w|w_I)$ 求

  1. $p(d, w | w_I) =
    \begin{cases}
    \frac{1}{M+1} p(w \vert w_I) & \text{if } d=1 \\
    \frac{M}{M+1} q(\tilde{w}) & \text{if } d=0
    \end{cases}$

    • $d$ is binary value
    • $M$ is number of samples,we have a correct output word $w$ and sample M word from distribution $Q$
  2. 因爲 $p(d| w, w_I) = \frac{p(d, w, w_I)}{p(w, w_I)} = \frac{p(d, w | w_I) p(w_I)}{p(w|w_I)p(w_I)} = \frac{p(d, w| w_I)}{\sum_dp(d,w| w_I)}$

    可以得出

  • $q(\tilde{w})$ 表從 distribution $Q$ sample 出 word $\tilde{w}$ 的 probability

最終 loss function of NCE

$p(w_O \vert w_I) = \frac{\exp({\text{v}’_{w_O}}^{\top} \text{v}_{w_I})}{\sum_{i=1}^V \exp({\text{v}’_{w_i}}^{\top} \text{v}_{w_I})}$ 代入 $\mathcal{L}_{\theta}$

可以看到 normalizer $Z(w) = {\sum_{i=1}^V \exp({\text{v}’_{w_i}}^{\top} \text{v}_{w_I})}$ 依然要 sum up the entire vocabulary,實務上通常會假設 $Z(w)\approx1$,所以 $\mathcal{L}_\theta$ 簡化成:

關於 Noise distribution $Q$

關於 noise distribution $Q$,在設計的時候通常會考慮

  • it should intuitively be very similar to the real data distribution.
  • it should be easy to sample from.

Negative Sampling (NEG)

Negative sampling can be seen as an approximation to NCE

  • Noise Contrastive Estimation attempts to approximately maximize the log probability of the softmax output
  • The objective of NEG is to learn high-quality word representations rather than achieving low perplexity on a test set, as is the goal in language modeling

從 NCE 說起

NCE $p(d|w, w_I)$ equation,$p(d=1 |w, w_I)$ 代表 word $w$ 是 output word 的機率; $p(d=0|w, w_I)$ 表 sample 出 noise word 的機率

NCE 假設 $p(w_O \vert w_I) = \frac{\exp({\text{v}’_{w_O}}^{\top} \text{v}_{w_I})}{\sum_{i=1}^V \exp({\text{v}’_{w_i}}^{\top} \text{v}_{w_I})}$ 中的分母 $Z(w) = {\sum_{i=1}^V \exp({\text{v}’_{w_i}}^{\top} \text{v}_{w_I})} = 1$,所以簡化成

NEG 繼續化簡

NEG 繼續假設 $Nq(\tilde{w}) = 1$ 式子變成

最終得到 loss function

前項是 positive sample $p(d=1 \vert w, w_I)$,後項是 M 個 negative samples $p(d=0|w, w_I) $

在 skipgram with negative sampling 上

  • $\text{v}_{w_I}$ 表 input 的 center word $w_I$ 的 vector,來自 $W$
  • $\text{v}’_{w}$ 表 output side 的一個 context word $w$ 的 vector, 來自 $W’$

實作上 skipgram 一個單位的 data sample 只會包含 一個 center word 與 一個 context word 的 pair 對 $(w_I, w_{C,j})$

參閱 Word2Vec (6):Pytorch 實作 Skipgram with Negative Sampling

結論

  • NEG 實際上目的跟 Hierarchical Softmax 不一樣,並不直接學 likelihood of correct word,而是直接學 words 的 embedding,也就是更好的 word distribution representation

Reference

Word2Vec (3):Negative Sampling 背後的數學

https://seed9d.github.io/negative-sampling-in-word2vec/

Author

seed9D

Posted on

2021-01-24

Updated on

2021-02-10

Licensed under


Comments