Word2Vec (2):Hierarchical Softmax 背後的數學

2021-01-242021-02-10NLP6 minutes read (About 890 words)0 visits

以 CBOW 為例

$V$： the vocabulary size
$N$ : the embedding dimension
$W$： the input side matrix which is $V \times N$
- each row is the $N$ dimension vector
- $\text{v}_{w_i}$ is the representation of the input word $w_i$
$W’$: the output side matrix which is $N \times V$
- $\text{v}’_j$ 表 $W’$ 中 j-th columns vector
- 在 Hierarchical softmax 中， $W’$ each column 表 huffman tree 的 non leaf node 的 vector 而不是 leaf node ，跟 column vector $\text{v}’_j$ 與 word $w_i$ 不是直接對應的關係

Hierarchical Softmax 優化了原始 softmax 分母 normalization 項需要對整個 vocabulary set 內的所有 word 內積，原始 softmax 如下

$p(w | c) = \dfrac{\text{exp}({h^\top \text{v}'_w})}{\sum_{w_i \in V} \text{exp}({h^\top \text{v}'_{w_i}})}$

$h = \frac {1}{C} \sum^{C}_{c=1}\text{v}_{w_c}$，is average of input context words’ vector representation in $W$

Hierarchical Softmax build a full binary tree to avoid computation over all vocabulary，示意圖：

每個 leaf node 代表一個 word $w_i$
Matrix $W^{‘}$ 就是所有 non-leaf node $n$ 代表的 vector $\text{v}^{‘}_n$ 集合，與 word $w_i$ 無一對一對應關係

Binary tree 中每個 node 分岔的 probability 是個 binary classification problem

$p(n, \text{left}) = \sigma({\text{v}’_n}^{\top} h)$

$p(n, \text{righ}) = 1 - p(\text{left},n) = \sigma(-{\text{v}’_n}^{\top} h)$

$\text{v}^{‘}_{n}$ 代表 node $n$ 的 vector

則每個 word $w_O = w_i$ 的 generate probability $p(w_i)$ 為其從 root node 分岔到 lead node 的 probability

$p(w_i = w_O) = \prod^{L(w_O)-1}_{j=1} \sigma(\mathbb{I}_{\text{turn}}(n(w_O, j), n(w_O, j + 1) \cdot {v^{‘}_{n(w_O, j)}}^{\top}h)$

$w_O$ 表 output word 的意思
$L(w_O)$ is the depth of the path leading to the output word $w_O$
$\mathbb{I}_{turn}$ is a specially indicator function

1 if $n(w_O, k+1)$ is the left child of $n(w_O, k)$

-1 if $n(w_O, k+1)$ is the right child of $n(w_O, k)$
$n(w, j)$ means the $j$ th unit on the path from root to the word $w$
$h = \frac {1}{C} \sum^{C}_{c=1}\text{v}_{w_c}$ average of all context word vector

簡單的例子

Ex 1: Following the path from the root to leaf node of $w_2$, we can compute the probability of $w_2$ $p(w_2)$:

Ex 2:

$\sum^{V}_{i=1} p(w_i = w_O) = 1$

probability $p(\text{cat}| context)$, 是由 $ node1 \stackrel{\text{left}}{\to} node \stackrel{\text{right}}{\to} node 5 \stackrel{\text{right}}{\to} cat $ 這條路徑組成

其中 context words 經過 hidden layer 後的輸出為 $h(\text{context words})$

為什麼 Hierarchical Softmax 可以減少 Time Complexity?

透過 Hierarchical Softmax ，原本計算 $p(w|c)$ 需要求所有 word $w_i$ 的 vector $\text{v}_{w_i}$對 $h$ 的內積，現在只需要計算路徑上的節點數 $L(w_i) -1$，又 HS 是 full binary tree ，其最大深度為 $\log_2|V|$

So we only need to evaluate at most $log_2|V|$

Hierarchical Softmax 如何 update 參數

Error Funtion of Hierarchical Softmax

Error function $E$ is negative log likelihood

$L(w_i) -1$ 表從 root node 到 leaf node of $w_i$ 的 node number
$[ \cdot ]$表分岔判斷
$h = \frac {1}{C} \sum^{C}_{c=1}\text{v}_{w_c}$ average of all context word vector

And we use gradient decent to update $\text{v}^{‘}_j$ and $h$ in $W’$ and $W’$

Calculate the Derivate $E$ with Regard to $\text{v}^{‘\top}_jh$

先求 total loss 對 $\text{v}^{‘\top}_jh$ 的 gradient

$\sigma^{‘}(x) = \sigma(x)[1 - \sigma(x)]$
$[\log\sigma(x)]^{‘} = 1 - \sigma(x)$ ⇒ $[log(1 - \sigma(x)]^{‘} = -\sigma(x)$

Calculate the Derivate $E$ with Regard to $\text{v}^{‘}_j$

根據 chain rule 可以求出 total loss 對 huffman tree node vector $\text{v}^{‘}_j$ 的 gradient

Update Equation

Calculate the Derivate $E$ with Regard to $h$

最後求 total loss 對 hidden layer outpot $h$ 的 gradient

$EH$: an N-dim vector, is the sum of the output vector of all words in the vocabulary, weighted by their prediction error

Update Equation

Because hidden vector $h$ is composed with all the context word $w_{I,c}$

$\text{v}_{w_{I,c}}$ is the input vector of c-th word in input context

Implement

CBOW + HS 實現 Word2Vec (5):Pytorch 實作 CBOW with Hierarchical Softmax

Reference

Language Models, Word2Vec, and Efficient Softmax Approximations https://rohanvarma.me/Word2Vec/
Goldberg, Y., & Levy, O. (2014). word2vec Explained: deriving Mikolov et al.’s negative-sampling word-embedding method, (2), 1–5. Retrieved from http://arxiv.org/abs/1402.3722

Word2Vec (2):Hierarchical Softmax 背後的數學

https://seed9d.github.io/hierarchical-softmax-in-word2vec/

Author

seed9D

Posted on

2021-01-24

Updated on

2021-02-10

Licensed under

NLP, word embedding

Word2Vec (2):Hierarchical Softmax 背後的數學

以 CBOW 為例

簡單的例子

為什麼 Hierarchical Softmax 可以減少 Time Complexity?

Hierarchical Softmax 如何 update 參數

Error Funtion of Hierarchical Softmax

Calculate the Derivate $E$ with Regard to $\text{v}^{‘\top}_jh$

Calculate the Derivate $E$ with Regard to $\text{v}^{‘}_j$

Update Equation

Calculate the Derivate $E$ with Regard to $h$

Update Equation

Implement

Reference

Author

Posted on

Updated on

Licensed under

Comments

Catalogue

Recents