Word2Vec (2):Hierarchical Softmax 背後的數學
以 CBOW 為例
- $V$: the vocabulary size
- $N$ : the embedding dimension
- $W$: the input side matrix which is $V \times N$
- each row is the $N$ dimension vector
- $\text{v}_{w_i}$ is the representation of the input word $w_i$
- $W’$: the output side matrix which is $N \times V$
- $\text{v}’_j$ 表 $W’$ 中 j-th columns vector
- 在 Hierarchical softmax 中, $W’$ each column 表 huffman tree 的 non leaf node 的 vector 而不是 leaf node ,跟 column vector $\text{v}’_j$ 與 word $w_i$ 不是直接對應的關係
Hierarchical Softmax 優化了原始 softmax 分母 normalization 項需要對整個 vocabulary set 內的所有 word 內積,原始 softmax 如下
- $h = \frac {1}{C} \sum^{C}_{c=1}\text{v}_{w_c}$,is average of input context words’ vector representation in $W$
Hierarchical Softmax build a full binary tree to avoid computation over all vocabulary,示意圖:
- 每個 leaf node 代表一個 word $w_i$
- Matrix $W^{‘}$ 就是所有 non-leaf node $n$ 代表的 vector $\text{v}^{‘}_n$ 集合,與 word $w_i$ 無一對一對應關係
Binary tree 中每個 node 分岔的 probability 是個 binary classification problem
$p(n, \text{left}) = \sigma({\text{v}’_n}^{\top} h)$
$p(n, \text{righ}) = 1 - p(\text{left},n) = \sigma(-{\text{v}’_n}^{\top} h)$
- $\text{v}^{‘}_{n}$ 代表 node $n$ 的 vector
則每個 word $w_O = w_i$ 的 generate probability $p(w_i)$ 為其從 root node 分岔到 lead node 的 probability
$p(w_i = w_O) = \prod^{L(w_O)-1}_{j=1} \sigma(\mathbb{I}_{\text{turn}}(n(w_O, j), n(w_O, j + 1) \cdot {v^{‘}_{n(w_O, j)}}^{\top}h)$
$w_O$ 表 output word 的意思
$L(w_O)$ is the depth of the path leading to the output word $w_O$
$\mathbb{I}_{turn}$ is a specially indicator function
1 if $n(w_O, k+1)$ is the left child of $n(w_O, k)$
-1 if $n(w_O, k+1)$ is the right child of $n(w_O, k)$
$n(w, j)$ means the $j$ th unit on the path from root to the word $w$
$h = \frac {1}{C} \sum^{C}_{c=1}\text{v}_{w_c}$ average of all context word vector
簡單的例子
Ex 1: Following the path from the root to leaf node of $w_2$, we can compute the probability of $w_2$ $p(w_2)$:
Ex 2:
- $\sum^{V}_{i=1} p(w_i = w_O) = 1$
probability $p(\text{cat}| context)$, 是由 $ node1 \stackrel{\text{left}}{\to} node \stackrel{\text{right}}{\to} node 5 \stackrel{\text{right}}{\to} cat $ 這條路徑組成
其中 context words 經過 hidden layer 後的輸出為 $h(\text{context words})$
為什麼 Hierarchical Softmax 可以減少 Time Complexity?
透過 Hierarchical Softmax , 原本計算 $p(w|c)$ 需要求所有 word $w_i$ 的 vector $\text{v}_{w_i}$對 $h$ 的內積,現在只需要計算路徑上的節點數 $L(w_i) -1$,又 HS 是 full binary tree ,其最大深度為 $\log_2|V|$
So we only need to evaluate at most $log_2|V|$
Hierarchical Softmax 如何 update 參數
Error Funtion of Hierarchical Softmax
Error function $E$ is negative log likelihood
- $L(w_i) -1$ 表 從 root node 到 leaf node of $w_i$ 的 node number
- $[ \cdot ]$表分岔判斷
- $h = \frac {1}{C} \sum^{C}_{c=1}\text{v}_{w_c}$ average of all context word vector
And we use gradient decent to update $\text{v}^{‘}_j$ and $h$ in $W’$ and $W’$
Calculate the Derivate $E$ with Regard to $\text{v}^{‘\top}_jh$
先求 total loss 對 $\text{v}^{‘\top}_jh$ 的 gradient
- $\sigma^{‘}(x) = \sigma(x)[1 - \sigma(x)]$
- $[\log\sigma(x)]^{‘} = 1 - \sigma(x)$ ⇒ $[log(1 - \sigma(x)]^{‘} = -\sigma(x)$
Calculate the Derivate $E$ with Regard to $\text{v}^{‘}_j$
根據 chain rule 可以求出 total loss 對 huffman tree node vector $\text{v}^{‘}_j$ 的 gradient
Update Equation
Calculate the Derivate $E$ with Regard to $h$
最後求 total loss 對 hidden layer outpot $h$ 的 gradient
- $EH$: an N-dim vector, is the sum of the output vector of all words in the vocabulary, weighted by their prediction error
Update Equation
Because hidden vector $h$ is composed with all the context word $w_{I,c}$
- $\text{v}_{w_{I,c}}$ is the input vector of c-th word in input context
Implement
CBOW + HS 實現 Word2Vec (5):Pytorch 實作 CBOW with Hierarchical Softmax
Reference
- Language Models, Word2Vec, and Efficient Softmax Approximations https://rohanvarma.me/Word2Vec/
- Goldberg, Y., & Levy, O. (2014). word2vec Explained: deriving Mikolov et al.’s negative-sampling word-embedding method, (2), 1–5. Retrieved from http://arxiv.org/abs/1402.3722
Word2Vec (2):Hierarchical Softmax 背後的數學