Word2Vec (2):Hierarchical Softmax 背後的數學

以 CBOW 為例

  • $V$: the vocabulary size
  • $N$ : the embedding dimension
  • $W$: the input side matrix which is $V \times N$
    • each row is the $N$ dimension vector
    • $\text{v}_{w_i}$ is the representation of the input word $w_i$
  • $W’$: the output side matrix which is $N \times V$
    • $\text{v}’_j$ 表 $W’$ 中 j-th columns vector
    • 在 Hierarchical softmax 中, $W’$ each column 表 huffman tree 的 non leaf node 的 vector 而不是 leaf node ,跟 column vector $\text{v}’_j$ 與 word $w_i$ 不是直接對應的關係

Hierarchical Softmax 優化了原始 softmax 分母 normalization 項需要對整個 vocabulary set 內的所有 word 內積,原始 softmax 如下

  • $h = \frac {1}{C} \sum^{C}_{c=1}\text{v}_{w_c}$,is average of input context words’ vector representation in $W$

Hierarchical Softmax build a full binary tree to avoid computation over all vocabulary,示意圖:

  1. 每個 leaf node 代表一個 word $w_i$
  2. Matrix $W^{‘}$ 就是所有 non-leaf node $n$ 代表的 vector $\text{v}^{‘}_n$ 集合,與 word $w_i$ 無一對一對應關係

Binary tree 中每個 node 分岔的 probability 是個 binary classification problem

$p(n, \text{left}) = \sigma({\text{v}’_n}^{\top} h)$

$p(n, \text{righ}) = 1 - p(\text{left},n) = \sigma(-{\text{v}’_n}^{\top} h)$

  • $\text{v}^{‘}_{n}$ 代表 node $n$ 的 vector

則每個 word $w_O = w_i$ 的 generate probability $p(w_i)$ 為其從 root node 分岔到 lead node 的 probability

$p(w_i = w_O) = \prod^{L(w_O)-1}_{j=1} \sigma(\mathbb{I}_{\text{turn}}(n(w_O, j), n(w_O, j + 1) \cdot {v^{‘}_{n(w_O, j)}}^{\top}h)$

  • $w_O$ 表 output word 的意思

  • $L(w_O)$ is the depth of the path leading to the output word $w_O$

  • $\mathbb{I}_{turn}$ is a specially indicator function

    1 if $n(w_O, k+1)$ is the left child of $n(w_O, k)$

    -1 if $n(w_O, k+1)$ is the right child of $n(w_O, k)$

  • $n(w, j)$ means the $j$ th unit on the path from root to the word $w$

  • $h = \frac {1}{C} \sum^{C}_{c=1}\text{v}_{w_c}$ average of all context word vector

簡單的例子

Ex 1: Following the path from the root to leaf node of $w_2$, we can compute the probability of $w_2$ $p(w_2)$:

Ex 2:

  • $\sum^{V}_{i=1} p(w_i = w_O) = 1$

probability $p(\text{cat}| context)$, 是由 $ node1 \stackrel{\text{left}}{\to} node \stackrel{\text{right}}{\to} node 5 \stackrel{\text{right}}{\to} cat $ 這條路徑組成

其中 context words 經過 hidden layer 後的輸出為 $h(\text{context words})$

為什麼 Hierarchical Softmax 可以減少 Time Complexity?

透過 Hierarchical Softmax , 原本計算 $p(w|c)$ 需要求所有 word $w_i$ 的 vector $\text{v}_{w_i}$對 $h$ 的內積,現在只需要計算路徑上的節點數 $L(w_i) -1$,又 HS 是 full binary tree ,其最大深度為 $\log_2|V|$

So we only need to evaluate at most $log_2|V|$

Hierarchical Softmax 如何 update 參數

Error Funtion of Hierarchical Softmax

Error function $E$ is negative log likelihood

  • $L(w_i) -1$ 表 從 root node 到 leaf node of $w_i$ 的 node number
  • $[ \cdot ]$表分岔判斷
  • $h = \frac {1}{C} \sum^{C}_{c=1}\text{v}_{w_c}$ average of all context word vector

And we use gradient decent to update $\text{v}^{‘}_j$ and $h$ in $W’$ and $W’$

Calculate the Derivate $E$ with Regard to $\text{v}^{‘\top}_jh$

先求 total loss 對 $\text{v}^{‘\top}_jh$ 的 gradient

  • $\sigma^{‘}(x) = \sigma(x)[1 - \sigma(x)]$
  • $[\log\sigma(x)]^{‘} = 1 - \sigma(x)$ ⇒ $[log(1 - \sigma(x)]^{‘} = -\sigma(x)$

Calculate the Derivate $E$ with Regard to $\text{v}^{‘}_j$

根據 chain rule 可以求出 total loss 對 huffman tree node vector $\text{v}^{‘}_j$ 的 gradient

Update Equation

Calculate the Derivate $E$ with Regard to $h$

最後求 total loss 對 hidden layer outpot $h$ 的 gradient

  • $EH$: an N-dim vector, is the sum of the output vector of all words in the vocabulary, weighted by their prediction error

Update Equation

Because hidden vector $h$ is composed with all the context word $w_{I,c}$

  • $\text{v}_{w_{I,c}}$ is the input vector of c-th word in input context

Implement

CBOW + HS 實現 Word2Vec (5):Pytorch 實作 CBOW with Hierarchical Softmax

Reference

Word2Vec (2):Hierarchical Softmax 背後的數學

https://seed9d.github.io/hierarchical-softmax-in-word2vec/

Author

seed9D

Posted on

2021-01-24

Updated on

2021-02-10

Licensed under


Comments