Word2Vec (4):Pytorch 實作 Word2Vec with Softmax

2021-01-312021-02-10NLP9 minutes read (About 1404 words)0 visits

用 pytorch 實現最簡單版本的 CBOW 與 skipgram，objective function 採用 minimize negative log likelihood with softmax

CBOW

CBOW 的思想是用兩側 context 詞預測中間 center 詞，context 詞有數個，視 window size 大小而定

$P(center|context;\theta)$

$V$： the vocabulary size
$N$ : the embedding dimension
$W$： the input side matrix which is $V \times N$
- each row is the $N$ dimension vector
- $\text{v}_{w_i}$ is the representation of the input word $w_i$
$W’$: the output side matrix which is $N \times V$
- each column is the $N$ dimension vector
- $\text{v}^{‘}_{w_j}$ is the j-th column of the matrix $W’$ representing $w_j$

Condition probability $P(center | context; \theta)$ 中 variable $\textit{center word}$ 有限，所以是個 descrete probability，可以轉化成多分類問題來解

令 $w_O$ 表 center word, $w_I$ 表 input 的 context word，則

$P(center|context;\theta) = P(w_O|w_I; \theta) = \cfrac{\exp(h^\top \text{v}^{'}_{w_{O}})}{\sum_{w_ \in V}\exp(h^\top \text{v}'_{w_i})}$

$h$ 表 hidden layer 的輸出，其值為 input context word vector 的平均 $\cfrac{1}{C}(\text{v}_{w_1} + \text{v}_{w_2}+ …+ \text{v}_{w_C})^T$

訓練過程 $\text{maximize log of condition probability } P(w_O|w_I; \theta$

$\begin{aligned} & \text{maxmize}_\theta \ \log P(w_O|w_I; \theta) \\& = \text{minimize}_\theta \ -\log \ P(w_O|w_I; \theta) \\& = \text{minimize}_\theta \ - \log \cfrac{\exp(h^\top \text{v}^{'}_{w_{O}})}{\sum_{w_i \in V} \exp(h^\top \text{v}^{'}_{w_i})} \end{aligned}$

Pytorch CBOW + softmax

CBOW + softmax 模型定義

class CBOWSoftmax(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super().__init__()
        self.syn0 = nn.Embedding(vocab_size, embedding_dim)
        self.syn1 = nn.Linear(embedding_dim, vocab_size)
    
    def forward(self, context, center):
        #  context: [b_size, windows_size]
        #  center: [b_size, 1]
        embds = self.syn0(context).mean(dim=1) # [b_size, embedding_dim]
        out = self.syn1(embds)
        
        log_probs = F.log_softmax(out, dim=1)
        loss = F.nll_loss(log_probs, center.view(-1), reduction='mean')
        return loss

syn0 對應到 input 側的 embedding matrix $W$
syn1 對應到 output 側的 embedding matrix $W’$
loss 的計算

$- log \cfrac{\exp(h^\top \text{v}^{‘}_{w_{O}})}{\sum_{w_i \in V} \exp(h^\top \text{v}^{‘}_{w_i})}$
input: context 跟 center 內容都是將 word index 化
因爲 context 是由 windows size N 個 words 組成，所以總共有 N 個 word embedding ，常規操作是 sum or mean

Training Stage

訓練過程省略，有興趣的可以去 github 看 notebook

seed9D/hands-on-machine-learning

取出 Embedding

創建一個衡量 cosine similarity的 class

class CosineSimilarity:
    def __init__(self, word_embedding, idx_to_word_dict, word_to_idx_dict):
        self.word_embedding = word_embedding # normed already
        self.idx_to_word_dict = idx_to_word_dict
        self.word_to_idx_dict = word_to_idx_dict
        
    def get_synonym(self, word, topK=10):
        idx = self.word_to_idx_dict[word]
        embed = self.word_embedding[idx]
        
        cos_similairty = w2v_embedding @ embed
        
        topK_index = np.argsort(-cos_similairty)[:topK]
        pairs = []
        for i in topK_index:
            w = self.idx_to_word_dict[i]
#             pairs[w] = cos_similairty[i]
            pairs.append((w, cos_similairty[i]))
        return pairs

僅使用 syn0 做為 embedding，記得 L2 norm

syn0 = model.syn0.weight.data

w2v_embedding = syn0 
w2v_embedding = w2v_embedding.numpy()
l2norm = np.linalg.norm(w2v_embedding, 2, axis=1, keepdims=True)
w2v_embedding = w2v_embedding / l2norm

訓練的 corpus 是聖經，所以簡單看下 jesus 與 christ 兩個 word 的相似詞，效果不予置評

Pytorch%20Implement%20Naive%20Word2Vec%20with%20Softmax%20ae605d15ce0e403694f9d8049c1f2354/Untitled%202.png

Skipgram

skipgram 的思想是用中心詞 center word 去預測兩側的 context words

$P(context|center; \theta)$

$V$： the vocabulary size
$N$ : the embedding dimension
$W$： the input side matrix which is $V \times N$
- each row is the $N$ dimension vector
- $\text{v}_{w_i}$ is the representation of the input word $w_i$
$W’$: the output side matrix which is $N \times V$
- each column is the $N$ dimension vector
- $\text{v}^{‘}_{w_j}$ is the j-th column of the matrix $W’$ representing $w_j$

令 $w_I$ 表 input 的 center word， $w_{O,j}$ 表 target 的第 $j$ 個 context word ，則 condition probability

$P(context|center;\theta) = P(w_{O,1}, w_{O,2},...,w_{O,C}|w_I) = \prod^C_{c=1 }\cfrac{\exp(h^\top \text{v}'_{w_{O,c}})}{\sum_{w_i \in V} \exp(h^\top \text{v}'_{w_i})}$

$h$ 表 hidden layer 的輸出，在 skipgram 實際上就是 $\text{v}_{w_I}$

Skipgram 的 objective function

$\begin{aligned} & -\log P(w_{O,1}, w_{O,2},...,w_{O,C}|w_I) \\ & = -\log \prod^C_{c=1}\cfrac{\exp(h^\top \text{v}'_{w_{O,c}})}{\sum_{w_i \in V} \exp(h^{\top} \text{v}'_{w_i})} \\ & = -\log \prod^C_{c=1}\cfrac{\exp(\text{v}^\top_{w_I} \text{v}'_{w_{O,c}})}{\sum_{w_i \in V} \exp(\text{v}^\top_{w_I} \text{v}'_{w_i})} \\& = -\sum^C_{c=1}\log \cfrac{\exp(\text{v}^\top_{w_I} \text{v}'_{w_{O,c}})}{\sum_{w_i \in V} \exp(\text{v}^\top_{w_I} \text{v}'_{w_i})} \end{aligned}$

Pytorch skipgram + softmax

模型

class SkipgramSoftmax(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super().__init__()
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        self.syn0 = nn.Embedding(vocab_size, embedding_dim)  # |V| x |K|
        self.syn1 = nn.Linear(embedding_dim, vocab_size)  # |K| x |V|

    def forward(self, center, context):
        # center: [b_size, 1]
        # context: [b_size, 1]
        embds = self.syn0(center.view(-1))
        out = self.syn1(embds)
        log_probs = F.log_softmax(out, dim=1)
        loss = F.nll_loss(log_probs, context.view(-1), reduction='mean')
        return loss

syn0 對應到 input 側的 embedding matrix $W$
syn1 對應到 output 側的 embedding matrix $W’$

實際上，skipgram 每筆 training data 只需要 (a center word, a context word) 的 pair 即可

所以 loss function 實現上非常簡單

$-\log \cfrac{\exp(\text{v}^\top_{w_I} \text{v}'_{w_{O,c}})}{\sum_{w_i \in V} \exp(\text{v}^\top_{w_I} \text{v}'_{w_i})}$

Training Stage

訓練過程省略，有興趣的可以去 github 看 notebook

seed9D/hands-on-machine-learning

Evaluation

取出 embedding，這次 embedding 嘗試 $(W + W’)/2$

syn0 = model.syn0.weight.data
syn1 = model.syn1.weight.data

w2v_embedding = (syn0 + syn1) / 2
w2v_embedding = w2v_embedding.numpy()
l2norm = np.linalg.norm(w2v_embedding, 2, axis=1, keepdims=True)
w2v_embedding = w2v_embedding / l2norm

一樣看 jesus 跟 christ 的相似詞，感覺似乎比 CBOW 好一點

Reference

https://lilianweng.github.io/lil-log/2017/10/15/learning-word-embedding.html
https://towardsdatascience.com/implementing-word2vec-in-pytorch-skip-gram-model-e6bae040d2fb
基于PyTorch实现word2vec模型 https://lonepatient.top/2019/01/18/Pytorch-word2vec.htm
Rong, X. (2014). word2vec Parameter Learning Explained, 1–21. Retrieved from http://arxiv.org/abs/1411.2738
https://github.com/FraLotito/pytorch-continuous-bag-of-words/blob/master/cbow.py

Word2Vec (4):Pytorch 實作 Word2Vec with Softmax

https://seed9d.github.io/Pytorch-Implement-Naive-Word2Vec-with-Softmax/

Author

seed9D

Posted on

2021-01-31

Updated on

2021-02-10

Licensed under

Word2Vec (4):Pytorch 實作 Word2Vec with Softmax

CBOW

Pytorch CBOW + softmax

CBOW + softmax 模型定義

Training Stage

取出 Embedding

Skipgram

Pytorch skipgram + softmax

模型

Training Stage

Evaluation

Reference

Author

Posted on

Updated on

Licensed under

Comments

Catalogue

Recents