內容推薦 (2) Title Embedding with Keyword

2021-02-092021-02-10recommendation system11 minutes read (About 1682 words)0 visits

前言

在前篇內容推薦 (1) 關鍵詞識別中，我們利用 entropy 從商品池的 title 中辨識出 product word & label word

此篇，我們將利用已經辨識出的 product word & label word 回頭對商品池中的商品 title 做 embedding

當然你也可以直接將所有 title 送進 Word2Vec 硬 train 一發，然後對 title 內的所有的 word vectors 取平均得到 title vector。

Weight Keyword Embedding

假設我們有一個 title ，我們希望能根據 word 在 title 中的重要程度將他 embedding 化，要怎麼做？

1	'summer fisherman hat female outdoor sun hat sun hat japanese student basin hat watch travel fishing sun hat male'

從 CBOW 說起

CBOW 的思想是用兩側 context words 去預測中間的 center word

$P(center|context;\theta)$

換句話說，給定 context words 集合 $w_{I,C}$， word $w_j$ 是 center word $w_O$ 的 probability 越大是否代表 $w_j$ 在 context $C$ 中越關鍵？

$P(w_O = w_j |w_{I,C};\theta)$

如果上面的推測成立的話，CBOW 在 Hierarchical Softmax 下的 objective function: negative log likelihood

$\begin{aligned} & -\log p(w_O| w_I) = -\log \dfrac{\text{exp}({h^\top \text{v}'_O})}{\sum_{w_i \in V} \text{exp}({h^\top \text{v}'_{w_i}})} \\ & = - \sum^{L(w_O)-1}_{l=1} \log\sigma([ \cdot] h^\top \text{v}^{'}_l) \end{aligned} \tag{1}$

CBOW with Hierarchical Softmax 有兩個 matrix $W$ and $W’$
- $W$ 的 row vector 對應到 word $w_i$ 的 vector
- $W’$ 對應的是 huffman tree non-leaf node 的 vector
- 參見 Word2Vec (5):Pytorch 實作 CBOW with Hierarchical Softmax
$\text{v}’_j$ 表 output side matrix $W’$ 中 j-th columns vector，跟任何 word 沒一對一對應關係
$L(w_i) -1$ 表 huffman tree 中從 root node 到 leaf node of $w_i$ 的 node number
$[ \cdot ]$表 huffman tree 的分岔判斷
- $[ \cdot ] = 1$ 表 turn left
- $[ \cdot ] = -1$ 表 turn right
$h = \frac {1}{C} \sum^{C}_{j=1}\text{v}_{w_{I,j}}$ average of all context word vector $w_{I,j}$

Score function $\log p(w_O| w_I)$ (沒負號)， 本質上是對 output word $w_O$ 的打分。

先改寫一下 score function，等等會用到

$\begin{aligned} \log p(w_O| w_I) &= \sum^{L(w_O)-1}_{l=1}\log\sigma([ \cdot] h^\top \text{v}^{'}_l) \\ &= \sum^{L(w_O)-1}_{l=1} \log(\cfrac{1}{1+ \exp^{- [ \cdot] h^{\top} v_l^{'}}}) \\ &= \sum^{L(w_O)-1}_{l=1} [\log(1) -\log(1+ \exp^{- [ \cdot] h^{\top} v_l^{'}}) ]\\ & = \sum^{L(w_O)-1}_{l=1}-\log(1 + \exp^{- [ \cdot] h^{\top} v_l^{'}}) \end{aligned} \tag{2}$

有了式(2) score function ，給定一 title words 集合 $w_{T}$ 只要對 title 裡的每個 word $w_j \in w_{T}$ ，令 $\log p(w_O=w_j|w_I = w_{T, \lnot j})$，進行打分即可得到每個 word $w_j$ 在 title 裡的重要程度

$\text{weight}_j = \log p(w_O=w_j|w_I = w_{T, \lnot j}) \tag{3}$

而我們要的 title embedding 即 weighted sum of words in title

$\text{v}_{\text{title}} = \sum_{w_j \in w_T}\text{weight}_j \times \text{v}_{w_{j}} \tag{4}$

$w_T$: 某 title 的 word 集合
$\text{v}_{w_{j}}$: word $w_j$ 對應 matrix $W$ 中 row vector

Gensim 實作

Train CBOW + HS

from gensim.models import Word2Vec
w2v_model = Word2Vec(
        min_count=3,
        window=5,
        size=100,
        alpha=0.005,
        min_alpha=0.0007,
        hs=1,
        sg=0,
        workers=4,
        batch_words=100,
        cbow_mean = 1
    )
w2v_model.build_vocab(corpus) # build huffman tree
w2v_model.train(
        corpus,
        total_examples=w2v_model.corpus_count,
        epochs=50,
        report_delay=1)

corpus 裡的每個 title，應該已經先行合併 bigram and trigram 的 product & label 詞，最好可以去除無用詞，如下 sentence，某些 words 已合併 :

1	'hyuna_style ins cute wild sunscreen female hand sleeves arm_guard ice_silk sleeves driving anti_ultraviolet ice gloves tide'

Scoring Words in Title

訓練完 model 後，對每個 title 內的 words 重要性進行評分:

def cal_log_probs(model, target_w, context_embd: np.ndarray)-> np.ndarray:
    turns = (-1.0) ** target_w.code
    path_embd = model.trainables.syn1[target_w.point]
    log_probs = -np.logaddexp(0, -turns * np.dot(context_embd, path_embd.T))
    return np.sum(log_probs)

為式 (2) 的實現，實際上就是 gensim Word2Vec 內的 score_cbow_pair
word 在 huffman tree 的 path code 是 0/1 code，使用時須轉換成 -1 or 1
model.trainables.syn1 即 $W’$ ，存放 huffman tree non-leaf node 的 vector

def _cal_keyword_score(model, sentence:List[str]) -> Dict[str, float]:
    word_vocabs = [model.wv.vocab[w] for w in sentence if w in model.wv.vocab]
    
    word_importance = {}
    for pos_center, center_w in enumerate(word_vocabs):
        context_w_indices = [w.index for pos_w, w in enumerate(word_vocabs) if pos_center != pos_w]
        context_embed = np.mean(model.wv.vectors[context_w_indices], axis=0)
        log_probs = cal_log_probs(model, center_w, context_embed)
        
        center_w_term = w2v_model.wv.index2word[center_w.index]
        word_importance[center_w_term] = word_importance.get(center_w_term, 0) + log_probs
    return word_importance

def cal_keyword_score(model, sentence: List[str]) -> np.ndarray:
    word_importance = _cal_keyword_score(model, sentence)
    ds = pd.Series(word_importance).sort_values(ascending=False)
    
    scalar = MinMaxScaler(feature_range=(0.1, 1))
    array = ds.to_numpy()
    array = scalar.fit_transform(array.reshape(array.shape[0], -1))
    ds = pd.Series(array.reshape(-1, ), index=ds.index)
    return ds

model.wv.vectors 存放 $W$，即訓練完後每個 word 的 vector
MinMaxScaler: 縮放到 0.1 到 1 是為了方便觀察

使用方式如下

In:

1
2
3

sent = corpus_with_bigram_trigram[7676]
ds = cal_keyword_score(w2v_model, sent)
print(sent), print(ds)

Out:

['haining', 'leather', 'male', 'stand_collar', 'middle_aged', 'coat', 'fur', 'one', 'winter', 'cashmere', 'thick', 'money', 'father_loaded']
coat             1.000000
leather          0.874738
fur              0.861750
middle_aged      0.812752
winter           0.773609
male             0.734654
stand_collar     0.699505
thick            0.676800
cashmere         0.642869
one              0.546631
haining          0.457806
father_loaded    0.393533
money            0.100000
dtype: float64

Weighted Sum of Word Vectors

從 w2v_model 中取出某 title 內所有 words 的 vector 做 weighed sum

def weighted_sum_w2v(w2v_model, ds: pd.Series) -> np.ndarray:
    ds_  = ds.copy() / sum(ds)
    w2v = w2v_model.wv[ds_.index]
    weights = np.expand_dims(ds_.to_numpy(), 1)
    
    return np.sum((w2v * weights), axis=0)

得出的 title vector is un-normalized ，要使用前得先 L2-norm

參閱 bible 做的示例 notebook seed9D/hands-on-machine-learning

Title Embedding 應用

title embedding 最直覺的應用是 content I2I，用戶點擊了商品 A，我們就可以透過商品 A 的 title vector 召回 TopK 個最相似 title 的商品推薦給他。

而 title embedding with keyword weighting 中，我們將 product word 與 label word 在 title 中的重要程度進行 weighted sum，能更準確的表達 title 的意思，不再只是簡單的對 word vector 取平均，連一些無用詞的 vector 也混進去。

不過 title embedding 在推薦的效果不如利用用戶交互數據訓練出來的 embedding，但因爲每個商品一定會有 title，很適合作為商品冷啟動召回策略之一使用。在我負責的推薦應用裡，也是利用 title embedding 關聯新商品到有交互數據的舊商品上後讓新商品取得曝光機會。

title embedding 結合 label 詞 & product 詞的另一個業務應用就是卡片式的主題推薦，類似淘寶上的一個頁面就講一個購物主題，選定一個主題 (ex: 旅遊)與某樣你曾經互動過商品進行推薦

這個算法側的實作不難，留到下次說吧

Reference

【不可思议的Word2Vec】 3.提取关键词 https://spaces.ac.cn/archives/4316
- 以 skipgram 角度計算
my post
- Word2Vec (5):Pytorch 實作 CBOW with Hierarchical Softmax
- Word2Vec (2):Hierarchical Softmax 背後的數學
gensim
- https://www.kdnuggets.com/2018/04/robust-word2vec-models-gensim.html
- Gensim Word2Vec Tutorial [https://www.kaggle.com/pierremegret/gensim-word2vec-tutorial

內容推薦 (2) Title Embedding with Keyword

https://seed9d.github.io/title-embedding-with-keywords/

Author

seed9D

Posted on

2021-02-09

Updated on

2021-02-10

Licensed under

內容推薦 (2) Title Embedding with Keyword

前言

Weight Keyword Embedding

從 CBOW 說起

Gensim 實作

Train CBOW + HS

Scoring Words in Title

Weighted Sum of Word Vectors

Title Embedding 應用

Reference

Author

Posted on

Updated on

Licensed under

Comments

Catalogue

Recents