推薦系統中的瑞士刀 Factorization Machine

多才多藝的 Factorization Machine

在推薦領域,現今各種 DNN 模型當道,有個不怎麼深的模型卻小巧而美,既能做召回又能做排序,計算複雜度上又沒 DNN 耗時又耗錢

他 就是現今各種應用在推薦系統 DNN 的前輩 Factorization Machine (FM),這麼多才多藝的模型你值得擁有。

我們都知道 DNN 的本質是特徵的高階交叉,這點在 FM 上就能初見端倪:

FM 前兩項為 LR: $w_0 + \sum^n_{i=1}w_ix_i$,末項為二階特徵交叉項 ,這形式像極了 wide & deep,根本是 wide & deep 的前身,linear 項負責 memorization; cross 項負責 generalization。

FM 既能做召回又能做排序,甚至能在多路召回後對上萬個 entity 粗排打分,到底是怎麼做到的呢?

讓我們往下繼續看下去

推薦系統相關文章

特徵 label 化

首先在推薦領域中,特徵大部分都可被離散化/ label 化,也就是所謂的分桶 label,離散化後的特徵 $x$ 非 0 即 1,所以式 (1) 可以簡化成

Label 化例子:

gender 這個 feature 有三個取值 Men Women Unknown

  • label 化後取值為 0 ~ 2
1
2
3
4
'Men': 0,
'Women': 1,
'Unknown': 2

  • 事實上, gender 的 label 化 跟 3 dimensions 的 one-hot encoder 是一體兩面的
1
2
3
4
0: [1, 0, 0],
1: [0, 1, 0],
2: [0, 0, 1]

label 化,相當於 one-hot encoder 的壓縮,省去零值的儲存,因爲 matrix multiplication 時零值對結果沒影響 ,計算時只需從 weight 中取出 label 所對應的 index 維度即可。

推薦系統中,特徵維度上百萬甚至上千萬指的是 one-hot encode 展開來後的維度,實際上運算的時候不需要這麼多,因爲特徵非常非常稀疏。

FM 向量召回

Math Background

先把 式(2) 中的二階項的 hidden vector $v$ 拆解成 item hidden vector $v^{item}$ 跟 user hidden vector $v^{user}$ 的關係

  • $V^{user}_u$ 為 user-Id $u$ 的所有 user features’ hidden vector 逐位和

    1
    2
    3
    4
    5
    '''
    @ one_user_hidden_vectors shape: [number of user features, hidden vector dimension]
    '''
    sum_of_one_user_hidden_vector = np.sum(one_user_hidden_vectors, axis=0) # shape: [hidden vector dimension]

  • $V^{item}_i$ 為 item-Id $i$ 所有的 item features’ hidden vector 逐位和

  • $n^{user} + n^{item} = n$

式(3) 白話說就是三個部分的加法

以同樣方式分解式 (2)中的一階項

  • $w^{user}$ 為與 user features 對應的 weight
  • $w^{item}$ 為與 item features 對應的 weight
  • $W^{user}$ 為所有 user features 對應的 weights 和
  • $W^{item}$ 為所有 item features 對應的 weights 和

用 式(3) 式(4) 改寫式(2):

  • bias 大家都一樣,要的只是相對的 score,所以可以去掉
  • $V^{user}_{u}$ 為 user-id $u$ 的 user features’s hidden vector 逐位和
  • $V^{item}_i$ 為 item-Id $i$ 所有的 item features’ hidden vector 逐位和

推薦系統在做向量召回時,通常是以 cosine similarity 做為 score 衡量兩個向量的 similarity,式 (5) 可以轉化成兩個 vector 的 inner product,即

  • $E^{user}_u$ 為 user-id $u$ 的 embedding
  • $E^{item}_i$ 為 item-id $i$ 的 embedding

式(5) 推導到 式(6) 非常 trikcy,不過看下圖也就一目了然了

Figure 1 user embedding and item embedding

  • user embedding vector $E^{user}_u$ 的維度為 $dim(V^{user}_u) + 2$,在 $V^{user}_{u}$左側 concatenate 2 維
  • item embedding vector $E^{item}_{i}$ 的維度為 $dim(V^{item}_i) + 2$,在 $V^{item}_i$ 左側 concatenate 2 維

計算向量內積 ,式 (7) 神奇的跟 式 (5) 相等!

總結一下,FM 的 output score $y$ 可以拆解成用兩個 embedding vector $E^{user}_u$ $E^{item}_i$ inner product 表示,這特性用在召回時非常方便。

使用說明書

我們在線下訓練完 FM 之後,分別將 user Embedding $E^{user}$ , item Embedding $E^{item}$ 存入 Faiss 或是任何其他相似的 vector similarity search engine

到了線上,觸發 user-id $u$ 的推薦:

  • I2I
    1. 取回 user-id $u$ 的歷史交互 item-ids $I$
    2. 對每個 item-id $i \in I$,取回其 item embedding $E^{item}_i$
    3. 將 $E^{item}_i$ 拿去 faiss search 相似的 item-ids,完成 I2I 召回
  • U2I
    1. 取回 user-id $u$ 的 user embedding $E^{user}_u$
    2. 將 $E^{user}_u$拿去 faiss search 相似的 item-ids 完成 U2I 召回

上面的 I2I 可以做成 offline 版本,預先將所有 item-id $i$ 的 TopK I2I 算出後存進 redis ,線上只要根據 item-id $i$ 取出對應的 item-ids 即可

Training Tips

FM 用在召回任務時,用意是盡可能的召回用戶感興趣的 items,跟排序的目標是精確的 rank 還是有些不一樣的

而召回感興趣的 items 是透過向量內積得出的 score 決定感興趣程度,因此我們可以參考 Tomas Mikolov 在他 word2Vec 裡採用的 Negative Sampling 思路

  • 對 positive samples downsampling,降低熱門 items 在正樣本中的數量

    • word2vec 裡 $\alpha 取直 0.001$

  • 對 negative sampling ,通常 negative samples 數量遠多於正樣本

    • word2vec 中 $\beta = \cfrac{3}{4}$

有了正負樣本的選擇,接下來說說 loss function

召回任務的目的是盡可能的召回用戶感興趣的 items,以這個思路下去想,我們可以採用 hinge loss 加大正樣本跟負樣本之間的 Margin

或是使用 Bayesian Personalized Ranking Loss (BRP) 加大正樣本排在負樣本前的機率

而我們就是要學模型參數 $\theta$ , minimize NLL :

FM 排序

FM 用在排序時可分成 粗排 (coarse rank) 和精排 (fine rank),兩者運算過程一樣,只不過粗排用的特徵數量應遠少於精排,因為粗排得對多路召回的上萬個 entity 打分截斷。

使用方式

線上觸發 user-id $u$ 的推薦:

  1. 取回 user-id $u$ 的 user embedding $E^{user}_u$ 和待排序 items $I$ 的 item embeddings $E^{item}_{I}$
  2. 將 $E^{user}_u$ 和所有 $E^{item}_I$ inner product 計算出 score $\text{S}_I^u$,根據式 (7) 兩者相等
  3. 對 $S^u_{I}$ 排序截斷 topK

Some Tips

在排序時,若無依賴線上 context 的特徵,從 DB 取出 $E^{user}_u$ $E^{item}_i$ 按照 式(6) 計算 score 即可。

若有依賴 context 的特徵,得按照 式 (5) 計算 score 。

例如: 若 item 側有個 feature 為 “多路召回中來自哪路”,這是線上才能得知的 context 訊息,無法離線算好,所以得將 FM model 式 (1) 的所有 $\theta$ 都存下來帶到線上,照著 式 (5) 計算 score。

另外,當我們在計算 cross prod sum of hidden vectors 時,別忘了有複雜度更低的計算方式

Implement FM by Pytorch

僅示意流程,召回的任務的 criterion 單純使用 logits loss,並且沒 sampling

dataset 用 movie len ml-1m

MovieLens

Bucketize and Label Features

上面說過,在推薦系統中的特徵都得分桶 label 化, movie len 的 features 都已分桶過的,所以我們只需要 label 化即可

P.S. Sklearn 中有現成的 sklearn.preprocessing.LabelEncoder 和 sklearn.preprocessing.KBinsDiscretizer 可以用

label encoder >folded
1
2

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
class LabelEncoder(BaseEstimator, TransformerMixin):
def __init__(self, columns_to_encode: List[str]):
self.columns_to_encode = columns_to_encode
self.unseen = -1

def fit(self, df: pd.DataFrame, y=None):
df_ = df[self.columns_to_encode].copy()

df_[self.columns_to_encode] = df[self.columns_to_encode].astype(
'str')

unique_column_vals = {col: df_[col].unique()
for col in self.columns_to_encode}

self.encoding_dict_ = OrderedDict()

for col in self.columns_to_encode:
unique_value = unique_column_vals[col]
self.encoding_dict_[col] = {val: idx for idx, val in enumerate(unique_value)}
self.encoding_dict_[col][self.unseen] = len(self.encoding_dict_[col])

return self

def transform(self, df: pd.DataFrame):
try:
self.encoding_dict_
except AttributeError:
raise NotFittedError(
"This LabelEncoder instance is not fitted yet. "
"Call 'fit' with appropriate arguments before using this LabelEncoder."
)
df_ = df.copy()
filtered_columns = [col for col in self.columns_to_encode if col in df_.columns]
df_[filtered_columns] = df_[filtered_columns].astype('str')

for col, encoding_map in self.encoding_dict_.items():
original_value = [f for f in encoding_map.keys() if f != self.unseen]
if col in filtered_columns:
df_[col] = np.where(df_[col].isin(
original_value), df_[col], self.unseen)
df_[col] = df_[col].apply(lambda x: encoding_map[x])
return df_

Before labeling:

After labeling:

P.S. userId label 化後的 label 值跟 原本的 Id 會不一樣,很容易搞混。

把 gender ecoding 後的對應關係打出來看看:

1
2
3
4
In:
id_encoder.encoding_dict_['gender']
Out:
{-1: 2, 'F': 0, 'M': 1}

FM Model

自己實現 FM

FMmodel >folded
1
2

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
class FactorizationMachine(torch.nn.Module):
def __init__(self, reduce_sum=True):
super().__init__()
self.reduce_sum = reduce_sum

def forward(self, x):
square_of_sum = torch.sum(x, dim=1) ** 2
sum_of_square = torch.sum(x ** 2, dim=1)
ix = square_of_sum - sum_of_square
if self.reduce_sum:
ix = torch.sum(ix, dim=1, keepdim=True) # [b_size, 1]
return 0.5 * ix

class FactorizationMachineModel(torch.nn.Module):
def __init__(self, fields_index: Dict[str, int], fields_dims: Dict[str, int], embed_dim):
super(FactorizationMachineModel, self).__init__()
self.fields_index = fields_index
self.all_fields = [k for k, v in sorted(self.fields_index.items(), key=lambda kv: kv[1])]
self.embed_dim = embed_dim

self.fields_dims_dict = fields_dims
self.fields_dims = np.array([self.fields_dims_dict[f] for f in self.all_fields])

self.fields_offset = torch.tensor((0, *np.cumsum(self.fields_dims)[:-1]), dtype=torch.long).requires_grad_(False).unsqueeze(0)
'''
field_offset [1, sum_of_field_dims]
field_dims: np.array([10, 20 ,5, 7, 9])
[*np.cumsum(self.field_dims)[:-1]] = [10, 30, 35, 42]
'''
self.fields_range = self._gen_fields_range(self.fields_dims)
self.embedding = torch.nn.Embedding(self.fields_dims.sum(), embed_dim)
torch.nn.init.xavier_uniform_(self.embedding.weight.data)

self.fc = torch.nn.Embedding(self.fields_dims.sum(), 1)
self.bias = torch.nn.Parameter(torch.zeros((1, )))

self.fm = FactorizationMachine(reduce_sum=True)

def _forward_embedding(self, X: torch.Tensor):
return self.embedding(X) # [b_size, field_num, embed_dim]

def _forward_linear(self, X: torch.Tensor):
return torch.sum(self.fc(X), dim=1) + self.bias # [b_size, output_dim]

def get_field_vector(self, field_name: str, label: int)-> np.ndarray:
assert field_name in self.fields_index
assert label < self.fields_dims_dict[field_name]
field_index = self.fields_index[field_name]
field_range = self.fields_range[field_index]

field_vector = self.embedding.weight.data[field_range[0]: field_range[1]]
return field_vector[label]

def _gen_fields_range(self, fields_dims: np.ndarray)-> List[Tuple[int, int]]:
fields_offset = [0, *np.cumsum(fields_dims)]
return [(s, e) for s, e in zip(fields_offset, fields_offset[1:])]


def forward(self, X: torch.Tensor):
# X [b_size, num_fields]
if X.get_device() != self.fields_offset.get_device():
self.fields_offset = self.fields_offset.to(X.device)

X = X + self.fields_offset
X = self._forward_linear(X) + self.fm(self._forward_embedding(X)) # [b_size, 1]
return X.squeeze(1) # [b_size]

def predict_probs(self, X: torch.Tensor):
with torch.no_grad():
out = self.forward(X)
return torch.sigmoid(out)

  • self.embedding 存放 hidden vector ,也就是 式(1) 中的 $v$
  • self.fc, self.bias 分別為 式(1) 中的 bais 跟 weight
  • instance self.fm 計算 cross prod sum of hidden vector: $\frac{1}{2} \sum_{f=1}^k{\left( \left(\sum_{i=1}^n{v_{i,f}x_i}\right)^2 - \sum_{i=1}^nv_{i,f}^2 x_i^2 \right)}$
  • self.fields_offset,存放特徵 field $i$ 在 self.embedding 中對應的起始位置
    • i.e gender 這個特徵有 3 個取值 Men, Women, unknown 形成一個 field,所以 gender 在 self.embedding 會佔據三個 rows , self.fields_offset 存放 gender 第一個取值 Men 在 self.embedding 中的 row index

訓練過程

請參閱 seed9D/hands-on-machine-learning

組合出 $E^{user}$ $E^{item}$

訓練完後,取出 FM model 裡的 weight $W$ 跟 hidden vector $V$,按照 figure 1 所示,組合出 user embedding 跟 item embedding

FMEmbedding >folded
1
2

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
from collections import namedtuple
ID2Vector = namedtuple('id2vector', 'id vector')

class FMEmbedding:
def __init__(self, id_encoder, user_fields, item_fields, fm_model):
self.id_encoder = id_encoder
self.user_fields = user_fields
self.item_fields = item_fields
self.fm_model = fm_model

self.fields_dims_dict = fm_model.fields_dims_dict
self.fields_index= fm_model.fields_index
self.fields_dims = fm_model.fields_dims
self.fields_range = self._gen_fields_range(self.fields_dims)

self.fields_vectors = self.fm_model.embedding.weight.data.numpy() # hidden vector
self.fields_weights = self.fm_model.fc.weight.data.numpy()
self.bias = self.fm_model.bias.data.numpy()

def get_user_embedding(self, primitive_features: pd.DataFrame, userId_column='userId')-> List[ID2Vector]:
assert set(primitive_features.columns) == set(self.user_fields)
user_id_features = self.id_encoder.transform(primitive_features).values
user_fields_index = [self.fields_index[col] for col in self.user_fields]

fields_offset = np.array([0, *np.cumsum(self.fields_dims)[:-1]])[user_fields_index].reshape((1, -1))
user_IDs_features = user_id_features + fields_offset

''' field vector'''
user_IDs_vectors = np.take(self.fields_vectors, user_IDs_features, axis=0) # [b_size, user_field_size, vector_dims]
user_fields_cross = self._cross_vector(user_IDs_vectors) # [b_size]

user_IDs_vectors = user_IDs_vectors.sum(axis=1) # [b_size, vector_dims]

'''field weight'''
user_fields_weights = np.take(self.fields_weights, user_IDs_features, axis=0).squeeze(2) # [b_size, user_field_size]
user_fields_weights = user_fields_weights.sum(axis=1) # [b_size]

'''
user embebding composition [1 :: (user_fields_cross + user_fields_weghts):: user_IDs_vectors]
'''
ones = np.ones((len(user_fields_weights), 1)) # [b_size]

embedding = np.hstack([ones, np.expand_dims(user_fields_cross + user_fields_weights, 1), user_IDs_vectors])

id2vector = [ID2Vector(userId, vector) for userId, vector in zip(primitive_features[userId_column].tolist(), embedding)]
return id2vector

def get_item_embedding(self, primitive_features: pd.DataFrame, itemId_col='itemId')-> List[ID2Vector]:
assert set(primitive_features.columns) == set(self.item_fields)
'''
primitive_features may contain unseen itemId
'''
item_id_feautres = self.id_encoder.transform(primitive_features).values
item_fields_index = [self.fields_index[col] for col in self.item_fields]

fields_offset = np.array([0, *np.cumsum(self.fields_dims)[:-1]])[item_fields_index].reshape((1, -1))

item_IDs_features = item_id_feautres + fields_offset

''' field vector '''
item_IDs_vectors = np.take(self.fields_vectors, item_IDs_features, axis=0) # [b_size, item_fields_size, field_vector_dims]
item_fields_cross = self._cross_vector(item_IDs_vectors) # [b_size]
item_IDs_vectors = item_IDs_vectors.sum(axis=1) # [b_size, field_vector_dims]

'''field weight'''
item_fields_weights = np.take(self.fields_weights, item_IDs_features, axis=0).squeeze(2) # [b_size, item_field_size]
item_fields_weights = item_fields_weights.sum(axis=1) # [b_size]

# '''bias'''
# bias = np.full((len(item_fields_weights), 1), self.bias.data)

'''
item embedding composition: [(item_fields_weights + item_fileds_cross ):: 1 :: item field vector]

'''
ones = np.ones((len(item_fields_weights), 1)) # [b_size, 1]
sum_ = np.expand_dims(item_fields_weights + item_fields_cross, 1)


embedding = np.hstack([sum_, ones, item_IDs_vectors])
id2vector = [ID2Vector(itemId, vector) for itemId, vector in zip(primitive_features[itemId_col].tolist(), embedding) ]
return id2vector

def _cross_vector(self, vectors: np.ndarray)-> int:
'''
@vectors: [b_size, vector_dims]
'''
square_of_sum = np.sum(vectors, axis=1) ** 2
sum_of_square = np.sum(vectors ** 2, axis=1)
reduce_sum = np.sum(square_of_sum - sum_of_square, axis=1) # [b_size, 1]
return 0.5 * reduce_sum

def _get_field_vector(self, field_name: str, label: int)-> np.ndarray:
assert field_name in self.fields_index
assert label < self.fields_dims_dict[field_name]

field_index = self.fields_index[field_name]
field_range = self.fields_range[field_index]

field_vector = self.fields_vectors[field_range[0]: field_range[1]]

return field_vector[label]

def _gen_fields_range(self, fields_dims: np.ndarray)-> List[Tuple[int, int]]:
fields_offset = [0, *np.cumsum(fields_dims)]
return [(s, e) for s, e in zip(fields_offset, fields_offset[1:])]

  • function get_user_embedding 組合出 user embedding
  • function get_item_embedding 組合出 item embedding

print 一個 item embedding 看看

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
In:
item_id_2_vector = fm_embedding.get_item_embedding(item_df, 'movieId')
print(item_id_2_vector[0])
Out:
id2vector(
id=1,
vector=array([-1.6513232 , 1. , -0.15136771, -0.19989893, -0.17711505,
-0.27605495, -0.17580783, -0.20358004, 0.19178246, 0.28852561,
0.19627182, -0.21493186, 0.22243072, 0.24396233, -0.17809424,
0.19624405, 0.21878579, 0.20799968, 0.12430456, 0.18400647,
-0.18915175, 0.17806746, 0.21123466, -0.15553948, -0.20221886,
-0.23133394, -0.19063245, 0.25346702, -0.19234048, -0.20784694,
-0.12557872, -0.25455537, -0.15617739, -0.22168253, -0.14932276,
0.25438485, 0.19298089, -0.23894864, -0.26424453, -0.18523659,
0.20755866, -0.17146595, -0.1505574 , 0.26266149, 0.12615746,
-0.16710922, -0.19842891, 0.20556726, 0.16274993, 0.14940131,
0.16785385, 0.24925581]))

print 一個 user embedding 看看

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
In:
user_id_2_vector = fm_embedding.get_user_embedding(user_df, 'userId')
print(user_id_2_vector)
Out:
id2vector(
id=1,
vector=array([ 1. , 1.57251287, 0.31802261, -0.05311833, -0.51956671,
-0.41219246, -0.24985576, -0.65933889, 0.59215838, 0.41595197,
0.11497089, -0.69019455, 0.24329846, 0.45048714, -0.05286196,
0.2054476 , 0.30616254, 0.30125472, 0.40432882, 0.26946735,
-0.28497586, 0.23863903, 0.37228948, -0.47429329, -0.21513283,
-0.37172595, -0.29311907, 0.32599026, -0.37352464, -0.65939361,
-0.13792202, -0.29631072, -0.70424175, -0.03108542, -0.47497728,
0.65669644, -0.0536177 , -0.46591866, 0.04575365, -0.47661048,
-0.23746635, -0.48891699, -0.39626658, 0.34011322, 0.42570654,
-0.53802925, -0.32868454, 0.31107759, 0.65648586, 0.36503255,
0.22789076, 0.56083 ]))

塞進 Faiss

將每個 user-id embedding $E^{user}$ 和 iterm-id embedding $E^{item}$ 分別塞入 Faiss 建立 index

Item embedding:

1
2
3
4
5
6
7
8
9
10
11
'''
Offline
push item embedding into faiss
'''
all_item_id, item_embedding = zip(*item_id_2_vector)
all_item_id = np.array(all_item_id)
item_embedding = np.array(item_embedding)

movie_embedding_faiss_index = faiss.IndexFlatIP(item_embedding.shape[1])
movie_id_faiss_index = faiss.IndexIDMap(movie_embedding_faiss_index)
movie_id_faiss_index.add_with_ids(item_embedding.astype('float32'), np.array(all_item_id))

user embedding:

1
2
3
4
5
6
7
8
9
10
'''
Offline
push user embedding into faiss
'''
all_user_id, user_embedding = zip(*user_id_2_vector)
all_user_id = np.array(all_user_id)
user_embedding = np.array(user_embedding)
user_embedding_faiss_index = faiss.IndexFlatIP(user_embedding.shape[1])
user_id_faiss_index = faiss.IndexIDMap(user_embedding_faiss_index)
user_id_faiss_index.add_with_ids(user_embedding.astype('float32'), np.array(all_user_id))

IndexFlatIP 為 brute force 的 inner product,Faiss 還有其他更快的索引方式,但就得犧牲 式(6) $E^{iterm}$ $E^{user}$ inner product 的完備性。

應用

I2I

Online Computing Similarity

online 版的 I2I,預先將所有 item embedding $E^{item}$ 存進 Faiss,到線上實時計算 similarity score。

首先取回 trigger item-ids 的 vectors

1
2
3
4
5
6
7
'''
online
fetch movie ids and their corresponding vectors which user have interacted
'''
item_vector_provider = VectorProvider(item_id_2_vector)
user_seen_movie_id = [10 , 20, 50, 70]
vectors = np.array([item_vector_provider[id_] for id_ in user_seen_movie_id])
  • item_vector_provider 為 item-id $i$ 的 vector 查詢工具
  • user_seen_movie_id 為 trigger item-ids,來自 user 曾經互動過的 items

送入 faiss 查詢 I2I

1
2
each_match_num = 20
sim_score, movie_ids = movie_id_faiss_index.search(vectors.astype('float32'), each_match_num)

print 看看所有召回的 item-ids

1
2
3
4
5
6
7
8
9
10
11
In:
print(movie_ids.flatten())
Out:
array([3120, 2688, 3113, 3596, 1850, 2283, 2175, 2379, 64, 2372, 437,
3835, 2149, 2950, 3623, 315, 2589, 3616, 1686, 1752, 1322, 3667,
1490, 1324, 1335, 1853, 3574, 3313, 2534, 1989, 1170, 1595, 2298,
867, 2555, 244, 2818, 220, 2816, 473, 557, 2999, 3245, 1842,
2839, 3881, 1850, 1039, 1316, 701, 2909, 3679, 83, 1177, 2175,
3828, 3849, 1294, 814, 1225, 3667, 244, 1490, 2382, 2365, 2898,
1556, 3666, 2298, 1978, 387, 1853, 3407, 1980, 3041, 3220, 3042,
1720, 148, 884])

Offline Computing Similarity

Offline 版的 I2I 為離線計算好所有 item-ids 的 I2I 存進 DB,線上使用時直接取用

計算所有 I2I,每個 trigger item 召回 30 個 items

1
2
each_match_num = 30
sim_score, match_movie_ids = movie_id_faiss_index.search(item_embedding.astype('float32'), each_match_num)

結構化,方便持久化儲存

1
2
3
4
5
6
7
key_2_I = []
for trigger_id, match_ids, match_scores in zip(all_item_id, match_movie_ids, sim_score):
k2i = {}
k2i['trigger_key'] = trigger_id
pairs = [{'key':id_, 'score': s } for id_, s in zip(match_ids, match_scores) if i != trigger_id]
k2i['pairs'] = pairs
key_2_I.append(k2i)

print 其中一個 trigger key 的結構看看

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
In:
print(key_2_I[0])
Out:
{
'trigger_key': 1,
'pairs': [{'key': 2562, 'score': 8.487403},
{'key': 3359, 'score': 7.370528},
{'key': 853, 'score': 7.3140664},
{'key': 1910, 'score': 7.243814},
{'key': 3778, 'score': 7.064466},
{'key': 2819, 'score': 6.9514284},
{'key': 475, 'score': 6.9262433},
{'key': 551, 'score': 6.886018},
{'key': 1262, 'score': 6.7728105},
{'key': 1206, 'score': 6.7252},
{'key': 1268, 'score': 6.680213},
{'key': 265, 'score': 6.674436},
{'key': 722, 'score': 6.644379},
{'key': 3281, 'score': 6.6103525},
{'key': 2299, 'score': 6.536831},
{'key': 1983, 'score': 6.5208454},
{'key': 1188, 'score': 6.499143},
{'key': 1730, 'score': 6.414803},
{'key': 240, 'score': 6.3780065},
{'key': 2757, 'score': 6.29891},
{'key': 2731, 'score': 6.23072},
{'key': 3534, 'score': 6.217514},
{'key': 746, 'score': 6.1960998},
{'key': 326, 'score': 6.194806},
{'key': 3296, 'score': 6.143187},
{'key': 1058, 'score': 6.070843},
{'key': 2503, 'score': 6.015757},
{'key': 1187, 'score': 5.9964414},
{'key': 1218, 'score': 5.899559},
{'key': 2782, 'score': 5.8903465}]
}

最後將離線召回 I2I 存進 DB (i.e redis) 供線上 I2I 使用

U2I

U2I 為利用 user embedding $E^{user}_u$ 在 Faiss 內搜尋與之相似的 item Embedding $E^{item}_i$

首先取回 user-id $u$ 的 embedding

1
2
3
user_vector_provider = VectorProvider(user_id_2_vector)
query_user_id = 500
u_embedding = user_vector_provider[query_user_id]

將 u_embedding 丟進 Faiss 找相似的 item

1
2
3
4
5
6
'''
excute u2i
'''
topk = 20
sim_scores, match_movie_ids = movie_id_faiss_index.search(np.expand_dims(u_embedding.astype('float32'), 0), topk)
''' now we have socre and match item ids'''

print item-ids 看看

1
2
3
4
5
In:
print(match_movie_ids.flatten())
Out:
array([ 557, 3245, 2999, 2839, 1842, 755, 3881, 1316, 2503, 2909, 3338,
1664, 701, 406, 3828, 682, 2833, 2811, 3517, 53])

排序

因為特徵不涉及 context 類型的,所以直接取回 user embedding $E^{user}_u$ 與多個 item embedding $E^{item}$ 內積計算 score 即可

拿回待排序 item-ids

1
2
3
4
5
6
7
8
'''
suppose we have the following match items from different methods
'''
uid = 1000
match_from_A = np.random.choice(np.array(list(item_vector_provider.ID2Vector.keys())), size=30)
match_from_B = np.random.choice(np.array(list(item_vector_provider.ID2Vector.keys())), size=30)
match_from_C = np.random.choice(np.array(list(item_vector_provider.ID2Vector.keys())), size=30)
all_match = np.hstack([match_from_A, match_from_B, match_from_C])

取回 user embedding 與所有待排序 item-ids 的 embedding

1
2
3
4
5
6
7
'''
online vector provider
'''
item_vector_provider = VectorProvider(item_id_2_vector)
user_vector_provider = VectorProvider(user_id_2_vector)
u_vector = user_vector_provider[uid]
i_vectors = np.array([item_vector_provider[movie_id] for movie_id in all_match])

inner product 計算 score

1
2
scores = np.dot(i_vectors, u_vector)
predicted_prob = sigmoid(scores)

實際上只需要 scores 即可,sigmoid operation 為多餘的,兩者排序結果一樣,除非需要 probability 做其他操作

1
2
3
4
5
In:
print((all_match[np.argsort(predicted_prob)[::-1]] == all_match[np.argsort(scores)[::-1]]).all())

out:
True

查看排序後的分數

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
In:
sorted_index = np.argsort(scores)[::-1]
for idx in sorted_index:
print("id:{}, score: {}".format(all_match[idx], scores[idx]))
Out:
id:669, score: 3.925238274386397
id:942, score: 3.5017253691160786
id:3030, score: 3.2159240692623685
id:962, score: 2.5550285960726438
id:3578, score: 2.2348229702588718
id:1226, score: 2.1864282823175265
id:3057, score: 2.0258533377047634
id:128, score: 1.9924200765247109
id:3559, score: 1.86036771697748
id:3147, score: 1.8530369719991735
id:1301, score: 1.721844059286982
id:2843, score: 1.6223211152117063
id:2686, score: 1.4088124542846256
id:106, score: 1.3856411763378178
id:2117, score: 1.2064026627971622
id:1625, score: 1.0726163382326743
id:3746, score: 0.9731542416184148
id:3504, score: 0.9079088224687981
id:373, score: 0.8813322676009746
id:3101, score: 0.8557414634608994
id:2236, score: 0.8039020559275596
id:3405, score: 0.7829994401182859
id:1210, score: 0.7523971549813273
id:3255, score: 0.7294679680393682

Last but not Least

本篇便於理解使用 python + pytorch 實現思路,實際上在工業界 offline 應該是在 spark 上處理數據和訓練 FM ; online 的推薦服務為 Java 或其他。

在 spark 上得自己實現 FM model,個人實現,因為用在公司生產上就不能貼了, 貌似 spark 3.0 有提供了,不過版本更新一向令人頭大。

Reference

推薦系統中的瑞士刀 Factorization Machine

https://seed9d.github.io/pratical-FM-model/

Author

seed9D

Posted on

2021-02-19

Updated on

2021-03-03

Licensed under


Comments