classHuffmanTree: def__init__(self, fre_dict): self.root = None freq_dict = sorted(fre_dict.items(), key=lambda x:x[1], reverse=True) self.vocab_size = len(freq_dict) self.node_dict = {} self._build_tree(freq_dict) def_build_tree(self, freq_dict): ''' freq_dict is in decent order node_list: two part: [leaf node :: internal node] leaf node is sorting by frequency in decent order; ''' node_list = [HuffmanNode(is_leaf=True, value=w, fre=fre) for w, fre in freq_dict] # create leaf node node_list += [HuffmanNode(is_leaf=False, fre=1e10) for i inrange(self.vocab_size)] # create non-leaf node
parentNode = [0] * (self.vocab_size * 2) # only 2 * vocab_size - 2 be used binary = [0] * (self.vocab_size * 2) # recording turning left or turning right ''' pos1 points to currently processing leaf node at left side of node_list pos2 points to currently processing non-leaf node at right side of node_list '''
pos1 = self.vocab_size - 1 pos2 = self.vocab_size ''' each iteration picks two node from node_list the first pick assigns to min1i the second pick assigns to min2i min2i's frequency is always larger than min1i ''' min1i = 0 min2i = 0 ''' the main process of building huffman tree ''' for a inrange(self.vocab_size - 1): ''' first pick assigns to min1i ''' if pos1 >= 0: if node_list[pos1].fre < node_list[pos2].fre: min1i = pos1 pos1 -= 1 else: min1i = pos2 pos2 += 1 else: min1i = pos2 pos2 += 1 ''' second pick assigns to min2i ''' if pos1 >= 0: if node_list[pos1].fre < node_list[pos2].fre: min2i = pos1 pos1 -= 1 else: min2i = pos2 pos2 += 1 else: min2i = pos2 pos2 += 1 ''' fill information of non leaf node ''' node_list[self.vocab_size + a].fre = node_list[min1i].fre + node_list[min2i].fre node_list[self.vocab_size + a].left = node_list[min1i] node_list[self.vocab_size + a].right = node_list[min2i] ''' the parent node always is non leaf node assigen lead child (min2i) and right child (min1i) to parent node ''' parentNode[min1i] = self.vocab_size + a # max index = 2 * vocab_size - 2 parentNode[min2i] = self.vocab_size + a binary[min2i] = 1 '''generate huffman code of each leaf node ''' for a inrange(self.vocab_size): b = a i = 0 code = [] point = []
''' backtrace path from current node until root node. (bottom up) 'root node index' in node_list is 2 * vocab_size - 2 ''' while b != self.vocab_size * 2 - 2: code.append(binary[b]) b = parentNode[b] # point recording the path index from leaf node to root, the length of point is less 1 than the length of code point.append(b) ''' huffman code should be top down, so we reverse it. ''' node_list[a].code_len = len(code) node_list[a].code = list(reversed(code))
''' 1. Recording the path from root to leaf node (top down). 2.The actual index value should be shifted by self.vocab_size, because we need the index starting from zero to mapping non-leaf node 3. In case of full binary tree, the number of non leaf node always equals to vocab_size - 1. The index of BST root node in node_list is 2 * vocab_size - 2, and we shift vocab_size to get the actual index of root node: vocab_size - 2 ''' node_list[a].node_path = list(reversed([p - self.vocab_size for p in point])) self.node_dict[node_list[a].value] = node_list[a] self.root = node_list[2 * vocab_size - 2]
建樹過程參考 Word2Vec 作者 Tomas Mikolov 的 c code,思路如下:
建一個 Array,左半邊放 leaf node ,右半邊放 non leaf node
leaf node 按照 frequency 降序排列
bottom up building tree
從 Array 中間位置向右半邊填 non leaf node
each iteration 都從 leaf node 跟 已填完的 non leaf node 找兩個 frequency 最小的 node,做為 child node 填入當下 non leaf node
loss = -F.logsigmoid( (turns.unsqueeze(2) * paths_emb * neu1.unsqueeze(1)).sum(2)).sum(1).mean() return loss
def_get_turns_and_paths(self, target): turns = [] # turn right(1) or turn left(-1) in huffman tree paths = [] max_len = 0 for n in target: n = n.item() node = self.huffman_tree.node_dict[n] code = target.new_tensor(node.code).int() # in code, left node is 0; right node is 1 turn = torch.where(code == 1, code, -torch.ones_like(code)) turns.append(turn) paths.append(target.new_tensor(node.node_path)) if node.code_len > max_len: max_len = node.code_len turns = [F.pad(t, pad=(0, max_len - len(t)), mode='constant', value=0) for t in turns] paths = [F.pad(p, pad=(0, max_len - p.shape[0]), mode='constant', value=net.hs.vocab_size) for p in paths] return torch.stack(turns).int(), torch.stack(paths).long()
syn1 表 $W’$ 裡面的 vector 對應到 huffman tree non leaf node 的 vector
實作上 $W’$ row vector 才有意義
neu1 即 $\text{h}$ 為 hidden layer 的輸出
target 為 center word $w_O$
function _get_turns_and_paths 中
實作時 -1 表 turn left ; 1 表 turn right ,其實兩者只要相反就好,因爲對於 binary classification