透視 XGBoost(2) 圖解 Classification

2021-02-162021-02-18Machine Learning22 minutes read (About 3260 words)0 visits

TL;DR

XGBoost 與 GBDT 相比，同樣都是擬合上一步 $F_{m-1}(x)$ 預測值的 residual 。不同的是 XGBoost 在擬合 residual 時更為聰明有效，他有自己衡量分裂增益 gain 的方式去擬合 residual 建立 XGB tree $f_m(x)$

$Gain = Left_{\text{similarity}} + Right_{\text{similarity}} - Root_{\text{similarity}} \tag{1}$

分裂增益 gain ，的背後原理可以透過 XGBoost 通用 objective function 並填入 classification/regression/rank/etc.. 各自的 loss function 後推導出的 $\text{first derivative of loss function} = g_i \quad \text{and} \quad \text{second derivative of loss function} = h_i$
XGBoost 的通用 objective function 由兩部份組成
- first part is loss function，根據不同任務有不同的 loss function
- second part is regularization term 防止 overfitting，限制 leaf node 輸出值大小和 leaf nodes 的總數

從 Gradient Boosting 說起

GBDT Framework

XGBoost 跟 gradient boosting algorithm 框架一樣，皆是依序建立多棵樹 $f_1(x), f_2(x), ….,f_M(x)$ 組成模型

$F_m(x) = F_0(x) + \nu\sum^m_{i=1} f_i(x) = F_{m-1} + \nu f_{m-1}(x)$

其中第 $m$ 步的 tree $f_m(x)$ 是擬合模型 $F_{m-1}(x)$ 預測值 $\text{predicted }$ 與真實值 $\text{observed}$ 的 $\text{residual}$
$\nu$ 為 learning rate

算法差別主要體現在

objective function 的設計
Step 2 (B) (C) 建樹，GBDT 是建一顆傳統 regression tree $f_m(x)$ 去擬合 $\text{residual}$; XGBoost 有自己衡量分裂 gain 的方式去擬合 residual 建立 XGB tree $f_m(x)$

可以說 XGBoost 用一種比較精準的方式去擬合 residual 建立子樹 $f_m(x)$

以下章節主要分兩大塊

XGBoost for Classification： 藉由 classification 介紹 XGBoost 如何 train 一棵 XGB tree，以圖文方式說明 XGB tree 如何擬合目標到剪枝。此章節不涉及公式證明，只有少量運算，適合快速理解 XGB 訓練流程
XGBoost Classification Math Background：此章節深入討論在前一章節中用到的公式原理，並給予證明，適合深入理解 XGBoost 為何 work

篇幅關係 XGBoost 的優化手段放在透視 XGBoost(4) 神奇 optimization 在哪裡？

XGBoost for Classification

前置 background

如同在 GBDT - classification 裡一樣一步步透視 GBDT Classifier

模型輸出 $F_m(X)$ 為 $\log(odds)$，但衡量 residual 時與 probability 有關，兩者互相轉換的公式

probability $p$ → $\log(odds)$

$\log(odds) = \cfrac{p}{(1-p)}$

$\log(odds)$ → $p$

$p = \cfrac{\exp^{\log(odds)}}{1 + \exp^{\log(odds)}}$

Data

之後用到的例子，數據如下

四筆 data samples，目標是用 Drug Dosage 預測藥物是否有效用 (Drug Effectiveness)

如何建一顆 XGB tree $f_m(x)?$

Classification problem 上，XGBoost 會在所有的特徵值中選取最佳分裂點，最終建成一顆 binary tree $f_m(x)$

建樹的過程涉及

Fitting Target 擬合目標
分裂好壞的恆量，如 CART 用 gini gain 衡量分類樹
Pruning 剪枝
Output Value 決定每個 leaf node 的唯一輸出

$f_m(x)$ 擬合的目標是什麼？

$f_m(x)$ 擬合的目標是 $\text{residual}$ ，利用 data sample x 的所有特徵建一顆特殊的 $\text{regression tree}$ 去擬合 $\text{residual}$

$\text{residual = observed - predicted}$

classification 的 residual 計算與 probability 有關，observed 即 Drug Effectiveness False → 0 True → 1 ; predicted 為模型輸出的 probability $\sigma(F_m(x))$

以 $m=1$ 舉例

XGBoost%20for%20Classification%2025279fd7fd284e38afd31bf8a6e8463e/Untitled%202.png

XGBoost 預設的初始值 $F_0(x)$ probability，預設為 0.5，$\log(odds)$ = 0
ep_0_pre 表 $F_0(x)$ 輸出的 $\log(odds)$
ep_0_prob 表 $F_0(x)$ 輸出的 probability
ep_1_residual 表 $m=1$ 時 observed y 與 ep_0_prob 的 residual

ep_1_residual 即 $m=1$ 時，XGB tree $f_1(x)$ 要擬合的目標

建 tree 時如何衡量分裂點好壞？

建分支時依序在特徵 $\text{Drug Dosage}$ 的 data sample value $\text{[3, 8, 12, 17]}$ 中尋找最優分裂點切分 residual $\text{[-0.5, 0.5, 0.5, -0.5]}$

決定分裂點的優劣取決於 gain

$Gain = Left_{\text{similarity}} + Right_{\text{similarity}} - Root_{\text{similarity}} \tag{1}$

分裂 node ，會產生 left child node $Left$ 與 right child node $Right$，分別計算三者的 similarity score

$\text{Similarity Score} = \cfrac{(\sum \text{residual}_i)^2}{ \sum[\text{previous probability}_i \times (1- \text{previous probability}_i)]+ \lambda} \tag{2}$

$\lambda$ is a regularization parameter，跟之後的 pruning 有關，這邊先假設 0
$\text{previous probability}_i$ 為上一步模型輸出的 probability $\sigma(F_{m-1}(x))$

注意！！排序的是 $\text{Drug Dosage}$，分裂點依序在 $\text{Drug Dosage}$ 中找，但被分開到左右子樹的是 $\text{residual}$

計算 Gain

以 $\text{Dosage < 15}$ 做為分裂點計算 gain，令 $\lambda =0$

root similarity $\cfrac{(-0.5 + 0.5 + 0.5 -0.5)^2}{\text{whatever you are}} =0$
left node similarity:$\cfrac{(-0.5 + 0.5 + 0.5)^2}{(0.5 \times (1-0.5)) +(0.5 \times (1-0.5)) + (0.5 \times(1-0.5)) + 0} = 0.33$
right node similarity: $\cfrac{0.5^2}{0.5 \times (1-0.5) + 0} = 1$
$\text{Gain } = 0.33 + 1 -0 = 1.33$

上面只是示範如何計算 gain ，依序將 $\text{[3, 8, 12, 17]}$ 可能分裂點計算過 gain 後選取最大的 gain 即最優分裂點

當然這只是第一個分支，往下還有第二個分支第三個分支 … etc

假設最終我們找到的樹結構為：

建完一棵樹後，接下來 XGBoost 有多個手段對樹結構 pruning

如何 Pruning - Cover

Cover is related to the minimum number of Residuals in a leaf

XGBoost 其中一個 pruning 方法叫 cover，在 classification 中 cover 的運算為式(2) similarity score 分母中的前項

$\text{Cover} = \sum[\text{previous probability}_i \times (1- \text{previous probability}_i)]$

然後會有個 $\text{Cover}_{threshold}$ 的值決定是否 pruning 此 leaf node

例如，把 $\text{Cover}_{threshold}$設為 1 時，所有小於 threshold 的 leaf node 都會被 pruning：

XGBoost 規定一顆 XGB tree $f(x)$ 不能只有 root，所以會保留一個分岔
$\text{Cover}_{threshold}$ 在 XGBoost 參數中叫 min_child_weight

如何 Pruning - $\tau$

此法對 new tree $f_m(x)$ 的 pruning 建立在 gain 上，透過設定一個 gain threshold $\tau$ 決定是否剪枝

若 $\tau = 2$，則分裂點 $＂Dosage < 5＂$ 的 $gain = 2.66 > 2 = \tau$，不會被剪枝
- $\textit{Dosage < 15}$ 的 $gain = 1.33 < 3 = \tau$ ，理應要被 pruning，但因爲子節點沒被剪枝，父節點 $\textit{Dosage < 15}$ 也保留
若 $\tau =3$，則分裂點 $＂Dosage < 5＂$ 的 $gain = 2.66 < 3 = \tau$ ，會被剪枝
- $\textit{Dosage < 15}$ 的 $gain = 1.33 < 3 = \tau$，理應要被 pruning，但因為 tree 需要一個根節點，所以保留

P.S. gain threshold $\tau$ 在 XGBoost 叫 min_split_loss

如何決定 leaf node 的 output value?

classification 中決定每個 leaf node 的 output value 公式跟式(2) similarity score 很像，差別在分子部份的 $\text{sum of residuals}$ 是否取平方

$\text{Output Value} = \cfrac{(\sum \text{residual}_i)}{ \sum[\text{previous probability}_i \times (1- \text{previous probability}_i)]+ \lambda} \tag{3}$

$\lambda$ is a regularization parameter，這邊先假設 0

計算每個 leaf node output value 後

如何整合多棵樹 $f_1(x), f_2(x),…,f_m(x)$ 的輸出

$F_M(x)$ 的計算與 GBDT 一樣

$F_M(x) = F_0(x) + \nu\sum^M_{i=1} f_i(x) = F_{M-1} + \nu f_m(x)$

得注意的是，$F_m(x)$ 輸出的是 $\log(odds)$ ，得通過 $p = \sigma(\log(odds))$ ，才會是 predicted probability。

假設只建立一顆 XGB tree $f_1(x)$， $\text{learning rate = 0.3}$

則每個 data samples 新的 prediction 為

$F_1(x) = F_0(x) + 0.3* f_1(x)$

ep_0_prob: $\sigma(F_0(x))$ 算出的 probability of Drug Effectiveness
ep_0_pre: $F_0(x)$ 輸出的 predicted log odds
ep_1_leaf_node_output: XGB tree $f_1(x_i)$ 對每個 data sample $x_i$ 的輸出值。公式請見式(3)
ep_1_pre: $F_1(x)$ 輸出的 predicted log odds, $F_1(x) = F_0(x) + \nu f_1(x)$
ep_1_prob $\sigma(F_1(x))$ 算出的 probability of Drug Effectiveness

再次提醒

$F_m(x)$ 輸出 $log(odds)$
$F_m(x)$ 輸出的 probability 為 $\sigma(F_m(x))$
$\text{residual }$ 的計算與 probability 有關

XGBoost Classification Math Background

Regression 的 Similarity Score 怎麼來的？

式 (2) similarity score 怎麼得出的

XGBoost 的通用 Objective Function

$\begin{aligned} \mathcal{\tilde{L}}^{(m)} &= \sum^{T_m}_{j=1}[(\sum_{i \in R_{m,j}}g_i) \gamma_{j,m} + \cfrac{1}{2}(\sum_{i \in R_{j,m} }h_i + \lambda)\gamma_{j,m}^2] + \tau T_m \end{aligned}\tag{4}$

$g_i$ 為 first derivative of loss function related to data sample $x_i$ : $g_i = \cfrac{d}{d \ F_{m-1}} \ l(y_i, \hat{y_i})$
$h_i$ 為 second derivative of loss function related to data sample $x_i$: $h_i = \cfrac{d^2}{d \ F_{m-1}^2} l(y_i, \hat{y_i})$
$T_m$ is the number of leaf in tree $f_m$
$\tau$ 表對 $T_m$ 的 factor
$\gamma_{j,m}$ 表 $m$ 步 XGB tree $f_m$ 第 $j$ 個 leaf node 的輸出值
$R_{m, j}$表 m 步 XGB tree $f_m$ 下第 $j$ 個 leaf node 得 data sample 集合

optimal output $\gamma_{j,m}^*$ of leaf node $j$

$\gamma_{j,m}^*=- \cfrac{\sum_{i\in R_{j,m}} g_i}{\sum_{i \in R_{j,m}}h_i + \lambda} \tag{5}$

詳細推導參閱透視 XGBoost(3) 蘋果樹下的 objective function

Classification 的 Similarity Score

式 (2) classification similarity score 如下

$\text{Similarity Score} = \cfrac{(\sum \text{residual}_i)^2}{ \sum[\text{previous probability}_i \times (1- \text{previous probability}_i)]+ \lambda} \tag{2}$

XGBoost 分裂的目的是要使式(4) objective function 越小越好，怎麼評判分裂前後收益？式(1) 評價了分裂後 left/right leaf node 可以得到的＂score＂與分裂前 root node 的＂score＂差值計算分裂前後增益 gain

$Gain = Left_{\text{similarity}} + Right_{\text{similarity}} - Root_{\text{similarity}} \tag{1}$

所以 similarity score 肯定得跟 objective function 有關，才能正確的恆量 gain，但 similarity score 是＂score “ 越大越好， objective function 是要 minimize 的，得越小越好，兩者如何扯上關係？

觀察一下 objective function

因為我們要衡量單個 node 得 gain，式 (4)中，跟某個 node 直接相關的 part 為 summation of all leaf node 裡面的 equation

$[(\sum_{i \in R_{m,j}}g_i) \gamma_{j,m} + \cfrac{1}{2}(\sum_{i \in R_{j,m} }h_i + \lambda)\gamma_{j,m}^2] \tag{6}$

$R_{m, j}$表 m 步 XGB tree $f_m$ 下第 $j$ 個 leaf node 得 data sample 集合

而式(5) 給出了單個節點 optimal output value $\gamma_{j,m}^*$

$\gamma_{j,m}^*=- \cfrac{\sum_{i\in R_{j,m}} g_i}{\sum_{i \in R_{j,m}}h_i + \lambda} \tag{5}$

將式(5) 代入式(6) 可以得出式(6)的極小值

$\begin{aligned} (\sum_{i \in R_{m,j}}g_i) \gamma_{j,m} + \cfrac{1}{2}(\sum_{i \in R_{j,m} }h_i + \lambda)\gamma_{j,m}^2 &= -（\sum_{i \in R_{j,m}}g_i)\cfrac{\sum_{i\in R_{j,m}} g_i}{\sum_{i \in R_{j,m}}h_i + \lambda} + \cfrac{1}{2}(\sum_{i \in R_{j,m} }h_i + \lambda)(\cfrac{\sum_{i\in R_{j,m}} g_i}{\sum_{i \in R_{j,m}}h_i + \lambda})^2 \\ & = -\cfrac{(\sum g_i)^2}{\sum h_i + \lambda} + \cfrac{1}{2} \cfrac{(\sum g_i)^2}{\sum h_i + \lambda} \\ & = -\cfrac{1}{2}\cfrac{(\sum g_i)^2}{\sum h_i + \lambda} \end{aligned} \tag{7}$

因爲要的是＂score＂，所以讓 loss 對 x 軸做對稱，拿掉係數項 $\frac{1}{2}$

$\text{Similarity Score} = \cfrac{(\sum g_i)^2}{\sum h_i + \lambda} \tag{8}$

$g_i$ 為 first derivative of loss function related to data sample $x_i$ : $g_i = \cfrac{d}{d \ F_{m-1}} \ l(y_i, \hat{y_i})$
$h_i$ 為 second derivative of loss function related to data sample $x_i$: $h_i = \cfrac{d^2}{d \ F_{m-1}^2} l(y_i, \hat{y_i})$
拿掉係數項沒影響，原因是計算 gain (式 (1)) 時只需要相對 (relative) 的值

式 (8) 中的 $g_i, h_i$ 分別表 loss function $l(y,\hat{y})$ 的 first derivative and second derivative

classification 常用的 loss function 為 cross entropy/logistic loss

$\begin{aligned} l(y_i,\hat{y_i}) &= -[y_i\log(p_i) + (1-y_i)\log(1-p_i)] \\ &=-y_i \log(odds) + \log(1 + \exp(\log(odds)) \\ &= -y_i F_{m-1}(x_i) + \log(1 + \exp(F_{m-1}(x_i))) \end{aligned}\tag{9}$

$F_{m-1}(x_i)$ 輸出的是 log(odds)
$p_i$ 為 $F_{m-1}(x_i)$ 輸出的 probability , $p_i= \sigma(F_{m-1}(x_i)) = \cfrac{\exp(F_{m-1}(x_i))}{1 + \exp(F_{m-1}(x_i))}$

將式 (9) 代入 $g_i, h_i$，得到 cross entropy/logistic loss 下的 $g_i, h_i$

$\begin{aligned} g_i &= \cfrac{d}{d \ F_{m-1}} \ l(y_i, \hat{y_i}) \\ &= -y_iF_{m-1}(x_i) + \log(1 + \exp(F_{m-1}(x_i)) \\ &= -y_i + \cfrac{\exp(F_{m-1}(x_i))}{1 + \exp(F_{m-1}(x_i))} \\ & = -(y_i - p_i) = \text{negatvie residual} \\ h_i &= \cfrac{d^2}{d^2 \ F_{m-1}} l(y_i, \hat{y_i}) \\ &= \cfrac{d^2}{d F_{m-1}^2} [-y_i F_{m-1}(x_i) + \log(1 + \exp(F_{m-1}(x_i)))] \\ &= \cfrac{d}{d F_{m-1}} [-y_i + \cfrac{\exp(F_{m-1}(x_i))}{1 + \exp(F_{m-1}(x_i))}] \\ &= \cfrac{\exp(F_{m-1}(x_i))}{(1 + \exp(F_{m-1}(x_i)))} \times \cfrac{1}{(1 + \exp(F_{m-1}(x_i)))} \\ &= p_i \times (1-p_i) \end{aligned} \tag{10}$

代回式(8)，得證

$\begin{aligned} \text{Similarity Score} &= \cfrac{(\sum_{i \in R_{j,m}} g_i)^2}{\sum_{i \in R_{j,m}} h_i + \lambda} \\ & = \cfrac{(\sum-(y_i-p_i))^2}{(\sum p_i (1-p_i)) + \lambda} \\ &= \cfrac{(\sum \text{residual}_i)^2}{(\sum \text{previous probability}_i \times (\text{1 - preveious probability}_i) ) + \lambda} \end{aligned}$

Regression Leaf output 公式怎麼來的？

式(3) each leaf node 的 output 公式怎麼得出的？

$\text{Output Value} = \cfrac{(\sum \text{residual}_i)}{ \sum[\text{previous probability}_i \times (1- \text{previous probability}_i)]+ \lambda} \tag{3}$

從通用 objective function 的 optimal output value $\gamma_{j,m}^*$ 可求出

$\gamma_{j,m}^*=- \cfrac{\sum_{i\in R_{j,m}} g_i}{\sum_{i \in R_{j,m}}h_i + \lambda} \tag{5}$

將式 (10)的 $g_i$ $h_i$ 代入式 (5)，得證

$\gamma_{j,m}^*=\cfrac{\sum_{i\in R_{j,m}} (y_i - p_i)}{\sum_{i \in R_{j,m}}(p_i \times (1- p_i)) + \lambda} = \cfrac{(\sum \text{residual}_i)}{ \sum[\text{previous probability}_i \times (1- \text{previous probability}_i)]+ \lambda}$

$p_i$ 為 $F_{m-1}(x_i)$ 輸出的 probability , $p_i= \sigma(F_{m-1}(x_i)) = \cfrac{\exp(F_{m-1}(x_i))}{1 + \exp(F_{m-1}(x_i))}$

關於 Cover

Cover is related to the minimum number of Residuals in a leaf

在＂如何 Pruning - Cover＂這章節提到 classification 中 cover 的運算為式(2) similarity score 分母中的前項

$\text{Cover} = \sum[\text{previous probability}_i \times (1- \text{previous probability}_i)]$

實際上就是式(8) similarity score 中分母部份的 summation of $h_i$

$\text{Cover} = \sum_{i \in R_{j,m}} h_i$

Reference

Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 13-17-Augu, 785–794. https://doi.org/10.1145/2939672.2939785
What makes “XGBoost” so Extreme? https://medium.com/analytics-vidhya/what-makes-xgboost-so-extreme-e1544a4433bb
- XGBoost-from-scratch-python
Boosting algorithm: XGBoost https://towardsdatascience.com/boosting-algorithm-xgboost-4d9ec0207d
https://www.hrwhisper.me/machine-learning-xgboost/

My post

透視 XGBoost(2) 圖解 Classification

https://seed9d.github.io/XGBoost-for-classification/

Author

seed9D

Posted on

2021-02-16

Updated on

2021-02-18

透視 XGBoost(2) 圖解 Classification

TL;DR

從 Gradient Boosting 說起

XGBoost for Classification

前置 background

Data

如何建一顆 XGB tree $f_m(x)?$

$f_m(x)$ 擬合的目標是什麼？

建 tree 時如何衡量分裂點好壞？

計算 Gain

如何 Pruning - Cover

如何 Pruning - $\tau$

如何決定 leaf node 的 output value?

如何整合多棵樹 $f_1(x), f_2(x),…,f_m(x)$ 的輸出

XGBoost Classification Math Background

Regression 的 Similarity Score 怎麼來的？

XGBoost 的通用 Objective Function

Classification 的 Similarity Score

Regression Leaf output 公式怎麼來的？

關於 Cover

Reference

Author

Posted on

Updated on

Licensed under

Comments

Catalogue

Recents