前言

CSND | 提高SVM分類器的準確率

目前，因為論文的主要題目是做法律文件的機器學習分類，但是因為資料量少，碰到一點瓶頸。所以整理了一些提高分類器準確度的方法，但是注意的是本篇內容主要針對傳統的機器學習像是SVM, RF, NB等，並非Deep Learning。主要有以下：

特徵工程：選擇更好的特徵
調整超參數：可以透過找到最佳的超參數組合，來提高分類器的準確率
數據清洗與預處理：數據清洗是機器學習中非常重要的一個環節，數據清洗的好壞直接影響到模型的準確率
使用核函數：有些模型像是SVM可以使用不同的核函數，來提高分類器的準確率，例如線性核函數、多項式核函數、高斯核函數等
集成學習：使用集成學習如Bagging、Boosting等方法，來提高分類器的準確率
增加訓練數據：增加訓練數據，可以提高分類器的準確率，特別是針對複雜的問題

調整超參數

你是否曾經覺得模型有太多的超參數而感到厭煩嗎？要從某一個演算法得到好的解必須要調整超參數，所謂的超參數就是控制訓練模型的一組神秘數字，例如學習速率就是一種超參數。你永遠都不知道 0~1 之間哪一個數字是最適合的，唯一的方法就是試錯 (trial and error)。那萬一模型有多個超參數可以控制，豈不是就有成千上萬種組合要慢慢嘗試嗎？

GridSearchCV

Medium | 【Python】機器學習 — 交叉驗證與超參數調整

超參數可以使用GridSearchCV來找到最佳的超參數組合，這樣可以提高分類器的準確率。例如SVM的超參數有C、kernel、gamma等，可以使用GridSearchCV來找到最佳的超參數組合。這樣的好處是不需要動手寫for循環，可以自動找到最佳的超參數組合。

關於GridSearchCV的使用，可以參考以下的代碼：

## In many applications, we don't abuse test set like that
## we use cross validation to replace multiple evaluations on test set

## To do CV, there is no need to write multiple loops all by myself
from sklearn.model_selection import GridSearchCV

parameters_to_search = {'learning_rate': learning_rates, 
              'min_samples_leaf': min_samples_leafs}

gb_model = GradientBoostingRegressor(n_estimators = 300, 
                                     subsample = 0.7,
                                    n_iter_no_change = 10,
                                     random_state = randomState)

gb_model_CV = GridSearchCV(gb_model, parameters_to_search, cv=5)
gb_model_CV.fit(X_train.fillna(-1), y_train)


## the gridcv module run the models and save the results for us
gb_model_CV.cv_results_

## the mean of 5-folds test(not true test) R2
gb_model_CV.cv_results_["mean_test_score"]

## the best one is learning_rate=0.18999999999999997, min_samples_leaf=5
gb_model_CV.best_estimator_

Optuna

iThelp | [Day 21] 調整模型超參數利器 - Optuna

你可能聽過 Sklearn 的 GridSearchCV 同樣也是暴力的找出最佳參數，或是使用 RandomizedSearchCV 指定超參數的範圍並隨機的抽取參數進⾏訓練，其它們的共同缺點是非常耗時與佔用機器資源。這裡我們要來介紹 Optuna 這個自動找超參數的方便工具，並且可以和多個常用的機器學習演算法整合。
Optuna 是一個專為機器學習設計的自動超參數優化的框架，有以下優點：

支援大多數ML或DL框架：Optuna 支援大多數的機器學習或深度學習框架，包括 Scikit-learn、PyTorch、TensorFlow、XGBoost、LightGBM 等。
對搜尋結果提供可解釋性(XAI)
儲存歷史最佳參數實現平行優化
可以決定並終止不滿足條件的試驗

我參考了上面的連結，整理一下Optuna的使用方法

trial 設定我們的目標，以下面為例就是找到Accuracy的最大值maximize
objective 定義要針對什麼模型進行處理，最後回傳accuracy，讓Optuna檢查目前的準確度是否是最小值了
study 進行50次試驗，找到最佳的超參數，可以使用study.best_params來獲取最佳的超參數

相關的參數可以參考XGBoost官方

# 1. 引入相關套件 pip install optuna 
import optuna
from xgboost import XGBClassifier
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

def objective(trial):
    # 2. 想要控制哪些參數 "參數名稱", "參數範圍"
    params = {
        'eta': trial.suggest_float("eta", 1e-8, 1.0, log=True),
        'alpha': trial.suggest_float('alpha', 1e-8, 1.0, log=True),
        'lambda': trial.suggest_float('lambda', 1e-8, 1.0, log=True),
        'grow_policy': trial.suggest_categorical("grow_policy", ["depthwise", "lossguide"]),
        'n_estimators': trial.suggest_int('n_estimators', 50, 500),
        'max_depth': trial.suggest_int('max_depth', 3, 10),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3),
        'subsample': trial.suggest_float('subsample', 0.5, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
        'gamma': trial.suggest_float('gamma', 0, 5),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 10)
    }
    
    # 3. 初始化並訓練XGBoost模型
    xgb_model = XGBClassifier(
        **params,
        use_label_encoder=False, 
        eval_metric='mlogloss'
    )
    xgb_model.fit(X_train, y_train)
    
    # 4. 預測並計算精確度作為目標
    y_pred_xgboost = xgb_model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred_xgboost)
    
    return accuracy

# 5. 使用optuna進行超參數調整
study = optuna.create_study(direction='maximize')  # 我們想要最大化精確度
study.optimize(objective, n_trials=50)  # 進行50次試驗

# 6. 獲取最佳超參數
best_params = study.best_params
print("Best hyperparameters: ", best_params)

# 7. 使用最佳超參數訓練最終模型
xgb_model = XGBClassifier(
    **best_params,
    use_label_encoder=False, 
    eval_metric='mlogloss'
)
xgb_model.fit(X_train, y_train)

# 8. 測試最終模型
y_pred_xgboost = xgb_model.predict(X_test)

# 9. 評估最終模型
print("Accuracy (XGBoost):", accuracy_score(y_test, y_pred_xgboost))
print(classification_report(y_test, y_pred_xgboost, target_names=label_encoder.classes_))

我們還可以使用optuna.visualization來視覺化最佳參數的分布，例如：

plot_optimization_history (視覺化優化的過程)
plot_intermediate_values (視覺化學習的曲線)
plot_parallel_coordinate (視覺化高維度中參數間的彼此關係)
plot_contour (視覺化參數間的彼此關係)
plot_slice (視覺化個別參數)
plot_param_importances (參數對模型的重要程度)
plot_edf (視覺化驗分佈函數)

這裡我們使用 plot_param_importances 以及 plot_optimization_history來作為例子：

from optuna.visualization import plot_param_importances, plot_optimization_history

# 10. Optuna 視覺化：參數重要性
fig = plot_param_importances(study)
plotly_config = {"staticPlot": True}
fig.show(config=plotly_config)

# 11. Optuna 視覺化：優化歷史
fig = plot_optimization_history(study)
fig.show(config=plotly_config)