Traditional_machine_learning_tutorial

第4章第2节监督学习之决策树、随机森林和KNN

3.决策树与随机森林

3.1 决策树原理及实现

通过递归划分特征空间，使子节点”纯度”最高

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier,plot_tree
import matplotlib.pyplot as plt
X,y = make_classification(n_samples=200,n_features=5,n_classes=2)
X_train,X_test,y_train,y_test = train_test_split(X,y)
tree = DecisionTreeClassifier(
    criterion='gini', #或"entropy"
    max_depth=3,
    min_samples_split=10,
    random_state=100
)
tree.fit(X_train,y_train)
y_pred=tree.predict(X_test)
print((y_pred==y_test).mean())
#可视化
plt.figure(figsize=(12,8))
plot_tree(tree,feature_names=['X1','X2','X3','X4','X5'],class_names=['0','1'],filled=True)
plt.show()

3.2 随机森林（Random Forest）

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
X,y = make_classification(n_samples=200,n_features=5,n_classes=2)
X_train,X_test,y_train,y_test = train_test_split(X,y)
rf = RandomForestClassifier(
    n_estimators=100,
    max_depth=5,
    min_samples_split=5,
    oob_score=True, #使用袋外样本评估
    random_state=100,
    n_jobs=-1 #使用所有cpu
)
rf.fit(X_train,y_train)
y_pred=tree.predict(X_test)
print((y_pred==y_test).mean())
print(rf.score(X_test,y_test))
print('Feature Importances:',rf.feature_importances_)

3.3 超参数调整技巧

模型	关键超参数	调优建议
DecisionTree	max_depth, min_samples_split, min_samples_leaf	从 max_depth=3~10 开始
RandomForest	n_estimators, max_depth, max_features	n_estimators 越大越好（到性能瓶颈）；max_features=’sqrt’（分类）或 ‘log2’（回归）
SVM	C, gamma, kernel	先选核（RBF 默认），再网格搜索 C 和 gamma
LogisticRegression	C, penalty	用 LogisticRegressionCV 自动选 C

使用GridSearchCV自动调参（以Random Forest为例）

#首先加载和设置好随机森林的算法的内容
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
X,y = make_classification(n_samples=200,n_features=5,n_classes=2)
X_train,X_test,y_train,y_test = train_test_split(X,y)
rf = RandomForestClassifier(
    random_state=100,
    n_jobs=8 #使用8个cpu
)
#加载自动调参模块
from sklearn.model_selection import GridSearchCV
param_grid = {
    'n_estimators':[50,100,200],
    'max_depth':[3,5,7,9,None],
    'max_features':['sqrt','log2']
}
grid = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=8,#使用8个cpu
    verbose=0 #打印进度：0无日志 1简要 2详细
)
grid.fit(X_train,y_train)

#提取最优模型（已经全量训练集+最优参数训练完成）
best_model = grid.best_estimator_ 
print('Best params:',grid.best_params_)
print('Best CV score:',grid.best_score_)

#使用最优模型预测测试集
y_pred = best_model.predict(X_test)
y_pred_proba = best_model.predict_proba(X_test) #预测概率

print('准确率：',(y_test==y_pred).mean())
#保存模型到本地，可供后续使用
import joblib
joblib.dump(best_model,'best_random_forest_classifier.pkl')

#加载模型并预测
loaded_model = joblib.load('best_random_forest_classifier.pkl')
y_pred_loaded = loaded_model.predict(X_test)

4.K近邻算法（KNN）

KNN的算法核心就是找到最近的几个邻居，投票（分类）或平均（回归）

4.1 基本使用

from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
knn = KNeighborsClassifier(
    n_neighbors=5,#设置近邻的数量
    weights='uniform',#或‘distance’
    metric='minkowski',# 距离度量
    p=2 #p=2是欧氏距离  p=1是曼哈顿距离
)
knn.fit(X_train_scaled,y_train)
y_pred = knn.predict(X_test_scaled)
print('KNN Accuracy:',knn.score(X_test_scaled,y_test))
print((y_pred==y_test).mean())

4.2 距离度量与加权方式

参数	选项	说明
metric	‘euclidean’, ‘manhattan’, ‘chebyshev’, ‘minkowski’	控制距离计算方式
weights	‘uniform’（等权）
‘distance’（权重 = 1 / 距离）	距离近的样本影响更大

KNN的致命弱点：

KNN必须标准化特征，否则大尺度特征会主导距离
KNN训练快，预测慢
维度灾难，高维度下’距离失效’

This site is open source. Improve this page.