一、任务
在进行平均值填充后的数据集上,系统性地应用八种主流的分类算法,得出它们在矿物类型预测中准确率、召回率等指标
二、核心工作
1.模型实践:逐一实现八种算法。
2.横向对比:使用准确率、召回率等指标,公平地评估各模型在相同测试集上的表现。
3.结果存档:将所有模型的评估结果结构化地保存为JSON文件,便于后续分析和报告。
三、数据准备
使用经过“平均值填充”方法预处理后的数据集
import pandas as pd # 加载清洗后的训练数据 train_data = pd.read_excel(r'.//填充后的数据//训练数据集[平均值].xlsx') train_data_x = train_data.drop(['矿物类型'], axis=1) train_data_y = train_data['矿物类型'] # 加载清洗后的测试数据 test_data = pd.read_excel(r'.//填充后的数据//测试数据集[平均值].xlsx') test_data_x = test_data.drop(['矿物类型'], axis=1) test_data_y = test_data['矿物类型'] result_data = {}四、模型介绍与实现
4.1逻辑回归
核心概念:一种经典的线性分类模型,通过Sigmoid函数将线性回归结果映射为概率。
关键参数:C,penalty,solver
from sklearn import metrics lr = LogisticRegression(C=0.001,max_iter = 100,penalty='none',solver='lbfgs') lr.fit(train_data_x,train_data_y) train_predicted = lr.predict(train_data_x) #训练数据集预测结果 print('LR的train:\n',metrics.classification_report(train_data_y, train_predicted)) test_predicted = lr.predict(test_data_x) #数据集预测结果 print('LR的test:\n',metrics.classification_report(test_data_y, test_predicted)) a = metrics.classification_report(test_data_y, test_predicted,digits=6) b=a.split() LR_result={} LR_result['recall_0'] = float(b[6])#添加类别为0的召回率 LR_result['recall_1'] = float(b[11])#添加类别为1的台回率 LR_result['recall_2'] = float(b[16])# 添加类别为2的召回率 LR_result['recall_3'] = float(b[21])#添加类别为3的台回率 LR_result['acc']=float(b[25]) result_data['LR']=LR_result print('lr训练结束')4.2随机森林
核心概念:集成学习算法,通过构建多棵决策树并综合投票结果来提高预测精度和稳定性。
关键参数:n_estimators(树的数量),max_depth(树的最大深度),max_features(寻找最佳分割时考虑的特征数)。
from sklearn import metrics rfc = RandomForestClassifier(bootstrap=False,max_depth=20, max_features='log2', min_samples_leaf=1, min_samples_split=2, n_estimators= 50,random_state=487) rfc.fit(train_data_x,train_data_y) train_predicted = rfc.predict(train_data_x) #训练数据集预测结果 print('RFC的train:\n',metrics.classification_report(train_data_y, train_predicted)) test_predicted = rfc.predict(test_data_x) #数据集预测结果 print('RFC的test:\n',metrics.classification_report(test_data_y, test_predicted)) a = metrics.classification_report(test_data_y, test_predicted,digits=6) b=a.split() RFC_result={} RFC_result['recall_0'] = float(b[6])#添加类别为0的召回率 RFC_result['recall_1'] = float(b[11])#添加类别为1的台回率 RFC_result['recall_2'] = float(b[16])# 添加类别为2的召回率 RFC_result['recall_3'] = float(b[21])#添加类别为3的台回率 RFC_result['acc']=float(b[25]) result_data['RFC']=RFC_result print('rf训练结束')4.3SVM支持向量机
核心概念:寻找一个最优超平面,使得不同类别样本之间的间隔最大化。
关键参数:C(惩罚系数)
from sklearn.svm import SVC from sklearn import metrics svm = SVC(C=1, coef0=0.1, degree= 4, gamma= 1, kernel='poly', probability=True, random_state=100) svm.fit(train_data_x,train_data_y) train_predicted = svm.predict(train_data_x) #训练数据集预测结果 print('SVM的train:\n',metrics.classification_report(train_data_y, train_predicted)) test_predicted = svm.predict(test_data_x) #数据集预测结果 print('SVM的test:\n',metrics.classification_report(test_data_y, test_predicted)) a = metrics.classification_report(test_data_y, test_predicted,digits=6) b=a.split() SVM_result={} SVM_result['recall_0'] = float(b[6])#添加类别为0的召回率 SVM_result['recall_1'] = float(b[11])#添加类别为1的台回率 SVM_result['recall_2'] = float(b[16])# 添加类别为2的召回率 SVM_result['recall_3'] = float(b[21])#添加类别为3的台回率 SVM_result['acc']=float(b[25]) result_data['SVM']=SVM_result print('svm训练结束')4.4AdaBoost
核心概念:一种自适应增强算法,通过串行训练多个弱分类器(如深度较浅的决策树),并调整样本权重,最终组合成一个强分类器。
关键参数:n_estimators(树的数量),learning_rate(学习率)
from sklearn.ensemble import AdaBoostClassifier from sklearn.tree import DecisionTreeClassifier from sklearn import metrics abf = AdaBoostClassifier(algorithm= 'SAMME',base_estimator= DecisionTreeClassifier(max_depth=2),n_estimators=200,learning_rate= 1.0,random_state=0) abf.fit(train_data_x,train_data_y) train_predicted = abf.predict(train_data_x) #训练数据集预测结果 print('ABF的train:\n',metrics.classification_report(train_data_y, train_predicted)) test_predicted = abf.predict(test_data_x) #数据集预测结果 print('ABF的test:\n',metrics.classification_report(test_data_y, test_predicted)) a = metrics.classification_report(test_data_y, test_predicted,digits=6) b=a.split() ABF_result={} ABF_result['recall_0'] = float(b[6])#添加类别为0的召回率 ABF_result['recall_1'] = float(b[11])#添加类别为1的台回率 ABF_result['recall_2'] = float(b[16])# 添加类别为2的召回率 ABF_result['recall_3'] = float(b[21])#添加类别为3的台回率 ABF_result['acc']=float(b[25]) result_data['ABF']=ABF_result print('abf训练结束')4.5XGBoost
核心概念:一种高效的梯度提升决策树库,通过二阶导数优化损失函数,而且本身带有正则化以防止过拟合
关键参数:n_estimators, max_depth, learning_rate, subsample(样本采样比例)。
from xgboost import XGBClassifier from sklearn import metrics xgbc = XGBClassifier(learning_rate=0.05,#学习率 n_estimators=200, # 决策树数量 num_class = 5, max_depth=7, #树的最大深度 min_child_weight=1,#叶子节点中最小的样本权重和 gamma=0, #节点分裂所需的最小损失函数下降值 subsample=0.6, #训练样本的子样本比例 colsample_bytree=0.8, #每棵树随机采样的列数的占比 objective='multi:softmax', #损失函数类型(对于二分类问题) seed=0 #随机数种子 ) xgbc.fit(train_data_x,train_data_y) train_predicted = xgbc.predict(train_data_x) #训练数据集预测结果 print('XGB的train:\n',metrics.classification_report(train_data_y, train_predicted)) test_predicted = xgbc.predict(test_data_x) #数据集预测结果 print('XGB的test:\n',metrics.classification_report(test_data_y, test_predicted)) a = metrics.classification_report(test_data_y, test_predicted,digits=6) b=a.split() XGBF_result={} XGBF_result['recall_0'] = float(b[6])#添加类别为0的召回率 XGBF_result['recall_1'] = float(b[11])#添加类别为1的台回率 XGBF_result['recall_2'] = float(b[16])# 添加类别为2的召回率 XGBF_result['recall_3'] = float(b[21])#添加类别为3的台回率 XGBF_result['acc']=float(b[25]) result_data['XGBF']=XGBF_result print('xgb训练结束')4.6高斯贝叶斯
核心概念:基于贝叶斯定理,假设特征服从高斯分布,计算样本属于各个类别的后验概率。模型简单,训练速度快。
from sklearn.naive_bayes import GaussianNB from sklearn import metrics gnb = GaussianNB() gnb.fit(train_data_x,train_data_y) train_predicted = gnb.predict(train_data_x) #训练数据集预测结果 print('GNB的train:\n',metrics.classification_report(train_data_y, train_predicted)) test_predicted = gnb.predict(test_data_x) #数据集预测结果 print('GNB的test:\n',metrics.classification_report(test_data_y, test_predicted)) a = metrics.classification_report(test_data_y, test_predicted,digits=6) b=a.split() GNB_result={} GNB_result['recall_0'] = float(b[6])#添加类别为0的召回率 GNB_result['recall_1'] = float(b[11])#添加类别为1的台回率 GNB_result['recall_2'] = float(b[16])# 添加类别为2的召回率 GNB_result['recall_3'] = float(b[21])#添加类别为3的台回率 GNB_result['acc']=float(b[25]) result_data['GNB']=GNB_result print('gnb训练结束')延伸
使用以上六种方法时,都没有进行参数优化,而是直接取值,因此,如果想要使得到的准确率等指标更高,可以使用网格搜索进行参数的优化,使用最优参数进行训练
1)网格搜索的概念
网格搜索是一种超参数优化的经典方法,用于系统地寻找机器学习模型的最佳参数组合。
2)目的
自动调参:替代手动试错,寻找最优超参数组合
模型性能最大化:找到使模型在训练集上表现最好的参数
3)工作原理
以逻辑回归的参数优化为例
param_grid = [ # 对于 l2 或 none 惩罚 { 'penalty': ['l2', 'none'], 'C': [0.001, 0.01, 0.1, 1, 10, 100], 'solver': ['newton-cg', 'lbfgs', 'sag', 'saga'], 'max_iter': [500, 1000], 'multi_class': ['auto', 'ovr', 'multinomial'] }, # 对于 l1 惩罚 { 'penalty': ['l1'], 'C': [0.001, 0.01, 0.1, 1, 10, 100], 'solver': ['liblinear', 'saga'], # 只有这些支持l1 'max_iter': [500, 1000], 'multi_class': ['auto', 'ovr'] }, # 对于 elasticnet 惩罚 { 'penalty': ['elasticnet'], 'C': [0.001, 0.01, 0.1, 1, 10, 100], 'solver': ['saga'], # 只有saga支持elasticnet 'l1_ratio': [0.1, 0.5, 0.9], # l1正则化的比例 'max_iter': [500, 1000], 'multi_class': ['auto', 'ovr', 'multinomial'] } ] logreg = LogisticRegression() grid_search = GridSearchCV(logreg, param_grid, cv=5)#创建GridSearchCV对象 grid_search.fit(train_data_x, train_data_y)#在训练集上执行网络搜索 print(grid_search.best_params_)#输出最佳参数注:在进行参数组合是要考虑参数之间是否兼容
4.7神经网络
核心概念:使用PyTorch构建一个多层感知机,通过多个全连接层和非线性激活函数来学习复杂的非线性关系。
流程:
1.定义神经网络结构
import torch import torch.nn as nn # 定义神经网络结构 class Net(nn.Module):#pytorch tensorflow def __init__(self): super(Net, self).__init__() self.fc1 = nn.Linear(13, 32) self.fc2 = nn.Linear(32, 64) self.fc3 = nn.Linear(64, 4) # # def forward(self,x):#覆盖父类的方法 x = torch.relu(self.fc1(x)) x = torch.relu(self.fc2(x)) x = self.fc3(x) return x2.准备数据
X_train = torch.tensor(train_data_x.values, dtype=torch.float32)#将表格数据转换成张量数据 Y_train = torch.tensor(train_data_y.values) X_test = torch.tensor(test_data_x.values, dtype=torch.float32) Y_test = torch.tensor(test_data_y.values)3.定义损失函数和优化器
model = Net() criterion =nn.CrossEntropyLoss() #交叉熵损失函数 optimizer = torch.optim.Adam(model.parameters(), lr=0.001)4.训练循环
def evaluate_model(model, X_data, Y_data, train_or_test): size = len(X_data) with torch.no_grad():#禁用梯度下降的方法 predictions = model(X_data) correct = (predictions.argmax(1) == Y_data).type(torch.float).sum().item() correct /= size loss = criterion(predictions, Y_data).item() print(f"{train_or_test}: \t Accuracy: {(100 * correct)}%") return correct5.评估和保存结果
epochs = 15000 accs=[] for epoch in range(epochs):# 训练网络 optimizer.zero_grad() #梯度的初始化 outputs = model(X_train) loss = criterion(outputs, Y_train) loss.backward() optimizer.step() if(epoch+1)% 100 == 0: print(f'Epoch [{epoch + 1}/{epochs}], Loss: {loss.item():.4f}') train_acc = evaluate_model(model, X_train, Y_train,'train') test_acc = evaluate_model(model, X_test, Y_test,'test') accs.append(test_acc*100) net_result = {} net_result['acc'] = max(accs) result_data['net'] = net_result4.8卷积神经网络
核心概念:将每个样本的多个特征视作一个一维序列,使用一维卷积核来提取局部特征模式,最后通过全局池化和全连接层分类。
流程:
1.定义 1D CNN 网络结构
class ConvNet(nn.Module): def __init__(self, num_features, hidden_size, num_classes): super(ConvNet,self).__init__() self.conv1 = nn.Conv1d(in_channels=1, out_channels=16, kernel_size=3, padding=1) self.conv2 = nn.Conv1d(in_channels=16, out_channels=32, kernel_size=3, padding=1) self.conv3 = nn.Conv1d(in_channels=32, out_channels=64, kernel_size=3, padding=1) self.relu = nn.ReLU() self.fc = nn.Linear(64, num_classes) def forward(self, x): # 由于Conv1d期垒的输入维度是(batch_size,channels,length),我们需要增加一个维度 x = x.unsqueeze(1) # 增/channels.維/ x = self.conv1(x) x = self.relu(x) x = self.conv2(x) x = self.relu(x) x = self.conv3(x) x = self.relu(x) x = x.mean(dim=2) # 这里使用平均优化作为简化操作 x = self.fc(x) return x2.准备数据并进行模型初始化
X_train = torch.tensor(train_data_x.values, dtype=torch.float32)# Y_train = torch.tensor(train_data_y.values) X_test = torch.tensor(test_data_x.values, dtype=torch.float32) Y_test = torch.tensor(test_data_y.values) hidden_size = 10 num_classes = 4 model = ConvNet(13, hidden_size, num_classes)3.定义损失函数和优化器
criterion = nn.CrossEntropyLoss() optimizer = optim.Adam(model.parameters(), lr=0.001)4.训练循环
num_epochs = 15000 accs=[] for epoch in range(num_epochs): outputs = model(X_train) # 前向 loss = criterion(outputs, Y_train) optimizer.zero_grad() #反向传播和优化 loss.backward() optimizer.step() if (epoch + 1) % 100 == 0:#每隔100次打印训练结果 print(f'Epoch [{epoch + 1}/{num_epochs}], Loss: {loss.item():.4f}') #测试模型 with torch.no_grad(): predictions = model(X_train) predicted_classes = predictions.argmax(dim=1) accuracy = (predicted_classes == Y_train).float().mean() print(f'Train Accuracy:{accuracy.item() * 100:.2f}%') predictions = model(X_test) predicted_classes = predictions.argmax(dim=1) accuracy = (predicted_classes == Y_test).float().mean() print(f'Test Accuracy: {accuracy.item() * 100:.2f}%') accs.append(accuracy*100)5.保存结果
cnn_result = {} cnn_result['acc'] = max(accs).item() result_data['cnn'] = cnn_result五、保存文件
所有模型训练完成后,将结果进行保存
import json result={} result['平均值填充']=result_data with open(r'.//填充后的数据//平均值填充训练结果.json','w',encoding='utf-8') as file: json.dump(result,file,ensure_ascii=False,indent=4)