机器学习（一）——递归特征消除法实现SVM（matlab）

技术分享 2年前 (2024-06-23) 0 999+

机器学习方法对多维特征数据进行分类：本文用到非常经典的机器学习方法，使用递归特征消除进行特征选择，使用支持向量机构建分类模型，使用留一交叉验证的方法来评判模型的性能。
构建模型：支持向量机（Support Vector Machine，SVM）；
特征选择：递归特征消除(Recursive Feature Elimination，RFE)；
交叉验证：留一交叉验证（Leave one out cross validation，LOOCV）。
下面本问将逐一开始介绍这些方法。

支持向量机
适用场景：
1、只能用于2分类任务
2、目的是寻找一个超平面来对样本进行分割
3、注意：构建超平面不一定用到所有样本，只用到距离超平面最近的那些样本
模型调参：
1、当样本之间线性可分时，选用线性核函数（linear kernel）构建分类模型。
2、当模型线性不可分时候，需要适用非线性核函数（rbf，Gaussian，Polynomial）将数据分布映射到更高维的空间中来构建超平面，进而来构建分类模型。
3、选择哪种核函数，一般通过改变核函数来比较模型的分类性能来确定。matlab中自带的核函数只有四种，如果需要使用其他核函数请自行下载相关软件包。

递归特征消除
作用：
1、降低特征维度
2、选择最优的特征组合，使模型达到最好的分类效果（类似于贪心算法）。
步骤：
1、对于一个具有n维特征的数据集合，首先用n维特征构建SVM分类器模型，通过交叉验证的方法计算模型的分类准确性。
2、从当前特征集合中依次移除每一个特征，然后用剩下特征构建分类模型，使用交叉验证的方法计算特征移除后的分类准确率。
3、若某特征移除后模型的分类准确率反而上升，则该特征对分类模型没有贡献度，则将该特征移除，剩下特征作为保留特征。
4、使用剩下特征重复步骤2，直到所有的特征子集为空后便可以得到n个模型，选择分类准确率最高的特征集合作为最优的特征集合。

留一交叉验证
适用场景：
小样本构建分类模型，当样本量很小时，不足以区分单独的训练集和测试集时，通常使用这种方法。该方法的基本思想就是，当有n个样本的情况下，依次保留其中1个样本，用剩下n-1个样本构建分类模型，用保留的样本进行测试。这样就可以得到n个模型，计算这n个模型分类结果的平均值就可以得到在该数据分布情况下，使用某种分类方法构建分类模型的性能。

matlab实现的代码如下：

labels = res(:, 1); features = res(:, 2:end); features=zscore(features);%特征进行归一化 % 加载数据集并准备标签和特征数组 [num_samples, num_features] = size(features);  selected_indices = 1:num_features; % 初始选定所有特征的索引 selected_features_history = cell(num_features, num_features); % 存储选定的特征历史记录   accuracy_history = zeros(num_features, num_features); % 存储准确率历史记录     feature_to_remove = -1;  % 开始逐步特征选择 for i = 1:num_features     best_accuracy = 0;     temp_indices = selected_indices; % 创建临时特征索引列表          % 对每个特征进行评估     for j = 1:length(temp_indices)         features_subset = temp_indices(temp_indices ~= j);%去除特征后输入分类器的特征         [num1,num2]=size(features_subset);         % 使用留一交叉验证评估SVM分类器性能         accuracy = 0;         for k = 1:num_samples             % 留一样本作为验证集，其余样本作为训练集             train_features = features(:, features_subset);             train_features(k, :) = []; % 删除验证样本的特征             train_labels = labels;             train_labels(k) = []; % 删除验证样本的标签             test_feature = features(k, features_subset);             test_label = labels(k);                          % 训练SVM模型             svm_model = fitcsvm(train_features, train_labels, 'KernelFunction', 'linear');                          % 在验证集上进行预测并计算准确率             predicted_label = predict(svm_model, test_feature);             if predicted_label == test_label                 accuracy = accuracy + 1;             end         end         accuracy = accuracy / num_samples; % 计算准确率         accuracy_history(i,j)=accuracy;   %将每次分类的准确率存到一个数组中         selected_features_history{i,j} = features_subset;%将每次分类用到的特征存到一个数组里         %temp_indices          % 如果当前特征组合的准确率更高，则更新最佳特征及其对应的准确率         if (accuracy_history(i,j) > best_accuracy)                         best_accuracy = accuracy_history(i,j);             feature_to_remove = temp_indices(j);         end              end             % 删除性能下降最快的特征     selected_indices = selected_indices(selected_indices ~= feature_to_remove);    % selected_features_history{i} = selected_indices; % 更新选定的特征历史记录 %     accuracy_history(i) = best_accuracy; % 更新准确率历史记录     disp(['Removed feature index: ', num2str(feature_to_remove)]); end   [max_value, max_index] = max(accuracy_history(:));  % max_value 将是数组中的最大值 % max_index 将是数组中最大值所在的位置（线性索引） [row, col] = ind2sub(size(accuracy_history), max_index);  % row 和 col 将是数组中最大值的行和列索引   % 输出最终选定的特征索引 % disp('最优的分类准确性为'); % disp(max_value);  disp('对应的选择的特征索引为：'); disp(selected_features_history{row,col});   %利用选出来的特征重新建模求准确率 features_new=features(:,selected_features_history{row,col}); %features_new=features; % lables是样本标签 predictedScores=zeros(56,2);  accuracy_new=0;         for k = 1:num_samples             % 留一样本作为验证集，其余样本作为训练集             train_features = features_new(:,:);             train_features(k, :) = []; % 删除验证样本的特征             train_labels = labels;             train_labels(k) = []; % 删除验证样本的标签             test_feature = features_new(k,:);             test_label = labels(k);                          % 训练SVM模型             svm_model = fitcsvm(train_features, train_labels, 'KernelFunction', 'linear');                          % 在验证集上进行预测并计算准确率            % predicted_label = predict(svm_model, test_feature);             [predicted_label,predictedScore] = predict(svm_model, test_feature);             predictedScores(k,:)=predictedScore;             if predicted_label == test_label                 accuracy_new = accuracy_new + 1;             end         end         accuracy_new = accuracy_new / num_samples; % 计算准确率   % 输出最终选定的特征索引 disp('最优的分类准确性为'); disp(accuracy_new);```

发表评论