1.逻辑回归是怎么防止过拟合的?为什么正则化可以防止过拟合?
1.增加样本量,适用任何模型。
2.使用正则化:L1、L2正则化
3.特征选择,检查选取的特征,将一些不重要的特征去除降低模型复杂度;
4.逐步回归
2.用logiftic回归来进行实践操作,数据不限。
用logiftic回归来预测肺癌的得病情况:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np
# (1)加载breast-cancer-wisconsin数据,并划分训练集与测试集。
data = pd.read_csv('./data/breast-cancer-wisconsin.csv')
#数据处理
data = data.replace(to_replace='?',value = np.nan)
data = data.dropna()
#分割数据
x = data.iloc[:,1:10]
y = data.iloc[:,10]
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=5)
# (2)训练模型,并计算训练数据集的评分数据和测试数据集的评分数据,以及查看测试样本中预测正确的个数。
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
#标准化处理
std = StandardScaler()
x_train = std.fit_transform(x_train)
x_test = std.transform(x_test)
lg = LogisticRegression()
lg.fit(x_train, y_train)
y_pre = lg.predict(x_test)
print('训练数据集的评分:', lg.score(x_train, y_train))
print('测试数据集的评分:', lg.score(x_test, y_test))
print('预测个数:', x_test.shape[0])
print('预测正确个数:', x_test.shape[0] * lg.score(x_test, y_test))
print("召回率", classification_report(y_test, y_pre))