sklearn的class_weight设置为'balanced'的计算方法

分类的时候，当不同类别的样本量差异很大时，很容易影响分类结果，因此要么每个类别的数据量大致相同，要么就要进行校正。

sklearn的做法可以是加权，加权就要涉及到class_weight和sample_weight，当不设置class_weight参数时，默认值是所有类别的权值为1。

在python中：

# class_weight的传参
class_weight : {dict, 'balanced'}, optional
        Set the parameter C of class i to class_weight[i]*C for
        SVC. If not given, all classes are supposed to have
        weight one. The "balanced" mode uses the values of y to automatically
        adjust weights inversely proportional to class frequencies as
        ``n_samples / (n_classes * np.bincount(y))``
#  当使用字典时，其形式为：Weights associated with classes in the form ``{class_label: weight}``，比如：{0: 1, 1: 1}表示类0的权值为1，类1的权值为1.

# sample_weight的传参
sample_weight : array-like, shape (n_samples,)
            Per-sample weights. Rescale C per sample. Higher weights
            force the classifier to put more emphasis on these points.

1. 在：from sklearn.utils.class_weight import compute_class_weight 里面可以看到计算的源代码。

2. 除了通过字典形式传入权重参数，还可以设置的是：class_weight = 'balanced'，例如使用SVM分类：

clf = SVC(kernel = 'linear', class_weight='balanced', decision_function_shape='ovr')
clf.fit(X_train, y_train)

3. 那么'balanced'的计算方法是什么呢？看例子：

import numpy as np

y = [0,0,0,0,0,0,0,0,1,1,1,1,1,1,2,2]  #标签值，一共16个样本

a = np.bincount(y)  # array([8, 6, 2], dtype=int64) 计算每个类别的样本数量
aa = 1/a  #倒数 array([0.125     , 0.16666667, 0.5       ])
print(aa)

from sklearn.utils.class_weight import compute_class_weight 
class_weight = 'balanced'
classes = np.array([0, 1, 2])  #标签类别
weight = compute_class_weight(class_weight, classes, y)
print(weight) # [0.66666667 0.88888889 2.66666667]

print(0.66666667*8)  #5.33333336
print(0.88888889*6)  #5.33333334
print(2.66666667*2) #5.33333334
# 这三个值非常接近
# 'balanced'计算出来的结果很均衡，使得惩罚项和样本量对应

可以看出计算出来的值，乘以样本量之后，三个类别的数字很接近，我想的是：个人觉得惩罚项就用样本量的倒数未尝不可，因为乘以样本量都是1，相当于'balanced'这里是多乘以了一个常数

4. 真正的魔法到了：还记得上面所给出的python中，当class_weight为'balanced'时的计算公式吗？

# weight_ = n_samples / (n_classes * np.bincount(y))``
# 这里
# n_samples为16
# n_classes为3
# np.bincount(y)实际上就是每个类别的样本数量

于是：

print(16/(3*8))  #输出 0.6666666666666666
print(16/(3*6))  #输出 0.8888888888888888
print(16/(3*2))  #输出 2.6666666666666665

是不是跟计算出来的权值一样？这就是class_weight设置为'balanced'时的计算方法了。

5. 当然，需要说明一下传入字典时的情形

import numpy as np

y = [0,0,0,0,0,0,0,0,1,1,1,1,1,1,2,2]  #标签值，一共16个样本

from sklearn.utils.class_weight import compute_class_weight 
class_weight = {0:1,1:3,2:5}   # {class_label_1:weight_1, class_label_2:weight_2, class_label_3:weight_3}
classes = np.array([0, 1, 2])  #标签类别
weight = compute_class_weight(class_weight, classes, y)
print(weight)   # 输出：[1. 3. 5.]，也就是字典中设置的值

参考：

https://blog.csdn.net/go_og/article/details/81281387

https://www.zhihu.com/question/265420166/answer/293896934

相关阅读:
Java言语与C言语有哪些不同
 只会增删改查的Java程序员该如何发展
 java“单根继承结构”
Java编程领域你需要懂得技术名词解释
 HTTP相关工具类/协助类分享
 说说Java到底是值传递仍是引用传递
 Java自定义ClassLoader实现
 深化详细分析java的发展前景！
2020Java面试题及答案，命中率高达90%
Python爬虫详解，每个步骤都给你细致的讲解（附源码）
原文地址：https://www.cnblogs.com/qi-yuan-008/p/11992156.html