Logistic Regression
1、在有时间序列的特征数据中,怎么运用LR?
不光是LR,其他的模型也是。
有很多基本的模型变形之后,变成带时序的模型。但,个人觉得,这类模型大多不靠谱。
我觉得还是要从业务出发,同时探测分析数据,得出比较合理的假设,然后提取特征,这些特征可以含有时间信息,但不一定是时序的。比如,前N天其他特征的统计组合等。
可以参考:Logistic regression for time series
Q: I would like to use a binary logistic regression model in the context of streaming data (multidimensional time series) in order to predict the value of the dependent variable of the data (i.e. row) that just arrived, given the past observations. As far as I know, logistic regression is traditionally used for postmortem analysis, where each dependent variable has already been set (either by inspection, or by the nature of the study).
A: There are two methods to consider:
Only use the last N input samples. Assuming your input signal is of dimension D, then you have N*D samples per ground truth label. This way you can train using any classifier you like, including logistic regression. This way, each output is considered independent from all other outputs.
Use the last N input samples and the last N outputs you have generated. The problem is then similar to viterbi decoding. You could generate a non-binary score based on the input samples, and combine the score of multiple samples using a viterbi decoder. This is better than method 1. if you now something about the temporal relation between the outputs.
2、数据不平衡时怎么处理?
比如正负比例1:100,而要研究的是正例的1,这时候LR表现非常差。
一般有两种方案:
1)调整权重,比如正例*10。ps,个人实验还是不理想
2)sample,还没尝试
参考:http://www.alidata.org/archives/205 正反例极不平衡的数据集的采样