PP: Time series clustering via community detection in Networks

PP: Time series clustering via community detection in Networks
Improvement can be done in fulture:
1. the algorithm of constructing network from distance matrix.
2. evolution of sliding time window
3. the later processing or visual analysis of generated graphs.

Thinking:

1.What's the ground truth in load profiles?

For clustering, there's no ground truth, so how to tune the parameters or options in step2, step3 and step4? In this paper, they have the labels of time series, so they use RI to guide their selection of parameters, for example: k and epsilon.

Suppose: similar time series tend to connect to each other and form communities.

Background and related works

shaped based distance measures; feature based distance measures; structure based distance measures. time series clustering; community detection in networks.

Methodology
1. data normalization
2. time series distance calculation
3. network construction
4. community detection
Which step influence the clustering results:

distance calculation algorithm; network construction methods. community detection methods.

2. distance matrix

calculating the distance for each pair of time series in the data set and construct a distance matrix D, where dij is the distance between series Xi and XJ . A good choice of distance measure has strong influence on the network construction and clustering result.

3. network construction

Two common method: K-NN and epsilon-NN; EXPLORATION

Experiments

45 time series data sets.

Purpose: check the performance of each combination of step2, step3,and step4 to each data sets.

Index指标：Rand index.

Vary the parameters: the k of k-NN from 1 to n-1; the epsilon of epsilon-NN from min(D) to max(D) in 100 steps.

Step2: Manhattan, Euclidean, infinite Norm, DTW, short time series, DISSIM, Complexity-Invariant, Wavlet tranform, Pearson correlation, Intergrated periodogram.

Step3: fast greedy; multilevel; walktrap; infomap; label propagration.

Step4: vary the parameter of k and epsilon.

Results

1. the effect of k and epsilon to the clustering results(RI).

The k-NN construction method just allows discrete values of k while the ε-NN method accepts continuous values. When k and ε are small, vertices tend to make just few connections.

??what's the meaning of A,B,C,D in figure 5.

2. the statistical test of the effect of different distance methods. Friedman test and Nemenyi test.

多个算法在多个数据库上的对比:
- 如果样本符合ANOVA（repeated measure）的假设（如正态、等方差），优先使用ANOVA。
- 如果样本不符合ANOVA的假设，使用Friedman test配合Nemenyi test做post-hoc。
- 如果样本量不一样，或因为特定原因不能使用Friedman-Nemenyi，可以尝试Kruskal Wallis配合Dunn's test。值得注意的是，这种方法是用来处理独立测量数据，要分情况讨论。
DTW measure presents the best results for both network construction methods.

3. the statistical test of the effect of community detection algorithms. Friedman test and Nemenyi test.

4. comparison to rival methods.

i. some classic clustering algorithms: k-medoids, complete-linkage, single-linkage, average-linkage, median-linkage, centroid-linkage and diana;

ii. three up-to-date ones: Zhang’s method [41], Maharaj’s method [24] and PDC [5]

5. detect time series clusters with time-shifts

Suppose: Clustering algorithms should be capable of detecting groups of time series that have similar variations in time.

CBF dataset: 30个序列,一共三组, 全部正确分组/clustering.

6. detect shape patterns

1000 time series of length 128, four groups.

detect shape patterns (UD, DD, DU, UU);

Discussion

1. the same idea can be extended to multivariate time series clustering.

2. evaluate the simulation results using different indexes.

3. As future works, we plan to propose automatic strategies for choosing the best number of neighbors (k and ε) and speeding up the network construction method, instead of using the naive method.

4. We also plan to apply the idea to solve other kinds of problems in time series analysis, such as time series prediction. ??

Supplementary knowledge:

1. box plot

它能显示出一组数据的最大值、最小值、中位数、及上下四分位数。

以下是箱形图的具体例子：
```
                            +-----+-+       
  *           o     |-------|   + | |---|
                            +-----+-+    
                                         
+---+---+---+---+---+---+---+---+---+---+   分数
0   1   2   3   4   5   6   7   8   9  10
```
这组数据显示出：
- 最小值(minimum)=5
- 下四分位数(Q1)=7
- 中位数(Med --也就是Q2)=8.5
- 上四分位数(Q3)=9
- 最大值(maximum )=10
- 平均值=8
- 四分位间距(interquartile range)= ${displaystyle (Q3-Q1)}$
2. 观念转变， experiment部分也很重要，不是可有可无的，要细看。

3. 统计学检验

常用的机器学习算法比较？

All models are wrong, but some are useful. ----------统计学家George Box.

4. univariate and multivariate time series.

Univariate time series: Only one variable is varying over time. For example, data collected from a sensor measuring the temperature of a room every second. Therefore, each second, you will only have a one-dimensional value, which is the temperature.

Multivariate time series: Multiple variables are varying over time. For example, a tri-axial accelerometer三轴加速器. There are three accelerations, one for each axis (x,y,z) and they vary simultaneously over time.

Considering the data you showed in the question, you are dealing with a multivariate time series, where value_1, value_2 andvalue_3 are three variables changing simultaneously over time.
相关阅读:
Android Studio无法预览xml布局之解决方法（两种）
ssm web.xml配置解析
 ssm框架下实现文件上传
 spring mvc使用@InitBinder 标签对表单数据绑定
 Jquery实现相对浏览器位置固定、悬浮
 asp,php,jsp 不缓存网页的办法
 Spring 2.5
ERROR 1366 (HY000): Incorrect string value: 'xB3xA4xC9xB3' for column
DELPHI SOKET 编程--使用TServerSocket和TClientSocket
SVN switch 用法总结
原文地址：https://www.cnblogs.com/dulun/p/12170759.html