皮尔逊相关系数
斯皮尔曼等级相关(Spearman Rank Correlation)
http://wiki.mbalib.com/wiki/斯皮尔曼等级相关
从表中的数字可以看出,工人的考试成绩愈高其产量也愈高,二者之间的联系程度是很一致的,但是相关系数r=0.676 并不算太高,这是由于它们之间的关系并不是线性的,如果分别按考试成绩和产量高低变换成等级(见上表第3、4列),则可以计算它们之间的等级相关系数为1。
Kendall tau rank correlation coefficient,肯德尔等级相关系数
http://wiki.mbalib.com/wiki/肯德尔等级相关系数
http://baike.baidu.com/item/kendall秩相关系数
#缺失值处理策 ##0-将含有缺失值的记录剔除 ##1-根据变量之间的相关关系填补缺失 ##2-根据案例之间的相似性填补缺失 library(DMwR) #algae 海藻 algae[!complete.cases(algae),] #返回缺失值较多的数据 #数据缺失的属性比例超过了20% #返回行号 manyNAs(algae,0.2) ##0-将含有缺失值的记录剔除 #将海藻数据导入x x <- algae #实施剔除操作23 #na.omit(x) 没有将x中的数据清楚 y <- na.omit(x) y x[!complete.cases(x),] #返回0 y[!complete.cases(y),] #已知62-199行含有缺失 z <- algae[-c(62,199),] #named integer(0) manyNAs(z) ##0-将含有缺失值的记录剔除 ##1-根据变量之间的相关关系填补缺失 #寻找属性间相关关系 #第4-18个属性的相关性 cor(algae[,4:18],use="complete.obs") #将相关性的结果可视化,直观 symnum(cor(algae[,4:18],use="complete.obs")) #方差 偏离程度 衡量X取值分散程度的一个尺度 # E{|X-E(X)|} E{[X-E(X)][X-E(X)]} = D(X) = Var(X) # = E(X*X)-E(X)*E(X) #相关性系 Correlation coefficient ##协方 covariance; [计] covariation ##Cov(X,Y)=E[(X-E(X))(Y-E(Y))] ## =E[XY]-2E[Y][X]+E[X]E[Y] ## =E[XY]-E[X]E[Y] ## # 从直观上来看,协方差表示的是两个变量总体误差的期望 # 如果两个变量的变化趋势一致,也就是说如果其中一个大于自身的期望值时另外一个也大于自身的期望值,那么两个变量之间的协方差就是正值;如果两个变量的变化趋势相反,即其中一个变量大于自身的期望值时另外一个却小于自身的期望值,那么两个变量之间的协方差就是负值 # 如果X与Y是统计独立的,那么二者之间的协方差就0,因为两个独立的随机变量满足E[XY]=E[X]E[Y] # 但是,反过来并不成立。即如果X与Y的协方差0,二者并不一定是统计独立的 # 协方差Cov(X,Y)的度量单位是X的协方差乘以Y的协方差。而取决于协方差的相关性,是一个衡量线性独立的无量纲的数 # 协方差为0的两个随机变量称为是不相关的 # # r(X,Y) = Cov(X,Y)/(X标准差与Y标准差的乘积) # 复相关系数:又叫多重相关系数。复相关是指因变量与多个自变量之间的相关关系。例如,某种商品的季节性需求量与其价格水平、职工收入水平等现象之间呈现复相关关系 # 典型相关系数:是先对原来各组变量进行主成分分析,得到新的线性关系的综合指标,再通过综合指标之间的线性相关系数来研究原各组变量间相关关系 #期望E(X) #X、Y随机变量 E(X+Y)=E(X)+E(Y) #X、Y随机变量且相互独立 E(XY)=E(X)E(Y) #方差D(X) #D(X)=E{[X-E(X)]^2}=[E(X^2)]-[E(x)]^2 #D(X+Y) = E{[(X+Y)-E(X+Y)]^2} # = D(X)+D(Y)+2{E(XY)-E(X)E(Y)} # = D(X)+D(Y)+2Cov(X,Y) #如果随机变量只取得有限个值或无穷能按一定次序一一列出,其值域为一个或若干个有限或无限区间,这样的随机变量称为离散型随机变量。 #离散型随机变量的一切可能的取值xi与对应的概率pi乘积之和称为该离散型随机变量的数学期望[1] (设级数绝对收敛),记为E(x)。它是简单算术平均的一种推广,类似加权平均。 #r(X,Y) = Cov(X,Y)/(X标准差与Y标准差的乘积) # = (E[XY]-E[X]E[Y])/(X标准差与Y标准差的乘积) # = #X、Y N #r(X,Y)=[sum(XY)/n-sum(X)/n*sum(Y)/n]/{{sum(x^2)/n-[sum(x)/n]^2}^0.5}/{} # =[sum(XY)*n-sum(X)sum(Y)]/{{sum(x^2)*n-[sum(x)]^2}^0.5}/{} #http://www.oschina.net/code/snippet_66235_19127 Correlation<- function(x,y) { len<-length(x) if( len != length(y)) stop("length not equal!") x2 <- unlist(lapply(x,function(a) return(a^2))) y2 <- unlist(lapply(y,function(a) return(a^2))) xy <- x*y a <- sum(xy)*len - sum(x)*sum(y) b <- sqrt(sum(x2)*len - sum(x)^2)*sqrt(sum(y2)*len - sum(y)^2) if( b == 0) stop("data is incorrect!") return(a/b) } x1 = c(1,2,3) y1 = c(4,5,6) x = c(12.5,15.3,23.2,26.4,33.5,34.4,39.4,45.2,55.4,60.9) y = c(21.2,23.9,32.9,34.1,42.5,43.2,49.0,52.8,59.4,63.5) #1 Correlation(x1,y1) #0.9941984 Correlation(x,y) #0.9941984 cor(x,y) # > symnum(cor(algae[,4:18],use="complete.obs")) # mP mO Cl NO NH o P Ch a1 a2 a3 a4 a5 a6 a7 # mxPH 1 # mnO2 1 # Cl 1 # NO3 1 # NH4 , 1 # oPO4 . . 1 # PO4 . . * 1 # Chla . 1 # a1 . . . 1 # a2 . . 1 # a3 1 # a4 . . . 1 # a5 1 # a6 . . . 1 # a7 1 # attr(,"legend") # [1] 0 ‘ ’ 0.3 ‘.’ 0.6 ‘,’ 0.8 ‘+’ 0.9 ‘*’ 0.95 ‘B’ 1 #PO4-oPO4 相关性在0.9-0.95数据相关性高 “互推 互相补充” data(algae) x <- algae[-manyNAs(algae),] x lm( PO4~oPO4,data = algae) # > lm( PO4~oPO4,data = algae) # # Call: # lm(formula = PO4 ~ oPO4, data = algae) # # Coefficients: # (Intercept) oPO4 # 42.897 1.293 #PO4 = 42.897+1.293*oPO4 ##2-根据案例之间的相似性填补缺失 计算案例之间的距离 #计算距离,然后,距离排序,得出距离最小的 ## k-nearest neighbors algorithm KNN ## A Programmer's Guide to Data Mining 写给程序员的数据挖掘指南 ## 曼哈顿距离 欧式距离 马氏距离 ## 马氏距离是由印度统计学家马哈拉诺比斯 (英语)提出的,表示数据的协方差距离。它是一种有效的计算两个未知样本集的相似度的方法。与欧氏距离不同的是它考虑到各种特性之间的联系(例如:一条关于身高的信息会带来一条关于体重的信息,因为两者是有关联的)并且是尺度无关的(scale-invariant),即独立于测量尺度。 ## 预测的海藻数量之获取预测模型 #预测140个水样中7种海藻的出现频率 #多元线性回归模型:一个有关目标变量与一组解释变量关系的线性函数 #调优 #全部属性 变量 部分 降低无关因素 权重 0 《0 》 # 向后消元法
随机变量 期望 方差 独立随机变量 协方差 相关性系数
频率 k/N
N->无穷 概率
频率 概率 1/N
#缺失值处理策 ##0-将含有缺失值的记录剔除 ##1-根据变量之间的相关关系填补缺失 ##2-根据案例之间的相似性填补缺失 library(DMwR) #algae 海藻 algae[!complete.cases(algae),] #返回缺失值较多的数据 #数据缺失的属性比例超过了20% #返回行号 manyNAs(algae,0.2) ##0-将含有缺失值的记录剔除 #将海藻数据导入x x <- algae #实施剔除操作23 #na.omit(x) 没有将x中的数据清楚 y <- na.omit(x) y x[!complete.cases(x),] #返回0 y[!complete.cases(y),] #已知62-199行含有缺失 z <- algae[-c(62,199),] #named integer(0) manyNAs(z) #寻找属性间相关关系 cor(algae[,4:18],use="complete.obs") #将相关性的结果可视化,直观 #symnum(cor(algae[,4:18],use="complete.obs")) #方差 偏离程度 衡量X取值分散程度的一个尺度 # E{|X-E(X)|} E{[X-E(X)][X-E(X)]} = D(X) = Var(X) # = E(X*X)-E(X)*E(X) #相关性系 Correlation coefficient ##协方 covariance; [计] covariation ##Cov(X,Y)=E[(X-E(X))(Y-E(Y))] ## =E[XY]-2E[Y][X]+E[X]E[Y] ## =E[XY]-E[X]E[Y] ## # 从直观上来看,协方差表示的是两个变量总体误差的期望 # 如果两个变量的变化趋势一致,也就是说如果其中一个大于自身的期望值时另外一个也大于自身的期望值,那么两个变量之间的协方差就是正值;如果两个变量的变化趋势相反,即其中一个变量大于自身的期望值时另外一个却小于自身的期望值,那么两个变量之间的协方差就是负值 # 如果X与Y是统计独立的,那么二者之间的协方差就0,因为两个独立的随机变量满足E[XY]=E[X]E[Y] # 但是,反过来并不成立。即如果X与Y的协方差0,二者并不一定是统计独立的 # 协方差Cov(X,Y)的度量单位是X的协方差乘以Y的协方差。而取决于协方差的相关性,是一个衡量线性独立的无量纲的数 # 协方差为0的两个随机变量称为是不相关的 # # r(X,Y) = Cov(X,Y)/(X标准差与Y标准差的乘积) # 复相关系数:又叫多重相关系数。复相关是指因变量与多个自变量之间的相关关系。例如,某种商品的季节性需求量与其价格水平、职工收入水平等现象之间呈现复相关关系 # 典型相关系数:是先对原来各组变量进行主成分分析,得到新的线性关系的综合指标,再通过综合指标之间的线性相关系数来研究原各组变量间相关关系 #期望E(X) #X、Y随机变量 E(X+Y)=E(X)+E(Y) #X、Y随机变量且相互独立 E(XY)=E(X)E(Y) #方差D(X) #D(X)=E{[X-E(X)]^2}=[E(X^2)]-[E(x)]^2 #D(X+Y) = E{[(X+Y)-E(X+Y)]^2} # = D(X)+D(Y)+2{E(XY)-E(X)E(Y)} # = D(X)+D(Y)+2Cov(X,Y) #如果随机变量只取得有限个值或无穷能按一定次序一一列出,其值域为一个或若干个有限或无限区间,这样的随机变量称为离散型随机变量。 #离散型随机变量的一切可能的取值xi与对应的概率pi乘积之和称为该离散型随机变量的数学期望[1] (设级数绝对收敛),记为E(x)。它是简单算术平均的一种推广,类似加权平均。 #r(X,Y) = Cov(X,Y)/(X标准差与Y标准差的乘积) # = (E[XY]-E[X]E[Y])/(X标准差与Y标准差的乘积) # = #X、Y N #r(X,Y)=[sum(XY)/n-sum(X)/n*sum(Y)/n]/{{sum(x^2)/n-[sum(x)/n]^2}^0.5}/{} # =[sum(XY)*n-sum(X)sum(Y)]/{{sum(x^2)*n-[sum(x)]^2}^0.5}/{} #http://www.oschina.net/code/snippet_66235_19127 Correlation<- function(x,y) { len<-length(x) if( len != length(y)) stop("length not equal!") x2 <- unlist(lapply(x,function(a) return(a^2))) y2 <- unlist(lapply(y,function(a) return(a^2))) xy <- x*y a <- sum(xy)*len - sum(x)*sum(y) b <- sqrt(sum(x2)*len - sum(x)^2)*sqrt(sum(y2)*len - sum(y)^2) if( b == 0) stop("data is incorrect!") return(a/b) } x1 = c(1,2,3) y1 = c(4,5,6) x = c(12.5,15.3,23.2,26.4,33.5,34.4,39.4,45.2,55.4,60.9) y = c(21.2,23.9,32.9,34.1,42.5,43.2,49.0,52.8,59.4,63.5) #1 Correlation(x1,y1) #0.9941984 Correlation(x,y) #0.9941984 cor(x,y)
w
> tmp1 <- c(1,2) > tmp2 <- c(10,20) > tmp3 <- sum(tmp1*tmp2) > tmp3 [1] 50 > lapply(tmp1,function(a) return(a^2)) [[1]] [1] 1 [[2]] [1] 4 > unlist(lapply(tmp1,function(a) return(a^2))) [1] 1 4
library('DMwR') head(algae) #了解数据集 1st Qu 3rd Qu 四分之一中位数 summary(algae) #查看直方图 hist(algae$mxPH) #将频数转化为概率密度 hist(algae$mxPH,prob=T) #肉眼推断接近的分布曲线 #添加正态分布曲线 #移除NA值 lines(density(algae$mxPH,na.rm=T)) #0-1 hist(algae$mxPH,prob=T,ylim=0:1) lines(density(algae$mxPH,na.rm=T))
myString<-"Hello,World!"
print (myString)
v <- 2+5i
print(class(v))
list1 <- list(c(2,5,3),21.3,sin)
print(list1)
M = matrix(c(1,2,3,31,32,33), nrow=2,ncol=3,byrow=TRUE )
print(M)
a <- array(c('a','b','c'),dim=c(3,3,2))
print(a)
a <- array(c('a','b'),dim=c(3,3,2))
print(a)
vector1 <- c(5)
vector2 <- c(11,12,13,14)
result <- array(c(vector1,vector2),dim=c(3,3,2))
print(result)
result <- array(c(vector1,vector2),dim=c(3,3,3))
print(result)
vector1 <- c(5)
vector2 <- c(11,12,13,14)
c.names <- c("COLUMN1","C2","C3")
r.names <- c("ROW1","R2","R3")
m.names <- c("Matrix1","M2")
result <- array(c(vector1,vector2),dim=c(3,3,2),dimnames = list(c.names,r.names,m.names))
print(result)
#Create a vector as input.
data <- c("East","West","East","North","North","East","West","West","West","East","North")
print(data)
print(is.factor(data))
#Apply the factor function.
# Create a vector as input.
data <- c("East","West","East","North","North","East","West","West","West","East","North")
print(data)
print(is.factor(data))
# Apply the factor function.
factor_data <- factor(data)
print(factor_data)
print(is.factor(factor_data))
# Create the vectors for data frame.
height <- c(132,151,162,139,166,147,122)
weight <- c(48,49,66,53,67,52,40)
gender <- c("male","male","female","female","male","female","male")
# Create the data frame.
input_data <- data.frame(height,weight,gender)
print(input_data)
# Test if the gender column is a factor.
print(is.factor(input_data$gender))
# Print the gender column so see the levels.
print(input_data$gender)
data <- c("East","West","East","North","North","East","West","West","West","East","North")
# Create the factors
factor_data <- factor(data)
print(factor_data)
# Apply the factor function with required order of the level.
new_order_data <- factor(factor_data,levels = c("East","West","North"))
print(new_order_data)
v <- gl(3,5)
print(v)
v <- gl(3, 4, labels = c("Tampa", "Seattle","Boston"))
print(v)
v <- gl(3, 4, labels = c("Tampa", "Seattle","Boston","SZ"))
print(v)
# Create the data frame.
emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),
start_date = as.Date(c("2012-01-01","2013-09-23","2014-11-15","2014-05-11","2015-03-27")),
stringsAsFactors=FALSE
)
# Print the data frame.
print(emp.data)
#Create the data frame.
emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),
start_date = as.Date(c("2012-01-01","2013-09-23","2014-11-15","2014-05-11","2015-03-27")),
stringsAsFactors=FALSE
)
# Get the structure of the data frame.
str(emp.data)
#Extract Specific columns
result <- data.frame(emp.data$emp_name,emp.data$salary)
print(result)
#Add the "dept" coulmn.
emp.data$dept <- c("IT","Operations","IT","HR","Fiance")
print(emp.data)
#library()
library(MASS)
print(ships)
#
#install.packages("reshape2")
require(reshape2)
molten.ships <- melt(ships, id = c("type","year"))
print(molten.ships)
molten.ships <- melt(ships, id = c("type","year","variable"))
print(molten.ships)
检测数据类型
> rt <- read.table("hist.funs.txt",head=FALSE);
> class(rt)
[1] "data.frame"
c函数
多个元素创建向量
> a<- c(rt[1])
> class(a)
[1] "list"
> apple<-c('red','green')
> class(apple)
[1] "character"