【Spark机器学习速成宝典】基础篇04数据类型（Python版）

# -*-coding=utf-8 -*-  
from pyspark import SparkConf, SparkContext
sc = SparkContext('local')

import numpy as np
import scipy.sparse as sps
from pyspark.mllib.linalg import Vectors

# Use a NumPy array as a dense vector.使用NumPy数组作为稠密向量
dv1 = np.array([1.0, 0.0, 3.0])
# Use a Python list as a dense vector.使用Python list作为稠密向量
dv2 = [1.0, 0.0, 3.0]
# Create a SparseVector.创建一个稀疏向量<1.0 0.0 2.0 3.0>的两种方式
sv1 = Vectors.sparse(4, {0: 1.0, 2: 2.0})
sv2 = Vectors.sparse(4, [0, 2], [1.0, 2.0])
# Use a single-column SciPy csc_matrix as a sparse vector.使用单列的csc_matrix作为稀疏向量
sv2 = sps.csc_matrix((np.array([10.0, 30.0]), np.array([0, 2]), np.array([0, 2])), shape=(3, 1))

返回目录

LabledPoint

　　在诸如分类和回归这样的监督式学习（supervised learning）算法中，LabeledPoint 用来表示带标签的数据点。它包含一个特征向量与一个标签（由一个浮点数表示），位置在mllib.regression 包中。

# -*-coding=utf-8 -*-  
from pyspark import SparkConf, SparkContext
sc = SparkContext('local')

from pyspark.mllib.linalg import SparseVector
from pyspark.mllib.regression import LabeledPoint

# Create a labeled point with a positive label and a dense feature vector.使用稠密向量创建一个带有正标记LabeledPoint
pos = LabeledPoint(1.0, [1.0, 0.0, 3.0])

# Create a labeled point with a negative label and a sparse feature vector.使用稀疏向量创建一个带有负标记LabeledPoint
neg = LabeledPoint(0.0, SparseVector(3, [0, 2], [1.0, 3.0]))

返回目录

Matrix

　　矩阵的基类是Matrix，我们提供了两种实现方法：稠密矩阵和稀疏矩阵。建议使用矩阵实现的工厂方法来创建矩阵。

# -*-coding=utf-8 -*-  
from pyspark import SparkConf, SparkContext
sc = SparkContext('local')

from pyspark.mllib.linalg import Matrix, Matrices

# Create a dense matrix ((1.0, 2.0), (3.0, 4.0), (5.0, 6.0))
dm2 = Matrices.dense(3, 2, [1, 2, 3, 4, 5, 6])

# Create a sparse matrix ((9.0, 0.0), (0.0, 8.0), (0.0, 6.0))
sm = Matrices.sparse(3, 2, [0, 1, 3], [0, 2, 1], [9, 6, 8])