Mac上pycharm集成pyspark

Mac上pycharm集成pyspark
前提：

　　1.已经安装好spark。我的是spark2.2.0。

　　2.已经有python环境，我这边使用的是python3.6。

一、安装py4j

使用pip，运行如下命令：

　　
```
pip install py4j
```
使用conda，运行如下命令：
```
conda install py4j
```
二、使用pycharm创建一个project。

创建过程中选择python的环境。进入之后点击Run--》Edit Configurations--》Environment variables.

添加PYTHONPATH和SPARK_HOME，其中PYTHONPATH为spark安装路径中的python目录，SPARK_HOME为spark安装目录。

然后点ok，到第一个页面点Apply，ok。

三、点Preferences --》Project Structure--》Add Content Root

添加spark安装路径中python目录下的lib里面的py4j-0.10.4-src.zip和pyspark.zip。然后Apply，ok。

四、编写pyspark wordcount测试一下。我这边使用的是pyspark streaming程序。

代码如下：

WordCount.py
```
from pyspark import SparkContext
from pyspark.streaming import StreamingContext

# Create a local StreamingContext with two working thread and batch interval of 1 second

sc = SparkContext("local[2]", "NetWordCount")

ssc = StreamingContext(sc, 1)

# Create a DStream that will connect to hostname:port, like localhost:9999

lines = ssc.socketTextStream("localhost", 9999)

# Split each line into words

words = lines.flatMap(lambda line: line.split(" "))

# Count each word in each batch
pairs = words.map(lambda word: (word, 1))
wordCounts = pairs.reduceByKey(lambda x, y: x + y)

# Print the first ten elements of each RDD generated in this DStream to the console
wordCounts.pprint()

ssc.start()             # Start the computation
ssc.awaitTermination()  # Wait for the computation to terminate
```
先到终端运行如下命令：
```
$ nc -lk 9999
```
接着可以在pycharm中右键运行一下。然后在上面这个命令行中输入单词以空格分割：

我输入如下：
```
a b a d d d d
```
然后摁回车。可以看到pycharm中输出如下结果：
```
Time: 2017-12-17 22:06:19
-------------------------------------------
('b', 1)
('d', 4)
('a', 2)
```
至此，完成。
相关阅读:
HDOJ 1556 线段树
 POJ 3977 折半枚举
 2017ACM省赛选拔赛题解
 关于四舍五入和截断
 POJ 3422 最小费用最大流
 Codeforces Round #407 (Div. 2) D. Weird journey 思维+欧拉
 POJ 3155 最大密度子图
 无向图最小割 stoer_wagner算法
 最大权闭合子图
 L2-001. 紧急救援 Dijkstra
原文地址：https://www.cnblogs.com/zeppelin/p/8053795.html