How to implement connection pool in spark streaming

在spark streaming的文档里，有这么一段：

def sendPartition(iter):
    # ConnectionPool is a static, lazily initialized pool of connections
    connection = ConnectionPool.getConnection()
    for record in iter:
        connection.send(record)
    # return to the pool for future reuse
    ConnectionPool.returnConnection(connection)

dstream.foreachRDD(lambda rdd: rdd.foreachPartition(sendPartition))

但是怎么让worker得到一个ConectionPool呢？简单的想法是在使用static变量指向一个ConnectionPool。但这里有一个讲究：怎么保证这个ConnectionPool是worker上的，而不是driver上的？

用pyhton为例：

在ConnectionPool.py里实现一个pool

#/usr/bin/python
#connection_pool.py
import psycopg2
import settings

from DBUtils.PooledDB import PooledDB

pool = PooledDB(psycopg2, settings.connection_pool_size,
                         host=settings.db_host,
                         database=settings.database,
                         user=settings.db_user,
                         password=settings.db_password)
def getConnection():
    return pool.connection()

假设stream的主代码在main.py里，提交spark

spark-submit --py-files connection_pool.py main.py

这样connection_pool.py将被发送到worker执行，main.py里的 sendPartition 在worker节点上执行的时候就可以获得ConnectionPool.getConnection()调用。

这里的关键是明白哪些代码在driver上跑，哪些在worker上跑。

相关阅读:
list for循环中删除元素
XMLFeedSpider例子
myeclipse一直卡在loading workbench解决方法
代码
在Github上面搭建Hexo博客（一）：部署到Github
RegEX正则表达式截取字符串
将后台值传单前台js接收
C# List<T>泛型用法
基于jQuery——TreeGrid
在线编程学习网站

原文地址：https://www.cnblogs.com/englefly/p/4579863.html