php多线程下载
上篇说到需要打包七牛文件, 所以需要先将七牛文件下载到本地。下载单个文件还是比较好实现的。
常用打开url的函数
代码写多了, 不仅会关心结果, 还会关心性能和代码的优雅。这次我希望同时下载多个文件, 而不是串行下载。主要用到了cURL函数。去官方手册看了下, 找到了, 但关于curl函数的介绍却很少,踩的坑也是一堆一堆的。
官网的例子
// 创建一对cURL资源
$ch1 = curl_init();
$ch2 = curl_init();
// 设置URL和相应的选项
curl_setopt($ch1, CURLOPT_URL, "http://www.example.com/");
curl_setopt($ch1, CURLOPT_HEADER, 0);
curl_setopt($ch2, CURLOPT_URL, "http://www.php.net/");
curl_setopt($ch2, CURLOPT_HEADER, 0);
// 创建批处理cURL句柄
$mh = curl_multi_init();
// 增加2个句柄
curl_multi_add_handle($mh,$ch1);
curl_multi_add_handle($mh,$ch2);
$running=null;
// 执行批处理句柄
do {
usleep(10000);
curl_multi_exec($mh,$running);
} while ($running > 0);
// 关闭全部句柄
curl_multi_remove_handle($mh, $ch1);
curl_multi_remove_handle($mh, $ch2);
curl_multi_close($mh);
上面的代码有完整的注释,但如果请求有返回该如何处理呢, 继续看手册, 找到了curl_multi_getcontent
处理请求的响应
$aURLs = array("http://www.php.net","http://www.w3cschools.com"); // array of URLs
$mh = curl_multi_init(); // init the curl Multi
$aCurlHandles = array(); // create an array for the individual curl handles
foreach ($aURLs as $id=>$url) { //add the handles for each url
$ch = curl_setup($url,$socks5_proxy,$usernamepass);
$ch = curl_init(); // init curl, and then setup your options
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1); // returns the result - very important
curl_setopt($ch, CURLOPT_HEADER, 0); // no headers in the output
$aCurlHandles[$url] = $ch;
curl_multi_add_handle($mh,$ch);
}
$active = null;
//execute the handles
do {
$mrc = curl_multi_exec($mh, $active);
}
while ($mrc == CURLM_CALL_MULTI_PERFORM);
while ($active && $mrc == CURLM_OK) {
if (curl_multi_select($mh) != -1) {
do {
$mrc = curl_multi_exec($mh, $active);
} while ($mrc == CURLM_CALL_MULTI_PERFORM);
}
}
/* This is the relevant bit */
// iterate through the handles and get your content
foreach ($aCurlHandles as $url=>$ch) {
$html = curl_multi_getcontent($ch); // get the content
// do what you want with the HTML
curl_multi_remove_handle($mh, $ch); // remove the handle (assuming you are done with it);
}
/* End of the relevant bit */
curl_multi_close($mh); // close the curl multi handler
第一次看到这样的代码, 我是懵逼的, 尤其是那两个循环。相关的资料很少,还好找到一篇,我来大致梳理下循环的流程。
先说说那几个常量的意思吧
CURLMcode
- CURLM_CALL_MULTI_PERFORM (-1) This is not really an error. It means you should call curl_multi_perform again without doing select() or similar in between. Before version 7.20.0 this could be returned by curl_multi_perform, but in later versions this return code is never used.
- CURLM_OK (0) Things are fine.
我们来看看第一个循环
$active = null;
//execute the handles
do {
$mrc = curl_multi_exec($mh, $active);
}
while ($mrc == CURLM_CALL_MULTI_PERFORM);
curl_multi_exec试图加载批处理句柄的一些信息。$mh是之前通过调用curl_multi_init生成的。$active和$mrc均是整数。curl_multi_exec将$active赋值为一个用来判断操作是否仍在执行的标识的引用。也就是说,如果你用该句柄处理5个URL, curl_multi_exec当它正在处理所有的URL时, 它就会返回5,然后当每个URL完成时,$active每次将会以步长为1递减直到为0。
继续看第二个循环
while ($active && $mrc == CURLM_OK) {
if (curl_multi_select($mh) != -1) {
do {
$mrc = curl_multi_exec($mh, $active);
} while ($mrc == CURLM_CALL_MULTI_PERFORM);
}
}
这个循环是说
(while): 只要有活动连接且$mrc OK
(if) 如果有数据
(do/while) 处理数据 只要系统告诉我们一直保持获取
这个循环负责检查所有的socket是否都完成了。
理解代码之后, 我实现了第一版。但仔细思考后,好像有点问题, 当请求很多时, 我同时发出所有请求, 服务器能hold住吗?跑起来之后, 会不会把资源都耗尽了。这是一个很大的并发,不太合理。我们需要自己实现一个线程池,来掌控任务进度。
我们建立一个n个线程数的线程池,我们先通过curl_multi_add_handle将n个URL添加到线程池中,每执行完毕一个任务, 就将对应的句柄资源移除,同时加入新的URL,直到所有URL一次执行完毕。别人已经做好了, 我就把重要的代码贴出来
处理多个curl请求
/**
* Performs multiple curl requests
*
* @throws RollingCurlException
* @param array $requests 需要处理的url
* @param int $window_size 线程池的容量
* @return bool
*/
function rolling_curl(array $requests, $window_size = 5) {
// make sure the rolling window isn't greater than the # of urls
if (count($requests) < $window_size)
$window_size = count($requests);
if ($window_size < 2) {
throw new RollingCurlException("Window size must be greater than 1");
}
$master = curl_multi_init();
for ($i = 0; $i < $window_size; $i++) {
$ch = curl_init();
$options = []; // 配置项
curl_setopt_array($ch,$options);
curl_multi_add_handle($master, $ch);
// Add to our request Maps
$key = (string) $ch;
$this->requestMap[$key] = $i;
}
do {
while(($execrun = curl_multi_exec($master, $running)) == CURLM_CALL_MULTI_PERFORM);
if($execrun != CURLM_OK) {
break;
}
// 找出当前完成的请求
while($done = curl_multi_info_read($master)) {
// 添加新的请求前, 先将旧的删除掉
if ($i < count(requests) && isset($requests[$i])) {
$ch = curl_init();
$options = []; // 配置项
curl_setopt_array($ch,$options);
curl_multi_add_handle($master, $ch);
$i++;
}
// 删除已完成的句柄
curl_multi_remove_handle($master, $done['handle']);
}
// Blocks until there is activity on any of the curl_multi connections. 防止cpu飙升
if ($running) {
curl_multi_select($master);
}
} while ($running);
curl_multi_close($master);
return true;
}
终于理顺了, 上面的代码能基本实现我的需求了,但像是面条代码,继续寻找社区的轮子。之前一直听说guzzle, 然后就看了下文档, 文档很清晰, 直接上手, 撸了个demo
use GuzzleHttpPsr7Request;
use GuzzleHttpClient;
use GuzzleHttpPool;
set_time_limit(0);
$client = new Client();
$urls = [
'http://qxt-2017.cdn.xwg.cc/o_1bg5c4qca1j7vblh57qhl5aqu7.jpg',
'http://qxt-2017.cdn.xwg.cc/o_1bg5c5p9tp08c9v18lb2u1ufvc.pptx',
'http://qxt-2017.cdn.xwg.cc/2017-04-11_1491896251_lowb9SS8cBDjIOJ2jnIzZBDphY6s.mp4',
'http://qxt-2017.cdn.xwg.cc/o_1bdba57vm1dps1g34igq1853mi87.docx',
'http://qxt-2017.cdn.xwg.cc/2017-04-11_1491896395_lierbh4ZzMU8_2fSOUEUnXvgHQRo.mp4',
'http://qxt-2017.cdn.xwg.cc/o_1befiphi67tn1bnrgf6fc1mtm7.ppt',
'http://qxt-2017.cdn.xwg.cc/o_1befilahibnevnq1lp4170k1q9s7.xls',
'http://qxt-2017.cdn.xwg.cc/FsGmw6A4WZvOgt-nPhFKW2pFSH1t'
];
$titles = [
"264-141112102942604.jpg",
'希望谷样板.pptx',
'晓日.avi',
'原理与发明测试.docx',
'[dmzj][itazura_na_kiss][rv10][1280_720][12].rmvb',
'web+of+science分析功能.ppt',
'工-程-量-清-单-对-比-表.xls',
'IMG_20170103_191314.jpg'
];
$requests = function () use ($urls) {
foreach ($urls as $key => $url) {
yield new Request('GET', $url); // [Generator syntax](http://php.net/manual/en/language.generators.syntax.php)
}
};
$pool = new Pool($client, $requests(), [
'concurrency' => 5,
'fulfilled' => function ($response, $index) use ($titles){
file_put_contents($titles[$index], $response->getBody()->getContents()); //开始写文件
},
'rejected' => function ($reason, $index){
print_r($reason); // 失败的原因
echo $index; // 失败的索引
},
]);
// 开始发送请求
$promise = $pool->promise();
$promise->wait();
一看很清晰, 以promise的方式来实现, 对js开发者蛮友好的, 条理也很清晰。