• Linux下利用Shell使PHP并发采集淘宝产品


    上次项目中用到<<PHP采集淘宝商品>>

    此方法有一个缺点,就是执行效率问题.一个商品采集平均需要0.8秒.那10000个商品采集完需要2个半小时.

    首先想到的解决办法是并发.

    但是PHP不支持并发(这里是指通过PHP命令执行PHP文件,如果通过apache或nginx等做服务器是可以并发的,是并发访问,不能在程序中实现并发).

    通过Shell把对php命令推到后台执行,以达到并发的效果.

    整体思路:

      1.在Shell中连接数据库,取出需要更新的产品

      2.Shell中对数据进行循环,把商品id,price,url传递给PHP执行,将执行过程推到后台

      3.每循环20条使程序暂停5秒,达到控制并发数的目的

      4.php得到id,price,url参数后,通过URL进行采集,并返回现价

      5.将现价和原数据库中的价格进行比较,如果有变化或下架则更新.

      6.将执行结果写入日志文件.

    Shell

    #!/bin/sh
    #updateprice.sh
    j=0
    currcyline=0;
    #查询数据库
    for i in `/usr/local/bin/mysql -uroot -pshop123 shop -e"SELECT id,price,url FROM s_goods WHERE url!='' AND keyid like 'taobao%' AND is_off_sale=0 ORDER BY id DESC  "` 
    do
            if [ $j -gt 2 ]; then
            #前3个循环分别为id,price,url这相当于表头,不需要进行操作,所以从第3开始.
                    line=$(($j%3))
                    case $line in
                    0)
                            currcyline=$(($currcyline+1))
    
                            s=$(($currcyline%20))
    
                            if [ $s -eq 0 ]; then
                                    sleep 5  #每循环20次休息5秒,以此来控制避免产生过多的后台进行,使服务器压力过大或死机.
                            fi
                            id=$i;;
                    1)
                            price=$i;;
    
                    2)
                            url=$i;
                            #echo id:${id}  price:${price}  url:${url}
    
                            {
                                                                    #调用php命令执行PHP文件.
                                    re=`/usr/local/bin/php /var/www/9384shop/cron/goodsupdate.php ${id} ${price} "${url}"` 
                            }&
                            #此&为推到后台执行, 关键
                    esac
    
    
            fi
            j=$(($j+1))
    done
    wait
    #等待后台进行执行完成

    PHP:

    <?php
    /*
     ============================================================================
     Name        : goodsupdate.php
     Author      : 风飘无痕
     Version     :
     Copyright   : Your copyright notice
     Description : Collect taobao goods
     ============================================================================
     */
    
    //$argv为获取命令中的参数
    $s=microtime(1);
    $id=$argv[1];
    $oldprice=$argv[2];
    $price=getPrice($argv[3]);
    $t=microtime(1)-$s;
    $r=array();
    $r[]=date('Y-m-d H:i:s');
    $r[]=$id;
    $r[]=ceil($t*1000)/1000;
    if($price=='soldout'){
            $r[]="OutStock";
            $con=mysql_connect('localhost','shop','shop123');
            mysql_select_db("shop", $con);
            mysql_query("UPDATE s_goods SET is_off_sale=1 WHERE id=".$id);
            mysql_close($con);
    }
    elseif($price===false) $r[]= 'FALSE';
    elseif($price==$oldprice) $r[]='EQUAL';
    else{
            $r[]="UPDATE";
            $r[]=$oldprice;
            $r[]=$price;
            $con=mysql_connect('localhost','shop','shop123');
            mysql_select_db("shop", $con);
            mysql_query("UPDATE s_goods SET price=".$price." WHERE id=".$id);
            mysql_close($con);
    }
    //以日志的形式保存执行过程
    $h=fopen('/home/staff/www/9384shop/log/goodsUpdate.log','a+');
    
    fputcsv($h,$r);
    fclose($h);
    
    function getPrice($url,$time=1){
            $des_url='';
            $ch = curl_init();
            curl_setopt($ch, CURLOPT_USERAGENT,'Mozilla/5.0 (Windows NT 6.2; WOW64; rv:26.0) Gecko/20100101 Firefox/26.0');
            curl_setopt($ch, CURLOPT_REFERER,'http://www.tmall.com/');//采集淘宝商品必须设置此项
            curl_setopt($ch, CURLOPT_FOLLOWLOCATION,1);
            curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);//设置输出方式, 0为自动输出返回的内容, 1为返回输出的内容,但不自动输出.
            curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 30); //timeout on connect
            curl_setopt($ch, CURLOPT_TIMEOUT, 30); //timeout on response
            curl_setopt($ch, CURLOPT_HEADER, 1);//是否输出头信息,0为不输出,非零则输出
            curl_setopt($ch, CURLOPT_MAXREDIRS, 10 );
            curl_setopt($ch, CURLOPT_URL, $url);
            $content = curl_exec($ch);
            curl_close($ch);
            if(preg_match('/noitem.htm/',$content)){
                    return 'soldout';
            }elseif(preg_match("/'reservePrice's*:s*'([d.]+?)',/",$content,$price)){
                    $price = (float)$price[1];
            }elseif(preg_match('/price:([d.]+?),/',$content,$price)){
                    $price = (float)$price[1];
            }
            if(!$price){
                    preg_match('/id=(d+)+/',$url,$temp);
                    $url2="http://mdskip.taobao.com/core/initItemDetail.htm?itemId=".$temp[1];
                    $ch = curl_init();
                    curl_setopt( $ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20041001 Firefox/0.10.1" );
                    curl_setopt( $ch, CURLOPT_URL, $url2 );
                    curl_setopt( $ch, CURLOPT_FOLLOWLOCATION, true );
                    curl_setopt( $ch, CURLOPT_ENCODING, "" );
                    curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true );
                    curl_setopt( $ch, CURLOPT_REFERER, 'http://www.tmall.com' );
                    curl_setopt( $ch, CURLOPT_AUTOREFERER, true );
                    curl_setopt( $ch, CURLOPT_SSL_VERIFYPEER, false );   
                    curl_setopt( $ch, CURLOPT_CONNECTTIMEOUT, 10 );
                    curl_setopt( $ch, CURLOPT_TIMEOUT, 10 );
                    curl_setopt( $ch, CURLOPT_MAXREDIRS, 10 );
                    $price_content = curl_exec( $ch );
                    $response = curl_getinfo( $ch );
                    curl_close ( $ch );
                    $price_content=json_decode(iconv('gbk','utf-8',preg_replace('/(d{10,}):/','"${1}":',$price_content)),true);
                    $priceinfo=$price_content['defaultModel']['itemPriceResultDO']['priceInfo'];
                    $price=array();
                    if(is_array($priceinfo)){
                    foreach ($priceinfo as $v){
                            $price[]=$v['price'];
                            if(is_array($v['promotionList'])){
                                    foreach ($v['promotionList'] as $v2){
                                            $price[]=$v2['extraPromPrice']?$v2['extraPromPrice']:$v2['price'];
                                    }
                            }
                            if(is_array($v['suggestivePromotionList'])){
                                    foreach ($v['suggestivePromotionList'] as $v2){
                                            $price[]=$v2['extraPromPrice']?$v2['extraPromPrice']:$v2['price'];
                                    }
                            }
                    }
                    }
                    $price=count($price)>0?min($price):false;
    
            }
            if($price) return $price;
            elseif($time<3) return getPrice($url,$time+1);//如果没有取到价格递归执行,最多执行3次.
            else return false;
    }

    执行结果:

    tail -10 goodsUpdate.log
    "2014-03-21 13:45:34",13357,0.273,EQUAL
    "2014-03-21 13:45:35",13380,5.883,EQUAL
    "2014-03-21 13:45:35",13343,0.914,EQUAL
    "2014-03-21 13:45:35",13344,0.923,EQUAL
    "2014-03-21 13:45:35",13347,0.927,UPDATE,599.00,181.00
    "2014-03-21 13:45:35",13339,0.908,EQUAL
    "2014-03-21 13:45:35",13342,0.93,EQUAL
    "2014-03-21 13:45:35",13348,0.933,EQUAL
    "2014-03-21 13:45:35",13349,0.946,UPDATE,1547.00,1877.00
    "2014-03-21 13:45:35",13338,0.947,EQUAL

    此方法比只用PHP更新大大节约了时间,更新2万个商品大约是半小时.但是线程数非常不好控制

  • 相关阅读:
    软件性能测试
    我为何转来博客园
    【5】查询练习:DISTINCT、Between...and...、in、order by、count
    第5章:pandas入门【3】汇总和计算描述
    【4】建点表,填点数
    【3】数据库三大设计范式
    【2】约束
    【1】基本操作
    第5章:pandas入门【2】基本功能
    第5章:pandas入门【1】Series与DataFrame
  • 原文地址:https://www.cnblogs.com/lywy510/p/3613522.html
Copyright © 2020-2023  润新知