• [PHP]用php的curl_post写一个爬取百度搜索结果的api


    0x01 起因

    20180716671625_zOfqVp.jpg

    因为手速慢,漏洞刷不过别人,一个个手补确实慢,所以想自己写一个api,一键抓取百度搜索结果,然后就可以打批量了 ovo(真是太妙了!)

    a95a2e2dd42a2834fd6ca36c4cb5c9ea14cebf97.jpg

    0x02 动工

          1.抓包做准备

    要做一个抓取的,当然是先抓包啦~

    拿出我的bp,和zoomeye篇一样先输入一个关键字,方便在bp里面找我输入的关键字

    然后回车~

    发现我输入的关键字在这个GET的请求包里

    用repeat模块后:

    发现返回了我要的搜索结果


    timg (1).jpg

           2.使用php的curl来模拟访问

    PHP支持的由Daniel Stenberg创建的libcurl库允许你与各种的服务器使用各种类型的协议进行连接和通讯。

    libcurl目前支持http、https、ftp、gopher、telnet、dict、file和ldap协议。libcurl同时也支持HTTPS认证、HTTP POST、HTTP PUT、 FTP 上传(这个也能通过PHP的FTP扩展完成)、HTTP 基于表单的上传代理cookies和用户名+密码的认证

    PHP中使用cURL实现GetPost请求的方法

    这些函数在PHP 4.0.2中被引入。

    就是说,在php4.0.2中就已经引入了curl,而且还可以做post和get,真是太有用了有木有

    QQ截图20200509141743.png拿出我刚刚记录好的请求包~

    Connection: close

    Upgrade-Insecure-Requests: 1

    User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36

    Sec-Fetch-Dest: document

    Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9

    Sec-Fetch-Site: none

    Sec-Fetch-Mode: navigate

    Sec-Fetch-User: ?1

    Accept-Language: zh-CN,zh;q=0.9

    Cookie: PSTM=1588249253; BAIDUID=C34E0834A4B2DA6CBA0B25FA3A67FC8D:FG=1; BIDUPSID=735A45B6473102ED12E4236A4401AE21; BD_UPN=12314353; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; BDUSS=2FaaXBqcDhaY0p-U1JvMXJ3dnVVdnJlSklhelZkSEQ1aGF1a1lWYjh3WnNVOUplSVFBQUFBJCQAAAAAAAAAAAEAAADfyRotztLX3MrHzvvO-7n-uf4AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGzGql5sxqpeZ; H_PS_PSSID=1461_31325_21098_31254_31342_31271_31464_30824_31164_22158; delPer=0; BD_CK_SAM=1; PSINO=5; sug=3; sugstore=0; ORIGIN=2; bdime=0; H_PS_645EC=fe7bDpIqw0Ye%2FC9V9rTqXv5ARp5x3G1lJcPTrEHREGKf1YbuRCoB6oR0frw


    然后用php语言来描述他:

    <?php
    function curl_post($url){
    	//$url='https://www.baidu.com/s?wd=%22Office%20Anywhere%22&tn=93348797_hao_pg&ie=utf-8&ch=1&pn='.$pn;
         $ch = curl_init();
    	 curl_setopt($ch, CURLOPT_URL, $url);
    	 curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
    	 curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
    	 curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    	 curl_setopt($ch, CURLOPT_HEADER, TRUE);
    	$headers = array();
        $headers[] = 'Connection: close';
    	$headers[] = 'Upgrade-Insecure-Requests: 1';
    	$headers[] = 'User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36';
    	$headers[] = 'Sec-Fetch-Dest: document';
    	$headers[] = 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9';
    	$headers[] = 'Sec-Fetch-Site: none';
    	$headers[] = 'Sec-Fetch-Mode: navigate';
    	$headers[] = 'Sec-Fetch-User: ?1';
    	$headers[] = 'Accept-Language: zh-CN,zh;q=0.9';
        $headers[] = 'Cookie: PSTM=1588249253; BAIDUID=C34E0834A4B2DA6CBA0B25FA3A67FC8D:FG=1; BIDUPSID=735A45B6473102ED12E4236A4401AE21; BD_UPN=12314353; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; BDUSS=2FaaXBqcDhaY0p-U1JvMXJ3dnVVdnJlSklhelZkSEQ1aGF1a1lWYjh3WnNVOUplSVFBQUFBJCQAAAAAAAAAAAEAAADfyRotztLX3MrHzvvO-7n-uf4AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGzGql5sxqpeZ; H_PS_PSSID=1461_31325_21098_31254_31342_31271_31464_30824_31164_22158; delPer=0; BD_CK_SAM=1; PSINO=5; sug=3; sugstore=0; ORIGIN=2; bdime=0; H_PS_645EC=fe7bDpIqw0Ye%2FC9V9rTqXv5ARp5x3G1lJcPTrEHREGKf1YbuRCoB6oR0frw';
         curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
         $output= curl_exec($ch);
         curl_close($ch);
         return $output;
    }
    ?>

    这样子就可以完成一次php中curl对baidu的请求了

    然后用正则的方法取出其中的链接(http://www.baidu.com/link?url=xxxxx)

    $a=curl_post($url);
    $pattern = '/<a target="_blank" href="(.*?)"(.*?)" class="(.*?)/i';
    preg_match_all($pattern, $a, $match);

    如果有不会的可以看我的第一篇(zoomeye篇)

    )

    保留head

      最后就会把返回值$a打印在屏幕上了

    但是百度的返回值是www.baidu.com/link?url=xxxxxxxx

    所以要获取真实链接:

    function get_real($url){
    $info = parse_url($url);
    $fp = fsockopen($info['host'], 80,$errno, $errstr, 30);
    fputs($fp,"GET {$info['path']}?{$info['query']} HTTP/1.1
    ");
    fputs($fp, "Host: {$info['host']}
    ");
    fputs($fp, "Connection: close
    
    ");
    $rewrite = '';
    while(!feof($fp)) {
        $line = fgets($fp);
        if($line != "
    " ) {
            if(strpos($line,'Location:') !== false) {
                $rewrite = str_replace(array("
    ","
    ","Location: "),'',$line);
            }
        }else {
            break;
        }
    }
    return $rewrite; 
    }


    timg.jpg

    最后输出即可

    0x03所以完整代码如下:

    <?php
    function get_real($url){
    $info = parse_url($url);
    $fp = fsockopen($info['host'], 80,$errno, $errstr, 30);
    fputs($fp,"GET {$info['path']}?{$info['query']} HTTP/1.1
    ");
    fputs($fp, "Host: {$info['host']}
    ");
    fputs($fp, "Connection: close
    
    ");
    $rewrite = '';
    while(!feof($fp)) {
        $line = fgets($fp);
        if($line != "
    " ) {
            if(strpos($line,'Location:') !== false) {
                $rewrite = str_replace(array("
    ","
    ","Location: "),'',$line);
            }
        }else {
            break;
        }
    }
    return $rewrite; 
    }
    
    
    
    
    function getSubstr($str, $leftStr, $rightStr)
    {
    $left = strpos($str, $leftStr);
    //echo '左边:'.$left;
    $right = strpos($str, $rightStr,$left);
    //echo '<br>右边:'.$right;
    if($left < 0 or $right < $left) return '';
    return substr($str, $left + strlen($leftStr), $right-$left-strlen($leftStr));
    }
    
    
    
    
    
    
    function curl_post($url){
    	//$url='https://www.baidu.com/s?wd=%22Office%20Anywhere%22&tn=93348797_hao_pg&ie=utf-8&ch=1&pn='.$pn;
         $ch = curl_init();
    	 curl_setopt($ch, CURLOPT_URL, $url);
    	 curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
    	 curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
    	 curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    	 curl_setopt($ch, CURLOPT_HEADER, TRUE);
    	$headers = array();
        $headers[] = 'Connection: close';
    	$headers[] = 'Upgrade-Insecure-Requests: 1';
    	$headers[] = 'User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36';
    	$headers[] = 'Sec-Fetch-Dest: document';
    	$headers[] = 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9';
    	$headers[] = 'Sec-Fetch-Site: none';
    	$headers[] = 'Sec-Fetch-Mode: navigate';
    	$headers[] = 'Sec-Fetch-User: ?1';
    	$headers[] = 'Accept-Language: zh-CN,zh;q=0.9';
        $headers[] = 'Cookie: PSTM=1588249253; BAIDUID=C34E0834A4B2DA6CBA0B25FA3A67FC8D:FG=1; BIDUPSID=735A45B6473102ED12E4236A4401AE21; BD_UPN=12314353; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; BDUSS=2FaaXBqcDhaY0p-U1JvMXJ3dnVVdnJlSklhelZkSEQ1aGF1a1lWYjh3WnNVOUplSVFBQUFBJCQAAAAAAAAAAAEAAADfyRotztLX3MrHzvvO-7n-uf4AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGzGql5sxqpeZ; H_PS_PSSID=1461_31325_21098_31254_31342_31271_31464_30824_31164_22158; delPer=0; BD_CK_SAM=1; PSINO=5; sug=3; sugstore=0; ORIGIN=2; bdime=0; H_PS_645EC=fe7bDpIqw0Ye%2FC9V9rTqXv5ARp5x3G1lJcPTrEHREGKf1YbuRCoB6oR0frw';
         curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
         $output= curl_exec($ch);
         curl_close($ch);
         return $output;
    }
    
    
    
    
    
    
    
    
    
    function get_url($page){
    
    $start='"<a target="_blank" href="';
    $end='" class="';
    $a='';
    
    /*这里是关键词*/
    $wddd="'Office Anywhere'";
    /*这里是关键词*/
    
    
    
    
    $url="https://www.baidu.com/s?wd=".urlencode($wddd)."&tn=93348797_hao_pg&ie=utf-8&ch=1&pn=".$page;
    $a=curl_post($url);
    $pattern = '/<a target="_blank" href="(.*?)"(.*?)" class="(.*?)/i';
    preg_match_all($pattern, $a, $match);
    $first=count($match);
    $aaa=array();
    $o=1;
    	for($j=0;$j<=count($match[0]);$j++){
    
    		
    		if(strpos($match[0][$j],'http://www.baidu.com/link?url=') !== false){ 
    
    	$match1[$o]=get_real('h'.getSubstr($match[0][$j],$start,$end));
    	$o++;
    
    }
    	}
    
    	return $match1;
    }
    
    
    
    for($i=0;$i<=500;$i+=10){
    	$aaa[$o]=get_url($i);
    	$o++;
    }
    $bbb=array();
    $ccount=count($aaa[""]);
    
    for($i=1;$i<=$ccount;$i++){
    	$bbb[$i]=$aaa[""][$i];
    
    }
    $aaaaaaaa=count($aaa[$i]);
    for($i=1;$i<=count($aaa);$i++){
    	for($j=1;$j<=$aaaaaaaa;$j++){
    		if($aaa[$i][$j]!=''){echo $aaa[$i][$j].'</p>';}
    	
    	}
    }
    for($i=1;$i<=count($bbb);$i++){
    	
    		if($bbb[$i]!=''){echo $bbb[$i].'</p>';}
    	
    	
    }
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    ?>
    

    关键词要自己设定哦。我在源码中已经标明设定关键词的地方辽

    a95a2e2dd42a2834fd6ca36c4cb5c9ea14cebf97.jpg
  • 相关阅读:
    python进阶(1)--多进程与多线程概念
    mysql数据库(5)--表的相关操作
    mysql数据库(4)--表删除操作
    备注
    mysql数据库(3)--mysql的执行顺序
    mysql数据库(2)--窗口函数之序号函数
    Spring MVC 返回中文乱码
    郁闷,郁闷啊
    Struts2 下载文件
    js文件加载
  • 原文地址:https://www.cnblogs.com/shadowwolf/p/12943796.html
Copyright © 2020-2023  润新知