• PHP登入网站抓取并且抓取数据


    有时候需要登入网站,然后去抓取一些有用的信息,人工做的话,太累了。有的人可以很快的做到登入,但是需要在登入后再去访问其他页面始终都访问不了,因为他们没有带Cookie进去而被当做是两次会话。下面看看代码

    <?php  //test.php
    function getWebContent($host,$page="/",$paramstr="",$cookies='',$medth="POST",$port=80){
        $fp = fsockopen($host,$port);
        if(!$fp){
            return false;
        }
        $medth = strtoupper($medth);
        $medth = $medth=="POST" ? "POST":"GET";
        $length = strlen($paramstr);
        if($medth == "GET" && $paramstr){
            $page .= "?".$paramstr;
        }
        $out = "$medth $page  HTTP/1.1 ";
        $out .= "Accept: */* "; 
        $out .= "Host: www.exaple.com "; 
        $out .= "Content-Length: ".$length." ";
        $out .= "Content-Type: application/x-www-form-urlencoded ";
        if($cookies){
            $out .= "Cookie: ".$cookies." ";
        }
        $out .= "Connection: Keep-Alive ";
        if($medth=='POST' && $paramstr){
            $out .= $paramstr." ";
        }
        fwrite($fp, $out);
        $cookie = "";
        $content = "";
        while (!feof($fp)) {
            $str = fgets($fp);
            if(preg_match("/Set-Cookie:([^ ]*)/",$str,$matchs)){
                if($cookie){
                    $cookie .= ";".$matchs[1];
                }else{
                    $cookie = $matchs[1];
                }
            }
            $content .= $str;
            echo $str;
        }
        fclose($fp);
        return array('content'=>$content,'cookie'=>$cookie);
    }

    $params = "name=admin&pwd=admin";
    $rs = getWebContent("127.0.0.1","/test/login.php",$params,"","POST",8080);
    echo $rs['content'];
    $rs = getWebContent("127.0.0.1","/test/index.php","",$rs['cookie'],"POST",8080);
    //这里传入上次cookie是关键,否则会被当成两次会话
    echo $rs['content'];
    ?>

    <?php //login.php
        $name = $_REQUEST['name'];
        $pwd = $_REQUEST['pwd'];
        if($name == "admin" && $pwd == "admin"){
            setcookie("cname",$name);
            echo "success";
        }else{
            echo "failed";   
        }
    ?>

    <?php //index.php
    if(isset($_COOKIE['cname']) && $_COOKIE['cname']){
        echo "<ul><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li></ul>";
    }else{
        echo "please login first!";
    }
    ?>

    将上面三个文件分别保存,login.php和index.php放在root目录下的test目录下。然后test.php放在任意目录,然后去命令行运行php test.php,结果就能出来。

    还有一种更简单的方式,就是用curl,代码如下,可以用下面的代码替换test.php
    <?php
    $post_data = array (
        "name" => "admin",
        "pwd" => "admin",
    );
    $cookie_jar = tempnam('./', 'cookie');//新建cookie文件
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, "http://localhost:8080/test/login.php");
    //设定返回的数据是否自动显示
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    // 我们在POST数据哦!
    curl_setopt($ch, CURLOPT_POST, 1);
    // 把post的变量加上
    curl_setopt($ch, CURLOPT_POSTFIELDS, $post_data);
    //把返回来的cookie信息保存在$cookie_jar文件中
    curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie_jar);
    echo curl_exec($ch);
    curl_close($ch);

    $ch2 = curl_init();
    curl_setopt($ch2, CURLOPT_URL, "http://localhost:8080/test/index.php");
    curl_setopt($ch2, CURLOPT_HEADER, false);
    curl_setopt($ch2, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch2, CURLOPT_COOKIEFILE, $cookie_jar);
    echo curl_exec($ch2);
    unlink($cookie_jar);
    curl_close($ch2);
    ?>

  • 相关阅读:
    尘误解
    了解了解你自己的话zookeeper(从那时起,纠正了一些说法在线)
    HDU 5055 Bob and math problem(结构体)
    Linux通过编辑器vi使用介绍
    OCP-1Z0-051-名称解析-文章32称号
    刘强东:解密京东10甘蔗理论
    Android结构分析Android智能指针(两)
    hbase ganglia监控配置
    第一个位和一个真正的项目件
    Html5 の 微信飞机大战
  • 原文地址:https://www.cnblogs.com/grimm/p/5048993.html
Copyright © 2020-2023  润新知