• 数据采集类


    爬虫,又称蜘蛛,是从别的网站抓取资源的一种方法,C#.NET使用爬虫的方法如下:

    protected string GetPageHtml(string url)
    {
    string pageinfo;
    try
    {
    WebRequest myreq = WebRequest.Create(url);
    WebResponse myrep = myreq.GetResponse();
    StreamReader reader = new StreamReader(myrep.GetResponseStream(), Encoding.GetEncoding("gb2312"));
    pageinfo = reader.ReadToEnd();
    }
    catch
    {
    pageinfo = "";
    }
    return pageinfo;
    }


    按上述方法就可以在程序中获取某URL的页面源文件。
    但是有些网站屏蔽了爬虫,那就需要模拟浏览器获取的方法来进行,具体代码如下:

    protected string GetPageHtml(string url)
    {
    string pageinfo;
    try
    {
    HttpWebRequest myReq = (HttpWebRequest)HttpWebRequest.Create(url);
    myReq.Accept = "image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/x-shockwave-flash, application/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, */*";
    myReq.UserAgent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727)";
    HttpWebResponse myRep = (HttpWebResponse)myReq.GetResponse();
    Stream myStream = myRep.GetResponseStream();
    StreamReader sr = new StreamReader(myStream, Encoding.Default);
    pageinfo = sr.ReadToEnd().ToString();
    }
    catch
    {
    pageinfo = "";
    }
    return pageinfo;
    }
  • 相关阅读:
    Django model 字段类型及选项解析(二)
    MYSQL数据库设计规范与原则
    爬虫相关模块命令回顾
    Django model 字段类型及选项解析(一)
    Django自身安全机制-XSS和CSRF
    分页
    css样式大全
    HTML标签和属性大全
    IsPost 判断
    HTML中夹杂CODE
  • 原文地址:https://www.cnblogs.com/yujinchao88/p/3855051.html
Copyright © 2020-2023  润新知