• 干货分享丨Python从入门到编写POC之爬虫专题


    Python从入门到编写POC系列文章是i春秋论坛作家「Exp1ore」表哥原创的一套完整教程,想系统学习Python技能的小伙伴,不要错过哦!

    干货分享丨Python从入门到编写POC之爬虫专题

     

    Python从入门到编写POC之爬虫专题

    说到爬虫,用Python写的貌似是很多的。

    举个例子,re模块,BeautifulSoup模块,pyspider模块,pyquery等,当然还要用到requests模块,urllib模块,urllib2模块,还有一个四叶草公司开发的hackhttp等等。

    PS:BeautifulSoup模块和requests模块,Pyspider都要安装,因为是第三方库。

    BeautifulSoup模块

    <html><head><title>The Dormouse's story</title></head><body><p class="title" name="dromouse"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p></body></html>

    用BeautifulSoup创建一个对象

    >>> from bs4 import BeautifulSoup>>> html = """... <html>... <head>... <title>The Dormouse's story</title>... </head>... <body>... <p class="title"><b>The Dormouse's story</b></p>...... <p class="story">Once upon a time there were three little sisters; and their names were... <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,... <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and... <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;... and they lived at the bottom of a well.</p>... <p class="story">...</p>... </body>... </html>... """>>>>>> soup = BeautifulSoup(html)C:Python27libsite-packagess4\__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.The code that caused this warning is on line 1 of the file <stdin>. To get rid of this warning, change code that looks like this: BeautifulSoup(YOUR_MARKUP})to this: BeautifulSoup(YOUR_MARKUP, "html.parser")  markup_type=markup_type))

    浏览结构化数据的方法

    >>> soup.title<title>The Dormouse's story</title>>>> soup.title.nameu'title'>>> soup.p<p class="title"><b>The Dormouse's story</b></p>>>> soup.p['class'][u'title']>>> soup.head<head>
    <title>The Dormouse's story</title>
    </head>>>> soup.p.attrs{u'class': [u'title']}

    如果是爬虫,比如说要爬所有的链接,分析html代码得到,都是在<a>标签那。所以用个循环,就可以完美的解决了。

    >>> for link in soup.find_all('a'):...     print(link.get('href'))...[url]http://example.com/elsie[/url][url]http://example.com/lacie[/url][url]http://example.com/tillie[/url]

    那如果我要爬去所有的文字信息呢?

    就要用到下面的命令了:

    >>> print soup.get_text()The Dormouse's storyThe Dormouse's storyOnce upon a time there were three little sisters; and their names wereElsie,Lacie andTillie;and they lived at the bottom of a well....

    接下来,咱们写一个简单的爬虫,调用站长帮手,写一个查询子域名的工具。

    首先,咱们抓包分析一下,这里用到的是Burp

    POST /subdomain/ HTTP/1.1Host: i.links.cnContent-Length: 34Cache-Control: max-age=0Origin: [url]http://i.links.cn[/url]Upgrade-Insecure-Requests: 1User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36Content-Type: application/x-www-form-urlencodedAccept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8Referer: [url]http://i.links.cn/subdomain/[/url]Accept-Language: zh-CN,zh;q=0.8Cookie: ASPSESSIONIDCCRSRCQS=NFFNBODCNABACIGOEODDFKLG; __guid=12224748.1912086146849820700.1503481265395.9385; UM_distinctid=15e0e7780082dd-0f197d4291ddaa-5d4e211f-1fa400-15e0e7780091e6; linkhelper=sameipb3=1&sameipb4=1&sameipb2=1; serverurl=; ASPSESSIONIDQARRSARR=DNCFMEADGBBFOICPGKMFCNPK; safedog-flow-item=; monitor_count=2; umid=umid=f449b116e07d1d4f3d2dc5352b7fede9&querytime=2017%2D8%2D24+14%3A09%3A09; CNZZDATA30012337=cnzz_eid%3D226371595-1503478989-%26ntime%3D1503554751Connection: closedomain=ichunqiu.com&b2=1&b3=1&b4=1

    可以知道他是一个post包,然后提交的post数据是

    domain=ichunqiu.com&b2=1&b3=1&b4=1

    所以用requests模块:

    #coding = utf-8import requestsurl = 'http://i.links.cn/subdomain/'payload = 'domain=ichunqiu.com&b2=1&b3=1&b4=1'r = requests.post(url=url,data=payload)print r.content

    结果报了一个错

    Traceback (most recent call last):  File "demo.py", line 8, in <module>    print r.textUnicodeEncodeError: 'gbk' codec can't encode character u'\xcf' in position 386: illegal multibyte sequence

    所以咱们要改一下编码:

    import requestsurl = 'http://i.links.cn/subdomain/'payload = ("domain=ichunqiu.com&b2=1&b3=1&b4=1")r = requests.post(url=url,params=payload)con = r.text.encode('ISO-8859-1')

    之后就打印出来了,然后就上re或者beautifulsoup了。

    这里用re,简单明了。查看源码,得到在以下代码之间:

    value="http://ichunqiu.com"/><input
    import rea = re.compile('value="(.+?)"><input')result = a.findall(con)
    干货分享丨Python从入门到编写POC之爬虫专题

     

    然后转成列表

    list = '
    '.join(result)print list
    干货分享丨Python从入门到编写POC之爬虫专题

     

    咱们继续完善这个代码,改源码查询是不是有点麻烦?

    这里,咱们用sys库,然后就用那个命令函数,修改一下代码,再格式化一下,这里用到了format函数。

    payload = ("domain={domain}&b2=1&b3=1&b4=1".format(domain=domain))

    然后在定义一个get函数来获取domain这个变量。

    #coding = utf-8 import requestsimport reimport sys def get(domain):        url = 'http://i.links.cn/subdomain/'        payload = ("domain={domain}&b2=1&b3=1&b4=1".format(domain=domain))        r = requests.post(url=url,params=payload)        con = r.text.encode('ISO-8859-1')        a = re.compile('value="(.+?)"><input')        result = a.findall(con)        list = '
    '.join(result)        print listif __name__ == '__main__':        command= sys.argv[1:]        f = "".join(command)        get(f)

    这样子就好了,咱们实验一下。

    干货分享丨Python从入门到编写POC之爬虫专题

     

    以上是今天要分享的内容,大家看懂了吗?喜欢本文的小伙伴,记得文末点个赞哦~

  • 相关阅读:
    AJAX 类似电子表格的功能实现(续采购授权系统)
    Asp.net 程序优化
    Sql server 性能优化
    LinqToSql查询
    LInqToSql 增删改
    LinqToXml(删除某节点)
    LinqTo数组和cast,typeof的用法
    ThreadPool
    C# 定时器定时更新
    linqToXml查询
  • 原文地址:https://www.cnblogs.com/ichunqiu/p/13272641.html
Copyright © 2020-2023  润新知