干货分享丨Python从入门到编写POC之爬虫专题

Python从入门到编写POC系列文章是i春秋论坛作家「Exp1ore」表哥原创的一套完整教程，想系统学习Python技能的小伙伴，不要错过哦！

Python从入门到编写POC之爬虫专题

说到爬虫，用Python写的貌似是很多的。

举个例子，re模块，BeautifulSoup模块，pyspider模块，pyquery等，当然还要用到requests模块，urllib模块，urllib2模块，还有一个四叶草公司开发的hackhttp等等。

PS：BeautifulSoup模块和requests模块，Pyspider都要安装，因为是第三方库。

BeautifulSoup模块

<html><head><title>The Dormouse's story</title></head><body><p class="title" name="dromouse"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p></body></html>

用BeautifulSoup创建一个对象

>>> from bs4 import BeautifulSoup>>> html = """... <html>... <head>... <title>The Dormouse's story</title>... </head>... <body>... <p class="title"><b>The Dormouse's story</b></p>...... <p class="story">Once upon a time there were three little sisters; and their names were... <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,... <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and... <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;... and they lived at the bottom of a well.</p>... <p class="story">...</p>... </body>... </html>... """>>>>>> soup = BeautifulSoup(html)C:Python27libsite-packagess4\__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.The code that caused this warning is on line 1 of the file <stdin>. To get rid of this warning, change code that looks like this: BeautifulSoup(YOUR_MARKUP})to this: BeautifulSoup(YOUR_MARKUP, "html.parser")  markup_type=markup_type))

浏览结构化数据的方法

>>> soup.title<title>The Dormouse's story</title>>>> soup.title.nameu'title'>>> soup.p<p class="title"><b>The Dormouse's story</b></p>>>> soup.p['class'][u'title']>>> soup.head<head>
<title>The Dormouse's story</title>
</head>>>> soup.p.attrs{u'class': [u'title']}

如果是爬虫，比如说要爬所有的链接，分析html代码得到，都是在<a>标签那。所以用个循环，就可以完美的解决了。

>>> for link in soup.find_all('a'):...     print(link.get('href'))...[url]http://example.com/elsie[/url][url]http://example.com/lacie[/url][url]http://example.com/tillie[/url]

那如果我要爬去所有的文字信息呢？

就要用到下面的命令了：

>>> print soup.get_text()The Dormouse's storyThe Dormouse's storyOnce upon a time there were three little sisters; and their names wereElsie,Lacie andTillie;and they lived at the bottom of a well....

接下来，咱们写一个简单的爬虫，调用站长帮手，写一个查询子域名的工具。

首先，咱们抓包分析一下，这里用到的是Burp

POST /subdomain/ HTTP/1.1Host: i.links.cnContent-Length: 34Cache-Control: max-age=0Origin: [url]http://i.links.cn[/url]Upgrade-Insecure-Requests: 1User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36Content-Type: application/x-www-form-urlencodedAccept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8Referer: [url]http://i.links.cn/subdomain/[/url]Accept-Language: zh-CN,zh;q=0.8Cookie: ASPSESSIONIDCCRSRCQS=NFFNBODCNABACIGOEODDFKLG; __guid=12224748.1912086146849820700.1503481265395.9385; UM_distinctid=15e0e7780082dd-0f197d4291ddaa-5d4e211f-1fa400-15e0e7780091e6; linkhelper=sameipb3=1&sameipb4=1&sameipb2=1; serverurl=; ASPSESSIONIDQARRSARR=DNCFMEADGBBFOICPGKMFCNPK; safedog-flow-item=; monitor_count=2; umid=umid=f449b116e07d1d4f3d2dc5352b7fede9&querytime=2017%2D8%2D24+14%3A09%3A09; CNZZDATA30012337=cnzz_eid%3D226371595-1503478989-%26ntime%3D1503554751Connection: closedomain=ichunqiu.com&b2=1&b3=1&b4=1

可以知道他是一个post包，然后提交的post数据是

domain=ichunqiu.com&b2=1&b3=1&b4=1

所以用requests模块：

#coding = utf-8import requestsurl = 'http://i.links.cn/subdomain/'payload = 'domain=ichunqiu.com&b2=1&b3=1&b4=1'r = requests.post(url=url,data=payload)print r.content

结果报了一个错

Traceback (most recent call last):  File "demo.py", line 8, in <module>    print r.textUnicodeEncodeError: 'gbk' codec can't encode character u'\xcf' in position 386: illegal multibyte sequence

所以咱们要改一下编码：

import requestsurl = 'http://i.links.cn/subdomain/'payload = ("domain=ichunqiu.com&b2=1&b3=1&b4=1")r = requests.post(url=url,params=payload)con = r.text.encode('ISO-8859-1')

之后就打印出来了，然后就上re或者beautifulsoup了。

这里用re，简单明了。查看源码，得到在以下代码之间：

value="http://ichunqiu.com"/><input

import rea = re.compile('value="(.+?)"><input')result = a.findall(con)

然后转成列表

list = '
'.join(result)print list

咱们继续完善这个代码，改源码查询是不是有点麻烦？

这里，咱们用sys库，然后就用那个命令函数，修改一下代码，再格式化一下，这里用到了format函数。

payload = ("domain={domain}&b2=1&b3=1&b4=1".format(domain=domain))

然后在定义一个get函数来获取domain这个变量。

#coding = utf-8 import requestsimport reimport sys def get(domain):        url = 'http://i.links.cn/subdomain/'        payload = ("domain={domain}&b2=1&b3=1&b4=1".format(domain=domain))        r = requests.post(url=url,params=payload)        con = r.text.encode('ISO-8859-1')        a = re.compile('value="(.+?)"><input')        result = a.findall(con)        list = '
'.join(result)        print listif __name__ == '__main__':        command= sys.argv[1:]        f = "".join(command)        get(f)

这样子就好了，咱们实验一下。

以上是今天要分享的内容，大家看懂了吗？喜欢本文的小伙伴，记得文末点个赞哦~

相关阅读:
第一份随笔
 慢哈希算法
 彩虹表
 基于Wireshark验证网站口令认证传输方案
 electron学习笔记2
基于原型的软件需求获取
 《小学四则运算练习软件》结对项目报告
 201571030109 小学四则运算练习软件项目报告
 201571030109 《构建之法》速读
 个人学期总结
原文地址：https://www.cnblogs.com/ichunqiu/p/13272641.html