Python从入门到编写POC系列文章是i春秋论坛作家「Exp1ore」表哥原创的一套完整教程,想系统学习Python技能的小伙伴,不要错过哦!
Python从入门到编写POC之爬虫专题
说到爬虫,用Python写的貌似是很多的。
举个例子,re模块,BeautifulSoup模块,pyspider模块,pyquery等,当然还要用到requests模块,urllib模块,urllib2模块,还有一个四叶草公司开发的hackhttp等等。
PS:BeautifulSoup模块和requests模块,Pyspider都要安装,因为是第三方库。
BeautifulSoup模块
<html><head><title>The Dormouse's story</title></head><body><p class="title" name="dromouse"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p></body></html>
用BeautifulSoup创建一个对象
>>> from bs4 import BeautifulSoup>>> html = """... <html>... <head>... <title>The Dormouse's story</title>... </head>... <body>... <p class="title"><b>The Dormouse's story</b></p>...... <p class="story">Once upon a time there were three little sisters; and their names were... <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,... <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and... <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;... and they lived at the bottom of a well.</p>... <p class="story">...</p>... </body>... </html>... """>>>>>> soup = BeautifulSoup(html)C:Python27libsite-packagess4\__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.The code that caused this warning is on line 1 of the file <stdin>. To get rid of this warning, change code that looks like this: BeautifulSoup(YOUR_MARKUP})to this: BeautifulSoup(YOUR_MARKUP, "html.parser") markup_type=markup_type))
浏览结构化数据的方法
>>> soup.title<title>The Dormouse's story</title>>>> soup.title.nameu'title'>>> soup.p<p class="title"><b>The Dormouse's story</b></p>>>> soup.p['class'][u'title']>>> soup.head<head>
<title>The Dormouse's story</title>
</head>>>> soup.p.attrs{u'class': [u'title']}
如果是爬虫,比如说要爬所有的链接,分析html代码得到,都是在<a>标签那。所以用个循环,就可以完美的解决了。
>>> for link in soup.find_all('a'):... print(link.get('href'))...[url]http://example.com/elsie[/url][url]http://example.com/lacie[/url][url]http://example.com/tillie[/url]
那如果我要爬去所有的文字信息呢?
就要用到下面的命令了:
>>> print soup.get_text()The Dormouse's storyThe Dormouse's storyOnce upon a time there were three little sisters; and their names wereElsie,Lacie andTillie;and they lived at the bottom of a well....
接下来,咱们写一个简单的爬虫,调用站长帮手,写一个查询子域名的工具。
首先,咱们抓包分析一下,这里用到的是Burp
POST /subdomain/ HTTP/1.1Host: i.links.cnContent-Length: 34Cache-Control: max-age=0Origin: [url]http://i.links.cn[/url]Upgrade-Insecure-Requests: 1User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36Content-Type: application/x-www-form-urlencodedAccept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8Referer: [url]http://i.links.cn/subdomain/[/url]Accept-Language: zh-CN,zh;q=0.8Cookie: ASPSESSIONIDCCRSRCQS=NFFNBODCNABACIGOEODDFKLG; __guid=12224748.1912086146849820700.1503481265395.9385; UM_distinctid=15e0e7780082dd-0f197d4291ddaa-5d4e211f-1fa400-15e0e7780091e6; linkhelper=sameipb3=1&sameipb4=1&sameipb2=1; serverurl=; ASPSESSIONIDQARRSARR=DNCFMEADGBBFOICPGKMFCNPK; safedog-flow-item=; monitor_count=2; umid=umid=f449b116e07d1d4f3d2dc5352b7fede9&querytime=2017%2D8%2D24+14%3A09%3A09; CNZZDATA30012337=cnzz_eid%3D226371595-1503478989-%26ntime%3D1503554751Connection: closedomain=ichunqiu.com&b2=1&b3=1&b4=1
可以知道他是一个post包,然后提交的post数据是
domain=ichunqiu.com&b2=1&b3=1&b4=1
所以用requests模块:
#coding = utf-8import requestsurl = 'http://i.links.cn/subdomain/'payload = 'domain=ichunqiu.com&b2=1&b3=1&b4=1'r = requests.post(url=url,data=payload)print r.content
结果报了一个错
Traceback (most recent call last): File "demo.py", line 8, in <module> print r.textUnicodeEncodeError: 'gbk' codec can't encode character u'\xcf' in position 386: illegal multibyte sequence
所以咱们要改一下编码:
import requestsurl = 'http://i.links.cn/subdomain/'payload = ("domain=ichunqiu.com&b2=1&b3=1&b4=1")r = requests.post(url=url,params=payload)con = r.text.encode('ISO-8859-1')
之后就打印出来了,然后就上re或者beautifulsoup了。
这里用re,简单明了。查看源码,得到在以下代码之间:
value="http://ichunqiu.com"/><input
import rea = re.compile('value="(.+?)"><input')result = a.findall(con)
然后转成列表
list = '
'.join(result)print list
咱们继续完善这个代码,改源码查询是不是有点麻烦?
这里,咱们用sys库,然后就用那个命令函数,修改一下代码,再格式化一下,这里用到了format函数。
payload = ("domain={domain}&b2=1&b3=1&b4=1".format(domain=domain))
然后在定义一个get函数来获取domain这个变量。
#coding = utf-8 import requestsimport reimport sys def get(domain): url = 'http://i.links.cn/subdomain/' payload = ("domain={domain}&b2=1&b3=1&b4=1".format(domain=domain)) r = requests.post(url=url,params=payload) con = r.text.encode('ISO-8859-1') a = re.compile('value="(.+?)"><input') result = a.findall(con) list = '
'.join(result) print listif __name__ == '__main__': command= sys.argv[1:] f = "".join(command) get(f)
这样子就好了,咱们实验一下。
以上是今天要分享的内容,大家看懂了吗?喜欢本文的小伙伴,记得文末点个赞哦~