Python爬虫开发教程

Python爬虫开发教程
正文

现在Python语言大火，在网络爬虫、人工智能、大数据等领域都有很好的应用。今天我向大家介绍一下Python爬虫的一些知识和常用类库的用法，希望能对大家有所帮助。
其实爬虫这个概念很简单，基本可以分成以下几个步骤：
- 发起网络请求
- 获取网页
- 解析网页获取数据
发起网络请求这个步骤常用的类库有标准库urllib以及Python上常用的requests库。解析网页常用的类库有的BeautifulSoup。另外requests的作者还开发了另一个很好用的库requests-html，提供了发起请求和解析网页的二合一功能，开发小型爬虫非常方便。另外还有一些专业的爬虫类库，其中比较出名的就是scrapy。本文将会简单介绍一下这些类库，之后还会专门写一篇文章介绍scrapy的用法。

标准库urllib

首先先来看标准库urllib。标准库的优点是Python自带的，不需要安装任何第三方库，缺点就是urllib属于偏底层的库，使用起来比较麻烦。下面是urllib发起请求的一个简单例子，大家看看就好。可以看到为了发起一个简单的请求，我们需要创建opener、request、ProxyHandler等好几个对象，比较麻烦。
```
import urllib.request as request
import requests

proxies = {
    'https': 'https://127.0.0.1:1080',
    'http': 'http://127.0.0.1:1080'
}

headers = {
    'user-agent':
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:65.0) Gecko/20100101 Firefox/65.0'
}


print('--------------使用urllib--------------')
url = 'http://www.google.com'
opener = request.build_opener(request.ProxyHandler(proxies))
request.install_opener(opener)
req = request.Request(url, headers=headers)
response = request.urlopen(req)
print(response.read().decode())
```
requests

requests是Kenneth Reitz大神的著名作品之一，优点就是极度简单和好用。首先来安装requests。
```
pip install requests
```
下面是一个简单的例子，和上面urllib示例代码实现的功能相同，但是代码量少多了，也更易读。
```
print('--------------使用requests--------------')
response = requests.get('https://www.google.com', headers=headers, proxies=proxies)
response.encoding = 'utf8'
print(response.text)
```
requests还可以方便的发送表单数据，模拟用户登录。返回的Response对象还包含了状态码、header、raw、cookies等很多有用的信息。
```
data = {
    'name': 'yitian',
    'age': 22,
    'friends': ['zhang3', 'li4']
}
response = requests.post('http://httpbin.org/post', data=data)
pprint(response.__dict__)
print(response.text)
```
关于requests我就不多做介绍了，因为它有中文文档，虽然比官方落后几个小版本号，不过无伤大雅，大家可以放心参阅。
```
http://cn.python-requests.org/zh_CN/latest/
```
beautifulsoup

利用前面介绍的requests类库，我们可以轻易地获取HTML代码，但是为了从HTML中找到所需的数据，我们还需要HTML/XML解析库，BeautifulSoup就是这么一个常用的库。首先先来安装它：
```
pip install beautifulsoup4
```
这次就用我简书主页作为例子，爬取一下我简书的文章列表。首先先用requests获取到网页内容。
```
from pprint import pprint
import bs4
import requests

headers = {
    'user-agent':
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:65.0) Gecko/20100101 Firefox/65.0'
}

url = 'https://www.jianshu.com/u/7753478e1554'
response = requests.get(url, headers=headers)
```
然后就是BeautifulSoup的代码了。在使用BeautifulSoup的时候首先需要创建一个HTML树，然后从树中查找节点。BeautifulSoup主要有两种查找节点的办法，第一种是使用find和find_all方法，第二种方法是使用select方法用css选择器。拿到节点之后，用contents去获取它的子节点，如果子节点是文本，就会拿到文本值，注意这个属性返回的是列表，所以要加[0]。
```
html = bs4.BeautifulSoup(response.text, features='lxml')
note_list = html.find_all('ul', class_='note-list', limit=1)[0]
for a in note_list.select('li>div.content>a.title'):
    title = a.contents[0]
    link = f'https://www.jianshu.com{a["href"]}'
    print(f'《{title}》,{link}')
```
BeautifulSoup也有中文文档，同样也是稍微落后两个小版本，影响不大。
```
https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/
```
requests-html

这个类库是requests的兄弟，同样也是Kenneth Reitz大神的作品。它将请求网页和解析网页结合到了一起。本来如果你用requests的话只能请求网页，为了解析网页还得使用BeautifulSoup这样的解析库。现在只需要requests-html一个库就可以办到。
首先先来安装。
```
pip install requests-html
```
然后我们来看看用requests-html如何重写上面这个例子。
```
from requests_html import HTMLSession
from pprint import pprint

url = 'https://www.jianshu.com/u/7753478e1554'
headers = {
    'user-agent':
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:65.0) Gecko/20100101 Firefox/65.0'
}
session = HTMLSession()
r = session.get(url, headers=headers)
note_list = r.html.find('ul.note-list', first=True)
for a in note_list.find('li>div.content>a.title'):
    title = a.text
    link = f'https://www.jianshu.com{a.attrs["href"]}'
    print(f'《{title}》,{link}')
```
requests-html除了可以使用css选择器来搜索以外，还可以使用xpath来查找。
```
for a in r.html.xpath('//ul[@class="note-list"]/li/div[@class="content"]/a[@class="title"]'):
    title = a.text
    link = f'https://www.jianshu.com{a.attrs["href"]}'
    print(f'《{title}》,{link}')
```
requests-html还有一个很有用的特性就是浏览器渲染。有些网页是异步加载的，直接用爬虫去爬只能得到一个空页面，因为数据是靠浏览器运行JS脚本异步加载的，这时候就需要浏览器渲染了。而浏览器渲染用requests-html做非常简单，只要多调用一个render函数即可。render函数有两个参数，分别指定页面下滑次数和暂停时间。render函数第一次运行的时候，requests-html会下载一个chromium浏览器，然后用它渲染页面。
简书的个人文章页面也是一个异步加载的例子，默认只会显示最近几篇文章，通过浏览器渲染模拟页面下滑，我们可以得到所有文章列表。
```
session = HTMLSession()
r = session.get(url, headers=headers)
# render函数指示requests-html用chromium浏览器渲染页面
r.html.render(scrolldown=50, sleep=0.2)
for a in r.html.xpath('//ul[@class="note-list"]/li/div[@class="content"]/a[@class="title"]'):
    title = a.text
    link = f'https://www.jianshu.com{a.attrs["href"]}'
    print(f'《{title}》,{link}')
```
类似的，今日头条的个人页面也是异步加载的，所以也得调用render函数。
```
from requests_html import HTMLSession

headers = {
    'user-agent':
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:65.0) Gecko/20100101 Firefox/65.0'
}
session = HTMLSession()
r = session.get('https://www.toutiao.com/c/user/6662330738/#mid=1620400303194116', headers=headers)
r.html.render()

for i in r.html.find('div.rbox-inner a'):
    title = i.text
    link = f'https://www.toutiao.com{i.attrs["href"]}'
    print(f'《{title}》 {link}')
```
最后是requests-html的官网地址以及中文文档。
```
https://html.python-requests.org/
https://cncert.github.io/requests-html-doc-cn/#/?id=requests-html
```
scrapy

以上介绍的几个框架都是各自有各自的作用，把它们结合起来可以达到编写爬虫的目的，但是要说专业的爬虫框架，还是得谈谈scrapy。作为一个著名的爬虫框架，scrapy将爬虫模型框架化和模块化，利用scrapy，我们可以迅速生成功能强大的爬虫。
不过scrapy概念众多，要仔细说还得专门开篇文章，这里就只简单演示一下。首先安装scrapy，如果是Windows系统，还需要安装pypiwin32。
```
pip install scrapy
pip install pypiwin32
```
然后创建scrapy项目并添加一个新爬虫。
```
scrapy startproject myproject
cd myproject
scrapy genspider my jianshu.com
```
打开配置文件settings.py，设置用户代理，否则会遇到403错误。
```
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:65.0) Gecko/20100101 Firefox/65.0'
```
然后修改一下爬虫。
```
# -*- coding: utf-8 -*-
import scrapy

class JianshuSpider(scrapy.Spider):
    name = 'jianshu'
    allowed_domains = ['jianshu.com']
    start_urls = ['https://www.jianshu.com/u/7753478e1554']

    def parse(self, response):
        for article in response.css('div.content'):
            yield {
                'title': article.css('a.title::text').get(),
                'link': 'https://www.jianshu.com' + article.xpath('a[@class="title"]/@href').get()
            }
```
最后运行一下爬虫。
```
scrapy crawl my
```
以上就是这篇文章的内容了，希望对大家有所帮助。
相关阅读:
MySQL数据库的基本操作命令
 autoCAD2014安装过程
 网站降权与恢复
 移动站的优化技巧
 Robots.txt详解
 友情链接交换技巧
 网站日志分析
 seo-网站内容的创建与优化
 网站外链的建设技巧
 网站内链优化
原文地址：https://www.cnblogs.com/wjw-zm/p/11789980.html