python爬虫学习

开发环境

编译器版本Python 3.6
32bit:python-3.6.2.exe
64bit:python-3.6.5.exe
开发工具:Pycharm Jupyter-notebook
浏览器类型:google最新版本

安装步骤

Python3.6
- https://www.jianshu.com/p/6bd0e6c5cee0
Pycharm
- https://www.jianshu.com/p/608ff1efe662
Juptyter
- https://www.jianshu.com/p/f3c3dd636b8a

安装库

requests ->安装方法:pip install requests
beautifulsoup4 ->安装方法:pip install beautifulsoup4
html5lib ->安装方法:pip install html5lib
lxml ->安装方法:pip install lxml

requests库

导包:import requests
Http请求:post(),get(),put(),delete(),head(),options()
r = requests.get(“https://api.github.com/events”)
r = requests.post(‘http://httpbin.org/post’, data = {‘key’:‘value’})
r = requests.put(‘http://httpbin.org/put’, data = {‘key’:‘value’})
r = requests.delete(‘http://httpbin.org/delete’)
r = requests.head(‘http://httpbin.org/get’)
r = requests.options(‘http://httpbin.org/get’)

定制请求头

url = ‘https://api.github.com/some/endpoint’
headers = {‘user-agent’: ‘my-app/0.0.1’}
r = requests.get(url, headers=headers)

响应状态码

r = requests.get('http://httpbin.org/get')
r.status_code

正常输出结果为:200

BeautifulSoup4

简介

Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间.

导包:from bs4 import BeautifulSoup
soup = BeautifulSoup(open(“index.html”))
soup = BeautifulSoup(“data”)

对象的种类Tag:Tag对象与XML或HTML原生文档中的tag相同

soup = BeautifulSoup(‘Extremely bold’)
tag = soup.b
type(tag)
输出: <class ‘bs4.element.Tag’>

String

遍历的字符串
tag.string
输出：u’Extremely bold’
type(tag.string)
输出: <class ‘bs4.element.NavigableString’>

find_all,find

搜索函数:find_all(),搜索所有满足条件的内容，返回list列表
find()函数:搜索一个内容，第一个，返回tag对象

下面给出一个示例代码:

 #！-*-coding:utf-8-*-
 #! Author : WX
 # time  2018 10 30

import requests
from bs4 import BeautifulSoup
import os
from urllib.request import urlretrieve

def get_two_page():
    # 1.发送请求
    # 2.判断状态
    # 3.获取内容
    # 4.使用bs4解析内容
    # 5.重新定义规则：1.名字 2.出生日期 3.身高 4.三围 5.详细信息。。。 6.私人照
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) 
        AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'
    }
    response = requests.get(url=URL, headers=headers)
    if response.status_code == 200:

        response.encoding = 'utf-8'
        soup = BeautifulSoup(response.content, 'html5lib')
        file = open("校花网二级页面数据.txt", "w", encoding='utf-8')
        txt = ''

        # td内容
        for tabel in soup.find_all('table'):
           for tr in tabel.find_all('tr'):
                # 人物信息
                name = tr.find('td').next_element.next_element.string
                txt += "姓名:" + str(name) + "
"
        # 详细信息
        for div_hot_tag in soup.find('div',attrs={'class':'infocontent'}):
            # 详细信息
            all_news = div_hot_tag.string
            txt += "详细信息:" + str(all_news) + "
"
        # 图片
        ul_list = soup.find('div',attrs={'class':'post_entry'})
        for ul in ul_list:
            if ul != None:
                for li_list in ul_list.find_all('li'):
                    for li in li_list:
                        img_path = li.find('img')['src']
                        txt += "图片:" + img_path + "
"
                        get_info(img_path)

        # 写入
        file.write(txt)
        # 关闭
        file.close()
        print("采集完毕")
	else:
    	print("你访问的内容属于和谐，访问失败")

def get_info(img_path):
	download1 = 'download2Pic'
 	# 判断目录是否存在
	if not os.path.exists(download1):
   		 os.mkdir(download1)
	name = img_path.split('/')
	# 获取最后一位的内容
    str = name[len(name) - 1]
	try:
    	print(str + ".jpg" + "下载中....................")
    	urlretrieve(img_path, download1 + '/' + str + '.jpg')
	except:
    	print("未满18岁，不能观看，下载失败")

if __name__ == '__main__':
	URL = "http://www.xiaohuar.com/p-1-1994.html"
	get_two_page()

2018/10/30 20:22:40

相关阅读:
Eclipse 开发过程中利用 JavaRebel 提高效率
数字转化为大写中文
网页变灰
解决QQ截图无法在PS中粘贴
ORACLE操作表时”资源正忙，需指定nowait"的解锁方法
网页常用代码
SQL Server 2000 删除注册的服务器
GridView 显示序号
读取Excel数据到DataTable
清除SVN版本控制

原文地址：https://www.cnblogs.com/wxvirus/p/12896781.html