爬虫静态网页实训

一、连接下载网页

实训 1 生成 GET 请求并获取指定网页内容.

通过 Requests 库向网站“ http://www.tipdm.com/tipdm/gsjj/ ”发送 GET 请求，并上传伪装过的 User-Agent 信息，如“Mozilla/5.0 (Windows NT 6.1; Win64; x64) Chrome/65.0.3325.181”。查看服务器返回的状态码和响应头，确认连接是否建立成功，并查看服务器返回的能正确显示的页面内容。

from selenium import webdriver
import requests,chardet
url = 'http://www.tipdm.com/tipdm/gsjj/'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36'}
rqg =  requests.get(url,headers=headers,timeout=2)
print(type(rqg))
print("状态码：",rqg.status_code)
#print("编码：",rqg.encoding)
print("响应头：",rqg.headers)
rqg.encoding = chardet.detect(rqg.content)['encoding']
#print("修改后的编码：",rqg.encoding)
#print("响应头：",rqg.headers)
buf=bytes(rqg.text,encoding='utf8')
with open('D://test.html', 'wb+') as fp:
    fp.write(buf)
driver = webdriver.Chrome()
driver.get("D://test.html")
data = driver.page_source
print(data)

二、匹配对应信息

实训 2 搜索目标节点并提取文本内容。

通过 Beautiful Soup 库解析实训 1 获取的网页内容，找到其中 CSS 类名为“contentCom”的节点，并提取该节点中第一个含有文本的子节点的文本内容。

from bs4 import BeautifulSoup
soup = BeautifulSoup(open("D://test.html",encoding='utf-8'))    # 生成BeautifulSoup对象
target = soup.find_all(class_="contentCom")   # 按照CSS类名完全匹配
info = list(target[0].strings)[1]
print(info)

三、写入数据库

实训 3 在数据库中建立新表并导入数据

通过 PyMySQL 库存储实训 2 提取的网页内容，在 MySQL 的 test 库中建立一个新表，并将提取的文本内容存入该表内，之后查询该表内容，确认是否存储成功。

import pymysql
# 使用参数名创建连接
conn = pymysql.connect(host='127.0.0.1', port=3306, user='root', passwd='root', db='test', charset='utf8', connect_timeout=1000)
# 创建游标
cursor=conn.cursor()
# 创建表
sql="create table if not exists new_taidi (id int(10) primary key auto_increment,name varchar(20) not null,text varchar(100) not null)"
cursor.execute(sql)          # 执行创建表的sql语句
cursor.execute("show tables")    # 查看创建的表
# 插入数据
title = "泰迪"
sql = "insert into new_taidi (name,text)values(%s,%s)"
cursor.execute(sql,(title,info))          # 执行插入语句
conn.commit()           # 提交事务
# 查询数据
data=cursor.execute("select * from new_taidi")
# 使用fetchall方法获取操作结果
data=cursor.fetchmany()
print("查询获取的结果:", data)
conn.close()

延展：多条写入和查找

import pymysql
# 使用参数名创建连接
conn = pymysql.connect(host='127.0.0.1', port=3306, user='root', passwd='root', db='test', charset='utf8', connect_timeout=1000)
# 创建游标
cursor=conn.cursor()
# 创建表
sql="create table if not exists new_taidi (id int(10) primary key auto_increment,name varchar(20) not null,text varchar(100) not null)"
cursor.execute(sql)          # 执行创建表的sql语句
cursor.execute("show tables")    # 查看创建的表
# 插入数据 多条插入时确保title、info为列表
title = ["迟迟钟鼓初长夜","上穷碧落下黄泉","文能ac自动机","武能后缀自动机","于是我仰面"]
info = ["耿耿星河欲曙天","两处茫茫皆不见","fail树上dfs序建可持久化线段树","next指针dag图上跑SG函数","雨水顺着下巴往下滴答滴答"]
sql = "insert into new_taidi (name,text)values(%s,%s)"
for i,j in zip(title,info):
    cursor.execute(sql,(i,j))          # 执行插入语句
    conn.commit()           # 提交事务
# 查询数据
data=cursor.execute("select * from new_taidi") #where 附加条件
# 使用fetchall方法获取操作结果
data=cursor.fetchall()
data = list(data)
for i in data:
    print(i)
conn.close()

相关阅读:
条件注释判断IE版本
 win7及以上系统打开chm空白或显示"无法打开"的2个解决方案
 复制和删除txt文件
 casperjs 抓取爱奇艺高清视频
 chrome扩展程序之http/https 报文拦截
 bootstrap 的 datetimepicker 结束时间大于开始时间
 Jquery EasyUI的datagrid页脚footer使用及数据统计
 Web应用程序在加入反向代理服务器的时候如何获得真实IP
c#4.0 新特性可选参数可曾用过？
Pyhon
原文地址：https://www.cnblogs.com/thx2199/p/16130091.html