理解爬虫原理

作业来源：https://edu.cnblogs.com/campus/gzcc/GZCC-16SE1/homework/2881

1. 简单说明爬虫原理

互联网就像一张大的蜘蛛网，数据便是存放在蜘蛛网的各个节点，爬虫就像一只蜘蛛，沿着网络抓去自己需要的数据。爬虫：向网站发起请求，获取资源后进行分析并提取有用的数据的程序

2. 理解爬虫开发过程

1).简要说明浏览器工作原理；

用户输入网址，浏览器发送到服务器，浏览器接收到返回的数据后，会解析其内容来显示给用户。

2).使用 requests 库抓取网站数据；

requests.get(url) 获取校园新闻首页html代码

3).了解网页

写一个简单的html文件，包含多个标签，类，id

简单的网页：

 1 <!DOCTYPE html>
 2 <html>
 3 <head> 
 4 <meta charset="utf-8"> 
 5 <title>菜鸟教程(runoob.com)</title> 
 6 </head>
 7 <body>
 8 
 9 <table width="500" border="0">
10 <tr>
11 <td colspan="2" style="background-color:#FFA500;">
12 <h1>主要的网页标题</h1>
13 </td>
14 </tr>
15 
16 <tr>
17 <td style="background-color:#FFD700;100px;vertical-align:top;">
18 <b>菜单</b><br>
19 HTML<br>
20 CSS<br>
21 JavaScript
22 </td>
23 <td style="background-color:#eeeeee;height:200px;400px;vertical-align:top;">
24 内容在这里</td>
25 </tr>
26 
27 <tr>
28 <td colspan="2" style="background-color:#FFA500;text-align:center;">
29 版权 © runoob.com</td>
30 </tr>
31 </table>
32 
33 </body>
34 </html>

4).使用 Beautiful Soup 解析网页；

通过BeautifulSoup(html_sample,'html.parser')把上述html文件解析成DOM Tree

select（选择器）定位数据

找出含有特定标签的html元素

找出含有特定类名的html元素

找出含有特定id名的html元素

import requests
from bs4 import BeautifulSoup

url = "http://news.gzcc.cn/html/2019/jxky_0329/11094.html"
res = requests.get(url)
print(res.encoding)
res.encoding = 'utf-8'

soup = BeautifulSoup(res.text,'html.parser')

#获取h4标签元素
h = soup.select('h4')
print(h)

# 获取类为shou-title的标签元素
class_tag = soup.select(".show-title")
print(class_tag)

# 获取id为hits的标签元素
click_count = soup.select('#hits')
print(click_count)

效果图如下：

3.提取一篇校园新闻的标题、发布时间、发布单位

url = 'http://news.gzcc.cn/html/2019/xiaoyuanxinwen_0320/11029.html'

选取的网址为http://news.gzcc.cn/html/2019/jxky_0329/11094.html

 1 from urllib import request
 2 import re
 3 import datetime
 4 import time
 5 import requests
 6 from bs4 import BeautifulSoup
 7 
 8 url = "http://news.gzcc.cn/html/2019/jxky_0329/11094.html"
 9 res = requests.get(url)
10 print(res.encoding)
11 res.encoding = 'utf-8'
12 
13 soup = BeautifulSoup(res.text,'html.parser')
14 
15 info = soup.select('.show-info')[0].text
16 info = info.split("xa0xa0")
17 dict = {"作者":'',"发布时间":'',"来源":'',"点击":0}
18 for i in info:
19     if(':'in i):
20         temp = i.split("发布时间:")
21         dict["发布时间"]=temp[1]
22         # print(temp)
23         temp={}
24     if ( '：' in i):
25         temp = i.split("：")
26         dict[temp[0]] = temp[1]
27         # print(temp)
28         temp = {}
29 
30 # 获取点击次数
31 url2 = "http://oa.gzcc.cn/api.php?op=count&id=11094&modelid=80"
32 res2 = requests.get(url2)
33 # print(res2.text)
34 # print (re.findall(r"$('#hits').html('(d+)",res2.text))
35 dict['点击']=int(re.findall(r"$('#hits').html('(d+)",res2.text)[0])
36 
37 # 时间转换
38 dict["发布时间"] = datetime.datetime.strptime(dict["发布时间"],"%Y-%m-%d %H:%M:%S ")
39 print("发布时间类型为",type(dict["发布时间"]))
40 # 获取标题
41 title = soup.select(".show-title")[0].text
42 
43 # 获取内容
44 content = soup.select(".show-content")[0].text
45 
46 print("文章标题为",title)
47 print("作者为",dict["作者"])
48 print("发布时间为",dict["发布时间"])
49 print("点击次数为",dict["点击"])
50 print("发布单位为",dict["来源"])
51 print("内容为",content)

运行效果图：

相关阅读:
VS2008 环境中完美搭建 Qt 4.7.4 静态编译的调试与发布 Inchroy's Blog 博客频道 CSDN.NET
编写可丢弃的代码
 c++ using namespace std; 海明威博客园
 解决MySQL server has gone away
nginx upstream 调度策略
 (2006, 'MySQL server has gone away') 错误解决 dba007的空间 51CTO技术博客
 Linux IO模型漫谈（2）轩脉刃博客园
 redis源码笔记 initServer 刘浩de技术博客博客园
 MySQLdb批量插入数据
 词库的扩充百度百科的抓取你知道这些热词吗？ rabbit9898 ITeye技术网站
原文地址：https://www.cnblogs.com/grate/p/10594889.html