理解爬虫原理

理解爬虫原理
作业要求：https://edu.cnblogs.com/campus/gzcc/GZCC-16SE2/homework/2851

一. 简单说明爬虫原理

1.首先选取一部分精心挑选的种子URL；

2.将这些URL放入待抓取URL队列；

3.从待抓取URL队列中取出待抓取在URL，解析DNS，并且得到主机的ip，并将URL对应的网页下载下来，存储进已下载网页库中。此外，将这些URL放进已抓取URL队列。

4.分析已抓取URL队列中的URL，分析其中的其他URL，并且将URL放入待抓取URL队列，从而进入下一个循环。

二. 理解爬虫开发过程

1).简要说明浏览器工作原理；

输入url，发送请求，通过网络连接，等待服务器相应返回数据，浏览器出现界面

2).使用 requests 库抓取网站数据；

requests.get(url) 获取校园新闻首页html代码
```
url='http://news.gzcc.cn/html/xiaoyuanxinwen'
res = requests.get(url)
```
3).了解网页

写一个简单的html文件，包含多个标签，类，id
```
<body> 
<h1 id="title">Hello</h1> 
<a href="#" class="link"> This is link1</a >
<a href="# link2" class="link" qao=123> This is link2</a >
</body> 
```
4).使用 Beautiful Soup 解析网页；

通过BeautifulSoup(html_sample,'html.parser')把上述html文件解析成DOM Tree

select（选择器）定位数据

找出含有特定标签的html元素
```
a=soup.select('h1')[0].text
print(a)
```
找出含有特定类名的html元素
```
for i in range(len(soup.select('.link'))):
    b=soup.select('.link')[i].text
print(b)
```
找出含有特定id名的html元素
```
c=soup.select('#title')[0].text
print(c)
```
　　

三.提取一篇校园新闻的标题、发布时间、发布单位

url = 'http://news.gzcc.cn/html/2019/xiaoyuanxinwen_0320/11029.html'
```
url='http://news.gzcc.cn/html/2019/xiaoyuanxinwen_0322/11042.html'
res=requests.get(url)
res.encoding='utf-8'
soup1=BeautifulSoup(res.text,'html.parser')
a=soup1.select('.show-title')[0].text
b=soup1.select('.show-info')[0].text
print(a,b)
```
相关阅读:
Codeforces 120F Spiders
Codeforces 509C Sums of Digits
Topcoder SRM 497 DIV2 1000 MakeSquare
codeforces 22B Bargaining Table
Codeforces 487B Strip
Codeforces 132C Logo Turtle
关闭窗口对话框提示 messagedlg应用和showmodal的使用
 如何让窗口显示在电脑屏幕中间
 delphi项目程序输出编译成应用程序文件
 delphi程序项目创建和保存
原文地址：https://www.cnblogs.com/zl1216/p/10595112.html