网络爬虫练习

网页

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Simple DOM Demo</title>
</head>

<body>
    <h1>This is the document body</h1>
    <P ID = "p1Node">This is paragraph 1.</P>
    <P ID = "p2Node">段落2</P>
    <a href="http://www.gzcc.cn/">广州商学院</a>
    <li>
        <a href="http://news.gzcc.cn/html/2018/xiaoyuanxinwen_0328/9113.html">
            <div class="news-list-thumb"><img src="http://oa.gzcc.cn/uploadfile/2018/0328/20180328085249565.jpg"/></div>
            <div class="news-list-text">
                <div class="news-list-title" style="">我校校长杨文轩教授讲授新学期“思政第一课”</div>
                <div class="news-list-description">3月27日下午，我校校长杨文轩教授在第四教学楼310室为学生讲授了新学期“思政第一课”。</div>
                <div class="news-list-info"><span><i class="fa fa-clock-o"></i>2018-03-28</span><span><i class="fa fa-building-o"></i>马克思主义学院</span></div>
            </div>
        </a>
    </li>

</body>
</html>

练习

url='http://localhost:63342/bd/0328.html?_ijt=h9b41m2eup4kmk1cet0l4ai05j'
import requests
from bs4 import BeautifulSoup
res=requests.get(url)
res.encoding='utf-8'
print(type(res))

soup=BeautifulSoup(res.text,'html.parser')
print(type(soup))

print(soup.a.attrs['href'])

print(soup.li.text)

print(soup.select('.news-list-title'))
print(soup.a.attrs['href'])
print(soup.select('fa fa-clock-o'))
print(soup.select('fa fa-building-o')

相关阅读:
hdu5360 Hiking(水题)
hdu5348 MZL's endless loop(欧拉回路)
hdu5351 MZL's Border(规律题，java)
hdu5347 MZL's chemistry(打表)
hdu5344 MZL's xor(水题)
hdu5338 ZZX and Permutations(贪心、线段树)
hdu 5325 Crazy Bobo (树形dp)
hdu5323 Solve this interesting problem(爆搜)
hdu5322 Hope(dp)
Lightoj1009 Back to Underworld(带权并查集)

原文地址：https://www.cnblogs.com/0056a/p/8678569.html