• python爬虫入门


    知乎专栏

    获取整个网页的内容

    import requests
    
    url = 'http://www.wise.xmu.edu.cn/people/faculty'
    
    r = requests.get(url)
    html = r.content
    
    print html
    

    加入bs4

    bs4是一个非常好的解析网页的库。先看bs4里面最常用的几个BeautifulSoup对象的方法。主要是通过HTML的标签和标签里面的参数来定位,然后用特定方法提取数据。
    这里提取教职员工的人员链接和姓名。

    import requests
    from bs4 import BeautifulSoup
    
    url = 'http://www.wise.xmu.edu.cn/people/faculty'
    
    r = requests.get(url)
    html = r.content
    
    soup = BeautifulSoup(html, 'html.parser')
    
    div_people_list = soup.find('div', attrs={'class': 'people_list'})
    href_list = div_people_list.find_all('a', attrs={'target': '_blank'})
    
    for people in href_list:
        people_url = people['href']
        people_name = people.get_text().strip()
    
        print people_url, '	', people_name
    

    存储

    # -*- coding: utf-8 -*-
    
    import requests, codecs, csv
    from bs4 import BeautifulSoup
    
    
    def getHTML(url):
        r = requests.get(url)
        return r.content
    
    
    def parseHTML(html):
        soup = BeautifulSoup(html, 'html.parser')
    
        body = soup.body
        company_middle = body.find('div', attrs={'class': 'middle'})
        company_list_ct = company_middle.find('div', attrs={'class': 'list-ct'})
    
        company_list = []
        for company_ul in company_list_ct.find_all('ul', attrs={'class': 'company-list'}):
            for company_li in company_ul.find_all('li'):
                company_name = company_li.get_text()
                company_url = company_li.a['href']
                company_list.append([company_name.encode('utf-8'), company_url.encode('utf-8')])
        return company_list
    
    
    def writeCSV(file_name, data_list):
        with open(file_name, 'wb') as f:
            writer = csv.writer(f)
            for data in data_list:
                writer.writerow(data)
    
    
    if __name__ == '__main__':
        url = 'http://www.cninfo.com.cn/cninfo-new/information/companylist'
    
        html = getHTML(url)
        data_list = parseHTML(html)
        writeCSV('test.csv', data_list)
    
  • 相关阅读:
    Java 包装类的自动封箱与拆箱
    Java 基本类型的包装类
    Java日期时间练习三(闰年)
    导入包与模块
    模块_os模块
    Re模块练习题
    Re模块的方法补充
    Re模块的 三个方法
    基础纹理
    ruby 的数组操作
  • 原文地址:https://www.cnblogs.com/keer2345/p/6389947.html
Copyright © 2020-2023  润新知