• python网络爬虫(1)静态网页抓取


    获取响应内容:

    import requests
    r=requests.get('http://www.santostang.com/')
    print(r.encoding)
    print(r.status_code)
    print(r.text)
    

    获取编码,状态(200成功,4xx客户端错误,5xx服务器相应错误),文本,等。

    定制Request请求

    传递URL参数

    key_dict = {'key1':'value1','key2':'value2'}
    r=requests.get('http://httpbin.org/get',params=key_dict)
    print(r.url)
    print(r.text)
    

    定制请求头

    headers={'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0','Host':'www.santostang.com'}
    r=requests.get('http://www.santostang.com',headers=headers)
    print(r.status_code)

    发送POST请求

    POST请求发送表单信息,密码不显示在URL中,数据字典发送时自动编码为表单形式。

    key_dict = {'key1':'value1','key2':'value2'}
    r=requests.post('http://httpbin.org/post',data=key_dict)
    print(r.url)
    print(r.text)
    

    超时并抛出异常

    r=requests.get('http://www.santostang.com/',timeout=0.11)
    

      

    获取top250电影数据

    import requests
    import myToolFunction
    from bs4 import BeautifulSoup
    
    def get_movies():
        headers={'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0','Host':'movie.douban.com'}
        movie_list=[]
        for i in range(10):
            link='https://movie.douban.com/top250'
            key_dict = {'start':i*25,'filter':''}
            r=requests.get(link,params=key_dict)
            #print(r.text)
            print(r.status_code)
            print(r.url)
            
            soup=BeautifulSoup(r.text,'lxml')
            div_list=soup.find_all('div', class_='hd')
            for each in div_list:
                movie=each.a.span.text.strip()+'
    '
                movie_list.append(movie)
            pass
        return movie_list
    
    def storFile(data,fileName,method='a'):
        with open(fileName,method,newline ='') as f:
            f.write(data)
            pass
        pass
    
    movie_list=get_movies()
    for str in movie_list:
        myToolFunction.storFile(str, 'movie top250.txt','a')
        pass
    

      

  • 相关阅读:
    ubuntu18.04管理redis
    Mac Vmware虚拟机重启后没有网络
    记Spark写数据到Elasticsearch的报错
    Spark基础和RDD
    PHP日期处理
    集群命令
    hadoop集群时间同步
    HBase读写流程
    Flume简介
    Linux 常用快捷键
  • 原文地址:https://www.cnblogs.com/bai2018/p/10957787.html
Copyright © 2020-2023  润新知