• 爬虫回顾


    爬虫类型:通用爬虫、聚焦爬虫、增量式爬虫

    在使用fiddler工具抓包时,需要注意下:因为它需要安装证书,在项目请求HTTPS页面是会ssl要求提供安全证书,可能会被拒绝请求

    可以在发送requests请求时,关闭安全认证,或者暂时关闭fiddler代理。末尾也会提到,这个坑……

    使用 BeautifulSoup对HTML标签进行解析数据:

    import requests
    from bs4 import BeautifulSoup
    url='https://www.yangguiweihuo.com/16/16089/'
    ua={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36"}
    
    page_text=requests.get(url=url,headers=ua).text         #获取所有章节列表HTML
    soup=BeautifulSoup(page_text,'lxml')
    a_list=soup.select('.listmain > dl > dd > a')           #解析出所有章节的URL
    
    with open("秦吏.txt",'w',encoding='utf-8')as f:
    
        for a in a_list:
            title=a.string                                  #获取a标签中的文本 作为章节名
            detail_url='https://www.yangguiweihuo.com'+a['href']       #拼接章节详情url
            detail_page=requests.get(url=detail_url,headers=ua).text  
    
            dsp=BeautifulSoup(detail_page,'lxml')           #章节详情页面
            content=dsp.find('div',id='content').text       #章节内容详情
            f.write(title+'
    '+content)                     #数据持久化存储
            print(title+":下载完成")                    
    print('The end')
    f.close()

     关于xpath的使用

    div[@class="song"]         div中class为song的标签元素

    div[@class="song"]/li/a/@href  取出其中的url地址

    div[@class="song"]/li/a/text()   取出其中的文本

    div[contains(@class,'ng')]    是指在div中查找class属性名含有ng的标签元素

    div[starts-with(@class,'ta')]    是指div中查找class属性以ta开头的标签元素

    xpath小案例:

    import requests
    from lxml import etree
    url="https://gz.58.com/ershoufang/?PGTID=0d100000-0000-335c-5dda-1cebcdf9ae5f&ClickID=2"
    user_agent={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36"}
    
    page_text=requests.get(url,headers=user_agent).text
    tree=etree.HTML(page_text)              #格式化处理后的全部页面数据
    
    li_list=tree.xpath('//ul[@class="house-list-wrap"]/li')     #记录以列表返回
    
    fp=open("58.scv",'w',encoding='utf-8')
    for li in li_list:
        title=li.xpath("./div[@class='list-info']/h2/a/text()")[0]
        price=li.xpath("./div[@class='price']//text()")
        sum_price=''.join(price)
        fp.write("home:"+title+"price:"+sum_price+'
    ')
       
    fp.close()
    print("数据获取完成!")

     碰到网站文本乱码问题的解决:

    import requests,os
    from lxml import etree
    
    url='http://pic.netbian.com/4kmeinv/'
    user_agent={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36"}
    
    page_text=requests.get(url,headers=user_agent).text
    tree=etree.HTML(page_text)
    
    li_list=tree.xpath('//div[@class="slist"]/ul/li')
    
    def getpic(title,photo):
        if not os.path.exists('./photo'):       #没有文件夹则直接创建空文件夹
            os.mkdir('./photo')
        fp = open('photo/'+title, 'wb')
        fp.write(photo)
        fp.close()
        return "当前资源下载完成"
    
    for li in li_list:
        title=li.xpath('./a/b/text()')[0]+".jpg"
        title=title.encode('iso-8859-1').decode('gbk')  #乱码 转标准格式在解码
        print(title)
        p_url=li.xpath('./a/img/@src')[0]
        picture_url='http://pic.netbian.com'+p_url
        photo=requests.get(url=picture_url,headers=user_agent).content
        ret=getpic(title,photo)
        print(ret)

    批量获取简历模板:字符乱码问题的处理

    import requests,random,os,time
    from lxml import etree
    
    url='http://sc.chinaz.com/jianli/free.html'
    user_agent={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36"}
    
    response=requests.get(url,headers=user_agent)         #简历列表页面
    response.encoding='utf-8'                             #对乱码进行处理
    page_text=response.text
    
    tree=etree.HTML(page_text)
    
    div_list=tree.xpath('//div[@id="container"]/div')
    
    if not os.path.exists('./jl'):
        os.mkdir('./jl')
    for div in div_list:
    
        title=div.xpath('./a/img/@alt')[0]           #简历名称
        link=div.xpath('./a/@href')[0]                #简历详情地址
    
    
        fp=open('./jl/'+title+'.zip','wb')
        detail_page=requests.get(url=link,headers=user_agent).text  #简历详情页面
        dpage=etree.HTML(detail_page)
    
        down_list=dpage.xpath('//div[@class="clearfix mt20 downlist"]/ul/li/a/@href')
        down_url=random.choice(down_list)            #随机选择下载地址
        word=requests.get(url=down_url,headers=user_agent).content
        print("准备开始下载>>"+title)
        fp.write(word)
        time.sleep(1)

     对代理ip进行测试:写入数据则代理ip可用

    import requests
    
    url='https://www.baidu.com/s?wd=ip'
    user_agent={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36"}
    proxy={"https":'112.85.170.79:9999'}
    
    
    page=requests.get(url,headers=user_agent,proxies=proxy).text
    
    with open('./ip.html','w',encoding='utf-8')as f:
        f.write(page)

     这一天天的mmp,当指定headers的User-Agent时,服务器会重定向到https的网址.因此报出SSL验证失败的错误,为了避免重定向造成认证失败,直接关闭认证page_text=requests.get(url=url,headers=user_agent,verify=False).text

  • 相关阅读:
    What is EJB
    Redis连接工具类
    MyBatis单列工厂的实现
    TCP和UDP的区别(Socket)
    webSocket协议与Socket的区别
    OSI七层模型
    Http协议和HTTPS协议
    Cookie和Session
    Request库使用response.text返回乱码问题
    Selenium元素定位问题
  • 原文地址:https://www.cnblogs.com/wen-kang/p/10929473.html
Copyright © 2020-2023  润新知