• Python爬虫相关技巧


    get请求

    1 kv = {
    2     'Cookie': 'ccpassport=ec081bd592c086d4057c3442b43b7998; wzwsconfirm=52277a99b139398795c925c264b5cf54; wzwstemplate=OQ==; wzwschallenge=-1;
    3     'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36',
    4      'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8'}
    5 requests.adapters.DEFAULT_RETRIES = 5
    6 requests.session().keep_alive = False
    7 r = requests.get(url, headers=kv, timeout=60)
    8 r.raise_for_status()
    9 r.encoding = r.apparent_encoding

    设置重连次数:requests.adapters.DEFAULT_RETRIES
    设置连接活跃状态: requests.session().keep_alive = False
    添加参数,params={}
    查看http返回状态码:r.raise_for_status()
    设置返回数据的编码:r.encoding
    获取返回数据的文本:r.text; 若返回数据为json,可通过r.json()获取得到

    获取Post请求

    r = requests.post(url, data={}, headers=kv, timeout=60)

    与get请求类似。注意,post请求参数名称为data

    两者更具体的用法见:requests快速上手requests高级用法requests开发接口

    Selenium 获取网站截图

     1 from selenium import webdriver
     2 from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
     3 
     4 fire_fox_user_agent = "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:53.0) Gecko/20100101 Firefox/53.0"
     5 dcap = dict(DesiredCapabilities.PHANTOMJS)
     6 dcap["phantomjs.page.settings.userAgent"] = fire_fox_user_agent
     7 brower = webdriver.PhantomJS(desired_capabilities=dcap)
     8 brower.set_page_load_timeout(180)
     9 brower.get(url)
    10 brower.maximize_window()
    11 path = 'my.jpg'
    12 brower.save_screenshot(path)
    13 brower.close()

    更多用法见:Python+Selenium WebDriver API:浏览器及元素的常用函数及变量整理总结、 Selenium API文件

    将字典的json形式存入文件

    1 import json
    2 fp = open('my.json', 'a+', encoding='utf-8')
    3 json.dump(dict, fp, ensure_ascii=False)
    4 fp.close()

    此外,json.dumps(dict)可将字典转为字符串

    加载读取Json文件

    1 import json
    2 fp=open('my.json', 'r', encoding="utf-8")
    3 dict = json.load(fp)
    4 fp.close()

    此外,json.loads()可将字符串转为字典,且必须是'{"xx":"c","f":"v"}'这种形式,即外面是单引号,里面是双引号,反之报错。

     下载文件

    1 f = open("my.pdf",'wb') 
    2 f.write(requests.get(url).content)
    3 f.close()

    日志

    1 import logging
    2 
    3 logging.basicConfig(filename= 'crawlLog.log',
    4                         level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
    5 
    6 logging.info("info")
    7 logging.error("error ")

    更具体的用法见:Python之日志处理(logging模块)

    定时

     1 import schedule
     2 
     3 def job():
     4     print("I'm working...")
     5 
     6 schedule.every(10).minutes.do(job)
     7 schedule.every().hour.do(job)
     8 schedule.every().day.at("10:30").do(job)
     9 schedule.every().monday.do(job)
    10 schedule.every().wednesday.at("13:15").do(job)
    11 
    12 while True:
    13     schedule.run_pending()

    schedule API见:https://schedule.readthedocs.io/en/stable/api.html

    另,爬虫加密:简谈-Python爬虫破解JS加密的Cookie

  • 相关阅读:
    Annotation
    injector
    Java容器(container)
    build tool(构建工具)maven和gradle安装方法
    version control(版本控制)
    函数式编程
    URI与URL
    超文本传输协议HTTP
    annotation的理解
    Injection
  • 原文地址:https://www.cnblogs.com/Hyacinth-Yuan/p/10179282.html
Copyright © 2020-2023  润新知