• 爬虫(AJEX)——豆瓣动态页面


    工具:python3

    解释:Ajax 是一种用于创建快速动态网页的技术,在无需重新加载整个网页的情况下,能够更新部分网页的技术。

    目标:爬取使用Ajex结束的豆瓣网页

    import urllib.request
    
    # url为抓包(get请求)获取的,而不是web页面上的 url
    = "https://movie.douban.com/j/search_subjects?type=movie&tag=%E7%83%AD%E9%97%A8&sort=recommend&page_limit=20&page_start=80" headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36", }
    # fiddle中webforms中得到的表格数据 formdata
    ={ "page_limit": "20", "page_start": "80", "sort": "recommend", "tag" : "热门", "type": "movie" } data = urllib.parse.urlencode(formdata) data = bytes(data, "utf8")
    request
    = urllib.request.Request(url, data=data, headers=headers) response = urllib.request.urlopen(request).read()
    # response
    = response.decode("utf-8")
    with open(
    "douban.json","w") as f: f.write(str(response))

    执行上述代码后,将得到的内容在json.cn中转码,出现如下错误:

    说明文件格式不对,没能正确转码,尝试将返回值response进行解码:response=response.decode("utf-8")

    得到正确的json格式的文件:

    观察发现url中包含了formdata中的全部数据,尝试将formdata删除:
    import urllib.request
    
    url = "https://movie.douban.com/j/search_subjects?type=movie&tag=%E7%83%AD%E9%97%A8&sort=recommend&page_limit=20&page_start=80"
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36",
               }
    # formdata ={
    #     "page_limit": "20",
    #     "page_start": "80",
    #     "sort": "recommend",
    #     "tag"    : "热门",
    #     "type": "movie"
    # }
    # data = urllib.parse.urlencode(formdata)
    # data = bytes(data, "utf8")
    request = urllib.request.Request(url, headers=headers)
    response = urllib.request.urlopen(request).read()
    response = response.decode("utf-8")
    with open("douban.json","w") as f:
        f.write(str(response))

    运行结果与之前相同!

  • 相关阅读:
    在IT行业工作如何获得高薪?选择前沿的技术,把准方向,有技术有人缘
    如何去做不想做的事情的 - 10个建议
    如何去做不想做的事情的 - 10个建议
    项目管理
    项目管理
    Spring Quartz 定时任务
    Spring Quartz 定时任务
    Spring @Transactional (一)
    Spring @Transactional (一)
    Search Insert Position
  • 原文地址:https://www.cnblogs.com/gaoquanquan/p/9102307.html
Copyright © 2020-2023  润新知