• 煎蛋网爬虫之JS逆向解析img路径


    图片使用js onload事件加载

    <p><img src="//img.jandan.net/img/blank.gif" onload="jandan_load_img(this)" /><span class="img-hash">Ly93eDEuc2luYWltZy5jbi9tdzYwMC8wMDd1ejNLN2x5MWZ6NmVub3ExdHhqMzB1MDB1MGFkMC5qcGc=</span></p>

    找到soureces 文件中对应的js 方法jandan_load_img

    通过debugger  js  将Ly93eDEuc2luYWltZy5jbi9tdzYwMC8wMDd1ejNLN2x5MWZ6NmVub3ExdHhqMzB1MDB1MGFkMC5qcGc= 传入函数jdugRtgCtw78dflFjGXBvN6TBHAoKvZ7xu base64_decode得到img路经

    再通过正则表达式将img路径中的(/W+)替换为large

    爬取代码如下:

    import base64
    import re
    import requests
    from concurrent.futures import ThreadPoolExecutor
    from random import choice
    from lxml import etree
    from user_agent_list import USER_AGENTS
    headers = {'user-agent': choice(USER_AGENTS)}
    
    
    def fetch_url(url):
        '''
        :param url: 路径
        :return: html
        '''
        try:
            r = requests.get(url, headers=headers)
            r.raise_for_status()
            r.encoding = r.apparent_encoding
            if r.status_code in [200, 201]:
                return r.text
        except Exception as e:
            print(e)
    
    
    def downloadone(url):
        html = fetch_url(url)
        data = etree.HTML(html)
        img_hash_list = data.xpath('//*[@class="img-hash"]/text()')
        for img_hash in img_hash_list:
            img_path = 'http:' + bytes.decode(base64.b64decode(img_hash))
            img_path = re.sub(r'mwd+', 'large', img_path)
            img_name = img_path.rsplit('/', 1)[1]
            with open('jiandan/'+img_name, 'wb') as f:
                r = requests.get(img_path)
                f.write(r.content)
    
    
    def main():
        url_list = []
        for _ in range(1, 44):
            url = 'http://jandan.net/ooxx/page-{}'.format(_)
            url_list.append(url)
        with ThreadPoolExecutor(4) as executor:
           executor.map(downloadone, url_list)
    
    
    if __name__ == '__main__':
        main()
    

      

  • 相关阅读:
    从a文件判断是否删除b文件中的行(sed示例)
    绝对路径的表示方式为什么是"/usr"而不是"//usr"
    判断ssh远程命令是否执行结束
    彻底搞懂shell的高级I/O重定向
    Resource Agent:LSB和OCF
    流程控制语句(MySQL/MariaDB )
    MySQL/MariaDB中游标的使用
    翻译:DECLARE HANDLER语句(已提交到MariaDB官方手册)
    从集合的无序性看待关系型数据库中的"序"
    MariaDB/MySQL存储过程和函数
  • 原文地址:https://www.cnblogs.com/frank-shen/p/10269363.html
Copyright © 2020-2023  润新知