• python获取js里window对象


    python环境依赖

    pip install PyExecJS
    pip install lxml
    pip install beautifulsoup4
    pip install requests

    nodejs环境依赖

    全局安装命令

    npm install jsdom -g
    或者
    yarn add jsdom -g

    安装后下面这些代码可以正常执行了

    const jsdom = require("jsdom");
    const { JSDOM } = jsdom;
    const dom = new JSDOM(`<!DOCTYPE html><p>Hello world</p>`);
    window = dom.window;
    document = window.document;
    XMLHttpRequest = window.XMLHttpRequest;

    在全局安装jsdom后,在node里按上面的写法是没有问题的,但是我们要在python中使用的话,不能在全局安装
    如果在全局安装,使用时会报如下错误,说找不到jsdom

    execjs._exceptions.ProgramError: Error: Cannot find module 'jsdom'

    解决办法有两种
    1.就是在python执行文件所在的运行目录下,使用npm安装jsdom
    2. 使用cwd参数,指定模块的所在目录,比如,我们在全局安装的jsdom,在cmd里通过npm root -g 可以查看全局模块安装路径: C:Usersw001AppDataRoaming pm ode_modules
    我们使用时,代码可以按下面的写法写

    import execjs
    with open(r'要运行的.js','r',encoding='utf-8') as f:
        js = f.read()
    ct = execjs.compile(js,cwd=r'C:Usersw001AppDataRoaming
    pm
    ode_modules')
    print(ct.call('Rohr_Opt.reload','1'))
    print(js.eval("window.pageData"))

    python 爬虫的例子

    #!/usr/bin/env python
    # -*- coding:utf-8 -*-
    # @Author: Irving Shi
    
    import execjs
    import json
    import requests
    from bs4 import BeautifulSoup
    
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36"
    }
    
    
    def get_company(key):
        res = requests.get("https://aiqicha.baidu.com/s?q=" + key, headers=headers)
        soup = BeautifulSoup(res.text, features="lxml")
        tag = soup.find_all("script")[2].decode_contents()
        tag = """const jsdom = require("jsdom");
        const { JSDOM } = jsdom;
        const dom = new JSDOM(`<!DOCTYPE html><p>Hello world</p>`);
        window = dom.window;
        document = window.document;
        XMLHttpRequest = window.XMLHttpRequest; """ + tag
        js = execjs.compile(tag, cwd=r'C:UsersAdministratorAppDataRoaming
    pm
    ode_modules')
    
        res = js.eval("window.pageData").get("result").get("resultList")[0]
        return res
    
    
    res = get_company("91360000158304717T")
    # for i in res.items():
    #     print(i)
    
    pid = res.get("pid")
    r = requests.get("https://aiqicha.baidu.com/detail/basicAllDataAjax?pid=" + pid, headers=headers)
    data = json.loads(r.text).get("data").get("basicData")
    for i in data.items():
        print(i)

    使用python的execjs执行js,会有这个错误:

    UnicodeDecodeError: 'gbk' codec can't decode byte 0x80 in position 41: illegal multibyte sequence

    这个问题原因是文件编码问题,具体可以 Google 一下,这里直接解决方法是通过修改 subprocess.py 中的 Popen 类的构造方法 __init__ 中 encoding 参数的默认值为 utf-8

    改前

        _child_created = False  # Set here since __del__ checks it
    
        def __init__(self, args, bufsize=-1, executable=None,
                     stdin=None, stdout=None, stderr=None,
                     preexec_fn=None, close_fds=_PLATFORM_DEFAULT_CLOSE_FDS,
                     shell=False, cwd=None, env=None, universal_newlines=False,
                     startupinfo=None, creationflags=0,
                     restore_signals=True, start_new_session=False,
                     pass_fds=(), *, encoding=None, errors=None):
            """Create new Popen instance."""
            _cleanup()
            # Held while anything is calling waitpid before returncode has been
            # updated to prevent clobbering returncode if wait() or poll() are
            # called from multiple threads at once.  After acquiring the lock,
            # code must re-check self.returncode to see if another thread just
            # finished a waitpid() call.
            self._waitpid_lock = threading.Lock()

    改后

        _child_created = False  # Set here since __del__ checks it
    
        def __init__(self, args, bufsize=-1, executable=None,
                     stdin=None, stdout=None, stderr=None,
                     preexec_fn=None, close_fds=_PLATFORM_DEFAULT_CLOSE_FDS,
                     shell=False, cwd=None, env=None, universal_newlines=False,
                     startupinfo=None, creationflags=0,
                     restore_signals=True, start_new_session=False,
                     pass_fds=(), *, encoding="utf-8", errors=None):
            """Create new Popen instance."""
            _cleanup()
            # Held while anything is calling waitpid before returncode has been
            # updated to prevent clobbering returncode if wait() or poll() are
            # called from multiple threads at once.  After acquiring the lock,
            # code must re-check self.returncode to see if another thread just
            # finished a waitpid() call.
            self._waitpid_lock = threading.Lock()

    因为修改源码的缘故建议大家在虚拟环境venv中用

    pip install virtualenv
  • 相关阅读:
    关于医学的一点想法
    我的ArcGis9.3 到Arcgis10.0 升级步骤
    最近一月的娱乐生活:看电影,玩游戏
    最近一月的娱乐生活:看电影,玩游戏
    5年技术学习历程的回顾
    5年技术学习历程的回顾
    网站开发的技术选型问题
    网站开发的技术选型问题
    学技术真累
    Java实现 LeetCode 200 岛屿数量
  • 原文地址:https://www.cnblogs.com/shizhengwen/p/14092614.html
Copyright © 2020-2023  润新知