• 获取特定html源码 富文本编辑器 爬虫生成 dom


    python beautifulsoup获取特定html源码 - 吴悟无 - 博客园 https://www.cnblogs.com/vickey-wu/p/6843411.html

    PyQuery库的使用 - CSDN博客 https://blog.csdn.net/qw_xingzhe/article/details/75175256

    Python爬虫:PyQuery库的介绍与使用 - 简书 https://www.jianshu.com/p/c07f7cd1b548

    pyquery相当于jQuery的python实现,可以用于解析HTML网页等。它的语法与jQuery几乎完全相同,对于使用过jQuery的人来说很熟悉,也很好上手。

    引用作者的原话就是:

    “The API is as much as possible the similar to jquery.” 。

    from selenium import webdriver
    import time
    import random
    from bs4 import *
    from pyquery import PyQuery as pq

    browser = webdriver.Chrome()
    url = 'https://so.gushiwen.org/shiwenv_ee16df5673bc.aspx'
    browser.get(url)

    js = "a_=document.getElementsByTagName('a');le=a_.length;for(i=0;i<le;i++){if(a_[i].text=='展开阅读全文 ∨'){try{a_[i].click()}catch(err){console.log(err)}}}"
    try:
    browser.execute_script(js)
    except Exception as e:
    print(e)
    ck_l_ori_len = len(browser.find_elements_by_link_text('展开阅读全文 ∨'))
    ck_l_ori_ok = 0
    try:
    for isc in range(100):
    if ck_l_ori_ok == ck_l_ori_len:
    break
    time.sleep(1)
    js = 'window.scrollTo(0,document.body.scrollHeight)'
    js = 'window.scrollTo(0,100*{})'.format(isc)
    browser.execute_script(js)
    ck_l = browser.find_elements_by_link_text('展开阅读全文 ∨')
    for i in ck_l:
    try:
    i.click()
    ck_l_ori_ok += 1
    except Exception as e:
    print(e)
    except Exception as e:
    print('window.scrollTo-->', e)

    doc = pq(browser.page_source)
    pq_r_d = {'xmlns="http://www.w3.org/1999/xhtml"': ''}
    r_k, r_v = 'xmlns="http://www.w3.org/1999/xhtml"', ''
    article_ = doc('.left>:nth-child(2).sons>.cont>.contson').html().replace(r_k, r_v)
    title_d = {'h1': doc('.left>:nth-child(2).sons>.cont>:nth-child(2)').html().replace(r_k, r_v)}
    author_d = {'h3': doc('.left>:nth-child(2).sons>.cont>:nth-child(3)').text()}
    translation_ = doc('.left>:nth-child(4)>.contyishang>:nth-child(2)').html().replace(r_k, r_v)
    explanation_ = doc('.left>:nth-child(4)>.contyishang>:nth-child(3)').html().replace(r_k, r_v)
    refer_ = doc('.left>:nth-child(4)>.cankao').html().replace(r_k, r_v)

    author_img_url = doc('.left>.sonspic>.cont>.divimg>:nth-child(1)').html().split('src="')[-1].split('"')[0]

    k = 'h1'
    v = title_d[k]
    db_html = '<{}>{}</{}>'.format(k, v, k)
    k = 'h3'
    v = author_d[k]
    db_html = '{}<{}>{}</{}>'.format(db_html, k, v, k)
    db_html = '{}{}'.format(db_html, '<br><img src="{}" ><br>'.format(author_img_url))
    l = [db_html, article_, explanation_, translation_, refer_]
    db_html = '<br><br>'.join(l)

    rp_s_l = ['<a href=', '<a href=', '<a title=', '<a title=']
    for rp_s in rp_s_l:
    img_n = db_html.count(rp_s)
    for i in range(img_n):
    p1 = db_html.index(rp_s, 0)
    tmp = '{}{}'.format(db_html[0:p1].replace('>', 'X'), db_html[p1 + 1:])
    p2 = tmp.index('>')
    db_html = '{}{}{}'.format(db_html[0:p1], '', db_html[p2 + 1:])




  • 相关阅读:
    AI图形算法的应用之一:通过图片模板对比发现油田漏油
    基于GPS定位和人脸识别的作业识别管理系统
    windows平板的开发和选型
    windows系统和IE的兼容性问题
    ASP.NET写的一个博客系统
    Android Studio3.2.1升级刨坑记录
    C#怎样链接mysql数据库
    学习进度条博客
    期末总结
    【操作系统】实验四 主存空间的分配和回收
  • 原文地址:https://www.cnblogs.com/rsapaper/p/8939770.html
Copyright © 2020-2023  润新知