• Python爬取CVPR2018论文


    摘要:爬取CVPR2018论文的内容:标题,简介,关键字,论文链接

    1、数据库表的创建(MySQL)

    注意:abstract长度不定,所以类型应为text,避免入坑

    2、python爬取

    import requests
    from bs4 import BeautifulSoup
    import pymysql
    
    headers = {
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'}  # 创建头部信息
    url = 'http://openaccess.thecvf.com/CVPR2018.py'
    print(url)
    r = requests.get(url, headers=headers)
    content = r.content.decode('utf-8')
    
    soup = BeautifulSoup(content, 'html.parser')
    dts = soup.find_all('dt', class_='ptitle')
    print(dts)
    hts = 'http://openaccess.thecvf.com/'
    # 数据爬取
    alllist = []
    for i in range(len(dts)):
        print('这是第' + str(i) + '')
        title = dts[i].a.text.strip()
        href = hts + dts[i].a['href']
        r = requests.get(href, headers=headers)
        content = r.content.decode('utf-8')
        soup = BeautifulSoup(content, 'html.parser')
        # print(title,href)
        divabstract = soup.find(name='div', attrs={"id": "abstract"})
        abstract = divabstract.text.strip()
        # print('第'+str(i)+'个:',abstract)
        alllink = soup.select('a')
        link = hts + alllink[4]['href'][6:]
        keyword = str(title).split(' ')
        keywords = ''
        for k in range(len(keyword)):
            if (k == 0):
                keywords += keyword[k]
            else:
                keywords += ',' + keyword[k]
        value = (title, abstract, link, keywords)
        alllist.append(value)
    print(alllist)
    tuplist = tuple(alllist)
    # 数据保存
    db = pymysql.connect("localhost", "root", "123456", "lunwen", charset='utf8')
    cursor = db.cursor()
    sql_cvpr = "INSERT INTO lunwens(title, abstract, link, keywords) values (%s,%s,%s,%s)"
    try:
        cursor.executemany(sql_cvpr, tuplist)
        db.commit()
    except:
        print('执行失败,进入回调3')
        db.rollback()
    db.close()
    lunwen
  • 相关阅读:
    vue中minxin---小记
    微信认证
    Fatal error Using $this when not in object context in
    $.ajax()方法详解
    $.post
    jquery中的each
    jquery的$.extend和$.fn.extend作用及区别
    javascript字符串函数
    serializeArray()与serialize()的区别
    一个登陆界面
  • 原文地址:https://www.cnblogs.com/MoooJL/p/12782860.html
Copyright © 2020-2023  润新知