• 基于scrapy框架的爬虫项目(一)


    ['skræpi:]

    一、参考资料

    1.官方中文文档 https://scrapy-chs.readthedocs.io/zh_CN/latest/index.html

    2.简单易操作的爬虫框架(simplified-scrapy)

    3.爬虫框架Scrapy的安装与基本使用  https://www.jianshu.com/p/6bc5a4641629

    二、simplified-scrapy的使用方法

    1.导入simplified-scrapy包

    pip install simplified-scrapy

    2.编辑运行python文件

    from simplified_scrapy.core.spider import Spider
    class ScrapydSpider(Spider):
    name = 'scrapyd-spider' #定义爬虫名称
    start_urls = ['http://www.scrapyd.cn/'] #初始化入口链接
    # models = ['auto_main','auto_obj'] #配置抽取模型

    def urlFilter(self,url):
    return url.find('/jiaocheng/')>0# 添加采集过滤器,只采集教程数据

    # from simplified_scrapy.core.mongo_objstore import MongoObjStore
    # obj_store = MongoObjStore(name,{'host':'127.0.0.1','port':27017})

    # from simplified_scrapy.core.mongo_urlstore import MongoUrlStore
    # url_store = MongoUrlStore(name,{"multiQueue":True})

    # from simplified_scrapy.core.mongo_htmlstore import MongoHtmlStore
    # html_store = MongoHtmlStore(name)
    #自定义抽取数据方法
    def extract(self, url, html, models, modelNames):
    try:
    html = self.removeScripts(html)# 去掉脚本数据,也可以不去
    lstA = self.listA(html,url["url"])#抽取页面中的链接
    data = []
    ele = self.getElementByTag("h1",html)#取标题
    if(ele):
    title = ele.text
    ele = self.getElementByClass("cont",html,"</h1>")#取正文
    if(ele):
    content = ele.innerHtml
    ele = self.getElementsByTag("span",html,'class="title-2"','class="cont"')#取作者和时间
    author = None
    time = None
    if(ele and len(ele)>1):
    time = ele[0].text
    author = ele[1].text
    data.append({"Url": url["url"], "Title": title, "Content": content, "Author":author, "Time":time})

    return [{"Urls": lstA, "Data": data}]#将数据返回给框架,由框架处理
    except Exception as e:
    print (e)

    from simplified_scrapy.simplified_main import SimplifiedMain #主函数
    SimplifiedMain.startThread(ScrapydSpider())#启动爬虫

    3.抽取的数据默认的情况是存放在同级目录的文件夹data下面,格式为json

  • 相关阅读:
    证券市场主体
    证券投资基金
    1.监控系统的重要性
    1.五种世界顶级思维-20190303
    【四校联考】【比赛题解】FJ NOIP 四校联考 2017 Round 7
    【学长出题】【比赛题解】17-09-29
    【codeforces】【比赛题解】#854 CF Round #433 (Div.2)
    【codeforces】【比赛题解】#851 CF Round #432 (Div.2)
    【算法学习】三分法
    【codeforces】【比赛题解】#849 CF Round #431 (Div.2)
  • 原文地址:https://www.cnblogs.com/StarZhai/p/12120848.html
Copyright © 2020-2023  润新知