基于scrapy框架的爬虫项目（一）

['skræpi:]

一、参考资料

1.官方中文文档 https://scrapy-chs.readthedocs.io/zh_CN/latest/index.html

2.简单易操作的爬虫框架（simplified-scrapy）

3.爬虫框架Scrapy的安装与基本使用　　https://www.jianshu.com/p/6bc5a4641629

二、simplified-scrapy的使用方法

1.导入simplified-scrapy包

pip install simplified-scrapy

2.编辑运行python文件

from simplified_scrapy.core.spider import Spider
class ScrapydSpider(Spider):
name = 'scrapyd-spider' #定义爬虫名称
start_urls = ['http://www.scrapyd.cn/'] #初始化入口链接
# models = ['auto_main','auto_obj'] #配置抽取模型

def urlFilter(self,url):
return url.find('/jiaocheng/')>0# 添加采集过滤器，只采集教程数据

# from simplified_scrapy.core.mongo_objstore import MongoObjStore
# obj_store = MongoObjStore(name,{'host':'127.0.0.1','port':27017})

# from simplified_scrapy.core.mongo_urlstore import MongoUrlStore
# url_store = MongoUrlStore(name,{"multiQueue":True})

# from simplified_scrapy.core.mongo_htmlstore import MongoHtmlStore
# html_store = MongoHtmlStore(name)
#自定义抽取数据方法
def extract(self, url, html, models, modelNames):
try:
html = self.removeScripts(html)# 去掉脚本数据，也可以不去
lstA = self.listA(html,url["url"])#抽取页面中的链接
data = []
ele = self.getElementByTag("h1",html)#取标题
if(ele):
title = ele.text
ele = self.getElementByClass("cont",html,"</h1>")#取正文
if(ele):
content = ele.innerHtml
ele = self.getElementsByTag("span",html,'class="title-2"','class="cont"')#取作者和时间
author = None
time = None
if(ele and len(ele)>1):
time = ele[0].text
author = ele[1].text
data.append({"Url": url["url"], "Title": title, "Content": content, "Author":author, "Time":time})

return [{"Urls": lstA, "Data": data}]#将数据返回给框架，由框架处理
except Exception as e:
print (e)

from simplified_scrapy.simplified_main import SimplifiedMain #主函数
SimplifiedMain.startThread(ScrapydSpider())#启动爬虫

3.抽取的数据默认的情况是存放在同级目录的文件夹data下面，格式为json

相关阅读:
证券市场主体
证券投资基金
1.监控系统的重要性
1.五种世界顶级思维-20190303
【四校联考】【比赛题解】FJ NOIP 四校联考 2017 Round 7
【学长出题】【比赛题解】17-09-29
【codeforces】【比赛题解】#854 CF Round #433 (Div.2)
【codeforces】【比赛题解】#851 CF Round #432 (Div.2)
【算法学习】三分法
【codeforces】【比赛题解】#849 CF Round #431 (Div.2)

原文地址：https://www.cnblogs.com/StarZhai/p/12120848.html