• python 第二周(第十一天) 我的python成长记 一个月搞定python数据挖掘!(19) -scrapy + mongo


    mongoDB 3.2之后默认是使用wireTiger引擎

    在启动时更改存储引擎: 

      mongod --storageEngine mmapv1 --dbpath d:datadb

    这样就可以解决mongvue不能查看文档的问题啦!

    项目流程(步骤):

    前去准备(安装scrapy pymongo mongodb )

     1. 生成项目目录: scrapy startproject  stack

     2.itmes   

    from scrapy import Item,Field


    class StackItem(Item):
    title = Field()
    url = Field()

     3. 创建爬虫

    from scrapy import Spider
    from scrapy.selector import Selector
    from stack.items import StackItem

    class StackSpider(Spider):
    name = "stack"
    allowed_domains = ["stackoverflow.com"]
    start_urls = [
    "http://stackoverflow.com/questions?pagesize=50&sort=newest",
    ]

    def parse(self, response):
    questions = response.xpath('//div[@class="summary"]/h3')

    for question in questions:
    item = StackItem()
    item['title'] = question.xpath(
    'a[@class="question-hyperlink"]/text()').extract()[0]
    item['url'] = question.xpath(
    'a[@class="question-hyperlink"]/@href').extract()[0]
    yield item

     4.学会使用xpath selectors 进行数据的提取

     5.存储数据到mongo中

      5.1 setting.py

    ITEM_PIPELINES = {
    'stack.pipelines.MongoDBPipeline': 300,
    }

    MONGODB_SERVER = "localhost"
    MONGODB_PORT = 27017
    MONGODB_DB = "stackoverflow"
    MONGODB_COLLECTION = "questions"

      5.2 pipelines.py

    import pymongo

    from scrapy.conf import settings
    from scrapy.exceptions import DropItem
    from scrapy import log

    class MongoDBPipeline(object):
    def __init__(self):
    connection = pymongo.MongoClient(
    settings['MONGODB_SERVER'],
    settings['MONGODB_PORT']
    )
    db = connection[settings['MONGODB_DB']]
    self.collection = db[settings['MONGODB_COLLECTION']]

    def process_item(self, item, spider):
    valid = True
    for data in item:
    if not data:
    valid = False
    raise DropItem("Missing {0}!".format(data))
    if valid:
    self.collection.insert(dict(item))
    log.msg("Question added to MongoDB database!",
    level=log.DEBUG, spider=spider)

    return item

     6. 启动爬虫 main.py

    from scrapy import cmdline

    cmdline.execute('scrapy crawl stack'.split())

    效果图

    
    
    

        

  • 相关阅读:
    excel数据导入mySql数据库
    springboot-devtools实现项目的自动重启
    Java中List, Integer[], int[]的相互转换
    Postman代码测试工具如何用?
    git提交代码时,Unstaged changes如何过滤.class .log等文件
    Json字符串转map集合
    实现hibernate 的validator校验
    [转载]OpenSSL中文手册之命令行详解(未完待续)
    Openssl ASN.1 说明一 分享
    [转载]Parsing X.509 Certificates with OpenSSL and C
  • 原文地址:https://www.cnblogs.com/yugengde/p/7282699.html
Copyright © 2020-2023  润新知