• python 第二周(第十一天) 我的python成长记 一个月搞定python数据挖掘!(19) -scrapy + mongo


    mongoDB 3.2之后默认是使用wireTiger引擎

    在启动时更改存储引擎: 

      mongod --storageEngine mmapv1 --dbpath d:datadb

    这样就可以解决mongvue不能查看文档的问题啦!

    项目流程(步骤):

    前去准备(安装scrapy pymongo mongodb )

     1. 生成项目目录: scrapy startproject  stack

     2.itmes   

    from scrapy import Item,Field


    class StackItem(Item):
    title = Field()
    url = Field()

     3. 创建爬虫

    from scrapy import Spider
    from scrapy.selector import Selector
    from stack.items import StackItem

    class StackSpider(Spider):
    name = "stack"
    allowed_domains = ["stackoverflow.com"]
    start_urls = [
    "http://stackoverflow.com/questions?pagesize=50&sort=newest",
    ]

    def parse(self, response):
    questions = response.xpath('//div[@class="summary"]/h3')

    for question in questions:
    item = StackItem()
    item['title'] = question.xpath(
    'a[@class="question-hyperlink"]/text()').extract()[0]
    item['url'] = question.xpath(
    'a[@class="question-hyperlink"]/@href').extract()[0]
    yield item

     4.学会使用xpath selectors 进行数据的提取

     5.存储数据到mongo中

      5.1 setting.py

    ITEM_PIPELINES = {
    'stack.pipelines.MongoDBPipeline': 300,
    }

    MONGODB_SERVER = "localhost"
    MONGODB_PORT = 27017
    MONGODB_DB = "stackoverflow"
    MONGODB_COLLECTION = "questions"

      5.2 pipelines.py

    import pymongo

    from scrapy.conf import settings
    from scrapy.exceptions import DropItem
    from scrapy import log

    class MongoDBPipeline(object):
    def __init__(self):
    connection = pymongo.MongoClient(
    settings['MONGODB_SERVER'],
    settings['MONGODB_PORT']
    )
    db = connection[settings['MONGODB_DB']]
    self.collection = db[settings['MONGODB_COLLECTION']]

    def process_item(self, item, spider):
    valid = True
    for data in item:
    if not data:
    valid = False
    raise DropItem("Missing {0}!".format(data))
    if valid:
    self.collection.insert(dict(item))
    log.msg("Question added to MongoDB database!",
    level=log.DEBUG, spider=spider)

    return item

     6. 启动爬虫 main.py

    from scrapy import cmdline

    cmdline.execute('scrapy crawl stack'.split())

    效果图

    
    
    

        

  • 相关阅读:
    web site 和 web application的区别
    Windows Phone开发(10):常用控件(上)
    WPF绑定ListBox
    Cookies的实际存储位置
    parse_str() 函数把查询字符串解析到变量中。
    str_repeat() 函数把字符串重复指定的次数。
    搭建Git本地服务器
    windows下github 出现Permission denied (publickey).解决方法
    ReadSolve 规格严格
    Scp命令(转载) 规格严格
  • 原文地址:https://www.cnblogs.com/yugengde/p/7282699.html
Copyright © 2020-2023  润新知