这里使用到MySQL,对小白还算挺友好的。
当然还有其他数据库
- redis、mongodb(非关系数据库)
- influxdb (时序数据库)一般用作监控框架,单机版免费,了解一下?
废话少说,开始正题.
1、先创建scrapy项目
scrapy startproject dangdang
2、创一个爬虫,模式basic,crawl
scrapy genspider -t basic dd dangdang.com
3、了解项目相关内容
items.py 用于定义容器,在dd.py中可以使用,传递给pipelines.py处理
setting.py 设置scrapy项目的属性,例如user-agent、pipelines等设置
middlewares.py 中间件
一般编辑步骤 items->xx->pipelines->setting(按个人习惯吧)
items.py
import scrapy
class DangdangItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title= scrapy.Field()
link= scrapy.Field()
comment= scrapy.Field()
pass
dd.py
# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request
from dangdang.items import DangdangItem
class DdSpider(scrapy.Spider):
name = 'dd'
allowed_domains = ['dangdang.com']
#start_urls = ['http://dangdang.com/']
ua = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36'}
def start_requests(self):
return [Request('http://search.dangdang.com/?key=python&act=input&show=big&page_index=1#J_tab',headers=self.ua,callback=self.parse)]
def parse(self, response):
item=DangdangItem()
item['title']=response.xpath("//a[@class='pic']/@title").extract()
item['link'] = response.xpath("//a[@class='pic']/@href").extract()
item['comment'] = response.xpath("//a[@dd_name='单品评论']/text()").extract()
yield item
for i in range(2,33):
url='http://search.dangdang.com/?key=python&act=input&show=big&page_index='+str(i)+'#J_tab'
yield Request(url,callback=self.parse,headers=self.ua)
pipelines.py
import pymysql
class DangdangPipeline(object):
def process_item(self, item, spider):
con=pymysql.connect('127.0.0.1','root','123','dangdang',charset='utf8')
cursor = con.cursor()
for i in range(len(item['title'])):
title=item['title'][i]
link = item['link'][i]
comment = item['comment'][i]
sql="""insert into books(title,link,comment) VALUES(%s,%s,%s)"""
cursor.execute(sql,(title,link,comment))
con.commit()
con.close()
return item
setting.py
ITEM_PIPELINES = {
'dangdang.pipelines.DangdangPipeline': 300, #取消注释
}
USER_AGENT = 'xxxxxxxxxx' #修改user_agent
ROBOTSTXT_OBEY = False #robot改为false
4、运行
单独运行dd爬虫(--nolog不加载log,界面整洁)
scrapy crawl dd
数据库中效果图: