Scrapy导出欠套型JSON
scrapy如何导出类型如下结构的JSON:
[
{
"pingPai": ["ALPINA"],
"carTypes": [{
"carType": ["ALPINA"],
"carNames": {
"carName": ["ALPINA B4",
"ALPINA B3",
"ALPINA D5",
"ALPINA B7",
"ALPINA XD3"]
}
}],
"picUrl": "https://car3.autoimg.cn/cardfs/series/g27/M05/AB/2E/100x100_f40_autohomecar__wKgHHls8hiKADrqGAABK67H4HUI503.png"
},
{
"pingPai": ["ABT"],
"carTypes": [{
"carType": ["ABT"],
"carNames": {
"carName": ["ABT A3",
"ABT A5",
"ABT TT"]
}
}],
"picUrl": "https://car3.autoimg.cn/cardfs/series/g30/M07/B0/47/100x100_f40_autohomecar__wKgHPls9vLOAHILAAAAWGGhA_W0282.png"
}
]
解决核心
Scrapy导出欠套型JSON实质是对列表的操作
准备知识
- python合并多个列表,直接用“+”
list1 = [1,2,3]
list2 = [5,6,7]
print(list1+list2)
# 输出:
[1, 2, 3, 5, 6, 7]
- python合并多个字典,用“update()”(也可以用其它方法,这里只讲update)(在此文中实际上没有用到)
dic1 = {'a':'1','b':'2'}
dic2 = {'c':'3','d':'4'}
dic1.update(dic2)
print(dic1)
# 输出:
{'a': '1', 'b': '2', 'c': '3', 'd': '4'}
- xpath匹配得到的结果实际上是一个列表
比如,xpath匹配到一行数据就是“[X]”,X是所匹配的值
xpath匹配到多行数据就是“[X,Y,Z....]”
解决方法
观察如上欠套JSON,1级节点是:“pingPai”、“carTypes”、“picUrl”三个字段,根据scrapy定义items.py文件的特性,我们只需要定义这三个一级节点,定义为:
打开items.py文件,添加如下代码:
class CarModelItem(scrapy.Item):
pingPai = scrapy.Field() # 品牌
carTypes = scrapy.Field() # 车型
picUrl = scrapy.Field() # 品牌图片
注意
此处也可以不用定义 items.py文件 直接在导出的pipelines.py文件里面使用json.dumps系列化(把对象转换为字节序列的过程称为对象的序列化。把字节序列恢复为对象的过程称为对象的反序列化。)(json.dumps将一个Python数据结构转换为JSON,json.loads将一个JSON编码的字符串转换回一个Python数据结构)原理都是一样的,只是运用了系列化而已,本文不作讨论。
要生成欠套型的JSON,我们只需要在carTypes列表内再添加列表就行(添加值为字典“{}”类型的列表就可以了)。
比如我们要爬取地址为“https://www.autohome.com.cn/grade/carhtml/A.html”这个地址的内容,打开浏览器查看效果如下:
查看源代码,图片和1级节点在“dl/dt”内,如下图:
第二节点和第三节点在“dl/dd”内,如下图:
这里是比较难处理的地方,一般这里我们要定义的列表为:列表内再添加列表(值为字典)的数据格式才能满足需求。
直接上代码:
class GetcarmodelSpider(scrapy.Spider):
name = "GetCarModel"
allowed_domains = ["www.autohome.com.cn"]
chars = [
"A",
"""
"B",
"C",
"D",
"F",
"G",
"H",
"J",
"K",
"L",
"M",
"N",
"O",
"P",
"Q",
"R",
"S",
"T",
"W",
"X",
"Y",
"Z",""",
]
start_urls = [
"https://www.autohome.com.cn/grade/carhtml/%s.html" % i2 for i2 in chars
]
def parse(self, response):
dtArray = response.xpath("//dl[@id]")
for dt in dtArray:
pingPai = dt.xpath("./dt/div/a/text()").extract()
pingPaiPicArr = dt.xpath("./dt/a/img/@src").extract()
pingPaiPic = ""
# 这里图片其实只有一张图片
for cti in pingPaiPicArr:
# carTypeImg = "http://" + cti[2:]
pingPaiPic = parse.urljoin(response.url, cti)
carTypesTemp = dt.xpath("./dd/div[@class='h3-tit']")
carTypes = []
for pp in carTypesTemp:
print(">>>>>>>>>>>>>>>>>>>>>>", pp.xpath("./a/text()").extract())
carTypes += [
{"carType": pp.xpath("./a/text()").extract(), "carNames": {}}
]
# 获取具体名称
carNameArray = dt.xpath("./dd/ul[@class='rank-list-ul']")
carNames = []
for cn in carNameArray:
# 直接定义值为字典类型的列表,这样在循环第X次的时候取值就是carNames[X]
carNames += [{"carName": cn.xpath("./li/h4/a/text()").extract()}]
print(".......", [{"carName": cn.xpath("./li/h4/a/text()").extract()}])
for i in range(len(carTypes)):
try:
carTypes[i]["carNames"] = carNames[i]
except Exception as e:
print(e)
print("pingPai:", pingPai)
print("pingPaiPic:", pingPaiPic)
print("carTypes:", carTypes)
print("carNames:", carNames)
carModel = CarModelItem()
carModel["pingPai"] = pingPai
carModel["carTypes"] = carTypes
carModel["picUrl"] = pingPaiPic
yield carModel
注意,导出JSON方法这里不再说明,自行搜索,网上一大堆
运行代码,得到导出的JSON文件如下:
[{
"pingPai": ["奥迪"],
"carTypes": [{
"carType": ["一汽-大众奥迪"],
"carNames": {
"carName": ["奥迪Q2L新能源",
"奥迪A3",
"奥迪A4L",
"奥迪A6L",
"奥迪Q2L",
"奥迪Q3",
"奥迪Q5L",
"奥迪A6L新能源",
"奥迪Q4",
"奥迪A4",
"奥迪A6",
"奥迪Q5"]
}
},
{
"carType": ["Audi Sport"],
"carNames": {
"carName": ["奥迪RS 3",
"奥迪RS 4",
"奥迪RS 5",
"奥迪RS 6",
"奥迪RS 7",
"奥迪R8",
"奥迪TT RS",
"奥迪RS Q3",
"奥迪RSQ e-tron"]
}
},
{
"carType": ["奥迪(进口)"],
"carNames": {
"carName": ["奥迪e-tron",
"奥迪A3(进口)",
"奥迪S3",
"奥迪A4(进口)",
"奥迪A5",
"奥迪S4",
"奥迪S5",
"奥迪A6(进口)",
"奥迪S6",
"奥迪A7",
"奥迪S7",
"奥迪A8",
"奥迪Q7",
"奥迪Q7新能源",
"奥迪TT",
"奥迪TTS",
"奥迪A0",
"奥迪A1",
"奥迪S1",
"e-tron Concept",
"奥迪AI:ME",
"奥迪A6新能源(进口)",
"奥迪A7新能源",
"奥迪Aicon",
"奥迪e-tron GT",
"Prologue",
"奥迪A8新能源",
"奥迪A9",
"奥迪S8",
"allroad",
"奥迪Q2",
"奥迪SQ2",
"奥迪Q3(进口)",
"奥迪Q4(进口)",
"奥迪Q4新能源(进口)",
"奥迪TT offroad",
"h-tron quattro",
"奥迪Elaine",
"奥迪Q5(进口)",
"奥迪Q5新能源(进口)",
"奥迪SQ5",
"奥迪Q8",
"奥迪SQ7",
"奥迪Q9",
"e-tron Vision Gran Turismo",
"quattro",
"奥迪PB18",
"奥迪R18",
"奥迪Urban",
"奥迪A2",
"奥迪80",
"奥迪A3新能源(进口)",
"奥迪Coupe",
"奥迪100",
"Crosslane Coupe",
"奥迪Cross",
"Nanuk"]
}
}],
"picUrl": "https://car2.autoimg.cn/cardfs/series/g26/M0B/AE/B3/100x100_f40_autohomecar__wKgHEVs9u5WAV441AAAKdxZGE4U148.png"
},
{
"pingPai": ["阿斯顿·马丁"],
"carTypes": [{
"carType": ["阿斯顿·马丁"],
"carNames": {
"carName": ["Rapide",
"V8 Vantage",
"Vanquish",
"阿斯顿·马丁DB11",
"阿斯顿·马丁DBS",
"Cygnet",
"Rapide E",
"阿斯顿·马丁DBX",
"V12 Vantage",
"阿斯顿·马丁DB9",
"AM-RB 003",
"Heritage EV",
"Virage",
"Vulcan",
"阿斯顿·马丁CC100",
"阿斯顿·马丁DB10",
"阿斯顿·马丁DB5",
"阿斯顿·马丁DP-100",
"战神",
"拉共达Taraf",
"Ulster",
"V12 Zagato",
"阿斯顿·马丁DB6",
"阿斯顿·马丁One-77"]
}
}],
"picUrl": "https://car2.autoimg.cn/cardfs/series/g26/M06/AE/B5/100x100_f40_autohomecar__wKgHEVs9u6GAPWN8AAAYsmBsCWs847.png"
},
{
"pingPai": ["AC Schnitzer"],
"carTypes": [{
"carType": ["AC Schnitzer"],
"carNames": {
"carName": ["AC Schnitzer 3系",
"AC Schnitzer M4",
"AC Schnitzer 7系",
"AC Schnitzer X6",
"AC Schnitzer X5"]
}
}],
"picUrl": "https://car3.autoimg.cn/cardfs/series/g27/M01/B0/62/100x100_f40_autohomecar__ChcCQFs9vBKAO3YSAAAW0WOWvRc555.png"
},
{
"pingPai": ["安凯客车"],
"carTypes": [{
"carType": ["安凯客车"],
"carNames": {
"carName": ["宝斯通"]
}
}],
"picUrl": "https://car2.autoimg.cn/cardfs/series/g29/M00/AB/C8/100x100_f40_autohomecar__ChcCSFs8riCAYVA2AAApQLgf8a0969.png"
},
{
"pingPai": ["阿尔法·罗密欧"],
"carTypes": [{
"carType": ["阿尔法·罗密欧"],
"carNames": {
"carName": ["Giulia",
"Stelvio",
"MiTo",
"Giulietta",
"Tonale",
"ALFA 4C",
"Disco Volante",
"Gloria",
"ALFA 147",
"ALFA 156",
"ALFA 159",
"ALFA 166",
"ALFA 2uettottanta",
"ALFA 8C",
"ALFA GT",
"ALFA S.Z.",
"ALFA TZ3"]
}
}],
"picUrl": "https://car2.autoimg.cn/cardfs/series/g26/M05/B0/29/100x100_f40_autohomecar__ChcCP1s9u5qAemANAABON_GMdvI451.png"
},
{
"pingPai": ["ALPINA"],
"carTypes": [{
"carType": ["ALPINA"],
"carNames": {
"carName": ["ALPINA B4",
"ALPINA B3",
"ALPINA D5",
"ALPINA B7",
"ALPINA XD3"]
}
}],
"picUrl": "https://car3.autoimg.cn/cardfs/series/g27/M05/AB/2E/100x100_f40_autohomecar__wKgHHls8hiKADrqGAABK67H4HUI503.png"
},
{
"pingPai": ["ABT"],
"carTypes": [{
"carType": ["ABT"],
"carNames": {
"carName": ["ABT A3",
"ABT A5",
"ABT TT"]
}
}],
"picUrl": "https://car3.autoimg.cn/cardfs/series/g30/M07/B0/47/100x100_f40_autohomecar__wKgHPls9vLOAHILAAAAWGGhA_W0282.png"
},
{
"pingPai": ["AEV ROBOTICS"],
"carTypes": [{
"carType": ["AEV ROBOTICS"],
"carNames": {
"carName": ["Modular Vehicle System"]
}
}],
"picUrl": "https://car2.autoimg.cn/cardfs/series/g3/M02/58/D3/autohomecar__ChcCRVw0TJaAM8BmAAAS-7AD7DQ372.png"
},
{
"pingPai": ["Agile Automotive"],
"carTypes": [{
"carType": ["Agile Automotive"],
"carNames": {
"carName": ["Agile Automotive SC122",
"Agile Automotive SCX"]
}
}],
"picUrl": "https://car2.autoimg.cn/cardfs/series/g26/M09/AF/8C/100x100_f40_autohomecar__wKgHHVs9r62AIbiYAAAvAsqdpoA594.png"
},
{
"pingPai": ["Apollo"],
"carTypes": [{
"carType": ["Apollo"],
"carNames": {
"carName": ["Apollo N",
"Arrow",
"Intensa Emozione"]
}
}],
"picUrl": "https://car3.autoimg.cn/cardfs/series/g28/M06/B0/C6/100x100_f40_autohomecar__ChcCR1s90RGASBRgAACz67wh_68723.png"
},
{
"pingPai": ["Arash"],
"carTypes": [{
"carType": ["Arash"],
"carNames": {
"carName": ["AF8 Cassini",
"Arash AF10"]
}
}],
"picUrl": "https://car3.autoimg.cn/cardfs/series/g30/M05/AA/D4/100x100_f40_autohomecar__wKgHHFs8n1CAVhcNAAAV3xEAiDM531.png"
},
{
"pingPai": ["ARCFOX"],
"carTypes": [{
"carType": ["北汽新能源"],
"carNames": {
"carName": ["ARCFOX-1",
"ARCFOX ECF Concept",
"ARCFOX-7",
"ARCFOX-GT"]
}
}],
"picUrl": "https://car3.autoimg.cn/cardfs/series/g27/M02/AB/F7/100x100_f40_autohomecar__ChcCQFs8nA6AP-h5AABsvxhHw3E709.png"
},
{
"pingPai": ["Aria"],
"carTypes": [{
"carType": ["Aria"],
"carNames": {
"carName": ["Aria FXE"]
}
}],
"picUrl": "https://car3.autoimg.cn/cardfs/series/g28/M0B/B0/0D/100x100_f40_autohomecar__wKgHI1s9r2iAJwIXAAAIBShzq60456.png"
},
{
"pingPai": ["ATS"],
"carTypes": [{
"carType": ["ATS"],
"carNames": {
"carName": ["ATS GT"]
}
}],
"picUrl": "https://car2.autoimg.cn/cardfs/series/g26/M08/D7/D3/autohomecar__ChsEe1wYwKmAY2p9AAA1NP0jCHk594.png"
},
{
"pingPai": ["Aurus"],
"carTypes": [{
"carType": ["Aurus"],
"carNames": {
"carName": ["Senat"]
}
}],
"picUrl": "https://car2.autoimg.cn/cardfs/series/g27/M07/F3/E1/autohomecar__ChcCQFuN6WiAcztKAAAsLfBmU9g074.png"
},
{
"pingPai": ["艾康尼克"],
"carTypes": [{
"carType": ["艾康尼克ICONIQ Motors"],
"carNames": {
"carName": ["MUSE",
"艾康尼克七系"]
}
}],
"picUrl": "https://car2.autoimg.cn/cardfs/series/g29/M0A/A9/EC/100x100_f40_autohomecar__wKgHG1s8iP6ASbjTAAAOIwskkzo314.png"
},
{
"pingPai": ["爱驰"],
"carTypes": [{
"carType": ["爱驰汽车"],
"carNames": {
"carName": ["爱驰U5",
"爱驰U7",
"RG Nathalie"]
}
}],
"picUrl": "https://car3.autoimg.cn/cardfs/series/g29/M09/A9/9B/100x100_f40_autohomecar__wKgHG1s8fwqAOp3IAAALEeTkn6c536.png"
}]