需求:
项目内容:客户要求整理如下:
第一种
原始数据
xx市白云区新市新街新巷16号
直接输出
xx市,白云区,新市新街新巷16号
第二种:
原始数据:
大沙地沙边街
输出数据:
xx市,黄埔区,大沙地沙边街附加要求.将原始数据输出到第一列
提供的数据如下:
xx市白云区新市新街新巷16号
xx市花都区狮岭镇岭南工业园合和东路10号
xx市增城区派潭镇大埔村牛角塘一巷
xx市白云区同德街道同嘉路诚德大厦
荔湾区南岸铁路边7号顺景楼
xx市天河区车陂街道车陂高地大街
大沙地沙边街
寺右一马路96号201房
xx市海珠区龙凤街道革新路80号
xx市增城区新塘镇沙埔镇港口村
调用API进行爬取
http://api.map.baidu.com/place/v2/search?q=%s®ion=xx市&output=json&ak=vCx0pfB4y3UNeno7INcCi5wCSv4Gqaij
有些是街道,就要通过街道去获取其所在区号.市倒是不用担心因为都是广东.
先开始写一个函数尝试爬取
1 #!/usr/bin/env python 2 #encoding=utf-8 3 #by i3ekr 4 5 import requests,re,time,json 6 7 success_list = [] 8 def shell(values): 9 json_data = json.loads(requests.get("http://api.map.baidu.com/place/v2/search?q=%s®ion==xx市&output=json&ak=vCx0pfB4y3UNeno7INcCi5wCSv4Gqaij" % (values)).content) 10 print json_data 11 try: 12 for n in range(0, len(json_data) + 1): 13 c2 = json_data['results'][n]['area'] 14 c1 = u'xx市' 15 c3 = values.decode('utf-8') 16 if c1 in c3: 17 c3 = c3.replace(c1, "") 18 if c2 in c3: 19 c3 = c3.replace(c2, "") 20 success_list.append(c1 + "," + c2 + "," + c3) 21 print c2 22 break 23 except Exception as e: 24 print "error"
刚开始的时候我爬取的数据json格式是固定的
c2 = json_data['results'][1]['area']
后来发现这个area并不全都在第一个数据里.所以选择了先获取results的长度然后再进行结合try遍历,如果获取到就正常得到area并且break跳出循环遍历.
最后就是将这个函数进行封装然后进行利用即可.
最终得到代码如下:
1 #!/usr/bin/env python 2 # -*- coding: utf-8 -*- 3 # by i3ekr 4 #api_1 = vCx0pfB4y3UNeno7INcCi5wCSv4Gqaij 5 #api_2 = i1tGx6jjU3qFkeylf3S7ejBAoiQ6o91B 6 import json 7 import requests 8 import time 9 import sys 10 11 fail_list = [] 12 success_list = [] 13 def guolv(values): 14 json_data = json.loads(requests.get("http://api.map.baidu.com/place/v2/search?q=%s®ion=广州市&output=json&ak=vCx0pfB4y3UNeno7INcCi5wCSv4Gqaij" % (values)).content) 15 try: 16 for n in range(0, len(json_data) + 1): 17 c2 = json_data['results'][n]['address'] 18 c1 = u'广州市' 19 c3 = values.decode('utf-8') 20 if c1 in c3: 21 c3 = c3.replace(c1, "") 22 if c2 in c3: 23 c3 = c3.replace(c2, "") 24 success_list.append(c1 + "," + c2 + "," + c3) 25 break 26 except Exception as e: 27 fail_list.append(values) 28 29 def address(values): 30 try: 31 guolv(values) 32 except Exception as e: 33 fail_list.append(values) 34 35 36 def shell(values): 37 if "广州市" in values and "区" in values: 38 data = values.replace('广州市', '广州市,') 39 success_list.append(data.replace('区', '区,')) 40 elif "街" in values: 41 jiedao_left = values.split('街')[0] + "街" 42 jiedao_all = values 43 try: 44 guolv(jiedao_left) 45 except Exception as e: 46 address(values) 47 else: 48 guolv(values) 49 50 if __name__ == "__main__": 51 with open("data.txt", "r+") as f: 52 lines = f.readlines() 53 now_time = time.time() 54 for i in lines: 55 data = i.strip(" ") 56 print(u"[+] 正在测试: %s" % (data)) 57 shell(data) 58 59 60 61 print(u"success %s" % (len(success_list))) 62 print(u"fail %s" % (len(fail_list))) 63 print(u"----------") 64 print(u'总共用时:%s'%(time.time() - now_time)) 65 66 for i in success_list: 67 with open('success.txt','a+') as f: 68 f.write(i+" ") 69 with open('success.txt','a+') as f: 70 f.write("[-]以下是失败的---------------------------------------------------- ") 71 for x in fail_list: 72 with open('success.txt','a+') as f: 73 f.write("[-]" + x + " ")