一、任务
这次任务是与上一次的疫情地图相关联的,通过在网站上实时抓取数据,来实现数据库更新,页面更新。本来我打算用java写爬虫,奈何不会。在网络上查找相关资料,java要写一个实体相关类,然后抓取部分代码和html页面,然后连接数据库。具体为这三个步骤
(1)设置URL、URLConnection、BufferedReader *
(2)设置正则表达式,通过获取的数据流进行解析 *
(3)将符合匹配要求的数据存放到list数组中和数据库中
连接博客https://blog.csdn.net/wjf_1997/article/details/78245702
但是这种方式效率太低了,还要写dao层,于是我采用的python方法,python易操作、比较简单。
话不多说,奉上代码:
import requests from bs4 import BeautifulSoup import re import pymysql import json def create(): db = pymysql.connect("localhost", "root", "0000", "grabdata_test",charset='utf8') # 连接数据库 cursor = db.cursor() cursor.execute("DROP TABLE IF EXISTS info") sql = """CREATE TABLE info ( Id INT PRIMARY KEY AUTO_INCREMENT, Date varCHAR(255), Province varchar(255), City varchar(255), Confirmed_num varchar(255), Yisi_num varchar(255), Cured_num varchar(255), Dead_num varchar(255), Code varchar(255))""" cursor.execute(sql) db.close() def insert(value): db = pymysql.connect("localhost", "root", "0000", "grabdata_test",charset='utf8') cursor = db.cursor() sql = "INSERT INTO info(Date,Province,City,Confirmed_num,Yisi_num,Cured_num,Dead_num,Code) VALUES ( %s,%s,%s,%s,%s,%s,%s,%s)" try: cursor.execute(sql, value) db.commit() print('插入数据成功') except: db.rollback() print("插入数据失败") db.close() create() # 创建表 url = 'https://raw.githubusercontent.com/BlankerL/DXY-2019-nCoV-Data/master/json/DXYArea.json' response = requests.get(url) # 将响应信息进行json格式化 versionInfo = response.text # print(versionInfo)#打印爬取到的数据 # print("------------------------")#重要数据分割线↓ #一个从文件加载,一个从内存加载#json.load(filename)#json.loads(string) jsonData = json.loads(versionInfo) #用于存储数据的集合 dataSource = [] provinceShortNameList = [] confirmedCountList = [] curedCount = [] deadCountList = [] #遍历对应的数据存入集合中 for k in range(len(jsonData['results'])): if(jsonData['results'][k]['countryName'] == '中国'): provinceShortName = jsonData['results'][k]['provinceName'] if("待明确地区" == provinceShortName): continue; for i in range(len(jsonData['results'][k]['cities'])): confirmnum=jsonData['results'][k]['cities'][i]['confirmedCount'] yisi_num=jsonData['results'][k]['cities'][i]['suspectedCount'] cured_num=jsonData['results'][k]['cities'][i]['curedCount'] dead_num=jsonData['results'][k]['cities'][i]['deadCount'] code=jsonData['results'][k]['cities'][i]['locationId'] cityname=jsonData['results'][k]['cities'][i]['cityName'] date='2020-3-10' insert((date,provinceShortName,cityname,confirmnum,yisi_num,cured_num,dead_num,code))
运行结果:
数据库已经导入数据
运行疫情地图的界面: