Java爬取同花顺股票数据(附源码)
配置环境
先介绍下工程所需要的环境:
编码工具:idea 语言:java 依赖:jdk1.8、maven、chrome、ChromeDriver
我们使用的方案是模拟浏览器的操作,所以我们需要在电脑安装chrome浏览器和chromedriver驱动。chrome的安装这里就不说了,百度下载个浏览器就行。
关键是安装 ChromeDriver ,需要安装和当前chrome版本一致的驱动才写。
查看chrome版本:chrome浏览器输入:Chrome://version
我的chrome的版本
在根据版本下载对于的驱动,版本最好要一致,比如我的是:79.0.3945.117 (正式版本) (64 位),我下载的就是 79.0.3945.36。
ChromeDriver各版本的下载地址:
淘宝镜像:https://npm.taobao.org/mirrors/chromedriver
谷歌下载(需要访问外国网站,不推荐):https://sites.google.com/a/chromium.org/chromedriver/downloads
下面这一步可做可不做,不做也能启动工程,只是需要修改代码中的一个配置即可。
配置方式: 将下载好的ChromeDriver文件放到
/usr/local/bin/
目录下: cp chromedriver /usr/local/bin/ 检测是否安装成功 chromedriver --version
如果不配置,只需要记得修改ChromeDriver在代码中配置的路径,你只需要将路径改为你自己的ChromeDriver路径即可,比如我的是:
System.setProperty(
"webdriver.chrome.driver",
"/Users/admin/Documents/selenium/chrome/79.0.3945.36/chromedriver"
);
记得修改代码里ChromeDriver的路径。
验证方案
首先完成设计方案中的三步
package com.ths.controller;
import com.ths.service.ThsGnCrawlService;
import com.ths.service.ThsGnDetailCrawlService;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Controller;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.ResponseBody;
import java.util.HashMap;
import java.util.List;
@Controller
public class CrawlController {
@Autowired
private ThsGnCrawlService thsGnCrawlService;
@Autowired
private ThsGnDetailCrawlService thsGnDetailCrawlService;
@RequestMapping("/test")
@ResponseBody
public void test() {
// 抓取所有概念板块的url
List<HashMap<String, String>> list = thsGnCrawlService.ThsGnCrawlListUrl();
// 放入阻塞队列
thsGnDetailCrawlService.putAllArrayBlockingQueue(list);
// 根据url多线程抓取
thsGnDetailCrawlService.ConsumeCrawlerGnDetailData(1);
}
}
先看看thsGnCrawlService.ThsGnCrawlListUrl();
方法,如何抓取所有概念板块的url?
package com.ths.service.impl;
import com.ths.parse.service.ThsParseHtmlService;
import com.ths.service.ThsGnCrawlService;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Service;
import java.util.HashMap;
import java.util.List;
import java.util.concurrent.TimeUnit;
@Service
public class ThsGnCrawlServiceImpl implements ThsGnCrawlService {
private final static Logger LOGGER = LoggerFactory.getLogger(ThsGnCrawlServiceImpl.class);
/**
* 同花顺全部概念板块url
*/
private final static String GN_URL = "http://q.10jqka.com.cn/gn/";
@Autowired
private ThsParseHtmlService thsParseHtmlService;
@Override
public List<HashMap<String, String>> ThsGnCrawlListUrl() {
System.setProperty("webdriver.chrome.driver", "/Users/admin/Documents/selenium/chrome/79.0.3945.36/chromedriver");
ChromeOptions options = new ChromeOptions();
//是否启用浏览器界面的参数
//无界面参数
// options.addArguments("headless");
//禁用沙盒 就是被这个参数搞了一天
// options.addArguments("no-sandbox");
WebDriver webDriver = new ChromeDriver(options);
try {
// 根据网速设置,网速慢可以调低点
webDriver.manage().timeouts().implicitlyWait(5, TimeUnit.SECONDS);
webDriver.get(GN_URL);
Thread.sleep(1000L);
String gnWindow = webDriver.getWindowHandle();
// 获取同花顺概念页面的HTML
String thsGnHtml = webDriver.getPageSource();
LOGGER.info("获取同花顺url:[{}]的html为:/n{}", GN_URL, thsGnHtml);
return thsParseHtmlService.parseGnHtmlReturnGnUrlList(thsGnHtml);
} catch (Exception e) {
LOGGER.error("获取同花顺概念页面的HTML,出现异常:", e);
} finally {
webDriver.close();
webDriver.quit();
}
return null;
}
}
这里使用了上文说的ChromeDriver,我们需要根据自己的配置,修改对应的地址(重复第四遍!)。
根据代码可以看到String thsGnHtml = webDriver.getPageSource();
方法获取页面的HTML,再解析HTML就能获取各大概念板块的url。
解析HTML我使用的是Jsoup,简单易上手,api也很简单,解析HTML获取各大板块的url的代码如下:
package com.ths.parse.service.impl;
import com.ths.parse.service.ThsParseHtmlService;
import org.jsoup.Jsoup;
import org.jsoup.helper.StringUtil;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.springframework.stereotype.Service;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
@Service
public class ThsParseHtmlServiceImpl implements ThsParseHtmlService {
/**
* 解析同花顺概念板块的Html页面:http://q.10jqka.com.cn/gn/
* 返回所有概念板块的url地址
*/
public List<HashMap<String, String>> parseGnHtmlReturnGnUrlList(String html) {
if (StringUtil.isBlank(html)) {
return null;
}
List<HashMap<String, String>> list = new ArrayList<>();
Document document = Jsoup.parse(html);
Elements cateItemsFromClass = document.getElementsByClass("cate_items");
for (Element element : cateItemsFromClass) {
Elements as = element.getElementsByTag("a");
for (Element a : as) {
String gnUrl = a.attr("href");
String name = a.text();
HashMap<String, String> map = new HashMap<>();
map.put("url", gnUrl);
map.put("gnName", name);
list.add(map);
}
}
return list;
}
}
可以看到,只要在html中有的数据,定位到标签就能获取对应的数据。
然后放到阻塞队列:
/**
* 阻塞队列
*/
private ArrayBlockingQueue<HashMap<String, String>> arrayBlockingQueue = new ArrayBlockingQueue<>(1000);
@Override
public void putAllArrayBlockingQueue(List<HashMap<String, String>> list) {
if (!CollectionUtils.isEmpty(list)) {
arrayBlockingQueue.addAll(list);
}
}
再开启多个线程,从阻塞队列里获取url,分别抓取概念板块的股票数据,如果页面有分页,就循环点击下一页,再获取数据,直到尾页,代码如下:
package com.ths.service.impl;
import com.ths.dao.StockThsGnInfoDao;
import com.ths.domain.StockThsGnInfo;
import com.ths.service.ThsGnDetailCrawlService;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Service;
import org.springframework.util.CollectionUtils;
import org.springframework.util.StringUtils;
import javax.annotation.PostConstruct;
import java.math.BigDecimal;
import java.text.SimpleDateFormat;
import java.util.*;
import java.util.concurrent.ArrayBlockingQueue;
import java.util.concurrent.TimeUnit;
@Service
public class ThsGnDetailCrawlServiceImpl implements ThsGnDetailCrawlService {
private final static Logger LOGGER = LoggerFactory.getLogger(ThsGnDetailCrawlServiceImpl.class);
/**
* 阻塞队列
*/
private ArrayBlockingQueue<HashMap<String, String>> arrayBlockingQueue = new ArrayBlockingQueue<>(1000);
@Autowired
private StockThsGnInfoDao stockThsGnInfoDao;
@Override
public void putAllArrayBlockingQueue(List<HashMap<String, String>> list) {
if (!CollectionUtils.isEmpty(list)) {
arrayBlockingQueue.addAll(list);
}
}
@Override
public void ConsumeCrawlerGnDetailData(int threadNumber) {
for (int i = 0; i < threadNumber; ++i) {
LOGGER.info("开启线程第[{}]个消费", i);
new Thread(new crawlerGnDataThread()).start();
}
LOGGER.info("一共开启线程[{}]个消费", threadNumber);
}
class crawlerGnDataThread implements Runnable {
@Override
public void run() {
try {
while (true) {
Map<String, String