• Java爬取同花顺股票数据(附源码)


    Java爬取同花顺股票数据(附源码)

    配置环境

    先介绍下工程所需要的环境:

    编码工具:idea 语言:java 依赖:jdk1.8、maven、chrome、ChromeDriver

    我们使用的方案是模拟浏览器的操作,所以我们需要在电脑安装chrome浏览器和chromedriver驱动。chrome的安装这里就不说了,百度下载个浏览器就行。

    关键是安装 ChromeDriver ,需要安装和当前chrome版本一致的驱动才写。

    查看chrome版本:chrome浏览器输入:Chrome://version

     

    我的chrome的版本

    在根据版本下载对于的驱动,版本最好要一致,比如我的是:79.0.3945.117 (正式版本) (64 位),我下载的就是 79.0.3945.36。

    ChromeDriver各版本的下载地址:

    淘宝镜像:https://npm.taobao.org/mirrors/chromedriver

    谷歌下载(需要访问外国网站,不推荐):https://sites.google.com/a/chromium.org/chromedriver/downloads

    下面这一步可做可不做,不做也能启动工程,只是需要修改代码中的一个配置即可。

    配置方式: 将下载好的ChromeDriver文件放到/usr/local/bin/目录下: cp chromedriver /usr/local/bin/ 检测是否安装成功 chromedriver --version

    如果不配置,只需要记得修改ChromeDriver在代码中配置的路径,你只需要将路径改为你自己的ChromeDriver路径即可,比如我的是:

    System.setProperty(
      "webdriver.chrome.driver",
      "/Users/admin/Documents/selenium/chrome/79.0.3945.36/chromedriver"
    );

    记得修改代码里ChromeDriver的路径。

    验证方案

    首先完成设计方案中的三步

    package com.ths.controller;
    
    import com.ths.service.ThsGnCrawlService;
    import com.ths.service.ThsGnDetailCrawlService;
    import org.springframework.beans.factory.annotation.Autowired;
    import org.springframework.stereotype.Controller;
    import org.springframework.web.bind.annotation.RequestMapping;
    import org.springframework.web.bind.annotation.ResponseBody;
    
    import java.util.HashMap;
    import java.util.List;
    
    @Controller
    public class CrawlController {
    
        @Autowired
        private ThsGnCrawlService thsGnCrawlService;
    
        @Autowired
        private ThsGnDetailCrawlService thsGnDetailCrawlService;
    
        @RequestMapping("/test")
        @ResponseBody
        public void test() {
            // 抓取所有概念板块的url
            List<HashMap<String, String>> list = thsGnCrawlService.ThsGnCrawlListUrl();
            // 放入阻塞队列
            thsGnDetailCrawlService.putAllArrayBlockingQueue(list);
            // 根据url多线程抓取
            thsGnDetailCrawlService.ConsumeCrawlerGnDetailData(1);
        }
    
    }

    先看看thsGnCrawlService.ThsGnCrawlListUrl();方法,如何抓取所有概念板块的url?

    package com.ths.service.impl;
    
    import com.ths.parse.service.ThsParseHtmlService;
    import com.ths.service.ThsGnCrawlService;
    import org.openqa.selenium.WebDriver;
    import org.openqa.selenium.chrome.ChromeDriver;
    import org.openqa.selenium.chrome.ChromeOptions;
    import org.slf4j.Logger;
    import org.slf4j.LoggerFactory;
    import org.springframework.beans.factory.annotation.Autowired;
    import org.springframework.stereotype.Service;
    
    import java.util.HashMap;
    import java.util.List;
    import java.util.concurrent.TimeUnit;
    
    @Service
    public class ThsGnCrawlServiceImpl implements ThsGnCrawlService {
        private final static Logger LOGGER = LoggerFactory.getLogger(ThsGnCrawlServiceImpl.class);
    
        /**
         * 同花顺全部概念板块url
         */
        private final static String GN_URL = "http://q.10jqka.com.cn/gn/";
    
        @Autowired
        private ThsParseHtmlService thsParseHtmlService;
    
        @Override
        public List<HashMap<String, String>> ThsGnCrawlListUrl() {
            System.setProperty("webdriver.chrome.driver", "/Users/admin/Documents/selenium/chrome/79.0.3945.36/chromedriver");
            ChromeOptions options = new ChromeOptions();
            //是否启用浏览器界面的参数
            //无界面参数
    //        options.addArguments("headless");
            //禁用沙盒 就是被这个参数搞了一天
    //        options.addArguments("no-sandbox");
            WebDriver webDriver = new ChromeDriver(options);
            try {
                // 根据网速设置,网速慢可以调低点
                webDriver.manage().timeouts().implicitlyWait(5, TimeUnit.SECONDS);
                webDriver.get(GN_URL);
                Thread.sleep(1000L);
                String gnWindow = webDriver.getWindowHandle();
                // 获取同花顺概念页面的HTML
                String thsGnHtml = webDriver.getPageSource();
                LOGGER.info("获取同花顺url:[{}]的html为:/n{}", GN_URL, thsGnHtml);
                return thsParseHtmlService.parseGnHtmlReturnGnUrlList(thsGnHtml);
            } catch (Exception e) {
                LOGGER.error("获取同花顺概念页面的HTML,出现异常:", e);
            } finally {
                webDriver.close();
                webDriver.quit();
            }
            return null;
        }
    }

    这里使用了上文说的ChromeDriver,我们需要根据自己的配置,修改对应的地址(重复第四遍!)。

    根据代码可以看到String thsGnHtml = webDriver.getPageSource();方法获取页面的HTML,再解析HTML就能获取各大概念板块的url。

    解析HTML我使用的是Jsoup,简单易上手,api也很简单,解析HTML获取各大板块的url的代码如下:

    package com.ths.parse.service.impl;
    
    import com.ths.parse.service.ThsParseHtmlService;
    import org.jsoup.Jsoup;
    import org.jsoup.helper.StringUtil;
    import org.jsoup.nodes.Document;
    import org.jsoup.nodes.Element;
    import org.jsoup.select.Elements;
    import org.springframework.stereotype.Service;
    
    import java.util.ArrayList;
    import java.util.HashMap;
    import java.util.List;
    
    @Service
    public class ThsParseHtmlServiceImpl implements ThsParseHtmlService {
    
        /**
         * 解析同花顺概念板块的Html页面:http://q.10jqka.com.cn/gn/
         * 返回所有概念板块的url地址
         */
        public List<HashMap<String, String>> parseGnHtmlReturnGnUrlList(String html) {
            if (StringUtil.isBlank(html)) {
                return null;
            }
            List<HashMap<String, String>> list = new ArrayList<>();
            Document document = Jsoup.parse(html);
            Elements cateItemsFromClass = document.getElementsByClass("cate_items");
            for (Element element : cateItemsFromClass) {
                Elements as = element.getElementsByTag("a");
                for (Element a : as) {
                    String gnUrl = a.attr("href");
                    String name = a.text();
                    HashMap<String, String> map = new HashMap<>();
                    map.put("url", gnUrl);
                    map.put("gnName", name);
                    list.add(map);
                }
            }
            return list;
        }
    }

    可以看到,只要在html中有的数据,定位到标签就能获取对应的数据。

    然后放到阻塞队列:

    		/**
         * 阻塞队列
         */
        private ArrayBlockingQueue<HashMap<String, String>> arrayBlockingQueue = new ArrayBlockingQueue<>(1000);
    
        @Override
        public void putAllArrayBlockingQueue(List<HashMap<String, String>> list) {
            if (!CollectionUtils.isEmpty(list)) {
                arrayBlockingQueue.addAll(list);
            }
        }

    再开启多个线程,从阻塞队列里获取url,分别抓取概念板块的股票数据,如果页面有分页,就循环点击下一页,再获取数据,直到尾页,代码如下:

    package com.ths.service.impl;
    
    import com.ths.dao.StockThsGnInfoDao;
    import com.ths.domain.StockThsGnInfo;
    import com.ths.service.ThsGnDetailCrawlService;
    import org.jsoup.Jsoup;
    import org.jsoup.nodes.Document;
    import org.jsoup.nodes.Element;
    import org.jsoup.select.Elements;
    import org.openqa.selenium.By;
    import org.openqa.selenium.WebDriver;
    import org.openqa.selenium.WebElement;
    import org.openqa.selenium.chrome.ChromeDriver;
    import org.openqa.selenium.chrome.ChromeOptions;
    import org.slf4j.Logger;
    import org.slf4j.LoggerFactory;
    import org.springframework.beans.factory.annotation.Autowired;
    import org.springframework.stereotype.Service;
    import org.springframework.util.CollectionUtils;
    import org.springframework.util.StringUtils;
    
    import javax.annotation.PostConstruct;
    import java.math.BigDecimal;
    import java.text.SimpleDateFormat;
    import java.util.*;
    import java.util.concurrent.ArrayBlockingQueue;
    import java.util.concurrent.TimeUnit;
    
    @Service
    public class ThsGnDetailCrawlServiceImpl implements ThsGnDetailCrawlService {
        private final static Logger LOGGER = LoggerFactory.getLogger(ThsGnDetailCrawlServiceImpl.class);
    
        /**
         * 阻塞队列
         */
        private ArrayBlockingQueue<HashMap<String, String>> arrayBlockingQueue = new ArrayBlockingQueue<>(1000);
    
        @Autowired
        private StockThsGnInfoDao stockThsGnInfoDao;
    
        @Override
        public void putAllArrayBlockingQueue(List<HashMap<String, String>> list) {
            if (!CollectionUtils.isEmpty(list)) {
                arrayBlockingQueue.addAll(list);
            }
        }
    
        @Override
        public void ConsumeCrawlerGnDetailData(int threadNumber) {
            for (int i = 0; i < threadNumber; ++i) {
                LOGGER.info("开启线程第[{}]个消费", i);
                new Thread(new crawlerGnDataThread()).start();
            }
            LOGGER.info("一共开启线程[{}]个消费", threadNumber);
        }
    
        class crawlerGnDataThread implements Runnable {
    
            @Override
            public void run() {
                try {
                    while (true) {
                        Map<String, String> map = arrayBlockingQueue.take();
                        String url = map.get("url");
                        String gnName = map.get("gnName");
                        String crawlerDateStr = new SimpleDateFormat("yyyy-MM-dd HH:00:00").format(new Date());
                        //chromederiver存放位置
                        System.setProperty("webdriver.chrome.driver", "/Users/admin/Documents/selenium/chrome/79.0.3945.36/chromedriver");
                        ChromeOptions options = new ChromeOptions();
                        //无界面参数
                        //        options.addArguments("headless");
                        //禁用沙盒 就是被这个参数搞了一天
                        //        options.addArguments("no-sandbox");
                        WebDriver webDriver = new ChromeDriver(options);
                        try {
                            webDriver.manage().timeouts().implicitlyWait(5, TimeUnit.SECONDS);
                            webDriver.get(url);
                            Thread.sleep(1000L);
                            String oneGnHtml = webDriver.getPageSource();
                            LOGGER.info("当前概念:[{}],html数据为[{}]", gnName, oneGnHtml);
                            LOGGER.info(oneGnHtml);
                            // TODO 解析并存储数据
                            parseHtmlAndInsertData(oneGnHtml, gnName, crawlerDateStr);
                            clicktoOneGnNextPage(webDriver, oneGnHtml, gnName, crawlerDateStr);
                        } catch (Exception e) {
                            LOGGER.error("用chromerDriver抓取数据,出现异常,url为[{}],异常为[{}]", url, e);
                        } finally {
                            webDriver.close();
                            webDriver.quit();
                        }
                    }
                } catch (Exception e) {
                    LOGGER.error("阻塞队列出现循环出现异常:", e);
                }
            }
        }
    
        public void parseHtmlAndInsertData(String html, String gnName, String crawlerDateStr) {
            Document document = Jsoup.parse(html);
    //        Element boardElement = document.getElementsByClass("board-hq").get(0);
    //        String gnCode = boardElement.getElementsByTag("h3").get(0).getElementsByTag("span").get(0).text();
    
            Element table = document.getElementsByClass("m-pager-table").get(0);
            Element tBody = table.getElementsByTag("tbody").get(0);
            Elements trs = tBody.getElementsByTag("tr");
            for (Element tr : trs) {
                try {
                    Elements tds = tr.getElementsByTag("td");
                    String stockCode = tds.get(1).text();
                    String stockName = tds.get(2).text();
                    BigDecimal stockPrice = parseValueToBigDecimal(tds.get(3).text());
                    BigDecimal stockChange = parseValueToBigDecimal(tds.get(4).text());
                    BigDecimal stockChangePrice = parseValueToBigDecimal(tds.get(5).text());
                    BigDecimal stockChangeSpeed = parseValueToBigDecimal(tds.get(6).text());
                    BigDecimal stockHandoverScale = parseValueToBigDecimal(tds.get(7).text());
                    BigDecimal stockLiangBi = parseValueToBigDecimal(tds.get(8).text());
                    BigDecimal stockAmplitude = parseValueToBigDecimal(tds.get(9).text());
                    BigDecimal stockDealAmount = parseValueToBigDecimal(tds.get(10).text());
                    BigDecimal stockFlowStockNumber = parseValueToBigDecimal(tds.get(11).text());
                    BigDecimal stockFlowMakertValue = parseValueToBigDecimal(tds.get(12).text());
                    BigDecimal stockMarketTtm = parseValueToBigDecimal(tds.get(13).text());
                    // 存储数据
                    StockThsGnInfo stockThsGnInfo = new StockThsGnInfo();
                    stockThsGnInfo.setGnName(gnName);
                    stockThsGnInfo.setGnCode(null);
                    stockThsGnInfo.setStockCode(stockCode);
                    stockThsGnInfo.setStockName(stockName);
                    stockThsGnInfo.setStockPrice(stockPrice);
                    stockThsGnInfo.setStockChange(stockChange);
                    stockThsGnInfo.setStockChangePrice(stockChangePrice);
                    stockThsGnInfo.setStockChangeSpeed(stockChangeSpeed);
                    stockThsGnInfo.setStockHandoverScale(stockHandoverScale);
                    stockThsGnInfo.setStockLiangBi(stockLiangBi);
                    stockThsGnInfo.setStockAmplitude(stockAmplitude);
                    stockThsGnInfo.setStockDealAmount(stockDealAmount);
                    stockThsGnInfo.setStockFlowStockNumber(stockFlowStockNumber);
                    stockThsGnInfo.setStockFlowMakertValue(stockFlowMakertValue);
                    stockThsGnInfo.setStockMarketTtm(stockMarketTtm);
                    stockThsGnInfo.setCrawlerTime(crawlerDateStr);
                    stockThsGnInfo.setCrawlerVersion("同花顺概念板块#" + crawlerDateStr);
                    stockThsGnInfo.setCreateTime(new Date());
                    stockThsGnInfo.setUpdateTime(new Date());
                    stockThsGnInfoDao.insert(stockThsGnInfo);
                } catch (Exception e) {
                    LOGGER.error("插入同花顺概念板块数据出现异常:", e);
                }
    
            }
        }
    
        public BigDecimal parseValueToBigDecimal(String value) {
            if (StringUtils.isEmpty(value)) {
                return BigDecimal.ZERO;
            } else if ("--".equals(value)) {
                return BigDecimal.ZERO;
            } else if (value.endsWith("亿")) {
                return new BigDecimal(value.substring(0, value.length() - 1)).multiply(BigDecimal.ONE);
            }
            return new BigDecimal(value);
        }
    
        public boolean clicktoOneGnNextPage(WebDriver webDriver, String oneGnHtml, String key, String crawlerDateStr) throws InterruptedException {
            // 是否包含下一页
            String pageNumber = includeNextPage(oneGnHtml);
            if (!StringUtils.isEmpty(pageNumber)) {
                WebElement nextPageElement = webDriver.findElement(By.linkText("下一页"));
                webDriver.manage().timeouts().implicitlyWait(5, TimeUnit.SECONDS);
                nextPageElement.click();
                Thread.sleep(700);
                String nextPageHtml = webDriver.getPageSource();
                LOGGER.info("下一页:");
                LOGGER.info(nextPageHtml);
                // TODO 解析并存储数据
                parseHtmlAndInsertData(nextPageHtml, key, crawlerDateStr);
                clicktoOneGnNextPage(webDriver, nextPageHtml, key, crawlerDateStr);
            }
            return true;
        }
    
        public String includeNextPage(String html) {
            Document document = Jsoup.parse(html);
            List<Element> list = document.getElementsByTag("a");
            for (Element element : list) {
                String a = element.text();
                if ("下一页".equals(a)) {
                    String pageNumber = element.attr("page");
                    return pageNumber;
                }
            }
            return null;
        }
    }

    最后对,概念板块的页面数据进行解析入库。

    数据展示

    验证

    抓取数据入库,验证成功! 源码地址:Java爬取同花顺股票数据,源码地址

  • 相关阅读:
    离散化(AcWing.802)
    Hexo+coding实现自动化部署
    八、django学习之分组查询、F查询和Q查询
    七、django学习之聚合函数
    六、Django学习之基于下划线的跨表查询
    五、Django学习之基于对象的跨表查询
    四、Django学习之关系表介绍及使用
    三、Django学习之单表查询接口
    二、Django学习之增删改查
    Spacy模块:自然语言处理一站式工具
  • 原文地址:https://www.cnblogs.com/damowang/p/12511071.html
Copyright © 2020-2023  润新知