• crawler4j源码学习(2):Ziroom租房网房源信息采集爬虫


    crawler4j是用Java实现的开源网络爬虫。提供了简单易用的接口,可以在几分钟内创建一个多线程网络爬虫。下面实例结合jsoup解析网页,javacsv存储采集数据;采集自如ziroom租房网(http://sz.ziroom.com/z/nl/)的出租房信息。

    所有的过程仅需两步完成:

    第一步:开发Ziroom采集核心部分代码:

    /**
     * @date 2016年8月20日 下午6:13:24
     * @version
     * @since JDK 1.8
     */
    public class ZiroomCrawler extends WebCrawler {
    
        /** 爬取匹配原则 */
        private final static Pattern FILTERS = Pattern.compile(".*(\.(css|js|bmp|gif|jpe?g|ico"
                + "|png|tiff?|mid|mp2|mp3|mp4" + "|wav|avi|mov|mpeg|ram|m4v|pdf" + "|rm|smil|wmv|swf|wma|zip|rar|gz))$");
        /** 爬取数据保存文件路径 */
        private final static String DATA_PATH = "data/crawl/ziroom.csv";
        /** 爬取link文件路径 */
        private final static String LINK_PATH = "data/crawl/link.csv";
        // private static final Logger logger =
        // LoggerFactory.getLogger(ZiroomCrawler.class);
    
        private final static String URL_PREFIX = "http://sh.ziroom.com/z/nl/";
    
        private final File fLinks;
        private final File fDatas;
    
        private CsvWriter csvLinks;
        private CsvWriter csvDatas;
    
        /**
         * You should implement this function to specify whether the given url
         * should be crawled or not (based on your crawling logic).
         */
        ZiroomCrawlStat myCrawlStat;
    
        public ZiroomCrawler() throws IOException {
            myCrawlStat = new ZiroomCrawlStat();
            fLinks = new File(DATA_PATH);
            fDatas = new File(LINK_PATH);
            if (fLinks.isFile()) {
                fLinks.delete();
            }
            if (fDatas.isFile()) {
                fDatas.delete();
            }
            csvDatas = new CsvWriter(new FileWriter(fDatas, true), ',');
            csvDatas.write("请求路径");
            csvDatas.endRecord();
            csvDatas.close();
            csvLinks = new CsvWriter(new FileWriter(fLinks, true), ',');
            csvLinks.write("图片");
            csvLinks.write("价格");
            csvLinks.write("地址");
            csvLinks.write("说明");
            csvLinks.endRecord();
            csvLinks.close();
        }
    
        public void dumpMyData() {
            final int id = getMyId();
            // You can configure the log to output to file
            logger.info("Crawler {} > Processed Pages: {}", id, myCrawlStat.getTotalProcessedPages());
            logger.info("Crawler {} > Total Links Found: {}", id, myCrawlStat.getTotalLinks());
            logger.info("Crawler {} > Total Text Size: {}", id, myCrawlStat.getTotalTextSize());
        }
    
        @Override
        public Object getMyLocalData() {
            return myCrawlStat;
        }
    
        @Override
        public void onBeforeExit() {
            dumpMyData();
        }
    
        /*
         * 这个方法决定了要抓取的URL及其内容,例子中只允许抓取“http://sh.ziroom.com/z/nl/”这个域的页面,
         * 不允许.css、.js和多媒体等文件
         *
         * @see edu.uci.ics.crawler4j.crawler.WebCrawler#shouldVisit(edu.uci.ics.
         * crawler4j.crawler.Page, edu.uci.ics.crawler4j.url.WebURL)
         */
        @Override
        public boolean shouldVisit(Page referringPage, WebURL url) {
            final String href = url.getURL().toLowerCase();
    
            if (FILTERS.matcher(href).matches() || !href.startsWith(URL_PREFIX)) {
                return false;
            }
            return true;
        }
    
        /*
         * 当URL下载完成会调用这个方法。你可以轻松获取下载页面的url, 文本, 链接, html,和唯一id等内容。
         *
         * @see
         * edu.uci.ics.crawler4j.crawler.WebCrawler#visit(edu.uci.ics.crawler4j.
         * crawler.Page)
         */
        @Override
        public void visit(Page page) {
            final String url = page.getWebURL().getURL();
            logger.info("爬取路径:" + url);
            myCrawlStat.incProcessedPages();
            if (page.getParseData() instanceof HtmlParseData) {
                final HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
                final Set<WebURL> links = htmlParseData.getOutgoingUrls();
                try {
                    linkToCsv(links);
                } catch (final IOException e2) {
                    // TODO Auto-generated catch block
                    e2.printStackTrace();
                }
                myCrawlStat.incTotalLinks(links.size());
                try {
                    myCrawlStat.incTotalTextSize(htmlParseData.getText().getBytes("UTF-8").length);
                } catch (final UnsupportedEncodingException e1) {
                    // TODO Auto-generated catch block
                    e1.printStackTrace();
                }
                final String html = htmlParseData.getHtml();
    
                final Document doc = Jsoup.parse(html);
    
                final Elements contents = doc.select("li[class=clearfix]");
    
                for (final Element c : contents) {
                    // 图片
                    final String img = c.select(".img img").first().attr("src");
                    logger.debug("图片:" + img);
    
                    // 地址
                    final Element txt = c.select("div[class=txt]").first();
                    final String arr1 = txt.select("h3 a").first().text();
                    final String arr2 = txt.select("h4 a").first().text();
                    final String arr3 = txt.select("div[class=detail]").first().text();
    
                    final String arr = arr1.concat(arr1 + ",").concat(arr2 + ",").concat(arr3);
                    logger.debug("地址:" + arr);
                    // 说明
                    final String rank = txt.select("p").first().text();
                    logger.debug("说明:" + rank);
    
                    // 价格
                    final String pirce = c.select("p[class=price]").first().text();
    
                    try {
                        csvLinks = new CsvWriter(new FileWriter(fLinks, true), ',');
                        csvLinks.write(img);
                        csvLinks.write(pirce);
                        csvLinks.write(arr);
                        csvLinks.write(rank);
                        csvLinks.endRecord();
                        csvLinks.flush();
                        csvLinks.close();
                    } catch (final IOException e) {
                        e.printStackTrace();
                    }
                }
            }
        }
    
        private void linkToCsv(Set<WebURL> links) throws IOException {
            csvDatas = new CsvWriter(new FileWriter(fDatas, true), ',');
            for (final WebURL webURL : links) {
                csvDatas.write(webURL.getURL());
            }
            csvDatas.flush();
            csvDatas.endRecord();
            csvDatas.close();
        }
    }

    第二步:开发Ziroom采集控制部分代码:

    /**
     * @date 2016年8月20日 下午6:15:01
     * @version
     * @since JDK 1.8
     */
    public class ZiroomController {
    
        public static void main(String[] args) {
            
            final String crawlStorageFolder = "data/crawl/root";
            final int numberOfCrawlers = 3;
    
            final CrawlConfig config = new CrawlConfig();
            config.setCrawlStorageFolder(crawlStorageFolder);
            config.setPolitenessDelay(1000);
            config.setIncludeBinaryContentInCrawling(false);
            config.setMaxPagesToFetch(50);
            
            final PageFetcher pageFetcher = new PageFetcher(config);
            final RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
            final RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
            CrawlController controller;
            try {
                controller = new CrawlController(config, pageFetcher, robotstxtServer);
                
                controller.addSeed("http://sh.ziroom.com/z/nl/");
               
                controller.start(ZiroomCrawler.class, numberOfCrawlers);
    
                final List<Object> crawlersLocalData = controller.getCrawlersLocalData();
                long totalLinks = 0;
                long totalTextSize = 0;
                int totalProcessedPages = 0;
                for (final Object localData : crawlersLocalData) {
                    final ZiroomCrawlStat stat = (ZiroomCrawlStat) localData;
                    totalLinks += stat.getTotalLinks();
                    totalTextSize += stat.getTotalTextSize();
                    totalProcessedPages += stat.getTotalProcessedPages();
                }
    
                System.out.println("Aggregated Statistics:");
                System.out.println("	Processed Pages: {}" + totalProcessedPages);
                System.out.println("	Total Links found: {}" + totalLinks);
                System.out.println("	Total Text Size: {}" + totalTextSize);
            } catch (final Exception e) {
                // TODO Auto-generated catch block
                e.printStackTrace();
            }
        }
    }

    第三步:开发Ziroom采集状态搜集代码:

    /**
     * @date 2016年8月20日 下午6:14:13
     * @version
     * @since JDK 1.8
     */
    public class ZiroomCrawlStat {
        private long totalLinks;
        private int totalProcessedPages;
        private long totalTextSize;
    
        public long getTotalLinks() {
            return totalLinks;
        }
    
        public int getTotalProcessedPages() {
            return totalProcessedPages;
        }
    
        public long getTotalTextSize() {
            return totalTextSize;
        }
    
        public void incProcessedPages() {
            this.totalProcessedPages++;
        }
    
        public void incTotalLinks(int count) {
            this.totalLinks += count;
        }
    
        public void incTotalTextSize(int count) {
            this.totalTextSize += count;
        }
    
        public void setTotalLinks(long totalLinks) {
            this.totalLinks = totalLinks;
        }
    
        public void setTotalProcessedPages(int totalProcessedPages) {
            this.totalProcessedPages = totalProcessedPages;
        }
    
        public void setTotalTextSize(long totalTextSize) {
            this.totalTextSize = totalTextSize;
        }
    }

    Ziroom采集数据展示:

  • 相关阅读:
    C++ Primer 学习笔记_104_特殊工具与技术 --嵌套类
    [AngularJS + Webpack] Requiring CSS & Preprocessors
    [AngularJS + Webpack] Requiring Templates
    [AngularJS + Webpack] ES6 with BabelJS
    [Flux] 3. Actions
    [RxJS] Aggregating Streams With Reduce And Scan using RxJS
    [RxJS] map vs flatMap
    [RxJS] Stream Processing With RxJS vs Array Higher-Order Functions
    [MODx] Solve cannot upload large file
    [React + webpack] hjs-webpack
  • 原文地址:https://www.cnblogs.com/liinux/p/5791038.html
Copyright © 2020-2023  润新知