• httpclient 多线程爬虫实例(转)


    https://zhuanlan.zhihu.com/p/82856691

    本人最近在研究安全测试的过程中,偶然发现某站一个漏洞,在获取资源的时候竟然不需要校验,原来设定的用户每天获取资源的次数限制就没了。赶紧想到用爬虫多爬一些数据,但是奈何数据量太大了,所以想到用多线程来爬虫。经过尝试终于完成了,脚本写得比较粗糙,因为没真想爬完。预计10万数据量,10个线程,每个线程爬1万,每次爬100个数据(竟然是 get 接口,有 url 长度限制)。

    分享代码,供大家参考。

    package practise;
    
    import java.util.Date;
    import java.util.concurrent.CountDownLatch;
    import java.util.concurrent.ExecutorService;
    import java.util.concurrent.Executors;
    import org.apache.http.client.methods.HttpGet;
    import net.sf.json.JSONObject;
    import source.ApiLibrary;
    
    public class LoginDz extends ApiLibrary {
    
        public static void main(String[] args) {
            LoginDz loginDz = new LoginDz();
            loginDz.excuteTreads();
            testOver();
        }
    
        public JSONObject getTi(int[] code, String name) {
            JSONObject response = null;
            String url = "***********";
            JSONObject args = new JSONObject();
            // args.put("ID_List", getTiId(884969));
            args.put("ID_List", getTiId(code));
            HttpGet httpGet = getHttpGet(url, args);
            response = getHttpResponseEntityByJson(httpGet);
            // output(response.toString());
            String text = response.toString();
            if (!text.equals("{"success_response":[]}"))
                logLog("name", response.toString());
            output(response);
            return response;
        }
    
    
        public String getTiId(int... id) {
            StringBuffer result = new StringBuffer();
            int length = id.length;
            for (int i = 0; i < length; i++) {
                String abc = "filter[where][origDocID][inq]=" + id[i] + "&";
                result.append(abc);
            }
            return result.toString();
        }
    
        /**
         * 执行多线程任务
         */
        public void excuteTreads() {
            int threads = 10;
            ExecutorService executorService = Executors.newFixedThreadPool(threads);
            CountDownLatch countDownLatch = new CountDownLatch(threads);
            Date start = new Date();
            for (int i = 0; i < threads; i++) {
                executorService.execute(new More(countDownLatch, i));
            }
            try {
                countDownLatch.await();
                executorService.shutdown();
            } catch (InterruptedException e) {
                e.printStackTrace();
            }
            Date end = new Date();
            outputTimeDiffer(start, end);
        }
    
        /**
         * 多线程类
         */
        class More implements Runnable {
            public CountDownLatch countDownLatch;
            public int num;
    
            public More(CountDownLatch countDownLatch, int num) {
                this.countDownLatch = countDownLatch;
                this.num = num;
            }
    
            @Override
            public void run() {
                int bound = num * 10000;
    
                try {
                    for (int i = bound; i < bound + 10000; i += 100) {
                        int[] ids = new int[100];
                        for (int k = 0; k < 100; k++) {
                            ids[i] = i + k;
                            getTi(ids, bound + "");
                        }
                    }
                } finally {
                    countDownLatch.countDown();
                }
            }
    
        }
    
    }
  • 相关阅读:
    第04组 团队项目-需求分析报告
    团队项目-选题报告
    第二次结对编程作业
    第一次结对编程作业
    第四组 团队展示
    第一次博客作业
    第09组 Alpha事后诸葛亮
    第09组 Alpha冲刺(6/6)
    2019 SDN上机第4次作业
    第09组 Alpha冲刺(5/6)
  • 原文地址:https://www.cnblogs.com/huanghongbo/p/14993914.html
Copyright © 2020-2023  润新知