• 制作一个听话的电影种子挖掘器


    每次回到宿舍想看部电影才发现很长时间没有去bt站淘种子了, 然而天天去站上找适合自己类型的电影又是一件费时又费力的事儿, 所以周末花时间写了一个可配置的爬子, 能够根据不同人的不同需求去自动下载种子文件, 并且能够避免不同分类中的重复电影

    后期还会加入下载队列的功能, 在检测宿舍无人用网的时候开启bt下载, 有人接入wifi就暂停

    项目地址: https://github.com/hwding/btDigger , 众人拾柴火焰高, 欢迎共同制作...

    (爬子依赖'bt天堂'这个种子站)

    运行起来是这个样子的

    开始分析这个爬子的制作过程

    首先我想让它具有以下灵活性, 请看config.json:

    {
        "regions-banned":[
            "中国大陆"
        ],
        "depth":"2",
        "definition":"1080p",
        "categories":[
            "动作",
            "战争",
            "科幻",
            "悬疑",
            "犯罪",
            "恐怖",
            "惊悚",
            "冒险"
        ]
    }

    能够屏蔽大陆地区的电影, 每个分类的页面挖掘深度为2, 优先下载1080p的种子, 并且指定 "动作","战争","科幻"... 这几类

    在程序启动后首先读入配置文件并解析, 解析器采用单例模式以方便在其它类中载入配置

    main方法中:  ConfigLoader.getInstance(); 

    ConfigLoader类:

     
     1 import org.json.JSONArray;
     2 import org.json.JSONObject;
     3 import java.io.*;
     4 import java.util.ArrayList;
     5 
     6 class ConfigLoader {
     7     private static ConfigLoader configLoader = null;
     8     private static final String FILE_NAME = "config.json";
     9     private static ArrayList<String> regions_banned = new ArrayList<>();
    10     private static ArrayList<String> categories    = new ArrayList<>();
    11     private static int depth;
    12     private static String definition;
    13 
    14     private ConfigLoader() {
    15         String jsonString = extractJSON();
    16         JSONObject jsonObject = new JSONObject(jsonString);
    17         JSONArray regions_banned_array = jsonObject.getJSONArray("regions-banned");
    18         JSONArray categories_array = jsonObject.getJSONArray("categories");
    19         for (Object each : regions_banned_array)
    20             regions_banned.add((String) each);
    21         for (Object each : categories_array)
    22             categories.add((String) each);
    23         depth = Integer.parseInt(jsonObject.getString("depth"));
    24         definition = jsonObject.getString("definition");
    25     }
    26 
    27     static synchronized ConfigLoader getInstance() {
    28         if (configLoader==null)
    29             configLoader = new ConfigLoader();
    30         return configLoader;
    31     }
    32 
    33     ArrayList<String> getRegions_banned() {
    34         return regions_banned;
    35     }
    36 
    37     ArrayList<String> getCategories() {
    38         return categories;
    39     }
    40 
    41     int getDepth() {
    42         return depth;
    43     }
    44 
    45     String getDefinition() {
    46         return definition;
    47     }
    48 
    49     private String extractJSON() {
    50         File file = new File(FILE_NAME);
    51         String jsonString = "";
    52         String temp;
    53         try {
    54             BufferedReader bufferedReader = new BufferedReader(
    55                                             new InputStreamReader(
    56                                             new FileInputStream(file), "UTF-8"));
    57             while ((temp = bufferedReader.readLine()) != null)
    58                 jsonString+=temp;
    59         } catch (FileNotFoundException e) {
    60             System.out.println("
    [x] Configuration file not found");
    61             System.exit(0);
    62         } catch (IOException e) {
    63             System.out.println("
    [x] An error occurred when trying to read the configuration file");
    64             System.exit(0);
    65         }
    66         return jsonString;
    67     }
    68 }
     

    这里就是简单地将文件中的属性赋给对应的类成员变量

    然后main方法会启动页面解析器:  PageParser.getInstance(); 

    页面解析器在初始化成功后会自动根据配置进行解析

    首先检查能否连接'bt天堂'网站

    然后开始在首页中查找我们需要的电影分类的次级链接地址, 在网页源代码中是这样的存在

     
     2         <div class="Btitle"><a href="/category.php?/%e5%8a%a8%e4%bd%9c/" title="动作电影">动作</a></div>
     3         <div class="Btitle"><a href="/category.php?/%e6%88%98%e4%ba%89/" title="战争电影">战争</a></div>
     4         <div class="Btitle"><a href="/category.php?/%e5%89%a7%e6%83%85/" title="剧情电影">剧情</a></div>
     5         <div class="Btitle"><a href="/category.php?/%e7%88%b1%e6%83%85/" title="爱情电影">爱情</a></div>
     6         <div class="Btitle"><a href="/category.php?/%e7%a7%91%e5%b9%bb/" title="科幻电影">科幻</a></div>
     7         <div class="Btitle"><a href="/category.php?/%e6%82%ac%e7%96%91/" title="悬疑电影">悬疑</a></div>        
     8         <div class="Btitle"><a href="/category.php?/%e5%ae%b6%e5%ba%ad/" title="家庭电影">家庭</a></div>
     9         <div class="Btitle"><a href="/category.php?/%e7%8a%af%e7%bd%aa/" title="犯罪电影">犯罪</a></div>
    10         <div class="Btitle"><a href="/category.php?/%e6%81%90%e6%80%96/" title="恐怖电影">恐怖</a></div>
    11         <div class="Btitle"><a href="/category.php?/%e5%8a%a8%e7%94%bb/" title="动画电影">动画</a></div>
    12         <div class="Btitle"><a href="/category.php?/%e5%96%9c%e5%89%a7" title="喜剧电影">喜剧</a></div>
    13         <div class="Btitle"><a href="/category.php?/%e6%83%8a%e6%82%9a" title="惊悚电影">惊悚</a></div>
    14         <div class="Btitle"><a href="/category.php?/%e5%86%92%e9%99%a9" title="冒险电影">冒险</a></div>
     

    所以我们只需要收集名为class属性为Btitle的标签, 并将标签的内容与我们的目标分类数组中的每一项一一比对即可

    然后再将其中的href属性的值保存起来用于访问各个分类

     
    1 Elements bTitles = document.select("div[class="Btitle"]");
    2             for (Element each : bTitles) {
    3                 Element thisCategory = each.select("a").first();
    4                 ArrayList<String> categories = configLoader.getCategories();
    5                 if (!categories.contains(thisCategory.text()))
    6                     continue;
    7                 targetCategoriesSubURLs.add(thisCategory.attr("href"));
    8                 System.out.println("[o] 	Category ["+thisCategory.text()+"] spotted");
     

    完成分类链接的收集之后, 开始收集各个分类之中的电影:  parseCategoryPage(); 

    这一步的工作就是屏蔽不想要的电影(此配置屏蔽中国大陆地区电影)并且丢弃不同分类下的同一个电影

    parseCategoryPage()方法如下

    void parseCategoryPage() {
            System.out.println("[o] You are banning films from "+configLoader.getRegions_banned().toString());
            System.out.println("[o] You want to dig into each category with the depth of: "+configLoader.getDepth());
            if (configLoader.getDepth() < 1) {
                System.out.println("[x] Depth can not be smaller than 1");
                System.exit(0);
            }
            else if (configLoader.getDepth() > 5) {
                System.out.println("[i] Depth may be too large");
            }
            System.out.print("[o] Collecting films into each category...");
            int counterDuplicated = 0;
            int counterBanned = 0;
            boolean isBanned;
            boolean isDuplicated;
            for (String each : targetCategoriesSubURLs) {
                for (int i = 1; i < configLoader.getDepth() + 1; i++) {
                    try {
                        URL url = new URL(HOST + each + i);
                        Document document = Jsoup.parse(url, 5000);
                        Elements filmTitles = document.select("div[class="title"]");
                        for (Element eachFilmTitle : filmTitles) {
                            if (!"".equals(eachFilmTitle.select("font").text())) {
                                isBanned = false;
                                isDuplicated = false;
                                for (String eachBannedLocation : configLoader.getRegions_banned()) {
                                    if (eachFilmTitle.select("p[class="des"]").text().contains(eachBannedLocation)) {
                                        counterBanned++;
                                        isBanned = true;
                                    }
                                }
                                for (String eachValidFileTitle : validFilmTitles) {
                                    if (eachFilmTitle.select("font").text().contains(eachValidFileTitle)) {
                                        counterDuplicated++;
                                        isDuplicated = true;
                                    }
                                }
                                if (!isBanned && !isDuplicated) {
                                    validFilmTitles.add(eachFilmTitle.select("font").text());
                                    validFilmSubURLs.add(eachFilmTitle.select("a").first().attr("href"));
                                }
                            }
                        }
                    } catch (MalformedURLException e) {
                        System.out.println(" [x] Internal error: MalformedURL");
                    } catch (IOException e) {
                        System.out.println(" [x] An error occurred when trying to read the page");
                    }
                }
            }
            System.out.println("OK");
            System.out.println("[o] "+counterBanned+" films banned");
            System.out.println("[o] "+counterDuplicated+" films dropped due to duplication");
            parseFilmPage();
        }

    注意这里会先收集分类页面底部的页数导航栏, 并且用一个循环去根据指定的深度访问页面

    此处深度为2, 所以会访问第一页和第二页

    分类页面底部的页数导航栏在网页源代码中是这样的

     
     1 <li><a href='/category.php?/%E7%8A%AF%E7%BD%AA/1/'>首页</a></li>
     2  <li><a href='/category.php?/%E7%8A%AF%E7%BD%AA/-1/'>上一页</a></li>
     3  <li><a href='/category.php?/%E7%8A%AF%E7%BD%AA/1/'>1</a></li>
     4 <li><a href='/category.php?/%E7%8A%AF%E7%BD%AA/2/'>2</a></li>
     5 <li><a href='/category.php?/%E7%8A%AF%E7%BD%AA/3/'>3</a></li>
     6 <li><a href='/category.php?/%E7%8A%AF%E7%BD%AA/4/'>4</a></li>
     7 <li><a href='/category.php?/%E7%8A%AF%E7%BD%AA/5/'>5</a></li>
     8 <li><a href='/category.php?/%E7%8A%AF%E7%BD%AA/6/'>6</a></li>
     9 <li><a href='/category.php?/%E7%8A%AF%E7%BD%AA/7/'>7</a></li>
    10 <li><a href='/category.php?/%E7%8A%AF%E7%BD%AA/8/'>8</a></li>
    11 <li><a href='/category.php?/%E7%8A%AF%E7%BD%AA/9/'>9</a></li>
    12 <li><a href='/category.php?/%E7%8A%AF%E7%BD%AA/10/'>10</a></li>
    13 <li><a href='/category.php?/%E7%8A%AF%E7%BD%AA/11/'>11</a></li>
    14  <li><a href='/category.php?/%E7%8A%AF%E7%BD%AA/1/'>下一页</a></li>
    15  <li><a href='/category.php?/%E7%8A%AF%E7%BD%AA/102/'>末页</a></li>
     

    解析的方法同上

    完成并收集了符合要求的电影的页面链接之后, 我们将开始进入到每一个电影的详情页并找到最好的种子去下载

    这里会调用 parseFilmPage() 方法

    每个电影详情页会有种子列表, 提供不同清晰度(720p, 1080p, BluRay 720p, BluRay 1080p等), 为了不牺牲清晰度也不让硬盘爆满, 配置文件中指定最喜爱的清晰度为1080p

    种子列表在网页源代码中如下, 解析方式同上上...

     
    1 <div class="tinfo">
    2 <a href="/download.php?n=%E6%B4%9B%E5%9F%8E%E5%B1%A0%E6%89%8Bbt%E7%A7%8D%E5%AD%90%E4%B8%8B%E8%BD%BD.720p%E9%AB%98%E6%B8%85.torrent&temp=yes&id=27808&uhash=b57db4fed7d35c8d0924033f" title="【720p高清】洛城屠手 /L.A. Slasher .2015.1.02GBBT种子下载" target="_blank"><p class="torrent"><img border="0" src="/style/torrent.gif" style="vertical-align:middle" alt="">【720p高清】洛城屠手<i>/L.A. Slasher</i>.2015.<em>1.02GB</em>.torrent</p></a>
    3 <ul class="btTree treeview"><li><span class="file"><font color="#999">本torrent文件由BT天堂(www.BTtiantang.com)提供!</font></span></li><li><span class="video">L.A.Slasher.2015.720p.BluRay.H264.AAC-RARBG.mp4<small>1.02GB</small></span></li><li><span class="file">L.A.Slasher.2015.720p.BluRay.H264.AAC-RARBG.nfo<small>3.97KB</small></span></li><li><span class="video">RARBG.mp4<small>992.93KB</small></span></li></ul>
    4 </div>
    5 <div class="tinfo">
    6 <a href="/download.php?n=%E6%B4%9B%E5%9F%8E%E5%B1%A0%E6%89%8Bbt%E7%A7%8D%E5%AD%90%E4%B8%8B%E8%BD%BD.1080p%E9%AB%98%E6%B8%85.torrent&temp=yes&id=27808&uhash=04645321cb7afdbdee192d1d" title="【1080p高清】洛城屠手 /L.A. Slasher .2015.1.61GBBT种子下载" target="_blank"><p class="torrent"><img border="0" src="/style/torrent.gif" style="vertical-align:middle" alt="">【1080p高清】洛城屠手<i>/L.A. Slasher</i>.2015.<em>1.61GB</em>.torrent</p></a>
    7 <ul class="btTree treeview"><li><span class="file"><font color="#999">本torrent文件由BT天堂(www.BTtiantang.com)提供!</font></span></li><li><span class="video">L.A.Slasher.2015.1080p.BluRay.H264.AAC-RARBG.mp4<small>1.61GB</small></span></li><li><span class="file">L.A.Slasher.2015.1080p.BluRay.H264.AAC-RARBG.nfo<small>3.97KB</small></span></li><li><span class="video">RARBG.mp4<small>992.93KB</small></span></li></ul>
    8 </div>
     

    这里仅仅提供了720p和1080p两种清晰度的种子, 在多种清晰度的情况下, 爬子发现1080p的就会直接跳出对种子列表的遍历, 否则就去下载最高清的那个

     
    1 for (Element eachBtFileLink : btFileLinks) {
    2                     Element info = eachBtFileLink.select("span[class="video"]").first();
    3                     if (info.text().contains(configLoader.getDefinition())) {
    4                         targetBtFileLinkSuffix = eachBtFileLink.select("a").first().attr("href");
    5                         break;
    6                     }
    7                     targetBtFileLinkSuffix = eachBtFileLink.select("a").first().attr("href");
    8                 }
     

    每发现一个目标种子就紧接着去访问它的下载页面

    通过拦截POST请求和查看页面源代码, 我们发现每个种子的下载页面都有arcid和uhash两个属性, POST的时候必须写入这两个东西才行

    首先收集页面中这两个属性的值

     
     1 String arcid = null;
     2                 String uhash = null;
     3                 boolean hasArcid = false;
     4                 boolean hasUhash = false;
     5                 while ((temp = bufferedReader.readLine()) != null) {
     6                     if (temp.contains("var _arcid")) {
     7                         arcid = temp.substring(temp.indexOf(""")+1, temp.lastIndexOf("""));
     8                         hasArcid = true;
     9                     }
    10                     if (temp.contains("var _uhash")) {
    11                         uhash = temp.substring(temp.indexOf(""")+1, temp.lastIndexOf("""));
    12                         hasUhash = true;
    13                     }
    14                 }
     

    如果两个属性值都拿到了就可以提交POST请求去下载种子文件啦

     
     1 if (hasArcid && hasUhash) {
     2                     URL requestUrl = new URL(REQUEST_URL);
     3                     HttpURLConnection httpUrlConnection = (HttpURLConnection) requestUrl.openConnection();
     4                     httpUrlConnection.setInstanceFollowRedirects(false);
     5                     httpUrlConnection.setDoOutput(true);
     6                     httpUrlConnection.setRequestMethod("POST");
     7                     String OUTPUT_DATA =
     8                             "action=download" +
     9                             "&id="            +
    10                             arcid             +
    11                             "&uhash="         +
    12                             uhash;
    13                     OutputStreamWriter outputStreamWriter = new OutputStreamWriter(
    14                             httpUrlConnection.getOutputStream());
    15                     outputStreamWriter.write(OUTPUT_DATA);
    16                     outputStreamWriter.flush();
    17                     outputStreamWriter.close();
    18                     System.out.print("
    ");
    19                     System.out.print("[o] Downloading torrent files...("+i+"/"+validFilmSubURLs.size()+")");
    20                     File file = new File(uhash+".torrent");
    21                     InputStream inputStream = httpUrlConnection.getInputStream();
    22                     FileOutputStream fileOutputStream = new FileOutputStream(file);
    23                     byte[] buffer = new byte[1024];
    24                     int length;
    25                     while ((length = inputStream.read(buffer)) != -1) {
    26                         fileOutputStream.write(buffer, 0, length);
    27                         fileOutputStream.flush();
    28                     }
    29                     fileOutputStream.close();
    30                     httpUrlConnection.disconnect();
    31                 }
     

    到此为止种子收集的模块就初具雏形

    标签: Java, HTML, GitHub
  • 相关阅读:
    PHP抓取页面的几种方式
    MySQL性能优化的最佳20+条经验
    linux下导入、导出mysql数据库命令
    8个必备的PHP功能开发
    jquery中的children()和contents()的区别
    centos 通用开发工具及库安装 有了它不用愁了
    将bat批处理文件注册成windows服务
    squid隐藏squid的版本号
    squid如何屏蔽User-Agent为空的请求
    RHEL/CentOS 6.x使用EPEL6与remi的yum源安装MySQL 5.5.x
  • 原文地址:https://www.cnblogs.com/android-blogs/p/5566176.html
Copyright © 2020-2023  润新知