• 百度搜索结果爬虫


    1. 目的

    使用爬虫脚本 爬去 百度搜索关键字后获得链接地址以及域名信息

    可结合GHDB语法

    e.g.  inrul:php?id=

    2. 知识结构

    2.1 使用 threading & queue 模块,多线程处理,自定义线程数

    2.2 使用BeautifulSoup & re模块,处理href 匹配

    2.3 使用requests 模块,处理web请求&获得请求后的真实地址(r.url)

    2.4 百度最大搜索页面76页,pn max 760

    2.5 将结果存入文本,域名已去重

    3. 爬虫脚本

    #coding=utf-8
    
    import requests
    import re
    import Queue
    import threading
    from bs4 import BeautifulSoup as bs
    import os,sys,time
    
    headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0'}
    
    
    class BaiduSpider(threading.Thread):
        def __init__(self,queue):
            threading.Thread.__init__(self)
            self._queue = queue
        def run(self):
            while not self._queue.empty():
                url = self._queue.get_nowait()
                try:
                    #print url
                    self.spider(url)
                except Exception,e:
                    print e
                    pass
    
        def spider(self,url):
        #if not add self , error:takes exactly 1 argument (2 given)    
            r = requests.get(url=url,headers=headers)
            soup = bs(r.content,'lxml')
            urls = soup.find_all(name='a',attrs={'data-click':re.compile(('.')),'class':None})
            for url in urls:
                #print url['href']
                new_r = requests.get(url=url['href'],headers=headers,timeout=3)
                if new_r.status_code == 200 :
                    url_para = new_r.url
                    url_index_tmp = url_para.split('/')
                    url_index = url_index_tmp[0]+'//'+url_index_tmp[2]
                    print url_para+'
    '+url_index
                    with open('url_para.txt','a+') as f1:
                        f1.write(url_para+'
    ')
                    with open('url_index.txt','a+') as f2:
                        with open('url_index.txt', 'r') as f3:
                            if url_index not in f3.read():
                                f2.write(url_index+'
    ')
                else:
                    print 'no access',url['href']
    
    def main(keyword):
        queue = Queue.Queue()
        de_keyword = keyword.decode(sys.stdin.encoding).encode('utf-8')
        print keyword
        # baidu max pages 76 , so pn=750 max
        for i in range(0,760,10):
            #queue.put('https://www.baidu.com/s?ie=utf-8&wd=%s&pn=%d'%(keyword,i))
            queue.put('https://www.baidu.com/s?ie=utf-8&wd=%s&pn=%d'%(de_keyword,i))
        threads = []
        thread_count = 4
        for i in range(thread_count):
            threads.append(BaiduSpider(queue))
        for t in threads:
            t.start()
        for t in threads:
            t.join()
    
    if __name__ == '__main__':
        if len(sys.argv) != 2:
            print 'Enter:%s keyword'%sys.argv[0]
            sys.exit(-1)
        else:
            main(sys.argv[1])    

    效果图

    4. 待优化点

    4.1 多个搜索引擎的处理

    4.2 多参数的处理

    4.2 payload 结合

    5. 参考信息

    5.1. ADO ichunqiu Python安全工具开发应用

    5.2. https://github.com/sharpdeep/CrawlerBaidu/blob/master/CrawlerBaidu.py

  • 相关阅读:
    docker基本命令
    服务器端使用DeferredResult异步推送技术
    单链表输出倒数第k个结点值(栈实现)
    fork炸弹
    BlockingQueue实现阻塞队列
    C#工作笔记
    实现一个简单的Log框架
    (转载)在spring的bean中注入内部类
    Spring整合Hessian
    spring整合axis2(最小配置化)的示例
  • 原文地址:https://www.cnblogs.com/shellr00t/p/6340561.html
Copyright © 2020-2023  润新知