• Python笔记


    最近几天在研究爬虫爬取音频网站如何高效的访问并下载,说实话,我一开始还不知道有协程这个东东~~~并且之前一直觉得爬网站用啥方法都一样,能爬就行,自从发现了协程,爱不释手~~~~~

    好了废话少说~进入正题:

    首先,我们先说一下什么是多线程,在网上的一些教程中都给出这个例子(涉及类和队列)

     

     1 #Example.py
     2 '''
     3 Standard Producer/Consumer Threading Pattern
     4 '''
     5 
     6 import time 
     7 import threading 
     8 import Queue 
     9 
    10 class Consumer(threading.Thread): 
    11     def __init__(self, queue): 
    12         threading.Thread.__init__(self)
    13         self._queue = queue 
    14 
    15     def run(self):
    16         while True: 
    17             # queue.get() blocks the current thread until 
    18             # an item is retrieved. 
    19             msg = self._queue.get() 
    20             # Checks if the current message is 
    21             # the "Poison Pill"
    22             if isinstance(msg, str) and msg == 'quit':
    23                 # if so, exists the loop
    24                 break
    25             # "Processes" (or in our case, prints) the queue item   
    26             print "I'm a thread, and I received %s!!" % msg
    27         # Always be friendly! 
    28         print 'Bye byes!'
    29 
    30 
    31 def Producer():
    32     # Queue is used to share items between
    33     # the threads.
    34     queue = Queue.Queue()
    35 
    36     # Create an instance of the worker
    37     worker = Consumer(queue)
    38     # start calls the internal run() method to 
    39     # kick off the thread
    40     worker.start() 
    41 
    42     # variable to keep track of when we started
    43     start_time = time.time() 
    44     # While under 5 seconds.. 
    45     while time.time() - start_time < 5: 
    46         # "Produce" a piece of work and stick it in 
    47         # the queue for the Consumer to process
    48         queue.put('something at %s' % time.time())
    49         # Sleep a bit just to avoid an absurd number of messages
    50         time.sleep(1)
    51 
    52     # This the "poison pill" method of killing a thread. 
    53     queue.put('quit')
    54     # wait for the thread to close down
    55     worker.join()
    56 
    57 
    58 if __name__ == '__main__':
    59     Producer()

    这个栗子让我想起了大三的课设虽然用的是java实现生产者-消费者模式~~~万恶的java

    第一点你需要一个model类,其次需要一个队列来传递对象,是的如果你需要双向通信的话则需要再加入一个一个队列。。

    下面,我们需要一个worker线程池,在网页检索中通过多线程加速~~~

     1 #Example2.py
     2 '''
     3 A more realistic thread pool example 
     4 '''
     5 
     6 import time 
     7 import threading 
     8 import Queue 
     9 import urllib2 
    10 
    11 class Consumer(threading.Thread): 
    12     def __init__(self, queue): 
    13         threading.Thread.__init__(self)
    14         self._queue = queue 
    15 
    16     def run(self):
    17         while True: 
    18             content = self._queue.get() 
    19             if isinstance(content, str) and content == 'quit':
    20                 break
    21             response = urllib2.urlopen(content)
    22         print 'Bye byes!'
    23 
    24 
    25 def Producer():
    26     urls = [
    27         'http://www.python.org', 'http://www.yahoo.com'
    28         'http://www.scala.org', 'http://www.google.com'
    29         # etc.. 
    30     ]
    31     queue = Queue.Queue()
    32     worker_threads = build_worker_pool(queue, 4)
    33     start_time = time.time()
    34 
    35     # Add the urls to process
    36     for url in urls: 
    37         queue.put(url)  
    38     # Add the poison pillv
    39     for worker in worker_threads:
    40         queue.put('quit')
    41     for worker in worker_threads:
    42         worker.join()
    43 
    44     print 'Done! Time taken: {}'.format(time.time() - start_time)
    45 
    46 def build_worker_pool(queue, size):
    47     workers = []
    48     for _ in range(size):
    49         worker = Consumer(queue)
    50         worker.start() 
    51         workers.append(worker)
    52     return workers
    53 
    54 if __name__ == '__main__':
    55     Producer()

    可怕,我写不下去了,太复杂了。

    我们换一个方法~~~map

    1 urls = ['http://www.yahoo.com', 'http://www.reddit.com']

     2 results = map(urllib2.urlopen, urls) 

    对,你没有看错,直接就可以了,result直接遍历urls然后返回的结果存在里面类型是list。

    map直接可以实现并行操作了

     线程过多,切换时所消耗的时间甚至超过实际工作时间,对于实际工作需求,通过尝试来找到合适线程池大小最合适不过了

     1 import urllib2 
     2 from multiprocessing.dummy import Pool as ThreadPool 
     3 
     4 urls = [
     5     'http://www.python.org', 
     6     'http://www.python.org/about/',
     7     'http://www.onlamp.com/pub/a/python/2003/04/17/metaclasses.html',
     8     'http://www.python.org/doc/',
     9     'http://www.python.org/download/',
    10     'http://www.python.org/getit/',
    11     'http://www.python.org/community/',
    12     'https://wiki.python.org/moin/',
    13     'http://planet.python.org/',
    14     'https://wiki.python.org/moin/LocalUserGroups',
    15     'http://www.python.org/psf/',
    16     'http://docs.python.org/devguide/',
    17     'http://www.python.org/community/awards/'
    18     # etc.. 
    19     ]
    20 
    21 # Make the Pool of workers
    22 pool = ThreadPool(4) 
    23 # Open the urls in their own threads
    24 # and return the results
    25 results = pool.map(urllib2.urlopen, urls)
    26 #close the pool and wait for the work to finish 
    27 pool.close() 
    28 pool.join() 

    其中最关键只有一行,map函数轻易取代前文40多行复杂的例子。

     说了这么多,我要讲的东西还没说~~~无语

    好了,我要开始了~~~

    我用了半天时间研究了一下喜马拉雅听书的网站,并成功写了一个单进程的小爬虫,下载所需要的类型的全部音频资源~~~~听起来很酷,其实吧爬的时候真的慢,于是乎我就加入了协程这个东西,发现速度上去了一大截,其实就是缩短了访问的时间我主要加了两个协程

     1 #第一个协程
     2 def get_url():
     3     try:
     4         start_urls = ['http://www.ximalaya.com/dq/book-%E6%82%AC%E7%96%91/{}/'.format(pn) for pn in range(1,85)]
     5         print(start_urls)
     6         urls_list = []
     7         for start_url in start_urls:
     8             urls_list.append(start_url)
     9             # response = fetch_url_text(start_url)
    10             # soup  = BeautifulSoup(response,'lxml')
    11             # print(soup)
    12             # break
    13             # for item in soup.find_all('div',class_='albumfaceOutter'):
    14             #     print(item)
    15             #     href = item.a['href']
    16             #     title = item.img['alt']
    17             #     img_url = item.img['src']
    18             #     # print(title)
    19             #     content = {
    20             #         'href':href,
    21             #         'title':title,
    22             #         'img_url':img_url
    23             #     }
    24             #     get_mp3(href,title)
    25         #协程模块1
    26         jobs = [gevent.spawn(fetch_url_text,url) for url in urls_list]
    27         gevent.joinall(jobs)
    28         [job.value for job in jobs]
    29         for response in  [job.value for job in jobs]:
    30             soup = BeautifulSoup(response,'lxml')
    31             for item in soup.find_all('div',class_='albumfaceOutter'):
    32                 print(item)
    33                 href = item.a['href']
    34                 title = item.img['alt']
    35                 img_url = item.img['src']
    36                 # print(title)
    37                 content = {
    38                     'href':href,
    39                     'title':title,
    40                     'img_url':img_url
    41                 }
    42                 get_mp3(href,title)
    43     except Exception as e:
    44        print(e)
    45     return ''

    第二个协程

     1 def get_mp3(url,title):
     2     response = fetch_url_text(url)
     3     num_list = etree.HTML(response).xpath('//div[@class="personal_body"]/@sound_ids')[0].split(',')
     4     print(num_list)
     5 
     6     mkdir(title)
     7     os.chdir(r'F:xmly\'+title)
     8     ii=1
     9     list_ids = []
    10     for id in num_list:
    11         list_ids.append(id)
    12         # print(id)
    13         # # json_url = 'http://www.ximalaya.com/tracks/{}.json'.format(id)
    14         # html = fetch_json(id)
    15         # # print(html)
    16         # mp3_url = html.get('play_path')
    17         # # print(mp3_url)
    18         # # download(mp3_url)
    19         # content = requests.get(mp3_url, headers=headers).content
    20         # name = title + '_%i.m4a'%ii
    21         # with open(name, 'wb') as file:
    22         #     file.write(content)
    23         # print("{} download is ok".format(mp3_url))
    24         # ii+=1
    25     #协程模块2
    26     jobs = [gevent.spawn(fetch_json,id) for id in list_ids]
    27     gevent.joinall(jobs)
    28     [job.value for job in jobs]
    29     for html in [job.value for job in jobs]:
    30         mp3_url = html.get('play_path')
    31         content = requests.get(mp3_url, headers=headers).content
    32         name = title + '_%i.m4a' % ii
    33         with open(name, 'wb') as file:
    34             file.write(content)
    35         print("{} download is ok".format(mp3_url))
    36         ii += 1

    代码很冗余啊~~~看看就好了,毕竟刚学,现学现用嘛~~~~

    其实吧,我感觉速度还是不够

    可以再添加多进程模块进去,速度可能又会上去,如果再分布式呢?岂不是分分钟爬完整个站的内容,哈哈

    好了,不做白日梦了,我只有一个笔记本,而且实验室服务器就两个,并且都在跑训练模型,就不给他们添加负担了~~~~~

    还没说协程是什么啊~~~~~

    协程可以理解为绿色的线程,或者微线程,起作用是在执行function_a()时可以随时中断去执行function_b(),然后中断继续function_a(),可自由切换。整个过程看似多线程,然而只有一个线程执行,也就是gevent中的monkeyAPI实现了异步的过程~~~~好神奇是不是

     1 #! -*- coding:utf-8 -*-
     2 #version:2.7
     3 import gevent
     4 from gevent import monkey;monkey.patch_all()
     5 import urllib2
     6 def get_body(i):
     7     print "start",i
     8     urllib2.urlopen("http://cn.bing.com")
     9     print "end",i
    10 tasks=[gevent.spawn(get_body,i) for i in range(3)]
    11 gevent.joinall(tasks)

    这个简单栗子应该很容易明白~~~~~

    我总结了一下,只要是有for而且其中每次都需要大量时间做请求的程序都可以调用协程来完成~~~~~

    我的理解可能有偏差,欢迎指正~~~~

    总结一下,并发和并行吧~~~~~

    并发,就相当于有两个队伍,一个caffe机,轮流用这个caffe机

    并行,就相当于有两个队伍,两个coffe机,一个队伍用一个,且同时进行。~~~~~

    是不是很浅显~~~~~

    拜拜~~~~~~~~~~

  • 相关阅读:
    java注解-笔记
    java重载与重写-笔记
    java中(equals与==)- 笔记
    Java迭代与递归-笔记
    C++指针悬挂-笔记
    极速倒入sql记录到excel表格,19个子段5万条记录只需30秒
    利用MCI的方法可以方便的实现光驱门的开关
    如何让你的程序在任务列表隐藏
    如何实现遍历文件夹中的所有文件
    识别操作系统版本
  • 原文地址:https://www.cnblogs.com/BigStupid/p/7953707.html
Copyright © 2020-2023  润新知