Python笔记

最近几天在研究爬虫爬取音频网站如何高效的访问并下载，说实话，我一开始还不知道有协程这个东东~~~并且之前一直觉得爬网站用啥方法都一样，能爬就行，自从发现了协程，爱不释手~~~~~

好了废话少说~进入正题：

首先，我们先说一下什么是多线程，在网上的一些教程中都给出这个例子（涉及类和队列）

 1 #Example.py
 2 '''
 3 Standard Producer/Consumer Threading Pattern
 4 '''
 5 
 6 import time 
 7 import threading 
 8 import Queue 
 9 
10 class Consumer(threading.Thread): 
11     def __init__(self, queue): 
12         threading.Thread.__init__(self)
13         self._queue = queue 
14 
15     def run(self):
16         while True: 
17             # queue.get() blocks the current thread until 
18             # an item is retrieved. 
19             msg = self._queue.get() 
20             # Checks if the current message is 
21             # the "Poison Pill"
22             if isinstance(msg, str) and msg == 'quit':
23                 # if so, exists the loop
24                 break
25             # "Processes" (or in our case, prints) the queue item   
26             print "I'm a thread, and I received %s!!" % msg
27         # Always be friendly! 
28         print 'Bye byes!'
29 
30 
31 def Producer():
32     # Queue is used to share items between
33     # the threads.
34     queue = Queue.Queue()
35 
36     # Create an instance of the worker
37     worker = Consumer(queue)
38     # start calls the internal run() method to 
39     # kick off the thread
40     worker.start() 
41 
42     # variable to keep track of when we started
43     start_time = time.time() 
44     # While under 5 seconds.. 
45     while time.time() - start_time < 5: 
46         # "Produce" a piece of work and stick it in 
47         # the queue for the Consumer to process
48         queue.put('something at %s' % time.time())
49         # Sleep a bit just to avoid an absurd number of messages
50         time.sleep(1)
51 
52     # This the "poison pill" method of killing a thread. 
53     queue.put('quit')
54     # wait for the thread to close down
55     worker.join()
56 
57 
58 if __name__ == '__main__':
59     Producer()

这个栗子让我想起了大三的课设虽然用的是java实现生产者-消费者模式~~~万恶的java

第一点你需要一个model类，其次需要一个队列来传递对象，是的如果你需要双向通信的话则需要再加入一个一个队列。。

下面，我们需要一个worker线程池，在网页检索中通过多线程加速~~~

 1 #Example2.py
 2 '''
 3 A more realistic thread pool example 
 4 '''
 5 
 6 import time 
 7 import threading 
 8 import Queue 
 9 import urllib2 
10 
11 class Consumer(threading.Thread): 
12     def __init__(self, queue): 
13         threading.Thread.__init__(self)
14         self._queue = queue 
15 
16     def run(self):
17         while True: 
18             content = self._queue.get() 
19             if isinstance(content, str) and content == 'quit':
20                 break
21             response = urllib2.urlopen(content)
22         print 'Bye byes!'
23 
24 
25 def Producer():
26     urls = [
27         'http://www.python.org', 'http://www.yahoo.com'
28         'http://www.scala.org', 'http://www.google.com'
29         # etc.. 
30     ]
31     queue = Queue.Queue()
32     worker_threads = build_worker_pool(queue, 4)
33     start_time = time.time()
34 
35     # Add the urls to process
36     for url in urls: 
37         queue.put(url)  
38     # Add the poison pillv
39     for worker in worker_threads:
40         queue.put('quit')
41     for worker in worker_threads:
42         worker.join()
43 
44     print 'Done! Time taken: {}'.format(time.time() - start_time)
45 
46 def build_worker_pool(queue, size):
47     workers = []
48     for _ in range(size):
49         worker = Consumer(queue)
50         worker.start() 
51         workers.append(worker)
52     return workers
53 
54 if __name__ == '__main__':
55     Producer()

可怕，我写不下去了，太复杂了。

我们换一个方法~~~map

1 urls = ['http://www.yahoo.com', 'http://www.reddit.com']

2 results = map(urllib2.urlopen, urls)

对，你没有看错，直接就可以了，result直接遍历urls然后返回的结果存在里面类型是list。

map直接可以实现并行操作了

线程过多，切换时所消耗的时间甚至超过实际工作时间，对于实际工作需求，通过尝试来找到合适线程池大小最合适不过了

 1 import urllib2 
 2 from multiprocessing.dummy import Pool as ThreadPool 
 3 
 4 urls = [
 5     'http://www.python.org', 
 6     'http://www.python.org/about/',
 7     'http://www.onlamp.com/pub/a/python/2003/04/17/metaclasses.html',
 8     'http://www.python.org/doc/',
 9     'http://www.python.org/download/',
10     'http://www.python.org/getit/',
11     'http://www.python.org/community/',
12     'https://wiki.python.org/moin/',
13     'http://planet.python.org/',
14     'https://wiki.python.org/moin/LocalUserGroups',
15     'http://www.python.org/psf/',
16     'http://docs.python.org/devguide/',
17     'http://www.python.org/community/awards/'
18     # etc.. 
19     ]
20 
21 # Make the Pool of workers
22 pool = ThreadPool(4) 
23 # Open the urls in their own threads
24 # and return the results
25 results = pool.map(urllib2.urlopen, urls)
26 #close the pool and wait for the work to finish 
27 pool.close() 
28 pool.join()

其中最关键只有一行，map函数轻易取代前文40多行复杂的例子。

说了这么多，我要讲的东西还没说~~~无语

好了，我要开始了~~~

我用了半天时间研究了一下喜马拉雅听书的网站，并成功写了一个单进程的小爬虫，下载所需要的类型的全部音频资源~~~~听起来很酷，其实吧爬的时候真的慢，于是乎我就加入了协程这个东西，发现速度上去了一大截，其实就是缩短了访问的时间我主要加了两个协程

 1 #第一个协程
 2 def get_url():
 3     try:
 4         start_urls = ['http://www.ximalaya.com/dq/book-%E6%82%AC%E7%96%91/{}/'.format(pn) for pn in range(1,85)]
 5         print(start_urls)
 6         urls_list = []
 7         for start_url in start_urls:
 8             urls_list.append(start_url)
 9             # response = fetch_url_text(start_url)
10             # soup  = BeautifulSoup(response,'lxml')
11             # print(soup)
12             # break
13             # for item in soup.find_all('div',class_='albumfaceOutter'):
14             #     print(item)
15             #     href = item.a['href']
16             #     title = item.img['alt']
17             #     img_url = item.img['src']
18             #     # print(title)
19             #     content = {
20             #         'href':href,
21             #         'title':title,
22             #         'img_url':img_url
23             #     }
24             #     get_mp3(href,title)
25         #协程模块1
26         jobs = [gevent.spawn(fetch_url_text,url) for url in urls_list]
27         gevent.joinall(jobs)
28         [job.value for job in jobs]
29         for response in  [job.value for job in jobs]:
30             soup = BeautifulSoup(response,'lxml')
31             for item in soup.find_all('div',class_='albumfaceOutter'):
32                 print(item)
33                 href = item.a['href']
34                 title = item.img['alt']
35                 img_url = item.img['src']
36                 # print(title)
37                 content = {
38                     'href':href,
39                     'title':title,
40                     'img_url':img_url
41                 }
42                 get_mp3(href,title)
43     except Exception as e:
44        print(e)
45     return ''

第二个协程

 1 def get_mp3(url,title):
 2     response = fetch_url_text(url)
 3     num_list = etree.HTML(response).xpath('//div[@class="personal_body"]/@sound_ids')[0].split(',')
 4     print(num_list)
 5 
 6     mkdir(title)
 7     os.chdir(r'F:xmly\'+title)
 8     ii=1
 9     list_ids = []
10     for id in num_list:
11         list_ids.append(id)
12         # print(id)
13         # # json_url = 'http://www.ximalaya.com/tracks/{}.json'.format(id)
14         # html = fetch_json(id)
15         # # print(html)
16         # mp3_url = html.get('play_path')
17         # # print(mp3_url)
18         # # download(mp3_url)
19         # content = requests.get(mp3_url, headers=headers).content
20         # name = title + '_%i.m4a'%ii
21         # with open(name, 'wb') as file:
22         #     file.write(content)
23         # print("{} download is ok".format(mp3_url))
24         # ii+=1
25     #协程模块2
26     jobs = [gevent.spawn(fetch_json,id) for id in list_ids]
27     gevent.joinall(jobs)
28     [job.value for job in jobs]
29     for html in [job.value for job in jobs]:
30         mp3_url = html.get('play_path')
31         content = requests.get(mp3_url, headers=headers).content
32         name = title + '_%i.m4a' % ii
33         with open(name, 'wb') as file:
34             file.write(content)
35         print("{} download is ok".format(mp3_url))
36         ii += 1

代码很冗余啊~~~看看就好了，毕竟刚学，现学现用嘛~~~~

其实吧，我感觉速度还是不够

可以再添加多进程模块进去，速度可能又会上去，如果再分布式呢？岂不是分分钟爬完整个站的内容，哈哈

好了，不做白日梦了，我只有一个笔记本，而且实验室服务器就两个，并且都在跑训练模型，就不给他们添加负担了~~~~~

还没说协程是什么啊~~~~~

协程可以理解为绿色的线程，或者微线程，起作用是在执行function_a()时可以随时中断去执行function_b(),然后中断继续function_a()，可自由切换。整个过程看似多线程，然而只有一个线程执行，也就是gevent中的monkeyAPI实现了异步的过程~~~~好神奇是不是

 1 #! -*- coding:utf-8 -*-
 2 #version：2.7
 3 import gevent
 4 from gevent import monkey;monkey.patch_all()
 5 import urllib2
 6 def get_body(i):
 7     print "start",i
 8     urllib2.urlopen("http://cn.bing.com")
 9     print "end",i
10 tasks=[gevent.spawn(get_body,i) for i in range(3)]
11 gevent.joinall(tasks)

这个简单栗子应该很容易明白~~~~~

我总结了一下，只要是有for而且其中每次都需要大量时间做请求的程序都可以调用协程来完成~~~~~

我的理解可能有偏差，欢迎指正~~~~

总结一下，并发和并行吧~~~~~

并发，就相当于有两个队伍，一个caffe机，轮流用这个caffe机

并行，就相当于有两个队伍，两个coffe机，一个队伍用一个，且同时进行。~~~~~

是不是很浅显~~~~~

拜拜~~~~~~~~~~

相关阅读:
java注解-笔记
 java重载与重写-笔记
 java中（equals与==）- 笔记
 Java迭代与递归-笔记
 C++指针悬挂-笔记
 极速倒入sql记录到excel表格,19个子段5万条记录只需30秒
 利用MCI的方法可以方便的实现光驱门的开关
 如何让你的程序在任务列表隐藏
 如何实现遍历文件夹中的所有文件
 识别操作系统版本
原文地址：https://www.cnblogs.com/BigStupid/p/7953707.html