python抓取网易图片

　　一个python抓取程序，用来抓取网易图片新闻中的一系列图片(抓取图片默认为大图)。

　　首先打开你想要抓取的系列图片的任一一个图片，获取链接。把链接赋值给pic_url，就会在当前目录下生成一个以图片主题命名的文件夹，文件夹下是这个系列的所有图片。

　　由于Windows下的文件目录路径使用反斜杠“”来分隔，Python代码里面，反斜杠“”是转义符，这里有几个处理windows目录的技巧:

　　1.使用斜杠“/”的路径: “c:/test.txt”… 不用反斜杠就没法产生歧义了 (本程序使用的方法)

　　2.将反斜杠符号转义: “c:\test.txt”… 因为反斜杠是转义符，所以”\”两个连在一起表示一个反斜杠符号

　　3.使用Python的自然字符串（raw string）: r”c: est.txt” … python语言在字符串前面加上字母r，表示后面是一个自然字符串（raw string）。(由于本程序的路径是函数返回值，所以没有使用本方法)

　　4.利用filepath = os.path.normcase(filepath)，filepath为unicode类型,normcase函数会把filepath自动处理成系统可以使用的字符串(本程序使用的方法)

　　下面是第一版：

 1 #coding:utf-8
 2 import sys 
 3 reload(sys) 
 4 sys.setdefaultencoding('utf-8') 
 5 
 6 import re
 7 import requests
 8 import os
 9 import urllib
10 
11 def down_pic(url):
12     pic_html = requests.get(url)
13     if pic_html.status_code == 200:
14         pic_html_text = pic_html.text
15         #print pic_html.encoding
16         #print pic_html_text
17 
18         #获取所有图片url,选择图片均为大图
19         pic_gallery_patt = r'"oimg": "(.+?.jpg)"'
20         #获取图片链接的主题
21         title_patt = r'<title>(.+?)</title>'
22         #获取图片的名称
23         img_name_patt = r'"id": "(.+?)",'
24 
25         img_text = re.findall(pic_gallery_patt, pic_html_text)
26         title_name = re.findall(title_patt, pic_html_text)
27         file_name = re.findall(img_name_patt, pic_html_text)
28 
29         #创建文件夹,需要处理转义符
30         curr_path = os.getcwd()
31         curr_path = curr_path.replace('\', '/')
32         file_dir = curr_path + '/'
33         if os.path.exists(file_dir):
34             file_dir += title_name[0]        
35         file_dir += '/'
36         #curr_path 是str类型,title_name[0]是unicode类型
37         #print type(file_dir)
38         #直接将unicode作为参数传入mkdir()方法，Python将先使用源代码文件声明的字符编码进行编码然后写入
39         os.mkdir(file_dir)
40         
41         print '开始下载......'
42 
43         for dic in zip(img_text,file_name):
44             #requests模块抓取的网页内容为unicode类型,可以用encode可以转换为utf-8编码的str类型
45             img_utf8_url = dic[0]
46             #生成图片存储路径
47             file_name_str = dic[1]
48             file_type = ".jpg"
49             #unicode类型和str类型连接生成新对象为unicode类型
50             filepath = file_dir + file_name_str + file_type           
51             print img_utf8_url, filepath  
52             #filepath为unicode类型,normcase函数会把filepath自动处理成系统可以使用的字符串
53             #filepath = os.path.normcase(filepath)         
54             urllib.urlretrieve(img_utf8_url, filepath)
55 
56         print '下载完成......'
57 
58 
59 if __name__ == '__main__':
60     pic_url = r'http://news.163.com/photoview/00AP0001/37116.html?from=tj_day#p=96A9I01H00AP0001'
61     down_pic(pic_url)

相关阅读:
安全检测点的一些梳理——待长期整理
 Tor真的匿名和安全吗?——如果是http数据，则在出口节点容易被嗅探明文流量，这就是根本问题
 prefixspan是挖掘频繁子序列，子序列不一定是连续的，当心！！！
spark mllib prefixspan demo
spark 2.4 java8 hello world
有效的括号序列——算法面试刷题4（for google），考察stack
相似的RGB颜色——算法面试刷题3（for google），考察二分
 回文的范围——算法面试刷题2（for google），考察前缀和
 最长绝对文件路径——算法面试刷题1（google），字符串处理，使用tree遍历dfs类似思路
 比较全面的gdb调试命令
原文地址：https://www.cnblogs.com/lkprof/p/3260013.html