在完成得到某一年所有图片之后,不由得想优化一下:把国家地理历年来的每日图片都取回来。上网搜索了一下,国家地理每日图片是从2001年开始的,我们可以继续优化得到以下的代码:
Code urltemplate = 'http://photography.nationalgeographic.com/ngs_pod_ext/searchPOD.jsp?month=%d&day=%d&year=%d&page='
urlList = [urltemplate %(month, day, year) for month in range(1, 13) for day in range(1, 32) for year in range(2001, 2010)] 这个时候遇到一个问题,有些请求会返回HTTP Status 404, 就需要对response进行处理,不存在的文件要跳过。
Code
import urllib2
req = urllib2.Request(url)
try:
response = urllib2.urlopen(req)
except URLError, e:
print url
print "\n has error: ", e.code
print "\n"
continue except语句可以打印出对应的链接,以及对应的error code,事后可以验证这种处理的正确性。
详细代码如下:
Code
#!/usr/bin/env python
#coding=utf-8
from urllib2 import Request, urlopen, URLError, HTTPError
import re
import urllib
urltemplate = 'http://photography.nationalgeographic.com/ngs_pod_ext/searchPOD.jsp?month=%d&day=%d&year=%d&page='
urlList = [urltemplate %(month, day, year) for month in range(1, 13) for day in range(1, 32) for year in range(2001, 2010)]
# define a regex to get the img src
imgre = '<img alt="(?P<alt>[^"]*)" src="(?P<src>/staticfiles/NGS/Shared/StaticFiles/Photography/Images/POD/.+?-ga.jpg)">'
p = re.compile('<img.+?>.+?</a>', re.I|re.S)
for url in urlList:
# get page html
import urllib2
req = urllib2.Request(url)
try:
response = urllib2.urlopen(req)
except URLError, e:
print url
print "\n has error: ", e.code
print "\n"
continue
txt = response.read()
#page.close()
m = p.findall(txt)
imgre = '<img alt="(?P<alt>[^"]*)" src="(?P<src>/staticfiles/NGS/Shared/StaticFiles/Photography/Images/POD/.+?-ga.jpg)">'
for n in m:
p1=re.compile(imgre, re.I|re.S)
m1= p1.search(n)
if(m1!=None):
tmp=m1.group(2)
url="http://photography.nationalgeographic.com/" + tmp
n1=tmp.split("/")
urllib.urlretrieve(url,"D:\\National Geographic\\"+n1[-1]) 将以上代码保存为.py文件,然后就可以得到国家地理每日图片了,就是时间稍微久了点。注意修改
最后一行文件保存目录。继续改进目标:1. 使用PyQt4做出漂亮UI版本2.