最近在研究爬虫:
主要是2个版本 C# , Python
首先: 我们的爬虫是用在游戏客户端上,大概的需求就是 服务器是web形式的,每天去点点总是很烦人,所以写一个web客户端
httpwatch抓包,分析包。
Python 部分研究可行性代码,没有封装
!# 请求服务器部分 ,研究可行性部分,未封装
###########################################################
#
#
# iQSRobots
# 使用范围:Python3 + T4
#
#
__author__ = "Eagle Zhao(eaglezzb@gmail.com"
__version__ = "$Revision: 1.0 $"
__date__ = "$Date: 2011/11/15 21:57:19 $"
__copyright__ = "Copyright (c) 2011 Eagle"
__license__ = "iQS"
###########################################################
import urllib.parse
import httplib2
http = httplib2.Http()
url = 'http://ts2.travian.tw/dorf1.php'
body = {'name': '小铃铛','password':'1838888','s1':'登陆','w':'1280:800','login': '1321368625'}
headers = {'Content-type': 'application/x-www-form-urlencoded'}
response, content = http.request(url, 'POST', headers=headers, body=urllib.parse.urlencode(body))
#print(urllib.parse.urlencode(body))
print(response)
headers = {'Cookie': response['set-cookie']}
url = 'http://ts2.travian.tw/dorf1.php'
response, content = http.request(url, 'GET', headers=headers)
#print(content.decode('utf-8'))
/// 解析HTML -==- 使用 HTMLPaser 效果不是很好,最后决定使用正则
file=open('re.xml',encoding='utf-8')
p=file.read()
import urllib.parse
import re
building_farm =[]
building_links = []
m=re.search('<map name="rx" id="rx">.+?</map',p, re.S)
m_b=m.group()
buildings = re.findall('<area href="build.php.+?>', m_b)
# Parse each building
for building in buildings:
# Get building link
m = re.search('href="build.php.+?"', building)
#print(building)
link = m.group()[6:-1]
#print(link)
# Get bulding title
m = re.search('title=".+?"', building)
#<b>伐木場</b>||等級
title = m.group()[7:-1]
#print("title=",title)
# Get level
partsLevel = title.split()
parttitle = title.split(';')
#print("parts=",partsLevel)
if len(partsLevel) == 1:
level = 0
else:
title = partsLevel[0]
level = int(partsLevel[1])
#print("资源田种类",parttitle[2][:-3])
#print("资源田等级",level)
#print()
# Add bulidings info into list, eliminate duplicates
#if not link in building_links:
building_links.append(link)
#link = urllib.parse.urljoin(host, link) # Convert to absolute link
#test code start-===========
link = urllib.parse.urljoin("Http://ts2.travian.tw/", link) # Convert to absolute link
#test code end-=============
p=re.compile('\d+')
m=p.findall(link)
idNum=m[-1]
building_farm.append([idNum,parttitle[2][:-3], level, link])
print(building_farm[int(idNum) -1])
代码好凌乱:
看到到了Python 3 代码风格部分:
最重要的几点:
-
Use 4-space indentation, and no tabs.
-
使用 4-空格 缩进, 且没有制表符.
4 spaces are a good compromise between small indentation (allows greater nesting depth) and large indentation (easier to read). Tabs introduce confusion, and are best left out.
4 空格是在小缩进 (允许更多嵌套) 和大缩进 (更易读) 之间的好的妥协. 制表符会带来混乱, 最好不要使用.
-
Wrap lines so that they don’t exceed 79 characters.
-
包装行使它们不超过 79 个字符.
This helps users with small displays and makes it possible to have several code files side-by-side on larger displays.
这会帮助小屏幕的用户, 而且使得可以在大屏幕上同时显示几个代码文件成为可能.
-
Use blank lines to separate functions and classes, and larger blocks of code inside functions.
-
使用空行分隔函数和类, 以及函数中的大的代码块.
-
When possible, put comments on a line of their own.
-
尽可能将注释独占一行.
-
Use docstrings.
-
使用文档字符串.
-
Use spaces around operators and after commas, but not directly inside bracketing constructs: a = f(1, 2) + g(3, 4).
-
在操作符两边, 逗号后面使用空格, 但是括号内部与括号之间直接相连的部分不要空格: a = f(1, 2) + g(3, 4).
-
Name your classes and functions consistently; the convention is to use CamelCase for classes and lower_case_with_underscoresfor functions and methods. Always use self as the name for the first method argument (see 初识类 for more on classes and methods).
-
保持类名和函数名的一致性; 约定是, 类名使用 CamelCase 格式, 方法名和函数名使用 lower_case_with_underscres 形式. 一直使用 self 作为方法的第一个参数名 (参阅 初识类 获得更多有关类和方法的信息).
-
Don’t use fancy encodings if your code is meant to be used in international environments. Python’s default, UTF-8, or even plain ASCII work best in any case.
-
当你的代码打算用于国际化的环境, 那么不要使用奇特的编码. Python 默认的 UTF-8, 或者甚至是简单的 ASCII 在任何情况下工作得最好.
-
Likewise, don’t use non-ASCII characters in identifiers if there is only the slightest chance people speaking a different language will read or maintain the code.
-
同样地, 如果代码的读者或维护者只是很小的概率使用不同的语言, 那么不要在标识符里使用 非ASCII 字符.