• mechanize (1)


    最近看的关于网络爬虫和模拟登陆的资料,发现有这样一个包

    mechanize ['mekə.naɪz]又称为机械化的意思,确实文如其意,确实有自动化的意思。

    mechanize.Browser and mechanize.UserAgentBase implement the interface of urllib2.OpenerDirector, so:

    • any URL can be opened, not just http:

    • mechanize.UserAgentBase offers easy dynamic configuration of user-agent features like protocol, cookie, redirection and robots.txt handling, without having to make a new OpenerDirector each time, e.g. by calling build_opener().

    • Easy HTML form filling.

    • Convenient link parsing and following.

    • Browser history (.back() and .reload() methods).

    • The Referer HTTP header is added properly (optional).

    • Automatic observance of robots.txt.

    • Automatic handling of HTTP-Equiv and Refresh.

    意思就是说 mechanize.Browser和mechanize.UserAgentBase只是urllib2.OpenerDirector的接口实现,因此,包括HTTP协议,所有的协议都可以打开

    另外,提供了更简单的配置方式而不用每次都创建一个新的OpenerDirector

    对表单的操作,对链接的操作、浏览历史和重载操作、刷新、对robots.txt的监视操作等等

    import re
    import mechanize
    
    (1)实例化一个浏览器对象 br = mechanize.Browser() (2)打开一个网址
    br.open("http://www.example.com/") (3)该网页下的满足text_regex的第2个链接
    # follow second link with element text matching regular expression response1 = br.follow_link(text_regex=r"cheeses*shop", nr=1) assert br.viewing_html() (4)网页的名称
    print br.title() (5)将网页的网址打印出来
    print response1.geturl() (6)网页的头部
    print response1.info() # headers (7)网页的body
    print response1.read() # body
    (8)选择body中的name =" order"的FORM br.select_form(name="order") # Browser passes through unknown attributes (including methods) # to the selected HTMLForm.
    (9)为name = cheeses的form赋值 br["cheeses"] = ["mozzarella", "caerphilly"] # (the method here is __setitem__) # Submit current form. Browser calls .close() on the current response on # navigation, so this closes response1 (10)提交
    response2 = br.submit() # print currently selected form (don't call .submit() on this, use br.submit()) print br.form (11)返回 response3 = br.back() # back to cheese shop (same data as response1) # the history mechanism returns cached response objects # we can still use the response, even though it was .close()d
    response3.get_data() # like .seek(0) followed by .read() (12)刷新网页
    response4 = br.reload() # fetches from server (13)这可以列出该网页下所有的Form
    for form in br.forms():   print form # .links() optionally accepts the keyword args of .follow_/.find_link() for link in br.links(url_regex="python.org"): print link br.follow_link(link) # takes EITHER Link instance OR keyword args br.back()

     这是文档中给出的一个例子,基本的解释已经在代码中给出

    You may control the browser’s policy by using the methods of mechanize.Browser’s base class, mechanize.UserAgent. For example:

    通过mechanize.UserAgent这个模块,我们可以实现对browser’s policy的控制,代码给出如下,也是来自与文档的例子:

    br = mechanize.Browser()
    # Explicitly configure proxies (Browser will attempt to set good defaults).
    # Note the userinfo ("joe:password@") and port number (":3128") are optional.
    br.set_proxies({"http": "joe:password@myproxy.example.com:3128",
    "ftp": "proxy.example.com",
     })
    # Add HTTP Basic/Digest auth username and password for HTTP proxy access.
    # (equivalent to using "joe:password@..." form above)
    br.add_proxy_password("joe", "password")
    # Add HTTP Basic/Digest auth username and password for website access. br.add_password("http://example.com/protected/", "joe", "password")
    # Don't handle HTTP-EQUIV headers (HTTP headers embedded in HTML). br.set_handle_equiv(False)
    # Ignore robots.txt. Do not do this without thought and consideration. br.set_handle_robots(False)
    # Don't add Referer (sic) header br.set_handle_referer(False)
    # Don't handle Refresh redirections br.set_handle_refresh(False)
    # Don't handle cookies br.set_cookiejar()
    # Supply your own mechanize.CookieJar (NOTE: cookie handling is ON by # default: no need to do this unless you have some reason to use a # particular cookiejar) br.set_cookiejar(cj)
    # Log information about HTTP redirects and Refreshes. br.set_debug_redirects(True)
    # Log HTTP response bodies (ie. the HTML, most of the time). br.set_debug_responses(True)
    # Print HTTP headers. br.set_debug_http(True) # To make sure you're seeing all debug output: logger = logging.getLogger("mechanize") logger.addHandler(logging.StreamHandler(sys.stdout)) logger.setLevel(logging.INFO) # Sometimes it's useful to process bad headers or bad HTML: response = br.response() # this is a copy of response headers = response.info() # currently, this is a mimetools.Message headers["Content-type"] = "text/html; charset=utf-8" response.set_data(response.get_data().replace("<!---", "<!--")) br.set_response(response)

     另外,还有一些类似于mechanize的网页交互模块,

    There are several wrappers around mechanize designed for functional testing of web applications:

    归根到底,都是对urllib2的封装,因此,选择一个比较好用的模块就好了!

  • 相关阅读:
    python读取xml文件报错ValueError: multi-byte encodings are not supported
    使用命令创建jenkins的job,解决jenkinsapi.custom_exceptions.JenkinsAPIException错误
    使用Python命令创建jenkins的job
    使用selenium grid分布式执行之一
    使用python 操作liunx的svn,方案二
    使用python 操作liunx的svn,方案一
    使用shell脚本实现在liunx上进行svn的上传下载更新功能
    java连接ssh执行shell脚本
    perl学习(二)正则表达式
    googletest进行单元测试(使用cmake编译)
  • 原文地址:https://www.cnblogs.com/CBDoctor/p/3855738.html
Copyright © 2020-2023  润新知