• python爬虫学习(一)


    基于python2.7

    get与post:

    url = "http://zzk.cnblogs.com"
    urllib.urlopen(url)----->get方法
    
    name = urllib.urlencode({"k":"b"})
    urllib.urlopen(url+name)----->pst方法
    

    开发者工具中form表单的method选项为post,那么必须使用post方法。

    urllib:

    import urllib
    import re
    
    reponse = urllib.urlopen("https://www.baidu.com")         #打开指定的网页,返回网页所有信息
    reponse_code = reponse.getcode()                          #获取状态码
    reponse_body = reponse.read()                             #获取网页内容
    #直接保存网页地址的内容到指定的文件
    save = urllib.urlretrieve("https://www.baidu.com", filename="/home/guido/python/baidu.html")
    images = re.findall(r"src='(.*?.jpg)'", reponse_body)    #利用正则表达式匹配数据
    urllib.urlretrieve(images[0], filename="/home/guido/python/baidu_images.html")
    

    拼接链接格式

    import urllib
    parament = urllib.urlencode({"t":"b", "w":"ios"})
    url = ("http://zzk.cnblogs.com/s?"+parament)
    print(url)
    
    执行结果:
    http://zzk.cnblogs.com/s?t=b&w=ios
    

      

    urllib2:

    import urllib2
    url = "http://www.phpno.com"
    #伪造浏览器请求头 send_headers = { "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", "Accept-Encoding":"gzip, deflate, sdch", "Accept-Language":"zh-CN,zh;q=0.8", "Cache-Control":"max-age=0", "Connection":"keep-alive", "Cookie":"ASPSESSIONIDCCTRDBQT=OJNFDDEANPLCEFLECFILODNN; Hm_lvt_39dcd5bd05965dcfa70b1d2457c6dcae=1484820976,1484821014,1484821053; Hm_lpvt_39dcd5bd05965dcfa70b1d2457c6dcae=1484821053", "Host":"www.nm3dp.com", "Referer":"https://www.baidu.com/link?url=Q_AEn1rb05AX6miw616Tx5bIWILq5K_FpUQl_eyJ7TS&wd=&eqid=cb712bbf00052caf00000003588091e9", "Upgrade-Insecure-Requests":"1", "User-Agent":"Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36" } req = urllib2.Request(url, headers=send_headers) #合并浏览器向服务器发送的内容 r = urllib2.urlopen(req) print(r.read())

    Beautiful Soup

    response = urllib.uelopen("http://www.3jy.com/")

    html = response.read()

    创建beautifulsoup对象:

    soup = Beautifulsoup(html)

    格式化输出soup对象的内容:

    print(soup.prettify())

    找标签:

    soup.title
    soup.head
    soup.b
    soup.a
    

    找属性:

    soup.p.attrs
    

    获取文字:

    soup.p.string

    css选择器:

    soup.select('title')                            通过标签名查找
    
    soup.select('.sister')                          通过类名查找
    
    soup.select('#link1')                           通过ID名查找
    
    soup.select(p link1)                            组合查找
    
    soup.select('head>title')                       直接子标签查找
    
    soup.select('a[class='sister']')                属性查找
    
    soup.p['class']                                 获取标签内某个属性的值(内容)
    

    通过索引的方式把select的返回值列表,又转换成可以用select方法的对象,可以进一步操作

    aa = soup.select('body')
    bb = aa[o]
    cc = bb.select('a[class='sister']')
    
  • 相关阅读:
    配置双jdk
    检测一个页面所用的时间的js
    java发送短信开发,第三方接口方法
    jq的常用事件及其案例
    ajax无法返回视图
    SpringMVC IO 文件上传
    及上一篇linux安装mysql的说明
    centos6.10下安装mysql8.0.16root密码修改的坑
    线程池学习
    数组的分隔
  • 原文地址:https://www.cnblogs.com/Guido-admirers/p/6307739.html
Copyright © 2020-2023  润新知