• Python 爬虫知识点


    一、URL分析

      通过对“Python机器学习”结果抓包分析,有两个无规律的参数:_ksTS和callback。通过构建如下URL可以获得目标关键词的检索结果,如下所示:

    https://s.taobao.com/search?data-key=s&data-value=44&ajax=true&_ksTS=1482325509866_2527&callback=jsonp2528&q=Python机器学习&imgfile=&js=1&stats_click=search_radio_all:1&initiative_id=staobaoz_20161221&ie=utf8&bcoffset=0&ntoffset=6&p4ppushleft=.44&fs=1&s=0

    https://s.taobao.com/search?data-key=s&data-value=88&ajax=true&_ksTS=1482325509866_2527&callback=jsonp2528&q=Python机器学习&imgfile=&js=1&stats_click=search_radio_all:1&initiative_id=staobaoz_20161221&ie=utf8&bcoffset=0&ntoffset=6&p4ppushleft=.44&fs=1&s=44

    https://s.taobao.com/search?data-key=s&data-value=132&ajax=true&_ksTS=1482325509866_2527&callback=jsonp2528&q=Python机器学习&imgfile=&js=1&stats_click=search_radio_all:1&initiative_id=staobaoz_20161221&ie=utf8&bcoffset=0&ntoffset=6&p4ppushleft=.44&fs=1&s=88

    https://s.taobao.com/search?data-key=s&data-value=176&ajax=true&_ksTS=1482325509866_2527&callback=jsonp2528&q=Python机器学习&imgfile=&js=1&stats_click=search_radio_all:1&initiative_id=staobaoz_20161221&ie=utf8&bcoffset=0&ntoffset=6&p4ppushleft=.44&fs=1&s=132


    https://s.taobao.com/search?data-key=s&data-value=220&ajax=true&_ksTS=1482325509866_2527&callback=jsonp2528&q=Python机器学习&imgfile=&js=1&stats_click=search_radio_all:1&initiative_id=staobaoz_20161221&ie=utf8&bcoffset=0&ntoffset=6&p4ppushleft=.44&fs=1&s=176


    https://s.taobao.com/search?data-key=s&data-value=264&ajax=true&_ksTS=1482325509866_2527&callback=jsonp2528&q=Python机器学习&imgfile=&js=1&stats_click=search_radio_all:1&initiative_id=staobaoz_20161221&ie=utf8&bcoffset=0&ntoffset=6&p4ppushleft=.44&fs=1&s=220


    https://s.taobao.com/search?data-key=s&data-value=308&ajax=true&_ksTS=1482325509866_2527&callback=jsonp2528&q=Python机器学习&imgfile=&js=1&stats_click=search_radio_all:1&initiative_id=staobaoz_20161221&ie=utf8&bcoffset=0&ntoffset=6&p4ppushleft=.44&fs=1&s=264


    https://s.taobao.com/search?data-key=s&data-value=352&ajax=true&_ksTS=1482325509866_2527&callback=jsonp2528&q=Python机器学习&imgfile=&js=1&stats_click=search_radio_all:1&initiative_id=staobaoz_20161221&ie=utf8&bcoffset=0&ntoffset=6&p4ppushleft=.44&fs=1&s=308

    二、关键字分析

    1、q查询关键词

    2、data-value显示记录数

    3、s上一页记录数

    4、s与data-value的差值即当页显示数量

    三、Python抓取数据

    #__author__ = 'Joker'
    # -*- coding:utf-8 -*-

    import re
    import urllib.request
    keyWord1 = "Python机器学习"
    keyWord2 = urllib.request.quote(keyWord1)
    headers = ("User-Agent","MMozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.104 Safari/537.36 Core/1.53.1708.400 QQBrowser/9.5.9635.400")
    opener = urllib.request.build_opener()
    opener.addheaders = [headers]
    urllib.request.install_opener(opener)
    for j in range(1,25):
    try:
    curPage = 44
    prePage = 0
    url = "https://s.taobao.com/search?data-key=s&data-value=" + str(
    curPage) + "&ajax=true&_ksTS=1482325509866_2527&callback=jsonp2528&q=" + keyWord2 + "&imgfile=&js=1&stats_click=search_radio_all:1&initiative_id=staobaoz_20161221&ie=utf8&bcoffset=0&ntoffset=6&p4ppushleft=.44&fs=1&s=" + str(
    prePage)
    data = urllib.request.urlopen(url).read().decode("utf-8", "ignore")
    patTitle = '"title":"(.*?)","raw_title"'
    titles = re.compile(patTitle).findall(data)
    patRawTitle = '"raw_title":"(.*?)"'
    rawTitles = re.compile(patRawTitle).findall(data)
    patImage = '"pic_url":"//(.*?)","'
    rawImages = re.compile(patImage).findall(data)
    patPrice = '"view_price":"(.*?)","'
    rawPrices = re.compile(patPrice).findall(data)
    patNick = '"nick"(.*?)","'
    rawNicks = re.compile(patNick).findall(data)
    for i in range(0,len(titles)):
    print("-------------------")
    print("第" + str(j+1) + "页,第" + str(i+1) + "本" )
    #print(titles[i])
    print(rawTitles[i])
    print(rawImages[i])
    print(rawPrices[i])
    print(rawNicks[i])
    print("-------------------")
    prePage = 44 * j
    curPage = 44 + prePage
    except urllib.error.URLError as e:
    if hasattr(e,"code"):
    print(e.code)
    if hasattr(e,"reason"):
    print(e.reason)
    except Exception as e:
    print(e)
  • 相关阅读:
    NYOJ The Triangle
    max()和数组里面的max
    sizeof和strlen的区别和联系总结
    继BAT之后 第四大巨头是谁
    专注做好一件事
    编程技术面试的五大要点
    IBM面试记
    创业者,你为什么这么着急?
    硅谷创业教父Paul Graham:如何获得创业idea
    17家中国初创公司的失败史
  • 原文地址:https://www.cnblogs.com/defineconst/p/6209274.html
Copyright © 2020-2023  润新知