• 记录通过chales爬取‘京东到家’小程序里某沃尔玛线线上店的商品数据(mac系统)


    1.安装、打开chales,配置charles。

    1.1勾选Proxy->macOS Proxy选项,关闭默认的mac proxy设置。

     1.2勾选Proxy->Proxy Settings,弹出弹框。设置HTTP的代理端口为:6666(一般默认为:8888,可以自己定义)

     1.3勾选Proxy->SSL Proxying Settings,添加要抓包的域名。我们可以添加:*,匹配所有的。

    2.手机端的配置。(以iso系统为例)

    2.1点击连接的Wi-Fi的感叹号图标;点击最后一项:HTTP代理->配置代理;选择‘手动’,填入电脑的ip地址和刚刚设置chales的端口号:6666

     

    3.https抓包的配置。

    3.1因为要抓包的是https请求,所以我们还要安装证书。勾选Help->SSL Proxying->Install Charles Root Certificate。

    3.2双击电脑端添加的charles证书,选择‘始终信任’。

    3.3安装手机端的证书。勾选Help->SSL Proxying->Install Charles Root Certificate on a Mobile Device or Remote Browser。根据提示在手机端访问网址chls.pro/ssl。

    3.4根据弹窗的提示,在手机端安装该证书。

      

    3.5在‘通用->关于本机->证书信任设置’里选择完全信任该证书。(证书就是一套公钥私钥,所以手机和电脑端都要安装,并选择信任)

    4.1点击圆形按钮,就可以追踪手机开始抓包了。

    本文例子中是选择了一家沃尔玛超市,进入该店铺进行数据抓取。

     

    4.2通过分析发现发现获取商品类目的url拼接规律:

    url1 = 'https://daojia.jd.com/client?lat=22.56705&lng=113.95371&city_id=1607&deviceToken=b2e951ed-e72e-4a9a-b9ca-cd69348c3337&deviceId=b2e951ed-e72e-4a9a-b9ca-cd69348c3337&channel=wx_xcx&platform=5.0.0&platCode=H5&appVersion=5.0.0&xcxVersion=3.6.2&appName=paidaojia&deviceModel=appmodel&functionId=station%2FgetStationDetail&isForbiddenDialog=false&isNeedDealError=false&isNeedDealLogin=false&body=%7B%22storeId%22%3A%2211653731%22%2C%22skuId%22%3A%22%22%2C%22orgCode%22%3A%2281372%22%2C%22activityId%22%3A%22%22%2C%22promotionType%22%3A%22%22%2C%22lgt%22%3A113.95371%2C%22lat%22%3A22.56705%7D&afsImg=&business='

    body里的内容,解码后为:

    body = {"storeId":"11653731","skuId":"","orgCode":"81372","activityId":"","promotionType":"","lgt":113.95371,"lat":22.56705}

    body里的数值不影响获取类目的获取。所以通过url1发送get方法就可以获取数据。

    import requests
    
    url = 'https://daojia.jd.com/client?lat=22.51424&lng=113.93068&city_id=1607&deviceToken=b2e951ed-e72e-4a9a-b9ca-cd69348c3337&deviceId=b2e951ed-e72e-4a9a-b9ca-cd69348c3337&channel=wx_xcx&platform=5.0.0&platCode=H5&appVersion=5.0.0&xcxVersion=3.6.2&appName=paidaojia&deviceModel=appmodel&functionId=storeIndexSearch%2FsearchByCategory&isForbiddenDialog=false&isNeedDealError=false&isNeedDealLogin=false&body=%7B%22storeId%22%3A%2211653731%22%2C%22orgCode%22%3A%2281372%22%2C%22skuId%22%3A%22%22%2C%22catIds%22%3A%5B%7B%22catId%22%3A%224644375%22%2C%22type%22%3A2%7D%5D%7D&afsImg=&business=undefined'
    ua = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
    headers = {'User-Agent': ua}
    res = requests.get(url, headers=headers)
    print(res.text)  # 即为返回的数据内容

    部分数据展示:

    4.3通过分析发现获取不同类目下商品的url拼接规律:

    url2 = 'https://daojia.jd.com/client?lat=22.51424&lng=113.93068&city_id=1607&deviceToken=b2e951ed-e72e-4a9a-b9ca-cd69348c3337&deviceId=b2e951ed-e72e-4a9a-b9ca-cd69348c3337&channel=wx_xcx&platform=5.0.0&platCode=H5&appVersion=5.0.0&xcxVersion=3.6.2&appName=paidaojia&deviceModel=appmodel&functionId=storeIndexSearch%2FsearchByCategory&isForbiddenDialog=false&isNeedDealError=false&isNeedDealLogin=false&body=%7B%22storeId%22%3A%2211653731%22%2C%22orgCode%22%3A%2281372%22%2C%22skuId%22%3A%22%22%2C%22catIds%22%3A%5B%7B%22catId%22%3A%224644376%22%2C%22type%22%3A2%7D%5D%7D&afsImg=&business=undefined'

    body里的内容,解码后为:

    body = {"storeId":"11653731","orgCode":"81372","skuId":"","catIds":[{"catId":"4644376","type":2}]}

    catId值可以从url1返回的数据提取,传入不同的catId值,就会返回对应该类目下商品的信息。

    import requests
    import time
    from urllib.parse import quote
    
    def get_product(cateid2):  # 传入二级类目的类目id值
        body = {
            "storeId": "11653731",
            "orgCode": "81372",
            "skuId": "",
            "catIds": [{"catId": cateid2, "type": 2}]}
        body = json.dumps(body)
        body = quote(body)
        base_url = 'https://daojia.jd.com/client?lat=22.51424&lng=113.93068&city_id=1607&deviceToken=b2e951ed-e72e-4a9a-b9ca-cd69348c3337&deviceId=b2e951ed-e72e-4a9a-b9ca-cd69348c3337&channel=wx_xcx&platform=5.0.0&platCode=H5&appVersion=5.0.0&xcxVersion=3.6.2&appName=paidaojia&deviceModel=appmodel&functionId=storeIndexSearch%2FsearchByCategory&isForbiddenDialog=false&isNeedDealError=false&isNeedDealLogin=false&body={}&afsImg=&business=undefined'.format(body)
        print(base_url)  # 根据不同的cateId拼接url
        ua = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
        headers = {'User-Agent': ua}
        res = requests.get(base_url, headers=headers)
        print(res.text)

     部分数据展示:

     4.4将数据整理好输出为表的格式:

    filename = '{}.csv'.format(catename1)
    csvfile = open(filename, 'a')
    writer = csv.writer(csvfile)
    writer.writerow(['商品名称', '价格(单位:元)', '月销量', '图片', '二级类目', '一级类目'])
    
    for product in searchResultVOList:
        print(product)
        name = product['skuName']
        img = product['imgUrl']
        price = product['realTimePrice']
        sale = product['monthSales']
        writer.writerow([name, price, sale, img, catename2, catename1])
    
    csvfile.close()

    部分数据展示:

    4.5完整代码见:https://github.com/HongDanni/jd_daojia 

  • 相关阅读:
    Python第二天 (数据类型,变量 )
    python第一天(安装运行python)
    Linux shell 整数运算 let [ ] (( )) expr以及 浮点数 bc用法(转)
    2018年3月大事件
    2018年2月大事件
    项目假设与制约因素
    调用微信红包接口,本地可以服务器不可以。 请求被中止: 未能创建 SSL/TLS 安全通道
    【转】AddMvcCore,AddControllers,AddControllersWithViews,AddRazorPages的区别
    sql server create table 给字段添加注释说明
    HttpContext.SignInAsync 失效(表面解决了问题,未深入到.net core 源码去找问题,记录一下,等有时间翻一下.net core 源码试试能不能找到根本原因)
  • 原文地址:https://www.cnblogs.com/hongdanni/p/11662869.html
Copyright © 2020-2023  润新知