• Appium抓包


    环境搭建

    Android模拟器安装

    官网下载夜神模拟器安装

    抓包工具安装

    appium安装

    https://github.com/appium/appium-desktop/releases/tag/v1.11.0

    mitmproxy安装

    下载安装包, 直接点击下一步即可安装
    https://github.com/mitmproxy/mitmproxy/releases/

    装好之后配置一下环境变量就行了

    也可以直接使用pip install mitmproxy

    安装证书

    在cmd中输入mitmdump, 可以看到mitmdump已经启动了, 在监听8080端口

    C:UsersIIce>mitmdump
    Proxy server listening at http://*:8080
    

    打开模拟器, 配置代理

    查看一下pc的ip

    以太网适配器 以太网:
    
       连接特定的 DNS 后缀 . . . . . . . : North-Class.com
       本地链接 IPv6 地址. . . . . . . . : fe80::68d7:38a8:2729:4d97%6
       IPv4 地址 . . . . . . . . . . . . : 192.168.100.243
       子网掩码  . . . . . . . . . . . . : 255.255.255.0
       默认网关. . . . . . . . . . . . . : 192.168.100.250
    

    配置好以后, 打开浏览器, 输入baidu.com进行查看
    此时会弹出证书问题, 点继续即可

    输入mitm.it
    选择相应的版本进行安装

    此时再访问网站就不会有证书问题了

    docker安装

    https://docs.docker.com/toolbox/toolbox_install_windows/
    下载docker-toolbox,安装

    双击Docker Quickstart Terminal图标,启动一个终端
    会下载一个boot2docker.iso文件,如果下载较慢的话,可以复制链接自行下载,
    下载完成后复制到目录中即可

    如果出现Unable to start the VM: C:Program FilesOracleVirtualBoxVBoxManage.exe startvm default --type headless failed:卸载掉Oracle VM Virtualbox安装最新版即可
    https://www.virtualbox.org/wiki/Downloads

    完成后会出现

    输入docker run hello-world

    fiddler设置

    手机连接配置
    查看pc端IP

    ...
    
    以太网适配器 以太网:
    
       连接特定的 DNS 后缀 . . . . . . . : North-Class.com
       本地链接 IPv6 地址. . . . . . . . : fe80::f44c:fb33:30bf:5c57%18
       IPv4 地址 . . . . . . . . . . . . : 192.168.100.248
       子网掩码  . . . . . . . . . . . . : 255.255.255.0
       默认网关. . . . . . . . . . . . . : 192.168.100.250
    ...
    

    设置代理,服务器主机名是pc端IPv4地址

    设置完成后浏览器访问主机IP+端口

    App应用在开启抓包工具后无法联网问题

    http://www.imooc.com/article/251500

    fiddler 不能抓包的方法

    https://testerhome.com/topics/11462?from=singlemessage

    豆果美食菜谱抓取

    在模拟器中下载并安装豆果美食
    设置代理准备进行数据抓包
    打开fiddler和豆果美食

    点击菜谱分类

    点击标签进入详情

    抓包分析,http://api.douguo.net/recipe/flatcatalogs这个url返回的是菜谱分类

    http://api.douguo.net/recipe/v2/search/0/20返回的是详情
    综合最佳,收藏最多做过最多使用的都是这一个url,只是提交的参数不同

    # 0:综合最佳   2: 收藏最多   3:做过最多
    "order": "0",
    

    编码实现

    请求头

    首先将请求头共用的部分提取出来,注释掉的都是可以不用提交的

    def handle_reques(url, data):
        header = {
            "client": "4",
            "version": "6934.2",
            "device": "OPPO R11",
            "sdk": "22,5.1.1",
            "imei": "866174010942858",
            "channel": "baidu",
            # "mac": " 6A:07:15:F0:34:85",
            "resolution": "1280*720",
            "dpi": "1.5",
            # "android-id": "6a0715f034851883",
            # "pseudo - id": "5f0348518836a071",
            "brand": "OPPO",
            "scale": "1.5",
            "timezone": "28800",
            "language": "zh",
            "cns": "3",
            "carrier": "CHINA+MOBILE",
            # "imsi": "460071060715240",
            "user-agent": "Mozilla/5.0 (Linux; Android 5.1.1; OPPO R11 Build/NMF26X) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/39.0.0.0 Mobile Safari/537.36",
            "reach": "1",
            "newbie": "1",
            "Content-Type": "application/x-www-form-urlencoded; charset=utf-8",
            "Accept-Encoding": "gzip, deflate",
            "Connection": "Keep-Alive",
            # "Cookie": "duid=59159842",
            "Host": "api.douguo.net",
            # "Content-Length": "74",
        }
    
        response = requests.post(url=url, headers=header, data=data)
        return response
    

    菜谱分类

    
    from multiprocessing import Queue
    
    queue_list = Queue()
    
    def handle_index():
        url = "http://api.douguo.net/recipe/flatcatalogs"
        data = {
            "client": "4",
            # "_session": "1552715432169866174010942858",
            # "v": "1503650468",
            # "_vs": "0",   0 和 2305都可以
            "_vs": "2305",
    
        }
    
        response = handle_reques(url=url, data=data)
        # print(response.text)
        response_to_dict = json.loads(response.text)
    
        for item in response_to_dict['result']['cs']:
            for item_1 in item['cs']:
                for item_2 in item_1['cs']:
                    data_2 = {
                        "client": "4",
                        # "_session": "1552715831226866174010942858",
                        "keyword": item_2['name'],
                        # 0:综合最佳   2: 收藏最多   3:做过最多
                        "order": "0",
                        "_vs": "400",
                    }
                    queue_list.put(data_2)
    
    

    详情

    def handle_caipu_list(data):
        print("当前处理:", data['keyword'])
        caipu_list_url = 'http://api.douguo.net/recipe/v2/search/0/20'
        caipu_list_response = handle_reques(url=caipu_list_url, data=data)
        response_to_dict = json.loads(caipu_list_response.text)
        handle_caipu_detail(data, response_to_dict)
    
        count=0
        while response_to_dict['result']['end'] == 0:
            count+=1
            caipu_list_url = 'http://api.douguo.net/recipe/v2/search/{}/20'.format(count*20)
            caipu_list_response = handle_reques(url=caipu_list_url, data=data)
            response_to_dict = json.loads(caipu_list_response.text)
            handle_caipu_detail(data, response_to_dict)
    

    具体做法

    def handle_caipu_detail(data, response_to_dict):
    
        for item in response_to_dict['result']['list']:
            caipu_info = {}
            caipu_info['shicai'] = data['keyword']
    
            if item['type'] == 13:
                caipu_info['author'] = item['r']['an']
                caipu_info['shicai_id'] = item['r']['id']  # 查看详细操作步骤时使用
                caipu_info['describe'] = item['r']['cookstory']
                caipu_info['caipu_name'] = item['r']['n']
                caipu_info['zuoliao_list'] = item['r']['major']
    
                detail_url = 'http://api.douguo.net/recipe/detail/' + str(caipu_info['shicai_id'])
                detail_data = {
                    "client": "4",
                    # "_session": "1552715831226866174010942858",
                    "author_id": "0",
                    "_vs": "2803",
                    "_ext": '{"query":{"kw":' + data["keyword"] + ',"src":"2803","type":"13","id":' + str(
                        caipu_info["shicai_id"]) + '}}',
                }
    
                detail_response = handle_reques(url=detail_url, data=detail_data)
                # print(detail_response.text)
                detail_response_to_dict = json.loads(detail_response.text)
    
                caipu_info['tips'] = detail_response_to_dict['result']['recipe']['tips']
                caipu_info['cook_step'] = detail_response_to_dict['result']['recipe']['cookstep']
    
                print('当前入库:', caipu_info['caipu_name'])
                mongo_info.insert_item(caipu_info)
    
            else:
                continue
    

    入库

    import pymongo
    
    from pymongo.collection import Collection
    
    
    class Connect_Mongo:
        def __init__(self):
            self.client = pymongo.MongoClient()
            self.db_data = self.client['dou_guo_mei_shi']
    
        def insert_item(self, item):
            db_collection = Collection(self.db_data, 'mei_shi')
            db_collection.insert_one(item)
    
    
    mongo_info = Connect_Mongo()
    

    多线程测试

    if __name__ == '__main__':
        handle_index()
        # print(queue_list.qsize())
        # handle_caipu_list(queue_list.get())
        pool = ThreadPoolExecutor()
    
        while queue_list.qsize() > 0:
            pool.submit(handle_caipu_list, queue_list.get())
    

    安装android-sdk

    http://tools.android-studio.org/index.php/sdk
    下载安装即可

    配置环境变量

    变量
    ANDROID_HOME(新建) G:Program Files (x86)Androidandroid-sdk
    Path(添加) %ANDROID_HOME% ools
    Path(添加) %ANDROID_HOME%platform-tools

    运行SDK Manager.exe

    安卓版本勾选最新版的即可,兼容旧版本

    安装完成后打开cmd,输入adb,可以看到adb版本

    Android Debug Bridge version 1.0.40
    Version 28.0.2-5303910
    Installed as G:Program Files (x86)Androidandroid-sdkplatform-toolsadb.exe
    
    global options:
     -a         listen on all network interfaces, not just localhost
     -d         use USB device (error if multiple devices connected)
     -e         use TCP/IP device (error if multiple TCP/IP devices available)
     -s SERIAL  use device with given serial (overrides $ANDROID_SERIAL)
     -t ID      use device with given transport id
     -H         name of adb server host [default=localhost]
     -P         port of adb server [default=5037]
     -L SOCKET  listen on given socket for adb server [default=tcp:localhost:5037]
    
    

    升级夜神模拟器的adb

    android-sdkplatform-tools中的三个adb文件拷贝到模拟器安装目录下

    将adb.exe复制一份,覆盖掉原来的nox_adb.exe,
    开启模拟器的开发者选项
    重启模拟器,打开cmd

    C:Userslenovo>adb devices
    List of devices attached
    127.0.0.1:52001 device
    

    模拟器已连接上了

    uiautomatorviewer

    文件位置D:Program Files (x86)Androidandroid-sdk oolsuiautomatorviewer.bat

    双击运行, 将黑窗口最小化,不要关闭

    点击生成屏幕快照, 可以使用鼠标查看元素的信息

    appium

    启动参数配置
    http://www.testclass.net/appium

    {
      "platformName": "Android",
      "deviceName": "127.0.0.1:52001",
      "platformVersion": "5.1.1",
      "appPackage": "com.tal.kaoyan",
      "appActivity": "com.tal.kaoyan.ui.activity.SplashActivity"
    }
    

    appPackageappActivity 获取
    使用aapt.exe dump badging来获取

    D:Program Files (x86)Androidandroid-sdkuild-tools28.0.3>aapt.exe dump badging F:BrowserDownloadkaoyanbang_3.3.7beta.243.apk
    
    package: name='com.tal.kaoyan' versionCode='92' versionName='3.3.7beta' compileSdkVersion='28' compileSdkVersionCodename='9'
    sdkVersion:'16'
    ...
    
    launchable-activity: name='com.tal.kaoyan.ui.activity.SplashActivity'  label='' icon=''
    
    ...
    
    

    考研帮测试

    pip install Appium-Python-Client
    
    {
      "platformName": "Android",
      "deviceName": "127.0.0.1:52001",
      "platformVersion": "5.1.1",
      "appPackage": "com.tal.kaoyan",
      "appActivity": "com.tal.kaoyan.ui.activity.SplashActivity",
      "noReset": true
    }
    
    import time
    
    from appium import webdriver
    from selenium.webdriver.support.ui import WebDriverWait
    
    cap = {
        "platformName": "Android",
        "deviceName": "127.0.0.1:52001",
        "platformVersion": "5.1.1",
        "appPackage": "com.tal.kaoyan",
        "appActivity": "com.tal.kaoyan.ui.activity.SplashActivity",
        "noReset": True
    }
    
    name = ""
    pwd = ""
    
    driver = webdriver.Remote("http://localhost:4723/wd/hub", cap)
    
    
    def get_size():
        x = driver.get_window_size()['width']
        y = driver.get_window_size()['height']
        return (x, y)
    
    
    try:
        # 是否跳过
        if WebDriverWait(driver, 3).until(
                lambda x: x.find_element_by_xpath("//android.widget.TextView[@resource-id='com.tal.kaoyan:id/tv_skip']")):
            driver.find_element_by_xpath("//android.widget.TextView[@resource-id='com.tal.kaoyan:id/tv_skip']").click()
    except:
        pass
    
    try:
        if WebDriverWait(driver, 3).until(lambda x: x.find_element_by_xpath(
                "//android.widget.EditText[@resource-id='com.tal.kaoyan:id/login_email_edittext']")):
            driver.find_element_by_xpath(
                "//android.widget.EditText[@resource-id='com.tal.kaoyan:id/login_email_edittext']").send_keys(name)
            driver.find_element_by_xpath(
                "//android.widget.EditText[@resource-id='com.tal.kaoyan:id/login_password_edittext']").send_keys(pwd)
            driver.find_element_by_xpath(
                "//android.widget.Button[@resource-id='com.tal.kaoyan:id/login_login_btn']").click()
    except:
        pass
    
    try:
        # 隐私协议
        if WebDriverWait(driver, 3).until(
                lambda x: x.find_element_by_xpath("//android.widget.TextView[@resource-id='com.tal.kaoyan:id/tv_title']")):
            driver.find_element_by_xpath("//android.widget.TextView[@resource-id='com.tal.kaoyan:id/tv_agree']").click()
            driver.find_element_by_xpath(
                "//android.support.v7.widget.RecyclerView[@resource-id='com.tal.kaoyan:id/date_fix']/android.widget.RelativeLayout[3]").click()
    except:
        pass
    
    # 点击研讯
    if WebDriverWait(driver, 3).until(lambda x: x.find_element_by_xpath(
            "//android.support.v7.widget.RecyclerView[@resource-id='com.tal.kaoyan:id/date_fix']/android.widget.RelativeLayout[3]/android.widget.LinearLayout[1]/android.widget.ImageView[1]")):
        driver.find_element_by_xpath(
            "//android.support.v7.widget.RecyclerView[@resource-id='com.tal.kaoyan:id/date_fix']/android.widget.RelativeLayout[3]/android.widget.LinearLayout[1]/android.widget.ImageView[1]").click()
    
        l = get_size()
    
        x1 = int(l[0] * 0.5)
        y1 = int(l[1] * 0.75)
        y2 = int(l[1] * 0.25)
    
        # 滑动操作
        while True:
            driver.swipe(x1, y1, x1, y2)
            time.sleep(0.5)
    

    整体操作和selenium差不多

    抖音粉丝抓取

    先找一个分享链接
    https://www.douyin.com/share/user/96578108671

    浏览器打开, 进行查看, 可以看到数字被进行了混淆,
    字符文件链接https://s3.bytecdn.cn/ies/resource/falcon/douyin_falcon/static/font/iconfont_9eb9a50.woff

    在线字体查看http://fontstore.baidu.com/static/editor/index.html
    将下载的字体文件上传到网站, 就能看到字符和数字之间的关系了

    分享页面内容抓取

    import re
    import requests
    import time
    from lxml import etree
    
    from douyin.handle_mongo import get_task
    
    
    def handle_decode(input_data, share_web_url, task):
        search_douyin_str = re.compile('抖音ID:')
        regex_list = [
            {'name': ['  ', '  ', '  '], 'value': 0},
            {'name': ['  ', '  ', '  '], 'value': 1},
            {'name': ['  ', '  ', '  '], 'value': 2},
            {'name': ['  ', '  ', '  '], 'value': 3},
            {'name': ['  ', '  ', '  '], 'value': 4},
            {'name': ['  ', '  ', '  '], 'value': 5},
            {'name': ['  ', '  ', '  '], 'value': 6},
            {'name': ['  ', '  ', '  '], 'value': 7},
            {'name': ['  ', '  ', '  '], 'value': 8},
            {'name': ['  ', '  ', '  '], 'value': 9},
        ]
    
        for i1 in regex_list:
            for i2 in i1['name']:
                input_data = re.sub(i2, str(i1['value']), input_data)
        share_web_html = etree.HTML(input_data)
        douyin_info = {}
        douyin_info['nick_name'] = 
        share_web_html.xpath("//div[@class='personal-card']/div[@class='info1']//p[@class='nickname']/text()")[0]
        if 'douyin_id' in task:
            douyin_info['douyin_id'] = task['douyin_id']
        else:
            douyin_id = ''.join(
                share_web_html.xpath("//div[@class='personal-card']/div[@class='info1']/p[@class='shortid']/i/text()"))
            if douyin_id == '':
                try:
                    douyin_info['douyin_id'] = re.sub(search_douyin_str, '', share_web_html.xpath(
                        "//div[@class='personal-card']/div[@class='info1']/p[@class='shortid']/text()")[0]).strip()
                except:
                    douyin_info['douyin_id'] = '无数据'
            else:
                douyin_info['douyin_id'] = douyin_id
    
        try:
            douyin_info['job'] = share_web_html.xpath(
                "//div[@class='personal-card']/div[@class='info2']/div[@class='verify-info']/span[@class='info']/text()")[
                0].strip()
        except:
            pass
        douyin_info['describe'] = 
        share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='signature']/text()")[0].replace(
            '
    ', ',')
        douyin_info['location'] = 
        share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='extra-info']/span[1]/text()")
        douyin_info['xingzuo'] = 
        share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='extra-info']/span[2]/text()")
        douyin_info['follow_count'] = share_web_html.xpath(
            "//div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='focus block']//i[@class='icon iconfont follow-num']/text()")[
            0].strip()
        fans_value = ''.join(share_web_html.xpath(
            "//div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='follower block']//i[@class='icon iconfont follow-num']/text()"))
        unit = share_web_html.xpath(
            "//div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='follower block']/span[@class='num']/text()")
        if unit[-1].strip() == 'w':
            douyin_info['fans'] = str((int(fans_value) / 10)) + 'w'
        like = ''.join(share_web_html.xpath(
            "//div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='liked-num block']//i[@class='icon iconfont follow-num']/text()"))
        unit = share_web_html.xpath(
            "//div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='liked-num block']/span[@class='num']/text()")
        if unit[-1].strip() == 'w':
            douyin_info['like'] = str(int(like) / 10) + 'w'
        douyin_info['from_url'] = share_web_url
    
        print(douyin_info)
    
    
    
    def handle_douyin_web_share(task):
        share_web_url = 'https://www.douyin.com/share/user/' + task
        print(share_web_url)
        share_web_header = {
            'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.75 Safari/537.36'
        }
        share_web_response = requests.get(url=share_web_url, headers=share_web_header)
        handle_decode(share_web_response.text, share_web_url, task)
    
    if __name__ == '__main__':
        # task = get_task("share_id")
        handle_douyin_web_share("88445518961")
    
    https://www.douyin.com/share/user/88445518961
    {'nick_name': 'Dear-迪丽热巴', 'douyin_id': '274110380', 'job': '演员', 'describe': '先定一个能达到的小目标,比方说来句签名', 'location': [], 'xingzuo': [], 'follow_count': '0', 'fans': '5046.8w', 'like': '13527.7w', 'from_url': 'https://www.douyin.com/share/user/88445518961'}
    

    粉丝抓取

    前提: 登录状态, 最新版本

    抓取个人的粉丝

    import sys
    import time
    from selenium.webdriver.support.ui import WebDriverWait
    from appium import webdriver
    
    desired_caps = {}
    desired_caps['platformName'] = 'Android'
    desired_caps['deviceName'] = '127.0.0.1:62001'
    desired_caps['platformVersion'] = '5.1.1'
    desired_caps['appPackage'] = 'com.ss.android.ugc.aweme'
    desired_caps['appActivity'] = 'com.ss.android.ugc.aweme.splash.SplashActivity'
    desired_caps['noReset'] = True
    desired_caps['unicodeKeyboard'] = True
    desired_caps['resetKeyboard'] = True
    
    driver = webdriver.Remote('http://localhost:4723/wd/hub', desired_caps)
    
    
    def get_size(driver):
        x = driver.get_window_size()['width']
        y = driver.get_window_size()['height']
        return (x, y)
    
    
    def handle_douyin(driver):
        try:
            # 点击搜索
            while WebDriverWait(driver, 10).until(lambda x: x.find_element_by_xpath(
                    "//android.widget.LinearLayout[@resource-id='com.ss.android.ugc.aweme:id/aps']")):
                driver.find_element_by_xpath(
                    "//android.widget.LinearLayout[@resource-id='com.ss.android.ugc.aweme:id/aps']").click()
                break
        except:
            print("找不到搜索按钮")
    
        # 定位搜索框
        if WebDriverWait(driver, 3).until(lambda x: x.find_element_by_xpath(
                "//android.widget.EditText[@resource-id='com.ss.android.ugc.aweme:id/afo']")):
            # 获取douyin_id进行搜索
            driver.find_element_by_xpath(
                "//android.widget.EditText[@resource-id='com.ss.android.ugc.aweme:id/afo']").send_keys('706942127')
            while driver.find_element_by_xpath(
                    "//android.widget.EditText[@resource-id='com.ss.android.ugc.aweme:id/afo']").text != '706942127':
                driver.find_element_by_xpath(
                    "//android.widget.EditText[@resource-id='com.ss.android.ugc.aweme:id/afo']").send_keys('706942127')
                time.sleep(0.1)
        # 点击搜索
        driver.find_element_by_xpath("//android.widget.TextView[@resource-id='com.ss.android.ugc.aweme:id/afr']").click()
    
        # 点击用户标签
        if WebDriverWait(driver, 3).until(lambda x: x.find_element_by_xpath("//android.widget.TextView[@text='用户']")):
            driver.find_element_by_xpath("//android.widget.TextView[@text='用户']").click()
    
        # 点击头像
        if WebDriverWait(driver, 3).until(lambda x: x.find_element_by_xpath(
                "/hierarchy/android.widget.FrameLayout/android.widget.LinearLayout/android.widget.FrameLayout/android.widget.FrameLayout/android.widget.FrameLayout[2]/android.widget.RelativeLayout/android.widget.FrameLayout/android.widget.RelativeLayout/android.support.v4.view.ViewPager/android.widget.LinearLayout/android.widget.FrameLayout/android.view.View/android.support.v7.widget.RecyclerView/android.widget.RelativeLayout[1]/android.widget.RelativeLayout[1]/android.widget.ImageView[2]")):
            driver.find_element_by_xpath(
                "/hierarchy/android.widget.FrameLayout/android.widget.LinearLayout/android.widget.FrameLayout/android.widget.FrameLayout/android.widget.FrameLayout[2]/android.widget.RelativeLayout/android.widget.FrameLayout/android.widget.RelativeLayout/android.support.v4.view.ViewPager/android.widget.LinearLayout/android.widget.FrameLayout/android.view.View/android.support.v7.widget.RecyclerView/android.widget.RelativeLayout[1]/android.widget.RelativeLayout[1]/android.widget.ImageView[2]").click()
        # 点击粉丝按钮
        if WebDriverWait(driver, 3).until(lambda x: x.find_element_by_id("com.ss.android.ugc.aweme:id/aj1")):
            driver.find_element_by_id("com.ss.android.ugc.aweme:id/aj1").click()
    
        l = get_size(driver)
        x1 = int(l[0] * 0.5)
        y1 = int(l[1] * 0.75)
        y2 = int(l[1] * 0.25)
        while True:
            if '没有更多了' in driver.page_source:
                break
            driver.swipe(x1, y1, x1, y2)
            time.sleep(0.5)
    
    
    if __name__ == '__main__':
        handle_douyin(driver)
    

    Appium会先打开抖音, 然后点击搜索图标, 获取搜索栏进行输入, 点击搜索按钮, 点击用户, 点击头像, 点击粉丝, 模拟滑动, 直到没有粉丝了

    粉丝入库

    使用mitmdump来将数据存入数据库
    mitmdump -s xxx.py

    import json
    
    from douyin.handle_mongo import save_task
    
    
    def response(flow):
        if 'aweme/v1/user/follower/list/' in flow.request.url:
            for user in json.loads(flow.response.text)['followers']:
                douyin_info = {}
                douyin_info['share_id'] = user['uid']
                douyin_info['douyin_id'] = user['short_id']
                douyin_info['nickname'] = user['nickname']
                save_task(douyin_info)
    

    这样在滑动粉丝时, 就会将粉丝的信息添加进数据库

    多设备抓取

    设置一下appium
    appium客户端设置 udid
    appium服务端设置 bootstrapPort

    需要开启多个模拟器或者多台真机

    import multiprocessing
    import sys
    import time
    from selenium.webdriver.support.ui import WebDriverWait
    from appium import webdriver
    
    # desired_caps = {}
    # desired_caps['platformName'] = 'Android'
    # desired_caps['deviceName'] = '127.0.0.1:62001'
    # desired_caps['platformVersion'] = '5.1.1'
    # desired_caps['appPackage'] = 'com.ss.android.ugc.aweme'
    # desired_caps['appActivity'] = 'com.ss.android.ugc.aweme.splash.SplashActivity'
    # desired_caps['noReset'] = True
    # desired_caps['unicodeKeyboard'] = True
    # desired_caps['resetKeyboard'] = True
    #
    # driver = webdriver.Remote('http://localhost:4723/wd/hub', desired_caps)
    
    
    def get_size(driver):
        x = driver.get_window_size()['width']
        y = driver.get_window_size()['height']
        return (x, y)
    
    
    def handle_douyin(driver):
        while True:
            # 定位搜索框
            while WebDriverWait(driver, 10).until(lambda x: x.find_element_by_xpath(
                    "//android.widget.EditText[@resource-id='com.ss.android.ugc.aweme:id/afo']")):
                # 获取douyin_id进行搜索
                driver.find_element_by_xpath(
                    "//android.widget.EditText[@resource-id='com.ss.android.ugc.aweme:id/afo']").send_keys('706942127')
                while driver.find_element_by_xpath(
                        "//android.widget.EditText[@resource-id='com.ss.android.ugc.aweme:id/afo']").text != '706942127':
                    driver.find_element_by_xpath(
                        "//android.widget.EditText[@resource-id='com.ss.android.ugc.aweme:id/afo']").send_keys('706942127')
                    time.sleep(0.1)
                    break
                break
            # 点击搜索
            driver.find_element_by_xpath("//android.widget.TextView[@resource-id='com.ss.android.ugc.aweme:id/afr']").click()
    
            # 点击用户标签
            if WebDriverWait(driver, 10).until(lambda x: x.find_element_by_xpath("//android.widget.TextView[@text='用户']")):
                driver.find_element_by_xpath("//android.widget.TextView[@text='用户']").click()
    
            # 点击头像
            if WebDriverWait(driver, 10).until(lambda x: x.find_element_by_xpath(
                    "/hierarchy/android.widget.FrameLayout/android.widget.LinearLayout/android.widget.FrameLayout/android.widget.FrameLayout/android.widget.FrameLayout[2]/android.widget.RelativeLayout/android.widget.FrameLayout/android.widget.RelativeLayout/android.support.v4.view.ViewPager/android.widget.LinearLayout/android.widget.FrameLayout/android.view.View/android.support.v7.widget.RecyclerView/android.widget.RelativeLayout[1]/android.widget.RelativeLayout[1]/android.widget.ImageView[2]")):
                driver.find_element_by_xpath(
                    "/hierarchy/android.widget.FrameLayout/android.widget.LinearLayout/android.widget.FrameLayout/android.widget.FrameLayout/android.widget.FrameLayout[2]/android.widget.RelativeLayout/android.widget.FrameLayout/android.widget.RelativeLayout/android.support.v4.view.ViewPager/android.widget.LinearLayout/android.widget.FrameLayout/android.view.View/android.support.v7.widget.RecyclerView/android.widget.RelativeLayout[1]/android.widget.RelativeLayout[1]/android.widget.ImageView[2]").click()
            # 点击粉丝按钮
            if WebDriverWait(driver, 10).until(lambda x: x.find_element_by_id("com.ss.android.ugc.aweme:id/aj1")):
                driver.find_element_by_id("com.ss.android.ugc.aweme:id/aj1").click()
    
            l = get_size(driver)
            x1 = int(l[0] * 0.5)
            y1 = int(l[1] * 0.75)
            y2 = int(l[1] * 0.25)
            while True:
                if '没有更多了' in driver.page_source:
                    break
                elif '还没有粉丝' in driver.page_source:
                    break
                else:
                    driver.swipe(x1, y1, x1, y2)
                    time.sleep(0.5)
    
            driver.find_element_by_id("com.ss.android.ugc.aweme:id/n7").click()
            driver.find_element_by_id("com.ss.android.ugc.aweme:id/n7").click()
            driver.find_element_by_xpath(
                "//android.widget.EditText[@resource-id='com.ss.android.ugc.aweme:id/afo']").clear()
    
    
    def handle_appium(device, port):
        caps = {}
        caps["platformName"] = "Android"
        caps["deviceName"] = device
        caps["platformVersion"] = "5.1.1"
        caps["appPackage"] = "com.ss.android.ugc.aweme"
        caps["appActivity"] = "com.ss.android.ugc.aweme.splash.SplashActivity"
        caps["noReset"] = True
        caps["unicodeKeyboard"] = True
        caps["resetKeyboard"] = True
        caps["udid"] = device
    
        driver = webdriver.Remote('http://localhost:'+str(port)+'/wd/hub', caps)
    
        try:
            # 点击搜索图标
            while WebDriverWait(driver, 10).until(lambda x: x.find_element_by_xpath(
                    "//android.widget.LinearLayout[@resource-id='com.ss.android.ugc.aweme:id/aps']")):
                driver.find_element_by_xpath(
                    "//android.widget.LinearLayout[@resource-id='com.ss.android.ugc.aweme:id/aps']").click()
                break
        except:
            print("找不到搜索按钮")
    
        handle_douyin(driver)
    
    if __name__ == '__main__':
        m_list = []
    
        devices_list = ['127.0.0.1:62001', '127.0.0.1:62025']
        for device in range(len(devices_list)):
            port = 4723+2*device
            m_list.append(multiprocessing.Process(target=handle_appium, args=(devices_list[device], port)))
    
        for m in m_list:
            m.start()
    
        for m in m_list:
            m.join()
    

    devices_list 里的数据可以通过adb devices查看

    C:UsersIIce>adb devices
    List of devices attached
    127.0.0.1:62001 device
    127.0.0.1:62025 device
    

    抖音视频抓取

    从抖音 APP 分享个人信息,复制链接,获得个人主页地址,示例:
    https://www.iesdouyin.com/share/user/58862693224

    视频接口解析

    使用 Chrome 抓包,获取视频列表接口的请求信息

    链接参数解析

    https://www.iesdouyin.com/web/api/v2/aweme/post/?
    user_id=58862693224 #   分享链接中的id
    count=21            #   视频个数
    max_cursor=0        #   翻页使用的参数, 第一次是0, 往后会根据上次的返回结果进行变化
    aid=1128            #   固定值
    _signature=laPLvBAVyX-c77Gpje7Ys5Wjy6   #   签名值,由签名算法计算
    dytk=66cb5d220e0e48ed9195a7f62ac32764   #   不知道是啥, 网页中可直接提取
    

    获取签名算法

    打开控制台, 搜索_signature


    定位_bytedAcrawler

    定位 douyin_falcon:node_modules/byted-acrawler/dist/runtime

    定位 __M.define

    分析签名算法的执行逻辑

    ① 定义 __M 对象,及其 define require 函数
    ② 执行 __M.define("douyin_falcon:node_modules/byted-acrawler/dist/runtime......" 这段代码
    ③ 执行 _bytedAcrawler = require("douyin_falcon:node_modules/byted-acrawler/dist/runtime")

    ④ 计算签名值 _signature = _bytedAcrawler.sign(user_id)

    我们可以自己编写一个html文件, 访问这个文件来得到_signature
    淘宝chromedriver镜像

    源码地址

    关于水印

    视频链接的url分两种

    1. https://aweme.snssdk.com/aweme/v1/play/?video_id=v0300f6d0000bj81rdqrh6f3j18kvnpg&line=0&ratio=540p&media_type=4&vr_type=0&improve_bitrate=0
    2. https://aweme.snssdk.com/aweme/v1/playwm/?video_id=v0300f6d0000bj81rdqrh6f3j18kvnpg&line=0&ratio=540p&media_type=4&vr_type=0&improve_bitrate=0

    区别:

    1. 第一个请求的是play,第二个请求的是playwm
    2. 第一个网站是打不开的, 第二个可以打开
    3. 都可以使用requests来获取
    4. 第一个是无水印的!!!

    通过Postman测试, 发现只保留一个video_id即可

    参数说明


    has_more来判断是否需要翻页
    max_cursor下次请求时需要携带的, 首次为 0

    参考

    使用 NodeJS 提供抖音签名算法服务

    无水印解析

  • 相关阅读:
    mac与ip为什么同时存在
    tcp四次挥手
    tcp三次握手
    GET与POST的区别
    Servlet.service() for servlet [jsp] in context ....错误
    c3p0连接数据库时注意事项
    finalize()及垃圾回收
    composer 安装新包失败的原因之一
    如何使用优酷开放平台获取视频播放列表
    php解析优酷网上的视频资源去广告
  • 原文地址:https://www.cnblogs.com/gaoyongjian/p/10829791.html
Copyright © 2020-2023  润新知