• 22.天眼查cookie模拟登陆采集数据


    通过账号登录获取cookies,模拟登录(前提有天眼查账号),会员账号可查看5000家,普通只是100家,同时也要设置一定的反爬措施以防账号被封。
    拿有权限的账号去获取cookies,去访问页面信息,不过这样呢感觉还是不合适,因为之前也采集过都是避开登录和验证码的问题,因为这些数据只是人家网站让不让你拿,该怎样去拿的问题。
    这里只是简单地做一下测试,实际采集会遇到各种问题的,这里只是个解题思路仅供参考。
    不然会被检测如图:
    
    
    # coding:utf-8
    
    import requests
    from lxml import etree
    import re
    #请求地址
    target_url ='https://www.tianyancha.com/search?key='
    
    headers = {
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
        'Accept-Encoding': 'gzip, deflate, br',
        'Accept-Language': 'zh-CN,zh;q=0.9',
        'Connection': 'keep-alive',
        'Cookie': 'TYCID=b9583550959d11e897e06dcb73cfa6e2; undefined=b9583550959d11e897e06dcb73cfa6e2; _ga=GA1.2.442696904.1533136553; ssuid=9383237588; aliyungf_tc=AQAAAI5OlAkhwQcA7nGBd5KVOLSC9NYt; csrfToken=-56Od6hSl_S1CmBVCzLLBYEI; _gid=GA1.2.928876903.1541388137; _gat_gtag_UA_123487620_1=1; Hm_lvt_e92c8d65d92d534b0fc290df538b4758=1541393330,1541393799,1541394021,1541394112; Hm_lpvt_e92c8d65d92d534b0fc290df538b4758=1541394115; tyc-user-info=%257B%2522myQuestionCount%2522%253A%25220%2522%252C%2522integrity%2522%253A%25220%2525%2522%252C%2522state%2522%253A%25220%2522%252C%2522vipManager%2522%253A%25220%2522%252C%2522onum%2522%253A%25220%2522%252C%2522monitorUnreadCount%2522%253A%25226%2522%252C%2522discussCommendCount%2522%253A%25220%2522%252C%2522token%2522%253A%2522eyJhbGciOiJIUzUxMiJ9.eyJzdWIiOiIxODIzNjUzMTkwNiIsImlhdCI6MTU0MTM5NDEzMSwiZXhwIjoxNTU2OTQ2MTMxfQ.biIMiqd7l2LBwARywkoJ4J-dFh7zT-SSzz0V-GKc9r4EENomkv-1SA68RvVn0sZUzN_3wHbrw-Sl0ksedBgNGA%2522%252C%2522redPoint%2522%253A%25220%2522%252C%2522pleaseAnswerCount%2522%253A%25220%2522%252C%2522vnum%2522%253A%25220%2522%252C%2522bizCardUnread%2522%253A%25220%2522%252C%2522mobile%2522%253A%252218236531906%2522%257D; auth_token=eyJhbGciOiJIUzUxMiJ9.eyJzdWIiOiIxODIzNjUzMTkwNiIsImlhdCI6MTU0MTM5NDEzMSwiZXhwIjoxNTU2OTQ2MTMxfQ.biIMiqd7l2LBwARywkoJ4J-dFh7zT-SSzz0V-GKc9r4EENomkv-1SA68RvVn0sZUzN_3wHbrw-Sl0ksedBgNGA',
        'Host': 'www.tianyancha.com',
        'Referer': 'https://www.tianyancha.com/',
        'Upgrade-Insecure-Requests': '1',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36',
    }
    #搜索关键字
    list=['佛山']
    for j in range(6):
        for i in list:
            form_data={
                    'key': '{}'.format(i),
                }
            url = 'https://www.tianyancha.com/search/p{}?key={}'.format(j,i)
            # 发送post请求,翻译数据
            response = requests.get(url, data=form_data, headers=headers)
            # print(response.text)
            html = etree.HTML(response.text)
            #获取当前搜索界面url
            link_urls = html.xpath("//div[@class='content']/div[@class='header']/a/@href")
            for link_url in link_urls:
                # print(link_url)
                response = requests.get(link_url, headers=headers)
                # print(response.text)
                html2 = etree.HTML(response.text)
                #公司名称
                company = html2.xpath("//h1[@class='name']").extract_first()
                print(company)
            print('*'*100)
    # coding:utf-8

    import requests
    from lxml import etree
    import re
    #请求地址
    target_url ='https://www.tianyancha.com/search?key='

    headers = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept-Language': 'zh-CN,zh;q=0.9',
    'Connection': 'keep-alive',
    'Cookie': 'TYCID=b9583550959d11e897e06dcb73cfa6e2; undefined=b9583550959d11e897e06dcb73cfa6e2; _ga=GA1.2.442696904.1533136553; ssuid=9383237588; aliyungf_tc=AQAAAI5OlAkhwQcA7nGBd5KVOLSC9NYt; csrfToken=-56Od6hSl_S1CmBVCzLLBYEI; _gid=GA1.2.928876903.1541388137; _gat_gtag_UA_123487620_1=1; Hm_lvt_e92c8d65d92d534b0fc290df538b4758=1541393330,1541393799,1541394021,1541394112; Hm_lpvt_e92c8d65d92d534b0fc290df538b4758=1541394115; tyc-user-info=%257B%2522myQuestionCount%2522%253A%25220%2522%252C%2522integrity%2522%253A%25220%2525%2522%252C%2522state%2522%253A%25220%2522%252C%2522vipManager%2522%253A%25220%2522%252C%2522onum%2522%253A%25220%2522%252C%2522monitorUnreadCount%2522%253A%25226%2522%252C%2522discussCommendCount%2522%253A%25220%2522%252C%2522token%2522%253A%2522eyJhbGciOiJIUzUxMiJ9.eyJzdWIiOiIxODIzNjUzMTkwNiIsImlhdCI6MTU0MTM5NDEzMSwiZXhwIjoxNTU2OTQ2MTMxfQ.biIMiqd7l2LBwARywkoJ4J-dFh7zT-SSzz0V-GKc9r4EENomkv-1SA68RvVn0sZUzN_3wHbrw-Sl0ksedBgNGA%2522%252C%2522redPoint%2522%253A%25220%2522%252C%2522pleaseAnswerCount%2522%253A%25220%2522%252C%2522vnum%2522%253A%25220%2522%252C%2522bizCardUnread%2522%253A%25220%2522%252C%2522mobile%2522%253A%252218236531906%2522%257D; auth_token=eyJhbGciOiJIUzUxMiJ9.eyJzdWIiOiIxODIzNjUzMTkwNiIsImlhdCI6MTU0MTM5NDEzMSwiZXhwIjoxNTU2OTQ2MTMxfQ.biIMiqd7l2LBwARywkoJ4J-dFh7zT-SSzz0V-GKc9r4EENomkv-1SA68RvVn0sZUzN_3wHbrw-Sl0ksedBgNGA',
    'Host': 'www.tianyancha.com',
    'Referer': 'https://www.tianyancha.com/',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36',
    }
    #搜索关键字
    list=['佛山']
    for j in range(6):
    for i in list:
    form_data={
    'key': '{}'.format(i),
    }
    url = 'https://www.tianyancha.com/search/p{}?key={}'.format(j,i)
    # 发送post请求,翻译数据
    response = requests.get(url, data=form_data, headers=headers)
    # print(response.text)
    html = etree.HTML(response.text)
    #获取当前搜索界面url
    link_urls = html.xpath("//div[@class='content']/div[@class='header']/a/@href")
    for link_url in link_urls:
    # print(link_url)
    response = requests.get(link_url, headers=headers)
    # print(response.text)
    html2 = etree.HTML(response.text)
    #公司名称
    company = html2.xpath("//h1[@class='name']").extract_first()
    print(company)
    print('*'*100)
  • 相关阅读:
    redis 写入数据 越来越慢 是什么原因
    redis slowlog
    JavaCC 规格严格
    Lucene 规格严格
    数据库建立索引要点 规格严格
    编辑距离 规格严格
    Lucene NRT (Near Real Time) 规格严格
    事件关联 规格严格
    linux下mysql5.5.19编译安装笔记【已验证】 规格严格
    关于CLOSE BY CLIENT STACK TRACE 规格严格
  • 原文地址:https://www.cnblogs.com/lvjing/p/9909088.html
Copyright © 2020-2023  润新知