• 小白爬取单个微博用户的评论


    一、简要介绍

    对“深圳移动”微博用户爬取所有微博及其评论。

    二、工具介绍

    语言:python 2.7
    使用的库:import requests
    微博账号:网上购买若干
    IP代理:网上租动态IP的代理服务器
    User-agent:网上搜索若干

    三、整体思路

    1.首先找到“深圳移动”的手机微博页面

    https://m.weibo.cn/u/1922826034


    2. 手机微博看不到翻页,是一直往下加载的(一共1671页),但是其json格式的数据仍然以翻页的形式呈现。
    https://m.weibo.cn/api/container/getIndex?type=uid&value=1922826034&containerid=1076031922826034&page=2

    主要就是修改page后面的值来获取手机微博每个页面的json数据。


    3. 从上面的json数据页面获取字段idstr,即微博id。
    https://m.weibo.cn/status/4177994643916324地址可以获取一条微博的手机页面。
    格式:https://m.weibo.cn/status/【id】


    4. 从https://m.weibo.cn/api/comments/show?id=4131150395559419&page=1
    地址可以获取一条微博的评论的json格式数据,id为一条微博的id,page为评论翻页。
    格式:https://m.weibo.cn/api/comments/show?id=【id】&page=【page_num】

    首行若ok=1说明该条微博有评论;若ok=0说明该条微博没有评论。

    四、代码实现

    1.设置user-agent、cookies、headers。




    从网上获取大量user-agent,在TAOBAO购买若干微博账号,获取其cookie。
    Random.choice()函数从列表中每次随机获取一个值,避免短时间内用同一个cookie或者同一个user-agent访问微博页面导致cookie或user-agent被封。

    2.获取微博每一页json数据,提取其中的idstr字段得到每条微博的id。
    Time.sleep(random.randint(1,4)) 休眠时间是随机数而非固定值。

    3.同样的道理从评论的json页面获取评论的json数据。

    五、知识反馈

    1.时间久了之后会出现NO JSON COULD BE DECODED的错误,debug后发现是获取不到页面源码返回response 404的错误,原因是user-agent使用次数过多被禁,主要是因为使用了单一IP地址,在这里我用的是动态IP地址的服务器,因此不需要在爬虫中设置代理IP,设置代理IP的方法和random.choice( )设置user-agent的方法雷同。此外,尽管使用了动态IP,user-agent仍有被禁的可能。
    关于反爬虫如何禁止user-agent抓取网站的办法:

    来源:《Nginx反爬虫攻略:禁止某些User Agent抓取网站》

    2.爬取的数据过多时,需要有代码可以自动更新微博账号的cookie。

    六、参考资料

    对本次数据爬取有重要贡献的参考文章:《pyhton微博爬虫(3)——获取微博评论数据》
    http://blog.csdn.net/FlySky1991/article/details/76924443

    七、只有自己能看懂的代码

     1 #!/usr/bin/env python
     2 # -*- coding: utf-8 -*-
     3 import sys
     4 
     5 import requests
     6 
     7 reload(sys)
     8 sys.setdefaultencoding('utf8')
     9 import time
    10 import random
    11 import crawler.user_agents as ua
    12 from crawler import cookies as ck
    13 
    14 
    15 def writeintxt(list,filename):
    16     output = open(filename, 'a')
    17     for i in list:
    18         output.write(str(i[0])+','+str(i[1])+'
    ')
    19     output.close()
    20 
    21 cookies = random.choice(ck.cookies)
    22 user_agent = random.choice(ua.agents)
    23 headers = {
    24     'User-agent' : user_agent,
    25     'Host' : 'm.weibo.cn',
    26     'Accept' : 'application/json, text/plain, */*',
    27     'Accept-Language' : 'zh-CN,zh;q=0.8',
    28     'Accept-Encoding' : 'gzip, deflate, sdch, br',
    29     'Referer' : 'https://m.weibo.cn/u/1922826034',
    30     'Cookie' : cookies,
    31     'Connection' : 'keep-alive',
    32 }
    33 
    34 id_list = []
    35 base_url = 'https://m.weibo.cn/api/container/getIndex?type=uid&value=1922826034&containerid=1076031922826034&page='
    36 for i in range(0, 1672):
    37     try:
    38         url = base_url+i.__str__()
    39         resp = requests.get(url, headers=headers,timeout = 5)
    40         jsondata = resp.json()
    41 
    42         data = jsondata.get('cards')
    43         for d in data:
    44             id = d.get("mblog").get('idstr')
    45             # print id
    46             id_list.append([i,id])
    47         time.sleep(random.randint(1,4))
    48     except:
    49         print i
    50         print('*'*100)
    51         pass
    52 print "ok"
    53 
    54 
    55 writeintxt(id_list,'weibo_id')
     1 #!/usr/bin/env python
     2 # -*- coding: utf-8 -*-
     3 import sys
     4 
     5 import requests
     6 
     7 reload(sys)
     8 sys.setdefaultencoding('utf8')
     9 import time
    10 import random
    11 import crawler.user_agents as ua
    12 from crawler import cookies as ck
    13 
    14 
    15 def readfromtxt(filename):
    16     file = open(u'D:/MattDoc/实习/1124爬取深圳移动新浪微博/网页/'+filename, "r")
    17     text = file.read()
    18     file.close()
    19     return text
    20 
    21 def writeintxt(dict,filename):
    22     output = open(u"D:/MattDoc/实习/1124爬取深圳移动新浪微博/网页/"+filename, 'a+')
    23     for d, list in dict.items():
    24         comment_str = ""
    25         for l in list:
    26             comment_str = comment_str + l.__str__() + "####"
    27         output.write(d.split(',')[1]+"####"+comment_str+'
    ')
    28     output.close()
    29 
    30 
    31 
    32 user_agent = random.choice(ua.agents)
    33 cookies = random.choice(ck.cookies)
    34 headers = {
    35     'User-agent' : user_agent,
    36     'Host' : 'm.weibo.cn',
    37     'Accept' : 'application/json, text/plain, */*',
    38     'Accept-Language' : 'zh-CN,zh;q=0.8',
    39     'Accept-Encoding' : 'gzip, deflate, sdch, br',
    40     'Referer' : 'https://m.weibo.cn/u/1922826034',
    41     'Cookie' : cookies,
    42     'Connection' : 'keep-alive',
    43 }
    44 
    45 
    46 base_url = 'https://m.weibo.cn/api/comments/show?id='
    47 weibo_id_list = readfromtxt('weibo_id1.txt').split('
    ')
    48 result_dict = {}
    49 for weibo_id in weibo_id_list:
    50     try:
    51         record_list = []
    52         i=1
    53         SIGN = 1
    54         while(SIGN):
    55             # url = base_url + weibo_id.split(',')[1] + '&page=' + str(i)
    56             url = base_url + str(weibo_id) + '&page=' + str(i)
    57             resp = requests.get(url, headers=headers, timeout=100)
    58             jsondata = resp.json()
    59             if jsondata.get('ok') == 1:
    60                 SIGN = 1
    61                 i = i + 1
    62                 data = jsondata.get('data')
    63                 for d in data:
    64                     comment = d.get('text').replace('$$','')
    65                     like_count = d.get('like_counts')
    66                     user_id = d.get("user").get('id')
    67                     user_name = d.get("user").get('screen_name').replace('$$','')
    68                     one_record = user_id.__str__()+'$$'+like_count.__str__()+'$$'+user_name.__str__()+'$$'+ comment.__str__()
    69                     record_list.append(one_record)
    70             else:
    71                 SIGN = 0
    72 
    73         result_dict[weibo_id]=record_list
    74         time.sleep(random.randint(2,3))
    75     except:
    76         # print traceback.print_exc()
    77         print weibo_id
    78         print('*'*100)
    79         pass
    80 print "ok"
    81 
    82 writeintxt(result_dict,'comment1.txt')
     1 # encoding=utf-8
     2 """ User-Agents """
     3 agents = [
     4     "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
     5     "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
     6     "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
     7     "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
     8     "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
     9     "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
    10     "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
    11     "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
    12     "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
    13     "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
    14     "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
    15     "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
    16     "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
    17     "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
    18     "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
    19     "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",
    20     "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11",
    21     "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER",
    22     "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)",
    23     "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)",
    24     "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 LBBROWSER",
    25     "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)",
    26     "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)",
    27     "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",
    28     "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; 360SE)",
    29     "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",
    30     "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)",
    31     "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",
    32     "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",
    33     "Mozilla/5.0 (iPad; U; CPU OS 4_2_1 like Mac OS X; zh-cn) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8C148 Safari/6533.18.5",
    34     "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0b13pre) Gecko/20110307 Firefox/4.0b13pre",
    35     "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0",
    36     "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11",
    37     "Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10",
    38     "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
    39   ]
    # encoding=utf-8
    
    """ cookies """
    
    cookies = [
    
    "SINAGLOBAL=6061592354656.324.1489207743838; un=18240343109; TC-V5-G0=52dad2141fc02c292fc30606953e43ef; wb_cusLike_2140170130=N; _s_tentry=login.sina.com.cn; Apache=5393750164131.485.1511882292296; ULV=1511882292314:55:14:7:5393750164131.485.1511882292296:1511789163477; TC-Page-G0=1e758cd0025b6b0d876f76c087f85f2c; TC-Ugrow-G0=e66b2e50a7e7f417f6cc12eec600f517; login_sid_t=7cbd20d7f5c121ef83f50e3b28a77ed7; cross_origin_proto=SSL; WBStorage=82ca67f06fa80da0|undefined; UOR=,,login.sina.com.cn; WBtopGlobal_register_version=573631b425a602e8; crossidccode=CODE-tc-1EjHEO-2SNIe8-y00Hd0Yq79mGw3l1975ae; SSOLoginState=1511882345; SCF=AvFiX3-W7ubLmZwXrMhoZgCv_3ZXikK7fhjlPKRLjog0OIIQzSqq7xsdv-_GhEe8XWdkHikzsFJyqtvqej6OkaM.; SUB=_2A253GQ45DeThGeRP71IQ9y7NyDyIHXVUb3jxrDV8PUNbmtAKLWrSkW9NTjfYoWTfrO0PkXSICRzowbfjExbQidve; SUBP=0033WrSXqPxfM725Ws9jqgMF55529P9D9WFaVAdSwLmvOo1VRiSlRa3q5JpX5KzhUgL.FozpSh5pS05pe052dJLoIfMLxKBLBonL122LxKnLB.qL1-z_i--fiKyFi-2Xi--fi-2fiKyFTCH8SFHF1C-4eFH81FHWSE-RebH8SE-4BC-RSFH8SFHFBbHWeEH8SEHWeF-RegUDMJ7t; SUHB=04W-u1HCo6armH; ALF=1543418344; wvr=6",
    "SINAGLOBAL=6061592354656.324.1489207743838; TC-V5-G0=52dad2141fc02c292fc30606953e43ef; wb_cusLike_2140170130=N; _s_tentry=login.sina.com.cn; Apache=5393750164131.485.1511882292296; ULV=1511882292314:55:14:7:5393750164131.485.1511882292296:1511789163477; TC-Page-G0=1e758cd0025b6b0d876f76c087f85f2c; TC-Ugrow-G0=e66b2e50a7e7f417f6cc12eec600f517; login_sid_t=7cbd20d7f5c121ef83f50e3b28a77ed7; WBStorage=82ca67f06fa80da0|undefined; WBtopGlobal_register_version=573631b425a602e8; crossidccode=CODE-tc-1EjHEO-2SNIe8-y00Hd0Yq79mGw3l1975ae; cross_origin_proto=SSL; UOR=,,login.sina.com.cn; SSOLoginState=1511882443; SCF=AvFiX3-W7ubLmZwXrMhoZgCv_3ZXikK7fhjlPKRLjog0-14gBQox9IhSK8vZVaZYWsLxUaOWNkudAR9iT6NFJkg.; SUB=_2A253GQ6bDeRhGeNH6FsZ8CjLzj2IHXVUb2dTrDV8PUNbmtAKLWTjkW9NSqHIBUvGapKd6-MQhJTejk3w_ivUUNXZ; SUBP=0033WrSXqPxfM725Ws9jqgMF55529P9D9W5gYdHWIHRmedh9Nyrij6XN5JpX5K2hUgL.Fo-4e0.RehqNSK22dJLoI0.LxK-L122LB.qLxK-LB.BLBKqLxKMLB.2LBKzLxKnL12-L122LxK.LBK2L12qLxKqLBKqL1KHiqc-t; SUHB=0auwlDzUYulNGs; ALF=1543418442; un=13728408992; wvr=6",
    "SINAGLOBAL=6061592354656.324.1489207743838; TC-V5-G0=52dad2141fc02c292fc30606953e43ef; wb_cusLike_2140170130=N; _s_tentry=login.sina.com.cn; Apache=5393750164131.485.1511882292296; ULV=1511882292314:55:14:7:5393750164131.485.1511882292296:1511789163477; TC-Page-G0=1e758cd0025b6b0d876f76c087f85f2c; TC-Ugrow-G0=e66b2e50a7e7f417f6cc12eec600f517; login_sid_t=7cbd20d7f5c121ef83f50e3b28a77ed7; WBStorage=82ca67f06fa80da0|undefined; WBtopGlobal_register_version=573631b425a602e8; crossidccode=CODE-tc-1EjHEO-2SNIe8-y00Hd0Yq79mGw3l1975ae; wb_cusLike_5939806751=N; cross_origin_proto=SSL; UOR=,,login.sina.com.cn; SSOLoginState=1511882512; SCF=AvFiX3-W7ubLmZwXrMhoZgCv_3ZXikK7fhjlPKRLjog089iFKjxeT1Oc6cbJkkqgWrnQAuMVukRrJy3898cKIb8.; SUB=_2A253GQ9ADeRhGeNH6FsZ8ynJzz6IHXVUb2eIrDV8PUNbmtAKLVWhkW9NSqG4DzNeLkyPCmJIKq6bXfKXpSRCPLqO; SUBP=0033WrSXqPxfM725Ws9jqgMF55529P9D9W50J-rDh2D6-QEqNOZ2NddF5JpX5K2hUgL.Fo-4e0.Re0MfShz2dJLoIEeLxK-LB--L1KeLxK-L1hqLBoMLxKnL1K5LBo8IC281xEfIg5tt; SUHB=0gHiPrbPWNJvao; ALF=1543418511; un=15614187608; wvr=6",
    "SINAGLOBAL=6061592354656.324.1489207743838; TC-V5-G0=52dad2141fc02c292fc30606953e43ef; wb_cusLike_2140170130=N; _s_tentry=login.sina.com.cn; Apache=5393750164131.485.1511882292296; ULV=1511882292314:55:14:7:5393750164131.485.1511882292296:1511789163477; TC-Page-G0=1e758cd0025b6b0d876f76c087f85f2c; TC-Ugrow-G0=e66b2e50a7e7f417f6cc12eec600f517; login_sid_t=7cbd20d7f5c121ef83f50e3b28a77ed7; WBStorage=82ca67f06fa80da0|undefined; WBtopGlobal_register_version=573631b425a602e8; crossidccode=CODE-tc-1EjHEO-2SNIe8-y00Hd0Yq79mGw3l1975ae; wb_cusLike_5939806751=N; wb_cusLike_5939837542=N; cross_origin_proto=SSL; UOR=,,login.sina.com.cn; SSOLoginState=1511882567; SCF=AvFiX3-W7ubLmZwXrMhoZgCv_3ZXikK7fhjlPKRLjog02c5hBW41ia6vpj1cAqbFzE2KCcsXvDxToS_KOeUnwRc.; SUB=_2A253GQ8XDeRhGeNH6FsZ9CjKyjuIHXVUb2ffrDV8PUNbmtAKLU7wkW9NSqGOexL53l1CujvuLpAFNeOEsl05T_5E; SUBP=0033WrSXqPxfM725Ws9jqgMF55529P9D9WWuISqBnuGqpyxGiWdJ4bOv5JpX5K2hUgL.Fo-4e0.RShqceKM2dJLoI0YLxK-L1K5L1K2LxK.L1KnLBoeLxK-L1K5L1K2LxKqL1-2L1KqLxK.L1KMLBo-LxKMLB.zLB.qLxK-L1hML1-Bt; SUHB=0LcSwyK5XYMzbr; ALF=1543418566; un=13242833134; wvr=6"
    
    
    ]
  • 相关阅读:
    sniffer 和 debug flow
    如何认识TOS----DSCP 对照表
    NAT alg 和 ASPF
    使用refind引导多系统
    Backdooring a OS VM
    exit(0)与exit(1),return三者区别(详解)
    Codeforces Round #301 (Div. 2)(A,【模拟】B,【贪心构造】C,【DFS】)
    2017年中国大学生程序设计竞赛-中南地区赛暨第八届湘潭市大学生计算机程序设计大赛题解&源码(A.高斯消元,D,模拟,E,前缀和,F,LCS,H,Prim算法,I,胡搞,J,树状数组)
    高斯消元法(Gauss Elimination)【超详解&模板】
    2015 计蒜之道 初赛(4)爱奇艺的自制节目(枚举 贪心)
  • 原文地址:https://www.cnblogs.com/Denise-hzf/p/7927852.html
Copyright © 2020-2023  润新知