腾讯社招职位(多线程+线程池)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
version_1
声明:本内容仅学习参考,如有侵权,将立即删除
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
1、开发文档
域名:https://careers.tencent.com/search.html?index=1
返回数据:json
数据地址:https://careers.tencent.com/tencentcareer/api/post/Query?&countryId=1&pageIndex=1&pageSize=10&language=zh-cn&area=cn
数据地址分析:需要有countryId,area
要求:(1)招聘网站的中国地区职位名称、职位类别、招聘人数、工作地点、发布时间、以及每个职位详情的点击连接
分析:(1)formdata分析
0>method:get
1>关键字有:pageIndex,pageSize,timestamp
2>去掉pageIndex失效,相关参数
3>去掉pageSize失效,相关参数,默认为10,改写数字,返回相应数量数据,建议为默认值,构建获取api规则
4>去掉timestamp有效,非相关参数,有js生成,为时间戳,作用暂时未知,是否为反爬虫策略未知
5>countryId和area确定爬取的范围
(2)共有5573条中国地区职业数据
(3)headers分析
1>User-Agent,相关参数
2>referer:
https://careers.tencent.com/search.html?query=co_1&sc=1
https://careers.tencent.com/search.html?query=co_1&index=2&sc=1
https://careers.tencent.com/search.html?query=co_1&index=3&sc=1
相关参数,index=pageIndex,需要构建requests-headers分析
将referer和url一条一条对应起来村放入url_referer_queue中
3>cookie,因未登录即可浏览,可附带,但不是必须
4>初步设想:队列里按顺序存取,因此每个url对应一个refer,待验证
技术:(1)使用队列存取数据
(2)使用线程池处理线程,或可以用来请求json数据(待测试)>>>>>>>(因为使用了队列,故不需要)
(3)json存取,或可以保存为csv文件(待测试)
2、源代码
import requests from queue import Queue import json from threading import Thread from multiprocessing.dummy import Pool class Txsz(): def __init__(self): # 总共的数据条数 self.count = 5573 # 创建初始url self.start_url = "https://careers.tencent.com/tencentcareer/api/post/Query?&countryId=1&pageIndex={}&pageSize=10&language=zh-cn&area=cn" # 创建start_referer和second_referer self.start_referer = "https://careers.tencent.com/search.html?query=co_1&sc=1" self.second_referer = "https://careers.tencent.com/search.html?query=co_1&index={}&sc=1" """ 创建队列 self.url_queue存放url self.json_queue存放所有的json self.content_queue存放解析的内容 """ self.url_referer_queue = Queue() self.json_queue = Queue() self.content_queue = Queue() def get_url(self): """ 构建爬去的url 参数pageIndex,pageSize 变化,由count=5727设计 url_list返回待待爬取列表 """ pageSize = 10 for pageIndex in range(1,self.count//pageSize+2): item = dict() url = self.start_url.format(pageIndex) if pageIndex==1: referer = self.start_referer else: referer = self.second_referer.format(pageIndex) item["url"] = url item["referer"] = referer self.url_referer_queue.put(item) def get_json(self): """ 获取返回的json数据,原则上每条url对应一个json数据 """ while self.url_referer_queue.not_empty: """判断非空,则执行操作,否则跳出""" # 获取url,referer item = self.url_referer_queue.get() url = item["url"] referer = item["referer"] # 构造请求头 headers = { "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36", "cookie": "_ga=GA1.2.275188285.1582376294; pgv_pvi=4879193088; _gcl_au=1.1.1802818219.1582376295; loading=agree", "referer":referer, } print("spider_url:",url) # 获取响应的json response = requests.get(url,headers=headers).content.decode() # 将str转化为dict response = json.loads(response) # 添加获取到的json到json_queue self.json_queue.put(response) # 队列计数减一 self.url_referer_queue.task_done() def get_content(self): """ 分析json_queue中的单项json,并提取数据 职位名称:RecruitPostName 职位ID:PostId 每个职位详情页地址:PostURL 职位类别:CategoryName 工作地点:LocationName 发布时间:LastUpdateTime 工作职责:Responsibility """ while self.json_queue.not_empty: # 获取json json_ = self.json_queue.get() Posts = json_["Data"]["Posts"] for item in Posts: item_dict = { "RecruitPostName":item["RecruitPostName"], "PostId":item["PostId"], "PostURL":item["PostURL"], "CategoryName":item["CategoryName"], "LocationName":item["LocationName"], "LastUpdateTime":item["LastUpdateTime"], "Responsibility":item["Responsibility"], } self.content_queue.put(item_dict) self.json_queue.task_done() def save_content(self): while self.content_queue.not_empty: item = self.content_queue.get() with open("tencent_social_positoon.json","a",encoding="utf-8") as f: f.write(json.dumps(item,ensure_ascii=False,indent=2)) f.write(",") self.content_queue.task_done() def run(self): """ 实现主要逻辑 """ # 创建线程列表 thread_list = list() # 创建get_url方法的线程 url_thread = Thread(target=self.get_url) thread_list.append(url_thread) # 创建get_json方法的线程 json_thread = Thread(target=self.get_json) thread_list.append(json_thread) # 创建get_content方法的线程 content_thread = Thread(target=self.get_content) thread_list.append(content_thread) # 创建save_content方法的线程 save_content_thread = Thread(target=self.save_content) thread_list.append(save_content_thread) # # 创建线程池 # pool = Pool(10) # def process_thread(thread_): # thread_.setDaemon(True) # thread_.start() # pool.map(process_thread,thread_list) for t in thread_list: t.setDaemon(True) t.start() # 让各队列全空才退出主线程 self.url_referer_queue.join() self.json_queue.join() self.content_queue.join() if __name__=="__main__": obj = Txsz() obj.run()