• 51job爬虫


    51job爬虫

      项目使用Python3.7爬取前程无忧对应关键字的招聘,保存到mongodb,爬取下来的数据,可分析出目前互联网的近况,可统计到每个招聘岗位有多少,每个岗位的薪资分布情况

      github地址:https://github.com/HowName/51job

    • 统计结果图,java还是老大哥
    • 爬取效果图
    • mongodb数据图
    • 使用到的库(第三方库建议使用pip进行安装)
    • BeautifulSoup4,pymongo,requests,re,time
      

        

    • 项目主代码
    • import re
      import time
      from bs4 import BeautifulSoup
      from pack.DbUtil import DbUtil
      from pack.RequestUtil import RequestUtil
      
      db = DbUtil()
      
      # 要查找的岗位
      keywords = ['php', 'java', 'python', 'node.js', 'go', 'hadoop', 'AI', '算法工程师', 'ios', 'android', '区块链', '大数据']
      
      for keyword in keywords:
      
          cur_page = 1
          url = 'https://search.51job.com/list/030200,000000,0000,00,9,99,@keyword,2,@cur_page.html?lang=c&stype=&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&providesalary=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare=' 
              .replace('@keyword', str(keyword)).replace('@cur_page', str(cur_page))
          req = RequestUtil()
          html_str = req.get(url)
      
          # 从第一页中查找总页数
          soup = BeautifulSoup(html_str, 'html.parser')  # 推荐使用lxml
          the_total_page = soup.select('.p_in .td')[0].string.strip()
          the_total_page = int(re.sub(r"D", "", the_total_page))  # 取数字
      
          print('keyword:', keyword, 'total page: ', the_total_page)
          print('start...')
      
          while cur_page <= the_total_page:
              """
              循环获取每一页
              """
      
              url = 'https://search.51job.com/list/030200,000000,0000,00,9,99,@keyword,2,@cur_page.html?lang=c&stype=&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&providesalary=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare=' 
                  .replace('@keyword', str(keyword)).replace('@cur_page', str(cur_page))
              req = RequestUtil()
              html_str = req.get(url)
      
              if html_str:
                  soup = BeautifulSoup(html_str, 'html.parser')
      
                  #  print(soup.prettify()) #格式化打印
      
                  the_all = soup.select('.dw_table .el')
                  del the_all[0]
      
                  # 读取每一项招聘
                  dict_data = []
                  for item in the_all:
                      job_name = item.find(name='a').string.strip()
                      company_name = item.select('.t2')[0].find('a').string.strip()
                      area = item.select('.t3')[0].string.strip()
                      pay = item.select('.t4')[0].string
                      update_time = item.select('.t5')[0].string.strip()
      
                      dict_data.append(
                          {'job_name': job_name, 'company_name': company_name, 'area': area, 'pay': pay,
                           'update_time': update_time, 'keyword': keyword}
                      )
      
                  # 插入mongodb
                  db.insert(dict_data)
      
                  print('keyword:', keyword, 'success page:', cur_page, 'insert count:', len(dict_data))
                  time.sleep(0.5)
      
              else:
                  print('keyword:', keyword, 'fail page:', cur_page)
      
              # 页数加1
              cur_page += 1
      
          else:
              print('keyword:', keyword, 'fetch end...')
      
      else:
          print('Mission complete!!!')
      

        

  • 相关阅读:
    Qt on Android:将Qt调试信息输出到logcat中
    cheap louis vuitton outlet
    mysql经常使用查询:group by,左连接,子查询,having where
    SQLSERVER 2008 链接 到 ORACLE 11
    uva 11885
    Comet入门及最简单的Java Demo
    什么是BGP线路?什么是BGP机房?
    (LeetCode)旋转数组
    ios app 实现热更新(无需发新版本号实现app加入新功能)
    hdu 4961 Boring Sum(高效)
  • 原文地址:https://www.cnblogs.com/longer756567406/p/10833745.html
Copyright © 2020-2023  润新知