• Python爬虫入门教程石家庄链家租房数据抓取


     

    1. 写在前面

    这篇博客爬取了链家网的租房信息,爬取到的数据在后面的博客中可以作为一些数据分析的素材。
    我们需要爬取的网址为:https://sjz.lianjia.com/zufang/

    2. 分析网址

    首先确定一下,哪些数据是我们需要的


    在这里插入图片描述

    可以看到,黄色框就是我们需要的数据。

    接下来,确定一下翻页规律

    https://sjz.lianjia.com/zufang/pg1/
    https://sjz.lianjia.com/zufang/pg2/
    https://sjz.lianjia.com/zufang/pg3/
    https://sjz.lianjia.com/zufang/pg4/
    https://sjz.lianjia.com/zufang/pg5/
    ... 
    https://sjz.lianjia.com/zufang/pg80/

    3. 解析网页

    有了分页地址,就可以快速把链接拼接完毕,我们采用lxml模块解析网页源码,获取想要的数据。

    本次编码使用了一个新的模块 fake_useragent ,这个模块,可以随机的去获取一个UA(user-agent),模块使用比较简单,可以去百度百度就很多教程。

    本篇博客主要使用的是调用一个随机的UA

    self._ua = UserAgent()
    self._headers = {"User-Agent": self._ua.random}  # 调用一个随机的UA

    由于可以快速的把页码拼接出来,所以采用协程进行抓取,写入csv文件采用的pandas模块

    from fake_useragent import UserAgent
    from lxml import etree
    import asyncio
    import aiohttp
    import pandas as pd
    
    class LianjiaSpider(object):
    
        def __init__(self):
            self._ua = UserAgent()
            self._headers = {"User-Agent": self._ua.random}
            self._data = list()
    
    
        async def get(self,url):
            async with aiohttp.ClientSession() as session:
                try:
                    async with session.get(url,headers=self._headers,timeout=3) as resp:
                        if resp.status==200:
                            result = await resp.text()
                            return result
                except Exception as e:
                    print(e.args)
    
        async def parse_html(self):
            for page in range(1,77):
                url = "https://sjz.lianjia.com/zufang/pg{}/".format(page)
                print("正在爬取{}".format(url))
                html = await self.get(url)   # 获取网页内容
                html = etree.HTML(html)  # 解析网页
                self.parse_page(html)   # 匹配我们想要的数据
    
                print("正在存储数据....")
                ######################### 数据写入
                data = pd.DataFrame(self._data)
                data.to_csv("链家网租房数据.csv", encoding='utf_8_sig')   # 写入文件
                ######################### 数据写入
    
    
    
        def run(self):
            loop = asyncio.get_event_loop()
            tasks = [asyncio.ensure_future(self.parse_html())]
            loop.run_until_complete(asyncio.wait(tasks))
    
    
    if __name__ == '__main__':
        l = LianjiaSpider()
        l.run()

    上述代码中缺少一个解析网页的函数,我们接下来把他补全

    def parse_page(self,html):
            info_panel = html.xpath("//div[@class='info-panel']")
            for info in info_panel:
                region = self.remove_space(info.xpath(".//span[@class='region']/text()"))
                zone = self.remove_space(info.xpath(".//span[@class='zone']/span/text()"))
                meters = self.remove_space(info.xpath(".//span[@class='meters']/text()"))
                where = self.remove_space(info.xpath(".//div[@class='where']/span[4]/text()"))
    
                con = info.xpath(".//div[@class='con']/text()")
                floor = con[0]  # 楼层
                type = con[1]   # 样式
    
                agent = info.xpath(".//div[@class='con']/a/text()")[0]
    
                has = info.xpath(".//div[@class='left agency']//text()")
    
                price = info.xpath(".//div[@class='price']/span/text()")[0]
                price_pre =  info.xpath(".//div[@class='price-pre']/text()")[0]
                look_num = info.xpath(".//div[@class='square']//span[@class='num']/text()")[0]
    
                one_data = {
                    "region":region,
                    "zone":zone,
                    "meters":meters,
                    "where":where,
                    "louceng":floor,
                    "type":type,
                    "xiaoshou":agent,
                    "has":has,
                    "price":price,
                    "price_pre":price_pre,
                    "num":look_num
                }
                self._data.append(one_data)  # 添加数据

    不一会,数据就爬取的差不多了。

    在这里插入图片描述

     
     
  • 相关阅读:
    LeetCode 123. Best Time to Buy and Sell Stock III (stock problem)
    精帖转载(关于stock problem)
    LeetCode 122. Best Time to Buy and Sell Stock II (stock problem)
    LeetCode 121. Best Time to Buy and Sell Stock (stock problem)
    LeetCode 120. Triangle
    基于docker 搭建Elasticsearch5.6.4 分布式集群
    从零开始构建一个centos+jdk7+tomcat7的docker镜像文件
    Harbor实现容器镜像仓库的管理和运维
    docker中制作自己的JDK+tomcat镜像
    docker镜像制作---jdk7+tomcat7基础镜像
  • 原文地址:https://www.cnblogs.com/Anderson-An/p/10280304.html
Copyright © 2020-2023  润新知