• pandas.read_sql_query()读取数据库数据用chunksize的坑


    最近一项工作需要读取数据库中1500万条数据,考虑到数据量太大,不方便直接一次性读取,不然会内存爆炸。想到用pandas.read_sql_query()里有一个chunksize可以分批返回chunksize个数据,于是用pandas试了下,代码如下:

    import pandas as pd
    from sqlalchemy import create_engine
    import psycopg2
    import json
    
    class DB_connection(object):
        def __init__(self):
            with open('config_db.json', 'r') as load_f:
                db_config = json.load(load_f)
            self.db_engine = create_engine(''.join(['postgresql+psycopg2://', db_config['USER'], ':', db_config['PASSWORD'], '@', db_config['HOST'], ':', str(db_config['PORT']), '/', db_config['DATABASE']]))
            self.db_conn = self.db_engine.connect()
            self.database = db_config['DATABASE']
    
        def read_from_table(self):
            data_gen = pd.read_sql_query(
                'SELECT case_id, text FROM first_case',
                self.db_conn, chunksize=2000
            )
            return data_gen

    因为pandas.read_sql_query()加上chunksize后返回的是一个iterator。但运行程序时一直卡在那不动,看pandas.read_sql_query()源码才知道它不是真正的分批次读取,而是根据SQL语句全部读取出来后,再把它按chunksize个一批一批地转为iterator然后再返回。

        def read_query(self, sql, index_col=None, coerce_float=True,
                       parse_dates=None, params=None, chunksize=None):
            """Read SQL query into a DataFrame.
    
            Parameters
            ----------
            sql : string
                SQL query to be executed.
            index_col : string, optional, default: None
                Column name to use as index for the returned DataFrame object.
            coerce_float : boolean, default True
                Attempt to convert values of non-string, non-numeric objects (like
                decimal.Decimal) to floating point, useful for SQL result sets.
            params : list, tuple or dict, optional, default: None
                List of parameters to pass to execute method.  The syntax used
                to pass parameters is database driver dependent. Check your
                database driver documentation for which of the five syntax styles,
                described in PEP 249's paramstyle, is supported.
                Eg. for psycopg2, uses %(name)s so use params={'name' : 'value'}
            parse_dates : list or dict, default: None
                - List of column names to parse as dates.
                - Dict of ``{column_name: format string}`` where format string is
                  strftime compatible in case of parsing string times, or is one of
                  (D, s, ns, ms, us) in case of parsing integer timestamps.
                - Dict of ``{column_name: arg dict}``, where the arg dict
                  corresponds to the keyword arguments of
                  :func:`pandas.to_datetime` Especially useful with databases
                  without native Datetime support, such as SQLite.
            chunksize : int, default None
                If specified, return an iterator where `chunksize` is the number
                of rows to include in each chunk.
    
            Returns
            -------
            DataFrame
    
            See also
            --------
            read_sql_table : Read SQL database table into a DataFrame
            read_sql
    
            """
            args = _convert_params(sql, params)
    
            result = self.execute(*args)
            columns = result.keys()
    
            if chunksize is not None:
                return self._query_iterator(result, chunksize, columns,
                                            index_col=index_col,
                                            coerce_float=coerce_float,
                                            parse_dates=parse_dates)
            else:
                data = result.fetchall()
                frame = _wrap_result(data, columns, index_col=index_col,
                                     coerce_float=coerce_float,
                                     parse_dates=parse_dates)
                return frame

    上面源码可以看到,它先用execute执行sql语句,然后在判断是否有chunksize,没有就直接返回所有数据,有的话根据chunksize返回一个iterator。所以这不是一个真正的分批次读取,如果数据量大,还是会导致内存爆炸直至卡死。

    最好的分批次方式是在SQL语句直接执行,比如加limit和offset。

    SELECT case_id, text FROM first_case order by case_id limit 1000 offset 0

    limit a offset b,表示跳过b个数据,读取出a个数据,这样可以固定a, 更新b就可实现一批一批地读取到所有数据。

  • 相关阅读:
    无题
    赌对了
    赌:
    这次是真的再见了,oi退役回忆录
    线段树(lazy标记)
    《挑战》2.1 POJ POJ 1979 Red and Black (简单的DFS)
    《挑战》2.1 POJ 2386 Lake Counting (简单的dfs)
    TC安装全系列教程
    ProblemC 剪花布条(KMP基础)
    多校+CF简单题
  • 原文地址:https://www.cnblogs.com/zongfa/p/11332274.html
Copyright © 2020-2023  润新知