• python pandas 中文件的读写——read_csv()读取文件


    read_csv()读取文件

    1.python读取文件的几种方式

    • read_csv 从文件,url,文件型对象中加载带分隔符的数据。默认分隔符为逗号
    • read_table 从文件,url,文件型对象中加载带分隔符的数据。默认分隔符为制表符(“ ”)
    • read_fwf 读取定宽列格式数据(也就是没有分隔符)
    • read_cliboard 读取剪切板中的数据,可以看做read_table的剪切板。在将网页转换为表格时很有用

    2.读取文件的简单实现

    程序代码:

    df=pd.read_csv('D:/project/python_instruct/test_data1.csv')
    print('用read_csv读取的csv文件:', df)
    df=pd.read_table('D:/project/python_instruct/test_data1.csv', sep=',')
    print('用read_table读取csv文件:', df)
    
    df=pd.read_csv('D:/project/python_instruct/test_data2.csv', header=None)
    print('用read_csv读取无标题行的csv文件:', df)
    df=pd.read_csv('D:/project/python_instruct/test_data2.csv', names=['a', 'b', 'c', 'd', 'message'])
    print('用read_csv读取自定义标题行的csv文件:', df)
    
    names=['a', 'b', 'c', 'd', 'message']
    df=pd.read_csv('D:/project/python_instruct/test_data2.csv', names=names, index_col='message')
    print('read_csv读取时指定索引:', df)
    
    parsed=pd.read_csv('D:/project/python_instruct/test_data3.csv', index_col=['key1', 'key2'])
    print('read_csv将多个列做成一个层次化索引:')
    print(parsed)
    
    print(list(open('D:/project/python_instruct/test_data1.txt')))
    result=pd.read_table('D:/project/python_instruct/test_data1.txt', sep='s+')
    print('read_table利用正则表达式处理文件读取:')
    print(result)

    输出结果:

    用read_csv读取的csv文件:    
    a b c d message 0 1 2 3 4 hello 1 5 6 7 8 world 2 9 10 11 12 foo 用read_table读取csv文件:
    a b c d message 0 1 2 3 4 hello 1 5 6 7 8 world 2 9 10 11 12 foo 用read_csv读取无标题行的csv文件:

    0 1 2 3 4 0 1 2 3 4 hello 1 5 6 7 8 world 2 9 10 11 12 foo 用read_csv读取自定义标题行的csv文件:
    a b c d message 0 1 2 3 4 hello 1 5 6 7 8 world 2 9 10 11 12 foo read_csv读取时指定索引:
    a b c d message hello 1 2 3 4 world 5 6 7 8 foo 9 10 11 12 read_csv将多个列做成一个层次化索引: value1 value2 key1 key2 one a 1 2 b 3 4 c 5 6 d 7 8 two a 9 10 b 11 12 c 13 14 d 15 16 [' A B C ', 'aaa -0.26 -0.1 -0.4 ', 'bbb -0.92 -0.4 -0.7 ', 'ccc -0.34 -0.5 -0.8 ', 'ddd -0.78 -0.3 -0.2'] read_table利用正则表达式处理文件读取: A B C aaa -0.26 -0.1 -0.4 bbb -0.92 -0.4 -0.7 ccc -0.34 -0.5 -0.8 ddd -0.78 -0.3 -0.2

    3分块读取大型数据集

    先看代码:

    reslt=pd.read_csv('D:projectpython_instructweibo_network.txt')
    print('原始文件:', result)

    输出:

    Traceback (most recent call last):
    
      File "<ipython-input-5-6eb71b2a5e94>", line 1, in <module>
        runfile('D:/project/python_instruct/Test.py', wdir='D:/project/python_instruct')
    
      File "D:Anaconda3libsite-packagesspyderutilssitesitecustomize.py", line 866, in runfile
        execfile(filename, namespace)
    
      File "D:Anaconda3libsite-packagesspyderutilssitesitecustomize.py", line 102, in execfile
        exec(compile(f.read(), filename, 'exec'), namespace)
    
      File "D:/project/python_instruct/Test.py", line 75, in <module>
        reslt=pd.read_csv('D:projectpython_instructweibo_network.txt')
    
      File "D:Anaconda3libsite-packagespandasioparsers.py", line 562, in parser_f
        return _read(filepath_or_buffer, kwds)
    
      File "D:Anaconda3libsite-packagespandasioparsers.py", line 325, in _read
        return parser.read()
    
      File "D:Anaconda3libsite-packagespandasioparsers.py", line 815, in read
        ret = self._engine.read(nrows)
    
      File "D:Anaconda3libsite-packagespandasioparsers.py", line 1314, in read
        data = self._reader.read(nrows)
    
      File "pandasparser.pyx", line 805, in pandas.parser.TextReader.read (pandasparser.c:8748)
    
      File "pandasparser.pyx", line 827, in pandas.parser.TextReader._read_low_memory (pandasparser.c:9003)
    
      File "pandasparser.pyx", line 881, in pandas.parser.TextReader._read_rows (pandasparser.c:9731)
    
      File "pandasparser.pyx", line 868, in pandas.parser.TextReader._tokenize_rows (pandasparser.c:9602)
    
      File "pandasparser.pyx", line 1865, in pandas.parser.raise_parser_error (pandasparser.c:23325)
    
    CParserError: Error tokenizing data. C error: out of memory

    发现数据集大得已经超出内存。我们可以读取几行看看,如前10行:

    result=pd.read_csv('D:projectpython_instructweibo_network.txt', nrows=10)
    print('只读取几行:')
    print(result)

    输出结果:

                                     
    0  0	296	3	1	10	1	12	1	13	1	14	1	16	...
    1  1	271	8	1	17	1	22	1	31	0	34	1	6742...
    2  2	158	0	0	5	1	10	1	11	1	13	1	16	0...
    3  3	413	0	1	5	1	194	1	354	1	3462	1	8...
    4  4	142	1	0	5	1	7	1	11	1	14	1	18	1...
    5  5	272	2	1	3	1	4	1	12	1	13	1	14	1...
    6  6	59	9	1	13	1	46991	0	66930	0	85672...
    7  7	131	4	1	11	1	20	1	24	1	26	0	30	...
    8  8	326	0	0	1	1	12	1	13	1	17	1	19	1...
    9  9	12	0	0	6	1	10	1	13	1	18	0	466527...
  • 相关阅读:
    Quartz cron表达式
    Apache NiFi 核心概念和关键特性
    Hive llap服务安装说明及测试(一)
    nifi生产环境使用
    DataX 中Transformer的使用
    vue2.0之过渡动画,分别用钩子函数,animated,原生css实现(前端网备份)
    js对对象数组的某一字段排序(前端网备份)
    浏览器之禁扒(前端网备份)
    iframe 从父像子穿参数(前端网备份)
    关于小程序仿微博导航效果(前端网备份 )
  • 原文地址:https://www.cnblogs.com/heitaoq/p/7994842.html
Copyright © 2020-2023  润新知