• Python之Pandas 相关操作02---数据筛选、数据选择、loc、iloc的使用、新增一行、读取某些行


    1.df.loc[[index],[colunm]] 通过标签选择数据

    loc需要两个单/列表/范围运算符,用","分隔。第一个表示行,第二个表示列

    (1)获取指定列的数据

    df.loc[:,'reviews']      注意: 第一个参数为:表示所有行,第2个参数为列名,设置获取review列的数据

    import pandas as pd
    df=pd.read_csv('../hotel_csv_split/reviews_split_fenci_pos_1_05.csv',header=None,nrows=5)
    #在读数之后自定义标题
    columns_name=['mysql_id','hotelname','customername','reviews','aspectflag','review_fenci','review_pos','review_fenci_pos']
    df.columns=columns_name
    print(df.head(3)) #输出前3行
    print (df.loc[:,'reviews'].head(3))

    控制台输出:

    (2)选择指定的多行多列

    df.loc[[0,2],['customername','reviews','review_fenci']]   参数说明: [0,2] 这个列表有两个元素0,2表示选择第0行和第2行,['customername','reviews','review_fenci']这个列表有3个元素表示选择列名为'customername','reviews','review_fenci‘的这3列

    import pandas as pd
    df=pd.read_csv('../hotel_csv_split/reviews_split_fenci_pos_1_05.csv',header=None,nrows=5)
    #在读数之后自定义标题
    columns_name=['mysql_id','hotelname','customername','reviews','aspectflag','review_fenci','review_pos','review_fenci_pos']
    df.columns=columns_name
    print(df.head(3)) #输出前3行
    print (df.loc[[0,2],['customername','reviews','review_fenci']])

    控制台输出:

    2.df.iloc[[index],[colunm]] 通过位置选择数据

    (1)选择一列,以Series的形式返回列

    (2)选择两列或两列以上,以DataFrame形式返回多列

    import pandas as pd
    df=pd.read_csv('../hotel_csv_split/reviews_split_fenci_pos_1_05.csv',header=None,nrows=5)
    #在读数之后自定义标题
    columns_name=['mysql_id','hotelname','customername','reviews','aspectflag','review_fenci','review_pos','review_fenci_pos']
    df.columns=columns_name
    print(df.head(3)) #输出前3行
    print (df.iloc[[0,2],[1,2]])

    控制台输出:

     3.df[['列名1','列名2']]

    import pandas as pd
    df=pd.read_csv('../hotel_csv_split/reviews_split_fenci_pos_1_05.csv',header=None,nrows=5)
    #在读数之后自定义标题
    columns_name=['mysql_id','hotelname','customername','reviews','aspectflag','review_fenci','review_pos','review_fenci_pos']
    df.columns=columns_name
    print(df.head(3)) #输出前3行
    print (df[['customername','reviews']])

    控制台输出:

    4.按若干个列的组合条件筛选数据

    import pandas as pd
    df=pd.read_csv('../hotel_csv_split/reviews_split_fenci_pos_1_05.csv',header=None,nrows=5)
    #在读数之后自定义标题
    columns_name=['mysql_id','hotelname','customername','reviews','aspectflag','review_fenci','review_pos']
    df.columns=columns_name
    print(df.head(5)) #输出前3行
    print (df[(df['mysql_id']==201)&(df['aspectflag']==0.0)&(df['review_pos']==3)])

    控制台输出:

    5.筛选某列中值大于n的数据且给另一列的空值填充数据

    import pandas as pd
    df=pd.read_csv('../hotel_csv_split/reviews_split_fenci_pos_1_15256.csv',header=None,nrows=5)
    #在读数之后自定义标题
    columns_name=['mysql_id','hotelname','customername','reviews','aspectflag','review_fenci','review_pos','review_fenci_pos']
    df.columns=columns_name
    print(df.head(3)) #输出前3行
    df1 = df[df['aspectflag']==1.0].copy()  #df['aspectflag']==1.0
    df1['review_pos']=df1['review_pos'].fillna('n/adj')
    print(df1.head(3))

    控制台输出:

    注意:

    df1 = df[df['aspectflag']==1.0].copy()

    链式赋值是链式索引和赋值的组合。

    典例:

    data[data.bidder == 'parakeet2004']['bidderrate'] = 100

    其中:data[data.bidder == 'parakeet2004']  作用是从数据表中筛选出bidder列值为parakeet2004的数据,['bidderrate']获取前面筛选的列

    这种类似的写法会有警告:

    A value is trying to be set on a copy of a slice from aDataFrame.
    Try using .loc[row_indexer,col_indexer] = value instead
    See the caveats in the documentation:http://Pandas.pydata.org/Pandas-docs/stable/indexinghtml#indexing-view-versus-copy

    解决方案:拆为两部分,前面一部分使用copy(),生成一个副本。

    6.dataframe新增一行

    #创建一个空字典
    pos_dict = {}
    #往字典里添加一组新的key和value
    pos_dict['pos'] = pos
    pos_dict['count'] = count
    # print(pos_dict)
    df = df.append([pos_dict],ignore_index=True)   #给dataframe添加新的一行

     7.dataframe选择多列,并在指定位置插入一列

    import os
    import pandas as pd
    #读取csv文件的前200行,将其存储为另一个文件
    df=pd.read_csv('../csvfiles/hotelreviews_fenci_pos.csv',header=None,nrows=10)
    columns_name=['mysql_id','hotelname','customername','reviewtime','checktime','reviews','scores','type','room','useful','likenumber','review_split','review_pos','review_split_pos']
    df.columns=columns_name
    #获取dataframe表中的指定多列
    df1=pd.DataFrame(df,columns=['mysql_id','hotelname','customername','reviews','review_split'])
    col_name = df1.columns.tolist()
    # 在reviews列后面插入列名为keywords的列
    col_name.insert(col_name.index('reviews')+1,'keywords')
    df2=df1.reindex(columns=col_name)
    df2.to_csv('../csvfiles/reviews_split_200_keywords.csv', header=None, index=False)

    8.读取指定某些行

    pd.read_csv(路径,skiprows=需要忽略的行数,nrows=你想要读的行数)
    比如你想读中间第10行-20行的内容
    pd.read_csv(路径,skiprows=9,nrows=10),忽略前9行,往下读10行

    def dev_csv():
        df = pd.read_csv('../aspect_ner_csv_files/sentence_15000.csv', header=None,nrows=2683,skiprows=10256)
        columns_name = ['mysql_id', 'reviews']
        df.columns = columns_name
        review_csv_count_path = '../aspect_ner_csv_files/sentence_dev.csv'
        df.to_csv(review_csv_count_path, header=None,
                      index=False)  # header=None指不把列号写入csv当中

    参考文献:https://blog.csdn.net/destiny_python/article/details/78675036

    https://blog.csdn.net/weixin_42575020/article/details/98846427

  • 相关阅读:
    博文视点大讲堂第20期——Windows 7来了
    程序员如何缓解“电脑病”
    HQL中In的问题详解
    Tomcat、Websphere和Jboss类加载机制
    Hibernate锁机制悲观锁和乐观锁
    软件项目开发典型风险一览过程篇
    XPath实例教程
    CMMI的含义及重点
    自动化测试工具selenium使用介绍
    深入DNS域名解析服务原理
  • 原文地址:https://www.cnblogs.com/luckyplj/p/13274662.html
Copyright © 2020-2023  润新知