• 用python进行数据分析--引言


    前言

    这是用学习《用python进行数据分析》的连载。这篇博客记录的是学习第二章引言部分的内容

    内容

    一、分析usa.org的数据

    (1)载入数据

    import json
    if __name__ == "__main__":
    # load data
    path = "../../datasets/bitly_usagov/example.txt"
    with open(path) as data:
    records = [json.loads(line) for line in data]
     

    (2)计算数据中不同时区的数量

    1、复杂的方式

    # define the count function
    def get_counts(sequence):
        counts = {}
        for x in sequence:
            if x in counts:
                counts[x] += 1
            else:
                counts[x] = 1
        return counts

    2.简单的方式

    
    
    from collections import defaultdict

    #
    to simple the count function def simple_get_counts(sequence): # initialize the all value to zero counts = defaultdict(int) for x in sequence: counts[x] += 1 return counts

    简单之所以简单是因为它引用了defaultdicta函数,初始化了字典中的元素,将其值全部初始化为0,就省去了判断其值是否在字典中出现的过程

    (3)获取出现次数最高的10个时区名和它们出现的次数

    1.复杂方式

    # get the top 10 timezone which value is biggest
    def top_counts(count_dict, n=10):
        value_key_pairs = [(count, tz) for tz, count in count_dict.items()]
        # this sort method is asc
        value_key_pairs.sort()
        return value_key_pairs[-n:]
    # get top counts by get_count function
    counts = simple_get_counts(time_zones)
    top_counts = top_counts(counts)

    2.简单方式

    from collections import Counter
    
    # simple the top_counts method
    def simple_top_counts(timezone, n=10):
        counter_counts = Counter(timezone)
        return counter_counts.most_common(n)
    simple_top_counts(time_zones)

    简单方式之所以简单,是因为他利用了collections中自带的counter函数,它能够自动的计算不同值出现的次数,并且不需要先计算counts

    (4)利用pylab和dataframe画出不同timezone的出现次数,以柱状图的形式。

    from pandas import DataFrame, Series
    import pandas as pd
    import numpy as np
    import pylab as pyl
    
    # use the dataframe to show the counts of timezone
    def show_timezone_data(records):
        frame = DataFrame(records)
        clean_tz = frame['tz'].fillna("Missing")
        clean_tz[clean_tz == ''] = 'Unknown'
        tz_counts = clean_tz.value_counts()
        tz_counts[:10].plot("barh", rot=0)
        pyl.show()

    以下是结果图:

     

    我刚开始的时候,能够执行代码,但是一直显示不出这个图。后来我查阅了资料发现是因为,我用的是pycharm,它是一个编辑器,不处于pylab模式,所以你需要先引入pylab模块,然后执行 pylab.show()指令才能够使这个图出现

    (5)利用pylab和dataframe画出不同的timezone的window的分布情况

    # use the dataframe to show the character of timezone and windows
    def show_timezone_winows(records):
        frame = DataFrame(records)
        # results = Series([x.split()[0] for x in frame.a.dropna()])
        cframe = frame[frame.a.notnull()]
        operatine_system = np.where(cframe['a'].str.contains("Windows"), "Windows", "Not Windows")
        by_tz_os = cframe.groupby(["tz", operatine_system])
        agg_counts = by_tz_os.size().unstack().fillna(0)
        indexer = agg_counts.sum(1).argsort()
        count_subset = agg_counts.take(indexer)[-10:]
        count_subset.plot(kind='barh', stacked=True)
        pyl.show()
        # let the sum to be 1
        normed_subset = count_subset.div(count_subset.sum(1), axis=0)
        normed_subset.plot(kind='barh', stacked=True)
        pyl.show()

    没有归一化前的结果图:

    归一化后的结果图:

    二、分析movielens的数据

    (1)读取数据

    users:记录了用户信息, ratings:记录了用户的评分信息, movies:记录了电影信息

    import pandas as pd
    
    # read the three tables
    unames = ['user_id', 'gender', 'age', 'occupation', 'zip']
    users = pd.read_table("../../datasets/movielens/users.dat", sep='::', header=None, names=unames)
    
    rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
    ratings = pd.read_table("../../datasets/movielens/ratings.dat", sep='::', header=None, names=rnames)
    
    mnames = ['movie_id', 'title', 'genres']
    movies = pd.read_table('../../datasets/movielens/movies.dat', sep="::", header=None, names=mnames)

    (2)合并数据

    数据的合并用到了pandas中merge函数,它能够直接根据数据对象中列名直接推断出哪些列是外键

      # merge data
        return pd.merge(pd.merge(users, ratings), movies)

    (3) 查询同一电影不同性别的评分

    def mean_score(data):
        '''
        print the mean rating of the same title between genders
        :param data:the data source
        :return:
        '''
        mean_ratings = data.pivot_table(values='rating', index=['title'], columns='gender', aggfunc='mean')
        rating_by_title = data.groupby('title').size()
        active_titles = rating_by_title.index[rating_by_title >= 250]
        mean_ratings = mean_ratings.ix[active_titles]
        print(mean_ratings)

    这里要特别注意的是,现在的dataframe中的pivot_table中已经没有了rows和cols这两项,并且在定义聚集值是需要指定。

    #以前的代码
       mean_ratings =  data.pivot_table('rating', rows='title', cols='gender', aggfunc='mean')
    #现在的代码
        mean_ratings = data.pivot_table(values='rating', index=['title'], columns='gender', aggfunc='mean')
       

    (4)计算评分分歧

    def diff_score(mean_ratings, data):
        '''
        calculate the diff between different group like male and female
        :param data:
        :return:
        '''
        mean_ratings['diff'] = mean_ratings['M'] - mean_ratings['F']
        # sorted the table by diff ,asc
        sorted_by_diff = mean_ratings.sort_index(by='diff')
        print("sorted by diff between male and female")
        print(sorted_by_diff[:10])
        # calculate the most diff movies by ignoring gender
        rating_std_by_title = data.groupby('title')['rating'].std()
        rating_std_by_title.sort_values(ascending=False)
        print("sorted by diff in rating")
        print(rating_std_by_title[:10])

    这里要注意的是,现在的series中已经没有order这个函数了,取而代之的是sort_values和sort_index,不过作用是一样的。从以下的代码可以看到区别。

    #以前的代码
    rating_std_by_title.order(ascending=False)
    #现在的代码
    rating_std_by_title.sort_values(ascending=False)

    三、分析婴儿姓名

    (1)读取数据

    def contact_data():
        '''
        contact all name in different year into one table
        :return:
        '''
        years = range(1880, 2011)
        pieces = []
        columns = ['name', 'sex', 'births']
        for year in years:
            path = "../../datasets/babynames/yob%d.txt" % year
            frame = pd.read_csv(path, names=columns)
            frame['year'] = year
            pieces.append(frame)
        data = pd.concat(pieces,ignore_index=True)
        return data

    (2)画出每年不同姓名的婴儿的出生数量

    def gather_function(data):
        total_births = data.pivot_table(values='births', index=['year'], columns='sex', aggfunc=sum)
        total_births.tail()
        total_births.plot(title='total births by sex and year')
        pyl.show()

    结果图:

    (3)获取命名次数前1000的名字

    def get_top1000(group):
        return group.sort_index(by='births', ascending=False)[:1000]
    
    def prop_function(data):
        names = data.groupby(["year", "sex"]).apply(add_prop)
        # print(np.allclose(names.groupby(['year', 'sex']).prop.sum(), 1))
        grouped = names.groupby(['year', 'sex'])
        top1000 = grouped.apply(get_top1000)
        return top1000

    这里比较特别的是用了apply进行groupby函数的group方法的重写。

    (4)分析命名趋势

    def name_trend(data, top1000):
        boys = top1000[top1000.sex == 'M']
        girls = top1000[top1000.sex == 'F']
        total_births = top1000.pivot_table(values='births', index=['year'], columns='name', aggfunc=sum)
        subset = total_births[['John', 'Harry', 'Mary', 'Marilyn']]
        subset.plot(subplots=True, figsize=(12, 10), grid=False, title="Number of births per year")
        pyl.show()

    分析了john,harry,mary和marilyn这四个名字在不同年份的命名趋势。

     (4)名字多样性的增长

    有两种方式可以进行辅证

    1.计算排名前1000的名字在所有名字中的占比。

    def variety_growth(top1000):
        table = top1000.pivot_table(values='prop', index=['year'], columns='sex', aggfunc=sum)
        table.plot(title="sum of top1000.prop by year and sex", yticks=np.linspace(0, 1.2, 13), xticks=range(1880, 2020, 10))
        pyl.show()
        

    可以得到下图:

    2.计算不同性别,不同年份,在top1000中需要多少个名字才能达到占比0.5

     def get_quantile_count(group, q=0.5):
        group = group.sort_index(by='prop', ascending=False)
        return group.prop.cumsum().searchsorted(q)+1
    
    def variety_growth(top1000):
        '''
        show the plot of sum of top1000.prop by year and sex
        and the plot of number of the popular names in top 50%
        :param top1000:
        :return:
        '''
        table = top1000.pivot_table(values='prop', index=['year'], columns='sex', aggfunc=sum)
        table.plot(title="sum of top1000.prop by year and sex", yticks=np.linspace(0, 1.2, 13), xticks=range(1880, 2020, 10))
        pyl.show()
        diversity = top1000.groupby(['year', 'sex']).apply(get_quantile_count)
        diversity = diversity.unstack('sex')
        print(diversity.head())

    结果如下:

    (4)最后一个字母的变革

    查看不同年份,不同性别最后一个字母的占比。

    def last_name(data):
        '''
        the ratio of the character in last name by year and sex
        :param data: 
        :return: 
        '''
        get_last_letter = lambda x: x[-1]
        last_letters = data.name.map(get_last_letter)
        data['last_letters'] = last_letters
        table = data.pivot_table(values='births', index=['last_letters'], columns=['sex', 'year'], aggfunc=sum)
        subtable = table.reindex(columns=[1910, 1960, 2010], level='year')
        letter_prop = subtable/subtable.sum().astype(float)
        fig, axes = plt.subplots(2, 1, figsize=(10, 8))
        letter_prop['M'].plot(kind='bar', rot=0, ax=axes[0], title='Male')
        letter_prop['F'].plot(kind='bar', rot=0, ax=axes[1], title='Female', legend=False)
        pyl.show()
        letter_prop = table/table.sum().astype(float)
        dny_ts = letter_prop.ix[['d', 'n', 'y'], 'M'].T
        dny_ts.plot()
        pyl.show()

    它的结果图如下:

    在写这个代码的时候要注意,要加上一行 data['last_letters'] = last_letters,用来将last_letters这个列并入到data中,否则按照书上的代码执行会报错。

    (5)变成女孩名字的男孩名字(以及相反情况)

    def name_change_sex(top1000):
        allnames = top1000.name.unique()
        mask = np.array(['lesl' in i.lower() for i in allnames])
        lesley_like = allnames[mask]
        # filter by lesley_like
        filtered = top1000[top1000.name.isin(lesley_like)]
        table = filtered.pivot_table(values='births', index=['year'], columns='sex', aggfunc=sum)
        table = table.div(table.sum(1), axis=0)
        table.plot(style={'M': 'k-', 'F': 'k--'})
        pyl.show()

    结果图:

  • 相关阅读:
    Linux内核链表——看这一篇文章就够了
    2的幂和按位与&——效率
    fgets注意事项
    GDB TUI
    GDB TUI
    linux shell命令行选项与参数用法详解
    What are the differences between Perl, Python, AWK and sed
    What is the difference between sed and awk
    /proc/sysrq-trigger
    C++ Sqlite3的基本使用
  • 原文地址:https://www.cnblogs.com/whatyouknow123/p/9118174.html
Copyright © 2020-2023  润新知