来自 https://github.com/guipsamora/pandas_exercises
Ex2 - Getting and Knowing your Data
This time we are going to pull data directly from the internet.
Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.
Step 1. Import the necessary libraries
import pandas as pd
import numpy as np
Step 2. Import the dataset from this address.
Step 3. Assign it to a variable called chipo.
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv'
chipo = pd.read_csv(url,sep=' ')
Step 4. See the first 10 entries
# Solution 1
chipo[:10]
|
order_id |
quantity |
item_name |
choice_description |
item_price |
0 |
1 |
1 |
Chips and Fresh Tomato Salsa |
NaN |
$2.39 |
1 |
1 |
1 |
Izze |
[Clementine] |
$3.39 |
2 |
1 |
1 |
Nantucket Nectar |
[Apple] |
$3.39 |
3 |
1 |
1 |
Chips and Tomatillo-Green Chili Salsa |
NaN |
$2.39 |
4 |
2 |
2 |
Chicken Bowl |
[Tomatillo-Red Chili Salsa (Hot), [Black Beans... |
$16.98 |
5 |
3 |
1 |
Chicken Bowl |
[Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou... |
$10.98 |
6 |
3 |
1 |
Side of Chips |
NaN |
$1.69 |
7 |
4 |
1 |
Steak Burrito |
[Tomatillo Red Chili Salsa, [Fajita Vegetables... |
$11.75 |
8 |
4 |
1 |
Steak Soft Tacos |
[Tomatillo Green Chili Salsa, [Pinto Beans, Ch... |
$9.25 |
9 |
5 |
1 |
Steak Burrito |
[Fresh Tomato Salsa, [Rice, Black Beans, Pinto... |
$9.25 |
# Solution 2
chipo.head(10)
|
order_id |
quantity |
item_name |
choice_description |
item_price |
0 |
1 |
1 |
Chips and Fresh Tomato Salsa |
NaN |
$2.39 |
1 |
1 |
1 |
Izze |
[Clementine] |
$3.39 |
2 |
1 |
1 |
Nantucket Nectar |
[Apple] |
$3.39 |
3 |
1 |
1 |
Chips and Tomatillo-Green Chili Salsa |
NaN |
$2.39 |
4 |
2 |
2 |
Chicken Bowl |
[Tomatillo-Red Chili Salsa (Hot), [Black Beans... |
$16.98 |
5 |
3 |
1 |
Chicken Bowl |
[Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou... |
$10.98 |
6 |
3 |
1 |
Side of Chips |
NaN |
$1.69 |
7 |
4 |
1 |
Steak Burrito |
[Tomatillo Red Chili Salsa, [Fajita Vegetables... |
$11.75 |
8 |
4 |
1 |
Steak Soft Tacos |
[Tomatillo Green Chili Salsa, [Pinto Beans, Ch... |
$9.25 |
9 |
5 |
1 |
Steak Burrito |
[Fresh Tomato Salsa, [Rice, Black Beans, Pinto... |
$9.25 |
Step 5. What is the number of observations in the dataset?
type(chipo)
pandas.core.frame.DataFrame
# Solution 1
len(chipo.index)
4622
# Solution 2
chipo.shape[0]
4622
# Solution 3
chipo.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4622 entries, 0 to 4621
Data columns (total 5 columns):
order_id 4622 non-null int64
quantity 4622 non-null int64
item_name 4622 non-null object
choice_description 3376 non-null object
item_price 4622 non-null object
dtypes: int64(2), object(3)
memory usage: 180.7+ KB
Step 6. What is the number of columns in the dataset?
# Solution 1
len(chipo.columns)
5
# Solution 2
chipo.shape[1]
5
Step 7. Print the name of all the columns.
list(chipo.columns)
['order_id', 'quantity', 'item_name', 'choice_description', 'item_price']
Step 8. How is the dataset indexed?
chipo.index
RangeIndex(start=0, stop=4622, step=1)
Step 9. Which was the most-ordered item?
c = chipo.groupby('item_name')
c = c.sum()
c = c.sort_values(['quantity'],ascending=False)
c['quantity'].head(1)
item_name
Chicken Bowl 761
Name: quantity, dtype: int64
Step 10. For the most-ordered item, how many items were ordered?
c = chipo.groupby('item_name')
c = c.sum()
c = c.sort_values(['quantity'],ascending=False)
c['quantity'].head(1)
item_name
Chicken Bowl 761
Name: quantity, dtype: int64
Step 11. What was the most ordered item in the choice_description column?
c = chipo.groupby('choice_description')
c = c.sum()
c = c.sort_values(['quantity'],ascending=False)
c.head(1)
|
order_id |
quantity |
choice_description |
|
|
[Diet Coke] |
123455 |
159 |
Step 12. How many items were orderd in total?
chipo['quantity'].sum()
4972
Step 13. Turn the item price into a float
Step 13.a. Check the item price type
chipo['item_price'].dtypes
dtype('O')
Step 13.b. Create a lambda function and change the type of item price
chipo['item_price'] = chipo['item_price'].apply(lambda x:x.replace('$','')).astype(np.float64);
# dollarizer = lambda x:float(x[1:-1])
# chipo.item_price = chipo.item_price.apply(dollarizer)
Step 13.c. Check the item price type
chipo['item_price'].dtypes
dtype('float64')
Step 14. How much was the revenue for the period in the dataset?
(chipo['quantity']*chipo['item_price']).sum()
39237.02
Step 15. How many orders were made in the period?
# Solution 1
g = chipo.groupby(['order_id'])
g.ngroups
1834
# Solution 2
orders = chipo.order_id.value_counts().count()
orders
1834
Step 16. What is the average revenue amount per order?
# Solution 1
chipo['revenue'] = chipo['quantity']*chipo['item_price']
order_grouped = chipo.groupby(by=['order_id']).sum()
order_grouped.mean()['revenue']
21.394231188658654
# Solution 2
chipo.groupby(by=['order_id']).sum().mean()['revenue']
21.394231188658654
Step 17. How many different items are sold?
chipo.item_name.value_counts().count()
50