17: Pandas (the Basics) and Titanic Analysis#
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
1. Load your data into a Pandas DataFrame#
The most common options:
read_csv
read_excel
data = pd.read_excel('../data/titanic3 (3).xls')
data.shape
data.head()
pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | boat | body | home.dest | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | Allen, Miss. Elisabeth Walton | female | 29.0000 | 0 | 0 | 24160 | 211.3375 | B5 | S | 2 | NaN | St Louis, MO |
1 | 1 | 1 | Allison, Master. Hudson Trevor | male | 0.9167 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | 11 | NaN | Montreal, PQ / Chesterville, ON |
2 | 1 | 0 | Allison, Miss. Helen Loraine | female | 2.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NaN | NaN | Montreal, PQ / Chesterville, ON |
3 | 1 | 0 | Allison, Mr. Hudson Joshua Creighton | male | 30.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NaN | 135.0 | Montreal, PQ / Chesterville, ON |
4 | 1 | 0 | Allison, Mrs. Hudson J C (Bessie Waldo Daniels) | female | 25.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NaN | NaN | Montreal, PQ / Chesterville, ON |
data.describe()
pclass | survived | age | sibsp | parch | fare | body | |
---|---|---|---|---|---|---|---|
count | 1309.000000 | 1309.000000 | 1046.000000 | 1309.000000 | 1309.000000 | 1308.000000 | 121.000000 |
mean | 2.294882 | 0.381971 | 29.881135 | 0.498854 | 0.385027 | 33.295479 | 160.809917 |
std | 0.837836 | 0.486055 | 14.413500 | 1.041658 | 0.865560 | 51.758668 | 97.696922 |
min | 1.000000 | 0.000000 | 0.166700 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
25% | 2.000000 | 0.000000 | 21.000000 | 0.000000 | 0.000000 | 7.895800 | 72.000000 |
50% | 3.000000 | 0.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 | 155.000000 |
75% | 3.000000 | 1.000000 | 39.000000 | 1.000000 | 0.000000 | 31.275000 | 256.000000 |
max | 3.000000 | 1.000000 | 80.000000 | 8.000000 | 9.000000 | 512.329200 | 328.000000 |
2. Clean up your dataset with drop(), dropna() and fillna()#
data = data.drop(['name', 'sibsp', 'parch', 'ticket', 'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest'], axis=1)
data = data.dropna(axis=0)
data.shape
(1046, 4)
data['age'].hist()
<Axes: >
3. Groupby() and value_counts()#
data.groupby(['sex']).mean()
pclass | survived | age | |
---|---|---|---|
sex | |||
female | 2.048969 | 0.752577 | 28.687071 |
male | 2.300912 | 0.205167 | 30.585233 |
data.groupby(['sex', 'pclass']).mean()
survived | age | ||
---|---|---|---|
sex | pclass | ||
female | 1 | 0.962406 | 37.037594 |
2 | 0.893204 | 27.499191 | |
3 | 0.473684 | 22.185307 | |
male | 1 | 0.350993 | 41.029250 |
2 | 0.145570 | 30.815401 | |
3 | 0.169054 | 25.962273 |
data['pclass'].value_counts()
3 501
1 284
2 261
Name: pclass, dtype: int64
data[data['age'] < 18]['pclass'].value_counts()
3 106
2 33
1 15
Name: pclass, dtype: int64
4. Exercice#
Créer des catégories d’ages avec la fonction map() de pandas
Créer des catégories de genres avec cat.codes
Solution#
Show code cell content
def category_ages(age):
if age <= 20:
return '<20 ans'
elif (age > 20) & (age <= 30):
return '20-30 ans'
elif (age > 30) & (age <= 40):
return '30-40 ans'
else:
return '+40 ans'
Show code cell content
data['age'] = data['age'].map(category_ages)
Show code cell content
data['sex'].astype('category').cat.codes
0 0
1 1
2 0
3 1
4 0
..
1301 1
1304 0
1306 1
1307 1
1308 1
Length: 1046, dtype: int8