Open In Colab

17: Pandas (the Basics) and Titanic Analysis#

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

1. Load your data into a Pandas DataFrame#

https://www.youtube.com/redirect?q=http%3A%2F%2Fbiostat.mc.vanderbilt.edu%2Fwiki%2Fpub%2FMain%2FDataSets%2Ftitanic3.xls&redir_token=IS7fnKxJQSAQBgyL_W_n-Yg2XZJ8MTU4NzkxOTk0MkAxNTg3ODMzNTQy&v=zZkNOdBWgFQ&event=video_description

The most common options:

  • read_csv

  • read_excel

data = pd.read_excel('../data/titanic3 (3).xls')
data.shape
data.head()
pclass survived name sex age sibsp parch ticket fare cabin embarked boat body home.dest
0 1 1 Allen, Miss. Elisabeth Walton female 29.0000 0 0 24160 211.3375 B5 S 2 NaN St Louis, MO
1 1 1 Allison, Master. Hudson Trevor male 0.9167 1 2 113781 151.5500 C22 C26 S 11 NaN Montreal, PQ / Chesterville, ON
2 1 0 Allison, Miss. Helen Loraine female 2.0000 1 2 113781 151.5500 C22 C26 S NaN NaN Montreal, PQ / Chesterville, ON
3 1 0 Allison, Mr. Hudson Joshua Creighton male 30.0000 1 2 113781 151.5500 C22 C26 S NaN 135.0 Montreal, PQ / Chesterville, ON
4 1 0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female 25.0000 1 2 113781 151.5500 C22 C26 S NaN NaN Montreal, PQ / Chesterville, ON
data.describe()
pclass survived age sibsp parch fare body
count 1309.000000 1309.000000 1046.000000 1309.000000 1309.000000 1308.000000 121.000000
mean 2.294882 0.381971 29.881135 0.498854 0.385027 33.295479 160.809917
std 0.837836 0.486055 14.413500 1.041658 0.865560 51.758668 97.696922
min 1.000000 0.000000 0.166700 0.000000 0.000000 0.000000 1.000000
25% 2.000000 0.000000 21.000000 0.000000 0.000000 7.895800 72.000000
50% 3.000000 0.000000 28.000000 0.000000 0.000000 14.454200 155.000000
75% 3.000000 1.000000 39.000000 1.000000 0.000000 31.275000 256.000000
max 3.000000 1.000000 80.000000 8.000000 9.000000 512.329200 328.000000

2. Clean up your dataset with drop(), dropna() and fillna()#

data = data.drop(['name', 'sibsp', 'parch', 'ticket', 'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest'], axis=1)
data = data.dropna(axis=0)
data.shape
(1046, 4)
data['age'].hist()
<Axes: >
../../../_images/356f2ddb8ce811a373c77466f039b1bf901ddb897f68d274ae16deeaba1c3ddb.png

3. Groupby() and value_counts()#

data.groupby(['sex']).mean()
pclass survived age
sex
female 2.048969 0.752577 28.687071
male 2.300912 0.205167 30.585233
data.groupby(['sex', 'pclass']).mean()
survived age
sex pclass
female 1 0.962406 37.037594
2 0.893204 27.499191
3 0.473684 22.185307
male 1 0.350993 41.029250
2 0.145570 30.815401
3 0.169054 25.962273
data['pclass'].value_counts()
3    501
1    284
2    261
Name: pclass, dtype: int64
data[data['age'] < 18]['pclass'].value_counts()
3    106
2     33
1     15
Name: pclass, dtype: int64

4. Exercice#

  • Créer des catégories d’ages avec la fonction map() de pandas

  • Créer des catégories de genres avec cat.codes

Solution#

Hide code cell content
def category_ages(age):
    if age <= 20:
        return '<20 ans'
    elif (age > 20) & (age <= 30):
        return '20-30 ans'
    elif (age > 30) & (age <= 40):
        return '30-40 ans'
    else:
        return '+40 ans'
Hide code cell content
data['age'] = data['age'].map(category_ages)
Hide code cell content
data['sex'].astype('category').cat.codes
0       0
1       1
2       0
3       1
4       0
       ..
1301    1
1304    0
1306    1
1307    1
1308    1
Length: 1046, dtype: int8