How to score 0.8134 in Titanic Kaggle Challenge

Posted on Mer 10 août 2016 in Kaggle


The Titanic challenge on Kaggle is a competition in which the goal is to predict the survival or the death of a given passenger based on a set of variables describing him such as his age, his sex, or his passenger class on the boat.

I have been playing with the Titanic dataset for a while, and I have recently achieved an accuracy score of 0.8134 on the public leaderboard. As I'm writing this post, I am ranked among the top 7% of all Kagglers.

energy

This post is the opportunity to share my solution with you.

To make this tutorial more "academic" so that everyone could benefit, I will first start with an exploratory data analysis (EDA) then I'll follow with feature engineering and finally present the predictive model I set up.

Throughout this jupyter notebook, I will be using Python at each level of the pipeline.

The main libraries involved in this tutorial are:

  • Pandas for data manipulation
  • Matplotlib and seaborn for data visualization
  • Numpy for multidimensional array computing
  • sklearn for machine learning and predictive modeling

Installation procedure

A very easy way to install these packages is to download and install the Conda distribution that encapsulates them all. This distribution is available on all platforms (Windows, Linux and Mac OSX).

Nota Bene

This is my first attempt as a blogger and as a machine learning practitioner as well.

If you have a question about the code or the hypotheses I made, do not hesitate to post a comment in the comment section below.

Hope you've got everything set on your computer. Let's get started.

I - Exploratory data analysis

In this section, we'll be doing four things.

  • Data extraction : we'll load the dataset and have a first look at it.
  • Cleaning : we'll fill in missing values.
  • Plotting : we'll create some interesting charts that'll (hopefully) spot correlations and hidden insights out of the data.
  • Assumptions : we'll formulate hypotheses from the charts.

We tweak the style of this notebook a little bit to have centered plots.

In [1]:
from IPython.core.display import HTML
HTML("""
<style>
.output_png {
    display: table-cell;
    text-align: center;
    vertical-align: middle;
}
</style>
""")
Out[1]:

We import the useful libraries.

In [2]:
# remove warnings
import warnings
warnings.filterwarnings('ignore')
# ---

%matplotlib inline
import pandas as pd
pd.options.display.max_columns = 100
from matplotlib import pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')
import numpy as np

pd.options.display.max_rows = 100

Two datasets are available: a training set and a test set. We'll be using the training set to build our predictive model and the testing set to score it and generate an output file to submit on the Kaggle evaluation system.

We'll see how this procedure is done at the end of this post.

Now let's start by loading the training set.

In [3]:
data = pd.read_csv('../../data/train.csv')
In [4]:
data.shape
Out[4]:
(891, 12)

Pandas allows you to have a sneak peak at your data.

In [5]:
data.head()
Out[5]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

The Survived column is the target variable. If Suvival = 1 the passenger survived, otherwise he's dead.

The other variables that describe the passengers are:

  • PassengerId: and id given to each traveler on the boat
  • Pclass: the passenger class. It has three possible values: 1,2,3
  • The Name
  • The Sex
  • The Age
  • SibSp: number of siblings and spouses traveling with the passenger
  • Parch: number of parents and children traveling with the passenger
  • The ticket number
  • The ticket Fare
  • The cabin number
  • The embarkation. It has three possible values S,C,Q

Pandas allows you to statistically describe numerical features using the describe method.

In [6]:
data.describe()
Out[6]:
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 NaN 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 NaN 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 NaN 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

The count variable shows that 177 values are missing in the Age column.

One solution is to replace the null values with the median age which is more robust to outliers than the mean.

In [7]:
data['Age'].fillna(data['Age'].median(), inplace=True)

Let's check that again.

In [8]:
data.describe()
Out[8]:
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 891.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.361582 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 13.019697 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 22.000000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 35.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

Perfect.

Let's now make some charts.

Let's visualize survival based on the gender.

In [9]:
survived_sex = data[data['Survived']==1]['Sex'].value_counts()
dead_sex = data[data['Survived']==0]['Sex'].value_counts()
df = pd.DataFrame([survived_sex,dead_sex])
df.index = ['Survived','Dead']
df.plot(kind='bar',stacked=True, figsize=(15,8))
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x117744e50>

The Sex variable seems to be a decisive feature. Women are more likely to survive.

Let's now correlate the survival with the age variable.

In [10]:
figure = plt.figure(figsize=(15,8))
plt.hist([data[data['Survived']==1]['Age'], data[data['Survived']==0]['Age']], stacked=True, color = ['g','r'],
         bins = 30,label = ['Survived','Dead'])
plt.xlabel('Age')
plt.ylabel('Number of passengers')
plt.legend()
Out[10]:
<matplotlib.legend.Legend at 0x11768ca90>

If you follow the chart bin by bin, you will notice that passengers who are less than 10 are more likely to survive than older ones who are more than 12 and less than 50. Older passengers seem to be rescued too.

These two first charts confirm that one old code of conduct that sailors and captains follow in case of threatening situations: "Women and children first !".

In [11]:
from IPython.display import Image
Image("../../data/women-children.jpg",height=1000,width=1000)
Out[11]:

Right?

Let's now focus on the Fare ticket of each passenger and correlate it with the survival.

In [12]:
figure = plt.figure(figsize=(15,8))
plt.hist([data[data['Survived']==1]['Fare'],data[data['Survived']==0]['Fare']], stacked=True, color = ['g','r'],
         bins = 30,label = ['Survived','Dead'])
plt.xlabel('Fare')
plt.ylabel('Number of passengers')
plt.legend()
Out[12]:
<matplotlib.legend.Legend at 0x117d26a90>

Passengers with cheaper ticket fares are more likely to die. Put differently, passengers with more expensive tickets, and therefore a more important social status, seem to be rescued first.

Ok this is nice. Let's now combine the age, the fare and the survival on a single chart.

In [13]:
plt.figure(figsize=(15,8))
ax = plt.subplot()
ax.scatter(data[data['Survived']==1]['Age'],data[data['Survived']==1]['Fare'],c='green',s=40)
ax.scatter(data[data['Survived']==0]['Age'],data[data['Survived']==0]['Fare'],c='red',s=40)
ax.set_xlabel('Age')
ax.set_ylabel('Fare')
ax.legend(('survived','dead'),scatterpoints=1,loc='upper right',fontsize=15,)
Out[13]:
<matplotlib.legend.Legend at 0x118c19750>

A distinct cluster of dead passengers (the red one) appears on the chart. Those people are adults (age between 15 and 50) of lower class (lowest ticket fares).

In fact, the ticket fare correlates with the class as we see it in the chart below.

In [14]:
ax = plt.subplot()
ax.set_ylabel('Average fare')
data.groupby('Pclass').mean()['Fare'].plot(kind='bar',figsize=(15,8), ax = ax)
Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x1187d66d0>

Let's now see how the embarkation site affects the survival.

In [15]:
survived_embark = data[data['Survived']==1]['Embarked'].value_counts()
dead_embark = data[data['Survived']==0]['Embarked'].value_counts()
df = pd.DataFrame([survived_embark,dead_embark])
df.index = ['Survived','Dead']
df.plot(kind='bar', stacked=True, figsize=(15,8))
Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x118fbf050>

There seems to be no distinct correlation here.

II - Feature engineering

In the previous part, we flirted with the data and spotted some interesting correlations. However, we couldn't manage to analyze more complicated features like the names or the tickets because these required further processing.

In this part, we'll focus on the ways to transform these specific features in such a way they become easily fed to machine learning algorithms.

We'll also create, or "engineer" some other features that'll be useful in building the model.

We will break our code in separate functions for more clarity.

But first, let's define a print function that asserts whether or not a feature has been processed.

In [16]:
def status(feature):

    print 'Processing',feature,': ok'

Loading the data

One trick when starting a machine learning problem is to combine the training set and the test set together. This is a useful technique especially when your test set appears to have a feature that doesn't exist in the training set. Therefore, if we don't combine the two sets, testing our model on the test set will dramatically fail.

Besides, combining the two sets will save us some repeated work to do later on when testing.

The procedure is quite simple.

  • We start by loading the train set and the test set.
  • We create an empty dataframe called combined.
  • Then we append test to train and affect the result to combined.
In [17]:
def get_combined_data():
    # reading train data
    train = pd.read_csv('../../data/train.csv')
    
    # reading test data
    test = pd.read_csv('../../data/test.csv')

    # extracting and then removing the targets from the training data 
    targets = train.Survived
    train.drop('Survived', 1, inplace=True)
    

    # merging train data and test data for future feature engineering
    combined = train.append(test)
    combined.reset_index(inplace=True)
    combined.drop('index', inplace=True, axis=1)
    
    return combined
In [18]:
combined = get_combined_data()

Let's have a look at the shape :

In [19]:
combined.shape
Out[19]:
(1309, 11)

train and test sets are combined.

You may notice that the total number of rows (1309) is the exact summation of the number of rows in the train set and the test set.

In [20]:
combined.head()
Out[20]:
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

Extracting the passenger titles

When looking at the passenger names one could wonder how to process them to extract an easily interpretable information.

If you look closely at these first examples:

  • Braund, Mr. Owen Harris
  • Heikkinen, Miss. Laina
  • Oliva y Ocana, Dona. Fermina
  • Peter, Master. Michael J

You will notice that each name has a title in it ! This can be a simple Miss. or Mrs. but it can be sometimes something more sophisticated like Master, Sir or Dona. In that case, we might introduce an additional information about the social status by simply parsing the name and extracting the title.

Let's see how we'll do that in the function below.

In [21]:
def get_titles():

    global combined
    
    # we extract the title from each name
    combined['Title'] = combined['Name'].map(lambda name:name.split(',')[1].split('.')[0].strip())
    
    # a map of more aggregated titles
    Title_Dictionary = {
                        "Capt":       "Officer",
                        "Col":        "Officer",
                        "Major":      "Officer",
                        "Jonkheer":   "Royalty",
                        "Don":        "Royalty",
                        "Sir" :       "Royalty",
                        "Dr":         "Officer",
                        "Rev":        "Officer",
                        "the Countess":"Royalty",
                        "Dona":       "Royalty",
                        "Mme":        "Mrs",
                        "Mlle":       "Miss",
                        "Ms":         "Mrs",
                        "Mr" :        "Mr",
                        "Mrs" :       "Mrs",
                        "Miss" :      "Miss",
                        "Master" :    "Master",
                        "Lady" :      "Royalty"

                        }
    
    # we map each title
    combined['Title'] = combined.Title.map(Title_Dictionary)

This function parses the names and extract the titles. Then, it maps the titles to categories of titles. We selected :

  • Officer
  • Royalty
  • Mr
  • Mrs
  • Miss
  • Master

Let's run it !

In [22]:
get_titles()
In [23]:
combined.head()
Out[23]:
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Title
0 1 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S Mr
1 2 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C Mrs
2 3 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S Miss
3 4 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S Mrs
4 5 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S Mr

Perfect. Now we have an additional column called Title that contains the information.

Processing the ages

We have seen in the first part that the Age variable was missing 177 values. This is a large number ( ~ 13% of the dataset). Simply replacing them with the mean or the median age might not be the best solution since the age may differ by groups and categories of passengers.

To understand why, let's group our dataset by sex, Title and passenger class and for each subset compute the median age.

To avoid data leakage from the test set, we perform these operations separately on the train set and the test set.

In [24]:
grouped_train = combined.head(891).groupby(['Sex','Pclass','Title'])
grouped_median_train = grouped_train.median()

grouped_test = combined.iloc[891:].groupby(['Sex','Pclass','Title'])
grouped_median_test = grouped_test.median()
In [25]:
grouped_median_train
Out[25]:
PassengerId Age SibSp Parch Fare
Sex Pclass Title
female 1 Miss 369.0 30.0 0.0 0.0 88.25000
Mrs 499.0 40.0 1.0 0.0 79.20000
Officer 797.0 49.0 0.0 0.0 25.92920
Royalty 658.5 40.5 0.5 0.0 63.05000
2 Miss 437.5 24.0 0.0 0.0 13.00000
Mrs 439.5 31.5 1.0 0.0 26.00000
3 Miss 372.0 18.0 0.0 0.0 8.75625
Mrs 405.5 31.0 1.0 1.0 15.97500
male 1 Master 446.0 4.0 1.0 2.0 120.00000
Mr 463.0 40.0 0.0 0.0 42.40000
Officer 648.0 51.0 0.0 0.0 35.50000
Royalty 600.0 40.0 0.0 0.0 27.72080
2 Master 408.0 1.0 1.0 1.0 26.00000
Mr 440.0 31.0 0.0 0.0 13.00000
Officer 358.5 46.5 0.0 0.0 13.00000
3 Master 270.5 4.0 3.5 1.0 28.51250
Mr 472.0 26.0 0.0 0.0 7.89580
In [26]:
grouped_median_test
Out[26]:
PassengerId Age SibSp Parch Fare
Sex Pclass Title
female 1 Miss 1074.0 32.0 0.0 0.0 158.20835
Mrs 1076.0 48.0 1.0 0.0 63.35830
Royalty 1306.0 39.0 0.0 0.0 108.90000
2 Miss 1121.0 19.5 1.0 1.0 24.50000
Mrs 1123.5 29.0 0.0 0.0 26.00000
3 Miss 1090.5 22.0 0.0 0.0 7.87920
Mrs 1048.0 28.0 1.0 0.5 14.28125
male 1 Master 1022.0 9.5 1.0 2.0 198.43750
Mr 1102.0 42.0 0.0 0.0 50.24790
Officer 1094.0 53.0 1.0 0.0 81.85830
2 Master 1033.5 5.0 0.5 1.5 27.75000
Mr 1156.0 28.0 0.0 0.0 13.00000
Officer 1048.5 35.5 0.5 0.5 19.50000
3 Master 1173.0 7.0 1.0 1.0 15.24580
Mr 1101.0 25.0 0.0 0.0 7.85420

Look at the median age column and see how this value can be different based on the Sex, Pclass and Title put together.

For example:

  • If the passenger is female, from Pclass 1, and from royalty the median age is 39.
  • If the passenger is male, from Pclass 3, with a Mr title, the median age is 26.

Let's create a function that fills in the missing age in combined based on these different attributes.

In [27]:
def process_age():
    
    global combined
    
    # a function that fills the missing values of the Age variable
    
    def fillAges(row, grouped_median):
        if row['Sex']=='female' and row['Pclass'] == 1:
            if row['Title'] == 'Miss':
                return grouped_median.loc['female', 1, 'Miss']['Age']
            elif row['Title'] == 'Mrs':
                return grouped_median.loc['female', 1, 'Mrs']['Age']
            elif row['Title'] == 'Officer':
                return grouped_median.loc['female', 1, 'Officer']['Age']
            elif row['Title'] == 'Royalty':
                return grouped_median.loc['female', 1, 'Royalty']['Age']

        elif row['Sex']=='female' and row['Pclass'] == 2:
            if row['Title'] == 'Miss':
                return grouped_median.loc['female', 2, 'Miss']['Age']
            elif row['Title'] == 'Mrs':
                return grouped_median.loc['female', 2, 'Mrs']['Age']

        elif row['Sex']=='female' and row['Pclass'] == 3:
            if row['Title'] == 'Miss':
                return grouped_median.loc['female', 3, 'Miss']['Age']
            elif row['Title'] == 'Mrs':
                return grouped_median.loc['female', 3, 'Mrs']['Age']

        elif row['Sex']=='male' and row['Pclass'] == 1:
            if row['Title'] == 'Master':
                return grouped_median.loc['male', 1, 'Master']['Age']
            elif row['Title'] == 'Mr':
                return grouped_median.loc['male', 1, 'Mr']['Age']
            elif row['Title'] == 'Officer':
                return grouped_median.loc['male', 1, 'Officer']['Age']
            elif row['Title'] == 'Royalty':
                return grouped_median.loc['male', 1, 'Royalty']['Age']

        elif row['Sex']=='male' and row['Pclass'] == 2:
            if row['Title'] == 'Master':
                return grouped_median.loc['male', 2, 'Master']['Age']
            elif row['Title'] == 'Mr':
                return grouped_median.loc['male', 2, 'Mr']['Age']
            elif row['Title'] == 'Officer':
                return grouped_median.loc['male', 2, 'Officer']['Age']

        elif row['Sex']=='male' and row['Pclass'] == 3:
            if row['Title'] == 'Master':
                return grouped_median.loc['male', 3, 'Master']['Age']
            elif row['Title'] == 'Mr':
                return grouped_median.loc['male', 3, 'Mr']['Age']
    
    combined.head(891).Age = combined.head(891).apply(lambda r : fillAges(r, grouped_median_train) if np.isnan(r['Age']) 
                                                      else r['Age'], axis=1)
    
    combined.iloc[891:].Age = combined.iloc[891:].apply(lambda r : fillAges(r, grouped_median_test) if np.isnan(r['Age']) 
                                                      else r['Age'], axis=1)
    
    status('age')
In [28]:
process_age()
Processing age : ok
In [29]:
combined.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
PassengerId    1309 non-null int64
Pclass         1309 non-null int64
Name           1309 non-null object
Sex            1309 non-null object
Age            1309 non-null float64
SibSp          1309 non-null int64
Parch          1309 non-null int64
Ticket         1309 non-null object
Fare           1308 non-null float64
Cabin          295 non-null object
Embarked       1307 non-null object
Title          1309 non-null object
dtypes: float64(2), int64(4), object(6)
memory usage: 122.8+ KB

Perfect. The missing ages have been replaced.

However, we notice a missing value in Fare, two missing values in Embarked and a lot of missing values in Cabin. We'll come back to these variables later.

Let's now process the names.

In [30]:
def process_names():
    
    global combined
    # we clean the Name variable
    combined.drop('Name',axis=1,inplace=True)
    
    # encoding in dummy variable
    titles_dummies = pd.get_dummies(combined['Title'],prefix='Title')
    combined = pd.concat([combined,titles_dummies],axis=1)
    
    # removing the title variable
    combined.drop('Title',axis=1,inplace=True)
    
    status('names')

This function drops the Name column since we won't be using it anymore because we created a Title column.

Then we encode the title values using a dummy encoding.

You can learn about dummy coding and how to easily do it in Pandas here.

In [31]:
process_names()
Processing names : ok
In [32]:
combined.head()
Out[32]:
PassengerId Pclass Sex Age SibSp Parch Ticket Fare Cabin Embarked Title_Master Title_Miss Title_Mr Title_Mrs Title_Officer Title_Royalty
0 1 3 male 22.0 1 0 A/5 21171 7.2500 NaN S 0.0 0.0 1.0 0.0 0.0 0.0
1 2 1 female 38.0 1 0 PC 17599 71.2833 C85 C 0.0 0.0 0.0 1.0 0.0 0.0
2 3 3 female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 0.0 1.0 0.0 0.0 0.0 0.0
3 4 1 female 35.0 1 0 113803 53.1000 C123 S 0.0 0.0 0.0 1.0 0.0 0.0
4 5 3 male 35.0 0 0 373450 8.0500 NaN S 0.0 0.0 1.0 0.0 0.0 0.0

As you can see :

  • there is no longer a name feature.
  • new variables (Title_X) appeared. These features are binary.
    • For example, If Title_Mr = 1, the corresponding Title is Mr.

Processing Fare

In [33]:
def process_fares():
    
    global combined
    # there's one missing fare value - replacing it with the mean.
    combined.head(891).Fare.fillna(combined.head(891).Fare.mean(), inplace=True)
    combined.iloc[891:].Fare.fillna(combined.iloc[891:].Fare.mean(), inplace=True)
    
    status('fare')

This function simply replaces one missing Fare value by the mean.

In [34]:
process_fares()
Processing fare : ok

Processing Embarked

In [35]:
def process_embarked():
    
    global combined
    # two missing embarked values - filling them with the most frequent one (S)
    combined.head(891).Embarked.fillna('S', inplace=True)
    combined.iloc[891:].Embarked.fillna('S', inplace=True)
    
    
    # dummy encoding 
    embarked_dummies = pd.get_dummies(combined['Embarked'],prefix='Embarked')
    combined = pd.concat([combined,embarked_dummies],axis=1)
    combined.drop('Embarked',axis=1,inplace=True)
    
    status('embarked')

This functions replaces the two missing values of Embarked with the most frequent Embarked value.

In [36]:
process_embarked()
Processing embarked : ok

Processing Cabin

In [37]:
def process_cabin():
    
    global combined
    
    # replacing missing cabins with U (for Uknown)
    combined.Cabin.fillna('U', inplace=True)
    
    # mapping each Cabin value with the cabin letter
    combined['Cabin'] = combined['Cabin'].map(lambda c : c[0])
    
    # dummy encoding ...
    cabin_dummies = pd.get_dummies(combined['Cabin'], prefix='Cabin')
    
    combined = pd.concat([combined,cabin_dummies], axis=1)
    
    combined.drop('Cabin', axis=1, inplace=True)
    
    status('cabin')

This function replaces NaN values with U (for Unknow). It then maps each Cabin value to the first letter. Then it encodes the cabin values using dummy encoding again.

In [38]:
process_cabin()
Processing cabin : ok
In [39]:
combined.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 26 columns):
PassengerId      1309 non-null int64
Pclass           1309 non-null int64
Sex              1309 non-null object
Age              1309 non-null float64
SibSp            1309 non-null int64
Parch            1309 non-null int64
Ticket           1309 non-null object
Fare             1309 non-null float64
Title_Master     1309 non-null float64
Title_Miss       1309 non-null float64
Title_Mr         1309 non-null float64
Title_Mrs        1309 non-null float64
Title_Officer    1309 non-null float64
Title_Royalty    1309 non-null float64
Embarked_C       1309 non-null float64
Embarked_Q       1309 non-null float64
Embarked_S       1309 non-null float64
Cabin_A          1309 non-null float64
Cabin_B          1309 non-null float64
Cabin_C          1309 non-null float64
Cabin_D          1309 non-null float64
Cabin_E          1309 non-null float64
Cabin_F          1309 non-null float64
Cabin_G          1309 non-null float64
Cabin_T          1309 non-null float64
Cabin_U          1309 non-null float64
dtypes: float64(20), int64(4), object(2)
memory usage: 266.0+ KB

Ok no missing values now.

In [40]:
combined.head()
Out[40]:
PassengerId Pclass Sex Age SibSp Parch Ticket Fare Title_Master Title_Miss Title_Mr Title_Mrs Title_Officer Title_Royalty Embarked_C Embarked_Q Embarked_S Cabin_A Cabin_B Cabin_C Cabin_D Cabin_E Cabin_F Cabin_G Cabin_T Cabin_U
0 1 3 male 22.0 1 0 A/5 21171 7.2500 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
1 2 1 female 38.0 1 0 PC 17599 71.2833 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
2 3 3 female 26.0 0 0 STON/O2. 3101282 7.9250 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
3 4 1 female 35.0 1 0 113803 53.1000 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
4 5 3 male 35.0 0 0 373450 8.0500 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0

Processing Sex

In [41]:
def process_sex():
    
    global combined
    # mapping string values to numerical one 
    combined['Sex'] = combined['Sex'].map({'male':1,'female':0})
    
    status('sex')

This function maps the string values male and female to 1 and 0 respectively.

In [42]:
process_sex()
Processing sex : ok

Processing Pclass

In [43]:
def process_pclass():
    
    global combined
    # encoding into 3 categories:
    pclass_dummies = pd.get_dummies(combined['Pclass'], prefix="Pclass")
    
    # adding dummy variables
    combined = pd.concat([combined,pclass_dummies],axis=1)
    
    # removing "Pclass"
    
    combined.drop('Pclass',axis=1,inplace=True)
    
    status('pclass')

This function encodes the values of Pclass (1,2,3) using a dummy encoding.

In [44]:
process_pclass()
Processing pclass : ok

Processing Ticket

In [45]:
def process_ticket():
    
    global combined
    
    # a function that extracts each prefix of the ticket, returns 'XXX' if no prefix (i.e the ticket is a digit)
    def cleanTicket(ticket):
        ticket = ticket.replace('.','')
        ticket = ticket.replace('/','')
        ticket = ticket.split()
        ticket = map(lambda t : t.strip(), ticket)
        ticket = filter(lambda t : not t.isdigit(), ticket)
        if len(ticket) > 0:
            return ticket[0]
        else: 
            return 'XXX'
    

    # Extracting dummy variables from tickets:

    combined['Ticket'] = combined['Ticket'].map(cleanTicket)
    tickets_dummies = pd.get_dummies(combined['Ticket'], prefix='Ticket')
    combined = pd.concat([combined, tickets_dummies], axis=1)
    combined.drop('Ticket', inplace=True, axis=1)

    status('ticket')
  • This functions preprocess the tikets first by extracting the ticket prefix. When it fails in extracting a prefix it returns XXX.
  • Then it encodes prefixes using dummy encoding.
In [46]:
process_ticket()
Processing ticket : ok

Processing Family

This part includes creating new variables based on the size of the family (the size is by the way, another variable we create).

This creation of new variables is done under a realistic assumption: Large families are grouped together, hence they are more likely to get rescued than people traveling alone.

In [47]:
def process_family():
    
    global combined
    # introducing a new feature : the size of families (including the passenger)
    combined['FamilySize'] = combined['Parch'] + combined['SibSp'] + 1
    
    # introducing other features based on the family size
    combined['Singleton'] = combined['FamilySize'].map(lambda s: 1 if s == 1 else 0)
    combined['SmallFamily'] = combined['FamilySize'].map(lambda s: 1 if 2<=s<=4 else 0)
    combined['LargeFamily'] = combined['FamilySize'].map(lambda s: 1 if 5<=s else 0)
    
    status('family')

This function introduces 4 new features:

  • FamilySize : the total number of relatives including the passenger (him/her)self.
  • Sigleton : a boolean variable that describes families of size = 1
  • SmallFamily : a boolean variable that describes families of 2 <= size <= 4
  • LargeFamily : a boolean variable that describes families of 5 < size
In [48]:
process_family()
Processing family : ok
In [49]:
combined.shape
Out[49]:
(1309, 68)

We end up with a total of 68 features.

In [50]:
combined.head()
Out[50]:
PassengerId Sex Age SibSp Parch Fare Title_Master Title_Miss Title_Mr Title_Mrs Title_Officer Title_Royalty Embarked_C Embarked_Q Embarked_S Cabin_A Cabin_B Cabin_C Cabin_D Cabin_E Cabin_F Cabin_G Cabin_T Cabin_U Pclass_1 Pclass_2 Pclass_3 Ticket_A Ticket_A4 Ticket_A5 Ticket_AQ3 Ticket_AQ4 Ticket_AS Ticket_C Ticket_CA Ticket_CASOTON Ticket_FC Ticket_FCC Ticket_Fa Ticket_LINE Ticket_LP Ticket_PC Ticket_PP Ticket_PPP Ticket_SC Ticket_SCA3 Ticket_SCA4 Ticket_SCAH Ticket_SCOW Ticket_SCPARIS Ticket_SCParis Ticket_SOC Ticket_SOP Ticket_SOPP Ticket_SOTONO2 Ticket_SOTONOQ Ticket_SP Ticket_STONO Ticket_STONO2 Ticket_STONOQ Ticket_SWPP Ticket_WC Ticket_WEP Ticket_XXX FamilySize Singleton SmallFamily LargeFamily
0 1 1 22.0 1 0 7.2500 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2 0 1 0
1 2 0 38.0 1 0 71.2833 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2 0 1 0
2 3 0 26.0 0 0 7.9250 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1 1 0 0
3 4 0 35.0 1 0 53.1000 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 2 0 1 0
4 5 1 35.0 0 0 8.0500 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1 1 0 0

Let's now remove the PassengerId variable. It conveys no information to the prediction task.

In [51]:
combined.drop('PassengerId', inplace=True, axis=1)
In [52]:
combined.head()
Out[52]:
Sex Age SibSp Parch Fare Title_Master Title_Miss Title_Mr Title_Mrs Title_Officer Title_Royalty Embarked_C Embarked_Q Embarked_S Cabin_A Cabin_B Cabin_C Cabin_D Cabin_E Cabin_F Cabin_G Cabin_T Cabin_U Pclass_1 Pclass_2 Pclass_3 Ticket_A Ticket_A4 Ticket_A5 Ticket_AQ3 Ticket_AQ4 Ticket_AS Ticket_C Ticket_CA Ticket_CASOTON Ticket_FC Ticket_FCC Ticket_Fa Ticket_LINE Ticket_LP Ticket_PC Ticket_PP Ticket_PPP Ticket_SC Ticket_SCA3 Ticket_SCA4 Ticket_SCAH Ticket_SCOW Ticket_SCPARIS Ticket_SCParis Ticket_SOC Ticket_SOP Ticket_SOPP Ticket_SOTONO2 Ticket_SOTONOQ Ticket_SP Ticket_STONO Ticket_STONO2 Ticket_STONOQ Ticket_SWPP Ticket_WC Ticket_WEP Ticket_XXX FamilySize Singleton SmallFamily LargeFamily
0 1 22.0 1 0 7.2500 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2 0 1 0
1 0 38.0 1 0 71.2833 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2 0 1 0
2 0 26.0 0 0 7.9250 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1 1 0 0
3 0 35.0 1 0 53.1000 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 2 0 1 0
4 1 35.0 0 0 8.0500 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1 1 0 0

III - Modeling

In this part, we use our knowledge of the passengers based on the features we created and then build a statistical model. You can think of this model as a black box that crunches the information of any new passenger and decides whether or not he survives.

There is a wide variety of models to use, from logistic regression to decision trees and more sophisticated ones such as random forests and gradient boosted trees.

We'll be using Random Forests. Random Froests has proven a great efficiency in Kaggle competitions.

For more details about why ensemble methods perform well, you can refer to these posts:

Back to our problem, we now have to:

  1. Break the combined dataset in train set and test set.
  2. Use the train set to build a predictive model.
  3. Evaluate the model using the train set.
  4. Test the model using the test set and generate and output file for the submission.

Keep in mind that we'll have to reiterate on 2. and 3. until an acceptable evaluation score is achieved.

Let's start by importing the useful libraries.

In [53]:
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest
from sklearn.cross_validation import StratifiedKFold
from sklearn.grid_search import GridSearchCV
from sklearn.ensemble.gradient_boosting import GradientBoostingClassifier
from sklearn.cross_validation import cross_val_score

To evaluate our model we'll be using a 5-fold cross validation with the Accuracy metric.

To do that, we'll define a small scoring function.

In [54]:
def compute_score(clf, X, y, scoring='accuracy'):
    xval = cross_val_score(clf, X, y, cv = 5, scoring=scoring)
    return np.mean(xval)

Recovering the train set and the test set from the combined dataset is an easy task.

In [55]:
def recover_train_test_target():
    global combined
    
    train0 = pd.read_csv('../../data/train.csv')
    
    targets = train0.Survived
    train = combined.head(891)
    test = combined.iloc[891:]
    
    return train, test, targets
In [56]:
train, test, targets = recover_train_test_target()

Feature selection

We've come up with 68 features so far. This number is quite large.

When feature engineering is done, we usually tend to decrease the dimensionality by selecting the "right" number of features that capture the essential.

In fact, feature selection comes with many benefits:

  • It decreases redundancy among the data
  • It speeds up the training process
  • It reduces overfitting

Tree-based estimators can be used to compute feature importances, which in turn can be used to discard irrelevant features.

In [57]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
clf = RandomForestClassifier(n_estimators=50, max_features='sqrt')
clf = clf.fit(train, targets)

Let's have a look at the importance of each feature.

In [58]:
features = pd.DataFrame()
features['feature'] = train.columns
features['importance'] = clf.feature_importances_
features.sort_values(by=['importance'], ascending=True, inplace=True)
features.set_index('feature', inplace=True)
In [59]:
features.plot(kind='barh', figsize=(20, 20))
Out[59]:
<matplotlib.axes._subplots.AxesSubplot at 0x11b5e18d0>

As you may notice, there is a great importance linked to Title_Mr, Age, Fare, and Sex.

There is also an important correlation with the Passenger_Id.

Let's now transform our train set and test set in a more compact datasets.

In [60]:
model = SelectFromModel(clf, prefit=True)
train_reduced = model.transform(train)
train_reduced.shape
Out[60]:
(891, 12)
In [61]:
test_reduced = model.transform(test)
test_reduced.shape
Out[61]:
(418, 12)

Yay! Now we're down to 12 features.

Hyperparameters tuning

As mentioned in the beginning of the Modeling part, we will be using a Random Forest model.

Random Forest are quite handy. They do however come with some parameters to tweak in order to get an optimal model for the prediction task.

To learn more about Random Forests, you can refer to this link :

In [62]:
# turn run_gs to True if you want to run the gridsearch again.
run_gs = False

if run_gs:
    parameter_grid = {
                 'max_depth' : [4, 6, 8],
                 'n_estimators': [50, 10],
                 'max_features': ['sqrt', 'auto', 'log2'],
                 'min_samples_split': [1, 3, 10],
                 'min_samples_leaf': [1, 3, 10],
                 'bootstrap': [True, False],
                 }
    forest = RandomForestClassifier()
    cross_validation = StratifiedKFold(targets, n_folds=5)

    grid_search = GridSearchCV(forest,
                               scoring='accuracy',
                               param_grid=parameter_grid,
                               cv=cross_validation)

    grid_search.fit(train, targets)
    model = grid_search
    parameters = grid_search.best_params_

    print('Best score: {}'.format(grid_search.best_score_))
    print('Best parameters: {}'.format(grid_search.best_params_))
else: 
    parameters = {'bootstrap': False, 'min_samples_leaf': 3, 'n_estimators': 50, 
                  'min_samples_split': 10, 'max_features': 'sqrt', 'max_depth': 6}
    
    model = RandomForestClassifier(**parameters)
    model.fit(train, targets)
In [63]:
compute_score(model, train, targets, scoring='accuracy')
Out[63]:
0.82608565824298075

Now that the model is built by scanning several combinations of the hyperparameters, we can generate an output file to submit on Kaggle.

This solution allowed me to get an accuracy score of 0.8134 on the public leaderboard.

In [64]:
output = model.predict(test).astype(int)
df_output = pd.DataFrame()
aux = pd.read_csv('../../data/test.csv')
df_output['PassengerId'] = aux['PassengerId']
df_output['Survived'] = output
df_output[['PassengerId','Survived']].to_csv('../../data/output.csv',index=False)

IV - Conclusion

In this article, we explored an interesting dataset brought to us by Kaggle.

We went through the basic bricks of a data science pipeline:

  • Data exploration and visualization: an initial step to formulate hypotheses
  • Data cleaning
  • Feature engineering
  • Feature selection
  • Hyperparameters tuning
  • Submission

This post can be downloaded as a notebook if you ever want to test and play with it : my github repo

Lots of articles have been written about this challenge, so obviously there is a room for improvement.

Here is what I suggest for next steps:

  • Dig more in the data and eventually build new features.
  • Try different models : logistic regressions, Gradient Boosted trees, XGboost, ...
  • Try ensemble learning techniques (stacking, blending)
  • Maybe try other families of ML models (neural networks?)

I would be more than happy if you could find out a way to improve my solution. This could make me update the article and definitely give you credit for that. So feel free to post a comment.

As a word of gratitude, I would like to thank Kdnuggets for sharing this post !