Titanic and machine learning : anything in common ?

Titanic Survival predictions is a fun machine learning challenge. I highly recommend that you play with it and it is perfect for beginners !
The data set can be downloaded on Kaggle, here.
The goal is: based on some variables (like age, family members on board, ticket …), predict if the Titanic passenger survived or not using machine learning.

What is provided:
– Training set : includes the survival information (1 for yes, 0 for no)
– Test set: without the survival information, that needs to be predicted.
– Data description: describes all features in both training and test sets.

Data preparation

Loading data from training and test sets :

train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

It looks like :

PassengerId                          1
Survived                             0
Pclass                               3
Name           Braund, Mr. Owen Harris
Sex                               male
Age                                 22
SibSp                                1
Parch                                0
Ticket                       A/5 21171
Fare                              7.25
Cabin                              NaN
Embarked                             S

A lot of us know about the Titanic tragedy, and it is a common knowledge that women, children, elderly and rich people were the first to embark on life boats. Is that true or not ? We will check by visualizing data.
Also, all coming operations are run on both train and test data sets previously loaded.

Compared to men, women had a better chance of survival

From the graph above, passengers on first and second class had a better chance of survival.

I don’t really know how to interpret the Embarked feature. Does the embarkation port explain survival ? Correlation and causality are quite different. I will keep the feature for now, but I will test with and without it.

From Age distribution, we can see that indeed, children had better chance of survival.

Instead of keeping Parch and SibSp: , I create the new feature Family, to describe number of family members on board.

data["Family"] = data["Parch"] + data["SibSp"]
data.drop(["SibSp", "Parch"], axis=1, inplace=True)

As for Ticket and Cabin, I decide to drop them.
Can the cabin explain the survival rate ? We can imagine that the closer the cabin is to the life boat, the better is the chance of surviving, but as there are 687 empty Cabin values, I doubt there is a good way to fill these missing values.
As for Ticket, I think the Pclass already brings a more general information of that. Plus, it is difficult to manage it.

data.drop(["Ticket", "Cabin"], axis=1, inplace=True)

As for Fare, it seems passengers with higher Fare had better chance of Survival. Also, something you should know and that explains why Fare distribution is skewed to the right : Fare for passengers in the same family is a sum of all their tickets fares. That will be very helpful later.

Missing values

To fill missing values, I choose the mean for Age and Fare, most frequent value for Embarked.

# Replace age = nan values with median or mean, median as there are some zeros.
# Replace embarked = nan embarked with most frequent value
# Replace fare = nan with mean
data["Age"] = data["Age"].fillna(data["Age"].mean())
data["Fare"] = data["Fare"].fillna(data["Fare"].mean())
data["Embarked"] = data["Embarked"].fillna(data["Embarked"].value_counts().index[0])

Adding new features

It is clear that I wouldn’t keep Name as feature, however the name has an information worth considering, the title : Mr, Mrs, Col, Don … Not only it gives an idea about the sex but also the marital status and the rank in society at that time.

# Create a new feature = Title
data["Title"] = data.Name.str.split(', ', expand=True)[1].str.split('.', expand=True)[0]
data["Title"].replace("Mlle", "Miss", inplace=True)
data["Title"].replace(["Ms", "Mme"], "Mrs", inplace=True)
data["Title"].replace("Don", "Master", inplace=True)
data["Title"].replace(["Capt", "Col", "Major"], "Military", inplace=True)
data["Title"].replace(["the Countess", "Sir", "Lady","Dona","Jonkheer"], "Elite", inplace=True)

Also, if we regroup passengers of the same family (meaning who have the same last Name and the same Fare), we can see a passenger’s survival is related to his/her relatives survival (credit for the idea).
If all of relatives didn’t survive, he/she has a bigger chance of not surviving.
If at least one of relatives survives, he/she has a bigger chance of surviving.
Below examples with some families from the train data set.

Survived  max      0.0
          count    4.0
          mean     0.0
Name: (Palsson, 21.075)
Survived  max      0.0
          count    6.0
          mean     0.0
Name: (Panula, 39.6875)
Survived  max      0.0
          count    5.0
          mean     0.0
Name: (Rice, 29.125)
Survived  max      1.0
          count    3.0
          mean     1.0
Name: (Richards, 18.75)
Survived  max      0.0
          count    7.0
          mean     0.0
Name: (Sage, 69.55)
Survived  max      0.0
          count    6.0
          mean     0.0
Name: (Skoog, 27.9)
Survived  max      1.000000
          count    3.000000
          mean     0.666667
Name: (Taussig, 79.65)
Survived  max      1.000000
          count    3.000000
          mean     0.666667
Name: (Thayer, 110.8833)
Survived  max      0.0
          count    3.0
          mean     0.0
Name: (Van Impe, 24.15)
Survived  max      0.0
          count    3.0
          mean     0.0
Name: (Vander Planke, 18.0)
Survived  max      1.000000
          count    3.000000
          mean     0.666667
Name: (West, 27.75)

def add_survival_rate(train, test):
    train['LastName'] = train.Name.str.split(', ', expand=True)[0]
    test['LastName'] = test.Name.str.split(', ', expand=True)[0]
    cols = dict(max='max', count='count')
    rate = train[["Survived", 'LastName', 'Fare']].groupby(['LastName', 'Fare']).agg(cols.keys())
    # to initialize in case there is a family in test not in train, or for people with no family in train
    train["Rate"] = 0.5
    test["Rate"] = 0.5
    for index, row in rate.iterrows():
        if row.iloc[1] != 1:
            train.loc[(train.LastName == index[0]) &amp; (train.Fare == index[1]), "Rate"] = row.iloc[0]
            test.loc[(test.LastName == index[0]) &amp; (test.Fare == index[1]), "Rate"] = row.iloc[0]
    # drop name, fare as not needed anymore
    train.drop(["Name","LastName","Fare"], axis=1, inplace=True)
    test.drop(["Name","LastName","Fare"], axis=1, inplace=True)
    return train, test

Transforming features

The goal is to transform categorical features to numeric features using one hot encoding (for binary or ordered features) or dummy features. I test both and the results in accuracy after modeling are similar.

I prefer to transform such variables into binary features, it adds more columns but it is still acceptable in this case.

# Convert to dummy variables.
data = pd.get_dummies(data=data, columns=["Sex", "Embarked","Title"])

If you would like to experiment with the other solution, here is an example for Sex and Embarked features.

# replace string categorical values with int : Sex, Embarked.
data["Sex"].replace("male", 1, inplace=True)
data["Sex"].replace("female", 2, inplace=True)
data["Embarked"].replace("S", 1, inplace=True)
data["Embarked"].replace("C", 2, inplace=True)
data["Embarked"].replace("Q", 3, inplace=True)

Scaling

Since features are on different scales, this step is important. I scale both training and testing sets using the mean and standard deviation of the training set.

def scale(train, test):
 means = train.mean()
 stdevs = train.std()
 train = (train.subtract(means,axis=1)).divide(stdevs,axis=1)
 test = (test.subtract(means,axis=1)).divide(stdevs,axis=1)
 return train, test

Modeling

Let’s start the fun part, which is exploring machine learning algorithms.
Since the training data set is small (891 rows) and there is a limited number of features (6 if OHE or 14 using the dummy transformation), a good classification algorithm to try is support vector machine, SVM with Gaussian kernel.
But first, let’s split the training set into a training set and a cross validation set. The cross validation error would be a great indicator of how the model performs on data it has never “seen”. Also, the comparison between the training error and cross validation error helps to check if there is over fitting.

X_train, X_test, y_train, y_test = model_selection.train_test_split(X_scaled,y,test_size=0.3)

range_C = 2. **np.arange(-3,3,0.1)
range_g = 2. **np.arange(-3,3,0.1)
param_grid = {'C': range_C, 'gamma': range_g}
grid_search = model_selection.GridSearchCV(svm.SVC(kernel='rbf'), param_grid, cv=5, scoring='f1')
grid_search.fit(X, y)

The scoring method is f1_score.
I test the model’s accuracy with and without Embarked, the results are similar. This is a fair proof that a high correlation between a feature and the predicted variable doesn’t mean there is causality and doesn’t give reason to remove/keep features (just because they are/aren’t highly correlated with the explained variable).

F1 score on train set : 0.8032128514056226
F1 score on cv set : 0.8244274809160305

Prediction

y_pred = classifier.predict(X_pred_scaled)
y_pred = pd.DataFrame(y_pred, columns=["Survived"])
y_pred.insert(0, "PassengerId", data_test["PassengerId"])
y_pred.to_csv("pred.csv", index=False)

The accuracy on the test set on Kaggle is 0.81339 (Top 5%).

Thanks for reading this post and see you soon !