Predicting Survivors on the Titanic using SAP Predictive Analytics

While Jack Dawson was a fictional character portrayed by Leonardo DiCaprio for the movie Titanic, there was an actual J. Dawson who was aboard the Titanic. But this ‘J’ stood for Joseph, not Jack. Over the years, data was collected on many of the members who were aboard the Titanic and we know quite a bit about them, including whether or not they survived the sinking of the Titanic. Below are some of the details that have been gathered about many of those aboard the Titanic.

Name
Sex
Age
# of Siblings aboard
# of Parents/Children aboard
Ticket #
Passenger Fare
Passenger Class
Cabin
Port of Embarkation
Survived (Yes/No)

Some of this detail is just general information about them. But other indicators relate to their social status aboard the Titanic. If you’ve seen the movie (most did since it was the 2nd highest grossing film of all time) you know that many of those who didn’t make it were trapped inside the ship and were not able to escape on boats or rafts. Since we have this data available, let’s go ahead and use ‘Automated Analytics’ within the SAP Predictive Analytics 2.5 tool to see if we can a prediction model to determine based on these indicators above whether or not a probability score of survival could be assigned to each person.

We have data on 1,309 unique individuals, so let’s use 1,000 of them to train a predictive model and then apply that model to the remaining 309 to see how close we came to predicting their outcome. Since this is a binary outcome where we will be measuring Survival as either a ‘Yes’ or ‘No’ response, a logistical regression model is the most ideal model to incorporate. Since we will be using Automated Analytics, the tool will automatically make that determination intuitively for us.

Let’s get started!

Step 1: Create a Classification/Regression model within the Automated Analytics Module in SAP Predictive Analytics 2.5

Step 2: Analyze the data to make sure all of the columns are appropriately valued

Step 3: Apply the target and predictor variables to assign roles within the model. For purposes of this model, ‘Survived’ will be our target variable since we are looking to see how well we can predict the survival outcome based on everything else. (Please note that KxIndex is just a generated index variable produced by the tool to provide uniqueness to each row of data and will be discarded from the final model. In addition, I removed boat, home destination, and ID as they were also insignificant to the model outcome.)

Step 4: After selecting all the explanatory variables as well as the target variable, go ahead and generate the model. (Please note that the settings for determining a probability threshold were kept at the default settings of 50%, meaning a probability of less than 0.5 would indicate ‘No Survival’ and a 0.5 or greater is ‘Yes Survival’. This setting can be adjusted in the advanced settings before model generation if the standard needs to be re-evaluated.)

Step 5: Evaluate Results

The model produced some general information for us to consider.

The model found that ~56% of the training data had not survived while 44% did survive.
Additionally, the model was able to generate about 80% predictive power (KI) based on the training data to determine the same outcome with a 91% chance of recreating the same results given similar datasets.
Finally, the model kept 9 out of the 14 predictor variables initially selected as contributing significantly to the Survived response, with the remaining variables discarded.

The most significant contributor was the sex/gender of those aboard, with women more likely to survive than men. Additionally, the next two contributors had to do the fare they paid for getting on board and the passenger class they were in. Clearly there is a direct relationship between how much you paid for a ticket and the class you were assigned to. These two indicators highlight that in fact social class may have been a factor if there was priority given to those who were saved during the Titanic’s fatal end.

Final Step: Apply Training Model on Test Results

We will now apply the model built off of the training set against the 309 rows of data we stored in the test data set to see how well we can predict the actual probability of survival. For the output, we want to select the option to view probability. For us, this model goes back to predicting a probability of 0 or 1.

Less than 50% = 0
50% or Greater = 1

You can view the output inside of the tool or export it out to a data source. I always like to export it out to an Excel spreadsheet to quickly calculate my scores. The variable that we are looking for is the proba_rr_survived. If we round it to a whole number, we want to compare that result to the original response of survived. After some calculations in Excel for the rows I tested, I got 245 correct matches out of 309. That rate comes out to be 79.3%, almost identical to our KI score above of 79.8%.

In conclusion, we found that the data gathered on 1,300 of those on the Titanic were able to predict the survival rate of 80% of those we tested our model on. We found that Sex, Fare, and Class were the highest individual predictors contributing to the model score.

FYI, the actual Joseph Dawson did also die on the Titanic.

To learn more about SAP Predictive Analytics, please register and join our 4-part webinar series called

I Love Predictive

Resources

Learn more about logistic regression algorithms

SAP Predictive Analytics Tutorials for Automated Analytics and Expert Analytics

Titanic Data Set