Table of Contents¶

Foreword

Snippets and results.
Source: 'Kaggle Python Tutorial on Machine Learning' from DataCamp fitted into Jupyter/IPython using the IRkernel.

Getting started with Python¶

Explore how to tackle Kaggle Titanic competition using Python and Machine Learning.

Get the Data with Pandas

When the Titanic sank, 1502 of the 2224 passengers and crew were killed. One of the main reasons for this high level of casualties was the lack of lifeboats on this self-proclaimed "unsinkable" ship.

Those that have seen the movie know that some individuals were more likely to survive the sinking (lucky Rose) than others (poor Jack). In this course, you will learn how to apply machine learning techniques to predict a passenger's chance of surviving using Python.

Let's start with loading in the training and testing set into your Python environment.

# import library pandas and numpy
import pandas as pd
import numpy as np

train_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/train.csv"
train = pd.read_csv(train_url)
print(train.head())

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S

test_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/test.csv"
test = pd.read_csv(test_url)
print(test.head())

   PassengerId  Pclass                                          Name     Sex  \
0          892       3                              Kelly, Mr. James    male   
1          893       3              Wilkes, Mrs. James (Ellen Needs)  female   
2          894       2                     Myles, Mr. Thomas Francis    male   
3          895       3                              Wirz, Mr. Albert    male   
4          896       3  Hirvonen, Mrs. Alexander (Helga E Lindqvist)  female   

    Age  SibSp  Parch   Ticket     Fare Cabin Embarked  
0  34.5      0      0   330911   7.8292   NaN        Q  
1  47.0      1      0   363272   7.0000   NaN        S  
2  62.0      0      0   240276   9.6875   NaN        Q  
3  27.0      0      0   315154   8.6625   NaN        S  
4  22.0      1      1  3101298  12.2875   NaN        S

#Store the data
#train.to_csv('train.csv')
#test.to_csv('test.csv')

Understanding your data

Before starting with the actual analysis, it's important to understand the structure of your data. Both test and train are DataFrame objects, the way pandas represent datasets. You can easily explore a DataFrame using the .describe() method. .describe() summarizes the columns/features of the DataFrame, including the count of observations, mean, max and so on. Another useful trick is to look at the dimensions of the DataFrame. This is done by requesting the .shape attribute of your DataFrame object. (ex. your_data.shape)

The training and test set are already available in the workspace, as train and test. Apply .describe() method and print the .shape attribute of the training set. Which of the following statements is correct?

#pd
train.describe()

#pd
test.shape

(418, 11)

Rose vs Jack, or Female vs Male

How many people in your training set survived the disaster with the Titanic? To see this, you can use the value_counts() method in combination with standard bracket notation to select a single column of a DataFrame:

# absolute numbers
train['Survived'].value_counts()

# percentages
train['Survived'].value_counts(normalize = True)

If you run these commands in the console, you'll see that 549 individuals died (62%) and 342 survived (38%). A simple way to predict heuristically could be: "majority wins". This would mean that you will predict every unseen observation to not survive.

To dive in a little deeper we can perform similar counts and percentage calculations on subsets of the Survived column. For example, maybe gender could play a role as well? You can explore this using the .value_counts() method for a two-way comparison on the number of males and females that survived, with this syntax:

train['Survived'][train['Sex'] == 'male'].value_counts()
train['Survived'][train['Sex'] == 'female'].value_counts()

To get proportions, you can again pass in the argument normalize = True to the .value_counts() method.

WARNING: pay attention to the 1 value, it can be first or second!

# Passengers that survived vs passengers that passed away
print(train['Survived'].value_counts())

0    549
1    342
Name: Survived, dtype: int64

# As proportions
print(train['Survived'].value_counts(normalize = True))

0    0.616162
1    0.383838
Name: Survived, dtype: float64

# Males that survived vs males that passed away
print(train['Survived'][train['Sex'] == 'male'].value_counts())

0    468
1    109
Name: Survived, dtype: int64

print(train['Survived'][train['Sex'] == 'male'].value_counts(normalize = True))

0    0.811092
1    0.188908
Name: Survived, dtype: float64

# Females that survived vs Females that passed away
print(train['Survived'][train['Sex'] == 'female'].value_counts())

1    233
0     81
Name: Survived, dtype: int64

print(train['Survived'][train['Sex'] == 'female'].value_counts(normalize = True))

1    0.742038
0    0.257962
Name: Survived, dtype: float64

Does age play a role?

Another variable that could influence survival is age; since it's probable that children were saved first. You can test this by creating a new column with a categorical variable Child. Child will take the value 1 in cases where age is less than 18, and a value of 0 in cases where age is greater than or equal to 18.

To add this new variable you need to do two things (i) create a new column, and (ii) provide the values for each observation (i.e., row) based on the age of the passenger.

Adding a new column with Pandas in Python is easy and can be done via the following syntax:

your_data["new_var"] = 0

This code would create a new column in the train DataFrame titled new_var with 0 for each observation.

To set the values based on the age of the passenger, you make use of a boolean test inside the square bracket operator. With the []-operator you create a subset of rows and assign a value to a certain variable of that subset of observations. For example,

train["new_var"][train["Fare"] > 10] = 1

would give a value of 1 to the variable new_var for the subset of passengers whose fares greater than 10. Remember that new_var has a value of 0 for all other values (including missing values).

A new column called Child in the train data frame has been created for you that takes the value NaN for all observations.

# Create the column Child and assign to 'NaN'
train['Child'] = float('NaN')

# Assign 1 to passengers under 18, 0 to those 18 or older. Print the new column.
train['Child'][train['Age'] >= 18] = 0
train['Child'][train['Age'] < 18] = 1
print(train['Child'].head(10))

0    0.0
1    0.0
2    0.0
3    0.0
4    0.0
5    NaN
6    0.0
7    1.0
8    0.0
9    1.0
Name: Child, dtype: float64

print(train.head(10))

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   
5            6         0       3   
6            7         0       1   
7            8         0       3   
8            9         1       3   
9           10         1       2   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   
5                                   Moran, Mr. James    male   NaN      0   
6                            McCarthy, Mr. Timothy J    male  54.0      0   
7                     Palsson, Master. Gosta Leonard    male   2.0      3   
8  Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)  female  27.0      0   
9                Nasser, Mrs. Nicholas (Adele Achem)  female  14.0      1   

   Parch            Ticket     Fare Cabin Embarked  Child  
0      0         A/5 21171   7.2500   NaN        S    0.0  
1      0          PC 17599  71.2833   C85        C    0.0  
2      0  STON/O2. 3101282   7.9250   NaN        S    0.0  
3      0            113803  53.1000  C123        S    0.0  
4      0            373450   8.0500   NaN        S    0.0  
5      0            330877   8.4583   NaN        Q    NaN  
6      0             17463  51.8625   E46        S    0.0  
7      1            349909  21.0750   NaN        S    1.0  
8      2            347742  11.1333   NaN        S    0.0  
9      0            237736  30.0708   NaN        C    1.0

# Print normalized Survival Rates for passengers under 18
print(train['Survived'][train['Child'] == 1].value_counts(normalize = True))

1    0.539823
0    0.460177
Name: Survived, dtype: float64

# Print normalized Survival Rates for passengers 18 or older
print(train['Survived'][train['Child'] == 0].value_counts(normalize = True))

0    0.618968
1    0.381032
Name: Survived, dtype: float64

First Prediction

In one of the previous exercises you discovered that in your training set, females had over a 50% chance of surviving and males had less than a 50% chance of surviving. Hence, you could use this information for your first prediction: all females in the test set survive and all males in the test set die.

You use your test set for validating your predictions. You might have seen that contrary to the training set, the test set has no Survived column. You add such a column using your predicted values. Next, when uploading your results, Kaggle will use this variable (= your predictions) to score your performance.

# Create a copy of test: test_one
test_one = test

# Initialize a Survived column to 0
test_one['Survived'] = 0

# Set Survived to 1 if Sex equals "female" and print the `Survived` column from `test_one`
test_one['Survived'][test_one['Sex'] == 'female'] = 1
print(test_one.Survived)

0      0
1      1
2      0
3      0
4      1
5      0
6      1
7      0
8      1
9      0
10     0
11     0
12     1
13     0
14     1
15     1
16     0
17     0
18     1
19     1
20     0
21     0
22     1
23     0
24     1
25     0
26     1
27     0
28     0
29     0
      ..
388    0
389    0
390    0
391    1
392    0
393    0
394    0
395    1
396    0
397    1
398    0
399    0
400    1
401    0
402    1
403    0
404    0
405    0
406    0
407    0
408    1
409    1
410    1
411    1
412    1
413    0
414    1
415    0
416    0
417    0
Name: Survived, dtype: int64

print(test_one.head(5))

   PassengerId  Pclass                                          Name     Sex  \
0          892       3                              Kelly, Mr. James    male   
1          893       3              Wilkes, Mrs. James (Ellen Needs)  female   
2          894       2                     Myles, Mr. Thomas Francis    male   
3          895       3                              Wirz, Mr. Albert    male   
4          896       3  Hirvonen, Mrs. Alexander (Helga E Lindqvist)  female   

    Age  SibSp  Parch   Ticket     Fare Cabin Embarked  Survived  
0  34.5      0      0   330911   7.8292   NaN        Q         0  
1  47.0      1      0   363272   7.0000   NaN        S         1  
2  62.0      0      0   240276   9.6875   NaN        Q         0  
3  27.0      0      0   315154   8.6625   NaN        S         0  
4  22.0      1      1  3101298  12.2875   NaN        S         1

Predicting with decision trees¶

Intro to decision trees

In the previous chapter, you did all the slicing and dicing yourself to find subsets that have a higher chance of surviving. A decision tree automates this process for you and outputs a classification model or classifier.

Conceptually, the decision tree algorithm starts with all the data at the root node and scans all the variables for the best one to split on. Once a variable is chosen, you do the split and go down one level (or one node) and repeat. The final nodes at the bottom of the decision tree are known as terminal nodes, and the majority vote of the observations in that node determine how to predict for new observations that end up in that terminal node.

# Import the Numpy library
import numpy as np

# Import 'tree' from scikit-learn library
from sklearn  import tree

Cleaning and formatting your data

Before you can begin constructing your trees you need to get your hands dirty and clean the data so that you can use all the features available to you. In the first chapter, we saw that the Age variable had some missing value. Missingness is a whole subject with and in itself, but we will use a simple imputation technique where we substitute each missing value with the median of the all present values.

train["Age"] = train["Age"].fillna(train["Age"].median())

Another problem is that the Sex and Embarked variables are categorical but in a non-numeric format. Thus, we will need to assign each class a unique integer so that Python can handle the information. Embarked also has some missing values which you should impute with the most common class of embarkation, which is 'S'.

print(train.head(10))

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   
5            6         0       3   
6            7         0       1   
7            8         0       3   
8            9         1       3   
9           10         1       2   

                                                Name Sex   Age  SibSp  Parch  \
0                            Braund, Mr. Owen Harris   0  22.0      1      0   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...   1  38.0      1      0   
2                             Heikkinen, Miss. Laina   1  26.0      0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)   1  35.0      1      0   
4                           Allen, Mr. William Henry   0  35.0      0      0   
5                                   Moran, Mr. James   0   NaN      0      0   
6                            McCarthy, Mr. Timothy J   0  54.0      0      0   
7                     Palsson, Master. Gosta Leonard   0   2.0      3      1   
8  Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)   1  27.0      0      2   
9                Nasser, Mrs. Nicholas (Adele Achem)   1  14.0      1      0   

             Ticket     Fare Cabin  Embarked  Child  
0         A/5 21171   7.2500   NaN         0    0.0  
1          PC 17599  71.2833   C85         1    0.0  
2  STON/O2. 3101282   7.9250   NaN         0    0.0  
3            113803  53.1000  C123         0    0.0  
4            373450   8.0500   NaN         0    0.0  
5            330877   8.4583   NaN         2    NaN  
6             17463  51.8625   E46         0    0.0  
7            349909  21.0750   NaN         0    1.0  
8            347742  11.1333   NaN         0    0.0  
9            237736  30.0708   NaN         1    1.0

# Convert the male and female groups to integer form
train['Sex'][train['Sex'] == 'male'] = 0
train['Sex'][train['Sex'] == 'female'] = 1

# Impute the Embarked variable
train['Embarked'] = train['Embarked'].fillna('S')

# Convert the Embarked classes to integer form
train['Embarked'][train['Embarked'] == 'S'] = 0
train['Embarked'][train['Embarked'] == 'C'] = 1
train['Embarked'][train['Embarked'] == 'Q'] = 2

# Impute the Age variable
train['Age'] = train['Age'].fillna(train['Age'].median())

#Print the Sex and Embarked columns
print(train['Sex'].head(10))
print(train['Embarked'].head(10))

0    0
1    1
2    1
3    1
4    0
5    0
6    0
7    0
8    1
9    1
Name: Sex, dtype: object
0    0
1    1
2    0
3    0
4    0
5    2
6    0
7    0
8    0
9    1
Name: Embarked, dtype: int64

Create your first decision tree

You will use the scikit-learn and numpy libraries to build your first decision tree. scikit-learn can be used to create tree objects from the DecisionTreeClassifier class. The methods that we will use take numpy arrays as inputs and therefore we will need to create those from the DataFrame that we already have. We will need the following to build a decision tree

target: A one-dimensional numpy array containing the target/response from the train data. (Survival in your case)
features: A multidimensional numpy array containing the features/predictors from the train data. (ex. Sex, Age)

Take a look at the sample code below to see what this would look like:

target = train["Survived"].values

features = train[["Sex", "Age"]].values

my_tree = tree.DecisionTreeClassifier()

my_tree = my_tree.fit(features, target)

One way to quickly see the result of your decision tree is to see the importance of the features that are included. This is done by requesting the .feature_importances_ attribute of your tree object. Another quick metric is the mean accuracy that you can compute using the .score() function with features_one and target as arguments.

Ok, time for you to build your first decision tree in Python! The train and testing data from chapter 1 are available in your workspace.

# Create the target and features numpy arrays: target, features_one
target = train['Survived'].values
features_one = train[['Pclass', 'Sex', 'Age', 'Fare']].values

# Fit your first decision tree: my_tree_one
my_tree_one = tree.DecisionTreeClassifier()
my_tree_one = my_tree_one.fit(features_one, target)

# Look at the importance and score of the included features
print(my_tree_one.feature_importances_)

# Impute the Age variable
train['Age'] = train['Age'].fillna(train['Age'].median())

[ 0.12231561  0.31274009  0.22900122  0.33594308]

print(my_tree_one.score(features_one, target))

0.977553310887

Interpreting your decision tree

The feature_importances_ attribute make it simple to interpret the significance of the predictors you include. Based on your decision tree, what variable plays the most important role in determining whether or not a passenger survived?

Fare = 0.33594308.

Predict and submit to Kaggle

To send a submission to Kaggle you need to predict the survival rates for the observations in the test set. In the last exercise of the previous chapter, we created simple predictions based on a single subset. Luckily, with our decision tree, we can make use of some simple functions to "generate" our answer without having to manually perform subsetting.

First, you make use of the .predict() method. You provide it the model (my_tree_one), the values of features from the dataset for which predictions need to be made (test). To extract the features we will need to create a numpy array in the same way as we did when training the model. However, we need to take care of a small but important problem first. There is a missing value in the Fare feature that needs to be imputed.

Next, you need to make sure your output is in line with the submission requirements of Kaggle: a csv file with exactly 418 entries and two columns: PassengerId and Survived. Then use the code provided to make a new data frame using DataFrame(), and create a csv file using to_csv() method from Pandas.

# Convert the male and female groups to integer form
test['Sex'][test['Sex'] == 'male'] = 0
test['Sex'][test['Sex'] == 'female'] = 1

# Impute the Embarked variable
test['Embarked'] = test['Embarked'].fillna('S')

# Convert the Embarked classes to integer form
test['Embarked'][test['Embarked'] == 'S'] = 0
test['Embarked'][test['Embarked'] == 'C'] = 1
test['Embarked'][test['Embarked'] == 'Q'] = 2

# Impute the Age variable
test['Age'] = test['Age'].fillna(test['Age'].median())

# Impute the missing value with the median
test.Fare[152] = test.Fare.median()

# Extract the features from the test set: Pclass, Sex, Age, and Fare.
test_features = test[['Pclass', 'Sex', 'Age', 'Fare']].values

# Make your prediction using the test set and print them
my_prediction = my_tree_one.predict(test_features)
print(my_prediction)

[0 0 1 1 1 0 0 0 1 0 0 0 1 1 1 1 0 1 1 0 0 1 1 0 1 0 1 1 1 0 0 0 1 0 1 0 0
 0 0 1 0 1 0 1 1 0 0 0 1 1 0 0 1 1 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 1 1 0 1 0
 1 0 0 1 0 1 1 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 1 1 1 0 1 0 0 0 1 0 0 0 0 0 0
 0 1 1 1 0 1 1 0 1 1 0 1 0 0 1 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 1 0
 1 0 1 0 0 1 0 0 1 1 0 1 1 1 1 1 0 1 1 0 0 0 0 1 0 1 0 1 1 0 1 1 0 0 1 0 1
 0 1 0 0 0 0 0 1 0 1 0 1 0 0 0 0 1 0 1 0 0 0 0 1 0 1 1 0 0 0 0 1 0 1 0 1 0
 1 1 1 0 0 1 0 0 0 1 0 0 1 0 0 1 1 1 1 1 1 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 1
 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 1 1 0 1 0 0 0 1 0 1 0 1 0 0 0
 1 0 0 0 0 0 0 0 1 1 0 1 1 0 0 1 0 0 1 1 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 0 1
 1 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 1 1 0 0 0 1 0 1 0 0 1 0 1 1 1 1 0 0 0 1 0
 0 1 0 0 1 1 0 0 0 1 0 0 0 1 0 1 0 0 0 0 1 1 0 0 0 1 0 1 0 0 1 0 1 1 0 0 0
 0 1 1 1 1 0 0 1 0 0 0]

# Create a data frame with two columns: PassengerId & Survived. Survived contains your predictions
PassengerId =np.array(test["PassengerId"]).astype(int)
my_solution = pd.DataFrame(my_prediction, PassengerId, columns = ['Survived'])
print(my_solution.head(10))

     Survived
892         0
893         0
894         1
895         1
896         1
897         0
898         0
899         0
900         1
901         0

# Check that your data frame has 418 entries
print(my_solution.shape)

(418, 1)

# Write your solution to a csv file with the name my_solution.csv
my_solution.to_csv('my_solution_one.csv', index_label = ['PassengerId'])

Overfitting and how to control it

When you created your first decision tree the default arguments for max_depth and min_samples_split were set to None. This means that no limit on the depth of your tree was set. That's a good thing right? Not so fast. We are likely overfitting. This means that while your model describes the training data extremely well, it doesn't generalize to new data, which is frankly the point of prediction. Just look at the Kaggle submission results for the simple model based on Gender and the complex decision tree. Which one does better?

Maybe we can improve the overfit model by making a less complex model? In DecisionTreeRegressor, the depth of our model is defined by two parameters: - the max_depth parameter determines when the splitting up of the decision tree stops. - the min_samples_split parameter monitors the amount of observations in a bucket. If a certain threshold is not reached (e.g minimum 10 passengers) no further splitting can be done.

By limiting the complexity of your decision tree you will increase its generality and thus its usefulness for prediction!

# Create a new array with the added features: features_two
features_two = train[['Pclass', 'Age', 'Sex', 'Fare', 'SibSp', 'Parch', 'Embarked']].values

#Control overfitting by setting "max_depth" to 10 and "min_samples_split" to 5 : my_tree_two
max_depth = 10
min_samples_split = 5

my_tree_two = tree.DecisionTreeClassifier(max_depth = 10, min_samples_split = 5, random_state = 1)
my_tree_two = my_tree_two.fit(features_two, target)

#Print the score of the new decison tree
print(my_tree_two.score(features_two, target))

0.905723905724

# Extract the features from the test set: Pclass, Sex, Age, and Fare.
test_features = test[['Pclass', 'Age', 'Sex', 'Fare', 'SibSp', 'Parch', 'Embarked']].values

# Make your prediction using the test set and print them
my_prediction = my_tree_two.predict(test_features)
print(my_prediction)

[0 0 0 0 1 0 0 0 1 0 0 0 1 0 1 1 0 0 1 0 1 1 1 1 1 0 1 0 0 0 0 0 1 0 0 0 0
 0 0 1 0 1 0 1 1 0 0 0 1 1 1 0 1 0 0 0 0 0 0 1 0 0 0 1 1 1 1 0 0 1 1 0 0 0
 1 0 0 1 0 1 1 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 1 0 1 0 1 0 0 0 1 0 0 0 0 0 0
 1 1 1 1 0 0 1 0 1 1 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0
 1 0 1 0 0 1 0 0 1 1 0 1 1 1 1 0 0 1 0 0 1 0 0 0 0 0 0 1 1 1 1 1 0 0 1 0 1
 0 1 0 0 0 0 0 1 0 1 0 1 0 0 0 1 1 0 1 0 0 0 0 1 0 1 0 0 0 0 0 1 0 1 0 1 0
 1 0 1 0 0 1 0 0 0 1 0 0 0 0 1 0 1 1 1 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1
 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 1 1 0 0 0 0 1 0 0 1 0 0 0 0 0
 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 1 1 0 1 0 0 0 1 0 0
 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 1 0 0 0 0 0 1 0 0 1 0 1 1 0 1 1 0 1 1 0
 0 1 0 0 1 1 0 0 0 0 0 0 1 1 0 1 0 0 0 0 1 1 1 0 0 1 0 1 0 0 1 0 1 1 0 0 0
 0 1 1 1 1 0 0 1 0 0 1]

# Create a data frame with two columns: PassengerId & Survived. Survived contains your predictions
PassengerId =np.array(test["PassengerId"]).astype(int)
my_solution = pd.DataFrame(my_prediction, PassengerId, columns = ['Survived'])
print(my_solution.head(10))

     Survived
892         0
893         0
894         0
895         0
896         1
897         0
898         0
899         0
900         1
901         0

# Check that your data frame has 418 entries
print(my_solution.shape)

(418, 1)

# Write your solution to a csv file with the name my_solution.csv
my_solution.to_csv('my_solution_two.csv', index_label = ['PassengerId'])

Feature-engineering for our Titanic dataset

Data Science is an art that benefits from a human element. Enter feature engineering: creatively engineering your own features by combining the different existing variables.

While feature engineering is a discipline in itself, too broad to be covered here in detail, you will have a look at a simple example by creating your own new predictive attribute: family_size.

A valid assumption is that larger families need more time to get together on a sinking ship, and hence have lower probability of surviving. Family size is determined by the variables SibSp and Parch, which indicate the number of family members a certain passenger is traveling with. So when doing feature engineering, you add a new variable family_size, which is the sum of SibSp and Parch plus one (the observation itself), to the test and train set.

# Create train_two with the newly defined feature
train_two = train.copy()
train_two['family_size'] = train_two['SibSp'] + train_two['Parch'] + 1

train_two.describe()

# Create a new feature set and add the new feature
features_three = train_two[['Pclass', 'Sex', 'Age', 'Fare', 'SibSp', 'Parch', 'family_size']].values

# Define the tree classifier, then fit the model
my_tree_three = tree.DecisionTreeClassifier()
my_tree_three = my_tree_three.fit(features_three, target)

# Print the score of this decision tree
print(my_tree_three.score(features_three, target))

0.979797979798

Etc.

Improving your Predictions through R¶

A random forest analysis in Python

A detailed study of Random Forests would take this tutorial a bit too far. However, since it's an often used machine learning technique, gaining a general understanding in Python won't hurt.

In layman's terms, the Random Forest technique handles the overfitting problem you faced with decision trees. It grows multiple (very deep) classification trees using the training set. At the time of prediction, each tree is used to come up with a prediction and every outcome is counted as a vote. For example, if you have trained 3 trees with 2 saying a passenger in the test set will survive and 1 says he will not, the passenger will be classified as a survivor. This approach of overtraining trees, but having the majority's vote count as the actual classification decision, avoids overfitting.

Building a random forest in Python looks almost the same as building a decision tree; so we can jump right to it. There are two key differences, however. Firstly, a different class is used. And second, a new argument is necessary. Also, we need to import the necessary library from scikit-learn.

Use RandomForestClassifier() class instead of the DecisionTreeClassifier() class.
n_estimators needs to be set when using the RandomForestClassifier() class. This argument allows you to set the number of trees you wish to plant and average over.

The latest training and testing data are preloaded for you.

# Import the `RandomForestClassifier`
from sklearn.ensemble import RandomForestClassifier

# We want the Pclass, Age, Sex, Fare,SibSp, Parch, and Embarked variables
features_forest = train[['Pclass', 'Age', 'Sex', 'Fare', 'SibSp', 'Parch', 'Embarked']].values

# Building and fitting my_forest
forest = RandomForestClassifier(max_depth = 10, min_samples_split = 2, n_estimators = 100, random_state = 1)
my_forest = forest.fit(features_forest, target)

# Print the score of the fitted random forest
print(my_forest.score(features_forest, target))

0.939393939394

# Compute predictions on our test set features then print the length of the prediction vector
test_features = test[['Pclass', 'Age', 'Sex', 'Fare', 'SibSp', 'Parch', 'Embarked']].values

pred_forest = my_forest.predict(test_features)
print(len(pred_forest))

418

Interpreting and comparing

Remember how we looked at .feature_importances_ attribute for the decision trees? Well, you can request the same attribute from your random forest as well and interpret the relevance of the included variables. You might also want to compare the models in some quick and easy way. For this, we can use the .score() method. The .score() method takes the features data and the target vector and computes mean accuracy of your model. You can apply this method to both the forest and individual trees. Remember, this measure should be high but not extreme because that would be a sign of overfitting.

For this exercise, you have my_forest and my_tree_two available to you. The features and target arrays are also ready for use.

#Request and print the `.feature_importances_` attribute
print(my_tree_two.feature_importances_)
print(my_forest.feature_importances_)

[ 0.14130255  0.17906027  0.41616727  0.17938711  0.05039699  0.01923751
  0.0144483 ]
[ 0.10384741  0.20139027  0.31989322  0.24602858  0.05272693  0.04159232
  0.03452128]

#Compute and print the mean accuracy score for both models
print(my_tree_two.score(features_two, target))
print(my_forest.score(features_forest, target))

0.905723905724
0.939393939394

Conclude and submit

Based on your finding in the previous exercise determine which feature was of most importance, and for which model. After this final exercise, you will be able to submit your random forest model to Kaggle! Use my_forest, my_tree_two, and feature_importances_ to answer the question.

The most important feature was 'Sex', but it was more significant for 'my_tree_two'.

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	NaN	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	NaN	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	NaN	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare	Embarked	Child	family_size
count	891.000000	891.000000	891.000000	891.000000	891.000000	891.000000	891.000000	891.000000	714.000000	891.000000
mean	446.000000	0.383838	2.308642	29.361582	0.523008	0.381594	32.204208	0.361392	0.158263	1.904602
std	257.353842	0.486592	0.836071	13.019697	1.102743	0.806057	49.693429	0.635673	0.365244	1.613459
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000
25%	223.500000	0.000000	2.000000	22.000000	0.000000	0.000000	7.910400	0.000000	NaN	1.000000
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200	0.000000	NaN	1.000000
75%	668.500000	1.000000	3.000000	35.000000	1.000000	0.000000	31.000000	1.000000	NaN	2.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200	2.000000	1.000000	11.000000