Titanic Data Family Size and Chance of Survival Logistic Regression

Predicting the Survival of Titanic Passengers

In this weblog-post, I will go through the whole process of creating a automobile learning model on the famous Titanic dataset, which is used by many people all over the world. It provides information on the fate of passengers on the Titanic, summarized according to economical status (course), sexual activity, age and survival.

I initially wrote this postal service on kaggle.com, as part of the "Titanic: Car Learning from Disaster" Competition. In this challenge, nosotros are asked to predict whether a rider on the titanic would have been survived or not.

RMS Titanic

The RMS Titanic was a British passenger liner that sank in the North Atlantic Bounding main in the early morning hours of 15 April 1912, after it collided with an iceberg during its maiden voyage from Southampton to New York City. In that location were an estimated 2,224 passengers and crew aboard the ship, and more than 1,500 died, making it one of the deadliest commercial peacetime maritime disasters in modern history. The RMS Titanic was the largest ship afloat at the fourth dimension information technology entered service and was the 2nd of 3 Olympic-class ocean liners operated past the White Star Line. The Titanic was built by the Harland and Wolff shipyard in Belfast. Thomas Andrews, her architect, died in the disaster.

Importing the Libraries

                      # linear algebra            
            import            numpy            every bit            np                          # data processing              
              import              pandas              as              pd            
                          # information visualization              
              import              seaborn              as              sns              
%matplotlib inline
              from              matplotlib              import              pyplot              as              plt
              from              matplotlib              import              style
                          # Algorithms              
              from              sklearn              import              linear_model
              from              sklearn.linear_model              import              LogisticRegression
              from              sklearn.ensemble              import              RandomForestClassifier
              from              sklearn.linear_model              import              Perceptron
              from              sklearn.linear_model              import              SGDClassifier
              from              sklearn.tree              import              DecisionTreeClassifier
              from              sklearn.neighbors              import              KNeighborsClassifier
              from              sklearn.svm              import              SVC, LinearSVC
              from              sklearn.naive_bayes              import              GaussianNB

Getting the Data

          test_df = pd.read_csv("test.csv")
train_df = pd.read_csv("train.csv")

Data Exploration/Analysis

          train_df.info()

The trainin g -gear up has 891 examples and 11 features + the target variable (survived). 2 of the features are floats, 5 are integers and 5 are objects. Beneath I have listed the features with a brusk clarification:

          survival:    Survival            
PassengerId: Unique Id of a passenger.            
pclass:    Ticket class            
sex:    Sex            
Age:    Age in years            
sibsp:    # of siblings / spouses aboard the Titanic            
parch:    # of parents / children aboard the Titanic            
ticket:    Ticket number            
fare:    Passenger fare            
motel:    Cabin number            
embarked:    Port of Embarkation          train_df.describe()

Above we can come across that 38% out of the grooming-set survived the Titanic. We can also see that the rider ages range from 0.iv to 80. On top of that we can already detect some features, that contain missing values, like the 'Historic period' characteristic.

          train_df.caput(8)

From the table above, we can note a few things. First of all, that we need to convert a lot of features into numeric ones subsequently on, so that the machine learning algorithms can process them. Furthermore, we tin come across that the features have widely different ranges, that we volition need to convert into roughly the same scale. We tin likewise spot some more than features, that contain missing values (NaN = not a number), that wee need to deal with.

Let's accept a more detailed look at what data is really missing:

          total = train_df.isnull().sum().sort_values(ascending=Fake)
percent_1 = train_df.isnull().sum()/train_df.isnull().count()*100
percent_2 = (round(percent_1, i)).sort_values(ascending=Fake)
missing_data = pd.concat([total, percent_2], axis=1, keys=['Total', '%'])
missing_data.caput(v)

The Embarked feature has simply 2 missing values, which tin can easily be filled. It will be much more tricky, to deal with the 'Historic period' feature, which has 177 missing values. The 'Cabin' feature needs further investigation, simply it looks similar that nosotros might want to driblet information technology from the dataset, since 77 % of it are missing.

          train_df.columns.values

Above you tin come across the 11 features + the target variable (survived). What features could contribute to a high survival rate ?

To me information technology would make sense if everything except 'PassengerId', 'Ticket' and 'Name' would be correlated with a high survival charge per unit.

1. Age and Sex:

          survived = 'survived'
not_survived = 'non survived'
fig, axes = plt.subplots(nrows=ane, ncols=ii,figsize=(10, 4))
women = train_df[train_df['Sex']=='female']
men = train_df[train_df['Sex']=='male']
ax = sns.distplot(women[women['Survived']==one].Age.dropna(), bins=18, characterization = survived, ax = axes[0], kde =False)
ax = sns.distplot(women[women['Survived']==0].Historic period.dropna(), bins=forty, label = not_survived, ax = axes[0], kde =False)
ax.legend()
ax.set_title('Female')
ax = sns.distplot(men[men['Survived']==1].Age.dropna(), bins=18, label = survived, ax = axes[1], kde =            False)
ax = sns.distplot(men[men['Survived']==0].Age.dropna(), bins=40, label = not_survived, ax = axes[1], kde =            False)
ax.legend()
_ = ax.set_title('Male')

You can meet that men have a high probability of survival when they are betwixt 18 and 30 years quondam, which is too a little chip true for women but not fully. For women the survival chances are higher between xiv and xl.

For men the probability of survival is very low betwixt the age of 5 and 18, but that isn't truthful for women. Another thing to note is that infants likewise have a little bit higher probability of survival.

Since there seem to be certain ages, which have increased odds of survival and because I want every feature to be roughly on the same scale, I will create age groups later on.

3. Embarked, Pclass and Sex activity:

          FacetGrid = sns.FacetGrid(train_df, row='Embarked', size=4.v, aspect=1.vi)
FacetGrid.map(sns.pointplot, 'Pclass', 'Survived', 'Sex', palette=None,  order=None, hue_order=None            )
FacetGrid.add_legend()

Embarked seems to be correlated with survival, depending on the gender.

Women on port Q and on port South have a higher gamble of survival. The changed is true, if they are at port C. Men have a high survival probability if they are on port C, only a low probability if they are on port Q or South.

Pclass too seems to be correlated with survival. We will generate another plot of it beneath.

4. Pclass:

          sns.barplot(ten='Pclass', y='Survived', information=train_df)

Hither we see clearly, that Pclass is contributing to a persons gamble of survival, especially if this person is in grade 1. We will create another pclass plot below.

          filigree = sns.FacetGrid(train_df, col='Survived', row='Pclass', size=ii.ii, aspect=1.6)
grid.map(plt.hist, 'Age', alpha=.5, bins=20)
grid.add_legend();

The plot above confirms our supposition about pclass i, only we tin besides spot a loftier probability that a person in pclass iii volition not survive.

5. SibSp and Parch:

SibSp and Parch would brand more sense as a combined characteristic, that shows the full number of relatives, a person has on the Titanic. I will create it below and too a feature that sows if someone is not solitary.

          data = [train_df, test_df]
            for            dataset            in            data:
            dataset['relatives'] = dataset['SibSp'] + dataset['Parch']
            dataset.loc[dataset['relatives'] > 0, 'not_alone'] = 0
            dataset.loc[dataset['relatives'] == 0, 'not_alone'] = one
            dataset['not_alone'] = dataset['not_alone'].astype(int)          train_df['not_alone'].value_counts()

          axes = sns.factorplot('relatives','Survived',            
            data=train_df, aspect = 2.5, )

Here we can meet that yous had a loftier probabilty of survival with 1 to 3 realitves, merely a lower ane if you had less than i or more than than iii (except for some cases with 6 relatives).

Data Preprocessing

Starting time, I volition drib 'PassengerId' from the train prepare, because it does non contribute to a persons survival probability. I will not drop it from the exam set up, since it is required there for the submission.

          train_df = train_df.drop(['PassengerId'], axis=1)

Missing Data:

Cabin:
Equally a reminder, nosotros have to deal with Motel (687), Embarked (2) and Age (177). Beginning I idea, we take to delete the 'Motel' variable just then I institute something interesting. A motel number looks like 'C123' and the alphabetic character refers to the deck. Therefore we're going to extract these and create a new feature, that contains a persons deck. Afterwords nosotros will convert the feature into a numeric variable. The missing values will exist converted to zero. In the film beneath you tin can come across the actual decks of the titanic, ranging from A to G.

                      import            re            
deck = {"A": 1, "B": 2, "C": 3, "D": 4, "East": 5, "F": 6, "Yard": vii, "U": viii}
information = [train_df, test_df]              for              dataset              in              data:
              dataset['Cabin'] = dataset['Cabin'].fillna("U0")
              dataset['Deck'] = dataset['Motel'].map(lambda              x: re.compile("([a-zA-Z]+)").search(10).group())
              dataset['Deck'] = dataset['Deck'].map(deck)
              dataset['Deck'] = dataset['Deck'].fillna(0)
              dataset['Deck'] = dataset['Deck'].astype(int)
                                # we tin can now driblet the cabin characteristic            
train_df = train_df.drib(['Motel'], axis=1)
test_df = test_df.drop(['Motel'], axis=1)

Age:
At present nosotros tin can tackle the issue with the historic period features missing values. I will create an array that contains random numbers, which are computed based on the hateful historic period value in regards to the standard difference and is_null.

          data = [train_df, test_df]              for              dataset              in              data:
              mean = train_df["Historic period"].mean()
              std = test_df["Age"].std()
              is_null = dataset["Age"].isnull().sum()
              # compute random numbers between the hateful, std and is_null              
              rand_age = np.random.randint(mean - std, hateful + std, size = is_null)
              # fill NaN values in Age column with random values generated              
              age_slice = dataset["Historic period"].copy()
              age_slice[np.isnan(age_slice)] = rand_age
              dataset["Age"] = age_slice
              dataset["Age"] = train_df["Age"].astype(int)
                    train_df["Age"].isnull().sum()

Embarked:

Since the Embarked characteristic has simply 2 missing values, we will just fill these with the about common one.

          train_df['Embarked'].draw()

          common_value = 'S'
data = [train_df, test_df]              for              dataset              in              data:
              dataset['Embarked'] = dataset['Embarked'].fillna(common_value)

Converting Features:

          train_df.info()

Above you lot tin can see that 'Fare' is a float and we take to bargain with iv categorical features: Name, Sexual activity, Ticket and Embarked. Lets investigate and transfrom one afterwards another.

Fare:
Converting "Fare" from bladder to int64, using the "astype()" role pandas provides:

          information = [train_df, test_df]              for              dataset              in              data:
              dataset['Fare'] = dataset['Fare'].fillna(0)
              dataset['Fare'] = dataset['Fare'].astype(int)

Name:
We will use the Proper noun characteristic to excerpt the Titles from the Proper noun, so that we can build a new feature out of that.

          data = [train_df, test_df]
titles = {"Mr": 1, "Miss": 2, "Mrs": 3, "Primary": iv, "Rare": five}              for              dataset              in              information:
              # extract titles              
              dataset['Title'] = dataset.Name.str.extract(' ([A-Za-z]+)\.', expand=False)
              # replace titles with a more common title or as Rare              
              dataset['Title'] = dataset['Title'].supervene upon(['Lady', 'Countess','Capt', 'Col','Don', 'Dr',\
              'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
              dataset['Title'] = dataset['Title'].supplant('Mlle', 'Miss')
              dataset['Championship'] = dataset['Title'].replace('Ms', 'Miss')
              dataset['Title'] = dataset['Championship'].replace('Mme', 'Mrs')
              # catechumen titles into numbers              
              dataset['Title'] = dataset['Title'].map(titles)
              # filling NaN with 0, to get safe              
              dataset['Title'] = dataset['Title'].fillna(0)
                    train_df = train_df.driblet(['Name'], axis=ane)
test_df = test_df.drop(['Name'], axis=1)

Sex:
Convert 'Sexual activity' feature into numeric.

          genders = {"male": 0, "female": i}
data = [train_df, test_df]              for              dataset              in              data:
              dataset['Sex'] = dataset['Sex activity'].map(genders)

Ticket:

          train_df['Ticket'].describe()

Since the Ticket aspect has 681 unique tickets, it volition exist a bit catchy to convert them into useful categories. And so nosotros will drib it from the dataset.

          train_df = train_df.drop(['Ticket'], axis=1)
test_df = test_df.drop(['Ticket'], axis=ane)

Embarked:
Convert 'Embarked' feature into numeric.

          ports = {"S": 0, "C": 1, "Q": ii}
information = [train_df, test_df]              for              dataset              in              information:
              dataset['Embarked'] = dataset['Embarked'].map(ports)

Creating Categories:

We will at present create categories within the following features:

Age:
Now we need to convert the 'age' feature. First we will convert information technology from float into integer. Then we will create the new 'AgeGroup" variable, past categorizing every age into a group. Note that it is important to place attention on how you form these groups, since yous don't want for example that lxxx% of your data falls into group 1.

          information = [train_df, test_df]
            for            dataset            in            data:
            dataset['Age'] = dataset['Age'].astype(int)
            dataset.loc[ dataset['Age'] <= 11, 'Historic period'] = 0
            dataset.loc[(dataset['Historic period'] > 11) & (dataset['Historic period'] <= 18), 'Age'] = i
            dataset.loc[(dataset['Age'] > xviii) & (dataset['Historic period'] <= 22), 'Historic period'] = 2
            dataset.loc[(dataset['Age'] > 22) & (dataset['Age'] <= 27), 'Age'] = 3
            dataset.loc[(dataset['Age'] > 27) & (dataset['Age'] <= 33), 'Age'] = 4
            dataset.loc[(dataset['Age'] > 33) & (dataset['Age'] <= twoscore), 'Historic period'] = five
            dataset.loc[(dataset['Historic period'] > forty) & (dataset['Age'] <= 66), 'Age'] = 6
            dataset.loc[ dataset['Age'] > 66, 'Age'] = 6
                          
# let'south see how it's distributed            train_df['Historic period'].value_counts()

Fare:
For the 'Fare' characteristic, we demand to practice the aforementioned every bit with the 'Historic period' feature. But it isn't that piece of cake, considering if we cut the range of the fare values into a few as big categories, fourscore% of the values would fall into the first category. Fortunately, we can employ sklearn "qcut()" role, that nosotros can use to see, how we can form the categories.

          train_df.caput(10)

          information = [train_df, test_df]              for              dataset              in              information:
              dataset.loc[ dataset['Fare'] <= seven.91, 'Fare'] = 0
              dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare'] = ane
              dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare']   = 2
              dataset.loc[(dataset['Fare'] > 31) & (dataset['Fare'] <= 99), 'Fare']   = 3
              dataset.loc[(dataset['Fare'] > 99) & (dataset['Fare'] <= 250), 'Fare']   = 4
              dataset.loc[ dataset['Fare'] > 250, 'Fare'] = 5
              dataset['Fare'] = dataset['Fare'].astype(int)

Creating new Features

I will add together two new features to the dataset, that I compute out of other features.

i. Age times Course

          data = [train_df, test_df]
            for            dataset            in            information:
            dataset['Age_Class']= dataset['Age']* dataset['Pclass']

2. Fare per Person

                      for            dataset            in            data:
            dataset['Fare_Per_Person'] = dataset['Fare']/(dataset['relatives']+1)
            dataset['Fare_Per_Person'] = dataset['Fare_Per_Person'].astype(int)                      # Allow's take a last look at the preparation prepare, earlier we start training the models.            
train_df.head(10)

Edifice Motorcar Learning Models

Now we volition train several Machine Learning models and compare their results. Note that considering the dataset does not provide labels for their testing-set, nosotros need to utilize the predictions on the training set up to compare the algorithms with each other. Later on, we will use cross validation.

          X_train = train_df.drop("Survived", centrality=i)
Y_train = train_df["Survived"]
X_test  = test_df.driblet("PassengerId", axis=one).copy()

Stochastic Gradient Descent (SGD):

          sgd = linear_model.SGDClassifier(max_iter=5, tol=None)
sgd.fit(X_train, Y_train)
Y_pred = sgd.predict(X_test)sgd.score(X_train, Y_train)
            acc_sgd = round(sgd.score(X_train, Y_train) * 100, ii)

Random Forest:

          random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)Y_prediction = random_forest.predict(X_test)
            random_forest.score(X_train, Y_train)
acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)

Logistic Regression:

          logreg = LogisticRegression()
logreg.fit(X_train, Y_train)Y_pred = logreg.predict(X_test)
            acc_log = round(logreg.score(X_train, Y_train) * 100, 2)

K Nearest Neighbor:

                      # KNN            knn = KNeighborsClassifier(n_neighbors = iii) knn.fit(X_train, Y_train)  Y_pred = knn.predict(X_test)  acc_knn = circular(knn.score(X_train, Y_train) * 100, 2)

Gaussian Naive Bayes:

          gaussian = GaussianNB() gaussian.fit(X_train, Y_train)  Y_pred = gaussian.predict(X_test)  acc_gaussian = round(gaussian.score(X_train, Y_train) * 100, 2)

Perceptron:

          perceptron = Perceptron(max_iter=5)
perceptron.fit(X_train, Y_train)Y_pred = perceptron.predict(X_test)
            acc_perceptron = round(perceptron.score(X_train, Y_train) * 100, ii)

Linear Back up Vector Machine:

          linear_svc = LinearSVC()
linear_svc.fit(X_train, Y_train)Y_pred = linear_svc.predict(X_test)
            acc_linear_svc = round(linear_svc.score(X_train, Y_train) * 100, ii)

Conclusion Tree

          decision_tree = DecisionTreeClassifier() decision_tree.fit(X_train, Y_train)  Y_pred = decision_tree.predict(X_test)  acc_decision_tree = round(decision_tree.score(X_train, Y_train) * 100, 2)

Which is the best Model ?

          results = pd.DataFrame({
            'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression',            
            'Random Forest', 'Naive Bayes', 'Perceptron',            
            'Stochastic Slope Decent',            
            'Conclusion Tree'],
            'Score': [acc_linear_svc, acc_knn, acc_log,            
            acc_random_forest, acc_gaussian, acc_perceptron,            
            acc_sgd, acc_decision_tree]})
result_df = results.sort_values(past='Score', ascending=Fake)
result_df = result_df.set_index('Score')
result_df.caput(9)

As we tin can see, the Random Woods classifier goes on the outset place. But starting time, let u.s.a. check, how random-forest performs, when we use cantankerous validation.

K-Fold Cantankerous Validation:

K-Fold Cross Validation randomly splits the training information into K subsets chosen folds. Allow'south image we would split our data into 4 folds (K = 4). Our random forest model would exist trained and evaluated four times, using a different fold for evaluation everytime, while it would be trained on the remaining 3 folds.

The image below shows the process, using 4 folds (K = iv). Every row represents one grooming + evaluation procedure. In the first row, the model become's trained on the get-go, 2d and third subset and evaluated on the fourth. In the second row, the model go's trained on the second, third and fourth subset and evaluated on the first. Thou-Fold Cross Validation repeats this procedure till every fold acted once as an evaluation fold.

The effect of our K-Fold Cross Validation example would be an array that contains 4 different scores. We then need to compute the mean and the standard deviation for these scores.

The code below perform 1000-Fold Cross Validation on our random forest model, using ten folds (K = 10). Therefore it outputs an array with 10 unlike scores.

                      from            sklearn.model_selection            import            cross_val_score
rf = RandomForestClassifier(n_estimators=100)
scores = cross_val_score(rf, X_train, Y_train, cv=10, scoring = "accuracy")          print("Scores:", scores)


print("Mean:", scores.mean())
print("Standard Divergence:", scores.std())

This looks much more than realistic than before. Our model has a average accuracy of 82% with a standard deviation of 4 %. The standard divergence shows us, how precise the estimates are .

This ways in our case that the accuracy of our model can differ + — 4%.

I think the accuracy is still really practiced and since random woods is an easy to use model, we will try to increment it'south performance even further in the following section.

Random Woods

What is Random Forest ?

Random Forest is a supervised learning algorithm. Like you can already meet from it's name, it creates a forest and makes information technology somehow random. The „forest" it builds, is an ensemble of Determination Trees, most of the time trained with the "bagging" method. The general idea of the bagging method is that a combination of learning models increases the overall event.

To say it in simple words: Random woods builds multiple conclusion trees and merges them together to become a more accurate and stable prediction.

One big advantage of random forest is, that information technology tin can exist used for both classification and regression problems, which form the majority of current machine learning systems. With a few exceptions a random-forest classifier has all the hyperparameters of a decision-tree classifier and likewise all the hyperparameters of a bagging classifier, to control the ensemble itself.

The random-forest algorithm brings extra randomness into the model, when it is growing the trees. Instead of searching for the best feature while splitting a node, it searches for the best feature among a random subset of features. This procedure creates a wide variety, which more often than not results in a better model. Therefore when you are growing a tree in random forest, only a random subset of the features is considered for splitting a node. Yous can even make trees more random, by using random thresholds on elevation of it, for each feature rather than searching for the best possible thresholds (like a normal decision tree does).

Below you tin see how a random forest would look similar with two trees:

Feature Importance

Some other great quality of random forest is that they brand it very easy to measure the relative importance of each characteristic. Sklearn measure a features importance by looking at how much the treee nodes, that employ that characteristic, reduce impurity on average (across all trees in the forest). Information technology computes this score automaticall for each feature after preparation and scales the results so that the sum of all importances is equal to ane. We will acces this below:

          importances = pd.DataFrame({'characteristic':X_train.columns,'importance':np.round(random_forest.feature_importances_,3)})
importances = importances.sort_values('importance',ascending=False).set_index('feature')          importances.caput(15)

          importances.plot.bar()

Decision:

not_alone and Parch doesn't play a significant part in our random forest classifiers prediction process. Because of that I will drop them from the dataset and train the classifier again. We could likewise remove more or less features, but this would need a more detailed investigation of the features effect on our model. But I recollect it'south but fine to remove simply Alone and Parch.

          train_df  = train_df.drop("not_alone", axis=one)
test_df  = test_df.drop("not_alone", centrality=one)train_df  = train_df.drop("Parch", axis=1)
test_df  = test_df.drop("Parch", axis=1)

Preparation random woods over again:

                      # Random Forest            random_forest = RandomForestClassifier(n_estimators=100, oob_score =              Truthful)
random_forest.fit(X_train, Y_train)
Y_prediction = random_forest.predict(X_test)
            random_forest.score(X_train, Y_train)
            acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)
print(round(acc_random_forest,2,), "%")

92.82%

Our random forest model predicts equally adept as it did before. A full general dominion is that, the more features you have, the more than likely your model will suffer from overfitting and vice versa. Merely I recollect our data looks fine for now and hasn't as well much features.

In that location is also another style to evaluate a random-forest classifier, which is probably much more accurate than the score nosotros used earlier. What I am talking almost is the out-of-bag samples to estimate the generalization accuracy. I will not go into details hither about how information technology works. Just note that out-of-handbag estimate is as accurate as using a test fix of the same size as the training set. Therefore, using the out-of-bag error estimate removes the demand for a fix bated test set.

          print("oob score:", round(random_forest.oob_score_, four)*100, "%")

oob score: 81.82 %

Now we can start tuning the hyperameters of random forest.

Hyperparameter Tuning

Below you tin see the lawmaking of the hyperparamter tuning for the parameters benchmark, min_samples_leaf, min_samples_split and n_estimators.

I put this lawmaking into a markdown prison cell and not into a lawmaking prison cell, considering information technology takes a long time to run it. Direct underneeth information technology, I put a screenshot of the gridsearch's output.

          param_grid = { "criterion" : ["gini", "entropy"], "min_samples_leaf" : [1, v, 10, 25, 50, 70], "min_samples_split" : [2, iv, ten, 12, xvi, 18, 25, 35], "n_estimators": [100, 400, 700, 1000, 1500]}          from sklearn.model_selection import GridSearchCV, cross_val_score          rf = RandomForestClassifier(n_estimators=100, max_features='auto', oob_score=True, random_state=1, n_jobs=-1)          clf = GridSearchCV(calculator=rf, param_grid=param_grid, n_jobs=-ane)          clf.fit(X_train, Y_train)          clf.bestparams

Examination new Parameters:

                      # Random Forest            
random_forest = RandomForestClassifier(criterion = "gini",            
            min_samples_leaf = 1,            
            min_samples_split = 10,            
            n_estimators=100,            
            max_features='auto',            
            oob_score=True,            
            random_state=1,            
            n_jobs=-1)random_forest.fit(X_train, Y_train)
Y_prediction = random_forest.predict(X_test)
            random_forest.score(X_train, Y_train)
            impress("oob score:", round(random_forest.oob_score_, iv)*100, "%")

oob score: 83.05 %

Now that we accept a proper model, we can start evaluating it'due south performace in a more authentic way. Previously we but used accuracy and the oob score, which is just another grade of accuracy. The problem is but, that it's more complicated to evaluate a nomenclature model than a regression model. We will talk most this in the following section.

Further Evaluation

Confusion Matrix:

                      from            sklearn.model_selection            import            cross_val_predict
            from            sklearn.metrics            import            confusion_matrix
predictions = cross_val_predict(random_forest, X_train, Y_train, cv=iii)
confusion_matrix(Y_train, predictions)

The first row is most the not-survived-predictions: 493 passengers were correctly classified equally not survived (called true negatives) and 56 where wrongly classified as non survived (faux positives).

The second row is about the survived-predictions: 93 passengers where wrongly classified as survived (fake negatives) and 249 where correctly classified as survived (true positives).

A confusion matrix gives you a lot of data nigh how well your model does, but theres a way to become even more, like computing the classifiers precision.

Precision and Call back:

                      from            sklearn.metrics            import            precision_score, recall_scoreprint("Precision:", precision_score(Y_train, predictions))
print("Call back:",recall_score(Y_train, predictions))

Precision: 0.801948051948
Call back: 0.722222222222

Our model predicts 81% of the fourth dimension, a passengers survival correctly (precision). The call up tells usa that it predicted the survival of 73 % of the people who really survived.

F-Score

You can combine precision and recall into one score, which is called the F-score. The F-score is computed with the harmonic mean of precision and recall. Annotation that information technology assigns much more weight to low values. Every bit a issue of that, the classifier will only get a high F-score, if both recall and precision are loftier.

                      from            sklearn.metrics            import            f1_score
f1_score(Y_train, predictions)

0.7599999999999

There we take it, a 77 % F-score. The score is non that high, considering we have a recall of 73%. But unfortunately the F-score is not perfect, because it favors classifiers that have a similar precision and remember. This is a trouble, because you sometimes want a high precision and sometimes a loftier recall. The thing is that an increasing precision, sometimes results in an decreasing recall and vice versa (depending on the threshold). This is called the precision/recall tradeoff. We will hash out this in the post-obit section.

Precision Recall Curve

For each person the Random Forest algorithm has to classify, information technology computes a probability based on a function and it classifies the person equally survived (when the score is bigger the than threshold) or as not survived (when the score is smaller than the threshold). That's why the threshold plays an important part.

We volition plot the precision and recall with the threshold using matplotlib:

                      from            sklearn.metrics            import            precision_recall_curve              # getting the probabilities of our predictions              
y_scores = random_forest.predict_proba(X_train)
y_scores = y_scores[:,1]
            precision, recall, threshold = precision_recall_curve(Y_train, y_scores)
                                def            plot_precision_and_recall(precision, recall, threshold):
            plt.plot(threshold, precision[:-1], "r-", characterization="precision", linewidth=five)
            plt.plot(threshold, call back[:-1], "b", label="recall", linewidth=5)
            plt.xlabel("threshold", fontsize=19)
            plt.legend(loc="upper right", fontsize=xix)
            plt.ylim([0, 1])plt.effigy(figsize=(14, 7))
plot_precision_and_recall(precision, retrieve, threshold)
plt.testify()

To a higher place you can clearly see that the recall is falling of rapidly at a precision of effectually 85%. Because of that you may want to select the precision/recall tradeoff earlier that — maybe at around 75 %.

You are now able to cull a threshold, that gives y'all the best precision/call up tradeoff for your current car learning problem. If you desire for example a precision of 80%, you can easily look at the plots and run across that you would demand a threshold of effectually 0.4. And so you could train a model with exactly that threshold and would get the desired accurateness.

Some other way is to plot the precision and recall against each other:

                      def            plot_precision_vs_recall(precision, retrieve):
            plt.plot(call up, precision, "thousand--", linewidth=two.5)
            plt.ylabel("recall", fontsize=19)
            plt.xlabel("precision", fontsize=19)
            plt.axis([0, one.five, 0, 1.5])plt.figure(figsize=(14, 7))
plot_precision_vs_recall(precision, recollect)
plt.show()

ROC AUC Bend

Some other manner to evaluate and compare your binary classifier is provided by the ROC AUC Curve. This bend plots the true positive charge per unit (also chosen retrieve) against the fake positive rate (ratio of incorrectly classified negative instances), instead of plotting the precision versus the recall.

                      from            sklearn.metrics            import            roc_curve
            # compute true positive rate and false positive rate            
false_positive_rate, true_positive_rate, thresholds = roc_curve(Y_train, y_scores)                      # plotting them against each other            
            def            plot_roc_curve(false_positive_rate, true_positive_rate, characterization=None):
            plt.plot(false_positive_rate, true_positive_rate, linewidth=ii, label=label)
            plt.plot([0, 1], [0, 1], 'r', linewidth=four)
            plt.axis([0, one, 0, 1])
            plt.xlabel('False Positive Rate (FPR)', fontsize=16)
            plt.ylabel('True Positive Rate (TPR)', fontsize=16)plt.effigy(figsize=(14, 7))
plot_roc_curve(false_positive_rate, true_positive_rate)
plt.show()

The red line in the middel represents a purely random classifier (e.yard a coin flip) and therefore your classifier should be as far away from it equally possible. Our Random Forest model seems to practise a adept task.

Of course we also have a tradeoff here, considering the classifier produces more false positives, the higher the true positive rate is.

ROC AUC Score

The ROC AUC Score is the corresponding score to the ROC AUC Curve. Information technology is simply computed by measuring the area nether the curve, which is chosen AUC.

A classifiers that is 100% correct, would take a ROC AUC Score of 1 and a completely random classiffier would take a score of 0.v.

                      from            sklearn.metrics            import            roc_auc_score
r_a_score = roc_auc_score(Y_train, y_scores)
impress("ROC-AUC-Score:", r_a_score)

ROC_AUC_SCORE: 0.945067587

Nice ! I call back that score is good enough to submit the predictions for the examination-set to the Kaggle leaderboard.

Summary

We started with the data exploration where we got a feeling for the dataset, checked about missing information and learned which features are of import. During this procedure we used seaborn and matplotlib to do the visualizations. During the data preprocessing part, we computed missing values, converted features into numeric ones, grouped values into categories and created a few new features. Subsequently we started training 8 different machine learning models, picked one of them (random woods) and applied cross validation on it. Then we discussed how random woods works, took a wait at the importance it assigns to the different features and tuned information technology's performace through optimizing it's hyperparameter values. Lastly, we looked at it's defoliation matrix and computed the models precision, recall and f-score.

Below you tin can see a earlier and later on flick of the "train_df" dataframe:

Of course there is still room for improvement, like doing a more all-encompassing feature engineering, past comparing and plotting the features against each other and identifying and removing the noisy features. Another thing that can amend the overall effect on the kaggle leaderboard would exist a more extensive hyperparameter tuning on several machine learning models. You could also practice some ensemble learning.

cousensfria1950.blogspot.com

Source: https://towardsdatascience.com/predicting-the-survival-of-titanic-passengers-30870ccc7e8