Developers Club geek daily blog

2 years, 9 months ago
I want to share experience with a problem of the known tender of machine learning from Kaggle. This tender is positioned as tender for beginners, and I just had no practical experience in this area. I knew the theory a little, but almost did not deal with real data and densely did not work with a python. As a result, having spent couple of New Year's Eve evenings, gathered 0.80383 (the first quarter of a rating).

Kaggle and Titanic — one more solution of a task by means of Python

Generally this article for still beginners from already begun.


We turn on the music, suitable for work, and we begin research.

The tender "on Titanic" was already repeatedly noted on Habré. Especially it would be desirable to note the last article from the list — it turns out that research of data can be not less intriguing with what the good detective novel, and it is possible to receive excellent result (0.81340) not only Random Forest the qualifier.

Also it would be desirable to note article about other tender. From it it is possible understand in what way the brain of the researcher has to work and that the most part of time has to be given to the preliminary analysis and data handling.


For a task solution I use a Python-stack of technologies. This approach is not unique: there are R, Matlab, Mathematica, Azure Machine Learning, Apache Weka, Java-ML and I think the list it is possible to continue for a long time. Use of Python has a number of advantages: libraries really there is a lot of also they excellent quality and as the majority of them represent wrappers over a C-code they also rather fast. Besides the constructed model can be easily brought into operation.

Has to be recognized that I am not really great admirer of scripting mild typed languages, but the richness of libraries for python does not allow to ignore it somehow.

Let's start everything under Linux (Ubuntu 14.04). Will be necessary: python 2.7, seaborn, matplotlib, sklearn, xgboost, pandas. In general only pandas and sklearn are obligatory, and the others are necessary for an illustration.

Under Linux of library for Python it is possible to set by two methods: regular package (deb) the manager or through the pitonovsky utility of pip.

Installation by deb-packets more simply and quicker, but often libraries there outdate (stability above all).

# Установка будет произведена в /usr/lib/python2.7/dist-packages/
$ sudo apt-get install python-matplotlib

Installation of packets through pip longer (compilation will be made), but with it it is possible to expect receipt of fresh versions of packets.

# Установка будет произведена в /usr/local/lib/python2.7/dist-packages/
$ sudo pip install matplotlib

So in what way it is better to set packets? I use a compromise: I put the massive and demanding sets of dependences for assembly of NumPy and SciPy from DEB packets.

$ sudo apt-get install python 
$ sudo apt-get install python-pip
$ sudo apt-get install python-numpy
$ sudo apt-get install python-scipy
$ sudo apt-get install ipython

And other, easier packets, I set through pip.

$ sudo pip install pandas 
$ sudo pip install matplotlib==1.4.3
$ sudo pip install skimage
$ sudo pip install sklearn
$ sudo pip install seaborn
$ sudo pip install statsmodels
$ sudo pip install xgboost

If I forgot something — that all necessary packets usually easily are calculated and established by a similar method.

Users of other platforms need to take similar actions for installation of packets. But there is an option strongly more simply: already there are prekompilirovanny distribution kits with a python and almost all necessary libraries. I did not try them, but they at first sight look it is promising.


Let's download basic data for a task and we will look that issued to us.

$ wc -l train.csv test.csv 
  892 train.csv
  419 test.csv
 1311 total

Data we will directly tell not really much — only 891 passengers in train-selection, and 418 in test-selection (one line goes on heading with the list of fields).

Let's open train.csv in any tabular processor (I use LibreOffice Calc) visually to look at data.

$ libreoffice --calc train.csv

We see the following:
  • The age is filled not at all
  • Tickets have some strange and not jellied format
  • In names there is a title (Ms., Mr., Mrs., etc.)
  • Numbers of cabins are registered a little at whom (there is a chilling history why)
  • The deck code probably is stated in the existing numbers of cabins (indeed as it appeared)
  • Also, according to article, in number of a cabin the party is ciphered
  • We sort by a name. It is visible that many traveled families, and tragedy scale is visible — often families were separated, only the part survived
  • We sort by the ticket. It is visible that several people traveled around one code of the ticket at once, and it is frequent — with different surnames. The glance, it seems, shows that people with identical number of the ticket often share the same fate.
  • At part of passengers the port of landing is not put down

It seems approximately everything is clear, we pass nepostredstvenno to work with data.

Data loading

That not zashumlyat a dalsheyshy code I will give all used imports at once:
Heading of a script
# coding=utf8

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import xgboost as xgb
import re
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.grid_search import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.cross_validation import StratifiedKFold
from sklearn.cross_validation import KFold
from sklearn.cross_validation import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn import metrics

pd.set_option ('display.width', 256)

Probably the most part of my motivation to writing of this article is caused by delight from work with pandas packet. I knew about existence of this technology, but could not even provide, it is how pleasant to work with it. Pandas is Excel in the command line with convenient functionality on input-output and processing of tabular data.

We load both selections.

train_data = pd.read_csv("data/train.csv")
test_data = pd.read_csv("data/test.csv")

We collect both selections (train-selection and test-selection) in one total all-selection.

all_data = pd.concat([train_data, test_data])

Why it to do, in test-selection there is no field with a rezultatiruyushchy flag of survival? The complete sample is useful to calculation of statistics on all other fields (averages, medians, quantiles, minima and maxima), and also communication between this field. That is, including statistics only on train-selection, we actually ignore part of information, very useful to us.

Data analysis

Data analysis in Python can be performed at once in several ways, for example:

  • To manually prepare data and to display through matplotlib
  • To use all ready in seaborn
  • To use a text output with grouping in pandas

We will try all three, but at first we will start the simplest text option. Let's display survival statistics depending on a class and a floor.

print("===== survived by class and sex")
print(train_data.groupby(["Pclass", "Sex"])["Survived"].value_counts(normalize=True))

===== survived by class and sex
Pclass  Sex     Survived
1       female  1           0.968085
                0           0.031915
        male    0           0.631148
                1           0.368852
2       female  1           0.921053
                0           0.078947
        male    0           0.842593
                1           0.157407
3       female  0           0.500000
                1           0.500000
        male    0           0.864553
                1           0.135447
dtype: float64

We see that in boats really put at first women — the chance of the woman of survival makes 96.8%, 92.1% and 50% depending on a ticket class. The chance of the man's survival gorazdno lower and makes respectively 36.9%, 15.7% and 13.5%.

By means of pandas we will quickly count the report on all numerical fields of both selections — separately on men and on women.

describe_fields = ["Age", "Fare", "Pclass", "SibSp", "Parch"]

print("===== train: males")
print(train_data[train_data["Sex"] == "male"][describe_fields].describe())

print("===== test: males")
print(test_data[test_data["Sex"] == "male"][describe_fields].describe())

print("===== train: females")
print(train_data[train_data["Sex"] == "female"][describe_fields].describe())

print("===== test: females")
print(test_data[test_data["Sex"] == "female"][describe_fields].describe())

===== train: males
              Age        Fare      Pclass       SibSp       Parch
count  453.000000  577.000000  577.000000  577.000000  577.000000
mean    30.726645   25.523893    2.389948    0.429809    0.235702
std     14.678201   43.138263    0.813580    1.061811    0.612294
min      0.420000    0.000000    1.000000    0.000000    0.000000
25%     21.000000    7.895800    2.000000    0.000000    0.000000
50%     29.000000   10.500000    3.000000    0.000000    0.000000
75%     39.000000   26.550000    3.000000    0.000000    0.000000
max     80.000000  512.329200    3.000000    8.000000    5.000000
===== test: males
              Age        Fare      Pclass       SibSp       Parch
count  205.000000  265.000000  266.000000  266.000000  266.000000
mean    30.272732   27.527877    2.334586    0.379699    0.274436
std     13.389528   41.079423    0.808497    0.843735    0.883745
min      0.330000    0.000000    1.000000    0.000000    0.000000
25%     22.000000    7.854200    2.000000    0.000000    0.000000
50%     27.000000   13.000000    3.000000    0.000000    0.000000
75%     40.000000   26.550000    3.000000    1.000000    0.000000
max     67.000000  262.375000    3.000000    8.000000    9.000000
===== train: females
              Age        Fare      Pclass       SibSp       Parch
count  261.000000  314.000000  314.000000  314.000000  314.000000
mean    27.915709   44.479818    2.159236    0.694268    0.649682
std     14.110146   57.997698    0.857290    1.156520    1.022846
min      0.750000    6.750000    1.000000    0.000000    0.000000
25%     18.000000   12.071875    1.000000    0.000000    0.000000
50%     27.000000   23.000000    2.000000    0.000000    0.000000
75%     37.000000   55.000000    3.000000    1.000000    1.000000
max     63.000000  512.329200    3.000000    8.000000    6.000000
===== test: females
              Age        Fare      Pclass       SibSp       Parch
count  127.000000  152.000000  152.000000  152.000000  152.000000
mean    30.272362   49.747699    2.144737    0.565789    0.598684
std     15.428613   73.108716    0.887051    0.974313    1.105434
min      0.170000    6.950000    1.000000    0.000000    0.000000
25%     20.500000    8.626050    1.000000    0.000000    0.000000
50%     27.000000   21.512500    2.000000    0.000000    0.000000
75%     38.500000   55.441700    3.000000    1.000000    1.000000
max     76.000000  512.329200    3.000000    8.000000    9.000000

It is visible that on averages and pertsentilyam all in general exactly. But at men on selections maxima on age and at ticket cost differ. Women in both selections also have a distinction in an age maximum.

Assembly of the digest by data

Let's collect the small digest on a complete sample — it will be necessary for dalsheyshy conversion of selections. In particular we need values which will be substituted instead of absent, and also different reference books for transfer text values in numerical. The matter is that many qualifiers can work only with numbers therefore somehow we have to translate categorial signs in numerical, but irrespective of a conversion method we will be need reference books of these values.

class DataDigest:

    def __init__(self):
        self.ages = None
        self.fares = None
        self.titles = None
        self.cabins = None
        self.families = None = None

def get_title(name):
    if pd.isnull(name):
        return "Null"

    title_search =' ([A-Za-z]+)\.', name)
    if title_search:
        return "None"

def get_family(row):
    last_name = row["Name"].split(",")[0]
    if last_name:
        family_size = 1 + row["Parch"] + row["SibSp"]
        if family_size > 3:
            return "{0}_{1}".format(last_name.lower(), family_size)
            return "nofamily"
        return "unknown"

data_digest = DataDigest()
data_digest.ages = all_data.groupby("Sex")["Age"].median()
data_digest.fares = all_data.groupby("Pclass")["Fare"].median()
data_digest.titles = pd.Index(test_data["Name"].apply(get_title).unique())
data_digest.families = pd.Index(test_data.apply(get_family, axis=1).unique())
data_digest.cabins = pd.Index(test_data["Cabin"].fillna("unknown").unique()) = pd.Index(test_data["Ticket"].fillna("unknown").unique())

The small explanation on digest fields:
  • ages — the reference book of medians of age depending on a floor;
  • fares — the reference book of medians of cost of tickets depending on a ticket class;
  • titles — the reference book of titles;
  • families — the reference book of identifiers of families (a surname + the number of family members);
  • cabins — the reference book of identifiers of cabins;
  • tickets — the reference book of identifiers of tickets.

We build reference books for recovery of the absent data (median) on the combined selection. And here reference books for transfer of categorial signs — only according to test data. The idea consisted in the following: let's allow in a train-set we have a surname "Ivanov", and this surname is not in a test-set. The knowledge in the qualifier that "Ivanov" survived (or did not survive) will not help in any way with a test-set assessment as this surname is not in a test-set all the same. Therefore we add only those surnames which are in a test-set to the reference book. By even more correct method will add to the reference book only intersection of signs (only those signs which are in both sets) — I tried, but the result of verification worsened for 3 percent.

We select signs

Now we need to select signs. As it was already told — many qualifiers are able to work only with numbers therefore it is necessary for us:

  • To transfer categories to numerical representation
  • To select implicit signs, that is those which are obviously not set (a title, the deck)
  • To make something with missing values

There are two methods of conversion of a categorial sign to numerical. We can consider a task on the example of a sex of the passenger.

In the first option we simply change a floor for some number, for example we can replace female on 0, and male on 1 (kruglyashok and a stick — it is very convenient to remember). Such option does not increase number of signs, however in a sign for its values the relation "more" and "less" appears now. In a case when there is a lot of values, such unexpected property of a sign is not always desirable and can lead to problems in geometrical qualifiers.

The second option of conversion — to get two columns "sex_male" and "sex_female". In case of a male we will appropriate sex_male=1, sex_female=0. In case of a female on the contrary: sex_male=0, sex_female=1. "More" "/less" we avoid the relations now, however now we had more signs, and the more signs the more data are necessary for us for training of the qualifier — this problem is known as "a dimension damnation". Especially difficult is a situation when values of signs much, for example identifiers of tickets, in such cases it is possible to cast away for example seldom found values having substituted instead of them some special tag — thus having reduced final quantity of signs after expansion.

Small spoiler: we stake first of all on the Random Forest qualifier. First all so do, and secondly he does not demand expansion of signs, is steady against the scale of values of signs and is calculated quickly. Despite it, we prepare signs in the general universal view as the main set goal — to investigate the principles of work with sklearn and opportunities.

Thus we replace some categorial signs with the numbers some we expand, some both we replace and we expand. We do not save on number of signs as in dalsheyshy we can always select what of them will participate in work.

In the majority of benefits and examples from a network initial data sets it is very free are modified: initial columns are replaced with new values, unnecessary columns are removed, etc. In it we have no need yet there is enough random access memory: it is always better to add new signs to a set without changing in any way the existing data as pandas will always allow us to select afterwards only necessary.

We create a method for conversion of data sets.

def get_index(item, index):
    if pd.isnull(item):
        return -1

        return index.get_loc(item)
    except KeyError:
        return -1

def munge_data(data, digest):
    # Age - замена пропусков на медиану в зависимости от пола
    data["AgeF"] = data.apply(lambda r: digest.ages[r["Sex"]] if pd.isnull(r["Age"]) else r["Age"], axis=1)

    # Fare - замена пропусков на медиану в зависимости от класса
    data["FareF"] = data.apply(lambda r: digest.fares[r["Pclass"]] if pd.isnull(r["Fare"]) else r["Fare"], axis=1)

    # Gender - замена
    genders = {"male": 1, "female": 0}
    data["SexF"] = data["Sex"].apply(lambda s: genders.get(s))

    # Gender - расширение
    gender_dummies = pd.get_dummies(data["Sex"], prefix="SexD", dummy_na=False)
    data = pd.concat([data, gender_dummies], axis=1)

    # Embarkment - замена
    embarkments = {"U": 0, "S": 1, "C": 2, "Q": 3}
    data["EmbarkedF"] = data["Embarked"].fillna("U").apply(lambda e: embarkments.get(e))

    # Embarkment - расширение
    embarkment_dummies = pd.get_dummies(data["Embarked"], prefix="EmbarkedD", dummy_na=False)
    data = pd.concat([data, embarkment_dummies], axis=1)

    # Количество родственников на борту
    data["RelativesF"] = data["Parch"] + data["SibSp"]

    # Человек-одиночка?
    data["SingleF"] = data["RelativesF"].apply(lambda r: 1 if r == 0 else 0)

    # Deck - замена
    decks = {"U": 0, "A": 1, "B": 2, "C": 3, "D": 4, "E": 5, "F": 6, "G": 7, "T": 8}
    data["DeckF"] = data["Cabin"].fillna("U").apply(lambda c: decks.get(c[0], -1))

    # Deck - расширение
    deck_dummies = pd.get_dummies(data["Cabin"].fillna("U").apply(lambda c: c[0]), prefix="DeckD", dummy_na=False)
    data = pd.concat([data, deck_dummies], axis=1)

    # Titles - расширение
    title_dummies = pd.get_dummies(data["Name"].apply(lambda n: get_title(n)), prefix="TitleD", dummy_na=False)
    data = pd.concat([data, title_dummies], axis=1)

    # амена текстов на индекс из соответствующего справочника или -1 если значения в справочнике нет (расширять не будем)
    data["CabinF"] = data["Cabin"].fillna("unknown").apply(lambda c: get_index(c, digest.cabins))

    data["TitleF"] = data["Name"].apply(lambda n: get_index(get_title(n), digest.titles))

    data["TicketF"] = data["Ticket"].apply(lambda t: get_index(t,

    data["FamilyF"] = data.apply(lambda r: get_index(get_family(r), digest.families), axis=1)

    # для статистики
    age_bins = [0, 5, 10, 15, 20, 25, 30, 40, 50, 60, 70, 80, 90]
    data["AgeR"] = pd.cut(data["Age"].fillna(-1), bins=age_bins).astype(object)

    return data

The small explanation on adding of new signs:
  • we add own index of a cabin
  • we add own index of the deck (we cut from number of a cabin)
  • we add own index of the ticket
  • we add own index of a title (we cut from a name)
  • we add own index of the identifier of a family (we create of a surname and the number of family)

Generally, we add to signs in general everything that comes to mind. It is visible that some signs duplicate each other (for example expansion and replacement of a floor), some obviously correlate with each other (a class of the ticket and cost of the ticket), some are obviously senseless (hardly the port of landing influences survival). Let's deal with all this later — when we make selection of signs for training.

Let's transform both available a set and also again we create the integrated set.

train_data_munged = munge_data(train_data, data_digest)
test_data_munged = munge_data(test_data, data_digest)
all_data_munged = pd.concat([train_data_munged, test_data_munged])

Though we aimed at use of Random Forest there is a wish to try and other qualifiers. And with them there is a sleduyushchey problem: many qualifiers are sensitive to the scale of signs. In other words if we have one sign with values from the [-10,5] and second a sign with values [0,10000] that the error identical in percentage terms on both signs will result in big difference in absolute value and the qualifier will treat the second sign as more important.

To avoid it we lead all numerical (and we do not have others any more) signs to an identical scale [-1,1] and null mean value. It is possible to make it in sklearn very simply.

scaler = StandardScaler()[predictors])

train_data_scaled = scaler.transform(train_data_munged[predictors])
test_data_scaled = scaler.transform(test_data_munged[predictors])

At first we calculate scaling coefficients (the complete set was useful again), and then we scale both sets individually.

Choice of signs

Well the moment also came when we can select those signs with which we will work further.

predictors = ["Pclass",
              "TitleD_mr", "TitleD_mrs", "TitleD_miss", "TitleD_master", "TitleD_ms", 
              "TitleD_col", "TitleD_rev", "TitleD_dr",
              "DeckD_U", "DeckD_A", "DeckD_B", "DeckD_C", "DeckD_D", "DeckD_E", "DeckD_F", "DeckD_G",
              "SexD_male", "SexD_female",
              "EmbarkedD_S", "EmbarkedD_C", "EmbarkedD_Q",
              "SibSp", "Parch",

Just we put the comment on unnecessary and we start training. What are not necessary — to solve to you.

Once again analysis

As now we had a column in which the range to which the age of the passenger gets is registered — will evaluate survival depending on age (range).

print("===== survived by age")

print("===== survived by gender and age")
print(train_data.groupby(["Sex", "AgeR"])["Survived"].value_counts(normalize=True))

print("===== survived by class and age")
print(train_data.groupby(["Pclass", "AgeR"])["Survived"].value_counts(normalize=True))

===== survived by age
AgeR      Survived
(0, 5]    1           0.704545
          0           0.295455
(10, 15]  1           0.578947
          0           0.421053
(15, 20]  0           0.656250
          1           0.343750
(20, 25]  0           0.655738
          1           0.344262
(25, 30]  0           0.611111
          1           0.388889
(30, 40]  0           0.554839
          1           0.445161
(40, 50]  0           0.616279
          1           0.383721
(5, 10]   0           0.650000
          1           0.350000
(50, 60]  0           0.595238
          1           0.404762
(60, 70]  0           0.764706
          1           0.235294
(70, 80]  0           0.800000
          1           0.200000
dtype: float64
===== survived by gender and age
Sex     AgeR      Survived
female  (0, 5]    1           0.761905
                  0           0.238095
        (10, 15]  1           0.750000
                  0           0.250000
        (15, 20]  1           0.735294
                  0           0.264706
        (20, 25]  1           0.755556
                  0           0.244444
        (25, 30]  1           0.750000
                  0           0.250000
        (30, 40]  1           0.836364
                  0           0.163636
        (40, 50]  1           0.677419
                  0           0.322581
        (5, 10]   0           0.700000
                  1           0.300000
        (50, 60]  1           0.928571
                  0           0.071429
        (60, 70]  1           1.000000
male    (0, 5]    1           0.652174
                  0           0.347826
        (10, 15]  0           0.714286
                  1           0.285714
        (15, 20]  0           0.870968
                  1           0.129032
        (20, 25]  0           0.896104
                  1           0.103896
        (25, 30]  0           0.791667
                  1           0.208333
        (30, 40]  0           0.770000
                  1           0.230000
        (40, 50]  0           0.781818
                  1           0.218182
        (5, 10]   0           0.600000
                  1           0.400000
        (50, 60]  0           0.857143
                  1           0.142857
        (60, 70]  0           0.928571
                  1           0.071429
        (70, 80]  0           0.800000
                  1           0.200000
dtype: float64
===== survived by class and age
Pclass  AgeR      Survived
1       (0, 5]    1           0.666667
                  0           0.333333
        (10, 15]  1           1.000000
        (15, 20]  1           0.800000
                  0           0.200000
        (20, 25]  1           0.761905
                  0           0.238095
        (25, 30]  1           0.684211
                  0           0.315789
        (30, 40]  1           0.755102
                  0           0.244898
        (40, 50]  1           0.567568
                  0           0.432432
        (50, 60]  1           0.600000
                  0           0.400000
        (60, 70]  0           0.818182
                  1           0.181818
        (70, 80]  0           0.666667
                  1           0.333333
2       (0, 5]    1           1.000000
        (10, 15]  1           1.000000
        (15, 20]  0           0.562500
                  1           0.437500
        (20, 25]  0           0.600000
                  1           0.400000
        (25, 30]  0           0.580645
                  1           0.419355
        (30, 40]  0           0.558140
                  1           0.441860
        (40, 50]  1           0.526316
                  0           0.473684
        (5, 10]   1           1.000000
        (50, 60]  0           0.833333
                  1           0.166667
        (60, 70]  0           0.666667
                  1           0.333333
3       (0, 5]    1           0.571429
                  0           0.428571
        (10, 15]  0           0.571429
                  1           0.428571
        (15, 20]  0           0.784615
                  1           0.215385
        (20, 25]  0           0.802817
                  1           0.197183
        (25, 30]  0           0.724138
                  1           0.275862
        (30, 40]  0           0.793651
                  1           0.206349
        (40, 50]  0           0.933333
                  1           0.066667
        (5, 10]   0           0.812500
                  1           0.187500
        (50, 60]  0           1.000000
        (60, 70]  0           0.666667
                  1           0.333333
        (70, 80]  0           1.000000
dtype: float64

We see that chances of a survival are big at children till 5 years, and already at advanced age the chance to survive falls with age. But it does not concern to women — at the woman the chance of survival is big at any age.

Let's try visualization from seaborn — it gives very beautiful pictures though I got used to the text more.

sns.pairplot(train_data_munged, vars=["AgeF", "Pclass", "SexF"], hue="Survived", dropna=True)

Kaggle and Titanic — one more solution of a task by means of Python

Beautifully, but for example correlation in class floor steam is not really evident.

Let's evaluate importance of our signs algorithm of SelectKBest.

selector = SelectKBest(f_classif, k=5)[predictors], train_data_munged["Survived"])

scores = -np.log10(selector.pvalues_), scores)
plt.xticks(range(len(predictors)), predictors, rotation='vertical')

Kaggle and Titanic — one more solution of a task by means of Python

Here article with the description of that how exactly it does it to you. It is possible to specify in the SelectKBest parameters also other strategy.

In principle, we know everything as it is — a floor is very important. Titles are important — but they have a strong correlation with a floor. The class of the ticket and somehow the deck of "F" is important.

Classification assessment

Before beginning start of any classification, we need to understand how we will estimate it. In a case with the tenders Kaggle everything is very simple: we simply read their rules. In case of Titanic as an assessment will serve the relation of the correct estimates of the qualifier to total number of passengers. In other words this assessment is called accuracy.

But before sending result of classification by test-selection to an assessment to Kaggle, it is quite good to us at first to understand at least approximate quality of operation of our qualifier. We will be able to understand only using train-selection as only it contains the marked data. But there is a question — how exactly?

Often in examples it is possible to see something similar:, train_y)
predict_y = classifier.predict(train_X)
return metrics.accuracy_score(train_y, predict_y)

That is we train the qualifier on a train-set then on it and we check it. Undoubtedly to some extent it gives a certain assessment of quality of operation of the qualifier, but in general this approach is incorrect. The qualifier has to describe not data on which it trained, and a certain model which generated these data. Otherwise the qualifier perfectly adapts to train-selection, when checking on it shows excellent results, however when checking on somebody other data set with a crash merges. As overfitting is called.

Separation of the available train-set into a quantity of pieces will be the correct approach. We can take a little from them, train on them the qualifier then to check its work for remained. It is possible to make this process several times just shuffling pieces. In sklearn this process is called cross-validation.

It is already possible to provide in the head cycles which will separate data, to make training and estimation, but a counter that everything that is necessary for implementation of it in sklearn - it is to define strategy.

cv = StratifiedKFold(train_data["Survived"], n_folds=3, shuffle=True, random_state=1)

Here we define rather difficult process: training data will be separated into three pieces, and records will get to each piece in a random way (to level possible dependence on an order), besides strategy will trace that the relation of classes in each piece was approximately equal. Thus we will perform three dimensions on pieces 1+2 vs 3, 1+3 vs 2, 2+3 vs 1 — after that we will be able to receive average score of accuracy of the qualifier (that will characterize quality of work), and also dispersion of an assessment (that will characterize stability of its work).


Now we will test operation of different qualifiers.

alg_ngbh = KNeighborsClassifier(n_neighbors=3)
scores = cross_val_score(alg_ngbh, train_data_scaled, train_data_munged["Survived"], cv=cv, n_jobs=-1)
print("Accuracy (k-neighbors): {}/{}".format(scores.mean(), scores.std()))

alg_sgd = SGDClassifier(random_state=1)
scores = cross_val_score(alg_sgd, train_data_scaled, train_data_munged["Survived"], cv=cv, n_jobs=-1)
print("Accuracy (sgd): {}/{}".format(scores.mean(), scores.std()))

alg_svm = SVC(C=1.0)
scores = cross_val_score(alg_svm, train_data_scaled, train_data_munged["Survived"], cv=cv, n_jobs=-1)
print("Accuracy (svm): {}/{}".format(scores.mean(), scores.std()))

alg_nbs = GaussianNB()
scores = cross_val_score(alg_nbs, train_data_scaled, train_data_munged["Survived"], cv=cv, n_jobs=-1)
print("Accuracy (naive bayes): {}/{}".format(scores.mean(), scores.std()))

def linear_scorer(estimator, x, y):
    scorer_predictions = estimator.predict(x)

    scorer_predictions[scorer_predictions > 0.5] = 1
    scorer_predictions[scorer_predictions <= 0.5] = 0

    return metrics.accuracy_score(y, scorer_predictions)

alg_lnr = LinearRegression()
scores = cross_val_score(alg_lnr, train_data_scaled, train_data_munged["Survived"], cv=cv, n_jobs=-1,
print("Accuracy (linear regression): {}/{}".format(scores.mean(), scores.std()))

The linear_scorer method is necessary as LinearRegression is the regression returning any real number. Respectively we separate a scale border 0.5 and we lead any numbers to two classes — 0 and 1.

alg_log = LogisticRegression(random_state=1)
scores = cross_val_score(alg_log, train_data_scaled, train_data_munged["Survived"], cv=cv, n_jobs=-1,
print("Accuracy (logistic regression): {}/{}".format(scores.mean(), scores.std()))

alg_frst = RandomForestClassifier(random_state=1, n_estimators=500, min_samples_split=8, min_samples_leaf=2)
scores = cross_val_score(alg_frst, train_data_scaled, train_data_munged["Survived"], cv=cv, n_jobs=-1)
print("Accuracy (random forest): {}/{}".format(scores.mean(), scores.std()))

At me it turned out approximately so
Accuracy (k-neighbors): 0.698092031425/0.0111105442611
Accuracy (sgd): 0.708193041526/0.0178870678457
Accuracy (svm): 0.693602693603/0.018027360723
Accuracy (naive bayes): 0.791245791246/0.0244349506813
Accuracy (linear regression): 0.805836139169/0.00839878201296
Accuracy (logistic regression): 0.806958473625/0.0156323100754
Accuracy (random forest): 0.827160493827/0.0063488824349

The algorithm of Random Forest won also dispersion at it quite good — it seems it is stable.

Even better

It seems everything well and can be sent result, but there was one muddy moment: each qualifier has parameters — as to us to understand that we selected the best option? Undoubtedly it is possible to sit and touch long parameters manually — but that if to charge this work to the computer?

alg_frst_model = RandomForestClassifier(random_state=1)
alg_frst_params = [{
    "n_estimators": [350, 400, 450],
    "min_samples_split": [6, 8, 10],
    "min_samples_leaf": [1, 2, 4]
alg_frst_grid = GridSearchCV(alg_frst_model, alg_frst_params, cv=cv, refit=True, verbose=1, n_jobs=-1), train_data_munged["Survived"])
alg_frst_best = alg_frst_grid.best_estimator_
print("Accuracy (random forest auto): {} with params {}"
      .format(alg_frst_grid.best_score_, alg_frst_grid.best_params_))

It turns out even better!
Accuracy (random forest auto): 0.836139169473 with params { 'min_samples_split': 6, 'n_estimators': 350, 'min_samples_leaf': 2 }

Selection can be made even more thinly in the presence of time and desire — either having changed parameters, or using other strategy of selection, for example RandomizedSearchCV.

We try xgboost

All praise xgboost — let's try also it.

ald_xgb_model = xgb.XGBClassifier()
ald_xgb_params = [
    {"n_estimators": [230, 250, 270],
     "max_depth": [1, 2, 4],
     "learning_rate": [0.01, 0.02, 0.05]}
alg_xgb_grid = GridSearchCV(ald_xgb_model, ald_xgb_params, cv=cv, refit=True, verbose=1, n_jobs=1), train_data_munged["Survived"])
alg_xgb_best = alg_xgb_grid.best_estimator_
print("Accuracy (xgboost auto): {} with params {}"
      .format(alg_xgb_grid.best_score_, alg_xgb_grid.best_params_))

For some reason training hung up when using all kernels therefore I was limited to one flow (n_jobs=1), but also in the one-line mode training and classification in xgboost works very quickly.

The result is quite good too
Accuracy (xgboost auto): 0.835016835017 with params { 'n_estimators': 270, 'learning_rate': 0.02, 'max_depth': 2 }


The qualifier is selected, parameters are calculated — it was necessary to create result and to send for check in Kaggle.

alg_test = alg_frst_best, train_data_munged["Survived"])

predictions = alg_test.predict(test_data_scaled)

submission = pd.DataFrame({
    "PassengerId": test_data["PassengerId"],
    "Survived": predictions

submission.to_csv("titanic-submission.csv", index=False)

In general It should be noted several moments in similar competitions which seemed to me interesting:

  • Fight in a top goes on the 100-th shares of percent — therefore even one correct or wrong classification from a verification set solves. Also the result is influenced by accidental modifications of parameters and algorithm;
  • The best result at local cross-validation does not guarantee the best result when checking a verification set. Happens that implementation of seemingly reasonable hypothesis improves local result of cross-validation and worsens result of check on Kaggle;
  • Above-mentioned two points bring to the third — the script and a set for sending for verification have to is under control of any version management system — in case of break in the account it is necessary to kommititsya with indication of the received result in the description of a kommit;
  • Competition is a little artificial. In the real world you have no verification set — only the marked selection on which it is possible to carry out cross-validation, final quality of work of algorithm is estimated usually indirectly, for example, by the number of complaints of users;
  • The maximum degree of quality of an assessment which can be received machine learning on a task is not really clear — the size of the training set is not really high, and process is unevident. Actually — what is process in case of flight of passengers from the perishing ship? It is quite probable that not all seated in boats survived, also as it is quite probable that not all not seated in boats died — the part of process has to be very chaotic and has to resist any modeling.

Chapter in which the author at the same time experienced surprise and an enlightenment

Browsing a top of contestants, it is necessary to notice people who gathered 1 (all correct answers) — and at some it turned out from the first attempt.

Kaggle and Titanic — one more solution of a task by means of Python

The following option occurs: someone registered an uchetka with whom began to select search (no more than 10 attempts a day are allowed) the correct answers. If I correctly understand it is a kind of tasks about weighing.

However having thought a little more, it is necessary to smile to the guess: we speak about a task on which answers are known for a long time! Really, the death of Titanic was a shock for contemporaries, and movies, books and documentary were devoted to this event. And most likely somewhere there is a complete inundated list of passengers of Titanic with the description of their destiny. But it does not belong to machine learning any more.

However from this it is possible and it is necessary to draw a conclusion which I am going to apply in the following tenders — optional (if it is not forbidden by rules of tender) to be limited only to data which were issued by the organizer. For example on the known time and the place it is possible to reveal weather conditions, a status of securities markets, currency rates whether day is festive — in other words it is possible to marry data from organizers with any available sets of public data which can help with the description of characteristics of model.

This article is a translation of the original post at
If you have any questions regarding the material covered in the article above, please, contact the original author of the post.
If you have any complaints about this article or you want this article to be deleted, please, drop an email here:

We believe that the knowledge, which is available at the most popular Russian IT blog, should be accessed by everyone, even though it is poorly translated.
Shared knowledge makes the world better.
Best wishes.

comments powered by Disqus