Day 8 Kaggle's 30 Days of ML

フリーレン from 葬送のフリーレン.

Start the course step 1-2 of tutorial to machine learning.

Basically introduces how model is used for machine learning.

We use training data to train or fit the model based on features, then use the model to predict our target in the new situation.

Import panda library.

import pandas as pd

Read .csv file.

data = pd.read_csv(file_dir)

Print a summary,

data.describe()

The summary will show "count, mean, std, min, 25%, 50%, 75%, max".

the count, shows how many rows have non-missing values.

One of the best known supervised classification methods.

A tree is composed of nodes, and those nodes are chosen looking for the optimum split of the features.

Use gini and entropy to determine which node to select for better result.

Basic idea is to select better node and node again.

Gini:

The gini impurity measures the frequency at which any element of the dataset will be mislabelled when it is randomly labeled.

Min value of gini is 0, and it means that the node is pure and will not be splited again.

Range of gini value is 0~0.5.

The lower the gini value is, the better the node selection is.

Entropy:

Entropy is a measure of information that indicates the disorder of the features with the target.

Min value of entropy is 0, and it means that the node is pure.

Range of entropy value is 0~1.

The lower the entropy value is, the better the node selection is.

Due the range, entropy requires more training time. In addition, redudent features also increase training time.