Commit 1c01440c authored by Roberto Ugolotti's avatar Roberto Ugolotti
Browse files

Machine Learning training

parent 92a038e8
%% Cell type:markdown id:af317b04-85a6-4c0a-8059-2a46a34b0405 tags:
# Demo - Read Data
This and the following notebooks provide an example of a machine learning workflow.
The task is the classification of the EuroSAT dataset. To do so, we are going to
use some simple classifiers from Scikit-Learn
## Dataset
[EuroSAT](https://github.com/phelber/eurosat) is a freely available dataset
that can be used to train machine learning models for the challenge of land use
and land cover classification. It is composed of 64x64 images, extracted from
Sentinel-2 and it is available in RGB or multi-spectral. It consists of 10 classes
with in total 27,000 labeled and geo-referenced images.
In this example we are going to use a subset of RGB images.
## Scikit-learn
[Scikit-learn](https://scikit-learn.org) is a free machine learning library for
Python. In the recent years, it became a de-facto standard for classical machine
learning algorithms. It features various classification, regression and clustering algorithms including support-vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with NumPy and SciPy.
## Machine Learning Workflow
![Workflow](images/ml_workflow.png)
%% Cell type:code id:9e8178e8-ec96-46aa-b26f-2f1d1a7a2200 tags:
``` python
import pylab as plt
from functions import read_images
```
%% Cell type:code id:973307ee-c4a9-4b2a-8150-b84593f7db0a tags:
``` python
images = read_images('/eos/jeodpp/data/base/MachineLearning/SatImNet/EuroSAT/rgb/images/', 200)
```
%% Cell type:code id:46672570-cce2-47ab-b852-a1ec755c370b tags:
``` python
for label_name, label_images in images.items():
print('Class %s contains %d images' % (label_name, len(label_images)))
```
%% Cell type:code id:56926de2-70c0-4c34-8ecb-5bfee6769b58 tags:
``` python
_, axes = plt.subplots(4, 6, figsize=(16, 7))
for i, l in enumerate(('Forest', 'AnnualCrop', 'SeaLake', 'River')):
axes[i, 0].set_ylabel(l)
for j in range(6):
axes[i, j].imshow(images[l][j])
```
%% Cell type:markdown id:13ab82e0-0d8b-4e88-ace2-b0aa3efc01b6 tags:
# Demo - First Iteration
This second notebook shows a first tentative iteration to classify the EuroSAT dataset.
%% Cell type:code id:68fcb21c-4fd8-46f2-8c37-78c55ed6f90e tags:
``` python
from functions import read_images, to_flat
```
%% Cell type:code id:a559ea99-9d5f-4aec-8f23-5aecd7de3003 tags:
``` python
# Read images from disk
images = read_images('/eos/jeodpp/data/base/MachineLearning/SatImNet/EuroSAT/rgb/images/', 200)
```
%% Cell type:markdown id:efebe875-7043-4886-8c00-64c13dd312f9 tags:
## Convert to features
Convert each image into a single one-dimensional entry. The simplest way to do it is to put all pixel values into a single array.
![example](images/features_1.PNG)
%% Cell type:code id:f30ce5e6-a616-46b6-9ff6-b35e78675da1 tags:
``` python
X_flat, y_flat, label_names = to_flat(images, return_label_names=True)
print('Shape of an image', images['Forest'][0].shape)
print('Shape of a single feature array', X_flat[0].shape)
print('Shape of input data', X_flat.shape)
print('Shape of labels (target) array', y_flat.shape)
```
%% Cell type:code id:cbf95809-e675-4d08-bdda-59fa99099a9b tags:
``` python
print('Example of an entry')
print('Input', X_flat[0])
print('Expected Output', y_flat[0])
```
%% Cell type:markdown id:16c4b5a1-6da2-4460-82ed-f09a8c133d27 tags:
## K-nearest neighbors Classifier
K-nearest neighbors (KNN) is one of the simplest algorithm for classification, suitable for datasets with a low number of features.
Basically, it has no learning phase, since it just memorizes the entire training set.
Once a new entry is to be classified, it looks at the classes of the `k` closest points
in the training set, and assign the most represented class (possibly weighted) to the entry.
KNN is a non-parametric algorithm, since it does not make any assumption on underlying data.
KNN is a distance-based algorithm, since the class is associated by looking at the distance (e.g. euclidean) between the object and the points in the training set.
With small changes, it can also be used for regression tasks.
![KNN](images/knn.png)
Image source: https://medium.com/analytics-vidhya/k-nearest-neighbor-the-maths-behind-it-how-it-works-and-an-example-f1de1208546c
%% Cell type:markdown id:341cea39-9e76-4e41-aced-eb6433126903 tags:
## How Scikit-learn works
All scikit-learn classes share a uniform API. The commands to train, test and evaluate a classifier are generally the same, regardless of the algorithm, with small variations across different families (see [documentation](https://scikit-learn.org/stable/developers/develop.html)).
Let's make an example of a classifer, called `MyClassifier`. First, you have to create it. All parameters must be passed at this time:
```
c = MyClassifier(param1=val1, param2=val2)
```
Then, you train it by passing your training set as numpy arrays:
```
c.fit(X_train, y_train)
```
and you can use it for classifying new instances, or summarizing its results:
```
predicted = c.predict(X_test)
accuracy = c.score(X_test, y_test)
```
%% Cell type:code id:4117e9c9-e90c-4370-9d57-64d81735c643 tags:
``` python
# Create a very simple classifier
# See https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
pipeline = Pipeline(
[('classifier', KNeighborsClassifier(n_neighbors=5))]
)
```
%% Cell type:code id:8a0a3703-4175-47c6-8beb-fb4a3c464358 tags:
``` python
# Split into training and test set
# See https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_flat, y_flat, test_size=0.25, random_state=0)
```
%% Cell type:code id:7ba60ac1-ee89-4ec4-a86d-70e45da17334 tags:
``` python
# Train and evaluate the classifier
pipeline.fit(X_train, y_train.ravel())
print('Accuracy on training set:', 100 * pipeline.score(X_train, y_train), '%')
print('Accuracy on test set:', 100 * pipeline.score(X_test, y_test), '%')
```
%% Cell type:code id:e2453b46-3c8e-4dfd-b27b-58f7ec81ffce tags:
``` python
from sklearn.metrics import ConfusionMatrixDisplay
import pylab as plt
_, ax = plt.subplots(figsize=(10, 8))
ConfusionMatrixDisplay.from_estimator(pipeline, X_test, y_test, display_labels=label_names, xticks_rotation=90, ax=ax)
```
%% Cell type:code id:e00ce19c-4a29-4bdd-ac66-b4befe4ead19 tags:
``` python
# Investigate on features
plt.ion()
plt.figure(figsize=(12, 8))
for i in range(4):
plt.scatter(
X_flat[y_flat.ravel() == i, 0], X_flat[y_flat.ravel() == i, 1],
alpha=0.6, s=70, label=label_names[i])
plt.legend()
```