Commit 87c9285e authored by Roberto Ugolotti's avatar Roberto Ugolotti
Browse files

Simplify: do not create a conda environment but just install packages

parent d2e4165b
%% Cell type:markdown id:cb377aac-a569-4b78-bc49-d3c01a2023b2 tags: %% Cell type:markdown id:cb377aac-a569-4b78-bc49-d3c01a2023b2 tags:
# Detect Fake News on Health using a pre-trained PyTorch Network # Detect Fake News on Health using a pre-trained PyTorch Network
%% Cell type:markdown id:a25fd7f5-700e-4981-966f-43a90efe2719 tags: %% Cell type:markdown id:a25fd7f5-700e-4981-966f-43a90efe2719 tags:
## HuggingFace libraries ## HuggingFace libraries
In this example, we are going to use the libraries provided by HuggingFace. This set of libraries provides open-source code, mostly devoted on Natural Language Processing. In this example, we are going to use the libraries provided by HuggingFace. This set of libraries provides open-source code, mostly devoted on Natural Language Processing.
We are going to download both the network and the dataset using the libraries `transformers` and `datasets`. We are going to download both the network and the dataset using the libraries `transformers` and `datasets`.
The dataset is [PUBHEALTH](https://github.com/neemakot/Health-Fact-Checking/blob/master/data/DATASHEET.md) which explores fact-checking of difficult to verify claims i.e., those which require expertise from outside of the journalistics domain, in this case biomedical and public health expertise. Each claim is associated with a label: The dataset is [PUBHEALTH](https://github.com/neemakot/Health-Fact-Checking/blob/master/data/DATASHEET.md) which explores fact-checking of difficult to verify claims i.e., those which require expertise from outside of the journalistics domain, in this case biomedical and public health expertise. Each claim is associated with a label:
``` ```
0: "false" 0: "false"
1: "mixture" 1: "mixture"
2: "true" 2: "true"
3: "unproven" 3: "unproven"
``` ```
The network is a 12-layers BERT model, a model that is small enough to be executed on a single GPU, but may provide good performances anyway The network is a 12-layers BERT model, a model that is small enough to be executed on a single GPU, but may provide good performances anyway
## CONDA Environment %% Cell type:code id: tags:
We are also going to create an ad-hoc conda environment in which we will install all the libraries that we need. See the BDAP [documentation](https://jeodpp.jrc.ec.europa.eu/apps/gitlab/for-everyone/documentation/-/wikis/Jeodpp_services/JEO-lab/GPU-Machine-Learning-with-Conda) on Gitlab. ``` python
# Install HuggingFace libraries. These libraries will be installed in your home
%% Cell type:raw id:c69eef8a-e19e-4c2a-8bf7-a40e3ef5c7fd tags: !pip install transformers datasets
```
# Create a CONDA environment
conda create -n myenv python=3.8 -y
conda activate myenv
pip install torch==1.9 transformers datasets ipykernel torchtext scikit-learn
python -m ipykernel install --user --name=my_environment
%% Cell type:code id:a37a3696-20b1-4fc7-baf4-a06add1b982c tags: %% Cell type:code id:a37a3696-20b1-4fc7-baf4-a06add1b982c tags:
``` python ``` python
import torch import torch
import numpy as np import numpy as np
``` ```
%% Cell type:code id:5f8da724-d7f4-4d91-be12-a5bdd1630ca4 tags: %% Cell type:code id:5f8da724-d7f4-4d91-be12-a5bdd1630ca4 tags:
``` python ``` python
# Download the dataset # Download the dataset
from datasets import load_dataset from datasets import load_dataset
# See https://huggingface.co/datasets/viewer/?dataset=health_fact # See https://huggingface.co/datasets/viewer/?dataset=health_fact
datasets = load_dataset("health_fact") datasets = load_dataset("health_fact")
label_names = ["False", "Mixture", "True", "Unproven"] label_names = ["False", "Mixture", "True", "Unproven"]
``` ```
%% Cell type:code id:2dab668b-0742-4de8-b904-d3f018a9bcbe tags: %% Cell type:code id:2dab668b-0742-4de8-b904-d3f018a9bcbe tags:
``` python ``` python
print('Names of the datasets', list(datasets.keys())) print('Names of the datasets', list(datasets.keys()))
``` ```
%% Cell type:code id:a6c6ed54-8feb-4845-89e4-1791d30eb629 tags: %% Cell type:code id:a6c6ed54-8feb-4845-89e4-1791d30eb629 tags:
``` python ``` python
# Download the tokenizer, who is in charge of preparing the inputs in the expected shape for the model # Download the tokenizer, who is in charge of preparing the inputs in the expected shape for the model
from transformers import AutoTokenizer from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased') tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
def tokenize_function(examples): def tokenize_function(examples):
return tokenizer(examples["claim"], padding='max_length', truncation=True) return tokenizer(examples["claim"], padding='max_length', truncation=True)
# Convert phrases into tokens # Convert phrases into tokens
tokenized_datasets = datasets.map(tokenize_function, batched=True) tokenized_datasets = datasets.map(tokenize_function, batched=True)
# Remove unlabelled data (and reduce the size, for this example) # Remove unlabelled data (and reduce the size, for this example)
train_ds = tokenized_datasets['train'].filter(lambda x: x['label'] > -1).select(range(1000)) train_ds = tokenized_datasets['train'].filter(lambda x: x['label'] > -1).select(range(1000))
val_ds = tokenized_datasets['validation'].filter(lambda x: x['label'] > -1).select(range(1000)) val_ds = tokenized_datasets['validation'].filter(lambda x: x['label'] > -1).select(range(1000))
test_ds = tokenized_datasets['test'].filter(lambda x: x['label'] > -1).select(range(1000)) test_ds = tokenized_datasets['test'].filter(lambda x: x['label'] > -1).select(range(1000))
``` ```
%% Cell type:code id:73ecc7e4-c0d6-45e2-bc2f-7f4ab75a717e tags: %% Cell type:code id:73ecc7e4-c0d6-45e2-bc2f-7f4ab75a717e tags:
``` python ``` python
# See an input example # See an input example
sel_id = 25 sel_id = 25
print('Raw sentence:', train_ds[sel_id]['claim']) print('Raw sentence:', train_ds[sel_id]['claim'])
print('Label', label_names[train_ds[sel_id]['label']]) print('Label', label_names[train_ds[sel_id]['label']])
print('Network input', train_ds[sel_id]['input_ids']) print('Network input', train_ds[sel_id]['input_ids'])
print('Attention mask (which part of the input is of interest of the network)', train_ds[sel_id]['attention_mask']) print('Attention mask (which part of the input is of interest of the network)', train_ds[sel_id]['attention_mask'])
``` ```
%% Cell type:code id:05391cba-3a8a-4279-ad39-1875e1e28171 tags: %% Cell type:code id:05391cba-3a8a-4279-ad39-1875e1e28171 tags:
``` python ``` python
from transformers import BertForSequenceClassification, AdamW from transformers import BertForSequenceClassification, AdamW
# Load BertForSequenceClassification, the pretrained BERT model with a single # Load BertForSequenceClassification, the pretrained BERT model with a single
# linear classification layer on top. # linear classification layer on top.
model = BertForSequenceClassification.from_pretrained( model = BertForSequenceClassification.from_pretrained(
"bert-base-uncased", # Use the 12-layer BERT model, with an uncased vocab. "bert-base-uncased", # Use the 12-layer BERT model, with an uncased vocab.
num_labels=len(label_names), # The number of output labels--2 for binary classification. num_labels=len(label_names), # The number of output labels--2 for binary classification.
output_attentions=False, # Whether the model returns attentions weights. output_attentions=False, # Whether the model returns attentions weights.
output_hidden_states=False, # Whether the model returns all hidden-states. output_hidden_states=False, # Whether the model returns all hidden-states.
) )
# Create the optimizer, who is in charge of updating the model weights # Create the optimizer, who is in charge of updating the model weights
# during training # during training
optimizer = AdamW(model.parameters(), lr=2e-5) optimizer = AdamW(model.parameters(), lr=2e-5)
``` ```
%% Cell type:code id:c065f5d0-e191-4cf0-92ec-e445d261a313 tags: %% Cell type:code id:c065f5d0-e191-4cf0-92ec-e445d261a313 tags:
``` python ``` python
device = 'cuda' # Use 'cpu' for debugging purposes device = 'cuda' # Use 'cpu' for debugging purposes
# Tell pytorch to run this model on the GPU. # Tell pytorch to run this model on the GPU.
if device == 'cuda': if device == 'cuda':
model.cuda() model.cuda()
``` ```
%% Cell type:code id:49bc5172-3504-4c01-85b5-d11a6e203635 tags: %% Cell type:code id:49bc5172-3504-4c01-85b5-d11a6e203635 tags:
``` python ``` python
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler from torch.utils.data import DataLoader, RandomSampler, SequentialSampler
# The DataLoader needs to know our batch size for training, so we specify it # The DataLoader needs to know our batch size for training, so we specify it
# here. For fine-tuning BERT on a specific task, the authors recommend a batch # here. For fine-tuning BERT on a specific task, the authors recommend a batch
# size of 16 or 32. # size of 16 or 32.
batch_size = 32 batch_size = 32
# Create the DataLoaders for our training and validation sets. # Create the DataLoaders for our training and validation sets.
# We'll take training samples in random order. # We'll take training samples in random order.
train_dataloader = DataLoader( train_dataloader = DataLoader(
train_ds, # The training samples. train_ds, # The training samples.
sampler=RandomSampler(train_ds), # Select batches randomly sampler=RandomSampler(train_ds), # Select batches randomly
batch_size=batch_size # Trains with this batch size. batch_size=batch_size # Trains with this batch size.
) )
# For validation and test the order doesn't matter, so we'll just read them sequentially. # For validation and test the order doesn't matter, so we'll just read them sequentially.
validation_dataloader = DataLoader( validation_dataloader = DataLoader(
val_ds, # The validation samples. val_ds, # The validation samples.
sampler=SequentialSampler(val_ds), # Pull out batches sequentially. sampler=SequentialSampler(val_ds), # Pull out batches sequentially.
batch_size=batch_size # Evaluate with this batch size. batch_size=batch_size # Evaluate with this batch size.
) )
test_dataloader = DataLoader( test_dataloader = DataLoader(
test_ds, # The validation samples. test_ds, # The validation samples.
sampler=SequentialSampler(test_ds), # Pull out batches sequentially. sampler=SequentialSampler(test_ds), # Pull out batches sequentially.
batch_size=batch_size # Evaluate with this batch size. batch_size=batch_size # Evaluate with this batch size.
) )
# Function to convert input in the expected shape for the network # Function to convert input in the expected shape for the network
def format_batch_to_input(batch, device): def format_batch_to_input(batch, device):
return torch.stack(batch['input_ids']).T.to(device), \ return torch.stack(batch['input_ids']).T.to(device), \
torch.stack(batch['attention_mask']).T.to(device), \ torch.stack(batch['attention_mask']).T.to(device), \
torch.atleast_2d(batch['label']).to(device) torch.atleast_2d(batch['label']).to(device)
``` ```
%% Cell type:code id:344e8d69-0603-4a5e-9580-ff0e3723d42c tags: %% Cell type:code id:344e8d69-0603-4a5e-9580-ff0e3723d42c tags:
``` python ``` python
# Function to calculate the accuracy of our predictions vs labels # Function to calculate the accuracy of our predictions vs labels
def accuracy_fun(preds, labels): def accuracy_fun(preds, labels):
pred_flat = np.argmax(preds, axis=1).flatten() pred_flat = np.argmax(preds, axis=1).flatten()
labels_flat = labels.flatten() labels_flat = labels.flatten()
return np.sum(pred_flat == labels_flat) / len(labels_flat) return np.sum(pred_flat == labels_flat) / len(labels_flat)
``` ```
%% Cell type:code id:95a5f1ff-2dd4-4083-9f0f-fe7bb85c15d5 tags: %% Cell type:code id:95a5f1ff-2dd4-4083-9f0f-fe7bb85c15d5 tags:
``` python ``` python
from tqdm.notebook import tqdm from tqdm.notebook import tqdm
# This training code is based on the `run_glue.py` script here: # This training code is based on the `run_glue.py` script here:
# https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/run_glue.py#L128 # https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/run_glue.py#L128
# Number of training epochs # Number of training epochs
epochs = 3 epochs = 3
# We'll store a number of quantities such as training and validation loss, # We'll store a number of quantities such as training and validation loss,
# validation accuracy, and timings. # validation accuracy, and timings.
training_stats = [] training_stats = []
# For each epoch... # For each epoch...
for epoch_i in range(epochs): for epoch_i in range(epochs):
# ======================================== # ========================================
# Training # Training
# ======================================== # ========================================
# Perform one full pass over the training set. # Perform one full pass over the training set.
print("") print("")
print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs)) print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))
print('Training...') print('Training...')
# Reset the total loss for this epoch. # Reset the total loss for this epoch.
total_train_loss = 0 total_train_loss = 0
# Put the model into training mode. Don't be mislead--the call to # Put the model into training mode. Don't be mislead--the call to
# `train` just changes the *mode*, it doesn't *perform* the training. # `train` just changes the *mode*, it doesn't *perform* the training.
# `dropout` and `batchnorm` layers behave differently during training # `dropout` and `batchnorm` layers behave differently during training
# vs. test (source: https://stackoverflow.com/questions/51433378/what-does-model-train-do-in-pytorch) # vs. test (source: https://stackoverflow.com/questions/51433378/what-does-model-train-do-in-pytorch)
model.train() model.train()
# For each batch of training data... # For each batch of training data...
for step, batch in tqdm(enumerate(train_dataloader), total=len(train_dataloader)): for step, batch in tqdm(enumerate(train_dataloader), total=len(train_dataloader)):
# Read training batch from our dataloader. # Read training batch from our dataloader.
b_input_ids, b_input_mask, b_labels = format_batch_to_input(batch, device) b_input_ids, b_input_mask, b_labels = format_batch_to_input(batch, device)
# Always clear any previously calculated gradients before performing a # Always clear any previously calculated gradients before performing a
# backward pass. PyTorch doesn't do this automatically because # backward pass. PyTorch doesn't do this automatically because
# accumulating the gradients is "convenient while training RNNs". # accumulating the gradients is "convenient while training RNNs".
# (source: https://stackoverflow.com/questions/48001598/why-do-we-need-to-call-zero-grad-in-pytorch) # (source: https://stackoverflow.com/questions/48001598/why-do-we-need-to-call-zero-grad-in-pytorch)
model.zero_grad() model.zero_grad()
# Perform a forward pass (evaluate the model on this training batch). # Perform a forward pass (evaluate the model on this training batch).
loss, logits = model.forward(b_input_ids, loss, logits = model.forward(b_input_ids,
token_type_ids=None, token_type_ids=None,
attention_mask=b_input_mask, attention_mask=b_input_mask,
labels=b_labels).values() labels=b_labels).values()
# Accumulate the training loss over all of the batches so that we can # Accumulate the training loss over all of the batches so that we can
# calculate the average loss at the end. # calculate the average loss at the end.
total_train_loss += loss total_train_loss += loss
# Perform a backward pass to calculate the gradients. # Perform a backward pass to calculate the gradients.
loss.backward() loss.backward()
# Clip the norm of the gradients to 1.0. # Clip the norm of the gradients to 1.0.
# This is to help prevent the "exploding gradients" problem. # This is to help prevent the "exploding gradients" problem.
# torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) # torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
# Update weights and take a step using the computed gradient. # Update weights and take a step using the computed gradient.
optimizer.step() optimizer.step()
# Calculate the average loss over all of the batches. # Calculate the average loss over all of the batches.
avg_train_loss = total_train_loss / len(train_dataloader) avg_train_loss = total_train_loss / len(train_dataloader)
print("") print("")
print(" Average training loss: {0:.2f}".format(avg_train_loss)) print(" Average training loss: {0:.2f}".format(avg_train_loss))
# ======================================== # ========================================
# Validation # Validation
# ======================================== # ========================================
# After the completion of each training epoch, measure our performance on # After the completion of each training epoch, measure our performance on
# our validation set. # our validation set.
print("") print("")
print("Running Validation...") print("Running Validation...")
# Put the model in evaluation mode: some layers behave differently # Put the model in evaluation mode: some layers behave differently
# during evaluation. # during evaluation.
model.eval() model.eval()
# Tracking variables # Tracking variables
total_eval_accuracy = 0 total_eval_accuracy = 0
total_eval_loss = 0 total_eval_loss = 0
# Iterate over epochs # Iterate over epochs
for batch in tqdm(validation_dataloader, total=len(validation_dataloader)): for batch in tqdm(validation_dataloader, total=len(validation_dataloader)):
# Read training batch from our dataloader. # Read training batch from our dataloader.
b_input_ids, b_input_mask, b_labels = format_batch_to_input(batch, device) b_input_ids, b_input_mask, b_labels = format_batch_to_input(batch, device)
# Tell pytorch not to bother with constructing the compute graph during # Tell pytorch not to bother with constructing the compute graph during
# the forward pass, since this is only needed for backprop (training). # the forward pass, since this is only needed for backprop (training).
with torch.no_grad(): with torch.no_grad():
(loss, logits) = model.forward(b_input_ids, (loss, logits) = model.forward(b_input_ids,
token_type_ids=None, token_type_ids=None,
attention_mask=b_input_mask, attention_mask=b_input_mask,
labels=b_labels).values() labels=b_labels).values()
# Accumulate the validation loss. # Accumulate the validation loss.
total_eval_loss += loss total_eval_loss += loss
# Calculate the accuracy for this batch of test sentences, and # Calculate the accuracy for this batch of test sentences, and
# accumulate it over all batches. # accumulate it over all batches.
logits = logits.detach().cpu().numpy() logits = logits.detach().cpu().numpy()
label_ids = b_labels.to('cpu').numpy() label_ids = b_labels.to('cpu').numpy()
total_eval_accuracy += accuracy_fun(logits, label_ids) total_eval_accuracy += accuracy_fun(logits, label_ids)
# Report the final accuracy for this validation run. # Report the final accuracy for this validation run.
avg_val_accuracy = total_eval_accuracy / len(validation_dataloader) avg_val_accuracy = total_eval_accuracy / len(validation_dataloader)
print(" Accuracy: {0:.2f}".format(avg_val_accuracy)) print(" Accuracy: {0:.2f}".format(avg_val_accuracy))
# Calculate the average loss over all of the batches. # Calculate the average loss over all of the batches.
avg_val_loss = total_eval_loss / len(validation_dataloader) avg_val_loss = total_eval_loss / len(validation_dataloader)
print(" Validation Loss: {0:.2f}".format(avg_val_loss)) print(" Validation Loss: {0:.2f}".format(avg_val_loss))
# Record all statistics from this epoch. # Record all statistics from this epoch.
training_stats.append( training_stats.append(
{ {
'epoch': epoch_i + 1, 'epoch': epoch_i + 1,
'Training Loss': avg_train_loss, 'Training Loss': avg_train_loss,
'Valid. Loss': avg_val_loss, 'Valid. Loss': avg_val_loss,
'Valid. Accur.': avg_val_accuracy, 'Valid. Accur.': avg_val_accuracy,
} }
) )
print("") print("")
print("Training complete!") print("Training complete!")
``` ```
%% Cell type:code id:9e52da0a-911c-46f9-a404-229e7d71dfdd tags: %% Cell type:code id:9e52da0a-911c-46f9-a404-229e7d71dfdd tags:
``` python ``` python
# Evaluation on the external test set # Evaluation on the external test set
from datasets import load_metric from datasets import load_metric
metric = load_metric("accuracy") metric = load_metric("accuracy")
model.eval() model.eval()
for batch in tqdm(test_dataloader): for batch in tqdm(test_dataloader):
# Unpack this training batch from our dataloader. # Unpack this training batch from our dataloader.
b_input_ids, b_input_mask, b_labels = format_batch_to_input(batch, device) b_input_ids, b_input_mask, b_labels = format_batch_to_input(batch, device)
with torch.no_grad(): with torch.no_grad():
(loss, logits) = model.forward(b_input_ids, (loss, logits) = model.forward(b_input_ids,
token_type_ids=None, token_type_ids=None,
attention_mask=b_input_mask, attention_mask=b_input_mask,
labels=b_labels).values() labels=b_labels).values()
predictions = torch.argmax(logits, dim=-1) predictions = torch.argmax(logits, dim=-1)
metric.add_batch(predictions=predictions, references=batch['label']) metric.add_batch(predictions=predictions, references=batch['label'])
metric.compute() metric.compute()
``` ```
%% Cell type:code id:f1516e2a-e025-4b28-807b-c02f83584978 tags: %% Cell type:code id:f1516e2a-e025-4b28-807b-c02f83584978 tags:
``` python ``` python
sel_id = 2 sel_id = 2
print(batch['claim'][sel_id]) print(batch['claim'][sel_id])
print('Expected response', label_names[int(batch['label'][sel_id].numpy())]) print('Expected response', label_names[int(batch['label'][sel_id].numpy())])
print('Predicted response', label_names[int(predictions[sel_id].cpu().numpy())]) print('Predicted response', label_names[int(predictions[sel_id].cpu().numpy())])
``` ```
%% Cell type:markdown id:5197ddca-4bba-414d-947b-df513554aaa9 tags: %% Cell type:markdown id:5197ddca-4bba-414d-947b-df513554aaa9 tags:
## We have learned ## We have learned
* How to download a dataset and a pre-trained network for PyTorch * How to download a dataset and a pre-trained network for PyTorch
* How to train a model in PyTorch * How to train a model in PyTorch
* The different paradigm between PyTorch and Keras * The different paradigm between PyTorch and Keras
## Exercise ## Exercise
* Save the model with the best validation accuracy * Save the model with the best validation accuracy
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment