Commit 87c9285e authored by Roberto Ugolotti's avatar Roberto Ugolotti
Browse files

Simplify: do not create a conda environment but just install packages

parent d2e4165b
%% Cell type:markdown id:cb377aac-a569-4b78-bc49-d3c01a2023b2 tags:
# Detect Fake News on Health using a pre-trained PyTorch Network
%% Cell type:markdown id:a25fd7f5-700e-4981-966f-43a90efe2719 tags:
## HuggingFace libraries
In this example, we are going to use the libraries provided by HuggingFace. This set of libraries provides open-source code, mostly devoted on Natural Language Processing.
We are going to download both the network and the dataset using the libraries `transformers` and `datasets`.
The dataset is [PUBHEALTH]( which explores fact-checking of difficult to verify claims i.e., those which require expertise from outside of the journalistics domain, in this case biomedical and public health expertise. Each claim is associated with a label:
0: "false"
1: "mixture"
2: "true"
3: "unproven"
The network is a 12-layers BERT model, a model that is small enough to be executed on a single GPU, but may provide good performances anyway
## CONDA Environment
%% Cell type:code id: tags:
We are also going to create an ad-hoc conda environment in which we will install all the libraries that we need. See the BDAP [documentation]( on Gitlab.
%% Cell type:raw id:c69eef8a-e19e-4c2a-8bf7-a40e3ef5c7fd tags:
# Create a CONDA environment
conda create -n myenv python=3.8 -y
conda activate myenv
pip install torch==1.9 transformers datasets ipykernel torchtext scikit-learn
python -m ipykernel install --user --name=my_environment
``` python
# Install HuggingFace libraries. These libraries will be installed in your home
!pip install transformers datasets
%% Cell type:code id:a37a3696-20b1-4fc7-baf4-a06add1b982c tags:
``` python
import torch
import numpy as np
%% Cell type:code id:5f8da724-d7f4-4d91-be12-a5bdd1630ca4 tags:
``` python
# Download the dataset
from datasets import load_dataset
# See
datasets = load_dataset("health_fact")
label_names = ["False", "Mixture", "True", "Unproven"]
%% Cell type:code id:2dab668b-0742-4de8-b904-d3f018a9bcbe tags:
``` python
print('Names of the datasets', list(datasets.keys()))
%% Cell type:code id:a6c6ed54-8feb-4845-89e4-1791d30eb629 tags:
``` python
# Download the tokenizer, who is in charge of preparing the inputs in the expected shape for the model
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
def tokenize_function(examples):
return tokenizer(examples["claim"], padding='max_length', truncation=True)
# Convert phrases into tokens
tokenized_datasets =, batched=True)
# Remove unlabelled data (and reduce the size, for this example)
train_ds = tokenized_datasets['train'].filter(lambda x: x['label'] > -1).select(range(1000))
val_ds = tokenized_datasets['validation'].filter(lambda x: x['label'] > -1).select(range(1000))
test_ds = tokenized_datasets['test'].filter(lambda x: x['label'] > -1).select(range(1000))
%% Cell type:code id:73ecc7e4-c0d6-45e2-bc2f-7f4ab75a717e tags:
``` python
# See an input example
sel_id = 25
print('Raw sentence:', train_ds[sel_id]['claim'])
print('Label', label_names[train_ds[sel_id]['label']])
print('Network input', train_ds[sel_id]['input_ids'])
print('Attention mask (which part of the input is of interest of the network)', train_ds[sel_id]['attention_mask'])
%% Cell type:code id:05391cba-3a8a-4279-ad39-1875e1e28171 tags:
``` python
from transformers import BertForSequenceClassification, AdamW
# Load BertForSequenceClassification, the pretrained BERT model with a single
# linear classification layer on top.
model = BertForSequenceClassification.from_pretrained(
"bert-base-uncased", # Use the 12-layer BERT model, with an uncased vocab.
num_labels=len(label_names), # The number of output labels--2 for binary classification.
output_attentions=False, # Whether the model returns attentions weights.
output_hidden_states=False, # Whether the model returns all hidden-states.
# Create the optimizer, who is in charge of updating the model weights
# during training
optimizer = AdamW(model.parameters(), lr=2e-5)
%% Cell type:code id:c065f5d0-e191-4cf0-92ec-e445d261a313 tags:
``` python
device = 'cuda' # Use 'cpu' for debugging purposes
# Tell pytorch to run this model on the GPU.
if device == 'cuda':
%% Cell type:code id:49bc5172-3504-4c01-85b5-d11a6e203635 tags:
``` python
from import DataLoader, RandomSampler, SequentialSampler
# The DataLoader needs to know our batch size for training, so we specify it
# here. For fine-tuning BERT on a specific task, the authors recommend a batch
# size of 16 or 32.
batch_size = 32
# Create the DataLoaders for our training and validation sets.
# We'll take training samples in random order.
train_dataloader = DataLoader(
train_ds, # The training samples.
sampler=RandomSampler(train_ds), # Select batches randomly
batch_size=batch_size # Trains with this batch size.
# For validation and test the order doesn't matter, so we'll just read them sequentially.
validation_dataloader = DataLoader(
val_ds, # The validation samples.
sampler=SequentialSampler(val_ds), # Pull out batches sequentially.
batch_size=batch_size # Evaluate with this batch size.
test_dataloader = DataLoader(
test_ds, # The validation samples.
sampler=SequentialSampler(test_ds), # Pull out batches sequentially.
batch_size=batch_size # Evaluate with this batch size.
# Function to convert input in the expected shape for the network
def format_batch_to_input(batch, device):
return torch.stack(batch['input_ids']), \
torch.stack(batch['attention_mask']), \
%% Cell type:code id:344e8d69-0603-4a5e-9580-ff0e3723d42c tags:
``` python
# Function to calculate the accuracy of our predictions vs labels
def accuracy_fun(preds, labels):
pred_flat = np.argmax(preds, axis=1).flatten()
labels_flat = labels.flatten()
return np.sum(pred_flat == labels_flat) / len(labels_flat)
%% Cell type:code id:95a5f1ff-2dd4-4083-9f0f-fe7bb85c15d5 tags:
``` python
from tqdm.notebook import tqdm
# This training code is based on the `` script here:
# Number of training epochs
epochs = 3
# We'll store a number of quantities such as training and validation loss,
# validation accuracy, and timings.
training_stats = []
# For each epoch...
for epoch_i in range(epochs):
# ========================================
# Training
# ========================================
# Perform one full pass over the training set.
print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))
# Reset the total loss for this epoch.
total_train_loss = 0
# Put the model into training mode. Don't be mislead--the call to
# `train` just changes the *mode*, it doesn't *perform* the training.
# `dropout` and `batchnorm` layers behave differently during training
# vs. test (source:
# For each batch of training data...
for step, batch in tqdm(enumerate(train_dataloader), total=len(train_dataloader)):
# Read training batch from our dataloader.
b_input_ids, b_input_mask, b_labels = format_batch_to_input(batch, device)
# Always clear any previously calculated gradients before performing a
# backward pass. PyTorch doesn't do this automatically because
# accumulating the gradients is "convenient while training RNNs".
# (source:
# Perform a forward pass (evaluate the model on this training batch).
loss, logits = model.forward(b_input_ids,
# Accumulate the training loss over all of the batches so that we can
# calculate the average loss at the end.
total_train_loss += loss
# Perform a backward pass to calculate the gradients.
# Clip the norm of the gradients to 1.0.
# This is to help prevent the "exploding gradients" problem.
# torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
# Update weights and take a step using the computed gradient.
# Calculate the average loss over all of the batches.
avg_train_loss = total_train_loss / len(train_dataloader)
print(" Average training loss: {0:.2f}".format(avg_train_loss))
# ========================================
# Validation
# ========================================
# After the completion of each training epoch, measure our performance on
# our validation set.
print("Running Validation...")
# Put the model in evaluation mode: some layers behave differently
# during evaluation.
# Tracking variables
total_eval_accuracy = 0
total_eval_loss = 0
# Iterate over epochs
for batch in tqdm(validation_dataloader, total=len(validation_dataloader)):
# Read training batch from our dataloader.
b_input_ids, b_input_mask, b_labels = format_batch_to_input(batch, device)
# Tell pytorch not to bother with constructing the compute graph during
# the forward pass, since this is only needed for backprop (training).
with torch.no_grad():
(loss, logits) = model.forward(b_input_ids,
# Accumulate the validation loss.
total_eval_loss += loss
# Calculate the accuracy for this batch of test sentences, and
# accumulate it over all batches.
logits = logits.detach().cpu().numpy()
label_ids ='cpu').numpy()
total_eval_accuracy += accuracy_fun(logits, label_ids)
# Report the final accuracy for this validation run.
avg_val_accuracy = total_eval_accuracy / len(validation_dataloader)
print(" Accuracy: {0:.2f}".format(avg_val_accuracy))
# Calculate the average loss over all of the batches.
avg_val_loss = total_eval_loss / len(validation_dataloader)
print(" Validation Loss: {0:.2f}".format(avg_val_loss))
# Record all statistics from this epoch.
'epoch': epoch_i + 1,
'Training Loss': avg_train_loss,
'Valid. Loss': avg_val_loss,
'Valid. Accur.': avg_val_accuracy,
print("Training complete!")
%% Cell type:code id:9e52da0a-911c-46f9-a404-229e7d71dfdd tags:
``` python
# Evaluation on the external test set
from datasets import load_metric
metric = load_metric("accuracy")
for batch in tqdm(test_dataloader):
# Unpack this training batch from our dataloader.
b_input_ids, b_input_mask, b_labels = format_batch_to_input(batch, device)
with torch.no_grad():
(loss, logits) = model.forward(b_input_ids,
predictions = torch.argmax(logits, dim=-1)
metric.add_batch(predictions=predictions, references=batch['label'])
%% Cell type:code id:f1516e2a-e025-4b28-807b-c02f83584978 tags:
``` python
sel_id = 2
print('Expected response', label_names[int(batch['label'][sel_id].numpy())])
print('Predicted response', label_names[int(predictions[sel_id].cpu().numpy())])
%% Cell type:markdown id:5197ddca-4bba-414d-947b-df513554aaa9 tags:
## We have learned
* How to download a dataset and a pre-trained network for PyTorch
* How to train a model in PyTorch
* The different paradigm between PyTorch and Keras
## Exercise
* Save the model with the best validation accuracy
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment