Commit 3b5cd96d authored by Roberto Ugolotti's avatar Roberto Ugolotti
Browse files

Add an example in which we create an environment

parent 87c9285e
%% Cell type:markdown id:cb377aac-a569-4b78-bc49-d3c01a2023b2 tags:
# Detect Fake News on Health using PyTorch 1.9
%% Cell type:markdown id:a25fd7f5-700e-4981-966f-43a90efe2719 tags:
## CONDA Environment
This example is the same shown in torch_on_text but we are going to use PyTorch 1.9. To do so, we are going to create an ad-hoc conda environment in which we will install all the libraries that we need. See the BDAP [documentation](https://jeodpp.jrc.ec.europa.eu/apps/gitlab/for-everyone/documentation/-/wikis/Jeodpp_services/JEO-lab/GPU-Machine-Learning-with-Conda) on Gitlab.
## How To Run it
Copy the commands in the cell below and run them in a Terminal. Once you have done that, you should have a Conda environment called myenv located under /home/YOUR_USERNAME/conda/envs and a kernell called my_environment. Select this kernel from the top right panel and run the notebook.
%% Cell type:raw id:c69eef8a-e19e-4c2a-8bf7-a40e3ef5c7fd tags:
# Create a CONDA environment
conda create -n myenv python=3.8 -y
conda activate myenv
pip install torch==1.9 transformers datasets ipykernel torchtext scikit-learn
python -m ipykernel install --user --name=my_environment
%% Cell type:code id:a37a3696-20b1-4fc7-baf4-a06add1b982c tags:
``` python
import torch
import numpy as np
```
%% Cell type:code id:5f8da724-d7f4-4d91-be12-a5bdd1630ca4 tags:
``` python
# Download the dataset
from datasets import load_dataset
# See https://huggingface.co/datasets/viewer/?dataset=health_fact
datasets = load_dataset("health_fact")
label_names = ["False", "Mixture", "True", "Unproven"]
```
%% Cell type:code id:2dab668b-0742-4de8-b904-d3f018a9bcbe tags:
``` python
print('Names of the datasets', list(datasets.keys()))
```
%% Cell type:code id:a6c6ed54-8feb-4845-89e4-1791d30eb629 tags:
``` python
# Download the tokenizer, who is in charge of preparing the inputs in the expected shape for the model
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
def tokenize_function(examples):
return tokenizer(examples["claim"], padding='max_length', truncation=True)
# Convert phrases into tokens
tokenized_datasets = datasets.map(tokenize_function, batched=True)
# Remove unlabelled data (and reduce the size, for this example)
train_ds = tokenized_datasets['train'].filter(lambda x: x['label'] > -1).select(range(1000))
val_ds = tokenized_datasets['validation'].filter(lambda x: x['label'] > -1).select(range(1000))
test_ds = tokenized_datasets['test'].filter(lambda x: x['label'] > -1).select(range(1000))
```
%% Cell type:code id:73ecc7e4-c0d6-45e2-bc2f-7f4ab75a717e tags:
``` python
# See an input example
sel_id = 25
print('Raw sentence:', train_ds[sel_id]['claim'])
print('Label', label_names[train_ds[sel_id]['label']])
print('Network input', train_ds[sel_id]['input_ids'])
print('Attention mask (which part of the input is of interest of the network)', train_ds[sel_id]['attention_mask'])
```
%% Cell type:code id:05391cba-3a8a-4279-ad39-1875e1e28171 tags:
``` python
from transformers import BertForSequenceClassification, AdamW
# Load BertForSequenceClassification, the pretrained BERT model with a single
# linear classification layer on top.
model = BertForSequenceClassification.from_pretrained(
"bert-base-uncased", # Use the 12-layer BERT model, with an uncased vocab.
num_labels=len(label_names), # The number of output labels--2 for binary classification.
output_attentions=False, # Whether the model returns attentions weights.
output_hidden_states=False, # Whether the model returns all hidden-states.
)
# Create the optimizer, who is in charge of updating the model weights
# during training
optimizer = AdamW(model.parameters(), lr=2e-5)
```
%% Cell type:code id:c065f5d0-e191-4cf0-92ec-e445d261a313 tags:
``` python
device = 'cuda' # Use 'cpu' for debugging purposes
# Tell pytorch to run this model on the GPU.
if device == 'cuda':
model.cuda()
```
%% Cell type:code id:49bc5172-3504-4c01-85b5-d11a6e203635 tags:
``` python
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler
# The DataLoader needs to know our batch size for training, so we specify it
# here. For fine-tuning BERT on a specific task, the authors recommend a batch
# size of 16 or 32.
batch_size = 32
# Create the DataLoaders for our training and validation sets.
# We'll take training samples in random order.
train_dataloader = DataLoader(
train_ds, # The training samples.
sampler=RandomSampler(train_ds), # Select batches randomly
batch_size=batch_size # Trains with this batch size.
)
# For validation and test the order doesn't matter, so we'll just read them sequentially.
validation_dataloader = DataLoader(
val_ds, # The validation samples.
sampler=SequentialSampler(val_ds), # Pull out batches sequentially.
batch_size=batch_size # Evaluate with this batch size.
)
test_dataloader = DataLoader(
test_ds, # The validation samples.
sampler=SequentialSampler(test_ds), # Pull out batches sequentially.
batch_size=batch_size # Evaluate with this batch size.
)
# Function to convert input in the expected shape for the network
def format_batch_to_input(batch, device):
return torch.stack(batch['input_ids']).T.to(device), \
torch.stack(batch['attention_mask']).T.to(device), \
torch.atleast_2d(batch['label']).to(device)
```
%% Cell type:code id:344e8d69-0603-4a5e-9580-ff0e3723d42c tags:
``` python
# Function to calculate the accuracy of our predictions vs labels
def accuracy_fun(preds, labels):
pred_flat = np.argmax(preds, axis=1).flatten()
labels_flat = labels.flatten()
return np.sum(pred_flat == labels_flat) / len(labels_flat)
```
%% Cell type:code id:95a5f1ff-2dd4-4083-9f0f-fe7bb85c15d5 tags:
``` python
from tqdm.notebook import tqdm
# This training code is based on the `run_glue.py` script here:
# https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/run_glue.py#L128
# Number of training epochs
epochs = 3
# We'll store a number of quantities such as training and validation loss,
# validation accuracy, and timings.
training_stats = []
# For each epoch...
for epoch_i in range(epochs):
# ========================================
# Training
# ========================================
# Perform one full pass over the training set.
print("")
print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))
print('Training...')
# Reset the total loss for this epoch.
total_train_loss = 0
# Put the model into training mode. Don't be mislead--the call to
# `train` just changes the *mode*, it doesn't *perform* the training.
# `dropout` and `batchnorm` layers behave differently during training
# vs. test (source: https://stackoverflow.com/questions/51433378/what-does-model-train-do-in-pytorch)
model.train()
# For each batch of training data...
for step, batch in tqdm(enumerate(train_dataloader), total=len(train_dataloader)):
# Read training batch from our dataloader.
b_input_ids, b_input_mask, b_labels = format_batch_to_input(batch, device)
# Always clear any previously calculated gradients before performing a
# backward pass. PyTorch doesn't do this automatically because
# accumulating the gradients is "convenient while training RNNs".
# (source: https://stackoverflow.com/questions/48001598/why-do-we-need-to-call-zero-grad-in-pytorch)
model.zero_grad()
# Perform a forward pass (evaluate the model on this training batch).
loss, logits = model.forward(b_input_ids,
token_type_ids=None,
attention_mask=b_input_mask,
labels=b_labels).values()
# Accumulate the training loss over all of the batches so that we can
# calculate the average loss at the end.
total_train_loss += loss
# Perform a backward pass to calculate the gradients.
loss.backward()
# Clip the norm of the gradients to 1.0.
# This is to help prevent the "exploding gradients" problem.
# torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)