Fine-tuning a Model on Lifting Corpus - Workout Program Recommender Walkthrough

In a perfect world, we would have data that is labeled with a group type and we could use similarity techniques from there. We could then use supervised fine-tuning via methods like NLI to let the model understand the domain. Unfortunately, that isn’t the case. To fine-tune on unlabeled data, we will use an unsupervised fine-tuning technique known as Masked Language Modeling (MLM).

MLM basically boils down to masking (blanking out) ~15% of the tokens across out sample texts and telling the model to predict these masked words. This forces our model to learn contextual word representations in the lifting domain.

import os
import pandas as pd
import numpy as np
from tqdm import tqdm
import torch
import torch.nn.functional as F
import torch.nn as nn
from torch.optim import AdamW
from torch.utils.data import DataLoader, Dataset
from transformers import RobertaTokenizerFast, RobertaModel, RobertaForMaskedLM, get_linear_schedule_with_warmup
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics.pairwise import cosine_similarity

# Load the data
program_features = pd.read_csv('drive/MyDrive/liftingML/data/program_features.csv')
program_features.drop(columns=['Unnamed: 0'], inplace=True)

huge_data = pd.read_csv('drive/MyDrive/liftingML/data/cleaned_600k.csv')
huge_data.drop(columns=['Unnamed: 0'], inplace=True)

device = torch.device('mps' if torch.backends.mps.is_available() else 'cpu')

tokenizer = RobertaTokenizerFast.from_pretrained('FacebookAI/roberta-base')
roberta_model = RobertaForMaskedLM.from_pretrained('FacebookAI/roberta-base')

descriptions = program_features['description'].tolist()

# Tokenizes the descroptions to be fed into the model
inputs = tokenizer(descriptions, return_tensors='pt', max_length=512, truncation=True, padding='max_length')
# Creates the labels we use to compare for against the masked tokens
inputs['labels'] = inputs.input_ids.detach().clone()

# Randomly select 15% of tokens to be masked
# (101 is the [CLS] token, 102 is the [SEP], 103 is the [MASK] token, 0 is the [PAD] token, or padding)
# We do not want to overwrite the start, end, and padding tokens, so we cut them out
rand = torch.rand(inputs.input_ids.shape)
mask_arr = (rand < 0.15) * (inputs.input_ids != 101) * (inputs.input_ids != 102) * (inputs.input_ids != 0)

# Get the randomly chosen tokens for each text and apply masking token
selection = []
for i in range(inputs.input_ids.shape[0]):
    selection.append(
        torch.flatten(mask_arr[i].nonzero()).tolist()
    )

for i in range(inputs.input_ids.shape[0]):
    inputs.input_ids[i, selection[i]] = 103

# Custom Dataset to user in Dataloader in training
class LiftingDataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
      self.encodings = encodings

    def __getitem__(self, idx):
      return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}

    def __len__(self):
      return self.encodings.input_ids.shape[0]

dataset = LiftingDataset(inputs)
loader = torch.utils.data.DataLoader(dataset, batch_size=16, shuffle=True)
roberta_model.to(device)

RobertaForMaskedLM(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): RobertaIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
            (intermediate_act_fn): GELUActivation()
          )
          (output): RobertaOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
      )
    )
  )
  (lm_head): RobertaLMHead(
    (dense): Linear(in_features=768, out_features=768, bias=True)
    (layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (decoder): Linear(in_features=768, out_features=50265, bias=True)
  )
)

# Optimizer used is Adam with weight decay, which reduces the chance of overfitting
# An optimizer efficiently trains the model by adjusting the learning rates for each parameter on each pass
roberta_model.train()
optim = AdamW(roberta_model.parameters(), lr=5e-5)

epochs = 3

for epoch in range(epochs):
    loop = tqdm(loader, leave=True)
    for batch in loop:
        # Initialize calculated gradients
        optim.zero_grad()
        # Get all tensor batches required for training
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        # Process
        outputs = roberta_model(input_ids, attention_mask=attention_mask,
                        labels=labels)
        # Extract loss
        loss = outputs.loss
        # Calculate loss for every parameter that needs grad update
        loss.backward()
        # Update parameters
        optim.step()
        # Progress bar
        loop.set_description(f'Epoch {epoch}')
        loop.set_postfix(loss=loss.item())

  0%|          | 0/165 [00:00<?, ?it/s]/tmp/ipython-input-1488655526.py:7: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.detach().clone() or sourceTensor.detach().clone().requires_grad_(True), rather than torch.tensor(sourceTensor).
  return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
Epoch 0: 100%|██████████| 165/165 [05:46<00:00,  2.10s/it, loss=0.553]
Epoch 1: 100%|██████████| 165/165 [05:53<00:00,  2.14s/it, loss=0.0874]
Epoch 2: 100%|██████████| 165/165 [05:52<00:00,  2.14s/it, loss=0.0645]

roberta_model.save_pretrained('drive/MyDrive/liftingML/roberta_finetuned')
tokenizer.save_pretrained('drive/MyDrive/liftingML/roberta_finetuned')

('./roberta_finetuned/tokenizer_config.json',
 './roberta_finetuned/special_tokens_map.json',
 './roberta_finetuned/vocab.json',
 './roberta_finetuned/merges.txt',
 './roberta_finetuned/added_tokens.json',
 './roberta_finetuned/tokenizer.json')

Converting Fine-tuned Model to a Sentence Transformer¶

class CustomSentenceEmbedder:
    def __init__(self, model_name="./roberta_finetuned", device=None):
        self.device = device or ("cuda" if torch.cuda.is_available() else "cpu")
        self.model_name = model_name
        self.tokenizer = RobertaTokenizerFast.from_pretrained(model_name)
        self.model = RobertaModel.from_pretrained(model_name).to(self.device)
        self.model.eval()

    def encode(self, sentences, batch_size=32, normalize=True):
        # Allows for single or multiple sentence inputs
        if isinstance(sentences, str):
            sentences = [sentences]

        all_embeddings = []
        # Divides input into batches to prevent OOM errors
        for start in range(0, len(sentences), batch_size):
            batch = sentences[start:start+batch_size]

            encoded_input = self.tokenizer(batch, padding=True, truncation=True,
                                           return_tensors='pt').to(self.device)

            with torch.no_grad():
                outputs = self.model(**encoded_input)
                token_embeddings = outputs.last_hidden_state  # (batch, seq_len, hidden_dim)
                attention_mask = encoded_input['attention_mask']

                # Mean pooling
                input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size())
                sum_embeddings = (token_embeddings * input_mask_expanded).sum(1) # (B, H)
                sum_mask = attention_mask.sum(1).unsqueeze(-1) # (B, 1)
                embeddings = sum_embeddings / sum_mask

                if normalize:
                    embeddings = F.normalize(embeddings, p=2, dim=1)

                all_embeddings.append(embeddings.cpu())

        return torch.cat(all_embeddings, dim=0)

    def save(self, save_directory):
        """
        Save the model and tokenizer to the specified directory.
        """
        os.makedirs(save_directory, exist_ok=True)
        self.model.save_pretrained(save_directory)
        self.tokenizer.save_pretrained(save_directory)

    @classmethod
    def load(cls, load_directory, device=None):
        """
        Load the model and tokenizer from the specified directory.
        """
        return cls(model_name=load_directory, device=device)

embedder = CustomSentenceEmbedder(device=device, model_name='./roberta_finetuned')

Some weights of RobertaModel were not initialized from the model checkpoint at ./roberta_finetuned and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

texts = program_features['text'].to_list()
BATCH_SIZE = 64
embeddings_test = embedder.encode(texts, batch_size=BATCH_SIZE).numpy()

# Add embeddings back to features dataframe
embd_cols = [f'embd_{i}' for i in range(embeddings_test.shape[1])]
embd_df = pd.DataFrame(embeddings_test, columns=embd_cols, index=program_features.index)
program_features_test = program_features.copy().drop(columns=[col for col in program_features.columns if col.startswith('embd_')])
program_features_test = pd.concat([program_features_test, embd_df], axis=1)

md_cols = [
    'reps_count', 'reps_time', 'is_rep_based',
    'sets', 'reps_per_week', 'program_length', 'time_per_workout',
    'intensity', 'level_beginner', 'level_novice', 'level_intermediate',
    'level_advanced', 'goal_olympic_weightlifting',
    'goal_muscle_&_sculpting', 'goal_bodyweight_fitness',
    'goal_powerbuilding', 'goal_bodybuilding', 'goal_powerlifting',
    'goal_athletics', 'equipment_at home', 'equipment_dumbbell only',
    'equipment_full gym', 'equipment_garage gym'
]

final_features_test = program_features_test[md_cols + embd_cols]

scaler = StandardScaler()

# Scale for KMeans and PCA
final_features_test_scaled = pd.DataFrame(
    scaler.fit_transform(final_features_test),
    columns=final_features_test.columns,
    index=final_features_test.index
)

Model Training¶

# Set up KMeans model to cluster the workout programs together
clustering_data_test = final_features_test_scaled.copy()
kmeans = KMeans(n_clusters=25, random_state=4)
clustering_data_test['cluster'] = kmeans.fit_predict(clustering_data_test)

# Add cluster back to unscaled dataset as the function needs unscaled data
final_features_test.loc[:, 'cluster'] = clustering_data_test['cluster']

/tmp/ipython-input-48959608.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_features_test.loc[:, 'cluster'] = clustering_data_test['cluster']

similarities_test = cosine_similarity(clustering_data_test)

Redefining the program recommender function for use in this notebook

def find_top_n(similarity_matrix, n_programs, program, metadata, info, cluster=None, features=None):
    """
    Gets the top n workout programs.

    Args:
        similarity_matrix (np.ndarray): Matrix of similarity scores between programs.
        n_programs (int): Number of top similar programs to return.
        program (int): Index of the program to compare against.
        metadata (list): List of metadata column names to include in the result.
        info (list): List of info column names to include in the result.

    Returns:
        list[pd.DataFrame]: List of DataFrames, each containing the metadata and info for a top similar program.
    """
    scores = similarity_matrix[program]

    if cluster:
        mask = (features['cluster'] == cluster).values
        scores = scores * mask

    idxs = np.argsort(scores)[::-1]

    # Gets the top n indices that aren't itself
    top_n = idxs[idxs != program][:n_programs]
    top_titles = program_features['title'][top_n]

    # For each of the top n workout programs, get out only specific columns and add each DF to a list
    progs = [huge_data[huge_data['title'] == i][metadata+info] for i in top_titles]
    return progs

def program_recommender(program, features, similarity_matrix, model=None, n_programs=5, within_cluster=False):
    """
    Takes in a user's inputted program vector or existing program index
    and computes the top n similar workout programs.

    Args:
        program (int or list): If int, the index of an existing program to use as the query.
                               If list, a vector of numeric features followed by a string description
                               representing a custom user program.
        features (np.ndarray): Feature matrix of all programs (used for custom queries).
        model (SentenceTransformer): Model used to encode text descriptions (default: global model).
        n_programs (int): Number of similar programs to return (default: 5).

    Returns:
        list[pd.DataFrame]: List of DataFrames, each containing metadata and info for a recommended program.
    """
    metadata = ['title', 'description', 'level', 'goal', 'equipment', 'program_length','time_per_workout', 'number_of_exercises']
    info = ['week', 'day', 'exercise_name', 'sets', 'reps', 'intensity']

    if (type(program) == int):
        return find_top_n(similarity_matrix,
                          n_programs,
                          program,
                          metadata,
                          info,
                          features['cluster'].iloc[program] if within_cluster else None,
                          features if within_cluster else None
        )
    elif (type(program) == list):
        # Encodes the user's description for the workout
        query_embd = model.encode(program[-1])
        query_numeric = np.array(program[:-1], dtype=np.float32)
        # Concatenate the numeric features and the embedding
        query_full = np.concatenate([query_numeric, query_embd.flatten()])

        # Standardize the query
        scaler = StandardScaler()
        features_scaled = scaler.fit_transform(features.drop(columns=['cluster']))
        query_full_scaled = (scaler.transform([query_full])[0].reshape(1, -1)) # Reshaping turns the query into a 2D array
        cluster = int(kmeans.predict(query_full_scaled))

        # Compute cosine similarity between the query and all existing (already scaled) features
        similarities_to_query = cosine_similarity(
            features_scaled,
            query_full_scaled
        ).flatten()

        features_scaled = pd.concat([pd.DataFrame(features_scaled), features['cluster']], axis=1)

        return find_top_n(
            similarities_to_query.reshape(1, -1),
            n_programs,
            0,
            metadata,
            info,
            cluster if within_cluster else None,
            features_scaled if within_cluster else None
        )

    else:
        raise ValueError('Value inputted is not an int or NumPy array.')

query = [10, 0, 1, 100, 400, 8, 60, 10, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 'Insane Arnold program, not for the weak. You need max intensity in every exercise. Push to failure ALWAYS.']

new = program_recommender(query, final_features_test, similarity_matrix=similarities_test, model=embedder, within_cluster=True)

/usr/local/lib/python3.12/dist-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but StandardScaler was fitted with feature names
  warnings.warn(
/usr/local/lib/python3.12/dist-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but KMeans was fitted with feature names
  warnings.warn(
/tmp/ipython-input-416598771.py:70: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
  cluster = int(kmeans.predict(query_full_scaled))

for i in range(len(new)):
    title = new[i]['title'].iloc[0]
    description = new[i]['description'].iloc[0]
    print("="*40)
    print(f"Title: {title}\n")
    print(f"Description: {description}")
    print("="*40 + "\n")

========================================
Title: building the titan

Description: Get jacked - pure hypertrophy. You need to aim to up as many reps or weight every session. This is about pushing yourself to the limit.
========================================

========================================
Title: rodrigo's - 6 day arnold split

Description: This intermediate and advanced program is designed for the bodybuilder and powerlifter who wants to increase his strength and size while pushing himself to the max. Not for the faint of heart as volume is high, but highly rewarding gains. Based on the 6 day Arnold split and adjusted to my favourite exercises.
========================================

========================================
Title: tighty by whitey

Description: Designed to be difficult for beginner lifters in order to push the mind through physical struggle/failure.
========================================

========================================
Title: godsplit

Description: The purpose of this program is to put in serious work in the gym while gaining strength and size. Put in the work. Reap the Rewards.
========================================

========================================
Title: house's strength gaining program

Description: To push you to the limits and not hold back one bit
========================================

The overall quality of outputs has drastically increased! The fine-tuned model has successfully improved on finding similar workouts to the user’s query.

With the pretrained model, the “fitness” terminology wasn’t captured and understood. On the other hand, the fine-tuned model actually picks up on unique lingo. For example, it picks up how Arnold is not just a person’s name (Arnold Schwarzenegger) but a unique training split named after him. It also picks up specifically how we are looking for a intense workout regime that isn’t a whole lot of fluff. It’s meant for people who want to be serious about lifting.

Workout Program Recommender Walkthrough

Model Training

Workout Program Recommender Walkthrough

Final Notes