This post is a summary of the site below.
To run an LLM experiment, it is much simpler and more effective to train or inference using huggingface's various libraries rather than simply importing the code and running it i think.
So in this post, we'll take a quick tour of the huggingface transformer library, and check out the LLM fine-tuning later.
Before we begin, make sure we have all the. ecessary libraries installed:
!pip install transformers datasets evaluate accelerate
We'll also need to install our preferred machine learning framework:
pytorch-cuda=11.8
The pipeline() is the easiest and fastest way to use a pretrained model for inference. We can use the pipeline() out-of-the-box for many tasks across different modalities.
Example
Task | Description | Modality | Pipeline | identifier
Text generation | generate text given a prompt | NLP | pipeline(task=“text-generation”)
Start by creating an instane of pipeline() and specifying a task you want to use it for. In this guide, you'll use the pipeline() for sentiment analysis as an example:
>>> from transformers import pipeline
>>> classifier = pipeline ("sentiment-analysis")
The pipeline() downloads and caches a default pretrained model and tokenizer for sentiment analysis. Now we can use the classifier
on our target text:
>>> classifier("We are very happy to show you the 🤗 Transformers library.")
[{'label': 'POSITIVE', 'score': 0.9998}]
If we have more than one input, pass your inputs as a list to pipeline() to return a list of dictionaries:
>>> results = classifier(["We are very happy to show you the 🤗 >>> Transformers library.", "We hope you don't hate it."])
for result in results:
print(f"label: {result['label']}, with score: {round(result['score'], 4)}")
label: POSITIVE, with score: 0.9998
label: NEGATIVE, with score: 0.5309
The pipeline() can also iterate over an entire dataset for any task we like. For this example, let's choose automatic speech recognition as our task:
>>> import torch
>>> from transformers import pipeline
>>> speech_recognizer = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-base-960h")
Load an audio dataset (see the 🤗 Datasets Quick Start for more details) we’d like to iterate over. For example, load the MInDS-14 dataset:
>>> from datasets import load_dataset, Audio
>>> dataset = load_dataset("PolyAI/minds14", name="en-US", split="train")
We need to make sure the sampling rate of the dataset matches the sampling rate facebook/wav2vec2-base-960h was trained on:
>>> dataset = dataset.cast_column("audio", Audio(sampling_rate=speech_recognizer.feature_extractor.sampling_rate))
The audio files are automatically loaded and resampled when calling the "audio"
column. Extract the raw waveform arrays from the first 4 samples and pass it as a list to the pipeline:
>>> result = speech_recognizer(dataset[:4]["audio"])
>>> print(d["text"] for d in result)
['I WOULD LIKE TO SET UP A JOINT ACCOUNT WITH MY PARTNER HOW DO I PROCEED WITH DOING THAT', "FONDERING HOW I'D SET UP A JOIN TO HELL T WITH MY WIFE AND WHERE THE AP MIGHT BE", "I I'D LIKE TOY SET UP A JOINT ACCOUNT WITH MY PARTNER I'M NOT SEEING THE OPTION TO DO IT ON THE APSO I CALLED IN TO GET SOME HELP CAN I JUST DO IT OVER THE PHONE WITH YOU AND GIVE YOU THE INFORMATION OR SHOULD I DO IT IN THE AP AN I'M MISSING SOMETHING UQUETTE HAD PREFERRED TO JUST DO IT OVER THE PHONE OF POSSIBLE THINGS", 'HOW DO I FURN A JOINA COUT']
For larger datasets where the inputs are big (like in speech or vision), we'll want to pass a generator instead of a list to load all the inputs in memory. Take a look at the pipeline API reference for more information.
The pipeline() can accommodate any model from the Hub, making it easy to adapt the pipeline() for other use-cases. For example, if you'd like a model capable of handling French text, use the tags on the Hub to filter for an appropriate model. The top filtered result returns a multilingual BERT model finetuned for sentiment analysis you can use for French text:
>>> model_name = 'nlptown/bert-base-multilingual-uncased-sentiment'
>>> from transformers import AutoTokenizer, AutoModelForSequenceClassification
>>> model = AutoModelForSequenceClassification.from_pretrained(model_name)
>>> tokenizer = AutoTokenizer.from_pretrained(model_name)
Specify the model and tokenizer in the pipeline() and now you can apply the classifier
on French text:
>>> classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)
>>> classifier("Nous sommes très heureux de vous présenter la bibliothèque 🤗 Transformers.")
[{'label': '5 stars', 'score': 0.7273}]
If you can't find a model for your use-case, you'll need to finetune a pretrained model on your data. Take a look at finetuning tutorial to learn how.
Under the hood, the AutoModelForSequenceClassification and AutoTokenizer classes work together to power the pipeline() you used above. An AutoClass is a shortcut that automatically retrieves the architecture of a pretrained model from its name or path. You only need to select the appropriate AutoClass for your task and it's associated preprocessing class.
Let's return to the example from the previous section and see how you can use the AutoClass to replicate the results of the pipeline()
A tokenizer is responsible for preprocessing text into an array of numbers as inputs to a model. There are multiple rules that govern the tokenization process, including how to split a word and at what level words should be split (learn more about tokenization in the tokenizer summary). The most important thing to remember is you need to instantiate a tokenizer with the same model name to ensure you're using the same tokenization rules a model was pretrained with.
Load a tokenizer with AutoTokenizer:
>>> from transformers import AutoTokenizer
>>> model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
>>> tokenizer = AutoTokenizer.from_pretrained(model_name)
Pass your text to the tokenizer:
>>> encoding = tokenizer("We are very happy to show you the 🤗 Transformers library.")
>>> print(encoding)
{'input_ids': [101, 11312, 10320, 12495, 19308, 10114, 11391, 10855, 10103, 100, 58263, 13299, 119, 102],
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
The tokenizer returns a dictionary containing:
A tokenizer can also accept a list of inputs, and pad and truncate the text to return a batch with uniform length:
pt_batch = tokenizer(
["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."],
padding=True,
truncation=True,
max_length=512,
return_tensors="pt",
)
More details of tokenizer in preprocess.
🤗 Transformers provides a simple and unified way to load pretrained instances. This means you can load an AutoModel like you would load an AutoTokenizer. The only difference is selecting the correct AutoModel for the task. For text (or sequence) classification, you should load AutoModelForSequenceClassification:
>>> from transformers import AutoModelForSequenceClassification
>>> model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
>>> pt_model = AutoModelForSequenceClassification.from_pretrained(model_name)
Now pass your preprocessed batch of inputs directly to the model. You just have to unpack the dictionary by adding **
:
>>> pt_outputs = pt_model(**pt_batch)
The model outputs the final activations in the logits
attribute. Apply the softmax function to the logits
to retrieve the probabilities:
>> from torch import nn
>>> pt_predictions = nn.functional.softmax(pt_outputs.logits, dim=-1)
>>> print(pt_predictions)
tensor([[0.0021, 0.0018, 0.0115, 0.2121, 0.7725],
[0.2084, 0.1826, 0.1969, 0.1755, 0.2365]], grad_fn=<SoftmaxBackward0>)
All 🤗 Transformers models (PyTorch or TensorFlow) output the tensors before the final activation function (like softmax) because the final activation function is often fused with the loss. Model outputs are special dataclasses so their attributes are autocompleted in an IDE. The model outputs behave like a tuple or a dictionary (you can index with an integer, a slice or a string) in which case, attributes that are None are ignored.
Once your model is fine-tuned, you can save it with its tokenizer using PreTrainedModel.save_pretrained():
>>> pt_save_directory = "./pt_save_pretrained"
>>> tokenizer.save_pretrained(pt_save_directory)
>>> pt_model.save_pretrained(pt_save_directory)
When you are ready to use the model again, reload it with PreTrainedModel.from_pretrained():
>>> pt_model = AutoModelForSequenceClassification.from_pretrained("./pt_save_pretrained")
One particularly cool 🤗 Transformers feature is the ability to save a model and reload it as either a PyTorch or TensorFlow model. The from_pt or from_tf parameter can convert the model from one framework to the other:
>>> from transformers import AutoModel
>>> tokenizer = AutoTokenizer.from_pretrained(tf_save_directory)
>>> pt_model = AutoModelForSequenceClassification.from_pretrained(tf_save_directory, from_tf=True)
You can modify the model's configuration class to change how a model is built. The configuration specifies a model's attributes, such as the number of hidden layers or attention heads. You start from scratch when you initialize a model from a custom configuration class. The model attributes are randomly initialized, and you'll need to train the model before you can use it to get meaningful results.
Start by importing AutoConfig, and then load the pretrained model you want to modify. Within AutoConfig.from_pretrained(), you can specify the attribute you want to change, such as the number of attention heads:
>>> from transformers import AutoConfig
>>> my_config = AutoConfig.from_pretrained("distilbert/distilbert-base-uncased", n_heads=12)
Create a model from your custom configuration with AutoModel.from_config():
>>> from transformers import AutoModel
>>> my_model = AutoModel.from_config(my_config)
Take a look at the Create a custom architecture guide for more information about building custom configurations.