Hugging Face is an AI research lab and hub that has built a community of scholars, researchers, and enthusiasts. In a short span of time, Hugging Face has garnered a substantial presence in the AI space. Tech giants including Google, Amazon, and Nvidia have bolstered AI startup Hugging Face with significant investments, making its valuation $4.5 billion.
In this guide, we’ll introduce transformers, LLMs and how the Hugging Face library plays an important role in fostering an opensource AI community. We’ll also walk through the essential features of Hugging Face, including pipelines, datasets, models, and more, with hands-on Python examples.
Transformers in NLP
In 2017, Cornell University published an influential paper that introduced transformers. These are deep learning models used in NLP. This discovery fueled the development of large language models like ChatGPT.
Large language models or LLMs are AI systems that use transformers to understand and create human-like text. However, creating these models is expensive, often requiring millions of dollars, which limits their accessibility to large companies.
Hugging Face, started in 2016, aims to make NLP models accessible to everyone. Despite being a commercial company, it offers a range of open-source resources helping people and organizations to affordably build and use transformer models. Machine learning is about teaching computers to perform tasks by recognizing patterns, while deep learning, a subset of machine learning, creates a network that learns independently. Transformers are a type of deep learning architecture that effectively and flexibly uses input data, making it a popular choice for building large language models due to lesser training time requirements.
How Hugging Face Facilitates NLP and LLM Projects
Hugging Face has made working with LLMs simpler by offering:
- A range of pre-trained models to choose from.
- Tools and examples to fine-tune these models to your specific needs.
- Easy deployment options for various environments.
A great resource available through Hugging Face is the Open LLM Leaderboard. Functioning as a comprehensive platform, it systematically monitors, ranks, and gauges the efficiency of a spectrum of Large Language Models (LLMs) and chatbots, providing a discerning analysis of the advancements in the open-source domain
LLM Benchmarks measures models through four metrics:
- AI2 Reasoning Challenge (25-shot) — a series of questions around elementary science syllabus.
- HellaSwag (10-shot) — a commonsense inference test that, though simple for humans this metric is a significant challenge for cutting-edge models.
- MMLU (5-shot) — a multifaceted evaluation touching upon a text model’s proficiency across 57 diverse domains, encompassing basic math, law, and computer science, among others.
- TruthfulQA (0-shot) — a tool to ascertain the tendency of a model to echo frequently encountered online misinformation.
The benchmarks, which are described using terms such as “25-shot”, “10-shot”, “5-shot”, and “0-shot”, indicate the number of prompt examples that a model is given during the evaluation process to gauge its performance and reasoning abilities in various domains. In “few-shot” paradigms, models are provided with a small number of examples to help guide their responses, whereas in a “0-shot” setting, models receive no examples and must rely solely on their pre-existing knowledge to respond appropriately.
Components of Hugging Face
Pipelines
‘pipelines‘ are part of Hugging Face’s transformers library a feature that helps in the easy utilization of pre-trained models available in the Hugging Face repository. It provides an intuitive API for an array of tasks, including sentiment analysis, question answering, masked language modeling, named entity recognition, and summarization.
Pipelines integrate three central Hugging Face components:
- Tokenizer: Prepares your text for the model by converting it into a format the model can understand.
- Model: This is the heart of the pipeline where the actual predictions are made based on the preprocessed input.
- Post-processor: Transforms the model’s raw predictions into a human-readable form.
These pipelines not only reduce extensive coding but also offer a user-friendly interface to accomplish various NLP tasks.
Transformer Applications using the Hugging Face library
A highlight of the Hugging Face library is the Transformers library, which simplifies NLP tasks by connecting a model with necessary pre and post-processing stages, streamlining the analysis process. To install and import the library, use the following commands:
pip install -q transformers from transformers import pipeline
Having done that, you can execute NLP tasks starting with sentiment analysis, which categorizes text into positive or negative sentiments. The library’s powerful pipeline() function serves as a hub encompassing other pipelines and facilitating task-specific applications in audio, vision, and multimodal domains.
Practical Applications
Text Classification
Text classification becomes a breeze with Hugging Face’s pipeline() function. Here’s how you can initiate a text classification pipeline:
classifier = pipeline("text-classification")
For a hands-on experience, feed a string or list of strings into your pipeline to obtain predictions, which can be neatly visualized using Python’s Pandas library. Below is a Python snippet demonstrating this:
sentences = ["I am thrilled to introduce you to the wonderful world of AI.", "Hopefully, it won't disappoint you."] # Get classification results for each sentence in the list results = classifier(sentences) # Loop through each result and print the label and score for i, result in enumerate(results): print(f"Result {i + 1}:") print(f" Label: {result['label']}") print(f" Score: {round(result['score'], 3)}\n")
Output
Result 1: Label: POSITIVE Score: 1.0 Result 2: Label: POSITIVE Score: 0.996
Named Entity Recognition (NER)
NER is pivotal in extracting real-world objects termed ‘named entities’ from the text. Utilize the NER pipeline to identify these entities effectively:
ner_tagger = pipeline("ner", aggregation_strategy="simple") text = "Elon Musk is the CEO of SpaceX." outputs = ner_tagger(text) print(outputs)
Output
Result 1: Label: POSITIVE Score: 1.0 Result 2: Label: POSITIVE Score: 0.996
Question Answering
Question answering involves extracting precise answers to specific questions from a given context. Initialize a question-answering pipeline and input your question and context to get the desired answer:
reader = pipeline("question-answering") text = "Hugging Face is a company creating tools for NLP. It is based in New York and was founded in 2016." question = "Where is Hugging Face based?" outputs = reader(question=question, context=text) print(outputs)
Output
{'score': 0.998, 'start': 51, 'end': 60, 'answer': 'New York'}
Hugging Face’s pipeline function offers an array of pre-built pipelines for different tasks, aside from text classification, NER, and question answering. Below are details on a subset of available tasks:
Table: Hugging Face Pipeline Tasks
Task | Description | Pipeline Identifier |
Text Generation | Generate text based on a given prompt | pipeline(task=”text-generation”) |
Summarization | Summarize a lengthy text or document | pipeline(task=”summarization”) |
Image Classification | Label an input image | pipeline(task=”image-classification”) |
Audio Classification | Categorize audio data | pipeline(task=”audio-classification”) |
Visual Question Answering | Answer a query using both an image and a question | pipeline(task=”vqa”) |
For detailed descriptions and more tasks, refer to the pipeline documentation on Hugging Face’s website.
Why Hugging Face is shifting its focus on Rust
The Hugging Face (HF) ecosystem started utilizing Rust in its libraries such as safesensors and tokenizers.
Hugging Face has very recently also released a new machine-learning framework called Candle. Unlike traditional frameworks that use Python, Candle is built with Rust. The goal behind using Rust is to enhance performance and simplify the user experience while supporting GPU operations.
The key objective of Candle is to facilitate serverless inference, making the deployment of lightweight binaries possible and removing Python from the production workloads, which can sometimes slow down processes due to its overheads. This framework comes as a solution to overcome the issues encountered with full machine learning frameworks like PyTorch that are large and slow when creating instances on a cluster.
Let’s explore why Rust is becoming a favored choice much more than Python.
- Speed and Performance – Rust is known for its incredible speed, outperforming Python, which is traditionally used in machine learning frameworks. Python’s performance can sometimes be slowed down due to its Global Interpreter Lock (GIL), but Rust does not face this issue, promising faster execution of tasks and, subsequently, improved performance in projects where it is implemented.
- Safety – Rust provides memory safety guarantees without a garbage collector, an aspect that is essential in ensuring the safety of concurrent systems. This plays a crucial role in areas like safetensors where safety in handling data structures is a priority.
Safetensors
Safetensors benefit from Rust’s speed and safety features. Safetensors involves the manipulation of tensors, a complex mathematical entity, and having Rust ensures that the operations are not just fast, but also secure, avoiding common bugs and security issues that could arise from memory mishandling.
Tokenizer
Tokenizers handle the breaking down of sentences or phrases into smaller units, such as words or terms. Rust aids in this process by speeding up the execution time, ensuring that the tokenization process is not just accurate but also swift, enhancing the efficiency of natural language processing tasks.
At the core of Hugging Face’s tokenizer is the concept of subword tokenization, striking a delicate balance between word and character-level tokenization to optimize information retention and vocabulary size. It functions through the creation of subtokens, such as “##ing” and “##ed”, retaining semantic richness while avoiding a bloated vocabulary.
Subword tokenization involves a training phase to identify the most efficacious balance between character and word-level tokenization. It goes beyond mere prefix and suffix rules, requiring a comprehensive analysis of language patterns in extensive text corpora to design an efficient subword tokenizer. The generated tokenizer is adept at handling novel words by breaking them down into known subwords, maintaining a high level of semantic understanding.
Tokenization Components
The tokenizers library divides the tokenization process into several steps, each addressing a distinct facet of tokenization. Let’s delve into these components:
- Normalizer: Takes initial transformations on the input string, applying necessary adjustments such as lowercase conversion, Unicode normalization, and stripping.
- PreTokenizer: Responsible for fragmenting the input string into pre-segments, determining the splits based on predefined rules, such as space delineations.
- Model: Oversees the discovery and creation of subtokens, adapting to the specifics of your input data and offering training capabilities.
- Post-Processor: Enhances construction features to facilitate compatibility with many transformer-based models, like BERT, by adding tokens such as [CLS] and [SEP].
To get started with Hugging Face tokenizers, install the library using the command pip install tokenizers
and import it into your Python environment. The library can tokenize large amounts of text in very little time, thereby saving precious computational resources for more intensive tasks like model training.
The tokenizers library uses Rust which inherits C++’s syntactical similarity while introducing novel concepts in programming language design. Coupled with Python bindings, it ensures you enjoy the performance of a lower-level language while working in a Python environment.
Datasets
Datasets are the bedrock of AI projects. Hugging Face offers a wide variety of datasets, suitable for a range of NLP tasks, and more. To utilize them efficiently, understanding the process of loading and analyzing them is essential. Below is a well-commented Python script demonstrating how to explore datasets available on Hugging Face:
from datasets import load_dataset # Load a dataset dataset = load_dataset('squad') # Display the first entry print(dataset[0])
This script uses the load_dataset function to load the SQuAD dataset, which is a popular choice for question-answering tasks.
Leveraging Pre-trained Models and bringing it all together
Pre-trained models form the backbone of many deep learning projects, enabling researchers and developers to jumpstart their initiatives without starting from scratch. Hugging Face facilitates the exploration of a diverse range of pre-trained models, as shown in the code below:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer # Load the pre-trained model and tokenizer model = AutoModelForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad') tokenizer = AutoTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad') # Display the model's architecture print(model)
With the model and tokenizer loaded, we can now proceed to create a function that takes a piece of text and a question as inputs and returns the answer extracted from the text. We will utilize the tokenizer to process the input text and question into a format that is compatible with the model, and then we will feed this processed input into the model to get the answer:
def get_answer(text, question): # Tokenize the input text and question inputs = tokenizer(question, text, return_tensors="pt", max_length=512, truncation=True) outputs = model(**inputs) # Get the start and end scores for the answer answer_start = torch.argmax(outputs.start_logits) answer_end = torch.argmax(outputs.end_logits) + 1 answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(inputs['input_ids'][0][answer_start:answer_end])) return answer
In the code snippet, we import necessary modules from the transformers package, then load a pre-trained model and its corresponding tokenizer using the from_pretrained method. We choose a BERT model fine-tuned on the SQuAD dataset.
Let’s see an example use case of this function where we have a paragraph of text and we want to extract a specific answer to a question from it:
text = """ The Eiffel Tower, located in Paris, France, is one of the most iconic landmarks in the world. It was designed by Gustave Eiffel and completed in 1889. The tower stands at a height of 324 meters and was the tallest man-made structure in the world at the time of its completion. """ question = "Who designed the Eiffel Tower?" # Get the answer to the question answer = get_answer(text, question) print(f"The answer to the question is: {answer}") # Output: The answer to the question is: Gustave Eiffel
In this script, we build a get_answer function that takes a text and a question, tokenizes them appropriately, and leverages the pre-trained BERT model to extract the answer from the text. It demonstrates a practical application of Hugging Face’s transformers library to build a simple yet powerful question-answering system. To grasp the concepts well, it is recommended to have a hands-on experimentation using a Google Colab Notebook.
Conclusion
Through its extensive range of open-source tools, pre-trained models, and user-friendly pipelines, it enables both seasoned professionals and newcomers to delve into the expansive world of AI with a sense of ease and understanding. Moreover, the initiative to integrate Rust, owing to its speed and safety features, underscores Hugging Face’s commitment to fostering innovation while ensuring efficiency and security in AI applications. The transformative work of Hugging Face not only democratizes access to high-level AI tools but also nurtures a collaborative environment for learning and development in the AI space, facilitating a future where AI is accessible to
Credit: Source link