Introduction to MLOps With SageMaker: Running your First LLM
Introduction
As the field of machine learning advances, it has become increasingly important for organizations to develop robust practices for managing their workflows. That’s where MLOps comes in - a set of best practices and tools for managing the entire lifecycle of machine learning models, from development to deployment and beyond.
In this blog post, we’ll delve into how MLOps practices can be leveraged to deploy an LLM in AWS SageMaker, using the popular Hugging Face Transformers library. We’ll cover everything from setting up an end-to-end pipeline for deploying a Large Language Model on SageMaker, to monitoring its performance.
By the end of this post, you’ll have a better understanding of the key components of an MLOps workflow, and how they can be used to streamline the deployment of complex machine learning models in production environments. Whether you’re an experienced machine learning practitioner or just starting out, this post will provide valuable insights into the cutting-edge tools and techniques driving the field forward.
Getting Started
Prerequisites - Runtimes
Install the following binaries on your machine:
Prerequisites - AWS Resources
Make sure your have an AWS account configured:
- Clone the repo:
- Run terraform init to check the provider loaded as expected:
- Run terraform plan
- Create SageMaker domain, user profile, and JupyterServer instance:
Creating model.tar.gz
for the Amazon SageMaker real-time endpoint
-
There are two ways you can deploy transformers to Amazon SageMaker. You can either deploy a model from the Hugging Face Hub directly or deploy a model stored on S3. Since we are not using the default Transformers method we need to go with the second option and deploy our endpoint with the model stored on S3. In order to do that we need to create a folder structure like the following:
Using the SageMaker Hugging Face Inference Toolkit, we can reference Dolly in SageMaker by creating a function like the one below in the file
inference.py
. By doing this we will be overwriting themodel_fn
function: -
Finally, upload model to S3:
Provisioning JupyterServer
Once the infrastructure is up and running and the model reference has been uploaded to S3, you can access the JupyterServer by clicking the button “Open Studio” in the SageMaker console:
Deploying LLM - Dolly V2 12B
From the JupyterServer, you can import the git repo and reference the
notebook notebooks/deploy-to-sm-endpoint.ipynb
.
After executing all the previous cells from the notebook, you can proceed to deploy the model like the image below shows:
NOTE
It’s important to mention that this step might take a couple of minutes to complete.
You can verify the model was deployed successfully by checking the SageMaker endpoint status on the AWS Console:
Consuming SageMaker Endpoint
We can use the Streamlit to create a fast application and test the model inference:
And finally execute the following command to deploy using docker:
You will be able to access the playground on: http://localhost/8501/
Bonus: More sophisticated workflows
You can build need more sophisticated workflows by templating the prompts
using langchain. The following are just a few examples of what you can do by
combining langchain
and SageMaker
:
Cleaning up resources
To clean up the resources created by this project, you can run the following command: