Mariano Gonzalez

Coder and Computer Enthusiast

26 Apr 2023

Introduction to MLOps With SageMaker: Running your First LLM


As the field of machine learning advances, it has become increasingly important for organizations to develop robust practices for managing their workflows. That’s where MLOps comes in - a set of best practices and tools for managing the entire lifecycle of machine learning models, from development to deployment and beyond.

In this blog post, we’ll delve into how MLOps practices can be leveraged to deploy an LLM in AWS SageMaker, using the popular Hugging Face Transformers library. We’ll cover everything from setting up an end-to-end pipeline for deploying a Large Language Model on SageMaker, to monitoring its performance.

By the end of this post, you’ll have a better understanding of the key components of an MLOps workflow, and how they can be used to streamline the deployment of complex machine learning models in production environments. Whether you’re an experienced machine learning practitioner or just starting out, this post will provide valuable insights into the cutting-edge tools and techniques driving the field forward.

Getting Started

Prerequisites - Runtimes

Install the following binaries on your machine:

brew install awscli
brew install go-task
brew install terraform

Prerequisites - AWS Resources

Make sure your have an AWS account configured:

cat ~/.aws/config
aws_access_key_id = [REDACTED]
aws_secret_access_key = [REDACTED]
  1. Clone the repo:
    git clone
  2. Run terraform init to check the provider loaded as expected:
    task tf_init
  3. Run terraform plan
    task tf_plan
  4. Create SageMaker domain, user profile, and JupyterServer instance:
    task tf_apply

Creating model.tar.gz for the Amazon SageMaker real-time endpoint

  1. There are two ways you can deploy transformers to Amazon SageMaker. You can either deploy a model from the Hugging Face Hub directly or deploy a model stored on S3. Since we are not using the default Transformers method we need to go with the second option and deploy our endpoint with the model stored on S3. In order to do that we need to create a folder structure like the following:

    |- model/code/
      |- requirements.txt   

    Using the SageMaker Hugging Face Inference Toolkit, we can reference Dolly in SageMaker by creating a function like the one below in the file By doing this we will be overwriting the model_fn function:

    import torch
    from transformers import pipeline
    def model_fn(model_dir):
        instruct_pipeline = pipeline(
            model_kwargs={"load_in_8bit": True},
        return instruct_pipeline
  2. Finally, upload model to S3:

    task tar_model
    task upload_model

Provisioning JupyterServer

Once the infrastructure is up and running and the model reference has been uploaded to S3, you can access the JupyterServer by clicking the button “Open Studio” in the SageMaker console:

Deploying LLM - Dolly V2 12B

From the JupyterServer, you can import the git repo and reference the notebook notebooks/deploy-to-sm-endpoint.ipynb.

After executing all the previous cells from the notebook, you can proceed to deploy the model like the image below shows:


It’s important to mention that this step might take a couple of minutes to complete.

You can verify the model was deployed successfully by checking the SageMaker endpoint status on the AWS Console:

Consuming SageMaker Endpoint

We can use the Streamlit to create a fast application and test the model inference:

import json

import boto3
import streamlit as st

def generate_text(input_prompt: str) -> str:
    payload = {
        "inputs": input_prompt,
        "min_length": min_length,
        "max_length": max_length,
        "temperature": temperature,
        "repetition_penalty": rep_penalty,
        "do_sample": temperature > 0,

    response = sagemaker_runtime.invoke_endpoint(

    result = json.loads(response["Body"].read().decode())
    return result[0]["generated_text"]

session = boto3.Session()
sagemaker_runtime = session.client("sagemaker-runtime", region_name=session.region_name)
endpoint_name = "dolly-v2-12b"

st.sidebar.title("Dolly-V2 Parameters")
stop_word = st.sidebar.text_input("Stop word")
min_length, max_length = st.sidebar.slider("Min/Max length", 0, 500, (0, 100))
temperature = st.sidebar.slider("Temperature", min_value=0.0, max_value=1.0, value=0.6)
rep_penalty = st.sidebar.slider("Repetition Penalty", min_value=0.9, max_value=1.2, value=1.0)

st.header("Dolly-v2-12B Playground")
prompt = st.text_area("Enter your prompt here:")

if st.button("Run"):
    generated_text = generate_text(prompt)
    if len(stop_word) > 0:
        generated_text = generated_text[:generated_text.rfind(stop_word)]

And finally execute the following command to deploy using docker:

task run_playground

You will be able to access the playground on: http://localhost/8501/

Bonus: More sophisticated workflows

You can build need more sophisticated workflows by templating the prompts using langchain. The following are just a few examples of what you can do by combining langchain and SageMaker:

!pip install langchain
!aws configure set aws_access_key_id [REDACTED]
!aws configure set aws_secret_access_key [REDACTED]
!aws configure set default.region us-east-1
import json

from langchain import SagemakerEndpoint
from langchain.llms.sagemaker_endpoint import LLMContentHandler

class ContentHandler(LLMContentHandler):
    content_type = "application/json"
    accepts = "application/json"

    def transform_input(self, prompt: str, model_kwargs) -> bytes:
        input_str = json.dumps({prompt: prompt, **model_kwargs})
        return input_str.encode('utf-8')

    def transform_output(self, output: bytes) -> str:
        response_json = json.loads("utf-8"))
        return response_json[0]["generated_text"]

se = SagemakerEndpoint(

se("Tell me a joke")
from langchain import PromptTemplate, LLMChain

prompt_template = "Why is {vegetable} good for you?"
llm_chain = LLMChain(llm=se, prompt=PromptTemplate.from_template(prompt_template))
import time

from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate

def generate_serially():
    prompt = PromptTemplate(
        template="What is a good name for a company that makes {product}?",
    chain = LLMChain(llm=se, prompt=prompt)

for _ in range(5):
    resp ="underwear")

s = time.perf_counter()
elapsed = time.perf_counter() - s
print('\033[1m' + f"Serial executed in {elapsed:0.2f} seconds." + '\033[0m')
prompt_template = "Tell me a {adjective} joke"
llm_chain = LLMChain(

llm_chain(inputs={"adjective": "corny"})

Cleaning up resources

To clean up the resources created by this project, you can run the following command:

task tf_destroy