Branch: serve_k8s_playground_ha

History

Zhanghao Wu 9d05415945 [AWS/LLM] Remove the conda channel that is not working on AWS and fix dependencies in LLMs (#3206 ) * remove not working channel from AWS * Fix pin versions for vllm and transformers * Add max_tokens * Add error in comment		2 months ago
..
README.md	[Examples] vLLM example for SkyServe + Mixtral (#2948)	4 months ago

serve-openai-api.yaml	[AWS/LLM] Remove the conda channel that is not working on AWS and fix dependencies in LLMs (#3206)	2 months ago

serve.yaml	[AWS/LLM] Remove the conda channel that is not working on AWS and fix dependencies in LLMs (#3206)	2 months ago

service.yaml	[AWS/LLM] Remove the conda channel that is not working on AWS and fix dependencies in LLMs (#3206)	2 months ago

README.md

vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention

vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention

vLLM

This README contains instructions to run a demo for vLLM, an open-source library for fast LLM inference and serving, which improves the throughput compared to HuggingFace by up to 24x.

Prerequisites

Install the latest SkyPilot and check your setup of the cloud credentials:

pip install git+https://github.com/skypilot-org/skypilot.git
sky check

See the vLLM SkyPilot YAML for serving.

Serve a model with vLLM, launched on the cloud by SkyPilot

Start the serving the LLaMA-65B model on 8 A100 GPUs:

sky launch -c vllm-serve -s serve.yaml

Check the output of the command. There will be a sharable gradio link (like the last line of the following). Open it in your browser to use the LLaMA model to do the text completion.

(task, pid=7431) Running on public URL: https://a8531352b74d74c7d2.gradio.live

Demo

Optional: Serve the 13B model instead of the default 65B and use less GPU:

sky launch -c vllm-serve -s serve.yaml --gpus A100:1 --env MODEL_NAME=decapoda-research/llama-13b-hf

Serving Llama-2 with vLLM's OpenAI-compatible API server

Before you get started, you need to have access to the Llama-2 model weights on huggingface. Please check the prerequisites section in Llama-2 example for more details.

Start serving the Llama-2 model:

sky launch -c vllm-llama2 serve-openai-api.yaml

Optional: Only GCP offers the specified L4 GPUs currently. To use other clouds, use the --gpus flag to request other GPUs. For example, to use V100 GPUs:

sky launch -c vllm-llama2 serve-openai-api.yaml --gpus V100:1

Check the IP for the cluster with:

IP=$(sky status --ip vllm-llama2)

You can now use the OpenAI API to interact with the model.

Query the models hosted on the cluster:

curl http://$IP:8000/v1/models

Query a model with input prompts for text completion:

curl http://$IP:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
      "model": "meta-llama/Llama-2-7b-chat-hf",
      "prompt": "San Francisco is a",
      "max_tokens": 7,
      "temperature": 0
  }'

You should get a similar response as the following:

{
    "id":"cmpl-50a231f7f06a4115a1e4bd38c589cd8f",
    "object":"text_completion","created":1692427390,
    "model":"meta-llama/Llama-2-7b-chat-hf",
    "choices":[{
        "index":0,
        "text":"city in Northern California that is known",
        "logprobs":null,"finish_reason":"length"
    }],
    "usage":{"prompt_tokens":5,"total_tokens":12,"completion_tokens":7}
}

Query a model with input prompts for chat completion:

curl http://$IP:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-2-7b-chat-hf",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "Who are you?"
      }
    ]
  }'

You should get a similar response as the following:

{
  "id": "cmpl-879a58992d704caf80771b4651ff8cb6",
  "object": "chat.completion",
  "created": 1692650569,
  "model": "meta-llama/Llama-2-7b-chat-hf",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": " Hello! I'm just an AI assistant, here to help you"
    },
    "finish_reason": "length"
  }],
  "usage": {
    "prompt_tokens": 31,
    "total_tokens": 47,
    "completion_tokens": 16
  }
}

Serving Llama-2 with vLLM for more traffic using SkyServe

To scale up the model serving for more traffic, we introduced SkyServe to enable a user to easily deploy multiple replica of the model:

Adding an service section in the above serve-openai-api.yaml file to make it an SkyServe Service YAML:

# The newly-added `service` section to the `serve-openai-api.yaml` file.
service:
  # Specifying the path to the endpoint to check the readiness of the service.
  readiness_probe: /v1/models
  # How many replicas to manage.
  replicas: 2

The entire Service YAML can be found here: service.yaml.

Start serving by using SkyServe CLI:

sky serve up -n vllm-llama2 service.yaml

Use sky serve status to check the status of the serving:

sky serve status vllm-llama2

You should get a similar output as the following:

Services
NAME           UPTIME     STATUS    REPLICAS   ENDPOINT
vllm-llama2    7m 43s     READY     2/2        3.84.15.251:30001

Service Replicas
SERVICE_NAME   ID   IP             LAUNCHED       RESOURCES          STATUS  REGION
vllm-llama2    1    34.66.255.4    11 mins ago    1x GCP({'L4': 1})  READY   us-central1
vllm-llama2    2    35.221.37.64   15 mins ago    1x GCP({'L4': 1})  READY   us-east4

Check the endpoint of the service:

ENDPOINT=$(sky serve status --endpoint vllm-llama2)

Once it status is READY, you can use the endpoint to interact with the model:

curl -L $ENDPOINT/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-2-7b-chat-hf",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "Who are you?"
      }
    ]
  }'

Notice that it is the same with previously curl command, except for thr -L argument. You should get a similar response as the following:

{
  "id": "cmpl-879a58992d704caf80771b4651ff8cb6",
  "object": "chat.completion",
  "created": 1692650569,
  "model": "meta-llama/Llama-2-7b-chat-hf",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": " Hello! I'm just an AI assistant, here to help you"
    },
    "finish_reason": "length"
  }],
  "usage": {
    "prompt_tokens": 31,
    "total_tokens": 47,
    "completion_tokens": 16
  }
}

Serving Mistral.ai's Mixtral 8x7b model with vLLM

Please refer to the Mixtral 8x7b example for more details.

No Description

Python SVG Shell Markdown HTML other

zhanghao.wu@outlook.com zongheng.y@gmail.com romil.bhardwaj@gmail.com concretevitamin@users.noreply.github.com cblmemo@gmail.com infwinston@gmail.com gautam@mittal.net romil.bhardwaj@berkeley.edu lsf@berkeley.edu suquark@gmail.com michael.luo@berkeley.edu woosuk.kwon@berkeley.edu 34902420+landscapepainter@users.noreply.github.com weichiang@berkeley.edu michaelluo@dhcp-132-50.EECS.Berkeley.EDU ziming.mao@yale.edu isaacong.jw@gmail.com sumanthgenz@gmail.com edwardzeng@berkeley.edu hysunhe@foxmail.com michaelluo@MacBook-Pro.local michael.luo123456789@gmail.com rahejamehul@gmail.com guoxd@jihulab.com 46831164+ewzeng@users.noreply.github.com

How to access data resources in code