Are you sure you want to delete this task? Once this task is deleted, it cannot be recovered.
Zhanghao Wu 9d05415945 | 2 months ago | |
---|---|---|
.. | ||
README.md | 4 months ago | |
serve-openai-api.yaml | 2 months ago | |
serve.yaml | 2 months ago | |
service.yaml | 2 months ago |
This README contains instructions to run a demo for vLLM, an open-source library for fast LLM inference and serving, which improves the throughput compared to HuggingFace by up to 24x.
Install the latest SkyPilot and check your setup of the cloud credentials:
pip install git+https://github.com/skypilot-org/skypilot.git
sky check
See the vLLM SkyPilot YAML for serving.
sky launch -c vllm-serve -s serve.yaml
(task, pid=7431) Running on public URL: https://a8531352b74d74c7d2.gradio.live
sky launch -c vllm-serve -s serve.yaml --gpus A100:1 --env MODEL_NAME=decapoda-research/llama-13b-hf
Before you get started, you need to have access to the Llama-2 model weights on huggingface. Please check the prerequisites section in Llama-2 example for more details.
sky launch -c vllm-llama2 serve-openai-api.yaml
Optional: Only GCP offers the specified L4 GPUs currently. To use other clouds, use the --gpus
flag to request other GPUs. For example, to use V100 GPUs:
sky launch -c vllm-llama2 serve-openai-api.yaml --gpus V100:1
IP=$(sky status --ip vllm-llama2)
curl http://$IP:8000/v1/models
curl http://$IP:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-2-7b-chat-hf",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0
}'
You should get a similar response as the following:
{
"id":"cmpl-50a231f7f06a4115a1e4bd38c589cd8f",
"object":"text_completion","created":1692427390,
"model":"meta-llama/Llama-2-7b-chat-hf",
"choices":[{
"index":0,
"text":"city in Northern California that is known",
"logprobs":null,"finish_reason":"length"
}],
"usage":{"prompt_tokens":5,"total_tokens":12,"completion_tokens":7}
}
curl http://$IP:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-2-7b-chat-hf",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Who are you?"
}
]
}'
You should get a similar response as the following:
{
"id": "cmpl-879a58992d704caf80771b4651ff8cb6",
"object": "chat.completion",
"created": 1692650569,
"model": "meta-llama/Llama-2-7b-chat-hf",
"choices": [{
"index": 0,
"message": {
"role": "assistant",
"content": " Hello! I'm just an AI assistant, here to help you"
},
"finish_reason": "length"
}],
"usage": {
"prompt_tokens": 31,
"total_tokens": 47,
"completion_tokens": 16
}
}
To scale up the model serving for more traffic, we introduced SkyServe to enable a user to easily deploy multiple replica of the model:
service
section in the above serve-openai-api.yaml
file to make it an SkyServe Service YAML
:# The newly-added `service` section to the `serve-openai-api.yaml` file.
service:
# Specifying the path to the endpoint to check the readiness of the service.
readiness_probe: /v1/models
# How many replicas to manage.
replicas: 2
The entire Service YAML can be found here: service.yaml.
sky serve up -n vllm-llama2 service.yaml
sky serve status
to check the status of the serving:sky serve status vllm-llama2
You should get a similar output as the following:
Services
NAME UPTIME STATUS REPLICAS ENDPOINT
vllm-llama2 7m 43s READY 2/2 3.84.15.251:30001
Service Replicas
SERVICE_NAME ID IP LAUNCHED RESOURCES STATUS REGION
vllm-llama2 1 34.66.255.4 11 mins ago 1x GCP({'L4': 1}) READY us-central1
vllm-llama2 2 35.221.37.64 15 mins ago 1x GCP({'L4': 1}) READY us-east4
ENDPOINT=$(sky serve status --endpoint vllm-llama2)
READY
, you can use the endpoint to interact with the model:curl -L $ENDPOINT/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-2-7b-chat-hf",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Who are you?"
}
]
}'
Notice that it is the same with previously curl command, except for thr -L
argument. You should get a similar response as the following:
{
"id": "cmpl-879a58992d704caf80771b4651ff8cb6",
"object": "chat.completion",
"created": 1692650569,
"model": "meta-llama/Llama-2-7b-chat-hf",
"choices": [{
"index": 0,
"message": {
"role": "assistant",
"content": " Hello! I'm just an AI assistant, here to help you"
},
"finish_reason": "length"
}],
"usage": {
"prompt_tokens": 31,
"total_tokens": 47,
"completion_tokens": 16
}
}
Please refer to the Mixtral 8x7b example for more details.
No Description
Python SVG Shell Markdown HTML other
Dear OpenI User
Thank you for your continuous support to the Openl Qizhi Community AI Collaboration Platform. In order to protect your usage rights and ensure network security, we updated the Openl Qizhi Community AI Collaboration Platform Usage Agreement in January 2024. The updated agreement specifies that users are prohibited from using intranet penetration tools. After you click "Agree and continue", you can continue to use our services. Thank you for your cooperation and understanding.
For more agreement content, please refer to the《Openl Qizhi Community AI Collaboration Platform Usage Agreement》