Branch: serve_k8s_playground_ha

History

Tian Xia b36334cb8e [Provisioner] Introducing `best` disk tier (#2434 ) * add best disk tier * update smoke test * Apply suggestions from code review Co-authored-by: Wei-Lin Chiang <infwinston@gmail.com> * upd comments * rename normalize to translate * group cpu, memory, disk options into _TASK_OPTIONS * move pylint disable for pylint bug * use enum instead of string literal * add sky/utils/resources_utils.py * change to lower case * minor * Update sky/resources.py Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> * warning when ignore some arguments * Update sky/clouds/local.py Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> * change more disk tier in str to its value * failover for best tier * fix tier comparison * add azure failover test * typo * format * change falcon to best disk tier * nits * centralize check_disk_tier_enabled * check ports first * add test_jobs tests * apply suggestions from code review * check and ignore params after CLI override * change new example to best * failover for azure * refactor * fix api test * upd tier * change for runpod * Update sky/resources.py Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> * changes for vsphere & lint * fix & add test * lint * move tests * add high tier test --------- Co-authored-by: Wei-Lin Chiang <infwinston@gmail.com> Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>		3 months ago
..
README.md	[Examples] Finetune Falcon 7B and 40B Example (#2242)	8 months ago

falcon.yaml	[Provisioner] Introducing `best` disk tier (#2434)	3 months ago

train.py	[Examples] Finetune Falcon 7B and 40B Example (#2242)	8 months ago

README.md

Finetuning Falcon with SkyPilot

Finetuning Falcon with SkyPilot

This README contains instructions on how to use SkyPilot to finetune Falcon-7B and Falcon-40B, an open-source LLM that rivals many current closed-source models, including ChatGPT.

Prerequisites

Install the latest SkyPilot and check your setup of the cloud credentials:

pip install git+https://github.com/skypilot-org/skypilot.git
sky check

See the Falcon SkyPilot YAML for training. Serving is currently a work in progress and a YAML will be provided for that soon! We are also working on adding an evaluation step to evaluate the model you finetuned compared to the base model.

Running Falcon on SkyPilot

Finetuning Falcon-7B and Falcon-40B require GPUs with 80GB memory,
but Falcon-7b-sharded requires only 40GB memory. Thus,

If your GPU has 40 GB memory or less (e.g., Nvidia A100): use ybelkada/falcon-7b-sharded-bf16.
If your GPU has 80 GB memory (e.g., Nvidia A100-80GB): you can also use tiiuae/falcon-7b and tiiuae/falcon-40b.

Try sky show-gpus --all for supported GPUs.

We can start the finetuning of Falcon model on Open Assistant's Guanaco data with a single command. It will automatically find the available cheapest VM on any cloud.

To finetune using different data, simply replace the path in timdettmers/openassistant-guanaco with any other huggingface dataset.

Steps for training on your cloud(s):

In train.yaml, set the following variables in envs:
- Replace the OUTPUT_BUCKET_NAME with a unique name. SkyPilot will create this bucket for you to store the model weights.
- Replace the WANDB_API_KEY to your own key.
- Replace the MODEL_NAME with your desired base model.
Training the Falcon model using spot instances:

sky spot launch -n falcon falcon.yaml

Currently, such A100-80GB:1 spot instances are only available on AWS and GCP.

[Optional] To use on-demand A100-80GB:1 instances, which are currently available on Lambda Cloud, Azure, and GCP:

sky launch -c falcon -s falcon.yaml --no-use-spot

For reference, below is a loss graph you may expect to see, and the amount of time and the approximate cost of fine-tuning each of the models over 500 epochs (assuming a spot instance A100 GPU rate at $1.1 / hour and a A100-80GB rate of $1.61 / hour):

ybelkada/falcon-7b-sharded-bf16: 2.5 to 3 hours using 1 A100 spot GPU; total cost ≈ $3.3.
tiiuae/falcon-7b: 2.5 to 3 hours using 1 A100 spot GPU; total cost ≈ $3.3.
tiiuae/falcon-40b: 10 hours using 1 A100-80GB spot GPU; total cost ≈ $16.10.

Q&A

Q: I see some bucket permission errors sky.exceptions.StorageBucketGetError when running the above:

...
sky.exceptions.StorageBucketGetError: Failed to connect to an existing bucket 'YOUR_OWN_BUCKET_NAME'.
Please check if:
  1. the bucket name is taken and/or
  2. the bucket permissions are not setup correctly. To debug, consider using gsutil ls gs://YOUR_OWN_BUCKET_NAME.

A: You need to replace the bucket name with your own globally unique name, and rerun the commands. New private buckets will be automatically created under your cloud account.

No Description

Python SVG Shell Markdown HTML other

zhanghao.wu@outlook.com zongheng.y@gmail.com romil.bhardwaj@gmail.com concretevitamin@users.noreply.github.com cblmemo@gmail.com infwinston@gmail.com gautam@mittal.net romil.bhardwaj@berkeley.edu lsf@berkeley.edu suquark@gmail.com michael.luo@berkeley.edu woosuk.kwon@berkeley.edu 34902420+landscapepainter@users.noreply.github.com weichiang@berkeley.edu michaelluo@dhcp-132-50.EECS.Berkeley.EDU ziming.mao@yale.edu isaacong.jw@gmail.com sumanthgenz@gmail.com edwardzeng@berkeley.edu hysunhe@foxmail.com michaelluo@MacBook-Pro.local michael.luo123456789@gmail.com rahejamehul@gmail.com guoxd@jihulab.com 46831164+ewzeng@users.noreply.github.com

How to access data resources in code

README.md

Finetuning Falcon with SkyPilot

Prerequisites

Running Falcon on SkyPilot

Q&A

Contributors (25+) All

Contributors (25+)
All