Chunxiang Xu avadesian
Loading Heatmap…

avadesian synced commits to teardown-return-value at avadesian/skypilot from mirror

  • 114df08993 Merge branch 'master' of github.com:concretevitamin/sky-experiments into teardown-return-value
  • fae047f6ae Make sky status -r updates the owner of the cluster (#1809) * update the owner list if the owner identity is less than the current one * Update the comments * rephrase
  • 678b09ac01 Update envs instead of replacing (#1806) * Update envs instead of replacing * update docstr * format * update docstr
  • c7d6d1d501 Allow fall back to account ID for AWS (#1808) * Make identity a list * Use a list of identities * Fix comments * Fix key name * Fix user identity * Address comments * format * format * Update sky/backends/backend_utils.py Co-authored-by: Zongheng Yang <zongheng.y@gmail.com> --------- Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
  • 67bd3cc402 [Spot] Spot jobs only failover through the regions in ssh_proxy_command (#1792) * make the spot jobs only failover through the specified regions in ssh_proxy_command * format * ssh_proxy_command to None * address comments * Any include None * format * Fix skypilot_config.set_nested and add tests * Update sky/backends/backend_utils.py Co-authored-by: Zongheng Yang <zongheng.y@gmail.com> * format * Fix spot job status when there is only one region and unavailable * longer time for recovering * longer timeout * Fix region check * set back to precheck * add region info * longer timeout * Address comments * format --------- Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
  • Compare 60 commits »

9 hours ago

avadesian synced commits to master at avadesian/skypilot from mirror

  • fae047f6ae Make sky status -r updates the owner of the cluster (#1809) * update the owner list if the owner identity is less than the current one * Update the comments * rephrase
  • 678b09ac01 Update envs instead of replacing (#1806) * Update envs instead of replacing * update docstr * format * update docstr
  • c7d6d1d501 Allow fall back to account ID for AWS (#1808) * Make identity a list * Use a list of identities * Fix comments * Fix key name * Fix user identity * Address comments * format * format * Update sky/backends/backend_utils.py Co-authored-by: Zongheng Yang <zongheng.y@gmail.com> --------- Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
  • Compare 3 commits »

9 hours ago

avadesian synced commits to list-of-identities at avadesian/skypilot from mirror

  • 43ffa245d9 update the owner list if the owner identity is less than the current one

9 hours ago

avadesian synced commits to expanded-job-queue at avadesian/skypilot from mirror

9 hours ago

avadesian synced commits to list-of-identities at avadesian/skypilot from mirror

1 day ago

avadesian synced commits to env-override at avadesian/skypilot from mirror

1 day ago

avadesian synced commits to add-sleep-controller at avadesian/skypilot from mirror

1 day ago

avadesian synced commits to spot-logs at avadesian/skypilot from mirror

3 days ago

avadesian synced commits to master at avadesian/skypilot from mirror

  • 67bd3cc402 [Spot] Spot jobs only failover through the regions in ssh_proxy_command (#1792) * make the spot jobs only failover through the specified regions in ssh_proxy_command * format * ssh_proxy_command to None * address comments * Any include None * format * Fix skypilot_config.set_nested and add tests * Update sky/backends/backend_utils.py Co-authored-by: Zongheng Yang <zongheng.y@gmail.com> * format * Fix spot job status when there is only one region and unavailable * longer time for recovering * longer timeout * Fix region check * set back to precheck * add region info * longer timeout * Address comments * format --------- Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
  • 0537fccf16 Fix a typo in file name (#1793) LICENCE -> LICENSE
  • 8737e87b3c Update Lambda Cloud docs. (#1804) * Lambda Cloud docs. * single _
  • Compare 3 commits »

3 days ago

avadesian created NPU type debugging task avade202303221983631

3 days ago

avadesian synced commits to new_provisioner at avadesian/skypilot from mirror

4 days ago

avadesian synced commits to master at avadesian/skypilot from mirror

4 days ago

avadesian created NPU type debugging task avade202303211481492

4 days ago

avadesian opened issue OpenI/aiforge#3860

智算NPU集群增加8卡资源规格

4 days ago

avadesian synced commits to retry at avadesian/skypilot from mirror

4 days ago

avadesian synced commits to releases/0.2.5 at avadesian/skypilot from mirror

4 days ago

avadesian synced commits to new_provisioner at avadesian/skypilot from mirror

4 days ago

avadesian synced commits to master at avadesian/skypilot from mirror

  • de332164c0 Use 1.1.1.1 instead of 8.8.8.8 as IP for network check. (#1800)
  • e9440ccf15 [Example] Run LLaMA LLM chatbots on any cloud with one click (#1799) * [Example] Run LLaMA LLM chatbots on any cloud with one click * Add 2 repo links. * Fix a link.
  • 14cb4b56bb [Spot] Fix spot for new user (#1798) create dir if not exists
  • Compare 3 commits »

4 days ago

avadesian synced commits to releases/0.2.5 at avadesian/skypilot from mirror

  • cdfb00222d Version bump
  • b990a3a24f Make `sky gpunode` reuse existing cluster if possible (#1787) * Handle gpunode reuse * Improve error message + enforcing same resources is too hard
  • bb6429bab5 [Lambda Cloud] Multinode support (#1718) * Multinode working * Change lambda default to gpu_1x_a10 * Add internal ip to local machine * Don't make SSHCommandRunner stream logs * Add tag file refresh and update smoke tests * Update cluster name limit * Revert node timeout * Increase timeout again and nit * Nits * Make Lambda smoke tests use A10 * Use get and set instead of __getitem__ and __setitem__ * Handle no ip case * Update tests * Format * Update optimizer dryruns
  • 75775d35e6 [Spot] Add cancelling state for the spot job (#1785) * Add cancelling state for the spot job * Add cancelling check in TODO * format * fix comment * Make spot cancel more robust in test_smoke * format * format * revert color * Add comment
  • 464b5db86e [spot] Fix multiprocessing signal handling in spot controller (#1745) Previously, we send an interruption signal to the controller process and the controller process handles cleanup. However, we figure out the behavior differs from cloud to cloud (e.g., GCP ignore 'SIGINT'). A possible reason is https://unix.stackexchange.com/questions/356408/strange-problem-with-trap-and-sigint. But anyway, a clean solution is killing the controller process directly, and then cleanup the cluster state. Tested (run the relevant ones): - [ ] Any manual or new tests for this PR (please specify below) - [ ] All smoke tests: `pytest tests/test_smoke.py` - [x] Relevant individual smoke tests: `pytest tests/test_smoke.py --managed-spot` (both AWS and GCP as spot controllers) - [ ] Backward compatibility tests: `bash tests/backward_comaptibility_tests.sh`
  • Compare 12 commits »

5 days ago

avadesian synced commits to new_provisioner at avadesian/skypilot from mirror

5 days ago