JeffDing
  • 中国上海
  • OpenI首批资深体验官,华为云云享专家,华为云HCDG核心贡献者、MindSpore资深开发者,Ascend优秀开发者,主要探索学习昇腾、昇思、CANN、华为云码道、模型适配/部署/量化/微调/推理、华为云、书生大模型、AI Infra、AI4S
  • Joined on Apr 26, 2021
  • Organization
Loading Heatmap…

JeffDing synced commits to releases/v0.20.2rc at JeffDing/vllm-ascend from mirror

  • 3b56d7a11f [BugFix] DSV4 Initialize KV store for Decode Node after first real request (#9793) ### What this PR does / why we need it Base this change on the release branch version where external KV stores already initialize lazily from put(). The put-only path misses decode-only / pure consumer workers because they can run a real request without storing KV. This keeps the existing put-triggered initialization for producer/store paths, and adds a post-real-forward fallback: start_load_kv() marks the current connector step as a real forward only when forward_context is not None, and get_finished() initializes the backend store afterward if needed. The no-forward path also calls get_finished(), so it explicitly resets the current-step marker and will not initialize the store there. Lazy init only changes the initialization timing: once initialization is actually triggered, initialization failures still raise as fatal errors. This avoids initializing in connector construction, KV cache registration, or no-forward cleanup, while still covering workers that do not call put(). vLLM version: v0.20.1 vLLM main: vllm-project/vllm@c7aa186 --------- Signed-off-by: Pz1116 <zpbzpb123123@gmail.com>
  • 3ead473ecd [Doc][Misc] Update v0.20.2rc README information (#10103) ### What this PR does / why we need it? Update README and docs version references for the v0.20.2rc release candidate. - Add v0.20.2rc1 release candidate news to English and Chinese README files. - Add releases/v0.20.2rc to the maintained branch tables. - Update docs header and Sphinx release metadata for v0.20.2rc. ### Does this PR introduce _any_ user-facing change? No. Documentation-only update. ### How was this patch tested? - `git diff --check` - `bash format.sh ci` partially passed; failed only because `shellcheck` is not installed in the local environment. Other hooks passed, including ruff, codespell, typos, clang-format, markdownlint, GitHub Actions workflow lint, PNG export lint, filename-space check, Python package `__init__.py` check, forbidden logger/import checks, boolean-op check, and suggestion check. --------- Signed-off-by: weijinqian_v1 <weijinqian@huawei.com> Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>
  • 6363175512 [Doc][Misc] Update release notes known issues (#10039) ### What this PR does / why we need it? Update v0.20.2rc1 release notes known issues and add the DeepSeek V4 KV Pool known issue reference. - git diff --check - vLLM version: v0.20.2 - vLLM main: https://github.com/vllm-project/vllm/commit/9090368b650896bf5fc990c921df7eb4c20355a5 Signed-off-by: weijinqian_v1 <weijinqian@huawei.com> Co-authored-by: weijinqian_v1 <weijinqian@huawei.com> (cherry picked from commit 6230b36103676268fc8b28ec397eb0af41fe3c7c)
  • 3790531219 [Doc][Misc] Prepare v0.20.2rc1 release notes (#9602) ### What this PR does / why we need it? This PR prepares the v0.20.2rc1 release documentation set. It adds the new release notes entry for `v0.20.2rc1` and updates the main branch documentation references so the latest RC entry, FAQ link, and version matrix all point to the current release candidate. Related release tracking: - Release checklist: #9591 - Feedback issue: #9586 ### Does this PR introduce _any_ user-facing change? No. This is a documentation-only update for the v0.20.2rc1 release process. ### How was this patch tested? - Reviewed the staged markdown and version substitutions with `git diff` and `rg` - Ran `python -m py_compile docs/source/conf.py` - Full Sphinx build was not run locally because the current environment does not have `docutils` installed - vLLM version: v0.20.2 - vLLM main: https://github.com/vllm-project/vllm/commit/9090368b650896bf5fc990c921df7eb4c20355a5 --------- Signed-off-by: weijinqian_v1 <weijinqian@huawei.com> Co-authored-by: weijinqian_v1 <weijinqian@huawei.com> (cherry picked from commit 078fae3f10bceb8fe75060167f334110c20bdaf1)
  • Compare 4 commits »

1 day ago

JeffDing synced commits to main at JeffDing/vllm-ascend from mirror

  • e58dd4558f [CI] add slash command dispatch for /e2e, /rerun, /nightly (#10059) ### What this PR does / why we need it? This PR replaces the old comment-based trigger mechanisms with `peter-evans/slash-command-dispatch` to handle slash commands in PR comments: - `/e2e`: Runs specific E2E tests under `tests/e2e/pull_request/`, automatically routed to the appropriate NPU runner. - `/rerun`: Re-runs all failed workflow runs on the current PR commit. - `/nightly`: Triggers specific nightly test suites on A2 and A3 by test case name. It also: - Configures permissions: `/e2e` and `/rerun` allow the PR author or triage+ users, while `/nightly` requires triage+ users. - Removes obsolete files: `pr_e2e_comment.yaml` and `_parse_trigger.yaml`. - Inlines the parse-trigger logic into `schedule_nightly_test_a2/a3.yaml`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Documentation update and CI workflow configuration changes. - vLLM version: v0.20.2 - vLLM main: https://github.com/vllm-project/vllm/commit/9090368b650896bf5fc990c921df7eb4c20355a5 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
  • eb632e2033 [CI] Fix csrc cache build error (#10114) ### What this PR does / why we need it? This PR pins the `setuptools` dependency to `<72` under `tool.uv.extra-build-dependencies` for `pandas` to resolve a build incompatibility. This PR also makes the csrc cache works for PR ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.20.2 - vLLM main: https://github.com/vllm-project/vllm/commit/9090368b650896bf5fc990c921df7eb4c20355a5 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
  • 8991ca5f44 [Misc] Rehash AscendStore grouped keys for DSV4/compress layouts (#9789) ### What this PR does / why we need it? Rehash grouped AscendStore block hashes with a framed SHA-256 digest when the store key granularity is larger than the original hash granularity. This keeps the key suffix fixed length while preserving all child block hashes in the digest input, reducing KV Pool metadata and backend key pressure for DSV4/compress layouts. - vLLM version: v0.20.2 - vLLM main: https://github.com/vllm-project/vllm/commit/9090368b650896bf5fc990c921df7eb4c20355a5 --------- Signed-off-by: Pz1116 <zpbzpb123123@gmail.com>
  • a4179063b0 [Feature] Simple yet General CPU KV Cache Offloading (#8743) ### What this PR does / why we need it? refer to: https://github.com/vllm-project/vllm/pull/37160 , SimpleCPUOffloadConnector is another design of vLLM's CPU KV cache offloading path. Instead of maintaining a parallel block management stack, it reuses vLLM's existing BlockPool and KVCacheCoordinator infrastructure directly. This gives us HMA support, prefix caching, and LRU eviction for free. - vLLM version: v0.20.2 - vLLM main: https://github.com/vllm-project/vllm/commit/9090368b650896bf5fc990c921df7eb4c20355a5 Signed-off-by: HF-001 <1670186653@qq.com>
  • 3a593b6685 [CI] Fix csrc CI build error (#10110) ### What this PR does / why we need it? This PR adds `pandas = ["setuptools"]` to `[tool.uv.extra-build-dependencies]` in `pyproject.toml` to resolve build errors during the compilation of C++ extensions (`csrc`) in the CI environment when using `uv`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Verified via CI build pipeline. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
  • Compare 24 commits »

1 day ago

JeffDing synced commits to offline-cache at JeffDing/xtuner from mirror

  • ceabe74215 [Fix] adapt clusterx brainpp breaking change

1 day ago

JeffDing synced commits to main at JeffDing/xtuner from mirror

  • 0e40b0d7ba [Refactor] Refactor RL unittest part 1: add pr-fast unit test (#1865) * Add RL PR-fast test suite * Add RL PR-fast risk tests 新增 _prepare_train_data contract 测试,覆盖文本、VLM、invalid group 和 fail-fast 路径。 新增 SingleTurnAgentLoop batch judge/pause 测试,并收紧 batch judge 仅全组 COMPLETED 才触发。 新增 RolloutController fake routed 分支和 CPUResourceManager register/validate_or_raise PR-fast 覆盖。 * Remove duplicated RL tests migrated to PR-fast 删除 tests/rl 根目录下已经迁入 tests/rl/fast/pr_fast 的旧单测入口。 覆盖同名 PR-fast 测试,以及已合并到 test_rollout_logic.py 的 rollout worker/utils 旧测试。 * delete useless rollout_output.jsonl * fix ci * fix claude comments * fix ci
  • b7af11ea0a [CI] Skip lmdeploy disaggregated ut until PR 4638 is merged (#1875) skip lmdeploy disaggregated ut
  • 54c9846676 [Fix] fix rollout worker recovery before colocated training when ep_size = 1 (#1869) * fix rollout worker recovery before colocated training * call terminate before shutdown * fix RL_TRAINER_RAY_GET_TIMEOUT default value * mv _request_server_terminate for to sglang and lmdeploy * fix claude comments * fix claude comments
  • 191c526e58 【CI】rl case (#1853) * qwen35 rl * more case * run cases * fix format * fix data pathh * set timeout * npu rl case
  • 986cce8507 Fix RL rollout port allocation collisions in CI (#1872) feat(rl): use fixed port in init_dist_port and remove useless api_port, api_host
  • Compare 8 commits »

1 day ago

JeffDing synced commits to gh-pages at JeffDing/xtuner from mirror

1 day ago

JeffDing synced commits to agentic_branch at JeffDing/xtuner from mirror

  • 2d78e01cd6 sandbox_creates_per_sec (#1874)
  • b1d75b18d3 Add eval-mode sandbox rollouts and trajectory logging (#1848) * Add eval-mode sandbox rollouts and trajectory logging - Add TB2 eval dataloader and eval AgentInSandboxLoop config - Disable token/logprob/routed-expert returns for eval inference - Preserve text-only eval responses and tokenized response length stats - Separate eval replay buffer from training replay buffer - Add regression coverage for text-only eval trajectory saves * Add eval trajectory grouping updates * Simplify eval rollout trace handling * Store eval trajectories as structured artifacts --------- Co-authored-by: liukuikun <641417025@qq.com>
  • Compare 2 commits »

1 day ago

JeffDing synced commits to releases/v0.20.2rc at JeffDing/vllm-ascend from mirror

  • 81d7aacbba [Doc][Model] Add DeepSeek-V4 Flash and Pro documentation (#9966) ## What this PR does / why we need it? Add tutorial documentation for DeepSeek-V4 Flash and DeepSeek-V4 Pro models running on Ascend A3 hardware. This includes: - New tutorial pages: `DeepSeek-V4-Flash.md` and `DeepSeek-V4-Pro.md` - Update to the models index to include both new pages - Update to the supported models matrix to add DeepSeek V4-Flash and V4-Pro entries ## Does this PR introduce _any_ user-facing change? No, this is a documentation-only update. ## How was this patch tested? Documentation-only change, CI verification is sufficient. - vLLM version: 0.20.2rc - vLLM main: N/A --------- Signed-off-by: GDzhu01 <809721801@qq.com>
  • 367b8e62da [BugFix][v0.20.2rc]Reduce sampling is reconstructed to eliminate all patch behaviors and support DFlash and MTP (#9946) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> The Reduce_sampling optimization is reconstructed to eliminate all patch behaviors and support the DFlash and MTP. When sampling optimization is enabled, if speculative decoding is Eagle3 or DFlash, sampling can be optimized for both the main model and the MTP layer. If the speculative decoding method is MTP, some models can optimize sampling for both the main model and the MTP layer, while others can only optimize sampling for the main model. ### Does this PR introduce _any_ user-facing change? <!-- Note that it means *any* user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> --------- Signed-off-by: hzx55906 <513464215@qq.com>
  • c942e37d12 [Feature][v0.20.2rc] Mooncake kvpool usage optimization (#7820) (#9947) ### What this PR does / why we need it? Backport of #7820 to `releases/v0.20.2rc`. Mooncake kvpool usage optimization, including: 1. Expose the `preferred_segment` and `prefer_alloc_in_same_node` Mooncake parameters so users can configure them as needed. 2. Convert `setup` to keyword-argument calls to accommodate different versions of Mooncake. 3. Make `master_server_address` retrievable from both environment variables and configuration files, with environment variables taking precedence. Signed-off-by: LCAIZJ <leichao139636@163.com>
  • ff973b43f6 [Ascend950][BugFix] Fix MoonViT3dPretrainedModel.to overriding quantized ViT weight dtype (#9933) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? On Ascend A5 devices, loading KimiK25 models with quantized ViT weights (mxfp8) fails because the initialization flow calls .to(dtype=...), which converts the quantized weights to the model's default dtype (e.g., bf16). This causes a RuntimeError: dtype mismatch during weight loading. This PR patches MoonViT3dPretrainedModel.to() to ignore the dtype argument on A5 devices, ensuring the quantized weight dtypes are preserved. <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> ### Does this PR introduce _any_ user-facing change? <!-- Note that it means *any* user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Signed-off-by: gaozihao <gaozihao3@huawei.com>
  • 2de930e462 [Misc] Improve logging in quantization (#9880) ### What this PR does / why we need it? Back port from #9621. Improve logging coverage, error context, and log level correctness in quantization. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? NA. Signed-off-by: mayumeng <m30059191@china.huawei.com> Co-authored-by: mayumeng <m30059191@china.huawei.com>
  • Compare 10 commits »

2 days ago

JeffDing synced commits to releases/v0.18.0 at JeffDing/vllm-ascend from mirror

  • 3ba3bd922e [Doc][Misc] Improve readability and fix typos in documentation (#9674) ### What this PR does / why we need it? 1. Added an introduction to the Quickstart section. 2. remove tsinghua mirror source and update pip before using pip 3. update log when offline infering ps: Synchronize the main PR: 1. The following PR content: https://github.com/vllm-project/vllm-ascend/pull/9619 2. Add the second and third changes of the following PR: https://github.com/vllm-project/vllm-ascend/pull/9091 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Documentation changes only. Signed-off-by: sunshine202600 <sunshine202600@163.com>
  • 9478de0f8c [Doc] optimize doc (#9607) ### What this PR does / why we need it? remove redundant logs, keep only the essential parts, and obfuscate the version numbers. ### Does this PR introduce _any_ user-facing change? yes --------- Signed-off-by: zouyida2052 <zouyida2002@gmail.com>
  • Compare 2 commits »

2 days ago

JeffDing synced commits to main at JeffDing/vllm-ascend from mirror

  • 72f39fbd06 [BugFix]Fix the question that cant find num_kv_head from some model config (#10008) ### What this PR does / why we need it? Fix the question that cant find num_kv_head from some model config ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.20.2 - vLLM main: https://github.com/vllm-project/vllm/commit/9090368b650896bf5fc990c921df7eb4c20355a5 Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
  • 0e4d77312c [Ascend950][BugFix] Fix MoonViT3dPretrainedModel.to overriding quantized ViT weight dtype (#9929) ### What this PR does / why we need it? On Ascend A5 devices, loading KimiK25 models with quantized ViT weights (mxfp8) fails because the initialization flow calls .to(dtype=...), which converts the quantized weights to the model's default dtype (e.g., bf16). This causes a RuntimeError: dtype mismatch during weight loading. This PR patches MoonViT3dPretrainedModel.to() to ignore the dtype argument on A5 devices, ensuring the quantized weight dtypes are preserved. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.20.2 - vLLM main: https://github.com/vllm-project/vllm/commit/9090368b650896bf5fc990c921df7eb4c20355a5 Signed-off-by: gaozihao <gaozihao3@huawei.com>
  • 40dae21230 [CI][Test] Optimize selective test routing and CI test stability (#9994) ### What this PR does / why we need it? This PR fixes and hardens PR CI test selection and several CI test stability issues. - Normalize path separators before matching UT/E2E routing patterns. - Add `--run-all-modules` to support main2main. - Support precision down to test cases `test_a.py::test_b` - Stabilize related tests by: - waiting for NPU memory before the offline TP2 weight-load test; - reusing the existing Qwen3 reranker template path. BugFix - capturing elastic netloader logs from the module logger instead of the global vLLM logger. ``` bash > assert "Failed to load" in log_output or "does not contain" in log_output E AssertionError: assert ('Failed to load' in '' or 'does not contain' in '') tests/ut/model_loader/netloader/test_netloader_elastic.py:363: AssertionError ``` - Variable contamination caused by non-isolated environments ``` bash E pydantic_core._pydantic_core.ValidationError: 1 validation error for VllmConfig E Assertion failed, Flash Comm v1 requires enable_expert_parallel=True for MoE models. [type=assertion_error, input_value=ArgsKwargs((), {'model_co... 'shutdown_timeout': 0}), input_type=ArgsKwargs] E For further information visit https://errors.pydantic.dev/2.13/v/assertion_error ``` ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.20.2 - vLLM main: https://github.com/vllm-project/vllm/commit/9090368b650896bf5fc990c921df7eb4c20355a5 --------- Signed-off-by: MrZ20 <2609716663@qq.com>
  • 028e538c84 [BugFix] fix dsv4 piecewise scenario (#10003) ### What this PR does / why we need it? In [PR#43746](https://github.com/vllm-project/vllm/pull/43746),while VLLM_USE_BREAKABLE_CUDAGRAPH is detacted as True, compilation_config.mode would be None even if we set it, thus server runs in eager mode. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.20.2 - vLLM main: https://github.com/vllm-project/vllm/commit/9090368b650896bf5fc990c921df7eb4c20355a5 --------- Signed-off-by: zouyida2052 <zouyida2002@gmail.com>
  • 16747f8b02 [Test][CI] final part for selected test (#10017) ### What this PR does / why we need it? This PR reorganizes the end-to-end (E2E) test suite by card count (`one_card`, `two_card`, `four_card`) and chip type (`_310p`), removing the previous `light` and `full` subdirectories. It also simplifies the CI workflows by consolidating them into `pr_test.yaml` and updates all associated documentation, readmes, and test configurations to reflect the new directory structure. ### Does this PR introduce _any_ user-facing change? No. This PR only affects internal test organization and CI workflows. ### How was this patch tested? Consolidated CI workflows (`pr_test.yaml`) were run to verify that the reorganized test suite executes correctly on the respective hardware runners. - vLLM version: v0.20.2 - vLLM main: https://github.com/vllm-project/vllm/commit/9090368b650896bf5fc990c921df7eb4c20355a5
  • Compare 56 commits »

2 days ago

JeffDing synced commits to main at JeffDing/xtuner from mirror

  • 3c6e61f119 refine github rl workflow (#1867) * rl workflow * fix * lint * update * revert
  • 8f0a15f544 fix deepep_max_tokens_per_rank (#1868)
  • 2e4e1c29cb [ci] update timeout config (#1866) * update * update
  • 8030fde07f fix rl ci (#1864) * fix ci * fix
  • 06bad83647 Refactor judger (#1856) * refactor judger * update * update * fix * update doc * refactor * fix * fix * merge * fix --------- Co-authored-by: duanyanhui <45005871+YanhuiDua@users.noreply.github.com>
  • Compare 17 commits »

4 days ago

JeffDing synced commits to gh-pages at JeffDing/xtuner from mirror

  • ff139935cc Deploying to gh-pages from @ InternLM/xtuner@2e4e1c29cb49d827e60dfcba2e9bf416efb81185 🚀
  • 86a9565cbb Deploying to gh-pages from @ InternLM/xtuner@6e49ba98f34d0b371ffcd57a9ed5dd2e304703d7 🚀
  • dae67a2977 Deploying to gh-pages from @ InternLM/xtuner@281b7f5f73422608f11958325fae0dd0dfecf309 🚀
  • Compare 3 commits »

4 days ago

JeffDing synced commits to ci/rl_case at JeffDing/xtuner from mirror

4 days ago

JeffDing synced commits to agentic_branch at JeffDing/xtuner from mirror

  • 4be02f24a3 add more info in sandbox (#1871)
  • 4d96fb59cc support localhost agent (#1842) * support localhost agent * inject session id and fix comment * update action init * fix detached running * fix sessionserver * fix tool call can not be jsonload * fix stream proxy crash on client disconnect * support session server timeout and sandbox retry * adapter return logprob and reveal sandbox detach --------- Co-authored-by: braisedpork1964 <497494458@qq.com>
  • Compare 2 commits »

4 days ago

JeffDing synced commits to releases/v0.20.2rc at JeffDing/vllm-ascend from mirror

  • d1f75be21c [BugFix][0.20.2] Chunk wq_b matmul for NPU 65536 dimension limit (#9848) ## Summary - `torch_npu.npu_quant_matmul` does not support weight dimensions >= 65536. When DSA context parallel is enabled, the `wq_b` linear layer's output dimension exceeds this limit, causing `aclnnQuantMatmulV5` to fail with error code 161002. - Split the `wq_b` weight into two chunks along the output dimension so each chunk stays below 65536. Two `npu_quant_matmul` calls are issued and their outputs concatenated along the last dimension. - Guard the chunking with `enable_dsa_cp()` since only DSA-CP models hit the 65536 limit on `wq_b`. ## Changes - **w8a8_dynamic.py**: Add chunked path in `process_weights_after_loading()` that splits weight/scale/offset into halves, flattens scales/offsets to 1D, and applies `maybe_trans_nz` for NZ format conversion. Add chunked forward in `apply()` with two matmul + `torch.cat`. - **dsa_cp.py**: Add chunked forward in the DSA CP attention path for `wq_b`, matching the same two-matmul-and-cat pattern. Signed-off-by: Zheng Shoujian <zheng.shoujian@outlook.com>
  • bcbdde040e [BugFix][v0.20.2rc] Backport MiniMax M2 tool call streaming to v0.20.2rc (#9784) ### What this PR does / why we need it? Backports #9742 to `releases/v0.20.2rc`. Fixes vLLM issue #39649: https://github.com/vllm-project/vllm/issues/39649 for the Ascend-patched vLLM 0.20.2 release branch. This PR backports the MiniMax M2 incremental tool-call streaming parser behavior used by upstream vLLM. The existing `minimax_m2_tool_parser` waits for a complete `<invoke>...</invoke>` block before emitting arguments, so long tool-call arguments are buffered instead of streamed. This PR adds a platform monkey patch that: - emits the tool-call name once `<invoke name=...>` is available - streams partial `<parameter>` content as JSON argument fragments - preserves `prev_tool_call_arr` and `streamed_args_for_tool` for finish handling - uses vLLM shared `find_tool_properties` helper so both Chat Completions tools and Responses `FunctionTool` schemas drive type conversion - handles the v0.20.2 special-token path where `<minimax:tool_call>` can arrive by token id without decoded text - keeps the token-id-started tool-call state across empty decoded chunks until actual `<invoke>` text arrives This follows the upstream vLLM fixes proposed in: - https://github.com/vllm-project/vllm/pull/40253 - https://github.com/vllm-project/vllm/pull/40298 ### Does this PR introduce _any_ user-facing change? Yes. MiniMax M2 auto tool-choice streaming now emits tool-call argument deltas incrementally instead of buffering them until the closing `</invoke>` tag. ### How was this patch tested? - `ruff check vllm_ascend/patch/platform/patch_minimax_m2_tool_call_parser.py tests/ut/patch/platform/test_patch_minimax_m2_tool_call_parser.py` - `python -m pytest -q tests/ut/patch/platform/test_patch_minimax_m2_tool_call_parser.py` - `python -m pytest -q tests/ut/patch/platform` - `python -m pre_commit run --hook-stage manual --files vllm_ascend/patch/platform/patch_minimax_m2_tool_call_parser.py tests/ut/patch/platform/test_patch_minimax_m2_tool_call_parser.py` - vLLM version: v0.20.2 - vLLM main: https://github.com/vllm-project/vllm/commit/39910f2b25aacc09f5e7f166cdf0030b19f8b9e8 Signed-off-by: jack <QwertyJack@users.noreply.github.com>
  • 31860b0d7b [0.20.2][BugFix][P/D] Add compress ratio and block_ids cutting for mooncake hybrid connector (#9810) ### What this PR does / why we need it? This PR adds support for tracking block sizes and compression ratios per KV cache group in the `MooncakeHybridConnector`. It introduces a helper method `_compute_transfer_block_ids` to calculate the block IDs to transfer based on prompt length, compression ratio, and block size, and applies this during the request completion phase. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tested by P/D accuracy test. - vLLM version: v0.20.2 - vLLM main: https://github.com/vllm-project/vllm/commit/39910f2b25aacc09f5e7f166cdf0030b19f8b9e8 --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com> Co-authored-by: linfeng-yuan <1102311262@qq.com>
  • e5487a770b [BugFix][v0.20.2rc] Fix DSA compressed idle dummy graph OOB (#9817) ### What this PR does / why we need it? Fixes #9816. This PR backports the DSA/compress idle dummy fix to `releases/v0.20.2rc`. Without this patch, a request can return 200 and then the engine crashes in idle `execute_dummy_batch()` when the ACLGraph dummy path is replayed, with 507011 / MTE out-of-range. The fix keeps the change scoped to the DSA/compress metadata path: - clear stale DSA slot mapping and block table values when dummy / graph-capture runs do not have fresh compressed scheduling metadata. ### Does this PR introduce _any_ user-facing change? No API or configuration change. ### How was this patch tested? Code checks: ```bash python -m py_compile vllm_ascend/worker/model_runner_v1.py python -m ruff check vllm_ascend/worker/model_runner_v1.py git diff --check ``` Failure was reproduced on A3, image `quay.io/ascend/vllm-ascend:nightly-releases-v0.20.2rc-a3`, model `/models/DeepSeek-V4-Flash-w8a8-mtp`, `dp=8/tp=1`, EP, MTP, async scheduling, ACLGraph `FULL_DECODE_ONLY`, `gpu_memory_utilization=0.93`, `max_model_len=450560`: ```text The request returns 200, then idle execute_dummy_batch() replays the graph dummy path and fails with 507011 / MTE out-of-range. ``` Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
  • 84c226eaf9 [BugFix][v0.20.2rc] Patch GLM tool-call final chunks (#9788) ### What this PR does / why we need it? Backports the GLM tool-call streaming final chunk patch to `releases/v0.20.2rc` for vllm-ascend issue #8327. This patch keeps the old GLM parser patch removed and only patches the serving-layer final chunk behavior that is still missing in the paired vLLM snapshot: - final remaining-argument chunks no longer repeat `id`, `type`, or `function.name` by default; - terminal argument chunks with `finish_reason="tool_calls"` are split into an argument chunk with `finish_reason=null`, followed by an empty finish chunk; - the GLM streaming wrapper is independent from the MiniMax usage-accounting wrapper, so MiniMax loading does not rely on the removed GLM parser patch. Related upstream vLLM issue/PR: vllm-project/vllm#44098 and vllm-project/vllm#44099. ### Does this PR introduce _any_ user-facing change? Yes. GLM tool-call streaming now emits final argument and finish chunks in the expected OpenAI-compatible order without repeating function metadata in final remaining-argument chunks. ### How was this patch tested? - `ruff check vllm_ascend/patch/platform/patch_glm_tool_call_streaming.py vllm_ascend/patch/platform/__init__.py tests/ut/patch/platform/test_patch_glm_tool_call_streaming.py` - `python -m py_compile vllm_ascend/patch/platform/patch_glm_tool_call_streaming.py tests/ut/patch/platform/test_patch_glm_tool_call_streaming.py` - `python -m pytest -q tests/ut/patch/platform/test_patch_glm_tool_call_streaming.py` (`4 passed`) - Import probe confirms both wrappers are installed: - `has_minimax_original=True` - `has_glm_original=True` Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
  • Compare 8 commits »

5 days ago

JeffDing synced commits to releases/v0.18.0 at JeffDing/vllm-ascend from mirror

  • 4a533861c9 [BugFix] [P/D] avoid MTP placeholders exceeding max model length (#9688) ### What this PR does / why we need it? This PR fixes an issue in recompute scheduler when the decode node enables MTP. When `is_mtp_kv_consumer` is enabled, the scheduler fills `request.spec_token_ids` with MTP placeholder tokens for newly added requests, so that decode nodes can match the full graph after pulling KV cache from prefill nodes. However, for the first incoming request on the decode node, if the original request length is already close to `max_model_len`, adding MTP placeholder tokens may make `request.num_tokens + num_spec_tokens` exceed the model length limit. This PR adds a boundary check before filling MTP placeholder tokens: ```python self.max_model_len >= (request.num_tokens + self.num_spec_tokens) ``` With this change, MTP placeholders are only added when the request can still fit within max_model_len, avoiding invalid scheduling behavior for the first request on the decode node. Does this PR introduce any user-facing change? No user-facing API change. This only fixes scheduler behavior for decode-node MTP scenarios and prevents MTP placeholder tokens from exceeding the model length limit. How was this patch tested? Verified the affected recompute scheduler logic. Tested/validated the decode-node MTP first-request scenario shown in the issue, where placeholder tokens should not be added if they would exceed max_model_len. Signed-off-by: liziyu179 <liziyu16@huawei.com>

5 days ago

JeffDing synced commits to main at JeffDing/vllm-ascend from mirror

  • ab32a0cfbf [Feature][Ops] Support A2/A3 and A5 compressor paths (#9350) ### What this PR does / why we need it? This PR merges the compressor updates needed for A2/A3 and A5 coexistence in the vllm-ascend packaging layout: - vllm-ascend main: A2/A3 `arch32` behavior, including non-contiguous `state_cache` support and fp32 RoPE sin/cos support on 910B/910C. - `vllm_ds_uncontigous_018_a5_0515`: A5 `arch35` behavior, including non-contiguous `state_cache` support. - `cann/ops-transformer` `experimental/attention/compressor`: A5 `arch35` fp32 `norm_weight` / `rope_sin` / `rope_cos` ABI. Current GitCode master was rechecked at `c5dbd0934b0f9565503dd094528861e9444cbaf0`. ### Main changes - Splits compressor host tiling and kernel files into `arch32` and `arch35` paths. - Maps GitCode `arch22` compressor sources into local `arch32`, because GitCode maps `arch22` to `ascend910b` / `ascend910_93` / `kirinx90`, while current vllm-ascend maps those SOCs to `arch32`. - Adds the A5 `arch35` compressor host/kernel path, including normal and full-load kernels. - Keeps `state_cache_stride_dim0` support in both paths: - `arch32` keeps vllm-ascend mainline non-contiguous `state_cache` support for 910B/910C. - `arch35` keeps the A5 non-contiguous `state_cache` support from `vllm_ds_uncontigous_018_a5_0515`. - Keeps upstream main's `arch32` `ropeDtype` handling so 910B/910C can use fp32 rope sin/cos while `norm_weight` still follows `x` dtype. - Aligns the A5 `arch35` compressor operator with ops-transformer fp32 handling for `norm_weight`, `rope_sin`, and `rope_cos`. ### Compatibility notes - This PR intentionally does not take GitCode `arch22`'s contiguous-only `state_cache` check into local `arch32`, because that would regress vllm-ascend mainline non-contiguous `state_cache` support. - `arch32` and `arch35` use separate tiling-key/template layouts. `compressor.cpp` dispatches them under `__CCE_AICORE__` so `arch32` `ROPE_DTYPE` and A5 `FULL_LOAD/cacheMode` do not leak into each other. - This PR only updates the compressor custom operator. A5 DSV4 Python-side wiring is intentionally left to a follow-up adaptation PR. ### Does this PR introduce _any_ user-facing change? No Python API change is intended. This refreshes and reorganizes the compressor custom-op implementation so A2/A3 and A5 can coexist with their required operator feature sets. ### How was this patch tested? - `git diff --check` - `PATH=/tmp/vllm-ascend-lint-venv/bin:$PATH bash format.sh ci` NPU build and runtime tests were not run locally. - vLLM version: v0.20.2 - vLLM main: https://github.com/vllm-project/vllm/commit/9090368b650896bf5fc990c921df7eb4c20355a5 --------- Signed-off-by: maoxx241 <maomaoyu870@gmail.com>
  • 58c59ca566 [Ascend950] [Feature] Support custom op for GDN on Ascend 950 (#9382) [Ascend950] [Feature] Support custom op for GDN on Ascend 950 ### What this PR does / why we need it? This PR enables custom operator support for **Gated Delta Network (GDN)** on Ascend 950 (A5), and fixes several compilation issues introduced by previous platform filtering logic. **Background & Motivation** - PR [#9271](https://github.com/vllm-project/vllm-ascend/pull/9271) excluded ATB-based kernels (e.g., `get_masked_input_and_mask`, `bgmv_expand`, `sgmv_expand`) from Ascend 950 via the compile macro `VLLM_ENABLE_ATB_AND_DIRECT_KERNELS`. However, the exclusion logic in `vllm_ascend/meta_registration.py` was incomplete: when `get_ascend_device_type() == AscendDeviceType.A5`, the code still attempted to register meta kernels for these ATB-only ops via `register_meta_if_necessary`, causing **compilation errors** on 950. - Additionally, the existing custom op build pipeline (`build_aclnn.sh`) did not correctly aggregate `binary_info_config.json` for A5-specific kernels, and some AscendC host tiling code (`recurrent_gated_delta_rule_tiling_arch35.cpp`) had C++ access-control issues when compiled under the unified `ophost_transformer_tiling_obj` target. **Changes Made** 1. **Fix 950 compilation guard in `meta_registration.py`** - Skip `register_meta_if_necessary` for ATB-only ops (`get_masked_input_and_mask`, `bgmv_expand`, `sgmv_expand`, etc.) on A5, aligning with the `VLLM_ENABLE_ATB_AND_DIRECT_KERNELS` macro. 2. **Add A5-compatible custom AscendC kernels** - `causal_conv1d` - `recurrent_gated_delta_rule` - `chunk_fwd_o` - `chunk_gated_delta_rule_fwd_h` - These kernels are registered under the `arch35` host tiling path for Ascend 950 and properly packaged into `vllm_ascend/_cann_ops_custom`. 3. **Fix custom Triton kernel bugs on Ascend 950** - `chunk_scaled_dot_kkt` - `solve_tril` ### Does this PR introduce _any_ user-facing change? No. All changes are backend compilation and operator registration fixes. No Python API or CLI behavior is changed. ### How was this PR tested? **Environment** **Ascend 950 (A5)** - Model: Qwen3-Next - GSM8K accuracy: - `W8A8_MXFP` (MXFP8): **96.44**, **96.36** - `BF16`: **96.13**, **96.44** **Ascend 910B** - Model: Qwen3-Next - GSM8K accuracy: - `BF16`: **95.83**, **96.21** --------- Signed-off-by: Bybbbb11 &lt;171289168@qq.com&gt; Signed-off-by: CXMT &lt;“liubaoyang3@huawei.com”&gt; ## Co-authors Co-authored-by: Bybbbb11 &lt;171289168@qq.com&gt; Co-authored-by: TallMessiWu &lt;tallmessiwu@qq.com&gt; Co-authored-by: SkychenLee &lt;litianchen2@huawei.com&gt; Co-authored-by: Feilin777 &lt;feilin_kkt@163.com&gt; - vLLM version: v0.20.2 - vLLM main: https://github.com/vllm-project/vllm/commit/9090368b650896bf5fc990c921df7eb4c20355a5 --------- Signed-off-by: CXMT <liubaoyang3@huawei.com> Signed-off-by: Bybbbb11 <171289168@qq.com> Co-authored-by: CXMT <liubaoyang3@huawei.com>
  • c6b853acb1 [Worker][Doc] Add A5 server disaggregated PD endpoint configuration (#9690) ### What this PR does / why we need it? Add A5 server disaggregated prefill-decode (PD) endpoint configuration support. On A5 hardware, disaggregated PD requires per-device NPU endpoint JSON files to be loaded during worker initialization so that HCCL communication layers can correctly match the hardware topology. Changes: - `vllm_ascend/worker/worker.py`: In `init_device()`, when running on A5, read the per-device endpoint config directory from `kv_transfer_config.kv_connector_extra_config["ascend_local_comm_res_path"]` and set `ASCEND_LOCAL_COMM_RES` for HCCL in each worker process. ### Does this PR introduce _any_ user-facing change? Yes. Users running disaggregated PD on A5 servers need to configure the endpoint config directory through `kv_connector_extra_config` before launching vLLM: ```json { "kv_connector_extra_config": { "ascend_local_comm_res_path": "/etc/hixlep" } } ``` - vLLM version: v0.20.2 - vLLM main: https://github.com/vllm-project/vllm/commit/9090368b650896bf5fc990c921df7eb4c20355a5 --------- Signed-off-by: willzhuyx <zhuyixiang2014@163.com> Signed-off-by: Zhu Yixiang <zhuyixiang2014@163.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: zzzzzmeng <810924837@qq.com>
  • 25d8b60844 [Feature]Add NZ layout support for C8 quantization(GQA) (#9721) ### What this PR does / why we need it? This PR adds NZ layout support for C8 quantization of KV cache. It enables converting the KV cache from the original ND layout to NZ layout during C8 quantization, improving memory access efficiency and computational performance for attention operations. **The test results are as follows(Qwen3-32B-w8a8c8/910B3*4):** **Accuracy:** python3 aisbench_test.py --dataset "/workspace/aisbench_auto_tools_prefix/GSM8K.jsonl" --concurrency 64 --request_rate 1 --test_accuracy --output_len 32768 --test_type text <img width="1583" height="644" alt="image" src="https://github.com/user-attachments/assets/85eefce9-a86e-40a2-bb51-830cbd8ab87b" /> **Performance:** python3 aisbench_test.py --input_len 30720 --output_len 1024 --data_num 256 --concurrency 64 --request_rate 1 --repeat_rate 0.9 --dataset_type prefix_cache --seed 1024 <img width="1447" height="804" alt="image" src="https://github.com/user-attachments/assets/aa3c0179-0921-4a38-96f6-29e2ef5c508a" /> This change improves throughput by ~90% compared to the previous version ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.20.2 - vLLM main: https://github.com/vllm-project/vllm/commit/9090368b650896bf5fc990c921df7eb4c20355a5 --------- Signed-off-by: ztzx3156 <164653300@qq.com> Co-authored-by: pichangping <1337510399@qq.com>
  • 0b527eade3 [BugFix] Chunk wq_b matmul for NPU 65536 dimension limit (#9780) ## Summary - `torch_npu.npu_quant_matmul` does not support weight dimensions >= 65536. When DSA context parallel is enabled, the `wq_b` linear layer's output dimension exceeds this limit, causing `aclnnQuantMatmulV5` to fail with error code 161002. - Split the `wq_b` weight into two chunks along the output dimension so each chunk stays below 65536. Two `npu_quant_matmul` calls are issued and their outputs concatenated along the last dimension. - Guard the chunking with `enable_dsa_cp()` since only DSA-CP models hit the 65536 limit on `wq_b`. ## Changes - **w8a8_dynamic.py**: Add chunked path in `process_weights_after_loading()` that splits weight/scale/offset into halves, flattens scales/offsets to 1D, and applies `maybe_trans_nz` for NZ format conversion. Add chunked forward in `apply()` with two matmul + `torch.cat`. - **dsa_cp.py**: Add chunked forward in the DSA CP attention path for `wq_b`, matching the same two-matmul-and-cat pattern. - vLLM version: v0.20.2 - vLLM main: https://github.com/vllm-project/vllm/commit/39910f2b25aacc09f5e7f166cdf0030b19f8b9e8 --------- Signed-off-by: Zheng Shoujian <zheng.shoujian@outlook.com>
  • Compare 35 commits »

5 days ago

JeffDing synced commits to releases/v0.20.2rc at JeffDing/vllm-ascend from mirror

  • 145e994a12 [BugFix][v0.20.2rc] Lazy initialize KV store on put (#9774) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> ### Does this PR introduce _any_ user-facing change? <!-- Note that it means *any* user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> --------- Signed-off-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local> Co-authored-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local>

6 days ago

JeffDing synced commits to main at JeffDing/vllm-ascend from mirror

  • e7840445b7 [BugFix][Parser] Backport MiniMax M2 tool call streaming (#9742) ### What this PR does / why we need it? Fixes vLLM issue #39649: https://github.com/vllm-project/vllm/issues/39649 for the Ascend-patched vLLM 0.20.2 runtime. This PR backports the MiniMax M2 incremental tool-call streaming parser behavior used by upstream vLLM. The existing `minimax_m2_tool_parser` waits for a complete `<invoke>...</invoke>` block before emitting arguments, so long tool-call arguments are buffered instead of streamed. This PR adds a platform monkey patch that: - emits the tool-call name once `<invoke name=...>` is available - streams partial `<parameter>` content as JSON argument fragments - preserves `prev_tool_call_arr` and `streamed_args_for_tool` for finish handling - uses vLLM shared `find_tool_properties` helper so both Chat Completions tools and Responses `FunctionTool` schemas drive type conversion - handles the v0.20.2 special-token path where `<minimax:tool_call>` can arrive by token id without decoded text - keeps the token-id-started tool-call state across empty decoded chunks until actual `<invoke>` text arrives This follows the upstream vLLM fixes proposed in: - https://github.com/vllm-project/vllm/pull/40253 - https://github.com/vllm-project/vllm/pull/40298 ### Does this PR introduce _any_ user-facing change? Yes. MiniMax M2 auto tool-choice streaming now emits tool-call argument deltas incrementally instead of buffering them until the closing `</invoke>` tag. ### How was this patch tested? - `ruff check vllm_ascend/patch/platform/patch_minimax_m2_tool_call_parser.py tests/ut/patch/platform/test_patch_minimax_m2_tool_call_parser.py` - `python -m pytest -q tests/ut/patch/platform/test_patch_minimax_m2_tool_call_parser.py` - `python -m pytest -q tests/ut/patch/platform` - `python -m pre_commit run --hook-stage manual --files vllm_ascend/patch/platform/patch_minimax_m2_tool_call_parser.py tests/ut/patch/platform/test_patch_minimax_m2_tool_call_parser.py` - vLLM version: v0.20.2 - vLLM main: https://github.com/vllm-project/vllm/commit/39910f2b25aacc09f5e7f166cdf0030b19f8b9e8 Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
  • 073da29121 [CI] Change nightly tests to releases/v0.20.2rc (#9778) ### What this PR does / why we need it? Change nightly tests to banrch `releases/v0.20.2rc` - vLLM version: v0.20.2 - vLLM main: https://github.com/vllm-project/vllm/commit/39910f2b25aacc09f5e7f166cdf0030b19f8b9e8 Signed-off-by: wjunLu <wjunlu217@gmail.com>
  • 6a7d0ce1b5 [BugFix] Lazy initialize KV store on put (#9771) ### What this PR does / why we need it Avoid initializing external KV stores during KV Pool backend construction only for the DSV4/compressed-model path, detected by compress_ratios. Mooncake uses lazy store init only when both DSV4/compress and fabric memory are enabled. Mooncake non-fabric-memory mode keeps the previous eager initialization behavior even for compressed models. Memcache follows the same rule on non-A2 devices, while the A2 path keeps the previous eager initialization and buffer registration behavior even for compressed models. When lazy init is enabled, the store is set up on the first put() call and guarded by a lock so later put() calls reuse the same store. Before the first put(), exists() treats keys as missing so the first store request can reach put(). In lazy-init paths, put failure logs include a hint that the failure is expected if this is the first DSV4/compress request, without tracking put attempts in connector state. ### Special notes for your reviewer None. ### Validation vLLM version: v0.20.1 vLLM main: vllm-project/vllm@c7aa186 - vLLM version: v0.20.2 - vLLM main: https://github.com/vllm-project/vllm/commit/39910f2b25aacc09f5e7f166cdf0030b19f8b9e8 Signed-off-by: Pz1116 <zpbzpb123123@gmail.com>
  • Compare 3 commits »

6 days ago

JeffDing synced commits to main at JeffDing/xtuner from mirror

  • 281b7f5f73 Refactor async HF writer status files (#1849) refactor: group async HF writer status files
  • f52e35d04b [Feature] add rank0 async HF save logs (#1845) * feat: add rank0 async HF logs * fix: import time for async HF compose logs * refactor: simplify async HF writer log path * refactor: rely on log timestamps for async HF writer * refactor: use log_rank0 for async HF writer logs
  • Compare 2 commits »

6 days ago

JeffDing synced commits to gh-pages at JeffDing/xtuner from mirror

  • c0c2a1a2a0 Deploying to gh-pages from @ InternLM/xtuner@ebeab77a765b95ce581c5da17ab5ee5c40dd9aff 🚀

6 days ago

JeffDing
Loading Heatmap…

JeffDing synced commits to releases/v0.20.2rc at JeffDing/vllm-ascend from mirror

  • 3b56d7a11f [BugFix] DSV4 Initialize KV store for Decode Node after first real request (#9793) ### What this PR does / why we need it Base this change on the release branch version where external KV stores already initialize lazily from put(). The put-only path misses decode-only / pure consumer workers because they can run a real request without storing KV. This keeps the existing put-triggered initialization for producer/store paths, and adds a post-real-forward fallback: start_load_kv() marks the current connector step as a real forward only when forward_context is not None, and get_finished() initializes the backend store afterward if needed. The no-forward path also calls get_finished(), so it explicitly resets the current-step marker and will not initialize the store there. Lazy init only changes the initialization timing: once initialization is actually triggered, initialization failures still raise as fatal errors. This avoids initializing in connector construction, KV cache registration, or no-forward cleanup, while still covering workers that do not call put(). vLLM version: v0.20.1 vLLM main: vllm-project/vllm@c7aa186 --------- Signed-off-by: Pz1116 <zpbzpb123123@gmail.com>
  • 3ead473ecd [Doc][Misc] Update v0.20.2rc README information (#10103) ### What this PR does / why we need it? Update README and docs version references for the v0.20.2rc release candidate. - Add v0.20.2rc1 release candidate news to English and Chinese README files. - Add releases/v0.20.2rc to the maintained branch tables. - Update docs header and Sphinx release metadata for v0.20.2rc. ### Does this PR introduce _any_ user-facing change? No. Documentation-only update. ### How was this patch tested? - `git diff --check` - `bash format.sh ci` partially passed; failed only because `shellcheck` is not installed in the local environment. Other hooks passed, including ruff, codespell, typos, clang-format, markdownlint, GitHub Actions workflow lint, PNG export lint, filename-space check, Python package `__init__.py` check, forbidden logger/import checks, boolean-op check, and suggestion check. --------- Signed-off-by: weijinqian_v1 <weijinqian@huawei.com> Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>
  • 6363175512 [Doc][Misc] Update release notes known issues (#10039) ### What this PR does / why we need it? Update v0.20.2rc1 release notes known issues and add the DeepSeek V4 KV Pool known issue reference. - git diff --check - vLLM version: v0.20.2 - vLLM main: https://github.com/vllm-project/vllm/commit/9090368b650896bf5fc990c921df7eb4c20355a5 Signed-off-by: weijinqian_v1 <weijinqian@huawei.com> Co-authored-by: weijinqian_v1 <weijinqian@huawei.com> (cherry picked from commit 6230b36103676268fc8b28ec397eb0af41fe3c7c)
  • 3790531219 [Doc][Misc] Prepare v0.20.2rc1 release notes (#9602) ### What this PR does / why we need it? This PR prepares the v0.20.2rc1 release documentation set. It adds the new release notes entry for `v0.20.2rc1` and updates the main branch documentation references so the latest RC entry, FAQ link, and version matrix all point to the current release candidate. Related release tracking: - Release checklist: #9591 - Feedback issue: #9586 ### Does this PR introduce _any_ user-facing change? No. This is a documentation-only update for the v0.20.2rc1 release process. ### How was this patch tested? - Reviewed the staged markdown and version substitutions with `git diff` and `rg` - Ran `python -m py_compile docs/source/conf.py` - Full Sphinx build was not run locally because the current environment does not have `docutils` installed - vLLM version: v0.20.2 - vLLM main: https://github.com/vllm-project/vllm/commit/9090368b650896bf5fc990c921df7eb4c20355a5 --------- Signed-off-by: weijinqian_v1 <weijinqian@huawei.com> Co-authored-by: weijinqian_v1 <weijinqian@huawei.com> (cherry picked from commit 078fae3f10bceb8fe75060167f334110c20bdaf1)
  • Compare 4 commits »

1 day ago

JeffDing synced commits to main at JeffDing/vllm-ascend from mirror

  • e58dd4558f [CI] add slash command dispatch for /e2e, /rerun, /nightly (#10059) ### What this PR does / why we need it? This PR replaces the old comment-based trigger mechanisms with `peter-evans/slash-command-dispatch` to handle slash commands in PR comments: - `/e2e`: Runs specific E2E tests under `tests/e2e/pull_request/`, automatically routed to the appropriate NPU runner. - `/rerun`: Re-runs all failed workflow runs on the current PR commit. - `/nightly`: Triggers specific nightly test suites on A2 and A3 by test case name. It also: - Configures permissions: `/e2e` and `/rerun` allow the PR author or triage+ users, while `/nightly` requires triage+ users. - Removes obsolete files: `pr_e2e_comment.yaml` and `_parse_trigger.yaml`. - Inlines the parse-trigger logic into `schedule_nightly_test_a2/a3.yaml`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Documentation update and CI workflow configuration changes. - vLLM version: v0.20.2 - vLLM main: https://github.com/vllm-project/vllm/commit/9090368b650896bf5fc990c921df7eb4c20355a5 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
  • eb632e2033 [CI] Fix csrc cache build error (#10114) ### What this PR does / why we need it? This PR pins the `setuptools` dependency to `<72` under `tool.uv.extra-build-dependencies` for `pandas` to resolve a build incompatibility. This PR also makes the csrc cache works for PR ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.20.2 - vLLM main: https://github.com/vllm-project/vllm/commit/9090368b650896bf5fc990c921df7eb4c20355a5 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
  • 8991ca5f44 [Misc] Rehash AscendStore grouped keys for DSV4/compress layouts (#9789) ### What this PR does / why we need it? Rehash grouped AscendStore block hashes with a framed SHA-256 digest when the store key granularity is larger than the original hash granularity. This keeps the key suffix fixed length while preserving all child block hashes in the digest input, reducing KV Pool metadata and backend key pressure for DSV4/compress layouts. - vLLM version: v0.20.2 - vLLM main: https://github.com/vllm-project/vllm/commit/9090368b650896bf5fc990c921df7eb4c20355a5 --------- Signed-off-by: Pz1116 <zpbzpb123123@gmail.com>
  • a4179063b0 [Feature] Simple yet General CPU KV Cache Offloading (#8743) ### What this PR does / why we need it? refer to: https://github.com/vllm-project/vllm/pull/37160 , SimpleCPUOffloadConnector is another design of vLLM's CPU KV cache offloading path. Instead of maintaining a parallel block management stack, it reuses vLLM's existing BlockPool and KVCacheCoordinator infrastructure directly. This gives us HMA support, prefix caching, and LRU eviction for free. - vLLM version: v0.20.2 - vLLM main: https://github.com/vllm-project/vllm/commit/9090368b650896bf5fc990c921df7eb4c20355a5 Signed-off-by: HF-001 <1670186653@qq.com>
  • 3a593b6685 [CI] Fix csrc CI build error (#10110) ### What this PR does / why we need it? This PR adds `pandas = ["setuptools"]` to `[tool.uv.extra-build-dependencies]` in `pyproject.toml` to resolve build errors during the compilation of C++ extensions (`csrc`) in the CI environment when using `uv`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Verified via CI build pipeline. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
  • Compare 24 commits »

1 day ago

JeffDing synced commits to offline-cache at JeffDing/xtuner from mirror

  • ceabe74215 [Fix] adapt clusterx brainpp breaking change

1 day ago

JeffDing synced commits to main at JeffDing/xtuner from mirror

  • 0e40b0d7ba [Refactor] Refactor RL unittest part 1: add pr-fast unit test (#1865) * Add RL PR-fast test suite * Add RL PR-fast risk tests 新增 _prepare_train_data contract 测试,覆盖文本、VLM、invalid group 和 fail-fast 路径。 新增 SingleTurnAgentLoop batch judge/pause 测试,并收紧 batch judge 仅全组 COMPLETED 才触发。 新增 RolloutController fake routed 分支和 CPUResourceManager register/validate_or_raise PR-fast 覆盖。 * Remove duplicated RL tests migrated to PR-fast 删除 tests/rl 根目录下已经迁入 tests/rl/fast/pr_fast 的旧单测入口。 覆盖同名 PR-fast 测试,以及已合并到 test_rollout_logic.py 的 rollout worker/utils 旧测试。 * delete useless rollout_output.jsonl * fix ci * fix claude comments * fix ci
  • b7af11ea0a [CI] Skip lmdeploy disaggregated ut until PR 4638 is merged (#1875) skip lmdeploy disaggregated ut
  • 54c9846676 [Fix] fix rollout worker recovery before colocated training when ep_size = 1 (#1869) * fix rollout worker recovery before colocated training * call terminate before shutdown * fix RL_TRAINER_RAY_GET_TIMEOUT default value * mv _request_server_terminate for to sglang and lmdeploy * fix claude comments * fix claude comments
  • 191c526e58 【CI】rl case (#1853) * qwen35 rl * more case * run cases * fix format * fix data pathh * set timeout * npu rl case
  • 986cce8507 Fix RL rollout port allocation collisions in CI (#1872) feat(rl): use fixed port in init_dist_port and remove useless api_port, api_host
  • Compare 8 commits »

1 day ago

JeffDing synced commits to gh-pages at JeffDing/xtuner from mirror

1 day ago

JeffDing synced commits to agentic_branch at JeffDing/xtuner from mirror

  • 2d78e01cd6 sandbox_creates_per_sec (#1874)
  • b1d75b18d3 Add eval-mode sandbox rollouts and trajectory logging (#1848) * Add eval-mode sandbox rollouts and trajectory logging - Add TB2 eval dataloader and eval AgentInSandboxLoop config - Disable token/logprob/routed-expert returns for eval inference - Preserve text-only eval responses and tokenized response length stats - Separate eval replay buffer from training replay buffer - Add regression coverage for text-only eval trajectory saves * Add eval trajectory grouping updates * Simplify eval rollout trace handling * Store eval trajectories as structured artifacts --------- Co-authored-by: liukuikun <641417025@qq.com>
  • Compare 2 commits »

1 day ago

JeffDing synced commits to releases/v0.20.2rc at JeffDing/vllm-ascend from mirror

  • 81d7aacbba [Doc][Model] Add DeepSeek-V4 Flash and Pro documentation (#9966) ## What this PR does / why we need it? Add tutorial documentation for DeepSeek-V4 Flash and DeepSeek-V4 Pro models running on Ascend A3 hardware. This includes: - New tutorial pages: `DeepSeek-V4-Flash.md` and `DeepSeek-V4-Pro.md` - Update to the models index to include both new pages - Update to the supported models matrix to add DeepSeek V4-Flash and V4-Pro entries ## Does this PR introduce _any_ user-facing change? No, this is a documentation-only update. ## How was this patch tested? Documentation-only change, CI verification is sufficient. - vLLM version: 0.20.2rc - vLLM main: N/A --------- Signed-off-by: GDzhu01 <809721801@qq.com>
  • 367b8e62da [BugFix][v0.20.2rc]Reduce sampling is reconstructed to eliminate all patch behaviors and support DFlash and MTP (#9946) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> The Reduce_sampling optimization is reconstructed to eliminate all patch behaviors and support the DFlash and MTP. When sampling optimization is enabled, if speculative decoding is Eagle3 or DFlash, sampling can be optimized for both the main model and the MTP layer. If the speculative decoding method is MTP, some models can optimize sampling for both the main model and the MTP layer, while others can only optimize sampling for the main model. ### Does this PR introduce _any_ user-facing change? <!-- Note that it means *any* user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> --------- Signed-off-by: hzx55906 <513464215@qq.com>
  • c942e37d12 [Feature][v0.20.2rc] Mooncake kvpool usage optimization (#7820) (#9947) ### What this PR does / why we need it? Backport of #7820 to `releases/v0.20.2rc`. Mooncake kvpool usage optimization, including: 1. Expose the `preferred_segment` and `prefer_alloc_in_same_node` Mooncake parameters so users can configure them as needed. 2. Convert `setup` to keyword-argument calls to accommodate different versions of Mooncake. 3. Make `master_server_address` retrievable from both environment variables and configuration files, with environment variables taking precedence. Signed-off-by: LCAIZJ <leichao139636@163.com>
  • ff973b43f6 [Ascend950][BugFix] Fix MoonViT3dPretrainedModel.to overriding quantized ViT weight dtype (#9933) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? On Ascend A5 devices, loading KimiK25 models with quantized ViT weights (mxfp8) fails because the initialization flow calls .to(dtype=...), which converts the quantized weights to the model's default dtype (e.g., bf16). This causes a RuntimeError: dtype mismatch during weight loading. This PR patches MoonViT3dPretrainedModel.to() to ignore the dtype argument on A5 devices, ensuring the quantized weight dtypes are preserved. <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> ### Does this PR introduce _any_ user-facing change? <!-- Note that it means *any* user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Signed-off-by: gaozihao <gaozihao3@huawei.com>
  • 2de930e462 [Misc] Improve logging in quantization (#9880) ### What this PR does / why we need it? Back port from #9621. Improve logging coverage, error context, and log level correctness in quantization. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? NA. Signed-off-by: mayumeng <m30059191@china.huawei.com> Co-authored-by: mayumeng <m30059191@china.huawei.com>
  • Compare 10 commits »

2 days ago

JeffDing synced commits to releases/v0.18.0 at JeffDing/vllm-ascend from mirror

  • 3ba3bd922e [Doc][Misc] Improve readability and fix typos in documentation (#9674) ### What this PR does / why we need it? 1. Added an introduction to the Quickstart section. 2. remove tsinghua mirror source and update pip before using pip 3. update log when offline infering ps: Synchronize the main PR: 1. The following PR content: https://github.com/vllm-project/vllm-ascend/pull/9619 2. Add the second and third changes of the following PR: https://github.com/vllm-project/vllm-ascend/pull/9091 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Documentation changes only. Signed-off-by: sunshine202600 <sunshine202600@163.com>
  • 9478de0f8c [Doc] optimize doc (#9607) ### What this PR does / why we need it? remove redundant logs, keep only the essential parts, and obfuscate the version numbers. ### Does this PR introduce _any_ user-facing change? yes --------- Signed-off-by: zouyida2052 <zouyida2002@gmail.com>
  • Compare 2 commits »

2 days ago

JeffDing synced commits to main at JeffDing/vllm-ascend from mirror

  • 72f39fbd06 [BugFix]Fix the question that cant find num_kv_head from some model config (#10008) ### What this PR does / why we need it? Fix the question that cant find num_kv_head from some model config ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.20.2 - vLLM main: https://github.com/vllm-project/vllm/commit/9090368b650896bf5fc990c921df7eb4c20355a5 Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
  • 0e4d77312c [Ascend950][BugFix] Fix MoonViT3dPretrainedModel.to overriding quantized ViT weight dtype (#9929) ### What this PR does / why we need it? On Ascend A5 devices, loading KimiK25 models with quantized ViT weights (mxfp8) fails because the initialization flow calls .to(dtype=...), which converts the quantized weights to the model's default dtype (e.g., bf16). This causes a RuntimeError: dtype mismatch during weight loading. This PR patches MoonViT3dPretrainedModel.to() to ignore the dtype argument on A5 devices, ensuring the quantized weight dtypes are preserved. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.20.2 - vLLM main: https://github.com/vllm-project/vllm/commit/9090368b650896bf5fc990c921df7eb4c20355a5 Signed-off-by: gaozihao <gaozihao3@huawei.com>
  • 40dae21230 [CI][Test] Optimize selective test routing and CI test stability (#9994) ### What this PR does / why we need it? This PR fixes and hardens PR CI test selection and several CI test stability issues. - Normalize path separators before matching UT/E2E routing patterns. - Add `--run-all-modules` to support main2main. - Support precision down to test cases `test_a.py::test_b` - Stabilize related tests by: - waiting for NPU memory before the offline TP2 weight-load test; - reusing the existing Qwen3 reranker template path. BugFix - capturing elastic netloader logs from the module logger instead of the global vLLM logger. ``` bash > assert "Failed to load" in log_output or "does not contain" in log_output E AssertionError: assert ('Failed to load' in '' or 'does not contain' in '') tests/ut/model_loader/netloader/test_netloader_elastic.py:363: AssertionError ``` - Variable contamination caused by non-isolated environments ``` bash E pydantic_core._pydantic_core.ValidationError: 1 validation error for VllmConfig E Assertion failed, Flash Comm v1 requires enable_expert_parallel=True for MoE models. [type=assertion_error, input_value=ArgsKwargs((), {'model_co... 'shutdown_timeout': 0}), input_type=ArgsKwargs] E For further information visit https://errors.pydantic.dev/2.13/v/assertion_error ``` ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.20.2 - vLLM main: https://github.com/vllm-project/vllm/commit/9090368b650896bf5fc990c921df7eb4c20355a5 --------- Signed-off-by: MrZ20 <2609716663@qq.com>
  • 028e538c84 [BugFix] fix dsv4 piecewise scenario (#10003) ### What this PR does / why we need it? In [PR#43746](https://github.com/vllm-project/vllm/pull/43746),while VLLM_USE_BREAKABLE_CUDAGRAPH is detacted as True, compilation_config.mode would be None even if we set it, thus server runs in eager mode. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.20.2 - vLLM main: https://github.com/vllm-project/vllm/commit/9090368b650896bf5fc990c921df7eb4c20355a5 --------- Signed-off-by: zouyida2052 <zouyida2002@gmail.com>
  • 16747f8b02 [Test][CI] final part for selected test (#10017) ### What this PR does / why we need it? This PR reorganizes the end-to-end (E2E) test suite by card count (`one_card`, `two_card`, `four_card`) and chip type (`_310p`), removing the previous `light` and `full` subdirectories. It also simplifies the CI workflows by consolidating them into `pr_test.yaml` and updates all associated documentation, readmes, and test configurations to reflect the new directory structure. ### Does this PR introduce _any_ user-facing change? No. This PR only affects internal test organization and CI workflows. ### How was this patch tested? Consolidated CI workflows (`pr_test.yaml`) were run to verify that the reorganized test suite executes correctly on the respective hardware runners. - vLLM version: v0.20.2 - vLLM main: https://github.com/vllm-project/vllm/commit/9090368b650896bf5fc990c921df7eb4c20355a5
  • Compare 56 commits »

2 days ago

JeffDing synced commits to main at JeffDing/xtuner from mirror

  • 3c6e61f119 refine github rl workflow (#1867) * rl workflow * fix * lint * update * revert
  • 8f0a15f544 fix deepep_max_tokens_per_rank (#1868)
  • 2e4e1c29cb [ci] update timeout config (#1866) * update * update
  • 8030fde07f fix rl ci (#1864) * fix ci * fix
  • 06bad83647 Refactor judger (#1856) * refactor judger * update * update * fix * update doc * refactor * fix * fix * merge * fix --------- Co-authored-by: duanyanhui <45005871+YanhuiDua@users.noreply.github.com>
  • Compare 17 commits »

4 days ago

JeffDing synced commits to gh-pages at JeffDing/xtuner from mirror

  • ff139935cc Deploying to gh-pages from @ InternLM/xtuner@2e4e1c29cb49d827e60dfcba2e9bf416efb81185 🚀
  • 86a9565cbb Deploying to gh-pages from @ InternLM/xtuner@6e49ba98f34d0b371ffcd57a9ed5dd2e304703d7 🚀
  • dae67a2977 Deploying to gh-pages from @ InternLM/xtuner@281b7f5f73422608f11958325fae0dd0dfecf309 🚀
  • Compare 3 commits »

4 days ago

JeffDing synced commits to ci/rl_case at JeffDing/xtuner from mirror

4 days ago

JeffDing synced commits to agentic_branch at JeffDing/xtuner from mirror

  • 4be02f24a3 add more info in sandbox (#1871)
  • 4d96fb59cc support localhost agent (#1842) * support localhost agent * inject session id and fix comment * update action init * fix detached running * fix sessionserver * fix tool call can not be jsonload * fix stream proxy crash on client disconnect * support session server timeout and sandbox retry * adapter return logprob and reveal sandbox detach --------- Co-authored-by: braisedpork1964 <497494458@qq.com>
  • Compare 2 commits »

4 days ago

JeffDing synced commits to releases/v0.20.2rc at JeffDing/vllm-ascend from mirror

  • d1f75be21c [BugFix][0.20.2] Chunk wq_b matmul for NPU 65536 dimension limit (#9848) ## Summary - `torch_npu.npu_quant_matmul` does not support weight dimensions >= 65536. When DSA context parallel is enabled, the `wq_b` linear layer's output dimension exceeds this limit, causing `aclnnQuantMatmulV5` to fail with error code 161002. - Split the `wq_b` weight into two chunks along the output dimension so each chunk stays below 65536. Two `npu_quant_matmul` calls are issued and their outputs concatenated along the last dimension. - Guard the chunking with `enable_dsa_cp()` since only DSA-CP models hit the 65536 limit on `wq_b`. ## Changes - **w8a8_dynamic.py**: Add chunked path in `process_weights_after_loading()` that splits weight/scale/offset into halves, flattens scales/offsets to 1D, and applies `maybe_trans_nz` for NZ format conversion. Add chunked forward in `apply()` with two matmul + `torch.cat`. - **dsa_cp.py**: Add chunked forward in the DSA CP attention path for `wq_b`, matching the same two-matmul-and-cat pattern. Signed-off-by: Zheng Shoujian <zheng.shoujian@outlook.com>
  • bcbdde040e [BugFix][v0.20.2rc] Backport MiniMax M2 tool call streaming to v0.20.2rc (#9784) ### What this PR does / why we need it? Backports #9742 to `releases/v0.20.2rc`. Fixes vLLM issue #39649: https://github.com/vllm-project/vllm/issues/39649 for the Ascend-patched vLLM 0.20.2 release branch. This PR backports the MiniMax M2 incremental tool-call streaming parser behavior used by upstream vLLM. The existing `minimax_m2_tool_parser` waits for a complete `<invoke>...</invoke>` block before emitting arguments, so long tool-call arguments are buffered instead of streamed. This PR adds a platform monkey patch that: - emits the tool-call name once `<invoke name=...>` is available - streams partial `<parameter>` content as JSON argument fragments - preserves `prev_tool_call_arr` and `streamed_args_for_tool` for finish handling - uses vLLM shared `find_tool_properties` helper so both Chat Completions tools and Responses `FunctionTool` schemas drive type conversion - handles the v0.20.2 special-token path where `<minimax:tool_call>` can arrive by token id without decoded text - keeps the token-id-started tool-call state across empty decoded chunks until actual `<invoke>` text arrives This follows the upstream vLLM fixes proposed in: - https://github.com/vllm-project/vllm/pull/40253 - https://github.com/vllm-project/vllm/pull/40298 ### Does this PR introduce _any_ user-facing change? Yes. MiniMax M2 auto tool-choice streaming now emits tool-call argument deltas incrementally instead of buffering them until the closing `</invoke>` tag. ### How was this patch tested? - `ruff check vllm_ascend/patch/platform/patch_minimax_m2_tool_call_parser.py tests/ut/patch/platform/test_patch_minimax_m2_tool_call_parser.py` - `python -m pytest -q tests/ut/patch/platform/test_patch_minimax_m2_tool_call_parser.py` - `python -m pytest -q tests/ut/patch/platform` - `python -m pre_commit run --hook-stage manual --files vllm_ascend/patch/platform/patch_minimax_m2_tool_call_parser.py tests/ut/patch/platform/test_patch_minimax_m2_tool_call_parser.py` - vLLM version: v0.20.2 - vLLM main: https://github.com/vllm-project/vllm/commit/39910f2b25aacc09f5e7f166cdf0030b19f8b9e8 Signed-off-by: jack <QwertyJack@users.noreply.github.com>
  • 31860b0d7b [0.20.2][BugFix][P/D] Add compress ratio and block_ids cutting for mooncake hybrid connector (#9810) ### What this PR does / why we need it? This PR adds support for tracking block sizes and compression ratios per KV cache group in the `MooncakeHybridConnector`. It introduces a helper method `_compute_transfer_block_ids` to calculate the block IDs to transfer based on prompt length, compression ratio, and block size, and applies this during the request completion phase. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tested by P/D accuracy test. - vLLM version: v0.20.2 - vLLM main: https://github.com/vllm-project/vllm/commit/39910f2b25aacc09f5e7f166cdf0030b19f8b9e8 --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com> Co-authored-by: linfeng-yuan <1102311262@qq.com>
  • e5487a770b [BugFix][v0.20.2rc] Fix DSA compressed idle dummy graph OOB (#9817) ### What this PR does / why we need it? Fixes #9816. This PR backports the DSA/compress idle dummy fix to `releases/v0.20.2rc`. Without this patch, a request can return 200 and then the engine crashes in idle `execute_dummy_batch()` when the ACLGraph dummy path is replayed, with 507011 / MTE out-of-range. The fix keeps the change scoped to the DSA/compress metadata path: - clear stale DSA slot mapping and block table values when dummy / graph-capture runs do not have fresh compressed scheduling metadata. ### Does this PR introduce _any_ user-facing change? No API or configuration change. ### How was this patch tested? Code checks: ```bash python -m py_compile vllm_ascend/worker/model_runner_v1.py python -m ruff check vllm_ascend/worker/model_runner_v1.py git diff --check ``` Failure was reproduced on A3, image `quay.io/ascend/vllm-ascend:nightly-releases-v0.20.2rc-a3`, model `/models/DeepSeek-V4-Flash-w8a8-mtp`, `dp=8/tp=1`, EP, MTP, async scheduling, ACLGraph `FULL_DECODE_ONLY`, `gpu_memory_utilization=0.93`, `max_model_len=450560`: ```text The request returns 200, then idle execute_dummy_batch() replays the graph dummy path and fails with 507011 / MTE out-of-range. ``` Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
  • 84c226eaf9 [BugFix][v0.20.2rc] Patch GLM tool-call final chunks (#9788) ### What this PR does / why we need it? Backports the GLM tool-call streaming final chunk patch to `releases/v0.20.2rc` for vllm-ascend issue #8327. This patch keeps the old GLM parser patch removed and only patches the serving-layer final chunk behavior that is still missing in the paired vLLM snapshot: - final remaining-argument chunks no longer repeat `id`, `type`, or `function.name` by default; - terminal argument chunks with `finish_reason="tool_calls"` are split into an argument chunk with `finish_reason=null`, followed by an empty finish chunk; - the GLM streaming wrapper is independent from the MiniMax usage-accounting wrapper, so MiniMax loading does not rely on the removed GLM parser patch. Related upstream vLLM issue/PR: vllm-project/vllm#44098 and vllm-project/vllm#44099. ### Does this PR introduce _any_ user-facing change? Yes. GLM tool-call streaming now emits final argument and finish chunks in the expected OpenAI-compatible order without repeating function metadata in final remaining-argument chunks. ### How was this patch tested? - `ruff check vllm_ascend/patch/platform/patch_glm_tool_call_streaming.py vllm_ascend/patch/platform/__init__.py tests/ut/patch/platform/test_patch_glm_tool_call_streaming.py` - `python -m py_compile vllm_ascend/patch/platform/patch_glm_tool_call_streaming.py tests/ut/patch/platform/test_patch_glm_tool_call_streaming.py` - `python -m pytest -q tests/ut/patch/platform/test_patch_glm_tool_call_streaming.py` (`4 passed`) - Import probe confirms both wrappers are installed: - `has_minimax_original=True` - `has_glm_original=True` Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
  • Compare 8 commits »

5 days ago

JeffDing synced commits to releases/v0.18.0 at JeffDing/vllm-ascend from mirror

  • 4a533861c9 [BugFix] [P/D] avoid MTP placeholders exceeding max model length (#9688) ### What this PR does / why we need it? This PR fixes an issue in recompute scheduler when the decode node enables MTP. When `is_mtp_kv_consumer` is enabled, the scheduler fills `request.spec_token_ids` with MTP placeholder tokens for newly added requests, so that decode nodes can match the full graph after pulling KV cache from prefill nodes. However, for the first incoming request on the decode node, if the original request length is already close to `max_model_len`, adding MTP placeholder tokens may make `request.num_tokens + num_spec_tokens` exceed the model length limit. This PR adds a boundary check before filling MTP placeholder tokens: ```python self.max_model_len >= (request.num_tokens + self.num_spec_tokens) ``` With this change, MTP placeholders are only added when the request can still fit within max_model_len, avoiding invalid scheduling behavior for the first request on the decode node. Does this PR introduce any user-facing change? No user-facing API change. This only fixes scheduler behavior for decode-node MTP scenarios and prevents MTP placeholder tokens from exceeding the model length limit. How was this patch tested? Verified the affected recompute scheduler logic. Tested/validated the decode-node MTP first-request scenario shown in the issue, where placeholder tokens should not be added if they would exceed max_model_len. Signed-off-by: liziyu179 <liziyu16@huawei.com>

5 days ago

JeffDing synced commits to main at JeffDing/vllm-ascend from mirror

  • ab32a0cfbf [Feature][Ops] Support A2/A3 and A5 compressor paths (#9350) ### What this PR does / why we need it? This PR merges the compressor updates needed for A2/A3 and A5 coexistence in the vllm-ascend packaging layout: - vllm-ascend main: A2/A3 `arch32` behavior, including non-contiguous `state_cache` support and fp32 RoPE sin/cos support on 910B/910C. - `vllm_ds_uncontigous_018_a5_0515`: A5 `arch35` behavior, including non-contiguous `state_cache` support. - `cann/ops-transformer` `experimental/attention/compressor`: A5 `arch35` fp32 `norm_weight` / `rope_sin` / `rope_cos` ABI. Current GitCode master was rechecked at `c5dbd0934b0f9565503dd094528861e9444cbaf0`. ### Main changes - Splits compressor host tiling and kernel files into `arch32` and `arch35` paths. - Maps GitCode `arch22` compressor sources into local `arch32`, because GitCode maps `arch22` to `ascend910b` / `ascend910_93` / `kirinx90`, while current vllm-ascend maps those SOCs to `arch32`. - Adds the A5 `arch35` compressor host/kernel path, including normal and full-load kernels. - Keeps `state_cache_stride_dim0` support in both paths: - `arch32` keeps vllm-ascend mainline non-contiguous `state_cache` support for 910B/910C. - `arch35` keeps the A5 non-contiguous `state_cache` support from `vllm_ds_uncontigous_018_a5_0515`. - Keeps upstream main's `arch32` `ropeDtype` handling so 910B/910C can use fp32 rope sin/cos while `norm_weight` still follows `x` dtype. - Aligns the A5 `arch35` compressor operator with ops-transformer fp32 handling for `norm_weight`, `rope_sin`, and `rope_cos`. ### Compatibility notes - This PR intentionally does not take GitCode `arch22`'s contiguous-only `state_cache` check into local `arch32`, because that would regress vllm-ascend mainline non-contiguous `state_cache` support. - `arch32` and `arch35` use separate tiling-key/template layouts. `compressor.cpp` dispatches them under `__CCE_AICORE__` so `arch32` `ROPE_DTYPE` and A5 `FULL_LOAD/cacheMode` do not leak into each other. - This PR only updates the compressor custom operator. A5 DSV4 Python-side wiring is intentionally left to a follow-up adaptation PR. ### Does this PR introduce _any_ user-facing change? No Python API change is intended. This refreshes and reorganizes the compressor custom-op implementation so A2/A3 and A5 can coexist with their required operator feature sets. ### How was this patch tested? - `git diff --check` - `PATH=/tmp/vllm-ascend-lint-venv/bin:$PATH bash format.sh ci` NPU build and runtime tests were not run locally. - vLLM version: v0.20.2 - vLLM main: https://github.com/vllm-project/vllm/commit/9090368b650896bf5fc990c921df7eb4c20355a5 --------- Signed-off-by: maoxx241 <maomaoyu870@gmail.com>
  • 58c59ca566 [Ascend950] [Feature] Support custom op for GDN on Ascend 950 (#9382) [Ascend950] [Feature] Support custom op for GDN on Ascend 950 ### What this PR does / why we need it? This PR enables custom operator support for **Gated Delta Network (GDN)** on Ascend 950 (A5), and fixes several compilation issues introduced by previous platform filtering logic. **Background & Motivation** - PR [#9271](https://github.com/vllm-project/vllm-ascend/pull/9271) excluded ATB-based kernels (e.g., `get_masked_input_and_mask`, `bgmv_expand`, `sgmv_expand`) from Ascend 950 via the compile macro `VLLM_ENABLE_ATB_AND_DIRECT_KERNELS`. However, the exclusion logic in `vllm_ascend/meta_registration.py` was incomplete: when `get_ascend_device_type() == AscendDeviceType.A5`, the code still attempted to register meta kernels for these ATB-only ops via `register_meta_if_necessary`, causing **compilation errors** on 950. - Additionally, the existing custom op build pipeline (`build_aclnn.sh`) did not correctly aggregate `binary_info_config.json` for A5-specific kernels, and some AscendC host tiling code (`recurrent_gated_delta_rule_tiling_arch35.cpp`) had C++ access-control issues when compiled under the unified `ophost_transformer_tiling_obj` target. **Changes Made** 1. **Fix 950 compilation guard in `meta_registration.py`** - Skip `register_meta_if_necessary` for ATB-only ops (`get_masked_input_and_mask`, `bgmv_expand`, `sgmv_expand`, etc.) on A5, aligning with the `VLLM_ENABLE_ATB_AND_DIRECT_KERNELS` macro. 2. **Add A5-compatible custom AscendC kernels** - `causal_conv1d` - `recurrent_gated_delta_rule` - `chunk_fwd_o` - `chunk_gated_delta_rule_fwd_h` - These kernels are registered under the `arch35` host tiling path for Ascend 950 and properly packaged into `vllm_ascend/_cann_ops_custom`. 3. **Fix custom Triton kernel bugs on Ascend 950** - `chunk_scaled_dot_kkt` - `solve_tril` ### Does this PR introduce _any_ user-facing change? No. All changes are backend compilation and operator registration fixes. No Python API or CLI behavior is changed. ### How was this PR tested? **Environment** **Ascend 950 (A5)** - Model: Qwen3-Next - GSM8K accuracy: - `W8A8_MXFP` (MXFP8): **96.44**, **96.36** - `BF16`: **96.13**, **96.44** **Ascend 910B** - Model: Qwen3-Next - GSM8K accuracy: - `BF16`: **95.83**, **96.21** --------- Signed-off-by: Bybbbb11 &lt;171289168@qq.com&gt; Signed-off-by: CXMT &lt;“liubaoyang3@huawei.com”&gt; ## Co-authors Co-authored-by: Bybbbb11 &lt;171289168@qq.com&gt; Co-authored-by: TallMessiWu &lt;tallmessiwu@qq.com&gt; Co-authored-by: SkychenLee &lt;litianchen2@huawei.com&gt; Co-authored-by: Feilin777 &lt;feilin_kkt@163.com&gt; - vLLM version: v0.20.2 - vLLM main: https://github.com/vllm-project/vllm/commit/9090368b650896bf5fc990c921df7eb4c20355a5 --------- Signed-off-by: CXMT <liubaoyang3@huawei.com> Signed-off-by: Bybbbb11 <171289168@qq.com> Co-authored-by: CXMT <liubaoyang3@huawei.com>
  • c6b853acb1 [Worker][Doc] Add A5 server disaggregated PD endpoint configuration (#9690) ### What this PR does / why we need it? Add A5 server disaggregated prefill-decode (PD) endpoint configuration support. On A5 hardware, disaggregated PD requires per-device NPU endpoint JSON files to be loaded during worker initialization so that HCCL communication layers can correctly match the hardware topology. Changes: - `vllm_ascend/worker/worker.py`: In `init_device()`, when running on A5, read the per-device endpoint config directory from `kv_transfer_config.kv_connector_extra_config["ascend_local_comm_res_path"]` and set `ASCEND_LOCAL_COMM_RES` for HCCL in each worker process. ### Does this PR introduce _any_ user-facing change? Yes. Users running disaggregated PD on A5 servers need to configure the endpoint config directory through `kv_connector_extra_config` before launching vLLM: ```json { "kv_connector_extra_config": { "ascend_local_comm_res_path": "/etc/hixlep" } } ``` - vLLM version: v0.20.2 - vLLM main: https://github.com/vllm-project/vllm/commit/9090368b650896bf5fc990c921df7eb4c20355a5 --------- Signed-off-by: willzhuyx <zhuyixiang2014@163.com> Signed-off-by: Zhu Yixiang <zhuyixiang2014@163.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: zzzzzmeng <810924837@qq.com>
  • 25d8b60844 [Feature]Add NZ layout support for C8 quantization(GQA) (#9721) ### What this PR does / why we need it? This PR adds NZ layout support for C8 quantization of KV cache. It enables converting the KV cache from the original ND layout to NZ layout during C8 quantization, improving memory access efficiency and computational performance for attention operations. **The test results are as follows(Qwen3-32B-w8a8c8/910B3*4):** **Accuracy:** python3 aisbench_test.py --dataset "/workspace/aisbench_auto_tools_prefix/GSM8K.jsonl" --concurrency 64 --request_rate 1 --test_accuracy --output_len 32768 --test_type text <img width="1583" height="644" alt="image" src="https://github.com/user-attachments/assets/85eefce9-a86e-40a2-bb51-830cbd8ab87b" /> **Performance:** python3 aisbench_test.py --input_len 30720 --output_len 1024 --data_num 256 --concurrency 64 --request_rate 1 --repeat_rate 0.9 --dataset_type prefix_cache --seed 1024 <img width="1447" height="804" alt="image" src="https://github.com/user-attachments/assets/aa3c0179-0921-4a38-96f6-29e2ef5c508a" /> This change improves throughput by ~90% compared to the previous version ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.20.2 - vLLM main: https://github.com/vllm-project/vllm/commit/9090368b650896bf5fc990c921df7eb4c20355a5 --------- Signed-off-by: ztzx3156 <164653300@qq.com> Co-authored-by: pichangping <1337510399@qq.com>
  • 0b527eade3 [BugFix] Chunk wq_b matmul for NPU 65536 dimension limit (#9780) ## Summary - `torch_npu.npu_quant_matmul` does not support weight dimensions >= 65536. When DSA context parallel is enabled, the `wq_b` linear layer's output dimension exceeds this limit, causing `aclnnQuantMatmulV5` to fail with error code 161002. - Split the `wq_b` weight into two chunks along the output dimension so each chunk stays below 65536. Two `npu_quant_matmul` calls are issued and their outputs concatenated along the last dimension. - Guard the chunking with `enable_dsa_cp()` since only DSA-CP models hit the 65536 limit on `wq_b`. ## Changes - **w8a8_dynamic.py**: Add chunked path in `process_weights_after_loading()` that splits weight/scale/offset into halves, flattens scales/offsets to 1D, and applies `maybe_trans_nz` for NZ format conversion. Add chunked forward in `apply()` with two matmul + `torch.cat`. - **dsa_cp.py**: Add chunked forward in the DSA CP attention path for `wq_b`, matching the same two-matmul-and-cat pattern. - vLLM version: v0.20.2 - vLLM main: https://github.com/vllm-project/vllm/commit/39910f2b25aacc09f5e7f166cdf0030b19f8b9e8 --------- Signed-off-by: Zheng Shoujian <zheng.shoujian@outlook.com>
  • Compare 35 commits »

5 days ago

JeffDing synced commits to releases/v0.20.2rc at JeffDing/vllm-ascend from mirror

  • 145e994a12 [BugFix][v0.20.2rc] Lazy initialize KV store on put (#9774) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> ### Does this PR introduce _any_ user-facing change? <!-- Note that it means *any* user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> --------- Signed-off-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local> Co-authored-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local>

6 days ago

JeffDing synced commits to main at JeffDing/vllm-ascend from mirror

  • e7840445b7 [BugFix][Parser] Backport MiniMax M2 tool call streaming (#9742) ### What this PR does / why we need it? Fixes vLLM issue #39649: https://github.com/vllm-project/vllm/issues/39649 for the Ascend-patched vLLM 0.20.2 runtime. This PR backports the MiniMax M2 incremental tool-call streaming parser behavior used by upstream vLLM. The existing `minimax_m2_tool_parser` waits for a complete `<invoke>...</invoke>` block before emitting arguments, so long tool-call arguments are buffered instead of streamed. This PR adds a platform monkey patch that: - emits the tool-call name once `<invoke name=...>` is available - streams partial `<parameter>` content as JSON argument fragments - preserves `prev_tool_call_arr` and `streamed_args_for_tool` for finish handling - uses vLLM shared `find_tool_properties` helper so both Chat Completions tools and Responses `FunctionTool` schemas drive type conversion - handles the v0.20.2 special-token path where `<minimax:tool_call>` can arrive by token id without decoded text - keeps the token-id-started tool-call state across empty decoded chunks until actual `<invoke>` text arrives This follows the upstream vLLM fixes proposed in: - https://github.com/vllm-project/vllm/pull/40253 - https://github.com/vllm-project/vllm/pull/40298 ### Does this PR introduce _any_ user-facing change? Yes. MiniMax M2 auto tool-choice streaming now emits tool-call argument deltas incrementally instead of buffering them until the closing `</invoke>` tag. ### How was this patch tested? - `ruff check vllm_ascend/patch/platform/patch_minimax_m2_tool_call_parser.py tests/ut/patch/platform/test_patch_minimax_m2_tool_call_parser.py` - `python -m pytest -q tests/ut/patch/platform/test_patch_minimax_m2_tool_call_parser.py` - `python -m pytest -q tests/ut/patch/platform` - `python -m pre_commit run --hook-stage manual --files vllm_ascend/patch/platform/patch_minimax_m2_tool_call_parser.py tests/ut/patch/platform/test_patch_minimax_m2_tool_call_parser.py` - vLLM version: v0.20.2 - vLLM main: https://github.com/vllm-project/vllm/commit/39910f2b25aacc09f5e7f166cdf0030b19f8b9e8 Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
  • 073da29121 [CI] Change nightly tests to releases/v0.20.2rc (#9778) ### What this PR does / why we need it? Change nightly tests to banrch `releases/v0.20.2rc` - vLLM version: v0.20.2 - vLLM main: https://github.com/vllm-project/vllm/commit/39910f2b25aacc09f5e7f166cdf0030b19f8b9e8 Signed-off-by: wjunLu <wjunlu217@gmail.com>
  • 6a7d0ce1b5 [BugFix] Lazy initialize KV store on put (#9771) ### What this PR does / why we need it Avoid initializing external KV stores during KV Pool backend construction only for the DSV4/compressed-model path, detected by compress_ratios. Mooncake uses lazy store init only when both DSV4/compress and fabric memory are enabled. Mooncake non-fabric-memory mode keeps the previous eager initialization behavior even for compressed models. Memcache follows the same rule on non-A2 devices, while the A2 path keeps the previous eager initialization and buffer registration behavior even for compressed models. When lazy init is enabled, the store is set up on the first put() call and guarded by a lock so later put() calls reuse the same store. Before the first put(), exists() treats keys as missing so the first store request can reach put(). In lazy-init paths, put failure logs include a hint that the failure is expected if this is the first DSV4/compress request, without tracking put attempts in connector state. ### Special notes for your reviewer None. ### Validation vLLM version: v0.20.1 vLLM main: vllm-project/vllm@c7aa186 - vLLM version: v0.20.2 - vLLM main: https://github.com/vllm-project/vllm/commit/39910f2b25aacc09f5e7f166cdf0030b19f8b9e8 Signed-off-by: Pz1116 <zpbzpb123123@gmail.com>
  • Compare 3 commits »

6 days ago

JeffDing synced commits to main at JeffDing/xtuner from mirror

  • 281b7f5f73 Refactor async HF writer status files (#1849) refactor: group async HF writer status files
  • f52e35d04b [Feature] add rank0 async HF save logs (#1845) * feat: add rank0 async HF logs * fix: import time for async HF compose logs * refactor: simplify async HF writer log path * refactor: rely on log timestamps for async HF writer * refactor: use log_rank0 for async HF writer logs
  • Compare 2 commits »

6 days ago

JeffDing synced commits to gh-pages at JeffDing/xtuner from mirror

  • c0c2a1a2a0 Deploying to gh-pages from @ InternLM/xtuner@ebeab77a765b95ce581c5da17ab5ee5c40dd9aff 🚀

6 days ago