关于GCU、沐曦GPGPU、MLU、0卡V100资源4月7日恢复上架的公告>>> 关于共建具身智能开源数据集的倡议>>> 关于云脑任务中统一路径访问方式的公告>>> 关于将启智集群GPU资源迁移至智算集群的公告>>>

19 KiB

Raw Permalink Blame History

Customization and Extension

Customization and Extension

Custom Datasets
Custom Models
Custom Dialogue Templates

Custom Dataset

We support three methods for customizing datasets.

[Recommended] using command line arguments: It is more convenient to support custom datasets, and it supports four dataset formats (using SmartPreprocessor) as well as the dataset_id and dataset_path.
Adding datasets to dataset_info.json is more flexible than the first method, and supports using two preprocessors and specifying their parameters: RenameColumnsPreprocessor, ConversationsPreprocessor (default is to use SmartPreprocessor). You can directly modify the built-in dataset_info.json in Swift, or pass in an external json file using --dataset_info_path xxx.json (for users who prefer pip install over git clone to expand datasets).
Registering datasets: More flexible than the first two methods, it supports using functions to preprocess datasets. Methods 1 and 2 are implemented by leveraging method 3. You can directly modify the source code for expansion, or pass in a custom registration path using --custom_register_path xxx.py, where the script will parse the py file (for pip install users).

📌 [Recommended] using Command Line Arguments

Supports directly passing in custom dataset_id (compatible with MS and HF) and dataset_path, as well as simultaneously passing in multiple custom datasets and their respective sample sizes. The script will automatically preprocess and concatenate the datasets. If a dataset_id is passed in, it will default to using the 'default' subset in the dataset_id and set the split to 'train'. If the dataset_id has already been registered, it will use the subsets, split, and preprocessing functions that were passed in during registration. If a dataset_path is passed in, it can be specified as a relative path or an absolute path, where the relative path is relative to the current running directory.

--dataset {dataset_id} {dataset_path}

# Dataset Mixing: the following command takes subset1 and subset2 from dataset_id and samples 20,000 records
--dataset {dataset_name}#20000 {dataset_id}:{subset1}/{subset2}#20000 {dataset_path}#10000

The script supports file formats including csv, json, and jsonl. The files you pass in need to adhere to the dataset formats listed below (only a partial list is shown). All formats support system. Files in json and jsonl formats support multi-turn conversations (not supported in csv).

Format 1:

Pre-Training

response
11111
aaaaa
AAAAA

{"response": "11111"}
{"response": "aaaaa"}
{"response": "AAAAA"}

Single-Round Dialogue

query,response
11111,22222
aaaaa,bbbbb
AAAAA,BBBBB

{"query": "11111", "response": "22222"}
{"query": "aaaaa", "response": "bbbbb"}
{"query": "AAAAA", "response": "BBBBB"}

Multi-Round Dialogue

{"query": "55555", "response": "66666"}
{"query": "eeeee", "response": "fffff", "history": []}
{"query": "EEEEE", "response": "FFFFF", "history": [["AAAAA", "BBBBB"], ["CCCCC", "DDDDD"]]}

[{"query": "55555", "response": "66666"},
{"query": "eeeee", "response": "fffff", "history": []},
{"query": "EEEEE", "response": "FFFFF", "history": [["AAAAA", "BBBBB"], ["CCCCC", "DDDDD"]]}]

Format 2:

{"conversations": [{"from": "user", "value": "11111"}, {"from": "assistant", "value": "22222"}]}
{"conversations": [{"from": "user", "value": "aaaaa"}, {"from": "assistant", "value": "bbbbb"}, {"from": "user", "value": "ccccc"}, {"from": "assistant", "value": "ddddd"}]}
{"conversations": [{"from": "user", "value": "AAAAA"}, {"from": "assistant", "value": "BBBBB"}, {"from": "user", "value": "CCCCC"}, {"from": "assistant", "value": "DDDDD"}]}

Format 3:

{"messages": [{"role": "user", "content": "11111"}, {"role": "assistant", "content": "22222"}]}
{"messages": [{"role": "user", "content": "aaaaa"}, {"role": "assistant", "content": "bbbbb"}, {"role": "user", "content": "ccccc"}, {"role": "assistant", "content": "ddddd"}]}
{"messages": [{"role": "user", "content": "AAAAA"}, {"role": "assistant", "content": "BBBBB"}, {"role": "user", "content": "CCCCC"}, {"role": "assistant", "content": "DDDDD"}]}

Format 4:

instruction,input,output
11111,22222,33333
aaaaa,bbbbb,ccccc
AAAAA,BBBBB,CCCCC

Reinforcement Learning (DPO/ORPO)

{"query": "11111", "response": "22222", "rejected_response": "33333", "history": [["AAAAA", "BBBBB"], ["CCCCC", "DDDDD"]]}
{"query": "aaaaa", "response": "bbbbb", "rejected_response": "ccccc", "history": [["AAAAA", "BBBBB"], ["CCCCC", "DDDDD"]]}
{"query": "AAAAA", "response": "BBBBB", "rejected_response": "CCCCC", "history": [["AAAAA", "BBBBB"], ["CCCCC", "DDDDD"]]}

Adding dataset_info.json

You can refer to the builtin dataset_info.json in Swift to expand datasets. You can directly add it in the built-in dataset_info.json, or you can pass in the path to an external dataset_info.json, a JSON string, or a dictionary using --custom_dataset_info 1.json.

Adding dataset_id:

# MS
# Usage: `--dataset <dataset_name>`
"<dataset_name>": {
    "dataset_id": "xxx/xxx"
}

# HF
# Usage: `--dataset HF::<dataset_name>` or directly use the `USE_HF` environment variable.
"<dataset_name>": {
    "hf_dataset_id": "xxx/xxx"
}

添加dataset_path:

# You can specify relative and absolute paths. Relative paths are relative to the directory where dataset_info.json is located.
# Usage: `--dataset <dataset_name>`
"<dataset_name>": {
    "dataset_path": "xxx"
}

Supported parameters include:

dataset_id: The corresponding ModelScope dataset_id, default is None. The simplest setup requires specifying one of dataset_id, hf_dataset_id, or dataset_path.
subsets: A list of names of the subsets, default is [], which means using the 'default' subset.
split: Default is ['train'], usually not necessary to set.
hf_dataset_id: The corresponding HuggingFace dataset_id, default is None.
dataset_path: Used to specify the local path of the dataset, e.g. 1.jsonl, default is None. It can take relative or absolute paths. If using a relative path, it is relative to the directory where the dataset_info.json is located. If dataset_path is set, then dataset_id, subsets, and hf_dataset_id parameters are ignored.
columns: The default preprocessor used is SmartPreprocessor. Specifying this parameter sets it to RenameColumnsPreprocessor. You need to rename the columns in the dataset and convert them to the style of format 1 mentioned above.
conversations: Specifying this parameter sets the preprocessor to ConversationsPreprocessor ('columns' takes priority over 'conversations').
remove_useless_columns: Specifies whether to remove unnecessary columns (including: 'query', 'response', 'rejected_response', 'system', 'history', 'images'), default is True, usually not necessary to set.
tags: Used to annotate the dataset, default is [], usually not necessary to set.

If the parameters in dataset_info.json are not sufficient for your needs, such as adding custom prompts, requiring advanced dataset cleaning, or complex dataset retrieval and preprocessing, you can use the method of registering datasets using functions for data retrieval and preprocessing.

Registering Datasets

The following is an example of registering datasets. The complete py file can be viewed at custom.py, and the sh script can be viewed at custom. You can parse the registered content by specifying --custom_register_path xxx.py.

from typing import Optional, Tuple

from datasets import Dataset as HfDataset
from modelscope import MsDataset

from swift.llm import get_dataset, register_dataset, get_dataset_from_repo
from swift.utils import get_logger

logger = get_logger()


class CustomDatasetName:
    stsb_en = 'stsb-en'

def _preprocess_stsb(dataset: HfDataset) -> HfDataset:
    prompt = """Task: Based on the given two sentences, provide a similarity score between 0.0 and 5.0.
Sentence 1: {text1}
Sentence 2: {text2}
Similarity score: """
    query = []
    response = []
    for d in dataset:
        query.append(prompt.format(text1=d['text1'], text2=d['text2']))
        response.append(f"{d['label']:.1f}")
    return HfDataset.from_dict({'query': query, 'response': response})


register_dataset(CustomDatasetName.stsb_en, 'huangjintao/stsb', None, _preprocess_stsb, get_dataset_from_repo)


if __name__ == '__main__':
    # test dataset
    train_dataset, val_dataset = get_dataset([CustomDatasetName.stsb_en],
                                             check_dataset_strategy='warning')
    print(f'train_dataset: {train_dataset}')
    print(f'val_dataset: {val_dataset}')

The register_dataset function will register the dataset in the DATASET_MAPPING. The parameters of this function are as follows:

dataset_name: Required, representing the name of the dataset, which is also the unique ID of the dataset.
dataset_id_or_path: Required, representing the dataset_id on the ModelScope Hub or the local dataset_dir.
subsets: List of subsets of the dataset, default is [].
split: Default is ['train'].
preprocess_func: Preprocessing function.
get_function: Default value is None. The function to get the dataset. If passed None, the decorator approach will be used to register the dataset. If passed a function, the normal approach will be used to register.

get_function should return HfDataset or Tuple[HfDataset, Optional[HfDataset]]. If only one dataset is returned, it will be the train_dataset. If two datasets are returned, they will be the train_dataset and val_dataset, respectively. The get_dataset function supports obtaining multiple datasets, for example: get_dataset(['dataset1', 'dataset2']). We will concatenate the training and validation parts of each subset and return the merged train_dataset and val_dataset.

The HfDataset returned by the function needs to follow certain specifications. If you want to do pre-training, you only need to include the response field, please refer to the 'tigerbot-law-zh' dataset for details. For instruction tuning (single-round dialogue), the query and response fields need to be included, representing the user's query and the AI assistant's answer in instruction tuning respectively, please refer to the 'alpaca-zh' dataset for details. For multi-round dialogue, an additional history field needs to be added, representing the historical information of the dialogue, please refer to the 'damo-agent-mini-zh' dataset for details. If each dataset sample has a different system, an additional system field needs to be added, you can also refer to the 'damo-agent-mini-zh' dataset for details.
**kwargs: Other parameters used to annotate the dataset. This parameter generally does not need to be set.

Custom Models

The following is an example of custom models. The complete py file can be viewed at custom.py, and the sh script can be viewed at custom. You can parse the registered content by specifying --custom_register_path xxx.py.

from typing import Any, Dict

from modelscope import AutoConfig, AutoModelForCausalLM, AutoTokenizer

from torch import dtype as Dtype
from transformers.utils.versions import require_version

from swift.llm import LoRATM, TemplateType, get_model_tokenizer, register_model
from swift.utils import get_logger

logger = get_logger()


class CustomModelType:
    tigerbot_7b = 'tigerbot-7b'
    tigerbot_13b = 'tigerbot-13b'
    tigerbot_13b_chat = 'tigerbot-13b-chat'


class CustomTemplateType:
    tigerbot = 'tigerbot'


@register_model(CustomModelType.tigerbot_7b,
                'TigerResearch/tigerbot-7b-base-v3', LoRATM.llama2,
                TemplateType.default_generation)
@register_model(CustomModelType.tigerbot_13b,
                'TigerResearch/tigerbot-13b-base-v2', LoRATM.llama2,
                TemplateType.default_generation)
@register_model(CustomModelType.tigerbot_13b_chat,
                'TigerResearch/tigerbot-13b-chat-v4', LoRATM.llama2,
                CustomTemplateType.tigerbot)
def get_tigerbot_model_tokenizer(model_dir: str,
                                 torch_dtype: Dtype,
                                 model_kwargs: Dict[str, Any],
                                 load_model: bool = True,
                                 **kwargs):
    use_flash_attn = kwargs.pop('use_flash_attn', False)
    if use_flash_attn:
        require_version('transformers>=4.34')
        logger.info('Setting use_flash_attention_2: True')
        model_kwargs['use_flash_attention_2'] = True
    model_config = AutoConfig.from_pretrained(
        model_dir, trust_remote_code=True)
    model_config.pretraining_tp = 1
    model_config.torch_dtype = torch_dtype
    logger.info(f'model_config: {model_config}')
    tokenizer = AutoTokenizer.from_pretrained(
        model_dir, trust_remote_code=True)
    model = None
    if load_model:
        model = AutoModelForCausalLM.from_pretrained(
            model_dir,
            config=model_config,
            torch_dtype=torch_dtype,
            trust_remote_code=True,
            **model_kwargs)
    return model, tokenizer


if __name__ == '__main__':
    # test model base
    model, tokenizer = get_model_tokenizer(
        CustomModelType.tigerbot_7b, use_flash_attn=False)
    print(model.__class__.__name__)
    # test model chat
    model, tokenizer = get_model_tokenizer(
        CustomModelType.tigerbot_13b_chat, use_flash_attn=False)
    print(model.__class__.__name__)

register_model will register the model in MODEL_MAPPING. The meaning of the parameters of this function are as follows:

model_type: Required field. Represents the name of the model, and is also the unique id.
model_id_or_path: Required field. Represents the model_id of the model in ModelScope Hub, or the local model directory model_dir.
lora_target_modules: Default is None. Represents the default lora_target_modules to use when --lora_target_modules DEFAULT or --lora_target_modules AUTO is specified in the sh script, or when --lora_target_modules is not specified.
template: Default is TemplateType.default. Represents the default dialogue template to use when --template_type AUTO is specified in the sh script, or when --template_type is not specified.
get_function: Default value is None. The function to get model and tokenizer. If passed None, the decorator approach will be used to register the model. If passed a function, the normal approach will be used to register.
requires: Default is []. Represents the dependencies required by the model that differ from other models. This parameter generally does not need to be set.
torch_dtype: Default is None. Represents the recommended torch_dtype for the model to use. This parameter generally does not need to be set.
revision: Default is None. Used to specify the version number of the model. If model_id_or_path is a local model directory, this parameter is not effective. This parameter generally does not need to be set.
ignore_file_pattern: Default is None. Represents the regular pattern of file names to be ignored when downloading, this parameter will be passed to snapshot_download. For example, r'.+\.bin$', r'.+\.savetensors$', etc. This parameter generally does not need to be set.
**kwargs: Other parameters used to annotate model capabilities. This parameter generally does not need to be set.

Custom Dialogue Templates

The following is an example of custom models. The complete py file can be viewed at custom.py, and the sh script can be viewed at custom.

from swift.llm import (Template, ModelType, dataset_map,
                       get_model_tokenizer, get_template, get_dataset,
                       print_example, register_template, DatasetName)
from swift.utils import get_logger

logger = get_logger()


class CustomTemplateType:
    tigerbot = 'tigerbot'


# Ref: https://github.com/TigerResearch/TigerBot/blob/main/infer.py
register_template(
    CustomTemplateType.tigerbot,
    Template(['{{SYSTEM}}'], ['\n\n### Instruction:\n{{QUERY}}\n\n### Response:\n'], [],
             [['eos_token_id']]))

if __name__ == '__main__':
    # test template
    train_dataset, _ = get_dataset(DatasetName.blossom_math_zh)
    _, tokenizer = get_model_tokenizer(ModelType.qwen_7b_chat, load_model=False)
    template = get_template(CustomTemplateType.tigerbot, tokenizer)
    train_dataset = dataset_map(train_dataset, template.encode)
    print_example(train_dataset[0], tokenizer)

register_template will register the dialogue template in TEMPLATE_MAPPING. The meaning of the parameters of this function are as follows:

template_type: Required field, represents the name of the dialogue template, and is also the unique id of the template.
template: Required field, needs to pass in a Template. To initialize Template, the following parameters need to be passed in: prefix, prompt, chat_sep, suffix, default_system.

The template initialization function will obtain the complete chat template based on these four contents. The meaning of these four configuration contents are as follows.

prefix: Represents the prefix part of the dialogue template, generally including system part, prefix tokens, bos tokens, etc. We use {{SYSTEM}} as the placeholder for the system. If {{SYSTEM}} does not exist in the prefix, then this Template does not support system, e.g. damo-agent-mini-zh dataset.
prompt: Represents a round of dialogue in the dialogue template. We use {{QUERY}} as the placeholder for the human query part in each round of dialogue, {{ROUND0}} represents the placeholder for which round of dialogue this is, starting from 0, and {{ROUND1}} starts from 1. The AI assistant's reply part will be concatenated after prompt, so we have not designed a placeholder for it. We will only calculate the loss for the AI assistant's reply part.
chat_sep: If multi-round dialogue is needed, chat_sep will be used as the separator between each round of dialogue, such as: newline, etc. If set to None, then this Template does not support multi-round dialogue.
suffix: Used as the suffix part of the dialogue template, generally eos token. Will be concatenated after the last round of dialogue.
default_system: The default system.

19 KiB Raw Permalink Blame History