History

yining 748cf76b31 上传文件至 'DFauLo/DataFaultExamples/Dataset Archives/mtfl'		7 months ago
..
DataFaultExamples	上传文件至 'DFauLo/DataFaultExamples/Dataset Archives/mtfl'	7 months ago

_mate	DFauLo	7 months ago

dataset	DFauLo	7 months ago

utils	DFauLo	7 months ago

README.md	DFauLo	7 months ago

dfaulo.py	DFauLo	7 months ago

requirements.txt	DFauLo	7 months ago

README.md

DFauLo

DFauLo

Description

This repository is the official implementation of the tool DfauLo.

DfauLo is a dynamic data fault localization tool for deep neural networks (DNNs), which can locate mislabeled and noisy data in the deep learning datasets. Inspired by conventional mutation-based code fault localization, DfauLo generates multiple DNN model mutants of the original trained DNN model and maps the extracted features into a suspiciousness score indicating the probability of the given data being a data fault. DfauLo is the first dynamic data fault localization technique, prioritizing the suspected data based on user feedback and providing the generalizability to unseen data faults during training.

[The Full Paper Link]

Installation

pip install -r requirements.txt

Usage

We prepared a default setting: running DFauLo on the EMNIST dataset mentioned in the paper. You can run this by directly executing:

python dfaulo.py

The results of the dfaulo-Round0 (without manual iteration) are stored in ./dataset/CaseStudyData/EMNIST/results/WaveMix/noManual_results_list.json
The more highly sorted the data represents the more likely it is to be a defective.

Run DFauLo on any Custom Datasets

We can run DFauLo on any custom datasets by changing the setting in dfaulo.py.

Before that, you should first prepare your classification dataset and classes file in dataset file for data fault localization according to the following format (The root directory is the name of the dataset, the subordinate directory represents it as the training set, then each set directory is named by the class name of the dataset, and the images of the corresponding class are stored):

MNIST # dataset name
|-- train # trainset
    |-- dog # class name
        |-- XX.png # corrensponding images
        |-- XX.png
        |-- ...
    |-- cat
        |-- XX.png
        |-- XX.png
        |-- ...
    |-- ...
|-- classes.json # class name and corresponding index

The classes.json file contains the corresponding index of the classes names when you are training the model. It needs to be written in the following form:

{
    "dog": 0,
    "cat": 1,
    "person": 2,
     ...
}

Next you need to prepare a model trained on the above dataset and the model's transform, loss function in models file, we recommend that you save your model in the following form:

transform = transforms.Compose([
    # MNIST transform
    transforms.Grayscale(num_output_channels=1),
    transforms.ToTensor(),
])

torch.save(
    {
        'transform': transform,
        'loss_fn': nn.CrossEntropyLoss(),
        'optimizer': "SGD"
    },
    '../dataset/mnist_model_args.pth'
)

Note that when training the model, using a transform with regularization can achieve better results for dfaulo.

In addition, you need to provide your model structure model.py in models file(coded in Pytorch form). Check out our sample files if you're not clear about it.

When having everything above ready, you can run our DfauLo tool as the following example:

python dfaulo.py --dataset './dataset/CaseStudyData/EMNIST' --model './dataset/CaseStudyData/EMNIST/WaveMix.pth' --model_name 'WaveMix' --class_path './dataset/emnist_classes.json' --image_size '(32,32,3)' --model_args './dataset/emnist_model_args.pth' --image_set 'train' --hook_layer 'conv' --rm_ratio 0.05 --retrain_epoch 10 --retrain_bs 64

Parameter explanation:

--dataset : Path of dataset requiring data fault localization.

--model : Corresponding trained model.

--model_name: Name of your Model.

--class_path : Path of the class file.

--image_size : Model input image size in (w,h,d) format.

--model_args : Other parameters of the model mentioned above.

--image_set : The associated set of datasets you need to perform DfauLo, usually 'train' or 'test'.

--hook_layer : The name of the model representation layer from which you need to extract features (We recommend using the second last linear layer).

--rm_ratio : Proportion of data to be removed when performing a mutation.

--retrain_epoch : Epoch for model fine-tuning with DfauLo tool.

--retrain_bs : BarchSize for model fine-tuning with DfauLo tool.