yining 748cf76b31 | 7 months ago | |
---|---|---|
.. | ||
DataFaultExamples | 7 months ago | |
_mate | 7 months ago | |
dataset | 7 months ago | |
utils | 7 months ago | |
README.md | 7 months ago | |
dfaulo.py | 7 months ago | |
requirements.txt | 7 months ago |
This repository is the official implementation of the tool DfauLo.
DfauLo is a dynamic data fault localization tool for deep neural networks (DNNs), which can locate mislabeled and noisy data in the deep learning datasets. Inspired by conventional mutation-based code fault localization, DfauLo generates multiple DNN model mutants of the original trained DNN model and maps the extracted features into a suspiciousness score indicating the probability of the given data being a data fault. DfauLo is the first dynamic data fault localization technique, prioritizing the suspected data based on user feedback and providing the generalizability to unseen data faults during training.
pip install -r requirements.txt
We prepared a default setting: running DFauLo on the EMNIST dataset mentioned in the paper. You can run this by directly executing:
python dfaulo.py
The results of the dfaulo-Round0 (without manual iteration) are stored in ./dataset/CaseStudyData/EMNIST/results/WaveMix/noManual_results_list.json
The more highly sorted the data represents the more likely it is to be a defective.
We can run DFauLo on any custom datasets by changing the setting in dfaulo.py
.
Before that, you should first prepare your classification dataset and classes file in dataset file for data fault localization according to the following format (The root directory is the name of the dataset, the subordinate directory represents it as the training set, then each set directory is named by the class name of the dataset, and the images of the corresponding class are stored):
MNIST # dataset name
|-- train # trainset
|-- dog # class name
|-- XX.png # corrensponding images
|-- XX.png
|-- ...
|-- cat
|-- XX.png
|-- XX.png
|-- ...
|-- ...
|-- classes.json # class name and corresponding index
The classes.json
file contains the corresponding index of the classes names when you are training the model. It needs to be written in the following form:
{
"dog": 0,
"cat": 1,
"person": 2,
...
}
Next you need to prepare a model trained on the above dataset and the model's transform, loss function in models file, we recommend that you save your model in the following form:
transform = transforms.Compose([
# MNIST transform
transforms.Grayscale(num_output_channels=1),
transforms.ToTensor(),
])
torch.save(
{
'transform': transform,
'loss_fn': nn.CrossEntropyLoss(),
'optimizer': "SGD"
},
'../dataset/mnist_model_args.pth'
)
Note that when training the model, using a transform
with regularization can achieve better results for dfaulo.
In addition, you need to provide your model structure model.py
in models file(coded in Pytorch form). Check out our sample files if you're not clear about it.
When having everything above ready, you can run our DfauLo tool as the following example:
python dfaulo.py --dataset './dataset/CaseStudyData/EMNIST' --model './dataset/CaseStudyData/EMNIST/WaveMix.pth' --model_name 'WaveMix' --class_path './dataset/emnist_classes.json' --image_size '(32,32,3)' --model_args './dataset/emnist_model_args.pth' --image_set 'train' --hook_layer 'conv' --rm_ratio 0.05 --retrain_epoch 10 --retrain_bs 64
Parameter explanation:
--dataset
: Path of dataset requiring data fault localization.
--model
: Corresponding trained model.
--model_name
: Name of your Model.
--class_path
: Path of the class file.
--image_size
: Model input image size in (w,h,d) format.
--model_args
: Other parameters of the model mentioned above.
--image_set
: The associated set of datasets you need to perform DfauLo, usually 'train' or 'test'.
--hook_layer
: The name of the model representation layer from which you need to extract features (We recommend using the second last linear layer).
--rm_ratio
: Proportion of data to be removed when performing a mutation.
--retrain_epoch
: Epoch for model fine-tuning with DfauLo tool.
--retrain_bs
: BarchSize for model fine-tuning with DfauLo tool.
Some DfauLo.py
results on benchmark MNIST are shown below :
label: 4 |
label: 7 |
label: 6 |
label: 7 |
label: 1 |
数据质量度量及缺陷检测技术开源问题:Open Issues in Data Quality Metrics and Defect Detection
Text Python
Dear OpenI User
Thank you for your continuous support to the Openl Qizhi Community AI Collaboration Platform. In order to protect your usage rights and ensure network security, we updated the Openl Qizhi Community AI Collaboration Platform Usage Agreement in January 2024. The updated agreement specifies that users are prohibited from using intranet penetration tools. After you click "Agree and continue", you can continue to use our services. Thank you for your cooperation and understanding.
For more agreement content, please refer to the《Openl Qizhi Community AI Collaboration Platform Usage Agreement》