English | 简体中文
What-is-AutoX?
AutoX is an efficient automl tool, mainly aimed at data mining competitions with tabular data.
Its features include:
- SOTA: AutoX outperforms other solutions in many competition datasets(see Evaluation).
- Easy to use: The design of interfaces is similar to sklearn.
- Generic & Universal: Supporting tabular data, including binary classification, multi-class classification and regression problems.
- Auto: Fully automated pipeline without human-intervention.
- Out of the box: Providing flexible modules which can be used alone.
- Summary of magics: Organize and publish magics of competitions.
interpretable-ml
AutoX covers following interpretable machine learning methods:
Golbel interpretation
Local interpretation
Influential interpretation
Prototypes and Criticisms
Table-of-Contents
Installation
1. git clone https://github.com/4paradigm/autox.git
2. cd autox
3. python setup.py install
Architecture
├── autox
│ ├── ensemble
│ ├── feature_engineer
│ ├── feature_selection
│ ├── file_io
│ ├── join_tables
│ ├── metrics
│ ├── models
│ ├── process_data
│ └── util.py
│ ├── CONST.py
│ ├── autox.py
├── run_oneclick.py
└── demo
└── test
├── setup.py
├── README.md
Quick-Start
from autox import AutoX
path = data_dir
autox = AutoX(target = 'loss', train_name = 'train.csv', test_name = 'test.csv',
id = ['id'], path = path)
sub = autox.get_submit()
sub.to_csv("submission.csv", index = False)
- Semi-Automl: run_demo.ipynb
Evaluation
Data type
- cat: Categorical, Categorical variable without order.
- ord: Ordinal, Categorical variable with order.
- num: Numeric, Numeric variable.
- datetime: Time variable with Datetime format.
- timestamp: Time variable with Timestamp format.
Pipeline
1.1 Read data
1.2 Concat train and test
1.3 Identify columns type in data
1.4 Data preprocess
Every feature engineer class inclues the following features:
1. auto select columns which will be executed with current operation
2. review the selected columns
3. modify the columns
4. execute the operation, and return features whose samples' number and order are consistent with orginal table.
Combine the raw features and derived features, and return wide table.
Split the wide table into train and test.
Filter the features according to the distribution of train and test.
Inputs of models are filtered features.
model class inclues the following features:
1. get the default parameters
2. model training
3. parameters tuning
4. get the features importance
5. prediction
AutoX
Attributes
info_: Information about the data set.
- info_['id']: List, unique keys to identify the sample.
- info_['target']: String, label column.
- info_['shape_of_train']: Int, the number of samples in the train set.
- info_['shape_of_test']: Int, the number of samples in the test set.
- info_['feature_type']: Dict of Dict, data type of the features.
- info_['train_name']: String, the table name of main table of train.
- info_['test_name']: String, the table name of main table of test.
dfs_: dfs_ contains all DataFrames, including raw tables and derived tables.
- dfs_['train_test']: The combined data of train data and test data.
- dfs_['FE_feature_name']: Derived tables by feature engineering, such as FE_count, FE_groupby.
- dfs_['FE_all']: The merged table which contains raw tables and derived tables.
Methods
- concat_train_test: concat the train and test data.
- split_train_test: split train and test data.
- get_submit: get the submission.
Details of operations in the pipeline:
Data IO
Data Pre-process
- extract year, month, day, hour, weekday info from time columns
- delete invalid(nunique equal to 1) features
- delete invalid (label is nan) samples
Feature Engineer
-
target encoding feature
-
shift feature
Model Fitting
AutoX supports fellowing models:
1. Lightgbm
2. Xgboost
3. Tabnet
Ensemble
AutoX supports two ensemble methods(Bagging will be used in default).
1. Stacking;
2. Bagging。
Summary-of-Magics
competition |
magics |
kaggle criteo |
|
zhidemai |
|
Debug