Tree: v4.3.1

6.2 KiB

Raw Permalink Blame History

English | 简体中文

What-is-AutoX?

AutoX is an efficient automl tool, mainly aimed at data mining competitions with tabular data.
Its features include:

SOTA: AutoX outperforms other solutions in many competition datasets(see Evaluation).
Easy to use: The design of interfaces is similar to sklearn.
Generic & Universal: Supporting tabular data, including binary classification, multi-class classification and regression problems.
Auto: Fully automated pipeline without human-intervention.
Out of the box: Providing flexible modules which can be used alone.
Summary of magics: Organize and publish magics of competitions.

interpretable-ml

AutoX covers following interpretable machine learning methods:

Prototypes and Criticisms

Table-of-Contents

What is AutoX?
Table of Contents
Installation
Architecture
Quick Start
Summary of Magics
Evaluation

Installation

1. git clone https://github.com/4paradigm/autox.git
2. cd autox
3. python setup.py install

Architecture

├── autox
│   ├── ensemble
│   ├── feature_engineer
│   ├── feature_selection
│   ├── file_io
│   ├── join_tables
│   ├── metrics
│   ├── models
│   ├── process_data
│   └── util.py
│   ├── CONST.py
│   ├── autox.py
├── run_oneclick.py
└── demo
└── test
├── setup.py
├── README.md

Quick-Start

Full-Automl

from autox import AutoX
path = data_dir
autox = AutoX(target = 'loss', train_name = 'train.csv', test_name = 'test.csv', 
               id = ['id'], path = path)
sub = autox.get_submit()
sub.to_csv("submission.csv", index = False)

Semi-Automl: run_demo.ipynb

Evaluation

index	data_type	data_name(link)	metric	AutoX	AutoGluon	H2o
1	regression	zhidemai	mse	1.1231	1.9466	1.1927
2	regression	Tabular Playground Series - Aug 2021	rmse	7.87731	10.3944	7.8895
3	regression	House Prices	rmse	0.13043	0.13104	0.13161
4	binary classification	Titanic	accuracy	0.77751	0.78229	0.79186

Data type

cat: Categorical, Categorical variable without order.
ord: Ordinal, Categorical variable with order.
num: Numeric, Numeric variable.
datetime: Time variable with Datetime format.
timestamp: Time variable with Timestamp format.

Pipeline

1.Initialize AutoX

1.1 Read data
1.2 Concat train and test
1.3 Identify columns type in data
1.4 Data preprocess

2.Feature engineer

Every feature engineer class inclues the following features:
1. auto select columns which will be executed with current operation
2. review the selected columns
3. modify the columns
4. execute the operation, and return features whose samples' number and order are consistent with orginal table.

3.Features combination

Combine the raw features and derived features, and return wide table.

4.train_test_split

Split the wide table into train and test.

5.Features filter

Filter the features according to the distribution of train and test.

6.Model training

Inputs of models are filtered features. 
model class inclues the following features:
1. get the default parameters
2. model training
3. parameters tuning
4. get the features importance
5. prediction

7.Prediction

AutoX

Attributes

info_: Information about the data set.

info_['id']: List, unique keys to identify the sample.
info_['target']: String, label column.
info_['shape_of_train']: Int, the number of samples in the train set.
info_['shape_of_test']: Int, the number of samples in the test set.
info_['feature_type']: Dict of Dict, data type of the features.
info_['train_name']: String, the table name of main table of train.
info_['test_name']: String, the table name of main table of test.

dfs_: dfs_ contains all DataFrames, including raw tables and derived tables.

dfs_['train_test']: The combined data of train data and test data.
dfs_['FE_feature_name']: Derived tables by feature engineering, such as FE_count, FE_groupby.
dfs_['FE_all']: The merged table which contains raw tables and derived tables.

Methods

concat_train_test: concat the train and test data.
split_train_test: split train and test data.
get_submit: get the submission.

Details of operations in the pipeline:

Data IO

Data Pre-process

- extract year, month, day, hour, weekday info from time columns
- delete invalid(nunique equal to 1) features 
- delete invalid (label is nan) samples

Feature Engineer

count feature

target encoding feature
shift feature

Model Fitting

AutoX supports fellowing models:
1. Lightgbm
2. Xgboost
3. Tabnet

Ensemble

AutoX supports two ensemble methods(Bagging will be used in default). 
1. Stacking；
2. Bagging。

Summary-of-Magics

competition	magics
kaggle criteo
zhidemai

Debug

Log	Solution

6.2 KiB Raw Permalink Blame History

What-is-AutoX?

interpretable-ml

Golbel interpretation

Local interpretation

Influential interpretation

Prototypes and Criticisms

Table-of-Contents

Installation

Architecture

Quick-Start

Evaluation

Data type

Pipeline

AutoX

Attributes

info_: Information about the data set.

dfs_: dfs_ contains all DataFrames, including raw tables and derived tables.

Methods

Details of operations in the pipeline:

Data IO

Data Pre-process

Feature Engineer

Model Fitting

Ensemble

Summary-of-Magics

Debug

6.2 KiB

Raw Permalink Blame History