The Ministry of Science and Technology (MOST) Excellent Young Scholars aims to nurture brilliant young scholars of next generation in Taiwan and to make potential young, exceptional scholars to focus on emerging issue in the early stage of their research careers. The research project title is On the development of classification-aware feature extraction and adaptive data augmentation in imbalanced data analytics.
Imbalanced data analytics has been a big challenge, even if the implemented big data infrastructure has triggered tremendous AI advancement. The extensive data volume does not guarantee an expected performance specifically due to the data bias. When the sample size of the minority class(es) is tiny compared to that of the majority class(es), the imbalance has occurred and led to ineffective or biased classifiers/predictors. Especially, given the overall classification accuracy as the learning objective, models are trained easily as the majority-oriented classifiers. However, the minority samples, i.e., positive cases, are the real aims because they are usually rare events to be detected and controlled, such as credit card transaction frauds and production quality defects. Researches conventionally tackled the imbalanced data issue in two perspectives: data preprocessing and cost-sensitive learning algorithms, but lacked the systematic scheme to unravel the solution.
In this project, the imbalanced data analytics will be unfolded into four phases: 1) imbalance assessment; 2) adaptive data augmentation; 3) classification-aware feature extraction, and 4) imbalance-accommodated cost-effective learning. The imbalance of the minority class shall be characterized before applying any pre-treatments. The minority samples, although in the same class, shall be studied and labeled for further processing. Adaptive linear/nonlinear data augmentation techniques will be developed and implemented for individual minority samples according to their imbalance characteristics. As the classes are balanced, the classification-aware features will be extracted in a supervised manner via training an encoder-decoder network and a classifier simultaneously. The new features are not only capable of reconstructing the original data but also effective for performing the classification. Furthermore, to prevent the black-boxed feature extraction from losing explainability, the pre-developed XAI framework will be implemented to ensure the linkage between the new features and the data domain. Cost-effective ensemble learning algorithms will be enhanced to accommodate the data imbalance. The four phases are configured in a cyclic loop in which one is linked closely with the other two. It is believed the proposed framework will become the synthetic paradigm facing the imbalanced data and transfer the guidelines across different problem domains