|Fr 25. November 2022
|In the process of machine learning, the data to be analyzed is often not only numerical but also categorical data. Therefore, encoders are developed to convert categorical data into the numerical world. However, different encoders may have other impacts on the performance of the machine learning process. To this end, this thesis is dedicated to understanding the best encoder selection using meta-learning approaches. Meta-learning, also known as learning how to learn, serves as the primary tool for this study. First, by using the concept of meta-learning, we find meta-features that represent the characteristics of these data sets. After that, an iterative machine learning process is performed to find the relationship between these meta-features and the best encoder selection.
In the experiment, we analyzed 50 datasets, those collected from OpenML. We collected their meta-features and performance with different encoders. After that, the decision tree and random forest are chosen as the meta-models to perform meta-learning and find the relationship between meta-features and the performance of the encoder or the best encoder. The output of these steps will be a ruleset that describes the relationship in an interpretable way and can also be generalized to new datasets.