Benchmarking Tabular Data Synthesis Pipelines for Mixed Data

Aus SDQ-Institutsseminar
Vortragende(r) Thomas Frank
Vortragstyp Bachelorarbeit
Betreuer(in) Federico Matteucci
Termin Fr 14. Oktober 2022
Vortragssprache
Vortragsmodus in Präsenz
Kurzfassung In machine learning, simpler, interpretable models require significantly more training data than complex, opaque models to achieve reliable results. This is a problem when gathering data is a challenging, expensive or time-consuming task. Data synthesis is a useful approach for mitigating these problems.

An essential aspect of tabular data is its heterogeneous structure, as it often comes in ``mixed data´´, i.e., it contains both categorical and numerical attributes. Most machine learning methods require the data to be purely numerical. The usual way to deal with this is a categorical encoding.

In this thesis, we evaluate a proposed tabular data synthesis pipeline consisting of a categorical encoding, followed by data synthesis and an optional relabeling of the synthetic data by a complex model. This synthetic data is then used to train a simple model. The performance of the simple model is used to quantify the quality of the generated data. We surveyed the current state of research in categorical encoding and tabular data synthesis and performed an extensive benchmark on a motivated selection of encoders and generators.