Institutsseminar/2022-Oktober-14

Aus SDQ-Institutsseminar
Termin (Alle Termine)
Datum Freitag, 14. Oktober 2022
Uhrzeit 10:30 – 11:00 Uhr (Dauer: 30 min)
Ort Raum 348 (Gebäude 50.34)
Webkonferenz https://kit-lecture.zoom.us/j/62996772275
Vorheriger Termin Fr 23. September 2022
Nächster Termin Fr 14. Oktober 2022

Termin in Kalender importieren: iCal (Download)

Vorträge

Vortragende(r) Thomas Frank
Titel Benchmarking Tabular Data Synthesis Pipelines for Mixed Data
Vortragstyp Bachelorarbeit
Betreuer(in) Federico Matteucci
Vortragsmodus in Präsenz
Kurzfassung In machine learning, simpler, interpretable models require significantly more training data than complex, opaque models to achieve reliable results. This is a problem when gathering data is a challenging, expensive or time-consuming task. Data synthesis is a useful approach for mitigating these problems.

An essential aspect of tabular data is its heterogeneous structure, as it often comes in ``mixed data´´, i.e., it contains both categorical and numerical attributes. Most machine learning methods require the data to be purely numerical. The usual way to deal with this is a categorical encoding.

In this thesis, we evaluate a proposed tabular data synthesis pipeline consisting of a categorical encoding, followed by data synthesis and an optional relabeling of the synthetic data by a complex model. This synthetic data is then used to train a simple model. The performance of the simple model is used to quantify the quality of the generated data. We surveyed the current state of research in categorical encoding and tabular data synthesis and performed an extensive benchmark on a motivated selection of encoders and generators.

Neuen Vortrag erstellen

Hinweise