Institutsseminar/2023-02-16

Aus SDQ-Institutsseminar
Termin (Alle Termine)
Datum Donnerstag, 16. Februar 2023
Uhrzeit 10:00 – 10:20 Uhr (Dauer: 20 min)
Ort
Webkonferenz https://kit-lecture.zoom.us/j/67744231815
Vorheriger Termin Fr 27. Januar 2023
Nächster Termin Fr 3. März 2023

Termin in Kalender importieren: iCal (Download)

Vorträge

Vortragende(r) Christoph Batke
Titel Improving Document Information Extraction with efficient Pre-Training
Vortragstyp Proposal
Betreuer(in) Edouard Fouché
Vortragsmodus online
Kurzfassung SAP Document Information Extraction (DOX) is a service to extract logical entities from scanned documents based on the well-known Transformer architecture. The entities comprise header information such as document date or sender name, and line items from tables on the document with fields such as line item quantity. The model currently needs to be trained on a huge number of labeled documents, which is impractical. Also, this hinders the deployment of the model at large scale, as it cannot easily adapt to new languages or document types. Recently, pretraining large language models with self-supervised learning techniques have shown good results as a preliminary step, and allow reducing the amount of labels required in follow-up steps. However, to generalize self-supervised learning to document understanding, we need to take into account different modalities: text, layout and image information of documents. How to do that efficiently and effectively is unclear yet. The goal of this thesis is to come up with a technique for self-supervised pretraining within SAP DOX. We will evaluate our method and design decisions against SAP data as well as public data sets. Besides the accuracy of the extracted entities, we will measure to what extent our method lets us lower label requirements.
Neuen Vortrag erstellen

Hinweise