Improving SAP Document Information Extraction via Pretraining and Fine-Tuning

Aus SDQ-Institutsseminar
Vortragende(r) Christoph Batke
Vortragstyp Masterarbeit
Betreuer(in) Edouard Fouché
Termin Fr 18. August 2023
Vortragsmodus in Präsenz
Kurzfassung Techniques for extracting relevant information from documents have made significant progress in recent years and became a key task in the digital transformation. With deep neural networks, it became possible to process documents without specifying hard-coded extraction rules or templates for each layout. However, such models typically have a very large number of parameters. As a result, they require many annotated samples and long training times. One solution is to create a basic pretrained model using self-supervised objectives and then to fine-tune it using a smaller document-specific annotated dataset. However, implementing and controlling the pretraining and fine-tuning procedures in a multi-modal setting is challenging. In this thesis, we propose a systematic method that consists in pretraining the model on large unlabeled data and then to fine-tune it with a virtual adversarial training procedure. For the pretraining stage, we implement an unsupervised informative masking method, which improves upon standard Masked-Language Modelling (MLM). In contrast to randomly masking tokens like in MLM, our method exploits Point-Wise Mutual Information (PMI) to calculate individual masking rates based on statistical properties of the data corpus, e.g., how often certain tokens appear together on a document page. We test our algorithm in a typical business context at SAP and report an overall improvement of 1.4% on the F1-score for extracted document entities. Additionally, we show that the implemented methods improve the training speed, robustness and data-efficiency of the algorithm.