|SAP Document Information Extraction (DOX) is a service to extract logical entities from scanned documents based on the well-known Transformer architecture. The entities comprise header information such as document date or sender name, and line items from tables on the document with fields such as line item quantity. The model currently needs to be trained on a huge number of labeled documents, which is impractical. Also, this hinders the deployment of the model at large scale, as it cannot easily adapt to new languages or document types. Recently, pretraining large language models with self-supervised learning techniques have shown good results as a preliminary step, and allow reducing the amount of labels required in follow-up steps. However, to generalize self-supervised learning to document understanding, we need to take into account different modalities: text, layout and image information of documents. How to do that efficiently and effectively is unclear yet. The goal of this thesis is to come up with a technique for self-supervised pretraining within SAP DOX. We will evaluate our method and design decisions against SAP data as well as public data sets. Besides the accuracy of the extracted entities, we will measure to what extent our method lets us lower label requirements.