Efficient Pruning of N-gram Corpora for Culturomics using Language Models

Aus SDQ-Institutsseminar
Vortragende(r) Caspar Friedrich Maximilian Nagy
Vortragstyp Bachelorarbeit
Betreuer(in) Jens Willkomm
Termin Fr 23. Oktober 2020
Kurzfassung Big data technology pushes the frontiers of science. A particularly interesting application of it is culturomics. It uses big data techniques to accurately quantify and observe language and culture over time. A milestone to enable this kind of analysis in a traditionally humanistic field was the effort around the Google Books project. The scanned books were then transformed into a so called N-gram corpus, that contains the frequency of words and their combinations over time. Unfortunately this corpus is enormous in size of over 2 terabytes of storage. This makes handling, storing and querying the corpus difficult. In this bachelor thesis, we introduce a novel technique to reduce the storage requirements of N-gram corpora. It uses Natural Language Processing to estimate the counts of N-grams. Our approach is able to prune around 30% more effective than state-of-the-art methods.