Efficient Pruning of N-gram Corpora for Culturomics using Language Models: Unterschied zwischen den Versionen

Aktuelle Version vom 20. Oktober 2020, 17:45 Uhr

Vortragende(r)	Caspar Friedrich Maximilian Nagy
Vortragstyp	Bachelorarbeit
Betreuer(in)	Jens Willkomm
Termin	Fr 23. Oktober 2020
Vortragsmodus
Kurzfassung	Big data technology pushes the frontiers of science. A particularly interesting application of it is culturomics. It uses big data techniques to accurately quantify and observe language and culture over time. A milestone to enable this kind of analysis in a traditionally humanistic field was the effort around the Google Books project. The scanned books were then transformed into a so called N-gram corpus, that contains the frequency of words and their combinations over time. Unfortunately this corpus is enormous in size of over 2 terabytes of storage. This makes handling, storing and querying the corpus difficult. In this bachelor thesis, we introduce a novel technique to reduce the storage requirements of N-gram corpora. It uses Natural Language Processing to estimate the counts of N-grams. Our approach is able to prune around 30% more effective than state-of-the-art methods.