Efficient Pruning of N-gram Corpora for Culturomics using Language Models: Unterschied zwischen den Versionen

Aus SDQ-Institutsseminar
(Die Seite wurde neu angelegt: „{{Vortrag |vortragender=Caspar Friedrich Maximilian Nagy |email=maxnagy@me.com |vortragstyp=Bachelorarbeit |betreuer=Jens Willkomm |termin=Institutsseminar/202…“)
 
(kein Unterschied)

Aktuelle Version vom 20. Oktober 2020, 17:45 Uhr

Vortragende(r) Caspar Friedrich Maximilian Nagy
Vortragstyp Bachelorarbeit
Betreuer(in) Jens Willkomm
Termin Fr 23. Oktober 2020
Vortragsmodus
Kurzfassung Big data technology pushes the frontiers of science. A particularly interesting application of it is culturomics. It uses big data techniques to accurately quantify and observe language and culture over time. A milestone to enable this kind of analysis in a traditionally humanistic field was the effort around the Google Books project. The scanned books were then transformed into a so called N-gram corpus, that contains the frequency of words and their combinations over time. Unfortunately this corpus is enormous in size of over 2 terabytes of storage. This makes handling, storing and querying the corpus difficult. In this bachelor thesis, we introduce a novel technique to reduce the storage requirements of N-gram corpora. It uses Natural Language Processing to estimate the counts of N-grams. Our approach is able to prune around 30% more effective than state-of-the-art methods.