Efficient Pruning of N-gram Corpora for Culturomics using Language Models: Unterschied zwischen den Versionen
(Die Seite wurde neu angelegt: „{{Vortrag |vortragender=Caspar Friedrich Maximilian Nagy |email=maxnagy@me.com |vortragstyp=Bachelorarbeit |betreuer=Jens Willkomm |termin=Institutsseminar/202…“) |
(kein Unterschied)
|
Aktuelle Version vom 20. Oktober 2020, 17:45 Uhr
Vortragende(r) | Caspar Friedrich Maximilian Nagy | |
---|---|---|
Vortragstyp | Bachelorarbeit | |
Betreuer(in) | Jens Willkomm | |
Termin | Fr 23. Oktober 2020 | |
Vortragsmodus | ||
Kurzfassung | Big data technology pushes the frontiers of science. A particularly interesting application of it is culturomics. It uses big data techniques to accurately quantify and observe language and culture over time. A milestone to enable this kind of analysis in a traditionally humanistic field was the effort around the Google Books project. The scanned books were then transformed into a so called N-gram corpus, that contains the frequency of words and their combinations over time. Unfortunately this corpus is enormous in size of over 2 terabytes of storage. This makes handling, storing and querying the corpus difficult. In this bachelor thesis, we introduce a novel technique to reduce the storage requirements of N-gram corpora. It uses Natural Language Processing to estimate the counts of N-grams. Our approach is able to prune around 30% more effective than state-of-the-art methods. |