Analyse von Zeitreihen-Kompressionsmethoden am Beispiel von Google N-Grams

Aus SDQ-Institutsseminar
Version vom 8. Oktober 2019, 14:58 Uhr von Martin Schäler (Diskussion | Beiträge)
(Unterschied) ← Nächstältere Version | Aktuelle Version (Unterschied) | Nächstjüngere Version → (Unterschied)
Vortragende(r) J. Bernhard
Vortragstyp Proposal
Betreuer(in) Martin Schäler
Termin Fr 25. Oktober 2019
Kurzfassung Temporal text corpora like the Google Ngram dataset usually incorporate a vast number of words and expressions, called ngrams, and their respective usage frequencies over the years. The large quantity of entries complicates working with the dataset, as transformations and queries are resource and time intensive. However, many use-cases do not require the whole corpus to have a sufficient dataset and achieve acceptable results. We propose various compression methods to reduce the absolute number of ngrams in the corpus. Additionally, we utilize time-series compression methods for quick estimations about the properties of ngram usage frequencies. As basis for our compression method design and experimental validation serve CHQL (Conceptual History Query Language) queries on the Google Ngram dataset. The goal is to find compression methods that reduce the complexity of queries on the corpus while still maintaining good results.