Analyse von Zeitreihen-Kompressionsmethoden am Beispiel von Google N-Gram

Aus SDQ-Institutsseminar
Vortragende(r) Jonas Bernhard
Vortragstyp Bachelorarbeit
Betreuer(in) Martin Schäler
Termin Fr 28. Februar 2020
Kurzfassung Temporal text corpora like the Google Ngram Data Set usually incorporate a vast number of words and expressions, called ngrams, and their respective usage frequencies over the years. The large quantity of entries complicates working with the data set, as transformations and queries are resource and time intensive. However, many use cases do not require the whole corpus to have a sufficient data set and achieve acceptable query results. We propose various compression methods to reduce the total number of ngrams in the corpus. Specially, we propose compression methods that, given an input dictionary of target words, find a compression tailored for queries on a specific topic. Additionally, we utilize time-series compression methods for quick estimations about the properties of ngram usage frequencies. As basis for our compression method design and experimental validation serve CHQL (Conceptual History Query Language) queries on the Google Ngram Data Set.