Kurzfassung
|
Temporal text corpora like the Google Ngram dataset usually incorporate a vast number of words and expressions, called ngrams, and their respective usage frequencies over the years. The large quantity of entries complicates working with the dataset, as transformations and queries are resource and time intensive. However, many use-cases do not require the whole corpus to have a sufficient dataset and achieve acceptable results. We propose various compression methods to reduce the absolute number of ngrams in the corpus. Additionally, we utilize time-series compression methods for quick estimations about the properties of ngram usage frequencies. As basis for our compression method design and experimental validation serve CHQL (Conceptual History Query Language) queries on the Google Ngram dataset. The goal is to find compression methods that reduce the complexity of queries on the corpus while still maintaining good results.
|