Approximating an Ngram Corpus with Probabilistic Methods

Aus SDQ-Institutsseminar
Vortragende(r) Caspar Nagy
Vortragstyp Proposal
Betreuer(in) Jens Willkomm
Termin Fr 19. Juni 2020
Vortragssprache
Vortragsmodus
Kurzfassung In this work, we consider ngram corpora, i.e., a set of word chains of different lengths and its usage frequency in natural language. For example, the 3-gram "bag of words" may be used 200 times. Obviously, there exists a dependence between the usage frequency of (1) the unigrams "bag", "of", and "words", (2) the bigrams "bag of" and "of words", and (3) the trigram "bag of words". This connection is partially used in language models to implement grammar correction or speech recognition. From a database point of view, the ngram corpus contains either redundant information or information that can be well estimated. This is an indication that we can achieve a high reduction of the corpus size while still providing its information with high accuracy.

In this work, we research the connection between n- and (n+1)-grams and vice versa. Our objective is to store only a part of the full ngram corpus and estimate the rest of the corpus.