Standardized Real-World Change Detection Data: Unterschied zwischen den Versionen

Aus SDQ-Institutsseminar
Keine Bearbeitungszusammenfassung
Keine Bearbeitungszusammenfassung
 
(Eine dazwischenliegende Version desselben Benutzers wird nicht angezeigt)
Zeile 6: Zeile 6:
|termin=Institutsseminar/2022-05-13 Zusatztermin
|termin=Institutsseminar/2022-05-13 Zusatztermin
|vortragsmodus=in Präsenz
|vortragsmodus=in Präsenz
|kurzfassung=The reliable detection of change points is a fundamental task when analysing
|kurzfassung=The reliable detection of change points is a fundamental task when analysing data across many fields, e.g., in finance, bioinformatics, and medicine.  
data across many fields, e.g., in finance, bioinformatics, and medicine. To define
To define “change points”, we assume that there is a distribution, which may change over time, generating the data we observe. A change point then is a change in this underlying distribution, i.e., the distribution coming before a change point is different from the distribution coming after. The principled way to compare distributions, and to find change points, is to employ statistical tests.
“change points”, we assume that there is a distribution, which may change over
 
time, generating the data we observe. A change point then is a change in this
While change point detection is an unsupervised problem in practice, i.e., the data is unlabelled, the development and evaluation of data analysis algorithms requires labelled data.  
underlying distribution, i.e., the distribution coming before a change point is
Only few labelled real world data sets are publicly available and many of them are either too small or have ambiguous labels. Further issues are that reusing data sets may lead to overfitting, and preprocessing (e.g., removing outliers) may manipulate results.
different from the distribution coming after. The principled way to compare
To address these issues, van den Burg et al. publish 37 data sets annotated by data scientists and ML researchers and use them for an assessment of 14 change detection algorithms.  
distributions, and to find change points, is to employ statistical tests.
Yet, there remain concerns due to the fact that these are labelled by hand: Can humans correctly identify changes according to the definition, and can they be consistent in doing so?
While change point detection is an unsupervised problem in practice, i.e.,
 
the data is unlabelled, the development and evaluation of data analysis algo-
The goal of this Bachelor's thesis is to algorithmically label their data sets following the formal definition and to also identify and label larger and higher-dimensional data sets, thereby extending their work.
rithms requires labelled data. Only few labelled real world data sets are publicly
To this end, we leverage a non-parametric hypothesis test which builds on Maximum Mean Discrepancy (MMD) as a test statistic, i.e., we identify changes in a principled way.  
available and many of them are either too small or have ambiguous labels. Fur-
We will analyse the labels so obtained and compare them to the human annotations, measuring their consistency with the F1 score.  
ther issues are that reusing data sets may lead to overfitting, and preprocessing
To assess the influence of the algorithmic and definition-conform annotations, we will use them to reevaluate the algorithms of van den Burg et al. and compare the respective performances.
(e.g., removing outliers) may manipulate results. To address these issues, van
den Burg et al. publish 37 data sets annotated by data scientists and ML re-
searchers and use them for an assessment of 14 change detection algorithms.
Yet, there remain concerns due to the fact that these are labelled by hand: Can
humans correctly identify changes according to the definition, and can they be
consistent in doing so?
The goal of this Bachelor’s thesis is to algorithmically label their data sets
following the formal definition and to also identify and label larger and higher-
dimensional data sets, thereby extending their work. To this end, we leverage
a non-parametric hypothesis test which builds on Maximum Mean Discrepancy
(MMD) as a test statistic, i.e., we identify changes in a principled way. We will
analyse the labels so obtained and compare them to the human annotations,
measuring their consistency with the F1 score. To assess the influence of the
algorithmic and definition-conform annotations, we will use them to reevaluate
the algorithms of van den Burg et al. and compare the respective performances.
}}
}}

Aktuelle Version vom 10. Mai 2022, 16:37 Uhr

Vortragende(r) Moritz Teichner
Vortragstyp Proposal
Betreuer(in) Florian Kalinke
Termin Fr 13. Mai 2022
Vortragssprache
Vortragsmodus in Präsenz
Kurzfassung The reliable detection of change points is a fundamental task when analysing data across many fields, e.g., in finance, bioinformatics, and medicine.

To define “change points”, we assume that there is a distribution, which may change over time, generating the data we observe. A change point then is a change in this underlying distribution, i.e., the distribution coming before a change point is different from the distribution coming after. The principled way to compare distributions, and to find change points, is to employ statistical tests.

While change point detection is an unsupervised problem in practice, i.e., the data is unlabelled, the development and evaluation of data analysis algorithms requires labelled data. Only few labelled real world data sets are publicly available and many of them are either too small or have ambiguous labels. Further issues are that reusing data sets may lead to overfitting, and preprocessing (e.g., removing outliers) may manipulate results. To address these issues, van den Burg et al. publish 37 data sets annotated by data scientists and ML researchers and use them for an assessment of 14 change detection algorithms. Yet, there remain concerns due to the fact that these are labelled by hand: Can humans correctly identify changes according to the definition, and can they be consistent in doing so?

The goal of this Bachelor's thesis is to algorithmically label their data sets following the formal definition and to also identify and label larger and higher-dimensional data sets, thereby extending their work. To this end, we leverage a non-parametric hypothesis test which builds on Maximum Mean Discrepancy (MMD) as a test statistic, i.e., we identify changes in a principled way. We will analyse the labels so obtained and compare them to the human annotations, measuring their consistency with the F1 score. To assess the influence of the algorithmic and definition-conform annotations, we will use them to reevaluate the algorithms of van den Burg et al. and compare the respective performances.