|Freitag, 13. Mai 2022
|11:30 – 12:10 Uhr (Dauer: 40 min)
|Raum 010 (Gebäude 50.34)
|Do 12. Mai 2022
|Fr 20. Mai 2022
Termin in Kalender importieren: iCal (Download)
|Developing a Framework for Mining Temporal Data from Twitter as Basis for Time-Series Correlation Analysis
|In the last decade, ample research has been produced regarding the value of user-generated data from microblogs as a basis for time series analysis in various fields.In this context, the objective of this thesis is to develop a domain-agnostic framework for mining microblog data (i.e., Twitter). Taking the subject related postings of a time series (e.g., inflation) as its input, the framework will generate temporal data sets that can serve as basis for time series analysis of the given target time series (e.g., inflation rate).
To accomplish this, we will analyze and summarize the prevalent research related to microblog data-based forecasting and analysis, with a focus on the data processing and mining approach. Based on the findings, one or several candidate frameworks are developed and evaluated by testing the correlation of their generated data sets against the target time series they are generated for.
While summative research on microblog data-based correlation analysis exists, it is mainly focused on summarizing the state of the field. This thesis adds to the body of research by applying summarized findings and generating experimental evidence regarding the generalizability of microblog data mining approaches and their effectiveness.
|Standardized Real-World Change Detection Data
|The reliable detection of change points is a fundamental task when analysing data across many fields, e.g., in finance, bioinformatics, and medicine.
To define “change points”, we assume that there is a distribution, which may change over time, generating the data we observe. A change point then is a change in this underlying distribution, i.e., the distribution coming before a change point is different from the distribution coming after. The principled way to compare distributions, and to find change points, is to employ statistical tests.
While change point detection is an unsupervised problem in practice, i.e., the data is unlabelled, the development and evaluation of data analysis algorithms requires labelled data. Only few labelled real world data sets are publicly available and many of them are either too small or have ambiguous labels. Further issues are that reusing data sets may lead to overfitting, and preprocessing (e.g., removing outliers) may manipulate results. To address these issues, van den Burg et al. publish 37 data sets annotated by data scientists and ML researchers and use them for an assessment of 14 change detection algorithms. Yet, there remain concerns due to the fact that these are labelled by hand: Can humans correctly identify changes according to the definition, and can they be consistent in doing so?
The goal of this Bachelor's thesis is to algorithmically label their data sets following the formal definition and to also identify and label larger and higher-dimensional data sets, thereby extending their work. To this end, we leverage a non-parametric hypothesis test which builds on Maximum Mean Discrepancy (MMD) as a test statistic, i.e., we identify changes in a principled way. We will analyse the labels so obtained and compare them to the human annotations, measuring their consistency with the F1 score. To assess the influence of the algorithmic and definition-conform annotations, we will use them to reevaluate the algorithms of van den Burg et al. and compare the respective performances.
- Neuen Vortrag erstellen