Review of data efficient dependency estimation

Vortragende(r)	Maximilian Georg
Vortragstyp	Proposal
Betreuer(in)	Bela Böhnke
Termin	Fr 25. Februar 2022
Vortragsmodus	online
Kurzfassung	The amount and complexity of data collected in the industry is increasing, and data analysis rises in importance.Dependency estimation is a significant part of knowledge discovery and allows strategic decisions based on this information. There are multiple examples that highlight the importance of dependency estimation, like knowing there exists a correlation between the regular dose of a drug and the health of a patient helps to understand the impact of a newly manufactured drug. Knowing how the case material, brand, and condition of a watch influences the price on an online marketplace can help to buy watches at a good price. Material sciences can also use dependency estimation to predict many properties of a material before it is synthesized in the lab, so fewer experiments are necessary. Many dependency estimation algorithms perform poorly in a real world setting because they do not consider multivariate dependencies. Multivariate dependencies are very common and occur, in the material science example where the properties of the synthesized material depend on many variables. Also, dependency estimation algorithms are often not robust against errors in the data. But data is error-prone, take for instance data about the health of a patient for a clinical study, which is hard to measure accurately. Many dependency estimation algorithms require a large amount of data for a good estimation. But data can be expensive, as an example experiments in material sciences, consume material and take time and energy. As we have the challenge of expensive data collection, algorithms need to be data efficient. But there is a trade-off between the amount of data and the quality of the estimation. With a lack of data comes an uncertainty of the estimation. However, the algorithms do not always quantify this uncertainty. As a result, we do not know if we can rely on the estimation or if we need more data for an accurate estimation. Furthermore, many algorithms are too complex to be used by a non expert. The parameters of an algorithm need to be intuitive to use, and the result should be interpretable. Only then people outside of academia can apply the algorithm without mistakes. In this bachelor's thesis we compare different state-of-the-art dependency estimation algorithms using a list of criteria addressing the above-mentioned challenges. We partly developed the criteria our self as well as took them from relevant publications. Many of the existing criteria where only formulated qualitative, part of this thesis is to make these criteria measurable quantitative, where possible, and come up with a systematic approach of comparison for the rest. From 14 selected criteria, the focus will be on data efficiency and uncertainty estimation. These criteria are essential for lowering the cost of dependency estimation. The expected result of this bachelor's thesis is to identify an algorithm that fulfils all 14 criteria. In the comparison we include a qualitative analysis by checking general criteria, that increase the usability for non experts, such criteria are interpretability, and intuitiveness. We also analyse if the algorithm is an anytime algorithm and if it uses incremental computation to enable early stopping and increase data efficiency. Another criterion is guided sampling, which can lead to more data efficiency. To apply the algorithms to different kinds of datasets, we also analyse if the algorithms are multivariate, general-purpose, and non-parametric. We also conduct a quantitative analysis of the dependency estimation algorithms that performed well in the qualitative analysis by experiment on well-established and representative datasets. In these experiments we evaluate more criteria: The robustness, which is necessary for error-prone data, the efficiency which saves time in the computation, the convergence which guarantees we get an accurate estimation with enough data, and consistency which ensures we can rely on an estimation.