Review of data efficient dependency estimation

Aus SDQ-Institutsseminar
Vortragende(r) Maximilian Georg
Vortragstyp Proposal
Betreuer(in) Bela Böhnke
Termin Fr 25. Februar 2022
Vortragssprache
Vortragsmodus online
Kurzfassung The amount and complexity of data collected in the industry is increasing, and data analysis rises in importance. Dependency estimation is a significant part of knowledge discovery and allows strategic decisions based on this information.

There are multiple examples that highlight the importance of dependency estimation, like knowing there exists a correlation between the regular dose of a drug and the health of a patient helps to understand the impact of a newly manufactured drug. Knowing how the case material, brand, and condition of a watch influences the price on an online marketplace can help to buy watches at a good price. Material sciences can also use dependency estimation to predict many properties of a material before it is synthesized in the lab, so fewer experiments are necessary.

Many dependency estimation algorithms require a large amount of data for a good estimation. But data can be expensive, as an example experiments in material sciences, consume material and take time and energy. As we have the challenge of expensive data collection, algorithms need to be data efficient. But there is a trade-off between the amount of data and the quality of the estimation. With a lack of data comes an uncertainty of the estimation. However, the algorithms do not always quantify this uncertainty. As a result, we do not know if we can rely on the estimation or if we need more data for an accurate estimation.

In this bachelor's thesis we compare different state-of-the-art dependency estimation algorithms using a list of criteria addressing these challenges and more. We partly developed the criteria our self as well as took them from relevant publications. The existing publications formulated many of the criteria only qualitative, part of this thesis is to make these criteria measurable quantitative, where possible, and come up with a systematic approach of comparison for the rest.

From 14 selected criteria, we focus on criteria concerning data efficiency and uncertainty estimation, because they are essential for lowering the cost of dependency estimation, but we will also check other criteria relevant for the application of algorithms. As a result, we will rank the algorithms in the different aspects given by the criteria, and thereby identify potential for improvement of the current algorithms.

We do this in two steps, first we check general criteria in a qualitative analysis. For this we check if the algorithm is capable of guided sampling, if it is an anytime algorithm and if it uses incremental computation to enable early stopping, which all leads to more data efficiency.

We also conduct a quantitative analysis on well-established and representative datasets for the dependency estimation algorithms, that performed well in the qualitative analysis. In these experiments we evaluate more criteria: The robustness, which is necessary for error-prone data, the efficiency which saves time in the computation, the convergence which guarantees we get an accurate estimation with enough data, and consistency which ensures we can rely on an estimation.