Subspace Search in Data Streams: Unterschied zwischen den Versionen

Aus SDQ-Institutsseminar
(Die Seite wurde neu angelegt: „{{Vortrag |vortragender=Florian Kalinke |email=utzzc@student.kit.edu |vortragstyp=Proposal |betreuer=Edouard Fouché |termin=Institutsseminar/2019-07-19 |kurzf…“)
 
Keine Bearbeitungszusammenfassung
Zeile 5: Zeile 5:
|betreuer=Edouard Fouché
|betreuer=Edouard Fouché
|termin=Institutsseminar/2019-07-19
|termin=Institutsseminar/2019-07-19
|kurzfassung=Modern data mining often takes place on high-dimensional data streams that arrive at a very fast pace. High dimensionality and the speed of arrival provide two unique sets of challenges, while current mining algorithms often tackle only one of them.
|kurzfassung=Modern data mining often takes place on high-dimensional data streams, which evolve at a very fast pace: On the one hand, the "curse of dimensionality" leads to a sparsely populated feature space, for which classical statistical methods perform poorly. Patterns, such as clusters or outliers, often hide in a few low-dimensional subspaces. On the other hand, data streams are non-stationary and virtually unbounded. Hence, algorithms operating on data streams must work incrementally and take concept drift into account.  


With the high-dimensionality, the curse of dimensionality comes into effect. This leads to a sparsely populated feature space, for which classical statistical methods perform poorly. Patterns, such as clusters or outliers, often hide in low-dimensional subspaces of interest and cannot be discovered in the high-dimensional space.
While "high-dimensionality" and the "streaming setting" provide two unique sets of challenges, we observe that the existing mining algorithms only address them separately. Thus, our plan is to propose a novel algorithm, which keeps track of the subspaces of interest  in high-dimensional data streams over time. We quantify the relevance of subspaces via a so-called "contrast" measure, which we are able to maintain incrementally in an efficient way. Furthermore, we propose a set of heuristics to adapt the search for the relevant subspaces as the data and the underlying distribution evolves.


Data streams are virtually unbounded, and the distribution of the data may change over time. Hence, algorithms operating on data streams have to work incrementally and have to take concept drift into account.  
We show that our approach is beneficial as a feature selection method and as such can be applied to extend a range of knowledge discovery tasks, e.g., "outlier detection", in high-dimensional data-streams.
 
In this thesis we propose a streaming algorithm to track the subspaces in which patterns may occur over time. We quantify the relevance of subspaces using a so-called contrast measure, which quantifies the strength of a potential relationship between the attributes of the subspaces. As the relevance of subspaces may change over time, the proposed algorithm uses a heuristic to search for the relevant subspaces as the data and the underlying distribution evolves.
}}
}}

Version vom 4. Juni 2019, 13:57 Uhr

Vortragende(r) Florian Kalinke
Vortragstyp Proposal
Betreuer(in) Edouard Fouché
Termin Fr 19. Juli 2019
Vortragssprache
Vortragsmodus
Kurzfassung Modern data mining often takes place on high-dimensional data streams, which evolve at a very fast pace: On the one hand, the "curse of dimensionality" leads to a sparsely populated feature space, for which classical statistical methods perform poorly. Patterns, such as clusters or outliers, often hide in a few low-dimensional subspaces. On the other hand, data streams are non-stationary and virtually unbounded. Hence, algorithms operating on data streams must work incrementally and take concept drift into account.

While "high-dimensionality" and the "streaming setting" provide two unique sets of challenges, we observe that the existing mining algorithms only address them separately. Thus, our plan is to propose a novel algorithm, which keeps track of the subspaces of interest in high-dimensional data streams over time. We quantify the relevance of subspaces via a so-called "contrast" measure, which we are able to maintain incrementally in an efficient way. Furthermore, we propose a set of heuristics to adapt the search for the relevant subspaces as the data and the underlying distribution evolves.

We show that our approach is beneficial as a feature selection method and as such can be applied to extend a range of knowledge discovery tasks, e.g., "outlier detection", in high-dimensional data-streams.