Optimizing resource allocation in distributed computing systems through predicting job execution metadata, based on resource demands and platform characteristics, is essential for enhancing system efficiency and reliability.
Among the various distributed computing simulators developed in recent decades, this thesis specifically focuses on DCSim.
DCSim simulates the nodes and links of the configured platform, generates the workloads according to configured parameter distributions, and performs the simulations.
The simulated job execution metadata is accurate, yet the simulations demand computational resources and time that increase superlinearly with the number of nodes simulated.
In this study, we explore the application of Recurrent Neural Networks and Transformer models for predicting job execution metadata within distributed computing environments.
We focus on data preparation, model training, and evaluation for handling numerical sequences of varying lengths.
This approach not only facilitates better predictions of distributed computing system dynamics but also enhances the scalability of predictive systems by leveraging deep neural networks to interpret and forecast job execution metadata based on simulated data or historical data.
We assess the models across four scenarios of increasing complexity, evaluating their ability to generalize for unseen jobs and platforms.
We examine the training duration and the amount of data necessary to achieve accurate predictions and discuss the applicability of such models to overcome the scalability challenges of DCSim.
|