Big Data processing architectures are now widely recognized as one of the most significant innovations in Computing in the last decade. Their enormous potential in collecting and processing huge volumes of data scattered throughout the Internet is opening the door to a new generation of fully distributed applications that, by leveraging the large amount of resources available on the network will be able to cope with very complex problems achieving performances never seen before. However, the Internet is known to have severe scalability limitations in moving very large quantities of data, and such limitations introduce the challenge of making efficient use of the computing and storage resources available on the network, in order to enable data-intensive applications to be executed effectively in such a complex distributed environment. This implies resource scheduling decisions which drive the execution of task towards the data by taking network load and capacity into consideration to maximize data access performance and reduce queueing and processing delays as possible. Accordingly, this work presents a data-centric meta-scheduling scheme for fully distributed Big Data processing architectures based on clustering techniques whose goal is aggregating tasks around storage repositories and driven by a new concept of "gravitational" attraction between the tasks and their data of interest. This scheme will benefit from heuristic criteria based on network awareness and advance resource reservation in order to suppress long delays in data transfer operations and result into an optimized use of data storage and runtime resources at the expense of a limited (polynomial) computational complexity. © 2013 Springer Science+Business Media New York.
A cluster-based data-centric model for network-aware task scheduling in distributed systems
Fiore, Ugo;Castiglione, Aniello;
2014-01-01
Abstract
Big Data processing architectures are now widely recognized as one of the most significant innovations in Computing in the last decade. Their enormous potential in collecting and processing huge volumes of data scattered throughout the Internet is opening the door to a new generation of fully distributed applications that, by leveraging the large amount of resources available on the network will be able to cope with very complex problems achieving performances never seen before. However, the Internet is known to have severe scalability limitations in moving very large quantities of data, and such limitations introduce the challenge of making efficient use of the computing and storage resources available on the network, in order to enable data-intensive applications to be executed effectively in such a complex distributed environment. This implies resource scheduling decisions which drive the execution of task towards the data by taking network load and capacity into consideration to maximize data access performance and reduce queueing and processing delays as possible. Accordingly, this work presents a data-centric meta-scheduling scheme for fully distributed Big Data processing architectures based on clustering techniques whose goal is aggregating tasks around storage repositories and driven by a new concept of "gravitational" attraction between the tasks and their data of interest. This scheme will benefit from heuristic criteria based on network awareness and advance resource reservation in order to suppress long delays in data transfer operations and result into an optimized use of data storage and runtime resources at the expense of a limited (polynomial) computational complexity. © 2013 Springer Science+Business Media New York.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.