Distributed scientific applications are generally time-critical. They consist of a set of tasks that require different types of resources for execution, e.g., available CPU for computation and available bandwidth for data communication. All tasks should be executed by a given deadline. We can choose appropriate systems to run the computational tasks and a suitable bandwidth to transport the data. However, this task-mapping approach may lead to different execution times and uneven use of resources. Accordingly, minimizing the execution time of these applications (which we refer to as scheduling length or makespan) and maximizing resource utilization to fully exploit available systems requires joint scheduling for both computational and optical network infrastructure.1,2
To meet these goals, the joint resources scheduling system (JRSS) employs three components: Resources Manager, Task Scheduler, and Execution Manager. Resources Manager identifies and manages information regarding computational and optical network infrastructure. Task Scheduler enables optimized task assignment to available hardware and bandwidth in order to reduce each task's completion time and increase the infrastructure's utilization ratio. Its core is a scheduling algorithm based on the given task and available resources. Execution Manager allows applications to run on geographically distributed sites according to the resulting schedule. It also enables the creation and release of optical network connections according to data exchange requests.
Figure 1 shows the software architecture of the JRSS. First, the Resources Manager acquires information about computational and optical network resources and maintains it in a database. Next, an application is submitted from an end host to the system via a user's portal. Task Scheduler receives the request and runs a specified algorithm to produce an optimized schedule. Then the Execution Manager instructs the computational resources to carry out tasks, and signals generalized multiprotocol label switching (GMPLS) to set up or release a network connection according to the timing of the schedule.
Figure 1. In the JRSS architecture, information about available hardware and bandwidth is used to manage user requests and schedule tasks efficiently.
In this article, we focus on the algorithm used to schedule tasks. Each distributed scientific application may be logically partitioned into multiple operations for execution on different systems. Moreover, communication may be required between operations (e.g., to exchange data, intermediate results, or other information). An application is described by a directed acyclic graph (DAG) in which each node represents an executable task, and each directed edge represents data communication between two adjacent nodes.
DAG scheduling comprises node and edge scheduling. The former assigns DAG nodes to computational resources, while the latter maps the edges of the DAG onto optical network connections. A start and finish time is set for every node and edge after the assignments are made. The objective of the process is to complete each task as quickly as possible and use resources as fully as possible. Scheduling schemes are essential for finding the best allocation of nodes and edges to meet this objective. Two are typically used: overlay scheduling and integrated scheduling.
In the overlay scheduling scheme, optical network and computational resources are managed in different layers that are completely separate. The nodes in the DAG are assigned only in the computational resources layer. A widely used approach is list scheduling heuristics.3 In this algorithm, the nodes are first sorted into a list according to a priority scheme and precedence constraints. Typically, the top priority is the node's bottom level, which is the length of the longest path leaving the node. Then, for each node in the list, we find the system that allows the earliest finish time and allocate it to the node.
In this scheme, no knowledge of network availability is provided to the computational resources layer. Systems invoke the optical network via a user-to-network interface (UNI)4 when data communication is needed. Then the UNI signals the GMPLS control plane to initiate a network connection. If sufficient bandwidth is available for the request, the connection is provisioned. If not, the communication task must wait until the requested network resources are released.
The advantage of this scheme is that optical network information is not disclosed to users. It is the most practical for near-term deployment because it is appropriate for the current telecommunications infrastructure. However, the simplicity of the overlay method comes at the expense of increased scheduling length and inefficient network use due to the complete separation of state and control information between the two layers.
In the integrated scheduling scheme, the optical network and computational resources are considered together. The nodes in the DAG are scheduled based not only on the available systems but also the available bandwidth and any conflicts in the optical network. In this approach, the nodes are ordered into a list like that used in overlay scheduling. Each node Vj on the list can be processed only after all its direct predecessors have finished and all the data has been transferred to it.
For example, suppose node Vi is one of the direct predecessors of Vjand Vi is assigned to computational resources Ri. We route the network connection from Ri to other available computational resources by an adaptive routing policy and calculate the earliest finish time for the edge Eij between Vi and Vj. Based on the result, we find the computational resource Rj that can achieve the earliest finish time of node Vj and schedule that node to the resource. The network connection between Ri and Rj that provides the earliest finish time for the edge Eij will be used for data communication. We have developed several integrated scheduling algorithms.5 Compared with the overlay method, integrated scheduling performs better and uses resources more efficiently because the state information and control information about the two types of resources are managed together.
To evaluate the performance of the two schemes, we have developed a simulation. Twenty DAGs are randomly generated under different communication-computation ratios (CCRs) and numbers of nodes. CCR is defined as the sum of the communication cost divided by the sum of the task execution cost. The average schedule length and resource utilization ratio are compared for the integrated and overlay schemes. From the experimental results given in Table 1, we can see that the integrated method can achieve shorter average schedule length, higher computational resource utilization, and lower optical network resource occupation than overlay scheduling. We also observe that the benefits of using the integrated method increase as the CCR and DAG size increase. We conclude that integrated scheduling is appropriate for data-intensive applications. The improvement may be more significant when there is higher communication contention caused by a lack of network resources.
We are setting up an optical network integrated computing environment by deploying a testbed at Shanghai Jiao Tong University (SJTU). The optical network, which interconnects several supercomputers and Windows clusters at SJTU, consists of three GMPLS-controlled automatically switched optical network nodes. Based on a joint resources scheduling algorithm that has been created, we will develop a scheduler and evaluate its performance in our testbed.
Table 1. Experimental results comparing integrated and overlay scheduling
This research is sponsored by the National Natural Science Foundation of China under contract 60672016 and by the China 863 program.
Wei Guo, Weisheng Hu
State Key Lab on Fiber-Optic Local Area Networks and Advanced Optical Communication Systems