Task graphs for benchmarking schedulers
Description
Workflow Task Graph Dataset
This dataset contains three sets of task graphs representing different types of task workflows:
- Elementary - contains trivial graph shapes, such as tasks with no dependencies or simple fork-join graphs. This set should test how the scheduler heuristics react to basic graph scenarios that frequently form parts of larger workflows.
- IRW - is inspired by real-world workflows, such as machine learning cross-validation or map-reduce.
- Pegasus - is derived from graphs created by Pegasus Synthetic Workflow Generators (https://github.com/pegasus-isi/WorkflowGenerator)
All of the provided task graphs are generated and compatible with ESTEE (https://github.com/It4innovations/estee) that allows to simulate their execution on a distributed system using various scheduling heuristics and environment conditions.
Data Format
Task graphs are stored in {elementary, irw, pegasus}.zip files that contain JSON representation of respective task graphs with the following fields:
- `graph_name` - Task graph name
- `graph_id` - Unique task graph identifier
- `graph` - Task graph representation - list of tasks where each task is represented as a dictionary with the following keys:
- `d`: Actual task duration in seconds (float value)
- `e_d`: User estimated task duration in seconds (float value)
- `cpus`: Task CPU core requirements (integer value)
- `outputs`: List of task outputs (list of integers indicating sizes of task outputs in MiB)
- `inputs`: List of task inputs in format of list [task\_id, output\_index]}. Output index is zero-based.
For example this task graph:
[{'d': 200, 'e_d': 180, 'cpus': 1, 'outputs': [100], 'inputs': []},
{'d': 50, 'e_d': 60, 'cpus': 2, 'outputs': [], 'inputs': [[0, 0]]}]
contains two tasks. One requiring no input, single CPU core with estimated duration 180s, actual duration 200s and producing a single output of 100 MiB. And another one requiring as an input task0's 0-th output, requiring 2 CPU cores, producing no output with estimated duration 60s and actual duration 50s.
Parsing the data
In Python, to load the elementary task graph set run the following snippet:
import pandas as pd
graphs = pd.read_json("./elementary.zip")
If you have Estee installed, you can use its provided `json_deserialize`
function to parse the JSON encoded graphs into Estee TaskGraph data structure.
from estee.serialization.dask_json import json_deserialize
graph_json = graphs.loc[0, "graph"]
graph = json_deserialize(graph)