Published April 5, 2019 | Version 1.0
Dataset Open

Task graphs for benchmarking schedulers

  • 1. IT4Innovations

Description

Workflow Task Graph Dataset

This dataset contains three sets of task graphs representing different types of task workflows:

  • Elementary - contains trivial graph shapes, such as tasks with no dependencies or simple fork-join graphs. This set should test how the scheduler heuristics react to basic graph scenarios that frequently form parts of larger workflows.
  • IRW - is inspired by real-world workflows, such as machine learning cross-validation or map-reduce.
  •  Pegasus - is derived from graphs created by Pegasus Synthetic Workflow Generators (https://github.com/pegasus-isi/WorkflowGenerator)

All of the provided task graphs are generated and compatible with ESTEE (https://github.com/It4innovations/estee) that allows to simulate their execution on a distributed system using various scheduling heuristics and environment conditions.

Data Format

Task graphs are stored in {elementary, irw, pegasus}.zip files that contain JSON representation of respective task graphs with the following fields:

  • `graph_name` - Task graph name
  • `graph_id` - Unique task graph identifier
  •  `graph` - Task graph representation - list of tasks where each task is represented as a dictionary with the following keys:
  •  `d`: Actual task duration in seconds (float value)
  •  `e_d`: User estimated task duration in seconds (float value)
  •  `cpus`: Task CPU core requirements (integer value)
  •  `outputs`: List of task outputs (list of integers indicating sizes of task outputs in MiB)
  •  `inputs`: List of task inputs in format of list [task\_id, output\_index]}. Output index is zero-based.

For example this task graph:

[{'d': 200, 'e_d': 180, 'cpus': 1, 'outputs': [100], 'inputs': []},

{'d': 50, 'e_d': 60, 'cpus': 2, 'outputs': [], 'inputs': [[0, 0]]}]

contains two tasks. One requiring no input, single CPU core with estimated duration 180s, actual duration 200s and producing a single output of 100 MiB. And another one requiring as an input task0's 0-th output, requiring 2 CPU cores, producing no output with estimated duration 60s and actual duration 50s.

 

Parsing the data

In Python, to load the elementary task graph set run the following snippet:

import pandas as pd

graphs = pd.read_json("./elementary.zip")

 

If you have Estee installed, you can use its provided `json_deserialize`

function to parse the JSON encoded graphs into Estee TaskGraph data structure.

 

from estee.serialization.dask_json import json_deserialize

graph_json = graphs.loc[0, "graph"]

graph = json_deserialize(graph)

 

Files

elementary.zip

Files (416.4 kB)

Name Size Download all
md5:a4f0b95312dfa19102450f6bd201a63c
71.7 kB Preview Download
md5:0f3cc663a5f07ba0cb91353ddc3a2258
320.2 kB Preview Download
md5:b06efd6563ed74b422425206e6f6e103
21.9 kB Preview Download
md5:091a334de62f2cba87ebda0384ab5108
2.6 kB Preview Download