Pre-training and fine-tuning dataset for transformers consisting of basic blocks and their execution times (average, minimum, and maximum) along with the execution context of these blocks, for various Cortex processors M7, M4, A53, and A72.

AMALOU, Abderaouf Nassim; Puaut, Isabelle; Fromont, Elisa

doi:10.5281/zenodo.10043908

Published October 26, 2023 | Version v1

Dataset Open

Pre-training and fine-tuning dataset for transformers consisting of basic blocks and their execution times (average, minimum, and maximum) along with the execution context of these blocks, for various Cortex processors M7, M4, A53, and A72.

1. Univ. Rennes
2. Inria Rennes - Bretagne Atlantique Research Centre
3. Institut Universitaire de France

We are making public the dataset used for training CAWET, a tool for estimating the Worst-Case Execution Time (WCET) of basic blocks using the Transformer XL model. CAWET leverages the Transformer architecture for accurate WCET predictions, and its training involves two main phases: self-supervised pre-training and fine-tuning.

CAWET undergoes a pre-training process on a substantial corpus of basic blocks to enable the Transformer to grasp the intricacies of the assembly language in focus. For this, we utilized CodeNet \cite{codenet}, a comprehensive collection of publicly submitted solutions to competitive programming challenges, comprising roughly 900,000 C programs. These programs were cross-compiled to the target architecture and subsequently disassembled using GNU binary utilities with objdump. The textual output from objdump, post a series of basic parsing operations (e.g., address extraction, separation of basic blocks), serves as the foundation for an extensive pre-training dataset. We employed this dataset to develop a vocabulary model utilizing sentence piece \cite{sentencepiece}. Following the completion of the sentence piece model's training, it becomes ready for use in tokenizing any binary programs written in the target instruction set.

The fine-tuning phase of CAWET involves its adaptation to basic blocks along with their contextual information. Here, we used a varied and openly accessible collection of programs, namely, The Algorithms (accessible at: https://github.com/TheAlgorithms/C), MiBench \cite{mibench}, and Polybench \cite{polybench}.

The provided zip file encompasses the following directories:

Fine_Tuning: This includes four distinct files, each tailored for a specific processor: Cortex_M4, Cortex_M7, Cortex_A53, and Cortex_72. Each file encompasses the basic block under analysis (bbUA), the preceding 10 basic blocks executed prior to it, and timing information related to the bbUA (mean, min, max, normalization, etc.).

Pre_Training: This comprises two extensive files, dataset_CortexA and dataset_CortexM, utilized for pre-training the transformers for the Masked Language Modeling Task (MLM). Additionally, it includes the sentence piece model and the necessary code to facilitate accurate tokenization.

For additional information, please refer to the CAWET paper or contact us at ea_amalou@esi.dz

Citation:

@inproceedings{amalou2023cawet,

title={CAWET: Context-Aware Worst-Case Execution Time Estimation Using Transformers},

author={Amalou, Abderaouf N and Fromont, Elisa and Puaut, Isabelle},

booktitle={35th Euromicro Conference on Real-Time Systems (ECRTS 2023)},

year={2023},

organization={Schloss Dagstuhl-Leibniz-Zentrum f{\"u}r Informatik}

}

Bibliography:

codenet

@article{codenet2021,

title={CodeNet: A large-scale AI for code dataset for learning a diversity of coding tasks},

author={Puri, Ruchir and Kung, David S and Janssen, Geert and Zhang, Wei and Domeniconi, Giacomo and Zolotov, Vladimir and Dolby, Julian and Chen, Jie and Choudhury, Mihir and Decker, Lindsey and others},

journal={arXiv preprint arXiv:2105.12655},

year={2021}

}

sentencepiece

@article{sentencepiece2018,

title={Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing},

author={Kudo, Taku and Richardson, John},

journal={arXiv preprint arXiv:1808.06226},

year={2018}

}

mibench

@inproceedings{polybench2014,

title={Understanding polybench/c 3.2 kernels},

author={Yuki, Tomofumi},

booktitle={International workshop on polyhedral compilation techniques (IMPACT)},

pages={1--5},

year={2014}

}

polybench:

@inproceedings{mibench,

title={MiBench: A free, commercially representative embedded benchmark suite},

author={Guthaus, Matthew R and Ringenberg, Jeffrey S and Ernst, Dan and Austin, Todd M and Mudge, Trevor and Brown, Richard B},

booktitle={4th IEEE international workshop on workload characterization},

year={2001}

}

Files

Files (583.5 MB)

Name	Size	Download all
DatasetCAWET.rar md5:567befea24362cd42ca142dae984cffc	583.5 MB	Download

Additional details

Is part of: Conference paper: 10.4230/LIPIcs.ECRTS.2023.7 (DOI)

	All versions	This version
Views	424	424
Downloads	47	47
Data volume	40.3 GB	40.3 GB

Pre-training and fine-tuning dataset for transformers consisting of basic blocks and their execution times (average, minimum, and maximum) along with the execution context of these blocks, for various Cortex processors M7, M4, A53, and A72.

Creators

Description

Files

Files (583.5 MB)

Additional details

Related works