Pre-training and fine-tuning dataset for transformers consisting of basic blocks and their execution times (average, minimum, and maximum) along with the execution context of these blocks, for various Cortex processors M7, M4, A53, and A72.
Creators
Description
We are making public the dataset used for training CAWET, a tool for estimating the Worst-Case Execution Time (WCET) of basic blocks using the Transformer XL model. CAWET leverages the Transformer architecture for accurate WCET predictions, and its training involves two main phases: self-supervised pre-training and fine-tuning.
CAWET undergoes a pre-training process on a substantial corpus of basic blocks to enable the Transformer to grasp the intricacies of the assembly language in focus. For this, we utilized CodeNet \cite{codenet}, a comprehensive collection of publicly submitted solutions to competitive programming challenges, comprising roughly 900,000 C programs. These programs were cross-compiled to the target architecture and subsequently disassembled using GNU binary utilities with objdump. The textual output from objdump, post a series of basic parsing operations (e.g., address extraction, separation of basic blocks), serves as the foundation for an extensive pre-training dataset. We employed this dataset to develop a vocabulary model utilizing sentence piece \cite{sentencepiece}. Following the completion of the sentence piece model's training, it becomes ready for use in tokenizing any binary programs written in the target instruction set.
The fine-tuning phase of CAWET involves its adaptation to basic blocks along with their contextual information. Here, we used a varied and openly accessible collection of programs, namely, The Algorithms (accessible at: https://github.com/TheAlgorithms/C), MiBench \cite{mibench}, and Polybench \cite{polybench}.
The provided zip file encompasses the following directories:
Fine_Tuning: This includes four distinct files, each tailored for a specific processor: Cortex_M4, Cortex_M7, Cortex_A53, and Cortex_72. Each file encompasses the basic block under analysis (bbUA), the preceding 10 basic blocks executed prior to it, and timing information related to the bbUA (mean, min, max, normalization, etc.).
Pre_Training: This comprises two extensive files, dataset_CortexA and dataset_CortexM, utilized for pre-training the transformers for the Masked Language Modeling Task (MLM). Additionally, it includes the sentence piece model and the necessary code to facilitate accurate tokenization.
For additional information, please refer to the CAWET paper or contact us at ea_amalou@esi.dz
Citation:
@inproceedings{amalou2023cawet,
title={CAWET: Context-Aware Worst-Case Execution Time Estimation Using Transformers},
author={Amalou, Abderaouf N and Fromont, Elisa and Puaut, Isabelle},
booktitle={35th Euromicro Conference on Real-Time Systems (ECRTS 2023)},
year={2023},
organization={Schloss Dagstuhl-Leibniz-Zentrum f{\"u}r Informatik}
}
Bibliography:
codenet
@article{codenet2021,
title={CodeNet: A large-scale AI for code dataset for learning a diversity of coding tasks},
author={Puri, Ruchir and Kung, David S and Janssen, Geert and Zhang, Wei and Domeniconi, Giacomo and Zolotov, Vladimir and Dolby, Julian and Chen, Jie and Choudhury, Mihir and Decker, Lindsey and others},
journal={arXiv preprint arXiv:2105.12655},
year={2021}
}
sentencepiece
@article{sentencepiece2018,
title={Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing},
author={Kudo, Taku and Richardson, John},
journal={arXiv preprint arXiv:1808.06226},
year={2018}
}
mibench
@inproceedings{polybench2014,
title={Understanding polybench/c 3.2 kernels},
author={Yuki, Tomofumi},
booktitle={International workshop on polyhedral compilation techniques (IMPACT)},
pages={1--5},
year={2014}
}
polybench:
@inproceedings{mibench,
title={MiBench: A free, commercially representative embedded benchmark suite},
author={Guthaus, Matthew R and Ringenberg, Jeffrey S and Ernst, Dan and Austin, Todd M and Mudge, Trevor and Brown, Richard B},
booktitle={4th IEEE international workshop on workload characterization},
year={2001}
}
Files
Files
(583.5 MB)
Name | Size | Download all |
---|---|---|
md5:567befea24362cd42ca142dae984cffc
|
583.5 MB | Download |
Additional details
Related works
- Is part of
- Conference paper: 10.4230/LIPIcs.ECRTS.2023.7 (DOI)