Published November 14, 2024 | Version v1
Dataset Open

On-the-Fly Syntax Highlighting: Generalisation and Speed-ups - Replication Package

Authors/Creators

  • 1. ROR icon University of Zurich

Contributors

Project leader:

Project member:

  • 1. ROR icon University of Zurich

Description

On-the-Fly Syntax Highlighting: Generalisation and Speed-ups

On-the-fly syntax highlighting involves the rapid association of visual secondary notation with each character of a language derivation. This task has grown in importance due to the widespread use of online software development tools, which frequently display source code and heavily rely on efficient syntax highlighting mechanisms. In this context, resolvers must address three key demands: speed, accuracy, and development costs. Speed constraints are crucial for ensuring usability, providing responsive feedback for end users and minimizing system overhead. At the same time, precise syntax highlighting is essential for improving code comprehension. Achieving such accuracy, however, requires the ability to perform grammatical analysis, even in cases of varying correctness. Additionally, the development costs associated with supporting multiple programming languages pose a significant challenge. The technical challenges in balancing these three aspects explain why developers today experience significantly worse code syntax highlighting online compared to what they have locally. The current state-of-the-art relies on leveraging programming languages' original lexers and parsers to generate syntax highlighting oracles, which are used to train base Recurrent Neural Network models. However, questions of generalisation remain. This paper addresses this gap by extending previous work validation dataset to six mainstream programming languages thus providing a more thorough evaluation. In response to limitations related to evaluation performance and training costs, this work introduces a novel Convolutional Neural Network (CNN) based model, specifically designed to mitigate these issues. Furthermore, this work addresses an area previously unexplored performance gains when deploying such models on GPUs. The evaluation demonstrates that the new CNN-based implementation is significantly faster than existing state-of-the-art methods, while still delivering the same near-perfect accuracy.

Files

dataset.zip

Files (14.1 GB)

Name Size Download all
md5:3255034ecc91adf982804d5c9cdd845a
11.7 GB Preview Download
md5:ae7c256ed9429592bb4559024b4250e0
1.3 GB Preview Download
md5:e44dc2431b71ce056f7bd4a828f6b2b8
1.1 GB Preview Download

Additional details

Funding

Swiss National Science Foundation
Melise - Machine Learning Assisted Software Development 204632

Dates

Accepted
2024-11-11