Conference paper Open Access
Yitagesu, Sofonias; Zhang, Xiaowang; Feng, Zhiyong; Li, Xiaohong; Xing, Zhenchang
{ "inLanguage": { "alternateName": "eng", "@type": "Language", "name": "English" }, "description": "<p>Abstract—In this paper, we study the problem of part-of-speech (POS) tagging for security vulnerability descriptions (SVD). In<br>\ncontrast to newswire articles, SVD often contains a high-level natural language description of the text composed of mixed<br>\nlanguage studded with codes, domain-specific jargon, vague language, and abbreviations. Moreover, training data dedicated<br>\nto security vulnerability research is not widely available. Existing neural network-based POS tagging has often relied on manually<br>\nannotated training data or applying natural language processing (NLP) techniques, suffering from two significant drawbacks. The<br>\nformer is extremely time-consuming and requires labor-intensive feature engineering and expertise. The latter is inadequate to<br>\nidentify linguistically-informed words specific to the SVD domain. In this paper, we propose an automatic approach to assign POS<br>\ntags to tokens in SVD. Our approach uses the character-level representation to automatically extract orthographic features and<br>\nunsupervised word embeddings to capture meaningful syntactic and semantic regularities from SVD. The character level representations are then concatenated with the word embedding as a combined feature, which is then learned and used to predict<br>\nthe POS tagging. To deal with the issue of the poor availability of annotated security vulnerability data, we implement a finetuning approach. Our approach provides public access to a POS annotated corpus of ∼8M tokens, which serves as a training dataset in this domain. Our evaluation results show a significant improvement in accuracy (17.72%-28.22%) of POS tagging in SVD over the current approaches.</p>", "license": "https://creativecommons.org/licenses/by/4.0/legalcode", "creator": [ { "affiliation": "Tianjin University, China", "@type": "Person", "name": "Yitagesu, Sofonias" }, { "affiliation": "Tianjin University, China", "@type": "Person", "name": "Zhang, Xiaowang" }, { "affiliation": "Tianjin University, China", "@type": "Person", "name": "Feng, Zhiyong" }, { "affiliation": "Tianjin University, China", "@type": "Person", "name": "Li, Xiaohong" }, { "affiliation": "Australian National University, Australia", "@type": "Person", "name": "Xing, Zhenchang" } ], "headline": "Automatic Part-of-Speech Tagging for Security Vulnerability Descriptions", "image": "https://zenodo.org/static/img/logos/zenodo-gradient-round.svg", "datePublished": "2021-03-23", "keywords": [ "Fine-Tuning, Part-of-Speech tagging, Unsupervised word embedding, Security vulnerability descriptions" ], "url": "https://zenodo.org/record/4632063", "@type": "ScholarlyArticle", "contributor": [], "@context": "https://schema.org/", "identifier": "https://doi.org/10.5281/zenodo.4632063", "@id": "https://doi.org/10.5281/zenodo.4632063", "workFeatured": { "alternateName": "MSR 2021", "@type": "Event", "name": "The 2021 IEEE/ACM 18th International Conference on Mining Software Repositories" }, "name": "Automatic Part-of-Speech Tagging for Security Vulnerability Descriptions" }
All versions | This version | |
---|---|---|
Views | 320 | 320 |
Downloads | 259 | 259 |
Data volume | 356.8 MB | 356.8 MB |
Unique views | 294 | 294 |
Unique downloads | 225 | 225 |