Conference paper Open Access

Automatic Part-of-Speech Tagging for Security Vulnerability Descriptions

Yitagesu, Sofonias; Zhang, Xiaowang; Feng, Zhiyong; Li, Xiaohong; Xing, Zhenchang

Citation Style Language JSON Export

  "publisher": "Zenodo", 
  "DOI": "10.5281/zenodo.4632063", 
  "language": "eng", 
  "title": "Automatic Part-of-Speech Tagging for Security Vulnerability Descriptions", 
  "issued": {
    "date-parts": [
  "abstract": "<p>Abstract&mdash;In this paper, we study the problem of part-of-speech (POS) tagging for security vulnerability descriptions (SVD). In<br>\ncontrast to newswire articles, SVD often contains a high-level natural language description of the text composed of mixed<br>\nlanguage studded with codes, domain-specific jargon, vague language, and abbreviations. Moreover, training data dedicated<br>\nto security vulnerability research is not widely available. Existing neural network-based POS tagging has often relied on manually<br>\nannotated training data or applying natural language processing (NLP) techniques, suffering from two significant drawbacks. The<br>\nformer is extremely time-consuming and requires labor-intensive feature engineering and expertise. The latter is inadequate to<br>\nidentify linguistically-informed words specific to the SVD domain. In this paper, we propose an automatic approach to assign POS<br>\ntags to tokens in SVD. Our approach uses the character-level representation to automatically extract orthographic features and<br>\nunsupervised word embeddings to capture meaningful syntactic and semantic regularities from SVD. The character level representations are then concatenated with the word embedding as a combined feature, which is then learned and used to predict<br>\nthe POS tagging. To deal with the issue of the poor availability of annotated security vulnerability data, we implement a finetuning approach. Our approach provides public access to a POS annotated corpus of &sim;8M tokens, which serves as a training dataset in this domain. Our evaluation results show a significant improvement in accuracy (17.72%-28.22%) of POS tagging in SVD over the current approaches.</p>", 
  "author": [
      "family": "Yitagesu, Sofonias"
      "family": "Zhang, Xiaowang"
      "family": "Feng, Zhiyong"
      "family": "Li, Xiaohong"
      "family": "Xing, Zhenchang"
  "id": "4632063", 
  "type": "paper-conference", 
  "event": "The 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR 2021)"
All versions This version
Views 322322
Downloads 261261
Data volume 359.6 MB359.6 MB
Unique views 296296
Unique downloads 226226


Cite as