00000nam##2200000uu#4500 4632063 doi 10.5281/zenodo.4632063 oai:zenodo.org:4632063 Zhang, Xiaowang Tianjin University, China Feng, Zhiyong Tianjin University, China Li, Xiaohong Tianjin University, China Xing, Zhenchang Australian National University, Australia Automatic Part-of-Speech Tagging for Security Vulnerability Descriptions Yitagesu, Sofonias Tianjin University, China info:eu-repo/semantics/openAccess Creative Commons Attribution 4.0 International https://creativecommons.org/licenses/by/4.0/legalcode cc-by-4.0 spdx Fine-Tuning, Part-of-Speech tagging, Unsupervised word embedding, Security vulnerability descriptions Abstract—In this paper, we study the problem of part-of-speech (POS) tagging for security vulnerability descriptions (SVD). In contrast to newswire articles, SVD often contains a high-level natural language description of the text composed of mixed language studded with codes, domain-specific jargon, vague language, and abbreviations. Moreover, training data dedicated to security vulnerability research is not widely available. Existing neural network-based POS tagging has often relied on manually annotated training data or applying natural language processing (NLP) techniques, suffering from two significant drawbacks. The former is extremely time-consuming and requires labor-intensive feature engineering and expertise. The latter is inadequate to identify linguistically-informed words specific to the SVD domain. In this paper, we propose an automatic approach to assign POS tags to tokens in SVD. Our approach uses the character-level representation to automatically extract orthographic features and unsupervised word embeddings to capture meaningful syntactic and semantic regularities from SVD. The character level representations are then concatenated with the word embedding as a combined feature, which is then learned and used to predict the POS tagging. To deal with the issue of the poor availability of annotated security vulnerability data, we implement a finetuning approach. Our approach provides public access to a POS annotated corpus of &sim;8M tokens, which serves as a training dataset in this domain. Our evaluation results show a significant improvement in accuracy (17.72%-28.22%) of POS tagging in SVD over the current approaches. eng Zenodo 2021-03-23 info:eu-repo/semantics/conferencePaper 20210324002730.0 1377678 md5:60927d17dae7e1a37f3331cb1081f68d https://zenodo.org/records/4632063/files/Automatic Part-of-Speech Tagging for SVD.pdf open 10.5281/zenodo.4632062 isVersionOf doi