Published May 1, 2018 | Version v1
Conference paper Open

Word Embeddings for the Software Engineering Domain

  • 1. Athens University of Economics and Business

Description

The software development process produces vast amounts of textual data expressed in natural language. Outcomes from the natural language processing community have been adapted in software engineering research for leveraging this rich textual information; these include methods and readily available tools, often furnished with pre–trained models. State of the art pre–trained models however, capture general, common sense knowledge, with limited value when it comes to handling data specific to a specialized domain.  There is currently a lack of domain-specific pre–trained models that would further enhance the processing of natural language artefacts related to software engineering. To this end, we release a word2vec model trained over 15GB of textual data from Stack Overflow posts.  We illustrate how the model disambiguates polysemous words by interpreting them within their software engineering context. In addition, we present examples of fine-grained semantics captured by the model, that imply transferability of these results to diverse, targeted information retrieval tasks in software engineering and motivate for further reuse of the model.

Notes

This is the pre-print draft of an accepted and published manuscript. The publication should always be cited in preference to this draft using the reference in the previous footnote. This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder. Published version: Copyright © by the Association for Computing Machinery, Inc. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Publications Dept, ACM Inc., fax +1 (212) 869-0481, or permissions@acm.org.

Files

MSR18-w2v.pdf

Files (544.0 kB)

Name Size Download all
md5:996bdc1c240558b3944c98b241eae7f2
544.0 kB Preview Download

Additional details

Related works

Is previous version of
Conference paper: 10.1145/3196398.3196448 (DOI)

Funding

European Commission
CROSSMINER – Developer-Centric Knowledge Mining from Large Open-Source Software Repositories 732223