RepoSkillMiner: identifying software expertise from GitHub repositories using natural language processing

A GitHub profile is becoming an essential part of a developer's resume enabling HR departments to extract someone's expertise, through automated analysis of his/her contribution to open-source projects. At the same time, having clear insights on the technologies used in a project can be very beneficial for resource allocation and project maintainability planning. In the literature, one can identify various approaches for identifying expertise on programming languages, based on the projects that developer contributed to. In this paper, we move one step further and introduce an approach (accompanied by a tool) to identify low-level expertise on particular software frameworks and technologies apart, relying solely on GitHub data, using the GitHub API and Natural Language Processing (NLP)---using the Microsoft Language Understanding Intelligent Service (LUIS). In particular, we developed an NLP model in LUIS for named-entity recognition for three (3) .NET technologies and two (2) front-end frameworks. Our analysis is based upon specific commit contents, in terms of the exact code chunks, which the committer added or changed. We evaluate the precision, recall and f-measure for the derived technologies/frameworks, by conducting a batch test in LUIS and report the results. The proposed approach is demonstrated through a fully functional web application named RepoSkillMiner. Tool Links: Video, Code Repo, Application, Validation Dataset


INTRODUCTION
Contemporary software development demands breadth and indepth knowledge for a tremendously large set of technologies, tools and practices [1] [2].The challenge of retaining a competitive development team becomes even more pronounced considering the rapid evolution of software technologies and the continuous emergence of novel programming languages, frameworks and libraries.Considering that developers' skills also evolve over time, project managers face the challenging task of tracking the kind and extent of knowledge that exists within the development team, to efficiently map people to tasks.Quite often, a particular technology has been introduced into a project following rising trends, only to find out a few years later that few people in the team (or even none) still hold the corresponding knowledge to maintain the corresponding parts of the codebase.At the same time identifying developers' expertise and skills is a topic of high importance among recruiters and human resource departments of software companies.With the plethora of frameworks and technologies involved in software engineering (e.g., front-end technologies) a stated set of known skills is not sufficient to perform efficient hiring.Social platforms such as LinkedIn are gaining ground and constitute a major field for IT recruiters to find ideal candidates, but the information posted on such platforms may be inaccurate, out of date and subjective.To this end, identifying expertise from artifacts can be a useful alternative for organizations, not only for hiring purposes, but also for managing their existing resources, maintaining and developing their skillset and monitoring any potential skill shortage.Artifact-based expertise identification offers a promising alternative to address such challenges.Software repositories such as GitHub or BitBucket can provide objective insights for an experienced technical eye.However, parsing through repositories to track individual contributions and then analyzing commits to derive the particular technologies that a developer is familiar with, is a tedious process urging for automation.At the same time, similar approaches are needed to automatically analyze the technologies used over time within a particular project and the diffusion of each technology within code.Related literature on identifying expertise from software development platforms, focuses on the programming languages used in each project [3][4] [8].However, considering only the dominant language of a project as an identifier can often lead to inaccurate results for today's multi-language projects.Thus, information about the multiple frameworks and technologies used in the context of a language is slipping under the radar.In this paper we propose an approach (and tool) to advance the state-of-the-art on mining skill-related information from coding platforms with a hybrid approach aided by NLP.In particular, first we obtain an insight based on the files contained in each commit; and next we apply NLP on the actual code chunks included in the commits to identify the use of specific frameworks and technologies.The outcome of this approach is a report containing all the programming languages, technologies and frameworks per contributor.To showcase the approach, we developed a web application, which performs the analysis and visualizes the results.In contrast to many existing approaches we use live data from GitHub, rather than relying on pre-existing curated datasets (e.g., GHTorrent)increasing the applicability of the approach.The described tool constitutes our initial effort to develop a comprehensive dashboard that will provide to project managers an overview of actual technologies employed in a project as well as detailed reporting on the skills held by active developers, including the evolution of both over time.The envisioned platform will enable the early identification of gaps between people's skills and deployed technologies.

RELATED WORK
Relevant studies span across 3 different areas, namely mining software repositories, extracting software developer's expertise, and Natural Language Processing.Several studies introduce approaches and methods of extracting expertise from software repositories.Gousios et al. [3] propose a way to quantify developer's contribution by identifying specific developers' contribution types: e.g., adding code of good / bad quality, commits of new source files or directories, commits of code that generates / closes bugs and assign different weights to them.Constantinou et al. [4] examined the commit activity from the curated dataset GHTorrent [5] and proposed a way to extract developers' expertise in programming languages by considering the quantity and the continuity of contributions.The programming language in this case is the project's dominant language as identified by GitHub.In a different approach [6] the same authors proposed a way to identify contributors' expertise and roles, considering their contribution history across projects and technologies.The technologies are identified by the contents of the READ-ME file, assuming that this file typically contains description of the technologies used.Likewise, Greene et al. [7] combined commit details and readme files to extract similar information.A different path is followed by Montandon et al. [8], who focus on 3 JavaScript libraries to evaluate the performance of machine learning classifiers to predict expertise and to propose a method on clustering feature data from GitHub.Using NLP techniques on source code is a topic related mostly with static code analysis [9].To the best of our knowledge there is no study about using NLP for expertise identification in source code.

BACKGROUND INFORMATION
The proposed approach relies on three main pillars: parsing (mining) of software repositories to seek code artifacts for analysis, NLP that treats code through statistical analysis; and Language Understanding to pull out of code the relevant information on the involved technologies.Mining Software Repositories (MSR) has become an established field in empirical software engineering, focusing on extracting and analyzing data drawn from software repositories to reveal useful relations and information around software products, processes and people [10].GitHub is by far the most popular social coding platform and most of the similar research efforts rely on GitHub, as reference source of data [11].For the purpose of facilitating mining and increasing performance, curated datasets such as GHTorrent [5] and Boa are being maintained by research teams.The availability of a comprehensive API was one of the key reasons that render GitHub an appealing source for many software engineering research efforts [14].However, there are several research studies pointing out the pitfalls of this process [15].Natural language processing (NLP) leverages the power of machine learning and computational linguistics and is concerned with making computer systems learn the syntax and meaning of human language, process and understand the intent of it to perform meaningful tasks [16].Applying NLP techniques on source code may sound unnatural, but there is scientific evidence supporting the validity of the approach.Hindle et al. [17] suggest that programming languages, in theory, are complex, flexible and powerful, but, "natural" programs that real people actually write, are mostly simple and rather repetitive; thus they have predictable statistical properties that can be captured in language models and leveraged for software engineering tasks.This approach has inspired us to use NLP on source code for expertise identification instead of crafting rules so that the proposed approach is scalable, benefit from the abundance of available data in software repositories to train the corresponding models and better cope with new situations.Microsoft's Language Understanding Intelligent Service (LUIS) is based on work by Microsoft Research on interactive learning, and rapid development of language understanding models [18].According to its creators it aims at enabling software developers to create cloud-based machine-learning language understanding models specific to their application domain, without ML expertise [19].The model creator needs to provide a small set of utterances for each intent and the LUIS model is trained based on these and after it's published it is ready for use.Successful industry applications of LUIS include information chatbots, commerce chatbots and conversational IoT interfaces [20].

PROPOSED APPROACH
The proposed approach can be briefly outlined in the following steps: Step 1. Data Collection.We used GitHub's REST API ver. 3 [21] to retrieve up-to-date data from GitHub repositories.Given a GitHub organization or repository we retrieve all commits as well as the authors of each commit.For every commit we retrieve the files included and the actual code chunks for these files.We preferred API ver.3, over its successor ver.4, because although the latest uses GraphQL, up to now there is no way to retrieve file contents using this version of the API.
Step 2. Identification of commits' programming language.For each retrieved commit we check the files included in the commit and in particular the file extensions to identify the employed programming languages.To do so, we use a slightly modified (removed duplicates and added a few more file extensions) version of the classification provided by GitHub Linguist [22], which is the library that GitHub uses for providing the language distribution information for the repositories.Both original and the modified classification file are available online1,2 .
Step 3. Identification of commit technologies with LUIS.We built a model in LUIS to identify three (3) technologies in the .NET framework domain, namely: (a) Language-Integrated Queries (LINQ) which are first-class language constructs that allow writing of queries against strongly typed collections of objects; (b) Asynchronous Programming that allows code in the form of sequential statements which however executes based on external resource allocation and according to the order of tasks; and (c) Entity Framework which is an object-database mapper.Moreover, LUIS is trained to identify two (2) front-end frameworks, namely: (d) Angular and (e) React.LUIS needs input utterances (i.e., inputs from the user that the model needs to interpret) to be provided for each target intent (technology/framework) in the training step.We note that an intent corresponds to a purpose or goal expressed in a user's utterance.To train LUIS to extract intents and entities it is important to capture a variety of example utterances.Active learning, or the process of continuing to train on new utterances, is essential to machine-learned intelligence that LUIS provides.We have created 98 example utterances for the 5 intents using existing or slightly altered (mainly altered variable names) existing samples from the official Microsoft, Angular and React documentation.An example response from LUIS for the utterance "var filteredResult = studentList.Where(s => s.Age > 12)" is shown in Figure 1.

Figure 1: Example JSON response
Table 1 displays model statistics for the five (5) target intents (Linq, Entity Framework, Async Programming, Angular, React), whereas the scatterplot in Figure 2 shows the distribution of the test results for Linq.To perform the evaluation, we created a batch test in LUIS testing platform with 88 code snippets, which are available online 3 .Additionally, we note that the model has been exported and is available online 4 .

USAGE SCENARIO
A video demonstrating the functionality of RepoSkillMiner is available online 5 .The demonstration begins with the user entering the name of the GitHub organization to be analyzed in the search field shown in Figure 4. Next, he/she selects from the dropdown list the repository he/she wants to include for scanning and specifies that he/she wants to apply LUIS scanning and clicks the scan button.A table is populated containing the contributors of the organization and the technologies they used as shown in Figure 5.By clicking on the name of any of the contributors two graphs are populated showing the technologies distribution based on the number of commits for each technology and a detailed list of the technologies as shown in Figures 6-7.

CONCLUSIONS AND FUTURE WORK
RepoSkillMiner is a scalable and easy-to-use web application that can determine the knowledge (in terms of low-level skills-e.g., the command of specific programming techniques or frameworks), held by individual developers in an open-source software project.Since the tool is still in a prototype phase, the application has some limitations including browser support (Firefox is the only fully supported one) and API limitations (GitHub API sets a limit to 5000 requests per hour).We plan to include more detailed visualizations and indepth insights from the collected data including the evolution of project technologies within an organization over time, the designation of technologies which are "at risk" because of lack of resources, the cross-tabulation of technologies and people's skills, etc. Developers' experience in terms of commits or years can also be derived by analyzing a projects' history.Once the tool is enhanced with the ability to detect more low-level technologies, we plan to evaluate its

Figure 2 :Figure 3
Figure 2: Model statistics for entity Linq

ASE' 20 ,
September 2020, Melbourne, Australia S. Kourtzanidis et al. accuracy against the actual skills held by developers in a selected company, using a questionnaire-based study.ACKNOWLEDGMENT Work reported in this paper has received funding from the European Union Horizon 2020 research and innovation program under grant agreement No. 780572 (project: SDK4ED).

Figure 4 :
Figure 4: Search for a repository in an organization

Table 1 :
Model Statistics