[Research Data] Mining Relevant Solutions for Programming Tasks from Search Engine Results
Description
[Abstract]
Software development is a knowledge-intensive activity. Official documentation for developers may not be sufficient for all developer needs. Searching for information on the Internet is a usual practice, but finding really useful information may be challenging, because the best solutions are not always among the first ranked pages. So, developers have to read and discard irrelevant pages, that is, pages that do not have code examples or that have content with little focus on the desired solution. This work aims at proposing an approach to mine relevant solutions for programming tasks from search engine results that remove irrelevant pages. The approach works as follows: a query related to the programming task is prepared, and given as an input to a search engine. The returned pages pass through an automatic filter to select relevant pages. We evaluated the top-20 pages returned by the Google search engine, for 10 different queries, and observed that only 31\% of the evaluated pages are relevant to developers. Then, we proposed and evaluated three different approaches to mine the relevant pages returned by the search engine. Google’s search engine has been used as a baseline, and our results have shown that Google’s search engine returns a reasonable number of irrelevant pages for developers, and we could find an effective approach to remove irrelevant pages, suggesting that developers could benefit from a customized web search filter for development content.
[Contents of Research Data.rar file]
The Research Data.rar file has a folder called Research Data that contains 3 folders internally, with the names: “01 – Source Code”, “02 - Data” and “03 – Preprocessing rules”. The folder “01 – Source Code” contains the JAVA source code of the implementations of the proposed approaches. The folder “02 - Data” contains the data of the evaluations carried out in the work, which are in the folders “01 - Evaluation results of pages returned by Google” and “02 - Results of approaches comparisons”. The folder “01 - Evaluation results of pages returned by Google” has the evaluations carried out on the first 20 pages returned by Google, following the criteria defined in the work, for the 10 queries considered in the evaluation. The folder “02 - Results of approaches comparisons” contains the results of the evaluation of the proposed approaches, for the 10 queries considered in the evaluation. In this evaluation, the number of pages given as input for the approaches was increased from 3 to 20 pages, for each number of pages a folder was generated with the results. In addition to the results of the Precision, Recall and F-Measure metrics that are in the file named Results Approaches.txt, other files were generated for analysis. For example, the Instances_without_outliers.txt file shows which pages were filtered out after applying the outlier page removal filter. The Selected Pages Approach 4.txt file, on the other hand, shows which pages were filtered after applying the filters of the GORCUO approach. The folder “03 - Preprocessing rules” has a file called Rules.java. In this file, there is the commented JAVA source code, from the implementation of the rules created in the pre-processing stage of the proposed approach.
Files
Files
(3.3 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:0a4c1e4b2a900c6ff15bb4e60787603d
|
3.3 MB | Download |