A Dense Retrieval System and Evaluation Dataset for Scientific Computational Notebooks
Description
The discovery and reutilization of scientific codes
are crucial in many research activities. Computational notebooks
have emerged as a particularly effective medium for sharing
and reusing scientific codes. Nevertheless, effectively locating
relevant computational notebooks is a significant challenge. First,
computational notebooks encompass multi-modal data comprising
unstructured text, source code, and other media, posing
complexities in representing such data for retrieval purposes.
Second, the absence of evaluation datasets for the computational
notebook search task hampers fair performance assessments
within the research community. Prior studies have either treated
computational notebook search as a code-snippet search problem
or focused solely on content-based approaches for searching
computational notebooks. To address the aforementioned difficulties,
we present DeCNR, tackling the information needs of
researchers in seeking computational notebooks. Our approach
leverages a fused sparse-dense retrieval model to represent
computational notebooks effectively. Additionally, we construct
an evaluation dataset including actual scientific queries, computational
notebooks, and relevance judgments for fair and objective
performance assessment. Experimental results demonstrate that
the proposed method surpasses baseline approaches in terms of
F1@5 and NDCG@5. The proposed system has been implemented
as a web service shipped with REST APIs, allowing seamless
integration with other applications and web services.
Files
2023.conference.escience.nali.camera.pdf
Files
(4.3 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:9bc220a00f8ee3243bab75e15c307e27
|
4.3 MB | Preview Download |