Software Similarity Dataset
Creators
- 1. INGENIEROS EN TELECOMUNICACIONES
- 2. Investigador distinguido
Description
This dataset contains the post-processed data for software similarity learning. More information is given: SoftwareSim_Github
post_process: All embedded software with autoencoder to make sure each function is the same length (1024 bits), each final is the embedded graph representation of software.
final_data: All information obtained by Somef & Inspect4py as well as cleaning. Each file represents software in the format given --> Function_Name: [[Called Function], [Function Tokens]]
lean_simscore.csv: This file contains software pairs as well as the similarity metrics, format is given:
Property | Example |
Graph_1 | kakaobrain_helo_word |
Graph_2 | mblondel_soft-dtw |
miniLM | 0.4503 |
Sbert | 0.7204 |
TSDAE | 0.5714 |