Software Similarity Dataset
Authors/Creators
- 1. INGENIEROS EN TELECOMUNICACIONES
- 2. Investigador distinguido
Description
This dataset contains the post-processed data for software similarity learning. More information is given: SoftwareSim_Github
post_process: All embedded software with autoencoder to make sure each function is the same length (1024 bits), each final is the embedded graph representation of software.
final_data: All information obtained by Somef & Inspect4py as well as cleaning. Each file represents software in the format given --> Function_Name: [[Called Function], [Function Tokens]]
lean_simscore.csv: This file contains software pairs as well as the similarity metrics, format is given:
| Property | Example |
| Graph_1 | kakaobrain_helo_word |
| Graph_2 | mblondel_soft-dtw |
| miniLM | 0.4503 |
| Sbert | 0.7204 |
| TSDAE | 0.5714 |
Files
data.zip
Files
(329.7 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:0c69813feff84fa6fa7d961c30fc73b6
|
329.7 MB | Preview Download |