Published September 23, 2022 | Version 1.9
Dataset Open

Software Similarity Dataset

  • 1. INGENIEROS EN TELECOMUNICACIONES
  • 2. Investigador distinguido

Description

This dataset contains the post-processed data for software similarity learning. More information is given: SoftwareSim_Github

 

post_process: All embedded software with autoencoder to make sure each function is the same length (1024 bits), each final is the embedded graph representation of software.

final_data: All information obtained by Somef & Inspect4py as well as cleaning. Each file represents software in the format given --> Function_Name: [[Called Function], [Function Tokens]]

lean_simscore.csv: This file contains software pairs as well as the similarity metrics, format is given:

Property Example
Graph_1 kakaobrain_helo_word
Graph_2 mblondel_soft-dtw
miniLM 0.4503
Sbert 0.7204
TSDAE 0.5714

 

Files

data.zip

Files (402.0 MB)

Name Size Download all
md5:0c69813feff84fa6fa7d961c30fc73b6
329.7 MB Preview Download
md5:a3190a953604884078c97de2b41935ae
72.3 MB Preview Download