Software Similarity Dataset

Wang ZiYuan; Daniel Garijo

doi:10.5281/zenodo.7488906

Published September 23, 2022 | Version 1.9

Dataset Open

Software Similarity Dataset

1. INGENIEROS EN TELECOMUNICACIONES
2. Investigador distinguido

This dataset contains the post-processed data for software similarity learning. More information is given: SoftwareSim_Github

post_process: All embedded software with autoencoder to make sure each function is the same length (1024 bits), each final is the embedded graph representation of software.

final_data: All information obtained by Somef & Inspect4py as well as cleaning. Each file represents software in the format given --> Function_Name: [[Called Function], [Function Tokens]]

lean_simscore.csv: This file contains software pairs as well as the similarity metrics, format is given:

Property	Example
Graph_1	kakaobrain_helo_word
Graph_2	mblondel_soft-dtw
miniLM	0.4503
Sbert	0.7204
TSDAE	0.5714

Files

data.zip

Files (402.0 MB)

Name	Size	Download all
data.zip md5:0c69813feff84fa6fa7d961c30fc73b6	329.7 MB	Preview Download
large.csv md5:a3190a953604884078c97de2b41935ae	72.3 MB	Preview Download

Views

Downloads

Show more details

	All versions	This version
Views	1,157	147
Downloads	90	26
Data volume	21.2 GB	5.0 GB

More info on how stats are collected....

DOI

Resource type

Dataset

Publisher

Zenodo

Languages

English

Creative Commons Attribution 4.0 International

The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited. Read more

Technical metadata

Created: December 28, 2022
Modified: December 28, 2022

Software Similarity Dataset

Creators

Description

Files

data.zip

Files (402.0 MB)