Published July 18, 2025 | Version v1
Data paper Open

Data for github repository https://github.com/GaioTransposon/metadata_mining

  • 1. EDMO icon University of Zurich

Description

The directory contains the following files. 

*container refers to the Docker container the file is used in (see the README.md on github)

**data_type: provided is for source files; produced is for files produced by running the scripts.  

filename type size container* data_type** description
sample.info.gz file 325.70 MB 1 provided input - file with concatenated metadata of 3.8M samples (downloaded from NCBI)
sample.info_test.gz file 10.64 KB 1 provided input - subset of sample.info.gz (can be used for testing)
metadata.out file 2.04 GB 1 provided input - samples metadata for coordinates extraction
samples_with_lat_lon_reversal.tsv file 2.37 KB 1 provided input - curated list of samples with obvious reversal of latitude-longitude
ontologies_dict.pkl file 8.56 MB 1 produced output - dictionary of ontologies (FOODON, ENVO, UBERON, PO)
sample.coordinates.reparsed.filtered file 60.86 MB 1 produced output - extracted coordinates 
training_data_pmids_based.csv file 1.07 GB 2 provided input - selection of samples: must be microbiome samples, must be different enough (based on PMID)
gold_dict.pkl file 98.41 KB 2 produced output - built benchmark
openai_system_better_prompt.txt file 1.43 KB 3 provided input - prompt improved in text format
openai_system_better_prompt_batch.txt file 1.69 KB 3 provided input - prompt improved in json format for batch 
openai_system_better_prompt_json.txt file 1.69 KB 3 provided input - prompt improved  in json format
openai_system_prompt.txt file 1.32 KB 3 provided input - original prompt in text format
openai_system_prompt_json.txt file 1.58 KB 3 provided input - original prompt in json format
gpt_file_label_map.tsv file 25.83 KB 3 provided input - manually made file matching obtained files with corresponding labels
batch_job_info.json file 5.25 KB 3 produced output - this file stores the OpenAI request batch id and related details
gpt_clean_output_*.txt file NA 3 produced output - all OpenAI GPT output files in text format
gpt_clean_output_*.csv file NA 3 produced output - all OpenAI GPT output files in csv format
embeddings directory 2.3 GB 3 produced output - directory containing OpenAI GPT -made embeddings 
GH_collect_output_here.txt file 41.53 KB 4 provided input - human curator samples annotations 1st round
GH_combined_output.txt file 41.30 KB 4 provided input - human curator samples annotations 1st + 2nd round
keywordsbased_biomes_parsed.csv file 38.75 MB 4 provided input - keywords-based samples annotations (biomes). This is what MicrobeAtlas was initially based on. 
biome_subbiome_results.csv file 149.53 KB 4 produced output - results of GPT vs benchmark 
biome_subbiome_stats.csv file 14.73 MB 4 produced output - stats of GPT vs benchmark 
geocoded_coordinates.csv file 7.76 MB 4 produced output - OpenStreetMap-obtained coordinates from GPT geographic location in text form

Files

MicrobeAtlasProject_Zenodo.zip

Files (1.1 GB)

Name Size Download all
md5:93ee448955297da5d54d0e776bc84a73
1.1 GB Preview Download