Published September 10, 2021 | Version v2
Dataset Open

A unified DTI prediction framework based on knowledge graph and recommendation system

Creators

  • 1. Zhejiang University

Description

## A unified DTI prediction framework based on knowledge graph and recommendation system

 

# Code and data description

## Scripts

- `kge_nfm.py`: the complement of the KGE_NFM & NFM methods.

- `kge_rf.py`: the complement of the KGE_RF & RF methods.

- `deepdit.py`: the complement of the MPNN_CNN & DeepDTI methods.

- the complement of DTINet and DTiGEMS is tested based on their source packages (more in Prerequisites)


 

## `data/` directory

#### `yamanishi_08/` directory

- `data_folds/`: 10 folds training set and test set in the three scenarios

- `warm_start_1_1/`

- `warm_start_1_10/`

- `drug_coldstart/`

- `protein_coldstart/`

- `kg_data/`: supporting knowledge graph data

- `dt_all_08.csv`: whole DTI dataset

- `791drug_struc.csv`: drugbank id and smiles of drugs

- `989proseq.csv`: kegg id and sequences of proteins

- `morganfp.txt`: list of drug morgan fingerprints

- `pro_ctd.txt`: list of protein descriptors

 

#### `BioKG/` directory

- `data_folds/`: 10 folds training set and test set in the three scenarios

- `warm_start_1_10/`

- `drug_coldstart/`

- `protein_coldstart/`

- `kg.csv`: supporting knowledge graph data

- `dti.csv`: whole DTI dataset

- `comp_struc.csv`: drugbank id and smiles of drugs

- `pro_seq.csv`: sequences of proteins

- `fp_df.csv`: list of drug morgan fingerprints

- `prodes_df.csv`: list of protein descriptors

 

#### `hetionet/` directory

- `data_folds/`: 10 folds training set and test set in the three scenarios

- `warm_start_1_10/`

- `drug_coldstart/`

- `protein_coldstart/`

- `kg.csv`: supporting knowledge graph data

- `dti.csv`: whole DTI dataset

- `map_drugs_df`: drugbank id and smiles of drugs

- `pro_seq.csv`: sequences of proteins

- `fp_df.csv`: list of drug morgan fingerprints

- `prodes_df.csv`: list of protein descriptors

 

#### `luo's_dataset/` directory

- `data_folds/`: 10 folds training set and test set in the three scenarios

- `warm_start_1_1/`

- `warm_start_1_10/`

- `drug_coldstart/`

- `protein_coldstart/`

- `mapping/`: related mappings and similarity matrix (https://github.com/luoyunan/DTINet)

- `protein.txt`: list of protein names

- `disease.txt`: list of disease names

- `se.txt`: list of side effect names

- `drug_dict_map`: a complete ID mapping between drug names and DrugBank ID

- `protein_dict_map`: a complete ID mapping between protein names and UniProt ID

- `Similarity_Matrix_Drugs.txt` : Drug similarity scores based on chemical structures of drugs

- `Similarity_Matrix_Proteins.txt` : Protein similarity scores based on primary sequences of proteins

- `feature/`: related features used in methods

- `drug_smiles.csv`: drugbank id and smiles

- `seq.txt`: list of protein sequences

- `morganfp.txt`: list of drug morgan fingerprints

- `pro_ctd.txt`: list of protein descriptors

 

#### `eg_model/` directory

We provided a pre-trained kge model for example.

- `dismult_400_warm_1_10.pkl`


 

# Prerequisites

#### Operating system: Linux

#### Programing language: python

#### KGE_NFM & NFM dependencies

```

- python 3.6

- pandas '1.1.5'

- numpy '1.18.4'

- scikit-learn '0.24.1'

- tensorflow '1.15.0'

- ampligraph '1.3.2'

- deepctr '0.8.4'

```

#### baseline dependencies

- RF & KGE_RF (included in KGE_NFM&NFM dependencies)

- MPNN_CNN & DeepDTI:

- source: https://github.com/kexinhuang12345/DeepPurpose

```

- deeppurpose '0.0.9'

- torch '1.6.0+cu101'

```

- DTINet:

- source: https://github.com/luoyunan/DTINet

- note: in this work, we run the DTINet in a python environment, which need Linux system and python2. Importantly, this method requires the [Inductive Matrix Completion](http://bigdata.ices.utexas.edu/software/inductive-matrix-completion/) (IMC) library. More detailed information about the installation of this method could be found in the source code of the DTINet.

- DTiGEMS:

- source: https://github.com/MahaThafar/DTiGEMSplus

- TriModel:

- source: http://drugtargets.insight-centre.org/



 

# Example (kge_nfm.py)

 

#### A brief presentation of the results:

- return average loss when training kge model

```

Average Loss: 0.475181: 2%|###3 | 1/50 [01:10<57:31, 70.44s/epoch]

```

- return performance(mrr) on training set of DTI for early stopping (kge_model in `eg_model/`)

```

In [35]: roc = roc_auc(test_label,test_score)

...: pr = pr_auc(test_label,test_score)

...: print(roc)

...: print(pr)

0.8731770833333332

0.44079654835037246

```

 

- nfm training process (`patience=10`)

 

```

In [45]: roc_nfm,pr_nfm,pred_y = train_nfm(feature_columns,train_model_input,train_label,test_model_input,test_label,patience)

Train on 44851 samples

Epoch 1/2000

44851/44851 - 2s - loss: 0.5332 - precision: 0.0976

Epoch 2/2000

44851/44851 - 1s - loss: 0.4143 - precision: 0.0000e+00

Epoch 3/2000

44851/44851 - 1s - loss: 0.3456 - precision: 0.0000e+00

Epoch 4/2000

44851/44851 - 1s - loss: 0.3443 - precision: 0.0000e+00

Epoch 5/2000

44851/44851 - 1s - loss: 0.3470 - precision: 0.0000e+00

Epoch 6/2000

44851/44851 - 1s - loss: 0.3382 - precision: 0.0000e+00

......

Epoch 279/2000

44851/44851 - 1s - loss: 0.0758 - precision: 0.9248

Epoch 280/2000

44851/44851 - 1s - loss: 0.0753 - precision: 0.9327

Epoch 281/2000

44851/44851 - 1s - loss: 0.0796 - precision: 0.9155

Epoch 282/2000

44851/44851 - 1s - loss: 0.0764 - precision: 0.9276

Epoch 283/2000

44851/44851 - 1s - loss: 0.0739 - precision: 0.9127

```

 

- reutrn results as type of roc_auc & pr_auc

```

0.9812476679104477

0.8803416284646345

```

Files

data&code.zip

Files (285.4 MB)

Name Size Download all
md5:34f3d94865606ad606532b3f126188f2
285.4 MB Preview Download