callahantiff/PheKnowLator: v3.0.2

Tiffany J. Callahan; Jordan M. Wyrwa, DO; Bill Baumgartner; Luca Cappelletti

doi:10.5281/zenodo.5568827

Published October 14, 2021 | Version v3.0.2

Software Open

callahantiff/PheKnowLator: v3.0.2

1. CU Anschutz Medical Campus
2. Università degli Studi di Milano

Release: v3.0.2

Website: https://github.com/callahantiff/PheKnowLator/wiki/v2.0.0 Data Access: Google Cloud Storage -- PheKnowLator Bucket Docker Container: DockerHub Dedicated Project Container PyPI: pkt-kg 3.0.2

Updated Jupyter Notebooks:

Updated Scripts:

builds/data_preprocessing.py
pkt_kg/metadata.py
pkt_kg/utils/kg_utils.py
builds/data_to_download.txt
pkt_kg/utils/data_utils.py
tests/test_data_utils_downloading.py

Updates

Addresses issue #118 (PR: #119) by patching the prior functionality related to obtaining labels and definitions from ontologies. Specifically, it now ensures that whenever possible the language encoding for these fields is English. Please see details below for information on how to address nodes containing foreign characters prior to this release.

Solution for Builds Prior to v3.0.2 The (bad_node_patch.json) file contains a dictionary where the outer keys are the entity_uri and the puter values are another dictionary where the inner keys are label and description/definition and the inner values for these inner keys are the updated strings without foreign characters. An example of this dictionary is shown below:
```
key = '<http://purl.obolibrary.org/obo/UBERON_0000468>'

print(bad_node_patch[key])
>>> {'label': 'multicellular organism', 'description/definition': 'Anatomical structure that is an individual member of a species and consists 
of more than one cell.'}
```
The code to identify the nodes with erroneous foreign characters is shown below:
```
import re
import pandas as pd

# link to downloaded `NodeLabels.txt` file
input_file = `'NodeLabels.txt'`

# load data as Pandas DataFrame
nodedf = pd.read_csv(input_file, sep='\t', header=0)

# identify bad nodes and filter DataFrame so it only contains these rows
nodedf['bad'] = nodedf['label'].apply(lambda x: re.search("[\u4e00-\u9FFF]", x) if not pd.isna(x) else None)
nodedf_bad_nodes = nodedf[~pd.isna(nodedf['bad'])].drop_duplicates()
```

Files

callahantiff/PheKnowLator-v3.0.2.zip

Files (64.1 MB)

Name	Size	Download all
callahantiff/PheKnowLator-v3.0.2.zip md5:46b9b639e9eb6662358e60ebdb33b57f	64.1 MB	Preview Download

Additional details

Is supplement to: https://github.com/callahantiff/PheKnowLator/tree/v3.0.2 (URL)

	All versions	This version
Views	842	131
Downloads	136	16
Data volume	9.1 GB	1.0 GB

callahantiff/PheKnowLator: v3.0.2

Creators

Description

Files

callahantiff/PheKnowLator-v3.0.2.zip

Files (64.1 MB)

Additional details

Related works