BojarLab/glycowork: v1.0.0

Daniel Bojar; Jon Lundstrøm; lthomes; Kathryn; rburkholz; viktoriakarlsson

doi:10.5281/zenodo.10255536

Published December 4, 2023 | Version v1.0.0

Software Open

BojarLab/glycowork: v1.0.0

Change Log

Added a Zenodo badge, to have a release-specific doi for glycowork

glycan_data

Updated sugarbase database; sugarbase is now pickled, so literal evaluations are necessary
Harmonized glycan column names across generated dataframes; all use 'glycan' now, 'target' has been deprecated

loader

Updated motif_list to be compatible with new position encoding
Added Internal_LewisX and Internal_LewisA to motif_list (renamed LewisX and LewisA to Terminal_LewisX and Terminal_LewisA, correspondingly)
Made df_species static again to speed up package import
Added find_nth_reverse helper function that finds the starting index of the nth occurrence of a substring from the end of the string
Added remove_unmatched_brackets helper function to strip unmatched opening or closing brackets from glycan strings

motif

Added more masses to mz_to_composition.csv / mass_dict: Acetonitrile, Formate, Cl-, HCO3-, and NH4+

processing

Extended canonicalize_iupac to cases like "NeuGcα3Galβ3(NeuAcα6)GalNAcol" and even more modification formulations, e.g., "6S-GlcNAc"
Added canonicalize_composition to convert compositions formatted either in the style of HexNAc2Hex1Fuc3Neu5Ac1 or N2H1F3A1 into dictionaries used by glycowork
Added GalNAc4S to permitted reducing end monosaccharides for O-linked glycans in enforce_class
MissForest now has a maximum number of iterations and will check for convergence each iteration (immediately finishing upon converging), yielding some speed-ups in most cases
The output of min_process_glycans no longer contains empty strings for glycans ending in a linkage
Updated choose_correct_isoform to be compatible with change in min_process_glycans
Added get_possible_linkages to retrieve linkages matching a wildcarded linkage
Added get_possible_monosaccharides to retrieve monosaccharides matching a monosaccharide type (HexNAc, etc.)
Added decorators, rescue_glycans and rescue_compositions, to canonicalize them in case a decorated function errors out
Added linearcode_to_iupac to support LinearCode as input format for glycowork (this will be called within canonicalize_iupac and the decorators); note that for now coverage may not be perfect yet
Added iupac_extended_to_condensed to support IUPAC-extended as input format for glycowork (this will be called within canonicalize_iupac and the decorators); note that for now coverage may not be perfect yet
Added glycoct_to_iupac to support GlycoCT as input format for glycowork (this will be called within canonicalize_iupac and the decorators); note that for now coverage may not be perfect yet
Added wurcs_to_iupac to support WURCS as input format for glycowork (this will be called within canonicalize_iupac and the decorators); note that for now coverage may not be perfect yet
Added oxford_to_iupac to support Oxford as input format for glycowork (this will be called within canonicalize_iupac and the decorators); note that for now coverage is limited
check_nomenclature (formerly in motif.tokenization) now handles outputting warning messages for trying to use non-string, non-graph nomenclatures or SMILES with glycowork functions
Expanded find_isomorphs to generate more isomorphic sequence variants and thereby increasing the chances that choose_correct_isoform will have access to the canonical sequence
Fixed a rare issue with canonicalize_iupac where sequences coming from structure_to_basic would sometimes be formatted incorrectly if they contained dHex
Fixed an issue in find_isomorphs in which double branches were not always correctly swapped

analysis

get_heatmap now no longer tries to convert data to relative abundances if negative values are detected in the input
All functions using dataframes as inputs in analysis can now also be used by providing full filepaths to the .csv file instead
Optimized some of the code for readability and speed (everything should be at least a bit faster now)

annotate

get_k_saccharides is now allowed to generate new dynamic motifs with tokens outside of lib (via expand_lib)
annotate_glycan and annotate_dataset now also support narrow wildcards
Fixed an issue in count_unique_subgraphs_of_size_k in which branched motifs were not always correctly formatted (i.e., opening/closing brackets)
get_k_saccharides now outputs dataframes with counts as default and can yield the old nested lists of motifs by setting the new keyword just_motifs to True
Fixed an edge case in which get_k_saccharides sometimes overcounted individual monosaccharides if their strings overlapped

graph

subgraph_isomorphism and compare_glycans now support using wildcards and position encoding at the same time. The extra keyword argument is now deprecated and the functions auto-detect whether anything has been specified in wildcards and/or termini_list
subgraph_isomorphism and compare_glycans now support automatically inferred narrow wildcards to allow for (i) matching linkages like a1-? to only specified linkages within that group (e.g., a1-3 but not b1-3 etc.) and (ii) matching monosaccharide types like HexNAc to only specified monosaccharides of that type (e.g., GlcNAc but not Glc, etc.)
The wildcard_list keyword argument in all graph & annotation functions is now deprecated as wildcards are inferred automatically via narrow wildcards and native full wildcards (?1-? and Monosaccharide)
subgraph_isomorphism now behaves as expected for testing motifs ending in linkages on glycans ending in linkages
subgraph_isomorphism can now return the matched subgraphs in the input glycan with the new return_matches keyword argument
glycan_to_nxGraph is now decorated with the rescue_glycans decorator, which auto-canonicalizes IUPAC strings if they are not in the format preferred by glycowork
Fixed mismatch of labels and string_labels in categorical_node_match_wildcard
Fixed an issue in subgraph_isomorphism in which, when using positional encoding, sometimes the mirror image of a motif was incorrectly captured if the termini aligned
termini_list within subgraph_isomorphism now only requires the specification of monosaccharide positions
Added expand_termini_list helper function to facilitate the expansion of monosaccharide-only termini_list into full termini_list behind the scenes
Added support for shorthand notation of position encoding, now either 'terminal' or 't' will work
Improved handling of complex branching in graph_to_string; should be fewer unexpected translations now
Fixed an issue in graph_to_string in which induced subgraphs could cause errors due to unexpected or weirdly sorted node indices
Fixed an edge case in which the reducing end could be sometimes calculated as 'internal' when termini='calc' in glycan_to_nxGraph
Deprecated a duplicate character_to_label and string_to_labels
Deprecated categorical_termini_match; the functionality is now handled within categorical_node_match_wildcard
Deprecated the wildcards keyword argument from compare_glycans as this will now be detected internally, if wildcards are provided via wildcard_list

tokenization

Composition functions (e.g., composition_to_mass) are now decorated with rescue_compositions, which means that they can be used with compositions like "H3N2" (basically anything that canonicalize_composition can handle)
Deprecated character_to_label as it's now handled within string_to_labels
Moved check_nomenclature into motif.processing
Optimized some of the code for readability and speed (most things should be at least a bit faster now)

draw

Support motif highlighting in GlycoDraw: by providing the highlight_motif keyword argument, motifs can be highlighted (everything else will be set to low opacity). Works with IUPAC-condensed motifs and named motifs from known
Support wildcards in motif highlighting with the highlight_wildcard_list keyword argument, for instance highlighting all Gal(?1-?)GlcNAc subunits (for Gal(b1-?)GlcNAc you don't need highlight_wildcard_list, as narrow wildcards are handled automatically)
Support positional encoding in motif highlighting with the highlight_termini_list keyword argument, for instance highlighting all terminal, non-reducing end Gal(b1-?)GlcNAc subunits (yes, you can use both wildcards and positional encoding at the same time😊)
Support drawing of repeat structures (indicated by brackets and the number of repeats) via the new repeat keyword argument. Internal repeats can also be specified with the additional repeat_range keyword argument.
Optimized some of the code for readability and speed (most things should be at least a bit faster now)

network

biosynthesis

Optimized some of the code for readability and speed (everything should be up to 2x faster now)

evolution

Optimized some of the code for readability and speed (everything should be at least a bit faster now)

ml

Optimized some of the code for readability and speed (most things should be at least a bit faster now)

Files

BojarLab/glycowork-v1.0.0.zip

Files (106.8 MB)

Name	Size	Download all
BojarLab/glycowork-v1.0.0.zip md5:025d924fa6605c402868335e7d2bd37d	106.8 MB	Preview Download

Additional details

Is supplement to: Software: https://github.com/BojarLab/glycowork/tree/v1.0.0 (URL)

	All versions	This version
Views	839	32
Downloads	227	7
Data volume	10.3 GB	747.8 MB

BojarLab/glycowork: v1.0.0

Authors/Creators

Description

Change Log

glycan_data

loader

motif

processing

analysis

annotate

graph

tokenization

draw

network

biosynthesis

evolution

ml

Files

BojarLab/glycowork-v1.0.0.zip

Files (106.8 MB)

Additional details

Related works