Developer Reference

ClubCpG APIs

class clubcpg.ParseBam.BamFileReadParser(bamfile, quality_score, read1_5=None, read1_3=None, read2_5=None, read2_3=None, no_overlap=True)[source]

Used to simplify the opening and reading from BAM files. BAMs must be coordinate sorted and indexed.

Example
>>> from clubcpg.ParseBam import BamFileReadParser
>>> parser = BamFileReadParser("/path/to/data.BAM", quality_score=20, 3, 4, 7, 1)
>>> reads = parser.parse_reads("chr7", 10000, 101000)
>>> matrix = parser.create_matrix(reads)
__init__(bamfile, quality_score, read1_5=None, read1_3=None, read2_5=None, read2_3=None, no_overlap=True)[source]

Class used to read WGBSeq reads from a BAM file, extract methylation, and convert into data frame

Parameters
  • bamfile – Path to bam file location

  • quality_score – Only include reads >= this fastq quality

  • read1_5 – mbias ignore read1 5’

  • read1_3 – mbias ignore read1 3’

  • read2_5 – mbias ignore read2 5’

  • read2_3 – mbias ignore read2 3’

  • no_overlap – bool. If overlap exists between two reads, ignore that region from read 2.

create_matrix(read_cpgs)[source]

Converted parsed reads into a pandas dataframe.

Parameters

read_cpgs (iterable) – read CpGs generated by self.parse_reads

Returns

matrix methylated (1) and unmethylated (0) states

Return type

pd.DataFrame

fix_read_overlap(full_reads, read_cpgs)[source]

Takes pysam reads and read_cpgs generated during parse reads and removes any overlap between read1 and read2. If possible it also stitches read1 and read2 together to create a super read.

Parameters
  • full_reads – set of reads generated by self.parse_reads()

  • read_cpgs – todoo

Returns

A list in the same format as read_cpgs input, but corrected for paired read overlap

parse_reads(chromosome: str, start: int, stop: int)[source]
Parameters
  • chromosome – chromosome as “chr6”

  • start – start coordinate

  • stop – end coordinate

Returns

List of reads and their positional tags as assigned by bismark

class clubcpg.ClusterReads.ClusterReads(bam_a: str, bam_b=None, bin_size=100, bins_file=None, output_directory=None, num_processors=1, cluster_member_min=4, read_depth_req=10, remove_noise=True, mbias_read1_5=None, mbias_read1_3=None, mbias_read2_5=None, mbias_read2_3=None, suffix='', no_overlap=True)[source]

This class is used to take a dataframe or matrix of reads and cluster them

Example

>>> from clubcpg.ClusterReads import ClusterReads
>>> cluster = ClusterReads(bam_a="/path/to/file.bam", bam_b="/path/to/file.bam", bins_file="/path/to/file.csv", suffix="chr19")
>>> cluster.execute()
__init__(bam_a: str, bam_b=None, bin_size=100, bins_file=None, output_directory=None, num_processors=1, cluster_member_min=4, read_depth_req=10, remove_noise=True, mbias_read1_5=None, mbias_read1_3=None, mbias_read2_5=None, mbias_read2_3=None, suffix='', no_overlap=True)[source]

Initialize self. See help(type(self)) for accurate signature.

execute(return_only=False)[source]

This method will start multiprocessing execution of this class.

Parameters

return_only (bool) – Whether to return the results as a variabel (True) or write to file (False)

Returns

list of lists if :attribute: return_only False otherwise None

Return type

list or None

filter_data_frame(matrix: pandas.core.frame.DataFrame)[source]

Takes a dataframe of clusters and removes any groups with less than self.cluster_member_min members

Parameters

matrix – dataframe of clustered reads

Type

pd.DataFrame

Returns

input matrix with some clusters removed

generate_individual_matrix_data(filtered_matrix, chromosome, bin_loc)[source]

Take output of process_bins() and converts it into a list of lines of text data for output

Parameters
  • filtered_matrix (pd.DataFrame) – dataframe returned by ClusterReads.filter_data_frame()

  • chromosome (string) – chromosome as “Chr5”

  • bin_loc (string) – location representing the bin given as the end coordinate, ie 590000

Returns

comma separated lines extracted from the filtered matrix, containing chromosome and bin info

Return type

list

process_bins(bin)[source]

This is the main method and should be called using Pool.map It takes one bin location and uses the other helper functions to get the reads, form the matrix, cluster it with DBSCAN, and output the cluster data as text lines ready to writing to a file.

Parameters

bin – string in this format: “chr19_55555”

Returns

a list of lines representing the cluster data from that bin

class clubcpg.ClusterReads.ClusterReadsWithImputation(bam_a: str, bam_b=None, bin_size=100, bins_file=None, output_directory=None, num_processors=1, cluster_member_min=4, read_depth_req=10, remove_noise=True, mbias_read1_5=None, mbias_read1_3=None, mbias_read2_5=None, mbias_read2_3=None, suffix='', no_overlap=True, models_A=None, models_B=None, chunksize=10000)[source]

This class is used to perfom the same clustering, but also enabled the ability to perform imputation during clustering. This inherits from ClusterReads

Example

>>> from clubcpg.ClusterReads import ClusterReadsWithImputation
>>> cluster = ClusterReadsWithImputation(...)
>>> cluster.execute()
__init__(bam_a: str, bam_b=None, bin_size=100, bins_file=None, output_directory=None, num_processors=1, cluster_member_min=4, read_depth_req=10, remove_noise=True, mbias_read1_5=None, mbias_read1_3=None, mbias_read2_5=None, mbias_read2_3=None, suffix='', no_overlap=True, models_A=None, models_B=None, chunksize=10000)[source]

Initialize self. See help(type(self)) for accurate signature.

execute(return_only=False)[source]

This method will start multiprocessing execution of this class.

Parameters

return_only (bool) – Whether to return the results as a variabel (True) or write to file (False)

Returns

list of lists if :attribute: return_only False otherwise None

Return type

list or None

class clubcpg.ConnectToCpGNet.TrainWithPReLIM(cpg_density=None, save_path=None)[source]

Used to train models using CpGnet

__init__(cpg_density=None, save_path=None)[source]

Class to train a CpGNet model from input data

Parameters
  • cpg_density (int) – Number of CpGs

  • save_path – Location of folder to save the resulting model files. One per cpg density

save_net(model)[source]

Save the network to a file

Parameters

model (clubcpg_prelim.PReLIM) – The trained PReLIM model. Located at PReLIM.model

Returns

Path to the saved model

train_model(bins: iter)[source]

Train the CpGNet model on a list of provided bins

Parameters

bins – iterable containing CpG matrices of 1 (methylated), 0 (unmethylated), and -1 (unknown)

Returns

Path to the saved model file

class clubcpg.Imputation.Imputation(cpg_density: int, bam_file: str, mbias_read1_5=None, mbias_read1_3=None, mbias_read2_5=None, mbias_read2_3=None, processes=-1)[source]

The class providing convienent APIs to train models and impute from models using PReLIM

__init__(cpg_density: int, bam_file: str, mbias_read1_5=None, mbias_read1_3=None, mbias_read2_5=None, mbias_read2_3=None, processes=-1)[source]

[summary]

Parameters
  • {int} -- Number of CpGs this class instance will be used for (cpg_density) –

  • {str} -- path to the bam file (bam_file) –

Keyword Arguments
  • {[type]} -- [description] (default (mbias_read2_3) – {None})

  • {[type]} -- [description] (default – {None})

  • {[type]} -- [description] (default – {None})

  • {[type]} -- [description] (default – {None})

  • {int} -- number or CPUs to use when parallelization can be utilized, default= All available (default (processes) – {-1})

extract_matrices(coverage_data_frame: pandas.core.frame.DataFrame, sample_limit: int = None, return_bins=False)[source]

Extract CpG matrices from bam file.

Parameters

{pd.DataFrame} -- Output of clubcpg-coverage read in as a csv file (coverage_data_frame) –

Keyword Arguments

{bool} -- Return the bin location along with the matrix (default (return_bins) – {False})

Returns

[tuple] – Returns tuple of (bin, np.array) if returns_bins = True else returns only np.array

impute_from_model(models_folder: str, matrices: iter, postprocess=True)[source]

Generator to provide imputed matrices on-the-fly

Parameters
  • {str} -- Path to directory containing trained CpGNet models (models_folder) –

  • {iter} -- An iterable containging n x m matrices with n=cpgs and m=reads (matrices) –

Keyword Arguments

{bool} -- Round imputed values to 1s and 0s (default (postprocess) – {True})

static postprocess_predictions(predicted_matrix)[source]

Takes array with predicted values and rounds them to 0 or 1 if threshold is exceeded

Parameters

{[type]} -- matrix generated by imputation (predicted_matrix) –

Returns

[type] – predicted matrix predictions as 1, 0, or NaN

train_model(output_folder: str, matrices: iter)[source]

Train a CpGNet model using TrainWithCpGNet

Parameters
  • {str} -- Folder to save trained models (output_folder) –

  • {iter} -- An iterable of CpGMatrices - ideally obtained through Imputation.extract_matrices() (matrices) –

Returns

[keras model] – Returns the trained CpGNet model

PReLIM APIs

class clubcpg_prelim.PReLIM.CpGBin(matrix, binStartInc=None, binEndInc=None, cpgPositions=None, sequence='', encoding=None, missingToken=-1, chromosome=None, binSize=100, species='MM10', verbose=True, tag1=None, tag2=None)[source]

Constructor for a bin

__init__(matrix, binStartInc=None, binEndInc=None, cpgPositions=None, sequence='', encoding=None, missingToken=-1, chromosome=None, binSize=100, species='MM10', verbose=True, tag1=None, tag2=None)[source]
Parameters
  • matrix – numpy array, the bin’s CpG matrix.

  • binStartInc – integer, the starting, inclusive, chromosomal index of the bin.

  • binEndInc – integer, the ending, inclusive, chromosomal index of the bin.

  • cpgPositions – array of integers, the chromosomal positions of the CpGs in the bin.

  • sequence – string, nucleotide sequence (A,C,G,T)

  • encoding – array, a reduced representation of the bin’s CpG matrix

  • missingToken – integer, the token that represents missing data in the matrix.

  • chromosome – string, the chromosome this bin resides in.

  • binSize – integer, the number of base pairs this bin covers

  • species – string, the speices this bin belongs too.

  • verbose – boolean, print warnings, set to “false” for no error checking and faster speed

  • tag1 – anything, for custom use.

  • tag2 – anything, for custom use.

class clubcpg_prelim.PReLIM.PReLIM(cpgDensity=2)[source]

PReLIM imputation class to handle training and predicting from models.

__init__(cpgDensity=2)[source]

Initialize self. See help(type(self)) for accurate signature.

fit(X_train, y_train, n_estimators=[10, 50, 100, 500, 1000], cores=-1, max_depths=[1, 5, 10, 20, 30], model_file=None, verbose=False)[source]

Inputs: 1. X_train, numpy array, Contains feature vectors. 2. y_train, numpy array, Contains labels for training data. 3. n_estimators, list, the number of estimators to try during a grid search. 4. max_depths, list, the maximum depths of trees to try during a grid search. 5. cores, the number of cores to use during training, helpful for grid search. 6. model_file, string, The name of the file to save the model to.

If None, then create a file name that includes a timestamp. If you don’t want to save a file, set this to “no”

5-fold validation is built into the grid search

Outputs: The trained model

Usage: model.fit(X_train, y_train)

impute(matrix)[source]

Inputs: 1. matrix, a 2d np array, dtype=float, representing a CpG matrix, 1=methylated, 0=unmethylated, -1=unknown

Outputs: 1. A 2d numpy array with predicted probabilities of methylation

impute_many(matrices)[source]

Imputes a bunch of matrices at the same time to help speed up imputation time.

Inputs:

1. matrices: array-like (i.e. list), where each element is a 2d np array, dtype=float, representing a CpG matrix, 1=methylated, 0=unmethylated, -1=unknown

Outputs:

  1. A List of 2d numpy arrays with predicted probabilities of methylation for unknown values.

loadWeights(model_file)[source]

Inputs: 1. model_file, string, name of file with a saved model

Outputs: None

Effects: self.model is loaded with the provided weights

predict(X)[source]

Inputs: 1. X, numpy array, contains feature vectors

Outputs: 1. 1-d numpy array of predicted class labels

Usage: y_pred = CpGNet.predict(X)

predict_classes(X)[source]

Inputs: 1. X, numpy array, contains feature vectors

Outputs: 1. 1-d numpy array of prediction values

Usage: y_pred = CpGNet.predict_classes(X)

predict_proba(X)[source]

Inputs: 1. X, numpy array, contains feature vectors

Outputs: 1. 1-d numpy array of class predictions

Usage: y_pred = CpGNet.predict(X)

train(bin_matrices, model_file='no', verbose=False)[source]

bin_matrices: list of cpg matrices

model_file, string, The name of the file to save the model to.

If None, then create a file name that includes a timestamp. If you don’t want to save a file, set this to “no”