Developer Reference¶
ClubCpG APIs¶
-
class
clubcpg.ParseBam.BamFileReadParser(bamfile, quality_score, read1_5=None, read1_3=None, read2_5=None, read2_3=None, no_overlap=True)[source]¶ Used to simplify the opening and reading from BAM files. BAMs must be coordinate sorted and indexed.
- Example
>>> from clubcpg.ParseBam import BamFileReadParser >>> parser = BamFileReadParser("/path/to/data.BAM", quality_score=20, 3, 4, 7, 1) >>> reads = parser.parse_reads("chr7", 10000, 101000) >>> matrix = parser.create_matrix(reads)
-
__init__(bamfile, quality_score, read1_5=None, read1_3=None, read2_5=None, read2_3=None, no_overlap=True)[source]¶ Class used to read WGBSeq reads from a BAM file, extract methylation, and convert into data frame
- Parameters
bamfile – Path to bam file location
quality_score – Only include reads >= this fastq quality
read1_5 – mbias ignore read1 5’
read1_3 – mbias ignore read1 3’
read2_5 – mbias ignore read2 5’
read2_3 – mbias ignore read2 3’
no_overlap – bool. If overlap exists between two reads, ignore that region from read 2.
-
create_matrix(read_cpgs)[source]¶ Converted parsed reads into a pandas dataframe.
- Parameters
read_cpgs (iterable) – read CpGs generated by self.parse_reads
- Returns
matrix methylated (1) and unmethylated (0) states
- Return type
pd.DataFrame
-
fix_read_overlap(full_reads, read_cpgs)[source]¶ Takes pysam reads and read_cpgs generated during parse reads and removes any overlap between read1 and read2. If possible it also stitches read1 and read2 together to create a super read.
- Parameters
full_reads – set of reads generated by self.parse_reads()
read_cpgs – todoo
- Returns
A list in the same format as read_cpgs input, but corrected for paired read overlap
-
class
clubcpg.ClusterReads.ClusterReads(bam_a: str, bam_b=None, bin_size=100, bins_file=None, output_directory=None, num_processors=1, cluster_member_min=4, read_depth_req=10, remove_noise=True, mbias_read1_5=None, mbias_read1_3=None, mbias_read2_5=None, mbias_read2_3=None, suffix='', no_overlap=True)[source]¶ This class is used to take a dataframe or matrix of reads and cluster them
- Example
>>> from clubcpg.ClusterReads import ClusterReads >>> cluster = ClusterReads(bam_a="/path/to/file.bam", bam_b="/path/to/file.bam", bins_file="/path/to/file.csv", suffix="chr19") >>> cluster.execute()
-
__init__(bam_a: str, bam_b=None, bin_size=100, bins_file=None, output_directory=None, num_processors=1, cluster_member_min=4, read_depth_req=10, remove_noise=True, mbias_read1_5=None, mbias_read1_3=None, mbias_read2_5=None, mbias_read2_3=None, suffix='', no_overlap=True)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
execute(return_only=False)[source]¶ This method will start multiprocessing execution of this class.
- Parameters
return_only (bool) – Whether to return the results as a variabel (True) or write to file (False)
- Returns
list of lists if :attribute: return_only False otherwise None
- Return type
list or None
-
filter_data_frame(matrix: pandas.core.frame.DataFrame)[source]¶ Takes a dataframe of clusters and removes any groups with less than self.cluster_member_min members
- Parameters
matrix – dataframe of clustered reads
- Type
pd.DataFrame
- Returns
input matrix with some clusters removed
-
generate_individual_matrix_data(filtered_matrix, chromosome, bin_loc)[source]¶ Take output of process_bins() and converts it into a list of lines of text data for output
- Parameters
filtered_matrix (pd.DataFrame) – dataframe returned by
ClusterReads.filter_data_frame()chromosome (string) – chromosome as “Chr5”
bin_loc (string) – location representing the bin given as the end coordinate, ie 590000
- Returns
comma separated lines extracted from the filtered matrix, containing chromosome and bin info
- Return type
list
-
process_bins(bin)[source]¶ This is the main method and should be called using Pool.map It takes one bin location and uses the other helper functions to get the reads, form the matrix, cluster it with DBSCAN, and output the cluster data as text lines ready to writing to a file.
- Parameters
bin – string in this format: “chr19_55555”
- Returns
a list of lines representing the cluster data from that bin
-
class
clubcpg.ClusterReads.ClusterReadsWithImputation(bam_a: str, bam_b=None, bin_size=100, bins_file=None, output_directory=None, num_processors=1, cluster_member_min=4, read_depth_req=10, remove_noise=True, mbias_read1_5=None, mbias_read1_3=None, mbias_read2_5=None, mbias_read2_3=None, suffix='', no_overlap=True, models_A=None, models_B=None, chunksize=10000)[source]¶ This class is used to perfom the same clustering, but also enabled the ability to perform imputation during clustering. This inherits from
ClusterReads- Example
>>> from clubcpg.ClusterReads import ClusterReadsWithImputation >>> cluster = ClusterReadsWithImputation(...) >>> cluster.execute()
-
__init__(bam_a: str, bam_b=None, bin_size=100, bins_file=None, output_directory=None, num_processors=1, cluster_member_min=4, read_depth_req=10, remove_noise=True, mbias_read1_5=None, mbias_read1_3=None, mbias_read2_5=None, mbias_read2_3=None, suffix='', no_overlap=True, models_A=None, models_B=None, chunksize=10000)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
execute(return_only=False)[source]¶ This method will start multiprocessing execution of this class.
- Parameters
return_only (bool) – Whether to return the results as a variabel (True) or write to file (False)
- Returns
list of lists if :attribute: return_only False otherwise None
- Return type
list or None
-
class
clubcpg.ConnectToCpGNet.TrainWithPReLIM(cpg_density=None, save_path=None)[source]¶ Used to train models using CpGnet
-
__init__(cpg_density=None, save_path=None)[source]¶ Class to train a CpGNet model from input data
- Parameters
cpg_density (int) – Number of CpGs
save_path – Location of folder to save the resulting model files. One per cpg density
-
save_net(model)[source]¶ Save the network to a file
- Parameters
model (
clubcpg_prelim.PReLIM) – The trained PReLIM model. Located at PReLIM.model- Returns
Path to the saved model
-
-
class
clubcpg.Imputation.Imputation(cpg_density: int, bam_file: str, mbias_read1_5=None, mbias_read1_3=None, mbias_read2_5=None, mbias_read2_3=None, processes=-1)[source]¶ The class providing convienent APIs to train models and impute from models using PReLIM
-
__init__(cpg_density: int, bam_file: str, mbias_read1_5=None, mbias_read1_3=None, mbias_read2_5=None, mbias_read2_3=None, processes=-1)[source]¶ [summary]
- Parameters
{int} -- Number of CpGs this class instance will be used for (cpg_density) –
{str} -- path to the bam file (bam_file) –
- Keyword Arguments
{[type]} -- [description] (default (mbias_read2_3) – {None})
{[type]} -- [description] (default – {None})
{[type]} -- [description] (default – {None})
{[type]} -- [description] (default – {None})
{int} -- number or CPUs to use when parallelization can be utilized, default= All available (default (processes) – {-1})
-
extract_matrices(coverage_data_frame: pandas.core.frame.DataFrame, sample_limit: int = None, return_bins=False)[source]¶ Extract CpG matrices from bam file.
- Parameters
{pd.DataFrame} -- Output of clubcpg-coverage read in as a csv file (coverage_data_frame) –
- Keyword Arguments
{bool} -- Return the bin location along with the matrix (default (return_bins) – {False})
- Returns
[tuple] – Returns tuple of (bin, np.array) if returns_bins = True else returns only np.array
-
impute_from_model(models_folder: str, matrices: iter, postprocess=True)[source]¶ Generator to provide imputed matrices on-the-fly
- Parameters
{str} -- Path to directory containing trained CpGNet models (models_folder) –
{iter} -- An iterable containging n x m matrices with n=cpgs and m=reads (matrices) –
- Keyword Arguments
{bool} -- Round imputed values to 1s and 0s (default (postprocess) – {True})
-
static
postprocess_predictions(predicted_matrix)[source]¶ Takes array with predicted values and rounds them to 0 or 1 if threshold is exceeded
- Parameters
{[type]} -- matrix generated by imputation (predicted_matrix) –
- Returns
[type] – predicted matrix predictions as 1, 0, or NaN
-
train_model(output_folder: str, matrices: iter)[source]¶ Train a CpGNet model using
TrainWithCpGNet- Parameters
{str} -- Folder to save trained models (output_folder) –
{iter} -- An iterable of CpGMatrices - ideally obtained through Imputation.extract_matrices() (matrices) –
- Returns
[keras model] – Returns the trained CpGNet model
-
PReLIM APIs¶
-
class
clubcpg_prelim.PReLIM.CpGBin(matrix, binStartInc=None, binEndInc=None, cpgPositions=None, sequence='', encoding=None, missingToken=-1, chromosome=None, binSize=100, species='MM10', verbose=True, tag1=None, tag2=None)[source]¶ Constructor for a bin
-
__init__(matrix, binStartInc=None, binEndInc=None, cpgPositions=None, sequence='', encoding=None, missingToken=-1, chromosome=None, binSize=100, species='MM10', verbose=True, tag1=None, tag2=None)[source]¶ - Parameters
matrix – numpy array, the bin’s CpG matrix.
binStartInc – integer, the starting, inclusive, chromosomal index of the bin.
binEndInc – integer, the ending, inclusive, chromosomal index of the bin.
cpgPositions – array of integers, the chromosomal positions of the CpGs in the bin.
sequence – string, nucleotide sequence (A,C,G,T)
encoding – array, a reduced representation of the bin’s CpG matrix
missingToken – integer, the token that represents missing data in the matrix.
chromosome – string, the chromosome this bin resides in.
binSize – integer, the number of base pairs this bin covers
species – string, the speices this bin belongs too.
verbose – boolean, print warnings, set to “false” for no error checking and faster speed
tag1 – anything, for custom use.
tag2 – anything, for custom use.
-
-
class
clubcpg_prelim.PReLIM.PReLIM(cpgDensity=2)[source]¶ PReLIM imputation class to handle training and predicting from models.
-
fit(X_train, y_train, n_estimators=[10, 50, 100, 500, 1000], cores=-1, max_depths=[1, 5, 10, 20, 30], model_file=None, verbose=False)[source]¶ Inputs: 1. X_train, numpy array, Contains feature vectors. 2. y_train, numpy array, Contains labels for training data. 3. n_estimators, list, the number of estimators to try during a grid search. 4. max_depths, list, the maximum depths of trees to try during a grid search. 5. cores, the number of cores to use during training, helpful for grid search. 6. model_file, string, The name of the file to save the model to.
If None, then create a file name that includes a timestamp. If you don’t want to save a file, set this to “no”
5-fold validation is built into the grid search
Outputs: The trained model
Usage: model.fit(X_train, y_train)
-
impute(matrix)[source]¶ Inputs: 1. matrix, a 2d np array, dtype=float, representing a CpG matrix, 1=methylated, 0=unmethylated, -1=unknown
Outputs: 1. A 2d numpy array with predicted probabilities of methylation
-
impute_many(matrices)[source]¶ Imputes a bunch of matrices at the same time to help speed up imputation time.
Inputs:
1. matrices: array-like (i.e. list), where each element is a 2d np array, dtype=float, representing a CpG matrix, 1=methylated, 0=unmethylated, -1=unknown
Outputs:
A List of 2d numpy arrays with predicted probabilities of methylation for unknown values.
-
loadWeights(model_file)[source]¶ Inputs: 1. model_file, string, name of file with a saved model
Outputs: None
Effects: self.model is loaded with the provided weights
-
predict(X)[source]¶ Inputs: 1. X, numpy array, contains feature vectors
Outputs: 1. 1-d numpy array of predicted class labels
Usage: y_pred = CpGNet.predict(X)
-
predict_classes(X)[source]¶ Inputs: 1. X, numpy array, contains feature vectors
Outputs: 1. 1-d numpy array of prediction values
Usage: y_pred = CpGNet.predict_classes(X)
-