bcbioRNASeq is an S4 class that extends RangedSummarizedExperiment, and
is designed to store a bcbio RNA-seq
analysis.
bcbioRNASeq(uploadDir, level = c("genes", "transcripts"), caller = c("salmon", "kallisto", "sailfish", "star", "hisat2"), samples = NULL, censorSamples = NULL, sampleMetadataFile = NULL, organism = NULL, genomeBuild = NULL, ensemblRelease = NULL, gffFile = NULL, transgeneNames = NULL, spikeNames = NULL, countsFromAbundance = "lengthScaledTPM", interestingGroups = "sampleName", fast = FALSE, ...)
| uploadDir |
|
|---|---|
| level |
|
| caller |
|
| samples |
|
| censorSamples |
|
| sampleMetadataFile |
|
| organism |
|
| genomeBuild |
|
| ensemblRelease |
|
| gffFile |
|
| transgeneNames |
|
| spikeNames |
|
| countsFromAbundance |
|
| interestingGroups |
|
| fast |
|
| ... | Additional arguments. |
bcbioRNASeq.
Automatically imports RNA-seq counts, metadata, and the program versions used from a bcbio RNA-seq run. Simply point to the final upload directory generated by bcbio, and this generator function will take care of the rest.
Updated 2019-08-12.
When loading a bcbio RNA-seq run, the sample metadata will be imported
automatically from the project-summary.yaml file in the final upload
directory. If you notice any typos in your metadata after completing the run,
these can be corrected by editing the YAML file.
Alternatively, you can pass in a sample metadata file into the
bcbioRNASeq() function call using the sampleMetadataFile argument. This
requires either a CSV or Excel spreadsheet.
The samples in the bcbio run must map to the description column. The values
provided in description must be unique. These values will be sanitized into
syntactically valid names (see make.names for more
information), and assigned as the column names of the bcbioRNASeq object.
The original values are stored as the sampleName column in colData, and
are used for all plotting functions. Do not attempt to set a sampleID
column, as this is used internally by the package.
Here is a minimal example of a properly formatted sample metadata file:
| description | genotype |
| sample1 | wildtype |
| sample2 | knockout |
| sample3 | wildtype |
| sample4 | knockout |
R is strict about values that are considered valid for use in
names() and dimnames() (i.e.
rownames() and colnames()).
Non-alphanumeric characters, spaces, and dashes are not valid. Use either
underscores or periods in place of dashes when working in R. Also note that
names should not begin with a number, and will be prefixed with an X
when sanitized. Consult the documentation in the
make.names() function for more information. We strongly
recommend adhering to these conventions when labeling samples, to help avoid
unexpected downstream behavior in R due to dimnames()
mismatches.
bcbioRNASeq() provides support for automatic import of genome annotations,
which internally get processed into genomic ranges (GRanges) and are
slotted into the rowRanges() of the
object. Currently, we offer support for (1) Ensembl genome annotations
from AnnotationHub via ensembldb (recommended); or (2) direct
import from a GTF/GFF file using rtracklayer.
ensembldb requires the organism and ensemblRelease arguments to be
defined. When both of these are set, bcbioRNASeq will attempt to
download and use the pre-built Ensembl genome annotations from
AnnotationHub. This method is preferred over direct loading of a GTF/GFF
file because the AnnotationHub annotations contain additional rich
metadata not defined in a GFF file, specifically description and entrezID
values.
Alternatively, if you are working with a non-standard or poorly annotated genome that isn't available on AnnotationHub, we provide fall back support for loading the genome annotations directly from the GTF file used by the bcbio RNA-seq pipeline. This should be fully automatic for an R session active on the same server used to run bcbio.
Example bcbio GTF path: genomes/Hsapiens/hg38/rnaseq/ref-transcripts.gtf.
In the event that you are working from a remote environment that doesn't
have file system access to the bcbio genomes directory, we provide
additional fall back support for importing genome annotations from a GTF/GFF
directly with the gffFile argument.
Internally, genome annotations are imported via the basejump package, specifically with either of these functions:
Ensure that the organism and genome build used with bcio match correctly here
in the function call. In particular, for the legacy Homo sapiens
GRCh37/hg19 genome build, ensure that genomeBuild = "GRCh37". Otherwise,
the genomic ranges set in rowRanges()
will mismatch. It is recommended for current projects that GRCh38/hg38 is
used in place of GRCh37/hg19 if possible.
DESeq2 is run automatically when bcbioRNASeq() is called, unless fast = TRUE is set. Internally, this automatically slots normalized counts into
assays(), and generates variance-stabilized
counts.
When working on a local machine, it is possible to load bcbio run data over a
remote connection using sshfs. When loading a large number of samples, it
is preferable to call bcbioRNASeq() directly in R on the remote server, if
possible.
.S4methods(class = "bcbioRNASeq").
uploadDir <- system.file("extdata/bcbio", package = "bcbioRNASeq") ## Gene level. object <- bcbioRNASeq( uploadDir = uploadDir, level = "genes", caller = "salmon", organism = "Mus musculus", ensemblRelease = 87L )#>#> #> #>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#> #> #> #> #>#>#> #> #> #>#>#>#>#>#>#>#>#>#>#>#>#>#>#>print(object)#> bcbioRNASeq 0.3.29 #> uploadDir: /tmp/RtmpibnER3/temp_libpath16c0448330e2/bcbioRNASeq/extdata/bcbio #> dates(2): [bcbio] 2018-03-18; [R] 2019-10-30 #> level: genes #> caller: salmon #> organism: Mus musculus #> interestingGroups: sampleName #> class: RangedSummarizedExperiment #> dim: 100 6 #> metadata(27): allSamples bcbioCommandsLog ... wd yaml #> assays(7): counts aligned ... vst fpkm #> rownames(100): ENSMUSG00000000001 ENSMUSG00000000003 ... #> ENSMUSG00000062661 ENSMUSG00000074340 #> rowData names(7): broadClass description ... geneName seqCoordSystem #> colnames(6): control_rep1 control_rep2 ... fa_day7_rep2 fa_day7_rep3 #> colData names(26): averageInsertSize averageReadLength ... treatment #> x5x3Bias## Transcript level. object <- bcbioRNASeq( uploadDir = uploadDir, level = "transcripts", caller = "salmon", organism = "Mus musculus", ensemblRelease = 87L )#>#> #> #>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#> #> #> #> #>#>#> #> #> #>#>#>#>#>print(object)#> bcbioRNASeq 0.3.29 #> uploadDir: /tmp/RtmpibnER3/temp_libpath16c0448330e2/bcbioRNASeq/extdata/bcbio #> dates(2): [bcbio] 2018-03-18; [R] 2019-10-30 #> level: transcripts #> caller: salmon #> organism: Mus musculus #> interestingGroups: sampleName #> class: RangedSummarizedExperiment #> dim: 100 6 #> metadata(27): allSamples bcbioCommandsLog ... wd yaml #> assays(3): counts avgTxLength tpm #> rownames(100): ENSMUST00000000001 ENSMUST00000000003 ... #> ENSMUST00000000674 ENSMUST00000000687 #> rowData names(13): broadClass description ... transcriptName #> transcriptSupportLevel #> colnames(6): control_rep1 control_rep2 ... fa_day7_rep2 fa_day7_rep3 #> colData names(26): averageInsertSize averageReadLength ... treatment #> x5x3Bias## Fast mode. object <- bcbioRNASeq(uploadDir = uploadDir, fast = TRUE)#>#> #> #>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>#>