Runs and evaluates results from plink --missing --freq. It calculate the
rates of missing genotype calls and frequency for all variants in the
individuals that passed the perIndividualQC
. The SNP
missingness rates (stratified by minor allele frequency) are depicted as
histograms.
check_snp_missingness(indir, name, qcdir = indir, lmissTh = 0.01, interactive = FALSE, path2plink = NULL, verbose = FALSE, showPlinkOutput = TRUE)
indir | [character] /path/to/directory containing the basic PLINK data files name.bim, name.bed, name.fam files. |
---|---|
name | [character] Prefix of PLINK files, i.e. name.bed, name.bim, name.fam. |
qcdir | [character] /path/to/directory where results will be written to.
If |
lmissTh | [double] Threshold for acceptable variant missing rate across samples. |
interactive | [logical] Should plots be shown interactively? When choosing this option, make sure you have X-forwarding/graphical interface available for interactive plotting. Alternatively, set interactive=FALSE and save the returned plot object (p_lmiss) via ggplot2::ggsave(p=p_lmiss, other_arguments) or pdf(outfile) print(p_lmiss) dev.off(). |
path2plink | [character] Absolute path to PLINK executable
(https://www.cog-genomics.org/plink/1.9/) i.e.
plink should be accesible as path2plink -h. The full name of the executable
should be specified: for windows OS, this means path/plink.exe, for unix
platforms this is path/plink. If not provided, assumed that PATH set-up works
and PLINK will be found by |
verbose | [logical] If TRUE, progress info is printed to standard out and specifically, if TRUE, plink log will be displayed. |
showPlinkOutput | [logical] If TRUE, plink log and error messages are printed to standard out. |
Named list with i) fail_missingness containing a [data.frame] with CHR (Chromosome code), SNP (Variant identifier), CLST (Cluster identifier. Only present with --within/--family), N_MISS (Number of missing genotype call(s), not counting obligatory missings), N_CLST (Cluster size; does not include nonmales on Ychr; Only present with --within/--family), N_GENO (Number of potentially valid call(s)), F_MISS (Missing call rate) for all SNPs failing the lmissTh and ii) p_lmiss, a ggplot2-object 'containing' the SNP missingness histogram which can be shown by (print(p_lmiss)).
check_snp_missingness
uses plink --remove name.fail.IDs --missing
--freq to calculate rates of missing genotype calls and frequency per SNP in
the individuals that passed the perIndividualQC
. It does so
without generating a new dataset but simply removes the IDs when calculating
the statistics.
For details on the output data.frame fail_missingness, check the original description on the PLINK output format page: https://www.cog-genomics.org/plink/1.9/formats#lmiss.
indir <- system.file("extdata", package="plinkQC") qcdir <- tempdir() name <- "data" path2plink <- '/path/to/plink' # the following code is not run on package build, as the path2plink on the # user system is not known.# NOT RUN { fail_snp_missingness <- check_snp_missingness(qcdir=qcdir, indir=indir, name=name, interactive=FALSE, verbose=TRUE, path2plink=path2plink) # }