Quality control function to identify and filter damaged cells from an input count matrix, where 'damage' is defined by the loss of cytoplasmic RNA.
Usage
detect_damage(
count_matrix,
ribosome_penalty = 0.01,
organism = "Hsap",
annotated_celltypes = FALSE,
target_damage = c(0.1, 0.8),
damage_distribution = "right_skewed",
distribution_steepness = "moderate",
beta_shape_parameters = NULL,
damage_levels = 5,
damage_proportion = 0.15,
seed = 7,
mito_quantile = 0.75,
kN = NULL,
generate_plot = TRUE,
display_plot = TRUE,
palette = c("grey", "#7023FD", "#E60006"),
filter_threshold = 0.7,
filter_counts = FALSE,
verbose = TRUE
)
Arguments
- count_matrix
Matrix or dgCMatrix containing the counts from single cell RNA sequencing data.
- ribosome_penalty
Numeric specifying the factor by which the probability of loosing a transcript from a ribosomal gene is multiplied by. Here, values closer to 0 represent a greater penalty.
Default is 0.01.
- organism
String specifying the organism of origin of the input data where there are two standard options,
"Hsap"
"Mmus"
If a user wishes to use a non-standard organism they must input a list containing strings for the patterns to match mitochondrial and ribosomal genes of the organism. If available, nuclear-encoded genes that are likely retained in the nucleus, such as in nuclear speckles, must also be specified. An example for humans is below,
organism = c(mito_pattern = "^MT-", ribo_pattern = "^(RPS|RPL)", nuclear <- c("NEAT1","XIST", "MALAT1")
Default is "Hsap"
- annotated_celltypes
Boolean specifying whether input matrix has cell type information stored.
Default is FALSE
- target_damage
Numeric vector specifying the upper and lower range of the level of damage that will be introduced.
Here, damage refers to the amount of cytoplasmic RNA lost by a cell where values closer to 1 indicate more loss and therefore more heavily damaged cells.
Default is c(0.1, 0.8)
- damage_distribution
String specifying whether the distribution of damage levels among the damaged cells should be shifted towards the upper or lower range of damage specified in 'target_damage' or follow a symmetric distribution between them. There are three valid options:
"right_skewed"
"left_skewed"
"symmetric"
Default is "right_skewed"
- distribution_steepness
String specifying how concentrated the spread of damaged cells are about the mean of the target distribution specified in 'target_damage'. Here, an increase in steepness manifests in a more apparent skewness.There are three valid options:
"shallow"
"moderate"
"steep"
Default is "moderate"
- beta_shape_parameters
Numeric vector that allows for the shape parameters of the beta distribution to defined explicitly. This offers greater flexibility than allowed by the 'damage_distribution' and 'distribution_steepness' parameters and will override the defaults they offer.
Default is 'NULL'
- damage_levels
Numeric specifying the number of distinct sets of artificial damaged cells simulated, each with a defined range of loss. Default ptions include,
3 : c(0.00001, 0.08), c(0.1, 0.4), c(0.5, 0.9)
5 : c(0.00001, 0.08), c(0.1, 0.3), c(0.3, 0.5), c(0.5, 0.7), c(0.7, 0.9)
7 : c(0.00001, 0.08), c(0.1, 0.3), c(0.3, 0.4), c(0.4, 0.5), c(0.5, 0.7), c(0.7, 0.9), c(0.9, 0.99999).
A user can also provide a list specifying sets with their own ranges of loss,
damage_levels = list( pANN_50 = c(0.1, 0.5), pANN_100 = c(0.5, 1) )
By introducing more sets of damage a user can improve the accuracy of loss estimations (scaled_pANN) as they are found through scaling the pANN within each set according to the lower and upper boundary of the set's damage level. However, introducing more sets increases the computational time for the function.
Default is 5.
- damage_proportion
Numeric describing what proportion of the input data should be altered to resemble damaged data.
Must range between 0 and 1.
- seed
Numeric specifying the random seed to ensure reproducibility of the function's output. Setting a seed ensures that the random sampling and perturbation processes produce the same results when the function is run multiple times with the same input data and parameters.
Default is 7.
- mito_quantile
Numeric between 0 and 1 specifying below what level of mitochondrial proportion cells are sampled for simulations. This step is done to protect against simulating damaged cell profiles from cells that are likely damaged.
Default is 0.75.
- kN
Numeric describing how many nearest neighbours are considered for pANN calculations. kN cannot exceed the total cell number.
Default is one third of the total cell number.
- generate_plot
Boolean specifying whether the QC plot should be outputted. QC plots will be generated by default as we recommend verifying the perturbed data retains characteristics of true single cell data.
Default is TRUE.
- display_plot
Boolean specifying whether the output QC plot should be displayed in the global environment. Naturally, this is only relevant when generate_plot is TRUE.
Default is TRUE.
- palette
String specifying the three colours that will be used to create the continuous colour palette for colouring the 'damage_column'.
Default is a range from purple to red, c("grey", "#7023FD", "#E60006").
- filter_threshold
Numeric specifying the proportion of RNA loss above which a cell should be considered damaged.
Default is 0.75.
- filter_counts
Boolean specifying whether the output matrix should be filtered, returned containing only cells that fall below the filter threshold. Alternatively, a data frame containing cell barcodes and their associated label as either 'damaged' or 'cell' is returned.
Default is FALSE.
- verbose
Boolean specifying whether messages and function progress should be displayed in the console.
Default is TRUE.
Details
Using the simulation framework of simulate_counts()
, detect_damage()
generates artificially damaged cell profiles by introducing defined levels
of RNA loss into the input data. True and artificial cells are then
merged and pre-processed to compute the following quality control metrics:
Log-normalized feature count
Log-normalized total counts
Mitochondrial proportion
Ribosomal proportion
Log-normalized MALAT1 gene expression
Principal component analysis (PCA) is performed on these metrics,
and a Euclidean distance matrix is constructed from the PC embeddings.
For each true cell, the proportion of nearest neighbours that are
artificial cells (pANN) is calculated across all damage levels and the
damage level with the highest pANN is assigned to the true cell.
Finally, cells exceeding a specified damage threshold, filter_threshold
,
are marked as damaged.
This filtering method is inspired by approaches developed for DoubletFinder (McGinnis et al., 2019) to detect doublets in single-cell data.
References
McGinnis, C. S., Murrow, L. M., & Gartner, Z. J. (2019). DoubletFinder: Doublet Detection in Single-Cell RNA Sequencing Data Using Artificial Nearest neighbours. Cell Systems, 8(4), 329-337.e4. doi:10.1016/j.cels.2019.03.003
Examples
data("test_counts", package = "DamageDetective")
test <- detect_damage(
count_matrix = test_counts,
ribosome_penalty = 0.001,
damage_levels = 3,
damage_proportion = 0.1,
generate_plot = FALSE,
seed = 7
)
#> Simulating 1e-05 and 0.08 RNA loss...
#> Simulating 0.1 and 0.4 RNA loss...
#> Simulating 0.5 and 0.9 RNA loss...
#> Computing pANN...