Function to simulate damaged cells by perturbing the gene expression of existing cells.
Usage
simulate_counts(
count_matrix,
damage_proportion,
annotated_celltypes = FALSE,
target_damage = c(0.1, 0.8),
damage_distribution = "right_skewed",
distribution_steepness = "moderate",
beta_shape_parameters = NULL,
ribosome_penalty = 0.001,
generate_plot = TRUE,
palette = c("grey", "#7023FD", "#E60006"),
plot_ribosomal_penalty = FALSE,
display_plot = TRUE,
seed = NULL,
organism = "Hsap"
)
Arguments
- count_matrix
Matrix or dgCMatrix containing the counts from single cell RNA sequencing data.
- damage_proportion
Numeric describing what proportion of the input data should be altered to resemble damaged data.
Must range between 0 and 1.
- annotated_celltypes
Boolean specifying whether input matrix has cell type information stored.
Default is FALSE
- target_damage
Numeric vector specifying the upper and lower range of the level of damage that will be introduced.
Here, damage refers to the amount of cytoplasmic RNA lost by a cell where values closer to 1 indicate more loss and therefore more heavily damaged cells.
Default is c(0.1, 0.8)
- damage_distribution
String specifying whether the distribution of damage levels among the damaged cells should be shifted towards the upper or lower range of damage specified in 'target_damage' or follow a symmetric distribution between them. There are three valid options:
"right_skewed"
"left_skewed"
"symmetric"
Default is "right_skewed"
- distribution_steepness
String specifying how concentrated the spread of damaged cells are about the mean of the target distribution specified in 'target_damage'. Here, an increase in steepness manifests in a more apparent skewness.There are three valid options:
"shallow"
"moderate"
"steep"
Default is "moderate"
- beta_shape_parameters
Numeric vector that allows for the shape parameters of the beta distribution to defined explicitly. This offers greater flexibility than allowed by the 'damage_distribution' and 'distribution_steepness' parameters and will override the defaults they offer.
Default is 'NULL'
- ribosome_penalty
Numeric specifying the factor by which the probability of loosing a transcript from a ribosomal gene is multiplied by. Here, values closer to 0 represent a greater penalty.
Default is 0.01.
- generate_plot
Boolean specifying whether the QC plot should be outputted. QC plots will be generated by default as we recommend verifying the perturbed data retains characteristics of true single cell data.
Default is TRUE.
- palette
Character vector containing three colours to create the continuous palette for damaged cells.
Default is c("grey", "#7023FD", "#E60006").
- plot_ribosomal_penalty
Boolean specifying whether the output QC plot should focus on only the ribosomal proportion or contain additional QC information. If TRUE, this can be useful for visualising the impact of the ribosomal penalty parameter.
Default is FALSE.
- display_plot
Boolean specifying whether the output QC plot should be displayed in the global environment. Naturally, this is only relevant when generate_plot is TRUE.
Default is TRUE.
- seed
Numeric specifying the random seed to ensure reproducibility of the function's output. Setting a seed ensures that the random sampling and perturbation processes produce the same results when the function is run multiple times with the same input data and parameters.
Default is 7.
- organism
String specifying the organism of origin of the input data where there are two standard options,
"Hsap"
"Mmus"
If a user wishes to use a non-standard organism they must input a list containing strings for the patterns to match mitochondrial and ribosomal genes of the organism. If available, nuclear-encoded genes that are likely retained in the nucleus, such as in nuclear speckles, must also be specified. An example for humans is below,
organism = c(mito_pattern = "^MT-", ribo_pattern = "^(RPS|RPL)", nuclear <- c("NEAT1","XIST", "MALAT1")
Default is "Hsap"
Value
A list containing the altered count matrix, a data frame with summary statistics, and, if specified, a 'ggplot2' object of the quality control metrics of the alteration.
Details
'DamageDetective' models damage in single-cell RNA sequencing data as the loss of cytoplasmic RNA, where cells experiencing greater RNA loss are assumed to be more extensively damaged, while those with minimal loss are considered largely intact. The perturbation process introduces RNA loss into existing cells and is controlled by three key parameters: the target proportion of damage, which specifies the fraction of cells to be perturbed; the target level of damage, which defines the extent of RNA loss across cells; and the target distribution of damage, which determines how the different levels of RNA loss are distributed across cells.
Based on these parameters, cells are randomly selected and assigned a target proportion of RNA loss. The total number of transcripts to be removed is determined, and perturbation is applied through weighted sampling without replacement from cytoplasmic gene counts. Here, the probability of transcript loss is determined by gene abundance, with highly expressed genes more likely to lose transcripts. Once the target RNA loss is reached, the cell's expression profile is updated, and the process repeats for all selected cells.