Skip to contents

Tutorial Overview

The goal of DamageDetective is to simplify the process of making informed and reproducible damaged cell filtering decisions. This tutorial provides an outline of the steps needed to achieve this:

  1. Data preparation
  2. Parameter selection
  3. Damaged cell detection

Ensure the following packages are installed and made available in your R environment to follow along:

  • DamageDetective, ggplot2, Seurat, SeuratData, scRNAseq, SingleCellExperiment, and patchwork.

Data Preparation

This tutorial takes place in R, home to three popular scRNA-seq data storage types:

  1. The Seurat object
  2. The SingleCellExperiment (sce) object
  3. The matrix retrieved directly from alignment output (matrix.mtx, barcodes.tsv, and features.tsv).

DamageDetective operates with data in a compressed, column-oriented sparse matrix (dgCMatrix). This format efficiently handles the sparse nature of single-cell data by indexing only non-zero elements within each column. Each of the above data storage types can act as the starting point for damage detection through conversion to a sparse matrix.

Note: Due to vignette size constraints, this tutorial will only demonstrate live using data from the alignment output.

Conversion to Sparse Matrix

Using a Seurat Object

The count matrix can be extracted from the assay slot of a Seurat object. By default, this matrix is already in compressed column-oriented sparse form and can be used as input for DamageDetective.

We will use the SeuratData package to retrieve a publicly available Seurat object to demonstrate. This object stores the data from peripheral blood mononuclear cells (PBMCs) sequenced using the 10X Genomics platform.

# Retrieve the dataset of interest
SeuratData::InstallData("pbmc3k")
data("pbmc3k")

# Extract the count matrix
pbmc3k_counts <- pbmc3k[["RNA"]]$counts

Using Alignment Output

Alignment output comes in the form of three files containing the row names (genes), column names (cell identifiers), and input values (gene expression values) of the count matrix. These can be compiled using ReadMtx, a function offered by Seurat which simplifies the count matrix compilation into one function. This involves decompressing zipped files, mapping feature names to HGNC gene symbols, and converting the matrix to sparse format. ReadMtx output can be used directly for DamageDetective.

We will use publicly available alignment data from the 10X Genomics website for demonstration, specifically the ‘1k PBMCs from a healthy donor (v3)’ data available here following the link named “Feature / cell matrix (filtered)”.

# Set the file paths relative to location on your device
matrix_file <- "~/Projects/demonstrations/matrix.mtx.gz"     
barcodes_file <- "~/Projects/demonstrations/barcodes.tsv.gz" 
features_file <- "~/Projects/demonstrations/features.tsv.gz" 

# Construct the sparse matrix
alignment_counts <- Seurat::ReadMtx(
  mtx = matrix_file,
  cells = barcodes_file,
  features = features_file
)

Using an sce Object

The count matrix can be extracted from the assay slot of a Seurat object. By default, this matrix is already in compressed column-oriented sparse form and can be used as input for DamageDetective.

We will use the SeuratData package to retrieve a publicly available Seurat object to demonstrate. This object stores the data from peripheral blood mononuclear cells (PBMCs) sequenced using the 10X Genomics platform.

# Retrieve multisample dataset
pbmc_sce <- scRNAseq::fetchDataset("kotliarov-pbmc-2020", "2024-04-18")

# Extract sample of interest
metadata <- SummarizedExperiment::colData(pbmc_sce)
sample_sce <- subset(metadata, sample == "234_d0")
sample_sce <- rownames(sample_sce)

# Subset and convert to sparse format
pbmc_counts <- SummarizedExperiment::assay(pbmc_sce, "counts")
sample_counts  <- pbmc_counts[, sample_sce]
sample_counts <- as.matrix(sample_counts)
sample_counts <- as(sample_counts, "dgCMatrix")

Parameter Selection

The detect_damage function requires only the count matrix to run but accepts additional parameters. These parameters can be divided into two categories:

  1. Computational Parameters: Alter the computation and result in different outputs.
  2. Aesthetic Parameters: Adjust how the user receives the output but do not change the computation.

Computational Parameters

Dataset-Defined Parameters

  • organism: Specifies the organism of the data (default: "Hsap").
  • annotated_celltypes: Indicates whether cell types are known (default: FALSE).
  • ribosome_penalty: Accounts for the unexpectedly low probability of ribosomal RNA loss in real single-cell data.
Exploring the ribosome_penalty parameter

This parameter accounts for the unexpectedly low probability of ribosomal RNA loss observed in real single cell data that must be accounted for during the simulation of artificial cells.

The impact of changing ribosome_penalty can be explored in the plots below. The idea is to see whether the artificial cells describe the true cells, in other words, whether the coloured dots superimpose the grey dots.

  • Alignment output

You will see that as you increase the penalty, i.e. go from values closer to 1 to values closer to 0*^*, the coloured dots shift from an extreme position on the left-hand side of the plot to a more central position. At what point along this range true cells exist is dataset-dependent but generally lies closer to 0.

*^* ribosome_penalty is a multiplicative reduction factor meaning a value of 1 is the same as introducing no penalty while values increasingly closer to zero introduce increasingly greater reductions.

A scatter plot showing the effect of ribosomal penalty on alignment output.
A scatter plot showing the effect of ribosomal penalty on alignment output.


Plots like that shown above can be generated as follows where,

  • count_matrix requires your data in sparse matrix form

  • ribosome_penalty is a numeric between 0 and 1 specifying the ribosomal penalty

  • damage_proportion is a number between 0 and 1 specifying the amount of artificial cells to create relative to the input data (setting this to a lower value makes the computation faster)

  • target_damage is a vector specifying the lower and upper levels of damage between which the artificial cells are generated. Shifting this to higher values shifts the damaged cell profiles to more intense levels that can be seen by a shift in colour, more red, as well as a shift in position, higher up on the y axis.

  • plot_ribosomal_penalty allows the plot to focus on the ribosomal proportion rather than give a general overview of the QC distributions.

  • seed ensures the output can be reproduced.

penalty_plot <- DamageDetective::simulate_counts(
  count_matrix = alignment_counts,
  ribosome_penalty = 0.01,  
  damage_proportion = 0.05,  
  target_damage = c(0.5, 1), 
  plot_ribosomal_penalty = TRUE,
  seed = 7
)

Scatter plot showing the distribution of quality control  metrics in altered and unaltered counts.


Computing the Ideal ribosome_penalty

From the above plots you can see how selecting an unideal `ribosome_penalty` can generate damaged profiles that do not describe the data well and, as a result, generate estimations of damage that do not describe the data well.

Selecting a `ribosome_penalty` that simulates RNA loss in a way that is relevant to the input data can be done as shown above, through trial and error, or in an automated fashion using the `select_penalty` function,

  • penalty_range setting a specific range of ribosome_penalty estimates

  • max_penalty_trials set a maximum number of attempts allowed

  • penalty_step setting the difference between each estimate

  • return_output returns a table showing the mean distances for each estimate giving a better idea of how the estimates differ.

It also inherits parameters of the simulate_counts function above including damage_proportion.

selected_penalty <- DamageDetective::select_penalty(
  count_matrix = alignment_counts,
  max_penalty_trials = 3, # shortened for vignette, default is 10
  seed = 7, 
  verbose = TRUE
)
#> Testing penalty of 1e-05...
#> Testing penalty of 0.00501...
#> Testing penalty of 0.01001...
#> Maximum penalty trials reached (3). Stopping.
selected_penalty
#> [1] 1e-05

The output above reveals that the select_penalty function runs through trial and error. This involves generating artificial cells using different ribosome_penalty estimates and evaluating the similarity of the resulting artificial cells to true cells.

Here, ‘similarity’ refers to the shortest distance from each artificial cell to a true cell in principal component (PC) space. The mean shortest distances are taken for each level of damage simulated. The penalty that generates the smallest mean shortest distance across damage levels is selected as the ideal ribosome_penalty. In other words, a penalty that generates artificial cells that are the most similar to true cells over a wide range of damage levels is ideal.

Though not strictly necessary, the select_penalty function can be adjusted to give a user more control over the testing process. This includes,

  • penalty_range setting a specific range of ribosome_penalty estimates

  • max_penalty_trials set a maximum number of attempts allowed

  • penalty_step setting the difference between each estimate

  • return_output returns a table showing the mean distances for each estimate giving a better idea of how the estimates differ.

It also inherits parameters of the simulate_counts function above including damage_proportion.


Undefined Computational Parameters

The remainder of the computational parameters involve selections that cannot be determined directly from the dataset and are up to user discretion. To support this, it may be helpful for a user to understand how the detect_damage function works.

Ultimately, detect_damage outputs a score from 0 to 1 indicating the estimated damage level in each cell. This is found by comparing the cells to simulated cells where the extent of damage is known.

Damage is simulated by the loss of cytoplasmic RNA, a consequence that follows from a loss in plasma membrane integrity-a defining principle of damage used in many cell viability assays. DamageDetective assumes that the proportion of RNA lost is directly related to the extent of cellular damage.

In a complementary workflow to that of select_penalty, detect_damage estimates the damage level of each true cell by measuring its similarity to artificially damaged cells. Using the PC embeddings, a set of cells with the highest similarity to each true cell, the nearest neighbours (NN), is collected. The proportions of nearest neighbours that originate from damage level are found. The damage level of the artificial set to which the true cell shows the highest proportion of nearest neighbours is selected as the cell’s estimated damage level. This is scaled to score the cells in each set in a way that reflects the relative differences in proportion.

Now, looking at the parameters:

  • filter_threshold

A value between 0 and 1 to determine the level of damage, or proportion of RNA loss, above which cells will be excluded. By default, DamageDetective offers the threshold of 0.7. Values greater than 0.7 reflect more permissive filtering while those closer to 0 reflect more stringent filtering. We recommend the default for all cases but suggest that if adjustments are made, they are informed by inspecting the output detect_damage plots, generate_plot = TRUE.

  • damage_levels

This describes the number of distinct sets of artificial cells simulated, each with a defined level of damage. This parameter requires a balance between generating enough damage sets to give informative estimates of damage i.e. not 0 to 1, and not putting the computation under strain by creating redundant sets.

Of the default options, we find 5 sets strikes an ideal balance. Going above 5 becomes computationally challenging but could be done for small datasets. Going below 5, while not ideal, may be necessary for large datasets (over 10 000 cells).

While shifting from the default options is not recommended, it is possible should a user feel it necessary through specification in a list,

damage_levels <- list(
  pANN_50 = c(0.1, 0.5),
  pANN_100 = c(0.5, 1)
)
  • kN

This describes the size of the set of nearest neighbours that is found for each true cell. This value defaults to a fifth of the size of the input dataset. This parameter is not as essential as others since changes made here introduce change to all artificial cells in a relative way.

It should be noted that increasing the size of the set too greatly increases the computational requirements of the function and have not been found to dircetly improve output. For this reason, anything greater than a third of the size of the input data is discouraged.


Aesthetic Parameters

  • filter_counts: Determines the output format (filtered matrix or dataframe).
  • palette: Sets the color palette for damaged cells in plots.


Damaged Cell Detection

Once the data is in sparse matrix form and an appropriate ribosome_penalty is selected, the detect_damage function can be run.

# Run detection
detection_output <- DamageDetective::detect_damage(
  count_matrix = alignment_counts, 
  ribosome_penalty = selected_penalty,
  seed = 7
)
#> Simulating 1e-05 and 0.08 RNA loss...
#> Simulating 0.1 and 0.3 RNA loss...
#> Simulating 0.3 and 0.5 RNA loss...
#> Simulating 0.5 and 0.7 RNA loss...
#> Simulating 0.7 and 0.9 RNA loss...
#> Computing pANN...

Scatter plot showing the outcome damage detection where cells of the input data are coloured according to the estimated level of damage with grey being no damage,  purple being light damage, and red being heavy damage.


# View output
head(detection_output$output[, -1])
#>                    DamageDetective DamageDetective_filter
#> AAACCCAAGGAGAGTA-1     0.063257907                   cell
#> AAACGCTTCAGCCCAG-1     0.024193023                   cell
#> AAAGAACAGACGACTG-1     0.059537442                   cell
#> AAAGAACCAATGGCAG-1     0.181250000                   cell
#> AAAGAACGTCTGCAAT-1     0.007450930                   cell
#> AAAGGATAGTAGACAT-1     0.005590698                   cell
table(detection_output$output$DamageDetective_filter)
#> 
#>    cell damaged 
#>    1150      72

Session Information

#> R version 4.4.3 (2025-02-28)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.2 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
#>  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
#>  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
#> [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
#> 
#> time zone: UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] patchwork_1.3.0            ggplot2_3.5.1             
#> [3] Seurat_5.2.1               SeuratObject_5.0.2        
#> [5] sp_2.2-0                   DamageDetective_1.0.0.9000
#> 
#> loaded via a namespace (and not attached):
#>   [1] RColorBrewer_1.1-3     jsonlite_2.0.0         magrittr_2.0.3        
#>   [4] spatstat.utils_3.1-3   farver_2.1.2           rmarkdown_2.29        
#>   [7] fs_1.6.5               ragg_1.3.3             vctrs_0.6.5           
#>  [10] ROCR_1.0-11            spatstat.explore_3.4-2 rstatix_0.7.2         
#>  [13] htmltools_0.5.8.1      broom_1.0.8            Formula_1.2-5         
#>  [16] sass_0.4.9             sctransform_0.4.1      parallelly_1.43.0     
#>  [19] KernSmooth_2.23-26     bslib_0.9.0            htmlwidgets_1.6.4     
#>  [22] desc_1.4.3             ica_1.0-3              plyr_1.8.9            
#>  [25] plotly_4.10.4          zoo_1.8-13             cachem_1.1.0          
#>  [28] igraph_2.1.4           mime_0.13              lifecycle_1.0.4       
#>  [31] pkgconfig_2.0.3        Matrix_1.7-2           R6_2.6.1              
#>  [34] fastmap_1.2.0          fitdistrplus_1.2-2     future_1.34.0         
#>  [37] shiny_1.10.0           digest_0.6.37          colorspace_2.1-1      
#>  [40] tensor_1.5             RSpectra_0.16-2        irlba_2.3.5.1         
#>  [43] textshaping_1.0.0      ggpubr_0.6.0           labeling_0.4.3        
#>  [46] progressr_0.15.1       spatstat.sparse_3.1-0  httr_1.4.7            
#>  [49] polyclip_1.10-7        abind_1.4-8            compiler_4.4.3        
#>  [52] withr_3.0.2            backports_1.5.0        carData_3.0-5         
#>  [55] fastDummies_1.7.5      ggsignif_0.6.4         MASS_7.3-64           
#>  [58] tools_4.4.3            lmtest_0.9-40          httpuv_1.6.15         
#>  [61] future.apply_1.11.3    goftest_1.2-3          glue_1.8.0            
#>  [64] nlme_3.1-167           promises_1.3.2         grid_4.4.3            
#>  [67] Rtsne_0.17             cluster_2.1.8          reshape2_1.4.4        
#>  [70] generics_0.1.3         gtable_0.3.6           spatstat.data_3.1-6   
#>  [73] tidyr_1.3.1            data.table_1.17.0      car_3.1-3             
#>  [76] spatstat.geom_3.3-6    RcppAnnoy_0.0.22       ggrepel_0.9.6         
#>  [79] RANN_2.6.2             pillar_1.10.1          stringr_1.5.1         
#>  [82] spam_2.11-1            RcppHNSW_0.6.0         later_1.4.1           
#>  [85] splines_4.4.3          dplyr_1.1.4            lattice_0.22-6        
#>  [88] survival_3.8-3         deldir_2.0-4           tidyselect_1.2.1      
#>  [91] miniUI_0.1.1.1         pbapply_1.7-2          knitr_1.50            
#>  [94] gridExtra_2.3          scattermore_1.2        xfun_0.52             
#>  [97] matrixStats_1.5.0      stringi_1.8.7          lazyeval_0.2.2        
#> [100] yaml_2.3.10            evaluate_1.0.3         codetools_0.2-20      
#> [103] tibble_3.2.1           cli_3.6.4              uwot_0.2.3            
#> [106] xtable_1.8-4           reticulate_1.42.0      systemfonts_1.2.1     
#> [109] munsell_0.5.1          jquerylib_0.1.4        Rcpp_1.0.14           
#> [112] globals_0.16.3         spatstat.random_3.3-3  png_0.1-8             
#> [115] spatstat.univar_3.1-2  parallel_4.4.3         pkgdown_2.1.1         
#> [118] dotCall64_1.2          listenv_0.9.1          viridisLite_0.4.2     
#> [121] scales_1.3.0           ggridges_0.5.6         purrr_1.0.4           
#> [124] rlang_1.1.5            cowplot_1.1.3

References

  • Amezquita R, et al. (2020). “Orchestrating single-cell analysis with Bioconductor.” Nature Methods, 17, 137-145.
  • Hao et al. (2023). Seurat V5. Nature Biotechnology.
  • Satija R, et al. (2025). SeuratData: Install and Manage Seurat Datasets. R package version 0.2.2.9002.
  • Pedersen T (2024). patchwork: The Composer of Plots. R package version 1.3.0.
  • Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.