Seurat pipeline for Trisomy data — Part 1: QC, Filtering and Initial Integration

Author

Carlo Zanetti

Code

rm(list = ls())

Seurat analysis pipeline for trisomy dataset

This script processes 10x Genomics scRNA-seq data derived from xenografted human microglia into mouse models. It covers the full pipeline from raw CellRanger outputs through to initial cluster identification and removal of non-microglial populations. Outputs from this script are read in by Part 2.

Workflow overview

Data ingestion: Loads CellRanger outputs and maps to metadata
QC and filtering:
- Verifies file integrity
- Separates human vs mouse cells based on gene prefixes
- Checkpoint save — first initial object
- Labels multiplets (scDblFinder) and outliers (MAD score)
Individual batch processing — QC and module scoring per pool
Normalisation and Integration:
- Merges sequencing pools
- Normalises using SCTransform v2
- Corrects batch effects using RPCA
Clustering and annotation:
- Generates UMAPs
- Scores cells against curated Mancuso signatures
Cluster markers
Identification and removal of non-microglial clusters

Load libraries

Code

library(dplyr)
library(Matrix)
library(Seurat)
library(SeuratObject)
library(tidyverse)
library(glue)
library(readxl)
library(future)
library(gridExtra)
library(SoupX)
library(scDblFinder)
library(SingleCellExperiment)
library(qs2)
library(presto)
library(ggplot2)
library(scales)
library(ggrepel)

1. Data ingestion

Load directories

Edit base_dir and run_num for each new run. Output directories are created automatically if they do not already exist.

Code

base_dir <- "/nemo/homefs/home/zanettc/home/shared/zanettc/emily_transcriptomics"
demux_path <- file.path(base_dir, "input/jan_raw/cellranger_output")

nums <- c("1", "2", "3", "4", "5", "6")

filtered_paths <- file.path(
    unlist(lapply(
      file.path(demux_path, glue("pool{nums}_out"), "outs", "per_sample_outs"),
      list.dirs, recursive = FALSE, full.names = TRUE
    )),
    "count", "sample_filtered_feature_bc_matrix"
  )

raw_paths <- file.path(
    unlist(lapply(
      file.path(demux_path, glue("pool{nums}_out"), "outs", "per_sample_outs"),
      list.dirs, recursive = FALSE, full.names = TRUE
    )),
    "count", "sample_raw_feature_bc_matrix"
  )

run_num <- "run1"

out_dir     <- file.path("output", run_num)
graphs_dir  <- file.path(out_dir, "graphs")
objects_dir <- file.path(out_dir, "objects")

dir.create(graphs_dir,  recursive = TRUE, showWarnings = FALSE)
dir.create(objects_dir, recursive = TRUE, showWarnings = FALSE)

Read in metadata

Code

meta <- read_excel(file.path(base_dir, "input/meta/meta_data_combined.xlsx"))

Size check

Verifies that the three expected CellRanger output files exist and reports their sizes for each sample.

Code

expected <- c("matrix.mtx.gz", "features.tsv.gz", "barcodes.tsv.gz")

check_files_by_path <- function(path) {
  sample_id   <- basename(dirname(dirname(path)))
  lane_folder <- basename(dirname(dirname(dirname(dirname(dirname(path))))))
  pool_number <- as.integer(gsub("\\D", "", lane_folder))

  files       <- file.path(path, expected)
  file_present <- file.exists(files)
  sizes       <- ifelse(file_present, file.info(files)$size, NA_real_)

  tibble(
    pool_id     = pool_number,
    sample_hash = sample_id,
    file        = expected,
    exists      = file_present,
    size        = sizes
  )
}

results <- map_dfr(filtered_paths, check_files_by_path)
results

Map sample hashes to mouse IDs

Each sample hash returned by CellRanger is matched against the experimental metadata to recover the biological mouse ID and pool assignment.

Code

path_map <- results %>%
  select(pool_id, sample_hash) %>%
  distinct() %>%
  mutate(
    filtered_path = filtered_paths,
    raw_path      = raw_paths
  )

named_paths_tbl <- path_map %>%
  left_join(meta, by = c("sample_hash", "pool_id"))

filtered_data_dirs <- named_paths_tbl %>%
  filter(!is.na(mouse_id)) %>%  # drops blank tag (B0255) used by Emma in cellranger
  select(mouse_id, filtered_path) %>%
  deframe()

raw_data_dirs <- named_paths_tbl %>%
  filter(!is.na(mouse_id)) %>%
  select(mouse_id, raw_path) %>%
  deframe()

2. QC and filtering

Removal of failed samples

These samples were removed due to poor CellRanger QC metrics or high reported cell death from FACS.

Code

bad_samples <- c("BDAE56.1B", "BDAG185.1G", "BDAE50.1D")

filter_samples <- function(data_dirs) {
  data_dirs[!names(data_dirs) %in% bad_samples]
}
filtered_data_dirs <- filter_samples(filtered_data_dirs)
raw_data_dirs      <- filter_samples(raw_data_dirs)

Read 10x data

Loading CellRanger output with Read10X requires all antibody capture panels to match exactly across samples. Because last year’s samples used a different hash antibody (Hash5), a direct merge crashes. Instead, each sample is loaded individually — extracting only the Gene Expression modality — before merging into a single Seurat object. Cell barcodes are prefixed with their mouse ID at merge to ensure uniqueness. A global gene filter retains only genes detected in at least 3 cells.

Code

make_seurat_object <- function(path, name) {
  data      <- Read10X(data.dir = path)
  rna_counts <- data$`Gene Expression`

  CreateSeuratObject(
    counts      = rna_counts,
    project     = name,
    min.features = 200
  )
}

seurat_list <- mapply(
  FUN      = make_seurat_object,
  path     = filtered_data_dirs,
  name     = names(filtered_data_dirs),
  SIMPLIFY = FALSE
)

10X data contains more than one type and is being returned as a list containing matrices of each type.

Warning: Feature names cannot have underscores ('_'), replacing with dashes
('-')

10X data contains more than one type and is being returned as a list containing matrices of each type.

Warning: Feature names cannot have underscores ('_'), replacing with dashes
('-')

10X data contains more than one type and is being returned as a list containing matrices of each type.

Warning: Feature names cannot have underscores ('_'), replacing with dashes
('-')

10X data contains more than one type and is being returned as a list containing matrices of each type.

Warning: Feature names cannot have underscores ('_'), replacing with dashes
('-')

10X data contains more than one type and is being returned as a list containing matrices of each type.

Warning: Feature names cannot have underscores ('_'), replacing with dashes
('-')

10X data contains more than one type and is being returned as a list containing matrices of each type.

Warning: Feature names cannot have underscores ('_'), replacing with dashes
('-')

10X data contains more than one type and is being returned as a list containing matrices of each type.

Warning: Feature names cannot have underscores ('_'), replacing with dashes
('-')

10X data contains more than one type and is being returned as a list containing matrices of each type.

Warning: Feature names cannot have underscores ('_'), replacing with dashes
('-')

10X data contains more than one type and is being returned as a list containing matrices of each type.

Warning: Feature names cannot have underscores ('_'), replacing with dashes
('-')

10X data contains more than one type and is being returned as a list containing matrices of each type.

Warning: Feature names cannot have underscores ('_'), replacing with dashes
('-')

10X data contains more than one type and is being returned as a list containing matrices of each type.

Warning: Feature names cannot have underscores ('_'), replacing with dashes
('-')

10X data contains more than one type and is being returned as a list containing matrices of each type.

Warning: Feature names cannot have underscores ('_'), replacing with dashes
('-')

10X data contains more than one type and is being returned as a list containing matrices of each type.

Warning: Feature names cannot have underscores ('_'), replacing with dashes
('-')

10X data contains more than one type and is being returned as a list containing matrices of each type.

Warning: Feature names cannot have underscores ('_'), replacing with dashes
('-')

10X data contains more than one type and is being returned as a list containing matrices of each type.

Warning: Feature names cannot have underscores ('_'), replacing with dashes
('-')

10X data contains more than one type and is being returned as a list containing matrices of each type.

Warning: Feature names cannot have underscores ('_'), replacing with dashes
('-')

10X data contains more than one type and is being returned as a list containing matrices of each type.

Warning: Feature names cannot have underscores ('_'), replacing with dashes
('-')

10X data contains more than one type and is being returned as a list containing matrices of each type.

Warning: Feature names cannot have underscores ('_'), replacing with dashes
('-')

10X data contains more than one type and is being returned as a list containing matrices of each type.

Warning: Feature names cannot have underscores ('_'), replacing with dashes
('-')

10X data contains more than one type and is being returned as a list containing matrices of each type.

Warning: Feature names cannot have underscores ('_'), replacing with dashes
('-')

10X data contains more than one type and is being returned as a list containing matrices of each type.

Warning: Feature names cannot have underscores ('_'), replacing with dashes
('-')

10X data contains more than one type and is being returned as a list containing matrices of each type.

Warning: Feature names cannot have underscores ('_'), replacing with dashes
('-')

Code

seurat_obj <- merge(
  x           = seurat_list[[1]],
  y           = seurat_list[-1],
  add.cell.ids = names(filtered_data_dirs)
)

seurat_obj <- JoinLayers(seurat_obj)

seurat_obj@meta.data <- seurat_obj@meta.data %>%
  rownames_to_column("cell") %>%
  left_join(meta, by = c("orig.ident" = "mouse_id")) %>%
  column_to_rownames("cell")

seurat_obj@meta.data$group_id <- paste(
  seurat_obj@meta.data$app_genotype,
  seurat_obj@meta.data$microglia,
  sep = "_"
)

counts     <- GetAssayData(seurat_obj, assay = "RNA", layer = "counts")
keep_genes <- Matrix::rowSums(counts > 0) >= 3
seu        <- subset(seurat_obj, features = names(which(keep_genes)))

Filter for human cells and genes

Cells are assigned a species identity based on the fraction of UMIs mapping to human (GRCh38) vs mouse (GRCm39) genes. Cells with ≥90% human UMIs are retained. Mouse genes are then dropped from the feature set, and the GRCh38 prefix is stripped from remaining gene IDs for downstream compatibility.

Code

ids      <- rownames(seu)
is_human <- grepl("^GRCh38", ids)
is_mouse <- grepl("^GRCm39", ids)

human_ids <- ids[is_human]
mouse_ids <- ids[is_mouse]

raw_mat      <- GetAssayData(seu, layer = "counts")
human_counts <- Matrix::colSums(raw_mat[human_ids, , drop = FALSE])
mouse_counts <- Matrix::colSums(raw_mat[mouse_ids, , drop = FALSE])
total_counts <- Matrix::colSums(raw_mat)

seu$human_frac <- ifelse(total_counts > 0, human_counts / total_counts, 0)
seu$mouse_frac <- ifelse(total_counts > 0, mouse_counts / total_counts, 0)

human_cut <- 0.90
mouse_cut <- 0.90
seu$species_call <- dplyr::case_when(
  seu$human_frac >= human_cut ~ "human",
  seu$mouse_frac >= mouse_cut ~ "mouse",
  TRUE                        ~ "mixed_or_ambiguous"
)
print(table(seu$species_call))


             human mixed_or_ambiguous              mouse 
             88582               1371               1909

Code

filter_species <- function(seurat_obj, species = "human") {
  subset(seurat_obj, subset = species_call == species)
}
seu_h <- filter_species(seu, "human")

keep_features <- rownames(seu_h)[!grepl("^GRCm39-", rownames(seu_h))]
seu_h         <- subset(seu_h, features = keep_features)

rownames(seu_h) <- sub("^GRCh38-", "", rownames(seu_h))

dup <- any(duplicated(rownames(seu_h))); if (dup) warning("Duplicated gene IDs after renaming")

Initial save

Code

qs_save(seu_h, file.path(objects_dir, "seu_h_initial.qs2"))
seu_h <- qs_read(file.path(objects_dir, "seu_h_initial.qs2"))

Doublet detection using scDblFinder

Multiplets are identified using scDblFinder. The expected doublet rate is set to 0.4% per 1,000 recovered cells, consistent with the v4 10x chemistry specification. Detection is run per sequencing pool to account for pool-level variation.

Code

sce <- as.SingleCellExperiment(seu_h)

Warning: Layer 'data' is empty

Warning: Layer 'scale.data' is empty

Code

# dbr.per1k: for every 1,000 recovered cells, add 0.4% to the expected doublet rate (v4 chemistry)
sce <- scDblFinder(sce, samples = "pool_id", dbr.per1k = 0.004)


  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |============                                                          |  17%
  |                                                                            
  |=======================                                               |  33%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |===============================================                       |  67%
  |                                                                            
  |==========================================================            |  83%
  |                                                                            
  |======================================================================| 100%

Code

seu_h$scDblFinder.class    <- sce$scDblFinder.class
seu_h$scDblFinder.score    <- sce$scDblFinder.score
seu_h$scDblFinder.weighted <- sce$scDblFinder.weighted

table(seu_h$pool_id, seu_h$scDblFinder.class)

   
    singlet doublet
  1   16895    1784
  2   16942    1787
  3   15441    1458
  4    9927    1132
  5   10087     672
  6   11670     787

Code

qs_save(seu_h, file.path(objects_dir, "seu_h_scdbl.qs2"))

Doublet score histogram

Code

p <- ggplot(seu_h@meta.data, aes(x = scDblFinder.score, fill = scDblFinder.class)) +
  geom_histogram(bins = 100, position = "identity", alpha = 0.6) +
  scale_x_log10() +
  theme_classic() +
  labs(
    title    = "Doublet Score Distribution",
    subtitle = "Look for a valley between Singlets (Grey) \nand Doublets (Red)",
    x        = "scDblFinder Score (Log Scale)",
    y        = "Count"
  ) +
  scale_fill_manual(values = c("singlet" = "grey", "doublet" = "red"))

p

Code

ggsave(file.path(graphs_dir, "QC_scdbl_hist.pdf"), plot = p, width = 12, height = 6, units = "in")

QC violin plot

Violin plots of UMI counts, feature counts, and mitochondrial percentage split by pool, coloured by scDblFinder classification.

Code

seu_h[["percent.mt"]] <- PercentageFeatureSet(seu_h, pattern = "^MT-")

p <- VlnPlot(
  seu_h,
  features  = c("nCount_RNA", "nFeature_RNA", "percent.mt"),
  group.by  = "pool_id",
  split.by  = "scDblFinder.class",
  pt.size   = 0
) +
  labs(fill = "Cell Class") +
  theme(legend.position = "right")

Warning: Default search for "data" layer in "RNA" assay yielded no results;
utilizing "counts" layer instead.

The default behaviour of split.by has changed.
Separate violin plots are now plotted side-by-side.
To restore the old behaviour of a single split violin,
set split.plot = TRUE.
      
This message will be shown once per session.

Code

Code

ggsave(file.path(graphs_dir, "QC_scDbl_violin.pdf"), plot = p, width = 12, height = 6, units = "in")

Counts vs features

Code

plot1 <- FeatureScatter(seu_h, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 <- FeatureScatter(seu_h, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")

ggsave(file.path(graphs_dir, "QC_nCount_vs_mt.pdf"),      plot = plot1, width = 12, height = 6, units = "in")
ggsave(file.path(graphs_dir, "QC_nCount_vs_feature.pdf"), plot = plot2, width = 12, height = 6, units = "in")

print(plot1)

Code

print(plot2)

Counts vs features coloured by doublet assignment

Code

plot <- FeatureScatter(
  seu_h,
  feature1 = "nCount_RNA",
  feature2 = "nFeature_RNA",
  group.by = "scDblFinder.class"
)
ggsave(file.path(graphs_dir, "QC_nCount_vs_feature_doublets.pdf"), plot = plot, width = 12, height = 6, units = "in")
print(plot)

Outlier detection (MAD scoring)

Outliers are flagged using the Median Absolute Deviation (MAD) rather than fixed thresholds, making the cutoff robust to pool-level differences in sequencing depth. Per-pool statistics are calculated and cutoffs at both 3 and 5 MADs above the median are retained for comparison.

Code

seu_h@meta.data <- seu_h@meta.data %>%
  select(-any_of(c("median_count", "mad_count", "upper_limit_5_mad", "upper_limit_3_mad")))

qc_stats <- seu_h@meta.data %>%
  group_by(pool_id) %>%
  summarise(
    median_count = median(nCount_RNA),
    mad_count    = mad(nCount_RNA)
  ) %>%
  mutate(
    upper_limit_5_mad = median_count + (5 * mad_count),
    upper_limit_3_mad = median_count + (3 * mad_count)
  )

meta_updated <- seu_h@meta.data %>%
  rownames_to_column("cell_id") %>%
  left_join(qc_stats, by = "pool_id") %>%
  mutate(
    mad5_outlier = ifelse(nCount_RNA >= upper_limit_5_mad, "Outlier", "Not Outlier"),
    mad3_outlier = ifelse(nCount_RNA >= upper_limit_3_mad, "Outlier", "Not Outlier")
  ) %>%
  column_to_rownames("cell_id")

seu_h@meta.data <- meta_updated

VlnPlot(seu_h, features = "nCount_RNA", group.by = "pool_id", pt.size = 0.1) +
  geom_hline(data = qc_stats, aes(yintercept = upper_limit_5_mad), color = "red", linetype = "dashed")

Warning: Default search for "data" layer in "RNA" assay yielded no results;
utilizing "counts" layer instead.

Intersection of MAD outliers and doublet calls

Calculates the number and percentage of cells that would be removed at each MAD threshold, combined with doublet removal.

Code

dbl_col <- "scDblFinder.class"

global_stats <- seu_h@meta.data %>%
  summarise(
    total_cells      = n(),
    n_remove_3       = sum(mad3_outlier == "Outlier" | .data[[dbl_col]] == "doublet"),
    percent_remove_3 = round((n_remove_3 / total_cells) * 100, 2),
    n_remove_5       = sum(mad5_outlier == "Outlier" | .data[[dbl_col]] == "doublet"),
    percent_remove_5 = round((n_remove_5 / total_cells) * 100, 2)
  )

print(global_stats)

  total_cells n_remove_3 percent_remove_3 n_remove_5 percent_remove_5
1       88582       9411            10.62       8321             9.39

Outlier vs doublet co-occurrence

Code

table(Outlier_Status = seu_h$mad3_outlier, Doublet_Status = seu_h$scDblFinder.class)

              Doublet_Status
Outlier_Status singlet doublet
   Not Outlier   79171    4998
   Outlier        1791    2622

Cells with high feature counts within outlier groups

Checks how many cells exceed 6,000 features within each MAD category, as a sanity check against previously used fixed cutoffs.

Code

outlier_summary <- seu_h@meta.data %>%
  group_by(mad3_outlier, mad5_outlier) %>%
  summarise(
    Total_Cells              = n(),
    Cells_Over_6000_Features = sum(nFeature_RNA > 6000),
    Percentage_High_Feature  = round((Cells_Over_6000_Features / Total_Cells) * 100, 2),
    .groups = "drop"
  )

outlier_summary

MAD cutoff comparison

Prints the number of cells retained or lost at each threshold and overlays both cutoff lines on a violin plot.

Code

counts <- seu_h$nCount_RNA

median_val <- median(counts)
mad_val    <- mad(counts)

cutoff_3mad <- median_val + (3 * mad_val)
cutoff_5mad <- median_val + (5 * mad_val)

cat("Median UMI Count:", median_val, "\n")

Median UMI Count: 10839

Code

cat("MAD:", mad_val, "\n\n")

MAD: 3530.071

Code

cat("--- 3 MAD Cutoff ---\n")

--- 3 MAD Cutoff ---

Code

cat("Threshold:", round(cutoff_3mad, 2), "\n")

Threshold: 21429.21

Code

cat("Cells removed:", sum(counts > cutoff_3mad), "\n")

Cells removed: 4089

Code

cat("Cells kept:",   sum(counts <= cutoff_3mad), "\n\n")

Cells kept: 84493

Code

cat("--- 5 MAD Cutoff ---\n")

--- 5 MAD Cutoff ---

Code

cat("Threshold:", round(cutoff_5mad, 2), "\n")

Threshold: 28489.35

Code

cat("Cells removed:", sum(counts > cutoff_5mad), "\n")

Cells removed: 1424

Code

cat("Cells kept:",   sum(counts <= cutoff_5mad), "\n")

Cells kept: 87158

Code

p <- VlnPlot(seu_h, features = "nCount_RNA", pt.size = 0, group.by = "pool_id") +
  geom_hline(yintercept = cutoff_3mad, color = "red",  linetype = "dashed", size = 1) +
  geom_hline(yintercept = cutoff_5mad, color = "blue", linetype = "dashed", size = 1) +
  annotate("text", x = 5.1, y = 100000,
           label = paste0("3 MAD: ", round(cutoff_3mad)),
           vjust = 1.5, hjust = 0, color = "red", fontface = "bold") +
  annotate("text", x = 5.1, y = 100000,
           label = paste0("5 MAD: ", round(cutoff_5mad)),
           vjust = -0.5, hjust = 0, color = "blue", fontface = "bold") +
  ggtitle("nCount_RNA Distribution: 3 vs 5 MAD") +
  theme(plot.title = element_text(hjust = 0.5)) +
  xlab("Pool number") +
  ylab("UMI counts")

Warning: Default search for "data" layer in "RNA" assay yielded no results;
utilizing "counts" layer instead.

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

Code

ggsave(
  filename = file.path(graphs_dir, glue("count_dist_3_vs_5_mad.pdf")),
  plot = p, width = 8, height = 5, dpi = 300, units = "in"
)

Count vs feature scatter with MAD cutoffs

Code

qc_scatter <- FeatureScatter(seu_h, feature1 = "nCount_RNA", feature2 = "nFeature_RNA",
                             group.by = "pool_id", shuffle = TRUE) +
  geom_vline(xintercept = cutoff_3mad, color = "red",  linetype = "dashed", size = 1) +
  annotate("text", x = cutoff_3mad, y = max(seu_h$nFeature_RNA), label = "3 MAD",
           color = "red",  angle = 90, vjust = -0.5, hjust = 1, fontface = "bold") +
  geom_vline(xintercept = cutoff_5mad, color = "blue", linetype = "dashed", size = 1) +
  annotate("text", x = cutoff_5mad, y = max(seu_h$nFeature_RNA), label = "5 MAD",
           color = "blue", angle = 90, vjust = -0.5, hjust = 1, fontface = "bold") +
  ggtitle("Count vs Feature (with MAD Cutoffs)") +
  xlab("Number of counts") +
  ylab("Number of genes")

ggsave(
  filename = file.path(graphs_dir, "QC_count_vs_feat_lines.pdf"),
  plot = qc_scatter, width = 7, height = 5, dpi = 300, units = "in"
)
print(qc_scatter)

Initial subsetting

Cells are filtered on three criteria: fewer than 500 detected genes, mitochondrial content above 15%, or flagged as a MAD5 outlier. Doublets are retained here and removed in Part 2 once cluster identity has been confirmed.

Code

ncol(seu_h)

[1] 88582

Code

seu_h_f <- subset(seu_h, subset = nFeature_RNA > 500 & percent.mt < 15 & mad5_outlier != "Outlier")
ncol(seu_h_f)

[1] 86504

Code

qs_save(seu_h_f, file.path(objects_dir, "seu_h_initial_subset.qs2"))

Naive UMAP on all data

A quick unintegrated UMAP gives an initial view of structure and reveals whether pool-level batch effects require correction.

Code

options(future.globals.maxSize = 65 * 1024^3)
seu_h_f <- SCTransform(seu_h_f, vst.flavor = "v2", verbose = FALSE)
seu_h_f <- RunPCA(seu_h_f)

PC_ 1 
Positive:  P2RY12, XACT, PLXDC2, OLR1, SPP1, IFNGR1, CX3CR1, FRMD4A, DOCK4, APBB1IP 
       CD69, HTRA1, IPCEF1, C3, APOC2, FCGR3A, LINC02712, ENSG00000249738, P2RY13, RGS1 
       SAMSN1, ADAM28, CCDC200, OLFML3, SRGN, SORL1, DLEU1, FYB1, SIPA1L2, ST6GAL1 
Negative:  RNASE1, SELENOP, CD36, LYVE1, F13A1, DAB2, COLEC12, CD163, SCN9A, PID1 
       MRC1, CD163L1, IGFBP4, TGFBI, LILRB5, IQGAP2, ITSN1, CCL8, VAV3, CD200R1 
       CD28, FGF13, WWP1, RND3, ANTXR2, NRP1, MS4A4A, CCDC170, CCL2, RBPJ 
PC_ 2 
Positive:  SPP1, HLA-DRA, CD9, APOC1, FTL, FTH1, EEF1A1, CD74, TPT1, HLA-DPA1 
       HLA-DRB1, RPL34, HLA-DPB1, NAP1L1, TMSB4X, RPS8, LGALS3, RPL13, RPL10, APOE 
       CCL4L2, PLA2G7, RPS24, RPS14, RPL5, RPL41, RPL13A, CCL3, CCL4, GPNMB 
Negative:  MKI67, TOP2A, CENPF, ASPM, TPX2, HMMR, KNL1, NUSAP1, ANLN, CDK1 
       PRC1, UBE2C, H1-5, DLGAP5, NDC80, DIAPH3, GTSE1, NCAPG, KIF14, CENPE 
       RRM2, CKAP2L, HMGB2, CCNA2, CDCA2, H2AC14, CEP55, SYNE2, KIF23, TYMS 
PC_ 3 
Positive:  FRMD4A, SFMBT2, DOCK4, ITPR2, AFF3, SLC9A9, ENSG00000249738, MEF2A, P2RY12, AOAH 
       SMYD3, LRMDA, KCNQ3, FCHSD2, PLXDC2, LINC02712, FOXN3, QKI, DIAPH2, SSBP2 
       LYVE1, ELMO1, SLC8A1, PRKCA, MAML2, ST6GAL1, SRGAP2, ZSWIM6, ENSG00000231918, MERTK 
Negative:  SPP1, HLA-DRA, CD9, APOC1, HLA-DPA1, FTL, HLA-DRB1, FTH1, TMSB4X, CD74 
       NAP1L1, HLA-DPB1, LGALS1, EEF1A1, TPT1, RPS8, LGALS3, DBI, CCL4, RPLP1 
       RPL13, RPS2, APOE, CCL3, PLA2G7, RPL41, RPL10, RPS6, GPNMB, RPS24 
PC_ 4 
Positive:  MT-CO1, MT-CO3, MT-ND4, MT-CO2, TMSB4X, MT-CYB, MT-ATP6, TPT1, MT-ND3, RPLP1 
       RPS8, RPL13, MT-ND5, RPS24, EEF1A1, MT-ND4L, RPL13A, MT-ND2, FTL, ENSG00000289901 
       RPS2, RPS4X, RPS23, RPL34, RPS14, AIF1, MT-ND1, RPL21, P2RY12, RPS6 
Negative:  CCL4, CCL4L2, CCL3, CCL3L3, FOS, IL1B, DUSP1, GADD45B, CCDC200, SPP1 
       CCL2, TEX14, IER2, NR4A2, CD83, KLF2, CD69, BTG2, SRGN, JUNB 
       JUN, CXCL8, IER3, ENSG00000262202, CD36, NR4A1, OLR1, ZFP36, EGR1, ATF3 
PC_ 5 
Positive:  CCDC200, CCL4L2, TEX14, ENSG00000262202, TPT1, AGR2, ENSG00000285646, NR4A1, PTH2R, LYVE1 
       CCL3L3, CST3, EEF1A1, PNRC1, RPL13, ENSG00000287979, IER3, FOSB, NR4A2, RPL13A 
       LINC00910, RPS8, RPS24, IER2, AIF1, IL1B, BTG2, RPLP1, H3-3B, RPL41 
Negative:  SPP1, CD9, IFIT3, APOC1, MX1, IFIT1, IFI44L, IFI6, IFIT2, GPNMB 
       LGALS3, LGALS1, IFI27, XACT, APOE, IFITM1, MX2, RTTN, GLDN, RSAD2 
       SCIN, MYO1E, PLA2G7, RGCC, ISG15, LINC01235, MSR1, LIPA, PLXDC2, IFITM3

Code

seu_h_f <- RunUMAP(seu_h_f, dims = 1:30)

Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
This message will be shown once per session

17:42:47 UMAP embedding parameters a = 0.9922 b = 1.112

17:42:47 Read 86504 rows and found 30 numeric columns

17:42:47 Using Annoy for neighbor search, n_neighbors = 30

17:42:47 Building Annoy index with metric = cosine, n_trees = 50

0%   10   20   30   40   50   60   70   80   90   100%

[----|----|----|----|----|----|----|----|----|----|

**************************************************|
17:42:53 Writing NN index file to temp file /tmp/slurm_45934587/RtmpfBZYPK/filedc5153069a22
17:42:53 Searching Annoy index using 1 thread, search_k = 3000
17:43:22 Annoy recall = 100%
17:43:24 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
17:43:29 Initializing from normalized Laplacian + noise (using RSpectra)
17:43:34 Commencing optimization for 200 epochs, with 3929318 positive edges
17:43:34 Using rng type: pcg
17:44:14 Optimization finished

Unintegrated UMAP split by pool

Code

p <- DimPlot(seu_h_f,
             reduction = "umap",
             split.by  = "pool_id",
             group.by  = "pool_id") +
  ggtitle("Unintegrated Data: Split by Pool")
p

Code

ggsave(file.path(graphs_dir, "QC_umaps_pools_split.pdf"), plot = p, width = 18, height = 6, units = "in")

3. Individual batch processing

Each pool is processed independently to assess quality and score cell type identity before integration. This allows per-pool QC inspection and ensures Mancuso signatures are applied at the batch level.

Load gene signatures

Gene sets for non-microglial cell types (provided by Emma) and Mancuso 2024 microglial states are loaded here for use in both batch-level and integrated scoring.

Code

sig_dir <- file.path(base_dir, "signatures")

gs_myeloid       <- readLines(file.path(sig_dir, "myeloid.txt"))
gs_proliferating <- readLines(file.path(sig_dir, "proliferating.txt"))
gs_macrophage    <- readLines(file.path(sig_dir, "macrophage.txt"))
gs_dendritic     <- readLines(file.path(sig_dir, "dendritic.txt"))

mancuso_dir <- "/nemo/lab/destrooperb/home/shared/zanettc/Mancuso2024_genesets"

gs_crm  <- readLines(file.path(mancuso_dir, "CRM_geneset.txt"))
gs_dam  <- readLines(file.path(mancuso_dir, "DAM_geneset.txt"))
gs_hla  <- readLines(file.path(mancuso_dir, "HLA_geneset.txt"))
gs_hm   <- readLines(file.path(mancuso_dir, "HM_geneset.txt"))
gs_irm  <- readLines(file.path(mancuso_dir, "IRM_geneset.txt"))
gs_rm   <- readLines(file.path(mancuso_dir, "RM_geneset.txt"))

Warning in readLines(file.path(mancuso_dir, "RM_geneset.txt")): incomplete
final line found on
'/nemo/lab/destrooperb/home/shared/zanettc/Mancuso2024_genesets/RM_geneset.txt'

Code

gs_tcrm <- readLines(file.path(mancuso_dir, "transCRM_geneset.txt"))

Run per-batch UMAP and clustering

Each batch is normalised with SCTransform, reduced with PCA, and embedded with UMAP. Clustering at resolution 0.2 generates initial cluster assignments for module scoring.

Code

batches <- SplitObject(seu_h_f, split.by = "pool_id")

options(future.globals.maxSize = 8912896000000)
batches <- lapply(batches, SCTransform)

Running SCTransform on assay: RNA

vst.flavor='v2' set. Using model with fixed slope and excluding poisson genes.

Calculating cell attributes from input UMI matrix: log_umi

Variance stabilizing transformation of count matrix of size 24322 by 18350

Model formula is y ~ log_umi

Get Negative Binomial regression parameters per gene

Using 2000 genes, 5000 cells

Found 196 outliers - those will be ignored in fitting/regularization step

Second step: Get residuals using fitted parameters for 24322 genes

Computing corrected count matrix for 24322 genes

Calculating gene attributes

Wall clock passed: Time difference of 1.362898 mins

Determine variable features

Centering data matrix

Place corrected count matrix in counts slot

Warning: Different cells and/or features from existing assay SCT

Set default assay to SCT

Running SCTransform on assay: RNA

vst.flavor='v2' set. Using model with fixed slope and excluding poisson genes.

Calculating cell attributes from input UMI matrix: log_umi

Variance stabilizing transformation of count matrix of size 24054 by 18346

Model formula is y ~ log_umi

Get Negative Binomial regression parameters per gene

Using 2000 genes, 5000 cells

Found 225 outliers - those will be ignored in fitting/regularization step

Second step: Get residuals using fitted parameters for 24054 genes

Computing corrected count matrix for 24054 genes

Calculating gene attributes

Wall clock passed: Time difference of 1.339851 mins

Determine variable features

Centering data matrix

Place corrected count matrix in counts slot

Warning: Different cells and/or features from existing assay SCT

Set default assay to SCT

Running SCTransform on assay: RNA

vst.flavor='v2' set. Using model with fixed slope and excluding poisson genes.

Calculating cell attributes from input UMI matrix: log_umi

Variance stabilizing transformation of count matrix of size 23159 by 16345

Model formula is y ~ log_umi

Get Negative Binomial regression parameters per gene

Using 2000 genes, 5000 cells

Found 199 outliers - those will be ignored in fitting/regularization step

Second step: Get residuals using fitted parameters for 23159 genes

Computing corrected count matrix for 23159 genes

Calculating gene attributes

Wall clock passed: Time difference of 1.183697 mins

Determine variable features

Centering data matrix

Place corrected count matrix in counts slot

Warning: Different cells and/or features from existing assay SCT

Set default assay to SCT

Running SCTransform on assay: RNA

vst.flavor='v2' set. Using model with fixed slope and excluding poisson genes.

Calculating cell attributes from input UMI matrix: log_umi

Variance stabilizing transformation of count matrix of size 22383 by 10620

Model formula is y ~ log_umi

Get Negative Binomial regression parameters per gene

Using 2000 genes, 5000 cells

Found 158 outliers - those will be ignored in fitting/regularization step

Second step: Get residuals using fitted parameters for 22383 genes

Computing corrected count matrix for 22383 genes

Calculating gene attributes

Wall clock passed: Time difference of 49.21991 secs

Determine variable features

Centering data matrix

Place corrected count matrix in counts slot

Warning: Different cells and/or features from existing assay SCT

Set default assay to SCT

Running SCTransform on assay: RNA

vst.flavor='v2' set. Using model with fixed slope and excluding poisson genes.

Calculating cell attributes from input UMI matrix: log_umi

Variance stabilizing transformation of count matrix of size 22633 by 10588

Model formula is y ~ log_umi

Get Negative Binomial regression parameters per gene

Using 2000 genes, 5000 cells

Found 154 outliers - those will be ignored in fitting/regularization step

Second step: Get residuals using fitted parameters for 22633 genes

Computing corrected count matrix for 22633 genes

Calculating gene attributes

Wall clock passed: Time difference of 51.41714 secs

Determine variable features

Centering data matrix

Place corrected count matrix in counts slot

Warning: Different cells and/or features from existing assay SCT

Set default assay to SCT

Running SCTransform on assay: RNA

vst.flavor='v2' set. Using model with fixed slope and excluding poisson genes.

Calculating cell attributes from input UMI matrix: log_umi

Variance stabilizing transformation of count matrix of size 23243 by 12255

Model formula is y ~ log_umi

Get Negative Binomial regression parameters per gene

Using 2000 genes, 5000 cells

Found 175 outliers - those will be ignored in fitting/regularization step

Second step: Get residuals using fitted parameters for 23243 genes

Computing corrected count matrix for 23243 genes

Calculating gene attributes

Wall clock passed: Time difference of 56.26383 secs

Determine variable features

Centering data matrix

Place corrected count matrix in counts slot

Warning: Different cells and/or features from existing assay SCT

Set default assay to SCT

Code

batches <- lapply(batches, \(o) {
  set.seed(42)
  o <- RunPCA(o, npcs = 30, assay = "SCT", verbose = FALSE)
  set.seed(42)
  o <- RunUMAP(o, dims = 1:25, n.neighbors = 30L, n.epochs = 500, min.dist = 0.05)
  o
})

Warning: Number of dimensions changing from 50 to 30

17:51:55 UMAP embedding parameters a = 1.75 b = 0.8421

17:51:55 Read 18350 rows and found 25 numeric columns

17:51:55 Using Annoy for neighbor search, n_neighbors = 30

17:51:55 Building Annoy index with metric = cosine, n_trees = 50

0%   10   20   30   40   50   60   70   80   90   100%

[----|----|----|----|----|----|----|----|----|----|

**************************************************|
17:51:56 Writing NN index file to temp file /tmp/slurm_45934587/RtmpfBZYPK/filedc51518aba28f
17:51:56 Searching Annoy index using 1 thread, search_k = 3000
17:52:01 Annoy recall = 100%
17:52:03 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
17:52:06 Initializing from normalized Laplacian + noise (using RSpectra)
17:52:07 Commencing optimization for 500 epochs, with 798570 positive edges
17:52:07 Using rng type: pcg
17:52:28 Optimization finished

Warning: Number of dimensions changing from 50 to 30

17:52:33 UMAP embedding parameters a = 1.75 b = 0.8421
17:52:33 Read 18346 rows and found 25 numeric columns
17:52:33 Using Annoy for neighbor search, n_neighbors = 30
17:52:33 Building Annoy index with metric = cosine, n_trees = 50
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
17:52:35 Writing NN index file to temp file /tmp/slurm_45934587/RtmpfBZYPK/filedc515249d6a3f
17:52:35 Searching Annoy index using 1 thread, search_k = 3000
17:52:39 Annoy recall = 100%
17:52:41 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
17:52:45 Initializing from normalized Laplacian + noise (using RSpectra)
17:52:45 Commencing optimization for 500 epochs, with 815792 positive edges
17:52:45 Using rng type: pcg
17:53:06 Optimization finished

Warning: Number of dimensions changing from 50 to 30

17:53:11 UMAP embedding parameters a = 1.75 b = 0.8421
17:53:11 Read 16345 rows and found 25 numeric columns
17:53:11 Using Annoy for neighbor search, n_neighbors = 30
17:53:11 Building Annoy index with metric = cosine, n_trees = 50
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
17:53:12 Writing NN index file to temp file /tmp/slurm_45934587/RtmpfBZYPK/filedc51573057ced
17:53:12 Searching Annoy index using 1 thread, search_k = 3000
17:53:16 Annoy recall = 100%
17:53:18 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
17:53:22 Initializing from normalized Laplacian + noise (using RSpectra)
17:53:22 Commencing optimization for 500 epochs, with 718316 positive edges
17:53:22 Using rng type: pcg
17:53:41 Optimization finished

Warning: Number of dimensions changing from 50 to 30

17:53:45 UMAP embedding parameters a = 1.75 b = 0.8421
17:53:45 Read 10620 rows and found 25 numeric columns
17:53:45 Using Annoy for neighbor search, n_neighbors = 30
17:53:45 Building Annoy index with metric = cosine, n_trees = 50
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
17:53:45 Writing NN index file to temp file /tmp/slurm_45934587/RtmpfBZYPK/filedc51584005a9
17:53:45 Searching Annoy index using 1 thread, search_k = 3000
17:53:48 Annoy recall = 100%
17:53:49 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
17:53:53 Initializing from normalized Laplacian + noise (using RSpectra)
17:53:53 Commencing optimization for 500 epochs, with 460334 positive edges
17:53:53 Using rng type: pcg
17:54:06 Optimization finished

Warning: Number of dimensions changing from 50 to 30

17:54:09 UMAP embedding parameters a = 1.75 b = 0.8421
17:54:09 Read 10588 rows and found 25 numeric columns
17:54:09 Using Annoy for neighbor search, n_neighbors = 30
17:54:09 Building Annoy index with metric = cosine, n_trees = 50
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
17:54:10 Writing NN index file to temp file /tmp/slurm_45934587/RtmpfBZYPK/filedc51573561cc9
17:54:10 Searching Annoy index using 1 thread, search_k = 3000
17:54:13 Annoy recall = 100%
17:54:14 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
17:54:18 Initializing from normalized Laplacian + noise (using RSpectra)
17:54:18 Commencing optimization for 500 epochs, with 460154 positive edges
17:54:18 Using rng type: pcg
17:54:31 Optimization finished

Warning: Number of dimensions changing from 50 to 30

17:54:35 UMAP embedding parameters a = 1.75 b = 0.8421
17:54:35 Read 12255 rows and found 25 numeric columns
17:54:35 Using Annoy for neighbor search, n_neighbors = 30
17:54:35 Building Annoy index with metric = cosine, n_trees = 50
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
17:54:36 Writing NN index file to temp file /tmp/slurm_45934587/RtmpfBZYPK/filedc5156a84bc96
17:54:36 Searching Annoy index using 1 thread, search_k = 3000
17:54:39 Annoy recall = 100%
17:54:40 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
17:54:44 Initializing from normalized Laplacian + noise (using RSpectra)
17:54:44 Commencing optimization for 500 epochs, with 529934 positive edges
17:54:44 Using rng type: pcg
17:54:59 Optimization finished

Code

invisible(lapply(names(batches), function(nm) {
  p  <- DimPlot(batches[[nm]], reduction = "umap", group.by = "orig.ident", shuffle = TRUE) + ggtitle(nm)
  p2 <- FeaturePlot(batches[[nm]], features = c("nCount_RNA", "nFeature_RNA", "percent.mt"), cols = c("#440154", "#FDE725"))
  p3 <- DimPlot(batches[[nm]], group.by = "scDblFinder.class") + ggtitle(glue("Doublet Locations Batch {nm}"))
  p4 <- DimPlot(batches[[nm]], reduction = "umap", split.by = "orig.ident", shuffle = TRUE, ncol = 3) + ggtitle(nm)

  print(p); print(p2); print(p3); print(p4)

  batch_graphs_dir <- file.path(graphs_dir, glue("batch_{nm}"))
  dir.create(batch_graphs_dir, recursive = TRUE, showWarnings = FALSE)

  ggsave(filename = file.path(batch_graphs_dir, glue("QC_batch{nm}_umap_samples_overlayed.pdf")), plot = p,  width = 7, height = 5, dpi = 300, units = "in")
  ggsave(filename = file.path(batch_graphs_dir, glue("QC_batch{nm}_umap_qc_x3.pdf")),             plot = p2, width = 7, height = 5, dpi = 300, units = "in")
  ggsave(filename = file.path(batch_graphs_dir, glue("QC_batch{nm}_doublets_umap.pdf")),           plot = p3, width = 7, height = 5, dpi = 300, units = "in")
  ggsave(filename = file.path(batch_graphs_dir, glue("QC_batch{nm}_umap_samples_split.pdf")),      plot = p4, width = 7, height = 5, dpi = 300, units = "in")
}))

Save / reload batch objects

Code

qs_save(batches, file = file.path(objects_dir, "umap_batches.qs2"))

Code

batches <- qs_read(file.path(objects_dir, "umap_batches.qs2"))

Clustering

Code

res <- 0.2
batches <- lapply(batches, \(o) {
  set.seed(42)
  o <- FindNeighbors(o, reduction = "pca", dims = 1:30)
  set.seed(42)
  o <- FindClusters(o, resolution = res)
})

Computing nearest neighbor graph

Computing SNN

Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 18350
Number of edges: 608878

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9067
Number of communities: 8
Elapsed time: 3 seconds

Computing nearest neighbor graph
Computing SNN

Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 18346
Number of edges: 614514

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.8986
Number of communities: 9
Elapsed time: 2 seconds

Computing nearest neighbor graph
Computing SNN

Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 16345
Number of edges: 521575

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.8821
Number of communities: 8
Elapsed time: 2 seconds

Computing nearest neighbor graph
Computing SNN

Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 10620
Number of edges: 360281

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.8946
Number of communities: 8
Elapsed time: 1 seconds

Computing nearest neighbor graph
Computing SNN

Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 10588
Number of edges: 344063

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.8836
Number of communities: 7
Elapsed time: 1 seconds

Computing nearest neighbor graph
Computing SNN

Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 12255
Number of edges: 406332

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.8973
Number of communities: 8
Elapsed time: 1 seconds

Code

invisible(lapply(names(batches), function(nm) {
  p <- DimPlot(batches[[nm]], reduction = "umap", label = TRUE) +
    ggtitle(glue("Batch {nm} UMAP {res} resolution"))
  print(p)

  batch_graphs_dir <- file.path(graphs_dir, glue("batch_{nm}"))
  ggsave(
    filename = file.path(batch_graphs_dir, glue("clusters_batch_{nm}_{res}.pdf")),
    plot = p, width = 7, height = 5, dpi = 300, units = "in"
  )
}))

Optional save / reload for clustered batches

Code

for (i in seq_along(batches)) {
  qs_save(batches[[i]], file = paste0(objects_dir, "/batch", i, "_clustered.qs2"))
}

Code

files   <- list.files(path = objects_dir, pattern = "batch[0-9]+_clustered.qs2", full.names = TRUE)
batches <- lapply(files, qs_read)

Module scoring per batch

Gene sets are intersected against the batch feature space before scoring. AddModuleScore calculates average expression of each gene set relative to a randomly sampled control set, giving a background-corrected enrichment score per cell.

Code

myeloid_seu    <- intersect(gs_myeloid,       rownames(batches[[1]]))
prolif_seu     <- intersect(gs_proliferating, rownames(batches[[1]]))
macrophage_seu <- intersect(gs_macrophage,    rownames(batches[[1]]))
dendritic_seu  <- intersect(gs_dendritic,     rownames(batches[[1]]))

gene_sets <- list(
  dendritic     = dendritic_seu,
  proliferating = prolif_seu,
  macrophage    = macrophage_seu,
  myeloid       = myeloid_seu,
  CRM           = gs_crm,
  DAM           = gs_dam,
  HLA           = gs_hla,
  HM            = gs_hm,
  IRM           = gs_irm,
  RM            = gs_rm,
  tCRM          = gs_tcrm
)

batches <- lapply(seq_along(batches), function(i) {
  o  <- batches[[i]]
  nm <- as.character(i)
  message(glue("Processing Batch {nm}..."))

  batch_graphs_dir <- file.path(graphs_dir, glue("batch_{nm}"))
  if (!dir.exists(batch_graphs_dir)) dir.create(batch_graphs_dir, recursive = TRUE)

  o <- AddModuleScore(object = o, features = gene_sets, name = names(gene_sets), ctrl = 5, seed = 42)

  o$seurat_clusters <- factor(
    o$seurat_clusters,
    levels = sort(as.numeric(unique(as.character(o$seurat_clusters))))
  )
  Idents(o) <- "seurat_clusters"

  p1 <- VlnPlot(o, features = c("dendritic1", "proliferating2", "macrophage3", "myeloid4"),
                group.by = "seurat_clusters", pt.size = 0, ncol = 2) +
    ggtitle(glue("Batch {nm}: Cell Types"))
  p2 <- VlnPlot(o, features = c("CRM5", "DAM6", "HLA7", "HM8", "IRM9", "RM10", "tCRM11"),
                group.by = "seurat_clusters", pt.size = 0, ncol = 4) +
    ggtitle(glue("Batch {nm}: Mancuso Scores"))
  p3 <- VlnPlot(o, features = c("nFeature_RNA"),
                group.by = "seurat_clusters", pt.size = 0, ncol = 2) +
    ggtitle(glue("Batch {nm}: QC"))

  print(p1); print(p2); print(p3)

  ggsave(filename = file.path(batch_graphs_dir, glue("batch_{nm}_celltypes.pdf")), plot = p1, width = 7,  height = 5, dpi = 300)
  ggsave(filename = file.path(batch_graphs_dir, glue("batch_{nm}_mancuso.pdf")),   plot = p2, width = 10, height = 5, dpi = 300)
  ggsave(filename = file.path(batch_graphs_dir, glue("batch_{nm}_qc.pdf")),        plot = p3, width = 10, height = 5, dpi = 300)

  return(o)
})

Processing Batch 1...

Warning: The following features are not present in the object: CCL3L1, IL8,
MIR142, CCL4L1, LINC00936, H3F3B, RN7SL1, SNORD3B-2, AC006129.4, RP11-386I14.4,
MIR24-2, not searching for symbol synonyms

Warning: The following features are not present in the object: MT-RNR2,
RP11-536O18.2, FAM129A, AC004086.1, MT-RNR1, C9orf16, TMEM251, ATP5E, PPAP2A,
not searching for symbol synonyms

Warning: The following features are not present in the object: GNB2L1, GLTSCR2,
ATP5G2, RPS4Y1, not searching for symbol synonyms

Warning: The following features are not present in the object: C1orf63,
FAM105A, GPR56, C10orf54, TMEM173, FYB, RP11-480C22.1, RP11-343N15.5, KIAA1551,
RP1-249H1.4, RP11-439A17.7, not searching for symbol synonyms

Warning: The following features are not present in the object: DDX58, FAM26F,
C19orf66, WARS, not searching for symbol synonyms

Warning: The following features are not present in the object: GNB2L1, not
searching for symbol synonyms

Warning: The following features are not present in the object: LINC00936,
H3F3B, SNORD3B-1, CTA-29F11-1, SNORD3B-2, MT-RNR2, AC253572-2, AC103591-3,
AC020916-1, RP11-386I14-4, MT-RNR1, HIST2H2AA3, MTCO1P12, AP003481-1,
AC004687-1, SNORD3A, AC243960-1, HIST1H4E, RP1-18D14-7, AC245014-3, AL499604-1,
AL118516-1, AL021155-2, HIST1H4C, GGTA1P, RNU2-63P, not searching for symbol
synonyms

Processing Batch 2...

Warning: The following features are not present in the object: CCL3L1, IL8,
MIR142, CCL4L1, LINC00936, H3F3B, RN7SL1, SNORD3B-2, AC006129.4, RP11-386I14.4,
MIR24-2, not searching for symbol synonyms

Warning: The following features are not present in the object: MT-RNR2,
RP11-536O18.2, FAM129A, AC004086.1, MT-RNR1, C9orf16, TMEM251, ATP5E, PPAP2A,
not searching for symbol synonyms

Warning: The following features are not present in the object: GNB2L1, GLTSCR2,
ATP5G2, RPS4Y1, not searching for symbol synonyms

Warning: The following features are not present in the object: C1orf63,
FAM105A, GPR56, C10orf54, TMEM173, FYB, RP11-480C22.1, RP11-343N15.5, KIAA1551,
RP1-249H1.4, RP11-439A17.7, not searching for symbol synonyms

Warning: The following features are not present in the object: DDX58, FAM26F,
C19orf66, WARS, not searching for symbol synonyms

Warning: The following features are not present in the object: GNB2L1, not
searching for symbol synonyms

Warning: The following features are not present in the object: LINC00936,
H3F3B, SNORD3B-1, CTA-29F11-1, SNORD3B-2, MT-RNR2, AC253572-2, AC103591-3,
AC020916-1, RP11-386I14-4, MT-RNR1, HIST2H2AA3, MTCO1P12, AP003481-1,
AC004687-1, SNORD3A, AC243960-1, HIST1H4E, RP1-18D14-7, AC245014-3, AL499604-1,
AL118516-1, AL021155-2, HIST1H4C, GGTA1P, RNU2-63P, not searching for symbol
synonyms

Processing Batch 3...

Warning: The following features are not present in the object: CCL3L1, IL8,
MIR142, CCL4L1, LINC00936, H3F3B, RN7SL1, SNORD3B-2, AC006129.4, RP11-386I14.4,
MIR24-2, not searching for symbol synonyms

Warning: The following features are not present in the object: MT-RNR2,
RP11-536O18.2, FAM129A, AC004086.1, MT-RNR1, C9orf16, TMEM251, ATP5E, PPAP2A,
not searching for symbol synonyms

Warning: The following features are not present in the object: GNB2L1, GLTSCR2,
ATP5G2, RPS4Y1, not searching for symbol synonyms

Warning: The following features are not present in the object: C1orf63,
FAM105A, GPR56, C10orf54, TMEM173, FYB, RP11-480C22.1, RP11-343N15.5, KIAA1551,
RP1-249H1.4, RP11-439A17.7, not searching for symbol synonyms

Warning: The following features are not present in the object: DDX58, FAM26F,
C19orf66, WARS, not searching for symbol synonyms

Warning: The following features are not present in the object: GNB2L1, not
searching for symbol synonyms

Warning: The following features are not present in the object: LINC00936,
H3F3B, SNORD3B-1, CTA-29F11-1, SNORD3B-2, MT-RNR2, AC253572-2, AC103591-3,
AC020916-1, RP11-386I14-4, MT-RNR1, HIST2H2AA3, MTCO1P12, AP003481-1,
AC004687-1, SNORD3A, AC243960-1, HIST1H4E, RP1-18D14-7, AC245014-3, AL499604-1,
AL118516-1, AL021155-2, HIST1H4C, GGTA1P, RNU2-63P, not searching for symbol
synonyms

Processing Batch 4...

Warning: The following features are not present in the object: CCL3L1, IL8,
MIR142, CCL4L1, LINC00936, H3F3B, RN7SL1, SNORD3B-2, AC006129.4, RP11-386I14.4,
MIR24-2, not searching for symbol synonyms

Warning: The following features are not present in the object: MT-RNR2,
RP11-536O18.2, FAM129A, AC004086.1, MT-RNR1, C9orf16, TMEM251, ATP5E, PPAP2A,
not searching for symbol synonyms

Warning: The following features are not present in the object: GNB2L1, GLTSCR2,
ATP5G2, RPS4Y1, not searching for symbol synonyms

Warning: The following features are not present in the object: C1orf63,
FAM105A, GPR56, C10orf54, TMEM173, FYB, RP11-480C22.1, RP11-343N15.5, KIAA1551,
RP1-249H1.4, RP11-439A17.7, not searching for symbol synonyms

Warning: The following features are not present in the object: DDX58, FAM26F,
C19orf66, WARS, TNFAIP6, not searching for symbol synonyms

Warning: The following features are not present in the object: GNB2L1, not
searching for symbol synonyms

Warning: The following features are not present in the object: LINC00936,
H3F3B, SNORD3B-1, CTA-29F11-1, SNORD3B-2, MT-RNR2, AC253572-2, AC103591-3,
AC020916-1, RP11-386I14-4, MT-RNR1, HIST2H2AA3, MTCO1P12, AP003481-1,
AC004687-1, SNORD3A, AC243960-1, HIST1H4E, RP1-18D14-7, AC245014-3, AL499604-1,
AL118516-1, AL021155-2, HIST1H4C, GGTA1P, RNU2-63P, not searching for symbol
synonyms

Processing Batch 5...

Warning: The following features are not present in the object: CCL3L1, IL8,
MIR142, CCL4L1, LINC00936, H3F3B, RN7SL1, SNORD3B-2, AC006129.4, RP11-386I14.4,
MIR24-2, not searching for symbol synonyms

Warning: The following features are not present in the object: MT-RNR2,
RP11-536O18.2, FAM129A, AC004086.1, MT-RNR1, C9orf16, TMEM251, ATP5E, PPAP2A,
not searching for symbol synonyms

Warning: The following features are not present in the object: GNB2L1, GLTSCR2,
ATP5G2, RPS4Y1, not searching for symbol synonyms

Warning: The following features are not present in the object: C1orf63,
FAM105A, GPR56, C10orf54, TMEM173, FYB, RP11-480C22.1, RP11-343N15.5, KIAA1551,
RP1-249H1.4, RP11-439A17.7, not searching for symbol synonyms

Warning: The following features are not present in the object: DDX58, FAM26F,
C19orf66, WARS, TNFAIP6, CXCL11, not searching for symbol synonyms

Warning: The following features are not present in the object: GNB2L1, not
searching for symbol synonyms

Warning: The following features are not present in the object: LINC00936,
H3F3B, SNORD3B-1, CTA-29F11-1, SNORD3B-2, MT-RNR2, AC253572-2, AC103591-3,
AC020916-1, RP11-386I14-4, MT-RNR1, HIST2H2AA3, MTCO1P12, AP003481-1,
AC004687-1, SNORD3A, AC243960-1, HIST1H4E, RP1-18D14-7, AC245014-3, AL499604-1,
AL118516-1, AL021155-2, HIST1H4C, GGTA1P, RNU2-63P, not searching for symbol
synonyms

Processing Batch 6...

Warning: The following features are not present in the object: CCL3L1, IL8,
MIR142, CCL4L1, LINC00936, H3F3B, RN7SL1, SNORD3B-2, AC006129.4, RP11-386I14.4,
MIR24-2, not searching for symbol synonyms

Warning: The following features are not present in the object: MT-RNR2,
RP11-536O18.2, FAM129A, AC004086.1, MT-RNR1, C9orf16, TMEM251, ATP5E, PPAP2A,
not searching for symbol synonyms

Warning: The following features are not present in the object: GNB2L1, GLTSCR2,
ATP5G2, RPS4Y1, not searching for symbol synonyms

Warning: The following features are not present in the object: C1orf63,
FAM105A, GPR56, C10orf54, TMEM173, FYB, RP11-480C22.1, RP11-343N15.5, KIAA1551,
RP1-249H1.4, RP11-439A17.7, not searching for symbol synonyms

Warning: The following features are not present in the object: DDX58, FAM26F,
C19orf66, WARS, TNFAIP6, CXCL11, not searching for symbol synonyms

Warning: The following features are not present in the object: GNB2L1, not
searching for symbol synonyms

Warning: The following features are not present in the object: LINC00936,
H3F3B, SNORD3B-1, CTA-29F11-1, SNORD3B-2, MT-RNR2, AC253572-2, AC103591-3,
AC020916-1, RP11-386I14-4, MT-RNR1, HIST2H2AA3, MTCO1P12, AP003481-1,
AC004687-1, SNORD3A, AC243960-1, HIST1H4E, RP1-18D14-7, AC245014-3, AL499604-1,
AL118516-1, AL021155-2, HIST1H4C, GGTA1P, RNU2-63P, not searching for symbol
synonyms

4. Normalisation and Integration

Batches are merged back into a single object and re-normalised with SCTransform v2. PCA is then run on the merged, unintegrated object to provide the input embedding required by RPCA. Integration with IntegrateLayers (RPCA method) corrects pool-level batch effects while aiming to preserve biological variance.

Code

merged_obj <- merge(batches[[1]], y = batches[-1])
Idents(merged_obj) <- "pool_id"

options(future.globals.maxSize = 100 * 1024^3)
merged_obj <- SCTransform(merged_obj, vst.flavor = "v2", verbose = FALSE)
merged_obj <- RunPCA(merged_obj, verbose = FALSE, assay = "SCT")

set.seed(42)
int_method <- "rpca"
DefaultAssay(merged_obj) <- "SCT"
merged_obj <- IntegrateLayers(
  object               = merged_obj,
  method               = RPCAIntegration,
  normalization.method = "SCT",
  orig.reduction       = "pca",
  new.reduction        = glue("integrated.{int_method}"),
  verbose              = FALSE
)

Warning: UNRELIABLE VALUE: One of the 'future.apply' iterations
('future_lapply-1') unexpectedly generated random numbers without declaring so.
There is a risk that those random numbers are not statistically sound and the
overall results might be invalid. To fix this, specify 'future.seed=TRUE'. This
ensures that proper, parallel-safe random numbers are produced via a parallel
RNG method. To disable this check, use 'future.seed = NULL', or set option
'future.rng.onMisuse' to "ignore". [future 'future_lapply-1'
(8be034ec696f25278ad3a140970bf912-59); on
8be034ec696f25278ad3a140970bf912@cn027<902421>]

5. Clustering and annotation

Neighbours are found on the integrated RPCA reduction and clusters are identified at resolution 0.2. A UMAP is computed on the same integrated reduction for visualisation.

Code

set.seed(42)
merged_obj <- FindNeighbors(merged_obj, reduction = glue("integrated.{int_method}"), dims = 1:30)

Computing nearest neighbor graph

Computing SNN

Code

set.seed(42)
cluster_resolution <- 0.2
merged_obj <- FindClusters(
  merged_obj,
  resolution   = cluster_resolution,
  cluster.name = glue("{int_method}_clusters"),
  graph.name   = "SCT_snn"
)

Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 86504
Number of edges: 2726533

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9195
Number of communities: 12
Elapsed time: 52 seconds

Run UMAP on integrated reduction

Code

set.seed(42)
merged_obj <- RunUMAP(
  merged_obj,
  dims           = 1:30,
  reduction.name = glue("umap.{int_method}")
)

18:18:58 UMAP embedding parameters a = 0.9922 b = 1.112

18:18:58 Read 86504 rows and found 30 numeric columns

18:18:58 Using Annoy for neighbor search, n_neighbors = 30

18:18:58 Building Annoy index with metric = cosine, n_trees = 50

0%   10   20   30   40   50   60   70   80   90   100%

[----|----|----|----|----|----|----|----|----|----|

**************************************************|
18:19:04 Writing NN index file to temp file /tmp/slurm_45934587/RtmpfBZYPK/filedc515768aef1f
18:19:05 Searching Annoy index using 1 thread, search_k = 3000
18:19:35 Annoy recall = 100%
18:19:37 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
18:19:43 Initializing from normalized Laplacian + noise (using RSpectra)
18:19:47 Commencing optimization for 200 epochs, with 3937638 positive edges
18:19:47 Using rng type: pcg
18:20:27 Optimization finished

Save integrated object

Code

qs_save(merged_obj, file = file.path(objects_dir, "prelabelled_integrated_rpca.qs2"))
merged_obj <- qs_read(file.path(objects_dir, "prelabelled_integrated_rpca.qs2"))

Diagnostic UMAP plots

A standard set of diagnostic plots checks cluster structure, sample and pool mixing, QC metric distributions, doublet and outlier locations, and condition separation.

Code

p1 <- DimPlot(merged_obj, reduction = glue("umap.{int_method}"),
              group.by = glue("{int_method}_clusters"), label = TRUE) +
  ggtitle(glue("Clusters {cluster_resolution} resolution"))

p2 <- DimPlot(merged_obj, reduction = glue("umap.{int_method}"),
              group.by = "orig.ident", shuffle = TRUE, alpha = 0.5) +
  ggtitle("Sample Distribution Overlaid")

p2.5 <- DimPlot(merged_obj, reduction = glue("umap.{int_method}"),
                split.by = "orig.ident", ncol = 5, alpha = 0.5) +
  ggtitle("Sample Distribution Split")

p3 <- DimPlot(merged_obj, reduction = glue("umap.{int_method}"),
              group.by = "pool_id", shuffle = TRUE, alpha = 0.5) +
  ggtitle("Pool Distribution")

p3.5 <- DimPlot(merged_obj, reduction = glue("umap.{int_method}"),
                split.by = "pool_id", shuffle = TRUE, alpha = 0.5) +
  ggtitle("Pool Distribution")

p4 <- FeaturePlot(merged_obj, reduction = glue("umap.{int_method}"),
                  features = c("nCount_RNA", "nFeature_RNA", "percent.mt"),
                  cols = c("#440154", "#FDE725"))

p5 <- DimPlot(merged_obj, reduction = glue("umap.{int_method}"),
              group.by = "scDblFinder.class") +
  ggtitle("Doublet Locations")

p5.5 <- DimPlot(merged_obj, reduction = glue("umap.{int_method}"),
                group.by = "mad3_outlier") +
  ggtitle("MAD3 Outlier Locations")

p6 <- DimPlot(merged_obj, reduction = glue("umap.{int_method}"),
              group.by = "group_id", shuffle = TRUE) +
  ggtitle("Condition Locations")

p6.5 <- DimPlot(merged_obj, reduction = glue("umap.{int_method}"),
                split.by = "group_id", ncol = 5) +
  ggtitle("Condition Locations")

print(p1); print(p2); print(p2.5)

Code

print(p3); print(p4); print(p5)

Code

print(p5.5); print(p6); print(p6.5); print(p3.5)

Code

ggsave(filename = file.path(graphs_dir, int_method, cluster_resolution, glue("integrated_{int_method}_clusters_{cluster_resolution}.pdf")),       plot = p1,   width = 7,  height = 5, dpi = 300, units = "in")
ggsave(filename = file.path(graphs_dir, int_method, cluster_resolution, glue("integrated_{int_method}_sample_dist_overlaid.pdf")),                plot = p2,   width = 7,  height = 5, dpi = 300, units = "in")
ggsave(filename = file.path(graphs_dir, int_method, cluster_resolution, glue("integrated_{int_method}_sample_dist_split.pdf")),                   plot = p2.5, width = 7,  height = 5, dpi = 300, units = "in")
ggsave(filename = file.path(graphs_dir, int_method, cluster_resolution, glue("integrated_{int_method}_batch_dist.pdf")),                          plot = p3,   width = 7,  height = 5, dpi = 300, units = "in")
ggsave(filename = file.path(graphs_dir, int_method, cluster_resolution, glue("integrated_{int_method}_batch_dist_split.pdf")),                    plot = p3.5, width = 10, height = 5, dpi = 300, units = "in")
ggsave(filename = file.path(graphs_dir, int_method, cluster_resolution, glue("integrated_{int_method}_qc.pdf")),                                  plot = p4,   width = 7,  height = 5, dpi = 300, units = "in")
ggsave(filename = file.path(graphs_dir, int_method, cluster_resolution, glue("integrated_{int_method}_doublet_dist.pdf")),                        plot = p5,   width = 7,  height = 5, dpi = 300, units = "in")
ggsave(filename = file.path(graphs_dir, int_method, cluster_resolution, glue("integrated_{int_method}_outlier_dist.pdf")),                        plot = p5.5, width = 7,  height = 5, dpi = 300, units = "in")
ggsave(filename = file.path(graphs_dir, int_method, cluster_resolution, glue("integrated_{int_method}_condition_dist_overlaid.pdf")),              plot = p6,   width = 7,  height = 5, dpi = 300, units = "in")
ggsave(filename = file.path(graphs_dir, int_method, cluster_resolution, glue("integrated_{int_method}_condition_dist_split.pdf")),                 plot = p6.5, width = 7,  height = 5, dpi = 300, units = "in")

Join layers and normalise

JoinLayers collapses the separate per-pool raw count matrices into a single layer, required for downstream functions. Log-normalisation corrects for differences in sequencing depth, and ScaleData z-scores expression so that highly expressed genes do not dominate dotplots and heatmaps.

Code

DefaultAssay(merged_obj) <- "RNA"
merged_obj <- JoinLayers(merged_obj, assay = "RNA")
merged_obj <- NormalizeData(merged_obj)

Normalizing layer: counts

Code

merged_obj <- ScaleData(merged_obj)

Centering and scaling data matrix

Mancuso sense check

Module scores are calculated for all Mancuso microglial states and non-microglial cell types on the integrated object. Gene sets are re-intersected here against the merged object feature space. Scores are plotted per cluster to guide annotation.

Code

myeloid_seu    <- intersect(gs_myeloid,       rownames(merged_obj))
prolif_seu     <- intersect(gs_proliferating, rownames(merged_obj))
macrophage_seu <- intersect(gs_macrophage,    rownames(merged_obj))
dendritic_seu  <- intersect(gs_dendritic,     rownames(merged_obj))

gene_sets <- list(
  dendritic     = dendritic_seu,
  proliferating = prolif_seu,
  macrophage    = macrophage_seu,
  myeloid       = myeloid_seu,
  CRM           = gs_crm,
  DAM           = gs_dam,
  HLA           = gs_hla,
  HM            = gs_hm,
  IRM           = gs_irm,
  RM            = gs_rm,
  tCRM          = gs_tcrm
)

merged_obj <- AddModuleScore(merged_obj, features = gene_sets, name = names(gene_sets))

Warning: The following features are not present in the object: CCL3L1, IL8,
MIR142, CCL4L1, LINC00936, H3F3B, RN7SL1, SNORD3B-2, AC006129.4, RP11-386I14.4,
MIR24-2, not searching for symbol synonyms

Warning: The following features are not present in the object: MT-RNR2,
RP11-536O18.2, FAM129A, AC004086.1, MT-RNR1, C9orf16, TMEM251, ATP5E, PPAP2A,
not searching for symbol synonyms

Warning: The following features are not present in the object: GNB2L1, GLTSCR2,
ATP5G2, not searching for symbol synonyms

Warning: The following features are not present in the object: C1orf63,
FAM105A, GPR56, C10orf54, TMEM173, FYB, RP11-480C22.1, RP11-343N15.5, KIAA1551,
RP1-249H1.4, RP11-439A17.7, not searching for symbol synonyms

Warning: The following features are not present in the object: DDX58, FAM26F,
C19orf66, WARS, not searching for symbol synonyms

Warning: The following features are not present in the object: GNB2L1, not
searching for symbol synonyms

Warning: The following features are not present in the object: LINC00936,
H3F3B, SNORD3B-1, CTA-29F11-1, SNORD3B-2, MT-RNR2, AC253572-2, AC103591-3,
AC020916-1, RP11-386I14-4, MT-RNR1, HIST2H2AA3, MTCO1P12, AP003481-1,
AC004687-1, SNORD3A, AC243960-1, HIST1H4E, RP1-18D14-7, AC245014-3, AL499604-1,
AL118516-1, AL021155-2, HIST1H4C, GGTA1P, RNU2-63P, not searching for symbol
synonyms

Code

merged_obj$seurat_clusters <- factor(
  merged_obj$rpca_clusters,
  levels = sort(as.numeric(unique(as.character(merged_obj$rpca_clusters))))
)
merged_obj$seurat_clusters <- Idents(merged_obj)

p  <- VlnPlot(merged_obj, features = c("dendritic1", "proliferating2", "macrophage3", "myeloid4"),
              group.by = "seurat_clusters", pt.size = 0, ncol = 2)
p2 <- VlnPlot(merged_obj, features = c("CRM5", "DAM6", "HLA7", "HM8", "IRM9", "RM10", "tCRM11"),
              group.by = "seurat_clusters", pt.size = 0, ncol = 4)
p3 <- VlnPlot(merged_obj, features = c("nFeature_RNA"),
              group.by = "seurat_clusters", pt.size = 0, ncol = 2)

ggsave(filename = file.path(graphs_dir, int_method, cluster_resolution, glue("integrated_{int_method}_clusters_vs_nonmicroglia_{cluster_resolution}.pdf")), plot = p,  width = 7,  height = 5, dpi = 300, units = "in")
ggsave(filename = file.path(graphs_dir, int_method, cluster_resolution, glue("integrated_{int_method}_clusters_vs_mancuso_{cluster_resolution}.pdf")),      plot = p2, width = 10, height = 5, dpi = 300, units = "in")
ggsave(filename = file.path(graphs_dir, int_method, cluster_resolution, glue("integrated_{int_method}_clusters_vs_features_{cluster_resolution}.pdf")),     plot = p3, width = 10, height = 5, dpi = 300, units = "in")

print(p); print(p2); print(p3)

Code

table(merged_obj$seurat_clusters)


    0     1     2     3     4     5     6     7     8     9    10    11 
23725 17263 10984 10686  6716  4772  4512  2402  2041  1508  1024   871

6. Cluster markers

Marker genes are identified per cluster using a Wilcoxon rank-sum test. Only positive markers are returned, filtered to a minimum 25% detection rate and log2FC ≥ 0.25. The top 10 markers per cluster by log2FC are printed.

Code

DefaultAssay(merged_obj) <- "RNA"
Idents(merged_obj) <- glue("{int_method}_clusters")
markers <- FindAllMarkers(
  merged_obj,
  test.use      = "wilcox",
  only.pos      = TRUE,
  min.pct       = 0.25,
  logfc.threshold = 0.25
)

Calculating cluster 0

Calculating cluster 1

Calculating cluster 2

Calculating cluster 3

Calculating cluster 4

Calculating cluster 5

Calculating cluster 6

Calculating cluster 7

Calculating cluster 8

Calculating cluster 9

Calculating cluster 10

Calculating cluster 11

Code

write.csv(markers, file.path(objects_dir, glue("wilcox_markers_{cluster_resolution}.csv")))

top10_markers <- markers %>%
  group_by(cluster) %>%
  slice_max(n = 10, order_by = avg_log2FC)

top10_markers

Mitochondrial % per cluster

Code

p <- VlnPlot(merged_obj, features = "percent.mt", pt.size = 0) +
  geom_hline(yintercept = 15, linetype = "dashed", color = "red")

ggsave(
  filename = file.path(graphs_dir, int_method, cluster_resolution, glue("integrated_{int_method}_clusters_vs_mt_{cluster_resolution}.pdf")),
  plot = p, width = 10, height = 5, dpi = 300, units = "in"
)

Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_hline()`).

Code

print(p)

Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_hline()`).

7. Removal of unwanted clusters

Clusters 9 and 6 were identified as non-microglial based on elevated mitochondrial content and non-microglial module scores. Their cell barcodes are saved here for removal in Part 2.

Code

unwanted_cells <- WhichCells(merged_obj, idents = c("9", "6"))
length(unwanted_cells)

[1] 6020

Code

qs_save(unwanted_cells, file.path(objects_dir, "unwanted_cells_9_6.qs2"))

Code

sessionInfo()

R version 4.5.1 (2025-06-13)
Platform: x86_64-pc-linux-gnu
Running under: Rocky Linux 8.7 (Green Obsidian)

Matrix products: default
BLAS/LAPACK: FlexiBLAS OPENBLAS;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/London
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] ggrepel_0.9.6               scales_1.4.0               
 [3] presto_1.0.0                data.table_1.18.2.1        
 [5] Rcpp_1.1.1                  qs2_0.1.6                  
 [7] scDblFinder_1.24.10         SingleCellExperiment_1.32.0
 [9] SummarizedExperiment_1.40.0 Biobase_2.70.0             
[11] GenomicRanges_1.62.1        Seqinfo_1.0.0              
[13] IRanges_2.44.0              S4Vectors_0.48.0           
[15] BiocGenerics_0.56.0         generics_0.1.4             
[17] MatrixGenerics_1.22.0       matrixStats_1.5.0          
[19] SoupX_1.6.2                 gridExtra_2.3              
[21] future_1.70.0               readxl_1.4.5               
[23] glue_1.8.0                  lubridate_1.9.4            
[25] forcats_1.0.1               stringr_1.6.0              
[27] purrr_1.2.1                 readr_2.1.6                
[29] tidyr_1.3.2                 tibble_3.3.1               
[31] ggplot2_4.0.2               tidyverse_2.0.0            
[33] Seurat_5.3.1.9001           SeuratObject_5.3.0.9003    
[35] sp_2.2-1                    Matrix_1.7-4               
[37] dplyr_1.2.0                

loaded via a namespace (and not attached):
  [1] RcppAnnoy_0.0.23          splines_4.5.1            
  [3] later_1.4.6               BiocIO_1.20.0            
  [5] bitops_1.0-9              R.oo_1.27.1              
  [7] cellranger_1.1.0          polyclip_1.10-7          
  [9] XML_3.99-0.20             fastDummies_1.7.5        
 [11] lifecycle_1.0.5           edgeR_4.8.1              
 [13] globals_0.19.1            lattice_0.22-7           
 [15] MASS_7.3-65               magrittr_2.0.4           
 [17] limma_3.66.0              plotly_4.12.0            
 [19] rmarkdown_2.30            yaml_2.3.12              
 [21] metapod_1.18.0            httpuv_1.6.16            
 [23] otel_0.2.0                glmGamPoi_1.22.0         
 [25] sctransform_0.4.3         spam_2.11-3              
 [27] spatstat.sparse_3.1-0     reticulate_1.45.0        
 [29] cowplot_1.2.0             pbapply_1.7-4            
 [31] RColorBrewer_1.1-3        abind_1.4-8              
 [33] Rtsne_0.17                R.utils_2.13.0           
 [35] RCurl_1.98-1.17           irlba_2.3.7              
 [37] listenv_0.10.1            spatstat.utils_3.2-1     
 [39] goftest_1.2-3             RSpectra_0.16-2          
 [41] dqrng_0.4.1               spatstat.random_3.4-4    
 [43] fitdistrplus_1.2-6        parallelly_1.46.1        
 [45] DelayedMatrixStats_1.32.0 codetools_0.2-20         
 [47] DelayedArray_0.36.0       scuttle_1.20.0           
 [49] tidyselect_1.2.1          UCSC.utils_1.6.1         
 [51] farver_2.1.2              viridis_0.6.5            
 [53] ScaledMatrix_1.18.0       spatstat.explore_3.7-0   
 [55] GenomicAlignments_1.46.0  jsonlite_2.0.0           
 [57] BiocNeighbors_2.4.0       progressr_0.19.0         
 [59] scater_1.38.0             ggridges_0.5.7           
 [61] survival_3.8-3            systemfonts_1.3.1        
 [63] tools_4.5.1               ragg_1.5.0               
 [65] ica_1.0-3                 SparseArray_1.10.8       
 [67] xfun_0.56                 GenomeInfoDb_1.46.2      
 [69] withr_3.0.2               fastmap_1.2.0            
 [71] bluster_1.20.0            digest_0.6.39            
 [73] rsvd_1.0.5                timechange_0.3.0         
 [75] R6_2.6.1                  mime_0.13                
 [77] textshaping_1.0.3         scattermore_1.2          
 [79] tensor_1.5.1              spatstat.data_3.1-9      
 [81] R.methodsS3_1.8.2         cigarillo_1.0.0          
 [83] rtracklayer_1.70.0        httr_1.4.8               
 [85] htmlwidgets_1.6.4         S4Arrays_1.10.1          
 [87] uwot_0.2.4                pkgconfig_2.0.3          
 [89] gtable_0.3.6              lmtest_0.9-40            
 [91] S7_0.2.1                  XVector_0.50.0           
 [93] htmltools_0.5.9           dotCall64_1.2            
 [95] png_0.1-8                 spatstat.univar_3.1-6    
 [97] scran_1.38.0              knitr_1.51               
 [99] tzdb_0.5.0                reshape2_1.4.5           
[101] rjson_0.2.23              nlme_3.1-168             
[103] curl_7.0.0                zoo_1.8-15               
[105] KernSmooth_2.23-26        vipor_0.4.7              
[107] parallel_4.5.1            miniUI_0.1.2             
[109] ggrastr_1.0.2             restfulr_0.0.16          
[111] pillar_1.11.1             grid_4.5.1               
[113] vctrs_0.7.1               RANN_2.6.2               
[115] promises_1.5.0            stringfish_0.17.0        
[117] BiocSingular_1.26.1       beachmat_2.26.0          
[119] xtable_1.8-4              cluster_2.1.8.1          
[121] beeswarm_0.4.0            evaluate_1.0.5           
[123] locfit_1.5-9.12           cli_3.6.5                
[125] compiler_4.5.1            Rsamtools_2.26.0         
[127] rlang_1.1.7               crayon_1.5.3             
[129] future.apply_1.20.2       labeling_0.4.3           
[131] ggbeeswarm_0.7.3          plyr_1.8.9               
[133] stringi_1.8.7             viridisLite_0.4.3        
[135] deldir_2.0-4              BiocParallel_1.44.0      
[137] Biostrings_2.78.0         lazyeval_0.2.2           
[139] spatstat.geom_3.7-0       RcppHNSW_0.6.0           
[141] hms_1.1.4                 patchwork_1.3.2          
[143] sparseMatrixStats_1.22.0  statmod_1.5.1            
[145] shiny_1.12.1              ROCR_1.0-12              
[147] igraph_2.2.2              RcppParallel_5.1.11-1    
[149] xgboost_3.2.0.1