run doubletdetection

Uses doubletdetection to predict if an observation (cell) is likely to be a doublet and remove from the dataset. Doublets arise when multiple cells are mistaken as a single cell in droplet-based technologies. This affects biological signal, for example during PCA doublets may form separate clusters not reflecting biological difference.

Parameters

n_iters: int Number of fit operations from which to collect p-values. Defualt value is 25.

pseudocount: float Pseudocount used in normalize_counts. If 1 is used, and standard_scaling=False, the classifier is much more memory efficient; however, this may result in fewer doublets detected.

boost_rate: float Proportion of cell population size to produce as synthetic doublets.

clustering_algorithm: str Clustering algorithm to use (Louvain, Leiden or phenograph).

voter_thresh: float Fraction of iterations a cell must be called a doublet.

standard_scaling: bool Set to True to enable standard scaling of normalized count matrix prior to PCA. Recommended when not using Phenograph.

n_top_var_genes: int Number of highest variance genes to use; other genes discarded. Will use all genes when zero.

Web view

run_doubletdetection_screenshot

Python equivalent

# Example taken from doubletdetection tutorial for PBMC3K available at: https://doubletdetection.readthedocs.io/en/latest/tutorial.html
!pip install doubletdetection # install doubletdetection
import doubletdetection
import scanpy as sc
import matplotlib.pyplot as plt

adata = sc.read_10x_h5(
    "pbmc_10k_v3_filtered_feature_bc_matrix.h5",
    backup_url="https://cf.10xgenomics.com/samples/cell-exp/3.0.0/pbmc_10k_v3/pbmc_10k_v3_filtered_feature_bc_matrix.h5"
)
adata.var_names_make_unique()

sc.pp.filter_genes(adata, min_cells=1)

clf = doubletdetection.BoostClassifier(
    n_iters=10,
    clustering_algorithm="louvain",
    standard_scaling=True,
    pseudocount=0.1,
    n_jobs=-1,
)
doublets = clf.fit(adata.X).predict(p_thresh=1e-16, voter_thresh=0.5)
doublet_score = clf.doublet_score()

adata.obs["doublet"] = doublets
adata.obs["doublet_score"] = doublet_score

sc.pp.normalize_total(adata)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata)
sc.tl.pca(adata)
sc.pp.neighbors(adata)
sc.tl.umap(adata)

sc.pl.umap(adata, color=["doublet", "doublet_score"])
sc.pl.violin(adata, "doublet_score")