Introduction to sc-REnF (Darmanis Data)
An Entropy Based Feature Selection- Application on Single cell RNA Sequence Data
Summary
A novel and Robust Entropy based Feature (gene) Selection method for large Single-cell RNA-seq data, using Renyi ´and Tsallis entropy. The SC-REnF is also demonstrated for identifying marker genes from different cell types. Our results shed some light on the single-cell clustering problem with the application of entropy based feature selection, and therefore, it will be an important tool to complement existing methods in the scRNA-seq analysis pipeline. Results are shown here in brief.
How to use sc-REnF
Data Loading and Preprocessing
Load the Libraries
library(SingleCellExperiment)
library('Linnorm')
We will give the method demonstration on Darmanis Dataset. For more details about the data, see A survey of human brain transcriptome diversity at the single cell level
Read the gene expression data using (SingleCellExperiment object), calculate CPM values and extract metadata.
rawdata <- readRDS("darmanis.rds")
data <- assay(rawdata)
For demonstration purposes, we apply a standard Linnorm normalization with minimum read count =5 in 10 percent cell. However any other normalization approach may be used. Gene should be in row, Cells should be in coloumn
preprocessedata= normalized_data(data)
dim(preprocessedata)
[1] 8994 466
preprocessedata[1:2,1:3]
Brain Brain Brain
A2M 0.000000 4.953487 4.908761
AAAS 1.526881 0.000000 0.000000
A total of 466 cells and 8994 genes are remaining in the dataset after cell, gene filtering, and Normalization.
Feature (Gene Selection)
Load the libraries
library(foreach)
library(doParallel)
Apply the feature (gene) selection using Renyi and Tsallis with preprocesse data and cell types. Default— Core Number (p=20), q-values (q=0.7,0.3) , Number of genes to be selected (nf=500)
RenyiFeadata=Renyifeature(data,cell,gene,p,q,nf)
TsallisFeadata=Tsallisfeature(data,cell,gene,p,q,nf)
The Reduced Darmanis data using Renyi entropy
dim(RenyiFeadata)
[1] 466 500
RenyiFeadata[1:2,1:3]
BRD7P3 MEX3A TMSB15A
astrocytes 2.749929 0 0
endothelial 0.000000 0 0
The Reduced Darmanis data using Tsallis entropy
dim(TsallisFeadata)
[1] 466 500
TsallisFeadata[1:2,1:3]
PCP4 DDX5 A2M
astrocytes 0 4.095699 0.000000
endothelial 0 5.816251 4.955909
Saving the results
write.csv(TsallisFeadata, file="Tsallisd.csv")
write.csv(RenyiFeadata, file="Renyid.csv.csv")
Clustering using selected feature
import libraries in python and importing the data
import numpy as np
import pandas as pd
import scanpy as sc
adata1=sc.read_csv('Renyid.csv', delimiter=',', first_column_names=None, dtype='float32')
Using PCA dimensionality reduction and Leiden clustering
sc.tl.pca(adata1, svd_solver='arpack')
sc.pl.pca_variance_ratio(adata1,n_pcs=20,log=True)
#create neighborhood graph using 20 pcs
sc.pp.neighbors(adata1, n_neighbors=15, n_pcs=20)
##dim reduction using umap
sc.tl.umap(adata1)
#Leiden clustering
import leidenalg
sc.tl.leiden(adata1)
##visualizing clusters
sc.pl.umap(adata1, color=['leiden'])
save the clustering results
pd.DataFrame(adata1.obs).to_csv("darmanis_leiden.csv")
Marker selection form identified clusters
Compute a ranking for the highly differential genes in each cluster using Wilcoxon-RankSum test
sc.tl.rank_genes_groups(adata1, 'leiden', method='wilcoxon',key_added = "wilcoxon")
sc.pl.rank_genes_groups(adata1, n_genes=30, sharey=False,key="wilcoxon")
Top 10 DE genes for each cluster using Wilcox-Ranksum Test
#save top 10 DE genes for each cluster using wilcox-ranksum test
result = adata1.uns['wilcoxon']
groups = result['names'].dtype.names
p=pd.DataFrame(
{group + '_' + key[:1]: result[key][group]
for group in groups for key in ['names', 'pvals']}).head(10)
pd.DataFrame(p).to_csv("darmanis_marker.csv")
Visualizing top 5 DE genes for each cluster in a heatmap using wilcox results
sc.pl.rank_genes_groups_heatmap(adata1, n_genes=5, key="wilcoxon", groupby="leiden", show_gene_labels=True)
Visualizing top 5 DE genes for each cluster in a dotplot using t-test results. Here color of dot represents mean expression of the gene in those cell, dot size represents fraction of cells expressing a gene
sc.pl.rank_genes_groups_dotplot(adata1, n_genes=5, key="wilcoxon", groupby="leiden")
Visualizing top 5 DE genes for each cluster in a stacked violin plot using t-test results
sc.pl.rank_genes_groups_stacked_violin(adata1, n_genes=5, key="wilcoxon", groupby="leiden")
Visualizing top 5 DE genes for each cluster in a matrixplot using wilcox results. matrixplot represents mean expression of a gene in a cluster as a heatmap.
sc.pl.rank_genes_groups_matrixplot(adata1, n_genes=5, key="wilcoxon", groupby="leiden")
Showing expression of some marker genes (e.g VIP,DCX) across Leiden groups
sc.pl.violin(adata1, ['VIP'], groupby='leiden')
sc.pl.violin(adata1, ['DCX'], groupby='leiden')