Introduction to sc-REnF (CBMC Data)

An Entropy Based Feature Selection- Application on Single cell RNA Sequence Data

Summary

A novel and Robust Entropy based Feature (gene) Selection method for large Single-cell RNA-seq data, using Renyi ´and Tsallis entropy. The SC-REnF is also demonstrated for identifying marker genes from different cell types. Our results shed some light on the single-cell clustering problem with the application of entropy based feature selection, and therefore, it will be an important tool to complement existing methods in the scRNA-seq analysis pipeline. Results are shown here in brief.

How to use sc-REnF

Data Loading and Preprocessing

Load the Libraries

library(SingleCellExperiment)
library('Linnorm')

#normalized_data function is in DataProcessing.R function
annotation <- Biase_data[[2]] #already factor type class
colnames(data) <- as.matrix(annotation)
PreprocessedData = normalized_data(data)

We will give the method demonstration on CBMC data. For more details about the data, see Simultaneous epitope and transcriptome measurement in single cells

For demonstration purposes, we gave here the preprocessed CBMC data and cell types. One can download the other datasets from above link, and preprocess it using given code. For preprocessing, Genes should be in row, Cells should be in coloumn

data= as.matrix(read.csv("cbmcdata.csv",header=FALSE))
cell=as.matrix(read.csv("cbmcannot.csv",header = FALSE))
gene=as.matrix(read.csv("cbmc_gene.csv",header=FALSE))

dim(data) 
[1] 7895    2000

data[1:2,1:3]
                V1            V2           V3
1              10.00000000  4.6159939 -0.08952736
2              -0.04300511 -0.1279653 -0.08952736

A total of 7895 cells and 2000 genes are remaining in the dataset after cell, gene filtering, and Normalization.

Feature (Gene Selection)

Load the libraries

library(foreach)
library(doParallel)
library('scREnF')

Apply the feature (gene) selection using Renyi and Tsallis with preprocesse data and cell types. Default— Core Number (p=40), q-values (q=0.7,0.3) , Number of genes to be selected (nf=500). For Gene selection, Cells should be in row and genes should be in coloumn. Header should be null.

RenyiFeadata=Renyifeature(data,cell,gene,p,q,nf)
TsallisFeadata=Tsallisfeature(data,cell,gene,p,q,nf)

The Reduced CBMC data using Renyi entropy

dim(RenyiFeadata)
[1] 7895  500
RenyiFeadata[1:2,1:3]
            PTN            PRIM1          PLA2G12A      
Eryth "-0.102047271" " 1.738611217" " 2.505624603"
Eryth "-0.102047271" " 1.214342009" " 1.556973317"

The Reduced CBMC data using Tsallis entropy

dim(TsallisFeadata)
[1] 7895  500
TsallisFeadata[1:2,1:3]
            AC079354.1     CST3           NKG7          
Eryth "-0.020573752" "-0.334792659" "-0.396311788"
Eryth "-0.020573752" " 0.172185618" "-0.343284490"

Saving the results

write.csv(TsallisFeadata, file="Tsallisd.csv")
write.csv(RenyiFeadata, file="Renyid.csv.csv")

Clustering using selected feature

import libraries in python and importing the data

import numpy as np
import pandas as pd
import scanpy as sc
adata1=sc.read_csv('Renyid.csv', delimiter=',', first_column_names=None, dtype='float32')

Using PCA dimensionality reduction and Leiden clustering

sc.tl.pca(adata1, svd_solver='arpack')
sc.pl.pca_variance_ratio(adata1,n_pcs=20,log=True)

#create neighborhood graph using 10 pcs 
sc.pp.neighbors(adata1, n_neighbors=15, n_pcs=10)
##dim reduction using umap
sc.tl.umap(adata1)
#Leiden clustering
import leidenalg
sc.tl.leiden(adata1)
##visualizing clusters
sc.pl.umap(adata1, color=['leiden'])

save the clustering results

pd.DataFrame(adata1.obs).to_csv("cbmc_leiden.csv")

Marker selection form identified clusters

Compute a ranking for the highly differential genes in each cluster using Wilcoxon-RankSum test

sc.tl.rank_genes_groups(adata1, 'leiden', method='wilcoxon',key_added = "wilcoxon")
sc.pl.rank_genes_groups(adata1, n_genes=30, sharey=False,key="wilcoxon")

Top 10 DE genes for each cluster using Wilcox-Ranksum Test

#save top 10 DE genes for each cluster using wilcox-ranksum test
result = adata1.uns['wilcoxon']
groups = result['names'].dtype.names
p=pd.DataFrame(
    {group + '_' + key[:1]: result[key][group]
    for group in groups for key in ['names', 'pvals']}).head(10)
pd.DataFrame(p).to_csv("cbmc_marker.csv")

Visualizing top 3 DE genes for each cluster in a heatmap using wilcox results

sc.pl.rank_genes_groups_heatmap(adata1, n_genes=5, key="wilcoxon", groupby="leiden", show_gene_labels=True)

Visualizing top 3 DE genes for each cluster in a dotplot using t-test results. Here color of dot represents mean expression of the gene in those cell, dot size represents fraction of cells expressing a gene

sc.pl.rank_genes_groups_dotplot(adata1, n_genes=3, key="wilcoxon", groupby="leiden")

Visualizing top 3 DE genes for each cluster in a stacked violin plot using t-test results

sc.pl.rank_genes_groups_stacked_violin(adata1, n_genes=3, key="wilcoxon", groupby="leiden")

Visualizing top 3 DE genes for each cluster in a matrixplot using wilcox results. matrixplot represents mean expression of a gene in a cluster as a heatmap.

sc.pl.rank_genes_groups_matrixplot(adata1, n_genes=3, key="wilcoxon", groupby="leiden")

Showing expression of one marker genes (e.g GZMA) across Leiden groups

sc.pl.violin(adata1, ['GZMA'], groupby='leiden')