单细胞 芯片 转录组测序的数据挖掘文章一比一复现
自己造轮子
(生信技能树优秀学员“细心的网友”)
马拉松授课线上直播课程,最近的一期是5月8号(今晚七点半)开课哦:
生信入门&数据挖掘线上直播课5月班
文章复现:Integration of Single-Cell RNA Sequencing and Bulk RNA Sequencing Data to Establish and Validate a Prognostic Model for Patients With Lung Adenocarcinoma
全文数据分析,没有湿实验。在GEO数据库下载了一份scRNA的数据,两份基因芯片的数据,还下载了TCGA的数据。研究肺腺癌(LUAD),整合scRNA-seq和传统的RNA-seq数据来构建LUAD患者的预后模型,并采用两个外部验证队列来验证其风险分层能力。
文章整体思路如下:
单细胞降维分群注释,找出tumor和normal间关键细胞类型。单细胞功能富集,拟时序分析,细胞通讯分析TCGA数据进行差异分析,GO,KEGG功能富集TCGA数据进行WGCNA分析,找到hub gene后,取hub gene和TCGA DEGs的交集,进行后续分析单因素Cox回归鉴定潜在的预后DEGs,非负矩阵分解进行sample clustering(亚型),不同cluster的生存分析,免疫浸润分析单因素Cox回归筛选出的gene用LASSO模型进一步筛选,筛选完后进行多因素Cox回归,计算每个样本的risk score,根据risk score分成两类,进行生存分析,将构建的模型应用于两个GEO芯片数据,进行同样分析高危低危人群与临床信息相关性分析,分析risk score是否可作为预后指标,基因突变分析ssGSEA分析,免疫检查点与肿瘤突变负荷分析
本次复现只是为了复现图表,数据不一定具有生物学意义(大家可以理解为凑图或者灌水)
Step 1.单细胞数据处理
先下载好数据,一共4个样本就新建4个文件夹,名字命名成样本名,文件夹内带'GSM'开头的数据是下载的原始数据,重新复制一份,把文件名改成标准的read10x函数读入格式,一定注意名字不要写错。
一定要保证这3个文件同时存在,而且在同一个文件夹下面。
示例代码是:
rm(list=ls())options(stringsAsFactors = F)library(Seurat)sce1 <- CreateSeuratObject(Read10X('../10x-results/WT/'), "wt")
重点就是 Read10X 函数读取 文件夹路径,比如:../10x-results/WT/ ,保证文件夹下面有3个文件。
1.1 QC
library(Seurat)library(harmony)
library(dplyr)
library(stringr)
rm(list = ls())
# 多个数据读取与合并
rawdata_path <- './rawdata'
filename <- list.files(rawdata_path)
rawdata_path <- paste(rawdata_path,filename,sep = '/')
sceList <- lapply(rawdata_path, function(x){
obj <- CreateSeuratObject(counts = Read10X(x),
project = str_split(x,'/')[[1]][3])
})
names(sceList) <- filename
这里sceList是一个列表,列表里共4个元素,每个元素都是一个seurat对象,每个元素的名字就是上面的样本名。rawdata_path就是上面下载的.gz文件 所在文件夹的路径,Read10X函数只需要传入文件夹路径就可以自动读入数。
project参数传入的是一个字符串,一般用样本名即可,这里用str_split函数分割路径字符串得到样本名
sce <- merge(sceList[[1]],sceList[-1],add.cell.ids = names(sceList),project = 'luad')
sce@meta.data$group <- str_split(sce@meta.data$orig.ident,'_',simplify = T)[,2]
把4个seurat对象用merge函数合并后,看一下每个细胞的基本信息,这里我根据作者给出的信息,把疾病和正常分成两组,又增加了一列,ade1是疾病,nor是正常
# 查看线粒体(MT开头)、核糖体(RPS/RPL开头)、血红细胞所占比例
grep('^MT',x=rownames(sce@assays$RNA@data),value = T)
grep('^RP[SL]',x=rownames(sce@assays$RNA@data),value = T)
grep('^HB[^(P)]',x=rownames(sce@assays$RNA@data),value = T)
sce <- PercentageFeatureSet(sce,'^MT',col.name = 'percent_MT')
sce <- PercentageFeatureSet(sce,'^RP[SL]',col.name = 'percent_RP')
sce <- PercentageFeatureSet(sce,'^HB[^(P)]',col.name = 'percent_HB')
VlnPlot(sce,features = "nCount_RNA",pt.size = 0,y.max = 10000)
VlnPlot(sce,features = "nFeature_RNA",pt.size = 0,y.max = 2500)
VlnPlot(sce,features = "percent_MT",pt.size = 0)
VlnPlot(sce,features = "percent_RP",pt.size = 0)
VlnPlot(sce,features = "percent_HB",pt.size = 0,y.max = 0.1)
VlnPlot(sce,features = c("nCount_RNA","nFeature_RNA","percent_MT"),pt.size = 0,group.by = 'orig.ident')
对于每个细胞,我又添加了三列信息:线粒体基因占比,核糖体基因占比,红细胞基因占比。这几列用来过滤数据,这里重新给sce赋值是因为,PercentageFeatureSet函数的返回值还是一个seurat对象,和原来的sce相比, 就只是在meta data里多了一列而已,所以直接重新赋值,把原来的seurat对象sce给更新掉
过滤前先看下数据,32538个gene,13060个细胞
# 过滤细胞
sce <- subset(sce,subset = nCount_RNA>1000 & nFeature_RNA>300 & percent_MT<25)
# 过滤基因
sce <- sce[rowSums(sce@assays$RNA@counts>0)>3,]
# 过滤线粒体、核糖体、血红细胞进占比高的细胞
sce <- subset(sce,subset = percent_MT<25 & percent_RP<30 & percent_HB<0.1)
过滤完后22249个gene,7892个细胞
# 细胞周期评分
# S.Score较高的为S期,G2M.Score较高的为G2M期,都比较低的为G1期
s_feature <- cc.genes.updated.2019$s.genes
g2m_feature <- cc.genes.updated.2019$g2m.genes
sce <- CellCycleScoring(sce,
s.features = s_feature,
g2m.features = g2m_feature,
set.ident = T)
VlnPlot(sce,features = c('S.Score','G2M.Score'),group.by = 'orig.ident',pt.size = 0)
saveRDS(sce,'sce_qc.rds')
再看一下细胞周期,把细胞周期两个评分加到meta data里,更新sce
1.2 用harmony去除批次效应
library(harmony)library(dplyr)
library(Seurat)
library(clustree)
rm(list = ls())
sce <- readRDS('sce_qc.rds')
sce <- NormalizeData(sce,
normalization.method = 'LogNormalize',
scale.factor = 10000)
sce <- FindVariableFeatures(sce,
selection.method = "vst",
nfeatures = 2000)
# 默认用variableFeature做scale
sce <- ScaleData(sce)
sce <- RunPCA(sce,features = VariableFeatures(sce))
DimPlot(sce,reduction = 'pca',group.by = 'group')
从PCA降维的结果看,两个组的批次效应不大,均匀的混合在一起,但是这里还是用harmony试试
sce <- RunHarmony(sce,group.by.vars = 'orig.ident')
ElbowPlot(sce,reduction = 'harmony')
sce <- RunUMAP(sce,dims = 1:10,reduction = 'harmony')
sce <- RunTSNE(sce,dims = 1:10,reduction = 'harmony')
DimPlot(sce,reduction = 'umap',label = T,group.by = 'group')
DimPlot(sce,reduction = 'tsne',label = T,group.by = 'group')
选择orig.ident这一列,去除4个样本之间的批次效应。根据肘部图确定降维后的维度。
sce <- FindNeighbors(sce,reduction = 'harmony',dims = 1:10)
sce_res <- FindClusters(sce_res,resolution = 0.8)
# 设置不同的分辨率查看分组情况
sce_res <- sce
for (i in c(0.01, 0.05, 0.1, 0.2, 0.3, 0.5,0.8,1)){
sce_res <- FindClusters(sce_res,resolution = i)
}
clustree(sce_res,prefix = 'RNA_snn_res.')
FindNeighbors函数计算细胞间的距离,FindClusters函数确定分群结果。上面的UMAP看着大概分了7坨,但实际上并不是7个分群,实际分群数量还是要看FindClusters的结果。FindClusters函数分辨率不同,分群数量会不同,具体分多少需要摸索。这里分辨率选了0.5,对应第六行蓝色,13个分群,如果选0.8的话,感觉分群数量会突然增加好多,不太合适。
sce <- FindClusters(sce,resolution = 0.5)
saveRDS(sce,file = 'step2_harmony.rds')
DimPlot(sce,reduction = 'umap',group.by = 'orig.ident')
DimPlot(sce,reduction = 'umap',group.by = 'seurat_clusters')
确定好分辨率后,分群赋值给sce,在meta data里会多出来两列,这两列是一样的,表示每个细胞分在哪个cluster里,最后保存一下更新后的sce对象,再来看下分群结果
1.3 细胞注释
library(SingleR)library(celldex)
library(Seurat)
library(dplyr)
library(stringr)
library(pheatmap)
library(ReactomeGSA)
library(monocle)
library(limma)
library(ggplot2)
library(msigdbr)
library(singleseqgset)
rm(list = ls())
gc()
sce <- readRDS('step2_harmony.rds')
# singleR注释
hpca.se <- HumanPrimaryCellAtlasData()
bpe.se <- BlueprintEncodeData()
anno <- SingleR(sce@assays$RNA@data,
ref = list(BP=bpe.se,HPCA=hpca.se),
labels = list(bpe.se$label.main,hpca.se$label.main),
clusters = sce@meta.data$seurat_clusters
)
用SingleR自动注释,HumanPrimaryCellAtlasData和BlueprintEncodeData是SingleR自带的两个数据库,用两个数据库注释效果好像比一个会好点。得到的anno也是一个对象,每个细胞群的注释信息就藏在里面,13个分群被注释成7种细胞,其中有6个分群都是上皮细胞,也许上皮细胞还能继续细分。想继续细分的话,对sce取子集,把标签为上皮细胞的细胞取出来,降维分群,丢到SingleR()函数里再注释
plotScoreHeatmap(anno,clusters = anno@rownames,show_colnames = T)
再看下注释结果的打分,不满意的话也可以手动注释
sce@meta.data$singleR_label <- unlist(lapply(sce@meta.data$seurat_clusters, function(x){anno$labels[x]}))
DimPlot(sce,reduction = 'umap',group.by = 'singleR_label',label = T)
DimPlot(sce,reduction = 'tsne',group.by = 'singleR_label',label = T)
anno里面有每个分群对应的注释信息,把这个信息添加到sce的meta data里,然后就可以在分群的图上添加上注释信息了
1.4 鉴定关键细胞类型
cell_list <- split(colnames(sce), sce$singleR_label)deg <- c()
for (i in names(cell_list)){
sce_temp <- sce[,colnames(sce) %in% cell_list[[i]]]
sce_temp <- CreateSeuratObject(counts = sce_temp@assays$RNA@counts,
meta.data = sce_temp@meta.data,
min.cells = 3,
min.features = 200)
sce_temp <- NormalizeData(sce_temp)
sce_temp <- FindVariableFeatures(sce_temp)
sce_temp <- ScaleData(sce_temp,vars.to.regress = c('nFeature_RNA','"percent_MT"'))
Idents(sce_temp) <- sce_temp$group
sce_temp_markers <- FindAllMarkers(object=sce_temp,
only.pos = T,
logfc.threshold = 2,
min.pct = 0.1)
sce_temp_markers <- sce_temp_markers[order(sce_temp_markers$cluster,sce_temp_markers$avg_log2FC),]
deg_gene <- sce_temp_markers[sce_temp_markers$p_val_adj<0.05,'gene']
deg <- c(deg,length(deg_gene))
}
df <- data.frame(cell_type = names(cell_list),deg_gene = deg)
把每种类型的细胞分别拿出来,用FindAllMarkers找出ade1和nor组之间的差异基因
1.5 单细胞通路富集
temp <- sceIdents(temp) <- temp@meta.data$singleR_label
reactome_obj <- analyse_sc_clusters(temp,create_reactome_visualization=T,use_interactors=F)
path_ways <- pathways(reactome_obj)
path_ways$diff <- rowMax(as.matrix(path_ways[,2:ncol(path_ways)])) - rowMin(as.matrix(path_ways[,2:ncol(path_ways)]))
path_ways <- path_ways[order(path_ways$diff,decreasing = T),]
plot_gsva_heatmap(reactome_obj,rownames(path_ways)[1:10],margins = c(10,20))
saveRDS(sce,file = 'step3_sce.rds')
用pathway()提取富集出来的结果,每行是一个通路,每列是一种细胞,计算每个通路在7种细胞中表达差异的diff,根据diff排序,画出前10个通路
1.6 拟时序分析
library(Seurat)library(monocle)
sce <- readRDS('step3_sce.rds')
# 1. 将seurat对象转化为monocle的CDS对象
sparse_data <- as(as.matrix(sce@assays$RNA@counts),'sparseMatrix')
mdata <- new('AnnotatedDataFrame',data=sce@meta.data) # 行为cell
fData <- data.frame(gene_short_name=row.names(sparse_data),row.names = row.names(sparse_data))
fd <- new('AnnotatedDataFrame',data=fData) # 行为gene 列为gene description
monocle_cds <- newCellDataSet(cellData = sparse_data,
phenoData = mdata,
featureData = fd,
lowerDetectionLimit = 0.5,
expressionFamily = negbinomial.size())
# 2. 计算size factor和离散度
monocle_cds <- estimateSizeFactors(monocle_cds)
monocle_cds <- estimateDispersions(monocle_cds)
# 3.过滤低质量细胞
monocle_cds <- detectGenes(monocle_cds,min_expr = 0.1)
# 4. 找高变基因、降维分群
# 用seurat找的高变基因
sce_var_gene <- VariableFeatures(sce)
monocle_cds <- setOrderingFilter(monocle_cds,sce_var_gene)
monocle_cds <- reduceDimension(monocle_cds,num_dim=10,norm_method = 'log',reduction_method = 'tSNE')
monocle_cds <- clusterCells(monocle_cds,num_clusters = 10)
plot_cell_clusters(monocle_cds,color_by = 'singleR_label')
diff_test_gene <- differentialGeneTest(monocle_cds[sce_var_gene,],fullModelFormulaStr = '~singleR_label')
diff_gene <- row.names(subset(diff_test_gene,qval<0.01))
monocle_cds <- setOrderingFilter(monocle_cds,diff_gene)
# 5. 用reversed graph embedding降维
monocle_cds <- reduceDimension(monocle_cds,reduction_method = 'DDRTree')
# 6. 计算细胞拟时间
monocle_cds <- orderCells(monocle_cds)
这里运行到第六步,orderCells()函数大概率会报错,详细的解决办法可以参考知乎上的贴子https://zhuanlan.zhihu.com/p/589134519 ,这里说一下大概解决步骤
下载一个包,'igraph’下载一个R文件,order_cells.R,链接在知乎贴子里,然后source调用把orderCells()中用DDRTree计算的那部分代码单独拿出来,写成一个函数,后面直接调用,记得给出返回值。或者不封装成函数的形式也行,那就不需要返回值了# 6. 计算细胞拟时间
source('order_cells.R')
library('igraph')
my_ordercell <- function(cds){
root_state = NULL
num_paths = NULL
reverse = NULL
root_cell <- select_root_cell(cds, root_state, reverse)
cds@auxOrderingData <- new.env(hash = TRUE)
if (cds@dim_reduce_type == "DDRTree") {
if (is.null(num_paths) == FALSE) {
message("Warning: num_paths only valid for method 'ICA' in reduceDimension()")
}
cc_ordering <- extract_ddrtree_ordering(cds, root_cell)
pData(cds)$Pseudotime <- cc_ordering[row.names(pData(cds)), ]$pseudo_time
K_old <- reducedDimK(cds)
old_dp <- cellPairwiseDistances(cds)
old_mst <- minSpanningTree(cds)
old_A <- reducedDimA(cds)
old_W <- reducedDimW(cds)
cds <- project2MST(cds, project_point_to_line_segment)
minSpanningTree(cds) <- cds@auxOrderingData[[cds@dim_reduce_type]]$pr_graph_cell_proj_tree
root_cell_idx <- which(V(old_mst)$name == root_cell, arr.ind = T)
cells_mapped_to_graph_root <- which(cds@auxOrderingData[["DDRTree"]]$pr_graph_cell_proj_closest_vertex == root_cell_idx)
if (length(cells_mapped_to_graph_root) == 0) {
cells_mapped_to_graph_root <- root_cell_idx
}
cells_mapped_to_graph_root <- V(minSpanningTree(cds))[cells_mapped_to_graph_root]$name
tip_leaves <- names(which(degree(minSpanningTree(cds)) == 1))
root_cell <- cells_mapped_to_graph_root[cells_mapped_to_graph_root %in% tip_leaves][1]
if (is.na(root_cell)) {
root_cell <- select_root_cell(cds, root_state, reverse)
}
cds@auxOrderingData[[cds@dim_reduce_type]]$root_cell <- root_cell
cc_ordering_new_pseudotime <- extract_ddrtree_ordering(cds, root_cell)
pData(cds)$Pseudotime <- cc_ordering_new_pseudotime[row.names(pData(cds)), ]$pseudo_time
if (is.null(root_state) == TRUE) {
closest_vertex <- cds@auxOrderingData[["DDRTree"]]$pr_graph_cell_proj_closest_vertex
pData(cds)$State <- cc_ordering[closest_vertex[, 1], ]$cell_state
}
}
cds
}
monocle_cds <- my_ordercell(monocle_cds)
# 可以画图了
plot_cell_trajectory(monocle_cds,color_by = 'Pseudotime')
plot_cell_trajectory(monocle_cds,color_by = 'State')
plot_cell_trajectory(monocle_cds,color_by = 'singleR_label')
plot_cell_trajectory(monocle_cds,color_by = 'singleR_label') facet_wrap(~singleR_label,nrow = 3)
saveRDS(monocle_cds,file = 'monocle.rds')
1.7 细胞通讯
library(CellChat)library(patchwork)
library(Seurat)
library(dplyr)
library(ggalluvial)
library(NMF)
options(stringsAsFactors = FALSE)
rm(list = ls())
gc()
sce <- readRDS('step3_sce.rds')
# Step1.创建cellchat对象
data.input <- sce@assays$RNA@data
meta.data <- sce@meta.data
cellchat <- createCellChat(object=data.input,
meta = meta.data,
group.by='singleR_label')
cellchat <- addMeta(cellchat,meta = meta.data)
# Step2.加载CellChatDB数据库
cellchatDB <- CellChatDB.human
cellchat@DB <- cellchatDB
# Step3.处理表达数据
cellchat <- subsetData(cellchat)
cellchat <- identifyOverExpressedGenes(cellchat)
cellchat <- identifyOverExpressedInteractions(cellchat)
# Step4.计算通讯概率,推断细胞通讯网络 这一步很耗时
cellchat <- computeCommunProb(cellchat,population.size = F)
cellchat <- filterCommunication(cellchat,min.cells = 10)
# Step5. 提取预测的细胞通讯网络为data frame
df.net <- subsetCommunication(cellchat)
df.pathway <- subsetCommunication(cellchat,slot.name = 'netP')
# Step6. 在信号通路水平推断细胞通讯
cellchat <- computeCommunProbPathway(cellchat)
# Step7. 计算加和的cell-cell通讯网络
cellchat <- aggregateNet(cellchat)
par(mfrow = c(1,2))
netVisual_circle(cellchat@net$count,vertex.weight = groupSize,
weight.scale = T,label.edge = F,
title.name = 'number of Interaction')
netVisual_circle(cellchat@net$weight,vertex.weight = groupSize,
weight.scale = T,label.edge = F,
title.name = 'Interaction Weight')
mat <- cellchat@net$weight
par(mfrow = c(3,3),xpd=T)
for (i in 1:nrow(mat)){
mat2 <- matrix(0,nrow = nrow(mat),ncol = ncol(mat),
dimnames = dimnames(mat))
mat2[i,] <- mat[i,]
netVisual_circle(mat2,vertex.weight = groupSize,
weight.scale = T,edge.weight.max = max(mat),
title.name = rownames(mat)[i])
}
# Step8. 使用层次图(Hierarchical plot),圆圈图(Circle plot)或和弦图(Chord diagram)可视化每个信号通路
cellchat@netP$pathways
pathway.show <- c('SEMA4')
vertex.receiver <- c(1,2,3,4)
netVisual_aggregate(cellchat,signaling = pathway.show,
vertex.receiver = vertex.receiver,
layout = 'hierarchy')
par(mfrow=c(1,2),xpd=T)
netVisual_aggregate(cellchat,signaling = pathway.show,
layout = 'circle')
netVisual_aggregate(cellchat,signaling = pathway.show,
layout = 'circle',label.edge=T)
pairLR.CXCL <- extractEnrichedLR(cellchat,signaling = pathway.show,
geneLR.return = F)
LR.show <- pairLR.CXCL[1,]
netVisual_individual(cellchat,signaling = pathway.show,
pairLR.use = 'SEMA4A_PLXNB2',
layout = 'circle')
netVisual_aggregate(cellchat,signaling = pathway.show,layout = 'chord',
title.name = "Chord diagram 1")
Step 2. TCGA差异分析
library(TCGAbiolinks)library(SummarizedExperiment)
library(AnnoProbe)
library(stringr)
library(limma)
library(tinyarray)
library(data.table)
library(ggplot2)
library(dplyr)
library(tibble)
library(pheatmap)
library(clusterProfiler)
library(org.Hs.eg.db)
rm(list = ls())
# 下载LUAD表达矩阵
query <- GDCquery(project = 'TCGA-LUAD',
data.category = 'Transcriptome Profiling',
data.type = 'Gene Expression Quantification',
workflow.type = 'STAR - Counts')
GDCdownload(query)
dat <- GDCprepare(query)
exp <- assay(dat)
save(exp,file='TCGA_LUAD_exp.Rdata')
# 下载LUAD临床信息
query <- GDCquery(project = 'TCGA-LUAD',
data.category = 'Clinical',
data.type = 'Clinical Supplement',
file.type = 'xml')
GDCdownload(query)
dat <- GDCprepare_clinic(query,clinical.info = 'patient')
k = apply(dat, 2, function(x){!all(is.na(x))});table(k)
clinical <- dat[,k]
save(clinical,file = 'TCGA_LUAD_Clinical.Rdata')
注意一下TCGA的样本命名含义:
以这个为例:TCGA-A6-6650-01A-11R-1774-07
A6:Tissue source site,组织来源编码
6650:Participant, 参与者编号
01:Sample, 编号01~09表示肿瘤,10~19表示正常对照
A:Vial, 在一系列患者组织中的顺序,绝大多数样本该位置编码都是A;
很少数的是B,表示福尔马林固定石蜡包埋组织,已被证明用于测序分析的效果不佳,所以不建议使用-01B的样本数据
11:Portion, 同属于一个患者组织的不同部分的顺序编号,同一组织会分割为100-120mg的部分,分别使用
R:RNA, 分析的分子类型,对应关系如下所示:
1774:Plate, 在一系列96孔板中的顺序,值大表示制板越晚
07:Center, 测序或鉴定中心编码
# count 转换成TPM
gene_length <- fread('gene_len.txt')
colnames(gene_length) <- c('id','length')
gene_length <- as.data.frame(gene_length)
gene_length <- gene_length[!duplicated(gene_length$id),]
exp <- as.data.frame(exp)
exp$id <- unlist(lapply(rownames(exp),function(x){str_split(x,'\\.',simplify = T)[,1]}))
merge <- left_join(x=exp,y=gene_length,by='id')
merge <- na.omit(merge)
merge <- merge[!duplicated(merge$id),]
rownames(merge) <- merge$id
merge <- merge[,-601]
# 计算TPM
kb <- merge$length / 1000
rpk <- merge[,-601] / kb
tpm <- t(t(rpk) / colSums(rpk) * 1E6)
exp <- tpm
采用limma做差异分析,先把counts转换成TPM。不转也可以用DEseq2做差异分析。gene_length是每个基因的长度,这个文件是把gtf文件中,基因的终止位置-起始位置得到的。
# 更改行名:ENSEMBL -> SYMBOL
a <- annoGene(str_split(rownames(exp),'\\.',simplify = T)[,1],'ENSEMBL')
a <- a[!duplicated(a$ENSEMBL),]
rownames(a) <- a$ENSEMBL
rownames(exp) <- a[str_split(rownames(exp),'\\.',simplify = T)[,1],'SYMBOL']
# 给样本分组 tumor/normal
group <- ifelse(sapply(str_split(colnames(exp),'-',simplify = T)[,4],function(x){as.integer(substr(x,1,2))})>=10,
'normal','tumor')
sample_group <- data.frame(sample=colnames(exp),group=group)
a是annoGene函数返回的一个ID转换的结果,是一个dataframe,多个ENSEMBL ID可能对应同一个SYMBOL ID,对symbol去重后,再映射成exp的行名。
按照TCGA名字中第四部分,把样本分成tumor和normal
# 过滤
# 1. 去除FFPE样本
sample_group$sample_type <- ifelse(sapply(str_split(colnames(exp),'-',simplify = T)[,4],function(x){substr(x,3,3)})=='A',
'tissue','ffpe')
exp <- exp[,sample_group$sample_type!='ffpe']
sample_group <- sample_group[sample_group$sample_type=='tissue',]
# 2. 保留在一半以上样本中有值的gene
exp <- exp[rowSums(exp>0)>as.integer(ncol(exp)/2),]
dim(exp)
save(exp,file = 'TCGA_LUAD_TPM_exp.Rdata')
FFPE样本无论做测序还是蛋白组,代谢组,效果都不怎么好
# 对表达矩阵取log2
for (i in 1:ncol(exp)){
exp[,i] <- log2(exp[,i] 0.0001)
}
# 差异筛选
design <- model.matrix(~factor(sample_group$group,levels = c('tumor','normal')))
fit <- lmFit(exp,design = design)
fit <- eBayes(fit)
deg <- topTable(fit, coef = 2,number = Inf)
deg$change <- ifelse((deg$logFC>log2(2))&(deg$adj.P.Val<0.05),'up',
ifelse((deg$logFC<log2(0.5))&(deg$adj.P.Val<0.05),'down','nochange'))
deg <- na.omit(deg)
table(deg$change)
做完差异筛选,有10个gene含空值,删去有空值的行即可。更新下exp并保存好
# deg比exp少了几个gene
exp <- exp[intersect(rownames(exp),deg$ID),]
save(exp,file = 'TCGA_exp.Rdata')
save(deg,file = 'TCGA_DEGS.Rdata')
# volcano plot
ggplot(deg,aes(x=deg$logFC,y=-log10(deg$adj.P.Val),color=change))
geom_point()
geom_hline(yintercept = -log10(0.05),linetype=2,linewidth=1)
geom_vline(xintercept = log2(0.5),linetype=2,linewidth=1)
geom_vline(xintercept = log2(2),linetype=2,linewidth=1)
scale_color_manual(values=c('blue','gray','red'))
# heatmap
deg_heatmap <- deg[order(deg$logFC,decreasing = T),]
deg_gene <- deg_heatmap[deg_heatmap$change!='nochange','ID']
heatmap_deg_gene <- as.data.frame(exp)[deg_gene,]
heatmap_deg_gene <- na.omit(heatmap_deg_gene)
anno_col <- data.frame(group=sample_group$group)
rownames(anno_col) <- sample_group$sample
pheatmap(heatmap_deg_gene,show_rownames = F,cluster_rows = F,scale = 'row',
show_colnames = F,annotation_col=anno_col ,
breaks = seq(-2.5,2.5,length.out=100))
dim(heatmap_deg_gene)
# pca plot
exp <- as.data.frame(exp)
pca <- prcomp(t(exp))
ggplot(data.frame(pca$x),aes(x=PC1,y=PC2,color=sample_group$group))
geom_point(size = 3) stat_ellipse(level = 0.95, show.legend = F)
labs(title = 'PCA Analysis')
theme(plot.title = element_text(hjust = 0.5))
guides(color = guide_legend(title = 'Group'))
# GO enrich
go_bp <- enrichGO(deg_gene,org.Hs.eg.db,'SYMBOL','BP')
go_mf <- enrichGO(deg_gene,org.Hs.eg.db,'SYMBOL','MF')
go_cc <- enrichGO(deg_gene,org.Hs.eg.db,'SYMBOL','CC')
dotplot(go_bp)
barplot(go_bp,showCategory = 4)
# KEGG enrich
entrez_id <- bitr(deg_gene,fromType = 'SYMBOL',toType = 'ENTREZID',OrgDb = org.Hs.eg.db)
kegg <- enrichKEGG(entrez_id$ENTREZID,organism = 'hsa',keyType = 'ncbi-geneid')
dotplot(kegg)
barplot(kegg,showCategory = 4)
Step 3. WGCNA分析
library(WGCNA)library(reshape2)
library(stringr)
library(dplyr)
library(patchwork)
library(ggplot2)
rm(list = ls())
gc()
load('TCGA_DEGS.Rdata')
load('TCGA_exp.Rdata')
load('TCGA_LUAD_Clinical.Rdata')
options(stringsAsFactors = F) # 开启多线程
enableWGCNAThreads()
# 1. 数据处理
# >>> 选取表达量前10000的gene,用全部gene的话,计算后面的邻接矩阵和TOM矩阵太慢了
temp <- as.data.frame(exp)
temp$avg_exp <- rowMeans(temp)
temp <- temp[order(temp$avg_exp,decreasing = T),]
top10000_gene <- rownames(temp[1:10000,])
exp <- as.data.frame(exp[top10000_gene,])
# >>> 查找并删除异常值
temp <- exp
colnames(temp) <- 1:ncol(exp)
plot(hclust(dist(t(temp)))) # 没有异常样本
# 2. 构建基因共表达网络
exp <- as.data.frame(t(exp))
# >>> 2.1 选择合适的power
power_vec <- c(seq(1, 10, by = 1), seq(12, 20, by = 2))
sft <- pickSoftThreshold(exp,powerVector = power_vec)
a <- ggplot(sft$fitIndices,aes(x=Power,y=-sign(slope)*SFT.R.sq))
geom_text(label=sft$fitIndices$Power,color='red')
geom_hline(yintercept = 0.9,color='red')
xlab("Soft Threshold (power)") ylab("Scale Free Topology Model Fit,signed R^2")
ggtitle("Scale independence") theme(plot.title = element_text(hjust = 0.5))
b <- ggplot(sft$fitIndices,aes(x=Power,y=mean.k.))
geom_text(label=sft$fitIndices$Power,color='red')
xlab("Soft Threshold (power)") ylab("Mean Connectivity")
ggtitle('Mean connectivity') theme(plot.title = element_text(hjust = 0.5))
a b
这里power=12,达到0.8即可
power <- sft$powerEstimate
# >>> 2.2 检验所选power是否符合无尺度网络
k <- softConnectivity(exp,power = power)
scaleFreePlot(k)
power=12确实也符合无尺度网络,R2较高
# >>> 2.3 计算gene在sample间的corr或者distance,构建临近矩阵。默认用dist函数计算(欧氏距离)
adj <- adjacency(exp,power = power)
# >>> 2.4 根据临近矩阵计算TOM(拓扑重叠矩阵)
TOM <- TOMsimilarity(adj)
# >>> 2.5 计算gene间相异性
dissTOM <- 1-TOM
# 3. 对gene进行聚类并可视化
geneTree <- hclust(as.dist(dissTOM),method = 'average')
sizeGrWindow(12,9)
plot(geneTree, xlab="", sub="", main = "Gene clustering on TOM-based dissimilarity",
labels = FALSE, hang = 0.04)
# >>> 3.2 将gene聚类树进行修剪后,把gene归到不同的模块里
dynamic_modules <- cutreeDynamic(dendro = geneTree,distM = dissTOM,minClusterSize = 50,
deepSplit = 2,pamRespectsDendro=F,minSplitHeight = NULL)
table(dynamic_modules)
module_color <- labels2colors(dynamic_modules)
plotDendroAndColors(geneTree,module_color,dendroLabels = F,hang=0.03,addGuide = T,
guideHang = 0.05,groupLabels='Dynamic color tree',
main='gene dendrogram and module colors')
# >>> 3.3 计算每个模块的特征基因向量(对每个模块的gene表达量降维后,只保留PC1,即MEs)
MEList <- moduleEigengenes(exp,colors = dynamic_modules)
MEs <- MEList$eigengenes
# >>> 3.4 计算特征基因的相异度,基于相异度对原有模块进行层次聚类
dissME <- 1-cor(MEs)
MEtree <- hclust(as.dist(dissME),method = 'average')
plotEigengeneNetworks(MEs,
"Eigengene adjacency heatmap",
marHeatmap = c(3,4,2,2),
plotDendrograms = FALSE,
xLabelsAngle = 90)
# >>> 3.5 如果模块太多,可以对模块进行合并.设置剪切高度,剪切高度以下的模块进行合并
cut_height <- 0.2
sizeGrWindow(8,6)
plot(MEtree,main = "Clustering of module eigengenes",
xlab = "", sub = "")
abline(h=cut_height,col='red')
把高度0.2以下的模块进行合并,再看看合并前后模块的变化
module_merge <- mergeCloseModules(exp,module_color,cutHeight = cut_height)
merged_color <- module_merge$colors
merged_MEs <- module_merge$newMEs
sizeGrWindow(8,6)
plotDendroAndColors(geneTree,cbind(module_color,merged_color),dendroLabels = F,hang=0.03,
addGuide = T,groupLabels = c("Dynamic Tree Cut", "Merged dynamic"))
# 4. 计算每个模块与临床信息的相关性、pvalue。找出与临床信息相关性高的模块
nGenes <- ncol(exp)
nSamples <- nrow(exp)
################################准备临床信息####################################
clinical <- clinical[,c('bcr_patient_barcode','gender','residual_tumor')]
traits <- data.frame(row.names = rownames(exp),ID=rownames(exp))
traits$ID <- substr(traits$ID,1,12)
index <- match(traits$ID,clinical$bcr_patient_barcode)
traits$gender <- as.numeric(as.factor(clinical[index,'gender']))
traits$residual_tumor <- as.numeric(as.factor(clinical[index,'residual_tumor']))
traits$type <- ifelse(sapply(str_split(rownames(traits),'-',simplify = T)[,4],
function(x){as.integer(substr(x,1,2))})>=10,'normal','tumor')
traits$type <- as.numeric(as.factor(traits$type))
traits <- traits[,-1]
################################准备临床信息####################################
clinical是前面下载的TCGA临床数据,这里随便取了两个比较关心的临床信息。clinical的每一行是一个病人的ID,而表达矩阵每一行是一个样本,一个病人身上可能有多个样本。在把临床信息和表达矩阵一一对应的时候,如果直接把exp的行名取前12位,就会产生重复值。在做的时候,先准备好样本和病人ID的对应关系,再把clinical里病人的信息给匹配过来
最终的临床信息是这样,要把字符串变成数值型变量
module_trait_cor <- cor(merged_MEs,traits,use = 'p')
module_trait_pvalue <- corPvalueStudent(module_trait_cor,nSamples)
trait_colors <- numbers2colors(traits,signed = T,centered = T)
a <- paste(signif(module_trait_cor, 2), "\n(", signif(module_trait_pvalue, 1), ")", sep = "")
sizeGrWindow(8,6)
labeledHeatmap(module_trait_cor,xLabels = colnames(traits),yLabels = colnames(merged_MEs),colorLabels = FALSE,
colors = blueWhiteRed(50),textMatrix = a,
setStdMargins = FALSE,cex.text = 0.65,zlim = c(-1,1),
main = paste("Module-trait relationships"))
模块与tumor/normal之间的相关性并不是很高。。。。。就拿粉色模块当做感兴趣的模块吧
# 5. 找出感兴趣的模块内的枢纽gene
# >>> 5.1 计算gene表达量与临床性状的相关性
gs <- apply(exp,2,function(x){cor(x,traits$type)})
# >>> 5.2 基于网络计算gene与临床性状的相关性
ns <- networkScreening(y=traits$type,datME = merged_MEs,datExpr = exp)
# >>> 5.3 计算每个gene与特征gene的相关性
kme <- signedKME(datExpr = exp,datME = merged_MEs)
# >>> 5.4 根据gs, ns, kme 找出hub gene
hubgene_filter <- (gs>0.2 & ns$q.Weighted<0.01 & abs(kme$kMEmagenta)>0.4)
hub_gene <- colnames(exp)[hubgene_filter]
plotNetworkHeatmap(exp,
plotGenes = hub_gene,
networkType = "unsigned",
useTOM = TRUE,
power=8,
main="unsigned correlations")
看下kme,其中粉色模块里,gene的相关性并不是很高,这里筛选了>0.4的,最后找到25个hub gene。如果阈值再设高一点,hub gene就全卡掉了。。。这跟没有用所有的3w多个gene去计算有关,也跟一些参数有关,我没有一个个去调整优化,只是展示一下如何做图和分析思路
hub_gene <- intersect(hub_gene,deg$ID)
save(hub_gene,file = 'TCGA_WGCNA_hub_gene.Rdata')
最后找出hub gene和前面deg的交集部分,作为hub gene,存档
Step 4. 非负矩阵分解算法
4.1 NMF算法对样本分类
library(NMF)library(survival)
library(stringr)
library(data.table)
library(survminer)
library(dplyr)
library(AnnoProbe)
library(CIBERSORT)
library(tidyr)
library(tibble)
library(RColorBrewer)
library(ggalluvial)
library(ggplot2)
options(stringsAsFactors = F)
rm(list = ls())
gc()
load('TCGA_LUAD_TPM_exp.Rdata')
load('TCGA_DEGS.Rdata')
load('TCGA_WGCNA_hub_gene.Rdata')
dim(exp)
# 对表达矩阵取log2
exp <- log2(exp 1)
ranks <- seq(2,10)
hubgene_exp <- exp[hub_gene,]
estimate <- nmf(hubgene_exp,2:10,nrun=50,method = 'brunet')
plot(estimate)
注意hubgene_exp里不能有负数,不然会报错,rank为2时,波动最大
rank <- 2
nmf.rank4 <- nmf(hubgene_exp,rank,nrun=50,method = 'brunet')
index <- extractFeatures(nmf.rank4,'max')
index <- unlist(index)
nmf.exp <- hubgene_exp[index,]
nmf.exp <- na.omit(nmf.exp)
group <- predict(nmf.rank4)
jco <- c("#2874C5","#EABF00","#C6524A","#868686")
consensusmap(nmf.rank4,labRow=NA,labCol=NA,
annCol = data.frame("cluster"=group[colnames(nmf.exp)]),
annColors = list(cluster=c("1"=jco[1],"2"=jco[2],"3"=jco[3],"4"=jco[4])))
cluster_df <- data.frame(row.names = colnames(exp),ID=substr(colnames(exp),1,12),cluster=group)
NMF算法把所有样本分成两类,顺便看下最后的clueter_df
4.2 生存分析
load('TCGA_LUAD_TPM_exp.Rdata')load('TCGA_LUAD_Clinical.Rdata')
# 1. 取log(TPM)
for (i in 1:ncol(exp)){
exp[,i] <- log2(exp[,i] 0.0001)
}
# 2. 以病人为中心,去重
exp <- as.data.frame(exp)
exp <- exp[,sort(colnames(exp))]
index <- !duplicated(substr(colnames(exp),1,12))
exp <- exp[,index]
# 3.读取从xena下载的生存数据
sur <- fread('TCGA-LUAD.survival.tsv')
sur <- as.data.frame(sur)
patient_id <- substr(colnames(exp),1,12)
clin <- data.frame(row.names = patient_id,ID=patient_id)
index <- match(clin$ID,sur[,'_PATIENT'])
clin <- cbind(clin,sur[index,c('OS','OS.time')])
clin <- na.omit(clin)
index2 <- match(clin$ID,clinical$bcr_patient_barcode)
clin <- cbind(clin,clinical[index2,'gender'])
colnames(clin) <- c(colnames(clin)[1:ncol(clin)-1],'gender')
clin$gender <- as.numeric(as.factor(clin$gender))
下载下来的生存数据长这样
之前从TCGA下载的临床数据长这样
合并后的clin是这样,所有变量类型都要转成数值型
# 合并NMF结果
cluster_type_group <- ifelse(sapply(str_split(rownames(cluster_df),'-',simplify = T)[,4],
function(x){as.integer(substr(x,1,2))}
)>=10,
'normal','tumor')
cluster_df <- cluster_df[which(cluster_type_group=='tumor'),]
clin <- left_join(clin,cluster_df,'ID',multiple='first')
再把NMF分类的结果添加进来,就是cluster那一列
# 7. 过滤生存数据,不要<30天的,再把天数转换成月
clin <- clin[clin$OS.time>30,]
clin$OS.time <- clin$OS.time / 30
# 8. exp的列名和临床信息的行名一一对应。NMF将样本分成若干个亚型,画出亚型的生存曲线
colnames(exp) <- substr(colnames(exp),1,12)
index <- match(clin$ID,colnames(exp))
exp <- exp[,index]
sfit <- survfit(Surv(OS.time,OS)~cluster,data=clin)
ggsurvplot(sfit,pval = T)
save(clin,file = 'clin.Rdata')
4.3 免疫浸润
my_pallet <- colorRampPalette(brewer.pal(8,'Set1'))lm22f = system.file("extdata", "LM22.txt", package = "CIBERSORT")
cluster1 <- exp[,rownames(cluster_df[cluster_df$cluster==1,])]
cluster2 <- exp[,rownames(cluster_df[cluster_df$cluster==2,])]
TME.result1 <- cibersort(lm22f,as.matrix(cluster1),perm = 100,QN=T) # 耗时很久
TME.result2 <- cibersort(lm22f,as.matrix(cluster2),perm = 100,QN=T)
原文中免疫浸润用MCPcounter做的,这里我用CIBERSORT做,运行完的结果就是各种免疫细胞在每个样本中的含量
result1 <- TME.result1[,-c(23:25)] %>% as.data.frame() %>%
rownames_to_column('Sample') %>% gather(key = cell_type,value = proportion,-Sample)
result2 <- TME.result2[,-c(23:25)] %>% as.data.frame() %>%
rownames_to_column('Sample') %>% gather(key = cell_type,value = proportion,-Sample)
result1$cluster <- rep('cluster1', times=nrow(result1))
result2$cluster <- rep('cluster2', times=nrow(result2))
dat <- rbind(result1,result2)
ggplot(dat,aes(axis1 = cluster, axis2 = cell_type))
geom_alluvium(aes(fill = as.factor(cell_type)),width = 2/5, discern = FALSE)
geom_stratum(width = 2/5, discern = FALSE)
geom_text(stat = "stratum", discern = FALSE,aes(label = after_stat(stratum)))
theme(legend.position = "none",#去除刻度线和背景颜色
panel.background = element_blank(),
axis.ticks = element_blank(),
axis.text.y = element_blank(),
axis.text.x = element_text(size =15,colour = "black"),#坐标轴名
axis.title = element_blank())
scale_x_discrete(position = "top")
把上面的透视表转换成长表,添加上cluster的信息,再把cluster1和cluster2对应的长表合并起来,就可以画图了
cell <- 'Plasma cells'
ggviolin(dat[dat$cell_type==cell,],x='cluster',y='proportion',fill = 'cluster',
add = 'boxplot',add.params = list(fill = 'white',width=0.1,linetype=1),title = cell)
stat_compare_means(method = 't.test',label = 'p.format',aes(label = 'p.format'),
label.x.npc = 'center',label.y = 0.5,size=5)
4.4 批量单因素Cox回归
load('TCGA_WGCNA_hub_gene.Rdata')load('TCGA_LUAD_TPM_exp.Rdata')
load('clin.Rdata')
hubgene_exp <- exp[hub_gene,]
hubgene_exp <- as.data.frame(t(hubgene_exp))
hubgene_exp$ID <- substr(rownames(hubgene_exp),1,12)
clin <- left_join(clin,hubgene_exp,by='ID')
把WGCNA中找到的hub gene挑出来,拿到他们在所有样本中的表达值。把表达值和生存信息合并到一起,最后的clin长这样。在合并之前要注意,删掉normal样本,我这里没有做。
res.cox <- coxph(Surv(OS.time,OS)~SAPCD2,data=clin)
sumres.cox <- summary(res.cox)
挑一个基因SAPCD2做单因素cox回归,最后需要的信息都藏在sumres.cox里
# 批量单因素cox回归
gene <- c()
p_value <- c()
HR <- c()
lower95 <- c()
upper95 <- c()
for (i in 6:ncol(clin)){
res <- coxph(Surv(OS.time,OS)~clin[,i],data=clin)
sum_res <- summary(res)
p <- sum_res$coefficients[,'Pr(>|z|)']
if (p<0.05){
gene <- c(gene,colnames(clin)[i])
p_value <- c(p_value,p)
HR <- c(HR,sum_res$conf.int[,'exp(coef)'])
lower95 <- c(lower95,sum_res$conf.int[,'lower .95'])
upper95 <- c(upper95,sum_res$conf.int[,'upper .95'])
}
}
cox_df <- data.frame(row.names = gene,pvalue=p_value,HR=HR,lower.95=lower95,upper.95=upper95)
lasso_input_gene <- rownames(cox_df[cox_df$pvalue<0.01,])
save(lasso_input_gene,file = 'Lasso_input_gene.Rdata')
批量单因素cox回归得到每个gene的pvalue,挑出pvalue<0.01的再进行后续筛选
Step 5. 预后模型构建和验证
load('Lasso_input_gene.Rdata')load('TCGA_LUAD_TPM_exp.Rdata')
load('clin.Rdata')
# 对表达矩阵取log2
exp <- log2(exp 0.00001)
exp <- as.data.frame(exp)
# Lasso回归进一步筛选gene
exp2 <- exp[lasso_input_gene,]
exp2 <- as.data.frame(t(exp2))
exp2$ID <- substr(rownames(exp2),1,12)
exp2 <- exp2[!duplicated(exp2$ID),]
clin <- clin[,1:3]
exp3 <- left_join(clin,exp2,by='ID')
rownames(exp3) <- exp3$ID
exp3 <- exp3[,-1]
exp3 <- na.omit(exp3)
exp3是生存数据和上一步批量单因素Cox回归筛选出来的基因的表达量合并起来的结果
x <- as.matrix(exp3[,3:(ncol(exp3)-1)])
y <- Surv(exp3$OS.time,exp3$OS)
fit1 <- glmnet(x,y,alpha = 1,family = 'cox')
plot(fit1)
cv_fit <- cv.glmnet(x,y,alpha = 1,nfolds = 10,family="cox")
plot(cv_fit)
best_lambda <- cv_fit$lambda.min
fit2 <- glmnet(x,y,lambda = best_lambda,alpha = 1,family = 'cox')
lasso_filter_gene <- names(fit2$beta[fit2$beta[,1]!=0,1])
由于之前筛选出来的gene效果不怎么好, 经过lasso过滤后,只剩2个gene了,就用这两个来做多因素cox回归
# 多因素cox回归构建预后模型
exp4 <- exp3 %>% dplyr::select(OS,OS.time,lasso_filter_gene)
multicox <- coxph(Surv(OS.time,OS)~.,data = exp4)
sum_multicox <- summary(multicox)
riskScore <- predict(multicox,type = 'risk',newdata = exp4)
riskScore <- as.data.frame(riskScore)
riskScore$ID <- rownames(riskScore)
riskScore$risk <- ifelse(riskScore$riskScore>median(riskScore$riskScore),'high','low')
做完多因素回归后,计算每个样本的riskscore,用riskScore把样本分成两类
# KM plot
km_data <- left_join(riskScore,clin,by='ID')
fit <- survfit(Surv(OS.time,OS)~risk,data=km_data)
ggsurvplot(fit,data = km_data,pval = T,risk.table = T,surv.median.line = 'hv',
legend.title = 'RiskScore',title = 'Overall survival',
ylab='Cummulative survival',xlab='Time(Days)')
# 绘制ROC曲线
roc_data <- data.frame(x = roc$FP[,1],
y = roc$TP[,1],
time = roc$times[1])
for (i in 2:length(roc$times)) {
temp <- data.frame(x = roc$FP[,i],
y = roc$TP[,i],
time = roc$times[i])
roc_data <- rbind(roc_data, temp)
}
ggplot(roc_data, aes(x = x, y = y, color = as.factor(time)))
geom_line()
scale_color_manual(values = c("gray", "blue", "black"))
geom_abline(slope = 1, intercept = 0, linetype = "dashed")
xlab("1 - Specificity")
ylab("Sensitivity")
ggtitle("ROC Curve")
theme_minimal()
# 绘制风险图
ggrisk(multicox)
# risk score与临床病理特征的相关性
load('TCGA_LUAD_Clinical.Rdata')
clinical_df <- clinical[,c('bcr_patient_barcode','gender','stage_event_pathologic_stage')]
clinical_df <- merge(km_data,clinical_df,by.x = 'ID',by.y = 'bcr_patient_barcode')
clinical_df <- clinical_df[!duplicated(clinical_df$ID),]
clinical_df$stage_event_pathologic_stage <- case_when(clinical_df$stage_event_pathologic_stage=='Stage IA'~'Stage I',
clinical_df$stage_event_pathologic_stage=='Stage IB'~'Stage I',
clinical_df$stage_event_pathologic_stage=='Stage IIA'~'Stage II',
clinical_df$stage_event_pathologic_stage=='Stage IIB'~'Stage II',
clinical_df$stage_event_pathologic_stage=='Stage IIIA'~'Stage III',
clinical_df$stage_event_pathologic_stage=='Stage IIIB'~'Stage III',
)
ggviolin(clinical_df,x='gender',y='riskScore',fill = 'gender',
add = 'boxplot',add.params = list(fill = 'white',width=0.1,linetype=1),)
stat_compare_means(method = 'wilcox.test',label = 'p.format',aes(label = 'p.format'),
label.x.npc = 'center',label.y = 20,size=5)
ggviolin(clinical_df,x='stage_event_pathologic_stage',y='riskScore',fill = 'stage_event_pathologic_stage',
add = 'boxplot',add.params = list(fill = 'white',width=0.1,linetype=1),)
stat_compare_means(method = 'wilcox.test',label = 'p.signif',aes(label = 'p.format'),
label.x = 1,label.y = 20,size=5)
Step 6. riskScore和临床信息的相关性
# 森林图# 批量单因素回归
df_forest <- clinical_df
df_forest$gender <- as.numeric(as.factor(df_forest$gender))
df_forest$stage_event_pathologic_stage <- as.numeric(as.factor(df_forest$stage_event_pathologic_stage))
gene <- c()
p_value <- c()
HR <- c()
lower95 <- c()
upper95 <- c()
means <- c()
for (i in c('riskScore','gender','stage_event_pathologic_stage')){
res <- coxph(Surv(OS.time,OS)~df_forest[,i],data=df_forest)
sum_res <- summary(res)
p <- sum_res$coefficients[,'Pr(>|z|)']
gene <- c(gene,i)
p_value <- c(p_value,p)
HR <- c(HR,sum_res$conf.int[,'exp(coef)'])
lower95 <- c(lower95,sum_res$conf.int[,'lower .95'])
upper95 <- c(upper95,sum_res$conf.int[,'upper .95'])
means <- c(means,res$means)
}
cox_df <- data.frame(row.names = gene,pvalue=p_value,HR=HR,lower.95=lower95,upper.95=upper95,means=means)
cox_df$pvalue <- round(cox_df$pvalue,3)
cox_df$HR <- round(cox_df$HR,3)
cox_df <- rownames_to_column(cox_df,var = 'Variable')
cox_df <- rbind(colnames(cox_df),cox_df)
forestplot(labeltext=as.matrix(cox_df)[,1:3],lower=as.numeric(cox_df[,4]),
upper=as.numeric(cox_df[,5]),mean=as.numeric(cox_df[,6]),
is.summary=c(T,F,F,F),
zero=0,lineheight=unit(0.5,'cm'),graphwidth=unit(35,'mm'),
colgap=unit(2,'mm'),lwd.zero=2,lwd.ci=3,
col=fpColors(box = '#7AC5CD',lines = 'black',zero = 'purple'),
lwd.xaxis=2,lty.ci='solid',ci.vertices.height=0.05,graph.pos=2,
xlim=c(0,2.5),box.size=0.5,
hrzl_lines=list('2'=gpar(lwd=3,columns=1:4,col='dark green'),
'5'=gpar(lwd=3,columns=1:4,col='dark green'))
)
# 多因素回归
res <- coxph(Surv(OS.time,OS)~riskScore gender stage_event_pathologic_stage,data=df_forest)
sum_res <- summary(res)
df <- data.frame(row.names = rownames(sum_res$conf.int),pvalue=sum_res$coefficients[,'Pr(>|z|)'],
HR=sum_res$conf.int[,'exp(coef)'],lower=sum_res$conf.int[,'lower .95'],
upper=sum_res$conf.int[,'upper .95'],means=res$means)
df$pvalue <- round(df$pvalue,3)
df$HR <- round(df$HR,3)
df <- rownames_to_column(df,var='Variable')
df <- rbind(colnames(df),df)
forestplot(labeltext=as.matrix(df)[,1:3],lower=as.numeric(df[,4]),
upper=as.numeric(df[,5]),mean=as.numeric(df[,6]),
is.summary=c(T,F,F,F),
zero=0,lineheight=unit(0.5,'cm'),graphwidth=unit(35,'mm'),
colgap=unit(2,'mm'),lwd.zero=2,lwd.ci=3,
col=fpColors(box = '#7AC5CD',lines = 'black',zero = 'purple'),
lwd.xaxis=2,lty.ci='solid',ci.vertices.height=0.05,graph.pos=2,
xlim=c(0,2.5),box.size=0.5,
hrzl_lines=list('2'=gpar(lwd=3,columns=1:4,col='dark green'),
'5'=gpar(lwd=3,columns=1:4,col='dark green'))
)
# GSEA
df_exp <- as.data.frame(t(exp))
df_exp <- rownames_to_column(df_exp,var = 'sample_ID')
df_exp$sample_ID <- substr(df_exp$sample_ID,1,12)
df_exp <- df_exp[!duplicated(df_exp$sample_ID),]
rownames(df_exp) <- df_exp$sample_ID
df_exp <- as.data.frame(t(df_exp[,-1]))
df_exp <- df_exp[,km_data$ID]
group <- as.factor(km_data$risk)
design <- model.matrix(~km_data$risk)
fit <- lmFit(df_exp,design)
fit <- eBayes(fit)
df_deg <- topTable(fit,coef = 2,number = Inf)
df_deg$change <- ifelse(df_deg$logFC>log2(2) & df_deg$adj.P.Val<0.05,'up',
ifelse(df_deg$logFC<log2(0.5) & df_deg$adj.P.Val<0.05,'down','nochange'))
df_deg <- df_deg[order(df_deg$logFC,decreasing = T),]
symbol_2_entrez <- bitr(rownames(df_deg),fromType = 'SYMBOL',toType = 'ENTREZID',OrgDb = org.Hs.eg.db)
symbol_2_entrez <- symbol_2_entrez[!duplicated(symbol_2_entrez$SYMBOL),]
rownames(symbol_2_entrez) <- symbol_2_entrez$SYMBOL
df_deg$entrezid <- symbol_2_entrez[rownames(df_deg),'ENTREZID']
df_deg <- df_deg[!duplicated(df_deg$entrezid),]
df_deg <- na.omit(df_deg)
a <- df_deg[(df_deg$change=='up'),'logFC']
names(a) <- df_deg[(df_deg$change=='up'),'entrezid']
res_a <- gseKEGG(a,'hsa')
b <- df_deg[(df_deg$change=='down'),'logFC']
names(b) <- df_deg[(df_deg$change=='down'),'entrezid']
res_b <- gseKEGG(b,'hsa')
gseaplot2(res_a,geneSetID = 1:5)
gseaplot2(res_b,geneSetID = 1:5)
先做差异分析,找到high risk vs low risk的差异基因,上调和下调的gene分别拿去做GSEA再画图
# waterfall plot
query <- GDCquery(project = 'TCGA-LUAD',
data.category = "Simple Nucleotide Variation",
data.type = "Masked Somatic Mutation",
access = 'open')
GDCdownload(query)
GDCprepare(query,save = T,save.filename = 'TCGA-LUAD-SNP.Rdata')
load('TCGA-LUAD-SNP.Rdata')
luad_snp <- read.maf(maf=data)
plotmafSummary(luad_snp,rmOutlier = TRUE, addStat = 'median', dashboard = TRUE, titvRaw = FALSE)
oncoplot(luad_snp,top = 15)
Step 7. 高低风险组间的免疫浸润
library(GSVA)library(GSEABase)
library(ggplot2)
library(tidyr)
rm(list = ls())
#1. 获取geneSets
# 下载28种免疫细胞的参考基因集 </TISIDB/data/download/CellReports.txt>
geneSet <- read.csv("cellReport.txt",header = F,sep = "\t",)
geneSet[1:5,1:5]
geneSet <- t(tibble::column_to_rownames(geneSet,var = 'V1'))
gene_list <- list()
for (i in colnames(geneSet)){
gene_list[[i]] <- geneSet[,i]
gene_list[[i]] <-gene_list[[i]][gene_list[[i]]!='']
}
直接读取的geneSet长这样,有空值,需要转变成gene_list,list长度为28,其中每个元素是某个细胞所有基因的向量,名字是对应的细胞
# 载入表达矩阵,把表达矩阵分成high risk 和low risk组
load('TCGA_LUAD_TPM_exp.Rdata')
load('risk.Rdata')
# 对表达矩阵取log2
exp <- log2(exp 0.00001)
exp <- as.data.frame(exp)
colnames(exp) <- substr(colnames(exp),1,12)
exp <- exp[,!duplicated(colnames(exp))]
high_risk_sample <- km_data[km_data$risk=='high','ID']
low_risk_sample <- km_data[km_data$risk=='low','ID']
high_risk_exp <- exp[,high_risk_sample]
low_risk_exp <- exp[,low_risk_sample]
# ssGSEA
high_risk_gsea <- gsva(high_risk_exp,gene_list,method='ssgsea',kcdf='Gaussian',abs.ranking=T)
low_risk_gsea <- gsva(low_risk_exp,gene_list,method='ssgsea',kcdf='Gaussian',abs.ranking=T)
要注意gsva函数的输入得是matrix,不能是data.frame,不然会报错
hr_gsea_norm <- high_risk_gsea
for (i in colnames(hr_gsea_norm)) {
diff <- max(hr_gsea_norm[,i] )-min(hr_gsea_norm[,i] )
hr_gsea_norm[,i] <- (hr_gsea_norm[,i] -min(hr_gsea_norm[,i]))/diff
}
hr_gsea_norm <- as.data.frame(t(hr_gsea_norm))
hr_gsea_norm$type <- rep('high',times=nrow(hr_gsea_norm))
lr_gsea_norm <- low_risk_gsea
for (i in colnames(lr_gsea_norm)) {
diff <- max(lr_gsea_norm[,i] )-min(lr_gsea_norm[,i])
lr_gsea_norm[,i] <- (lr_gsea_norm[,i] -min(lr_gsea_norm[,i]))/diff
}
lr_gsea_norm <- as.data.frame(t(lr_gsea_norm))
lr_gsea_norm$type <- rep('low',times=nrow(lr_gsea_norm))
gsea_norm <- rbind(hr_gsea_norm,lr_gsea_norm)
longer <- pivot_longer(gsea_norm,cols = !type,names_to = 'cell_type',values_to = 'Score')
ggplot(longer,aes(x=cell_type,y=Score,color=type)) geom_boxplot()
theme(axis.text.x = element_text(angle = 45,hjust = 1))
stat_compare_means(label = "p.signif",size=3,method = 'wilcox.test')
先把high risk gsea和low risk gsea的数据分别进行min max scale,新增一列标记上high和low,然后合并起来,行是样本,列是细胞。再把透视表转成长表,就可以画图了
# 免疫检查点&TBM
risk_exp <- as.data.frame(t(exp))
risk_exp <- risk_exp[km_data$ID,]
risk_exp$type <- km_data$risk
risk_exp <- risk_exp[,c('CD274','PDCD1','CTLA4','LAG3','TIGIT','type')]
risk_long <- pivot_longer(risk_exp,cols = !type,names_to = 'ICIs',values_to = 'Expression')
ggplot(risk_long,aes(x=ICIs,y=Expression,fill=type)) geom_boxplot()
stat_compare_means(label = "p.format",size=3,method = 'wilcox.test')
文末友情宣传
强烈建议你推荐给身边的博士后以及年轻生物学PI,多一点数据认知,让他们的科研上一个台阶:
生物信息学马拉松授课(买一得五) ,你的生物信息学入门课144线程640Gb内存服务器共享一年仍然是仅需800千呼万唤始出来的独享生物信息学云服务器
- 00013
- 0001
- 0000
- 0000
- 0001