利用R代码从UCSC XENA下载mRNA, lncRNA, miRNA表达数据并匹配临床信息
UCSC xena数据库可以下载TCGA、TARGET等数据,非常便捷,下载后的数据需要用R进行一些处理:一. 表达数据的ID转换,并识别对照组(Normal)和肿瘤组(Tumor);...
UCSC xena 数据库可以下载TCGA
、TARGET
等数据,非常便捷,下载后的数据需要用R进行一些处理:
一. 表达数据的ID转换,并识别对照组(Normal)和肿瘤组(Tumor);
二. 注释区分mRNA,lncRNA和miRNA;
三. 临床信息与生存信息整合,并与表达数据排序匹配。
本文展示GDC TCGA HNSC下载及处理的步骤。
首先,从UCSC xena (https://xenabrowser.net/datapages/)下载数据,找到GDCTCGA HNSC:
点击后可以看到,包括表达数据,临床信息和生存信息等数据:
然后运用R语言进行如下处理(为节省工作量,示例文件仅展示了TCGA HNSC
中的12
个样本的处理过程):
#### 载入R包
library(openxlsx)
library(tidyverse)
library(limma)
library(readr)
模块一. 表达数据的ID注释,并识别对照组和肿瘤组
1.1 读取表达量数据
#### 读取表达数据,并做粗处理。可以输入Count数据,也可以用FPKM数据
TCGA_rawdata <- read_tsv("./RawData/TCGA-HNSC.htseq_counts.tsv")
dim(TCGA_rawdata) #60488 13
## [1] 60488 13
1.2 UCSC上下载gencode.v22.annotation.gene.probeMap
probeMap <- read.table("./RawData/gencode.v22.annotation.gene.probeMap",sep = "\t" , header = T)
probeMap[1:4,1:4]
## id gene chrom chromStart
## 1 ENSG00000223972.5 DDX11L1 chr1 11869
## 2 ENSG00000227232.5 WASH7P chr1 14404
## 3 ENSG00000278267.1 MIR6859-3 chr1 17369
## 4 ENSG00000243485.3 RP11-34P13.3 chr1 29554
1.3 转换ID,但TCGA_gset gene有重复基因名
TCGA_gset <- TCGA_rawdata %>%
inner_join(probeMap, by = c("Ensembl_ID" = "id")) %>%
select(gene, starts_with("TCGA") )
TCGA_gset[1:4,1:4]
## # A tibble: 4 x 4
## gene `TCGA-IQ-A6SH-01A` `TCGA-CV-6962-11A` `TCGA-CV-6960-11A`
## <chr> <dbl> <dbl> <dbl>
## 1 TSPAN6 10.5 13.5 11.7
## 2 TNMD 0 2.32 0
## 3 DPM1 11.2 10.7 10.4
## 4 SCYL3 8.62 9.94 9.25
1.4 利用limma包对重复的基因取平均值合并
TCGA_gset = as.data.frame(avereps(TCGA_gset[,-1],ID = TCGA_gset$gene) )
1.5 行名重命名:取行名的1-15位
colnames(TCGA_gset) <- substring(colnames(TCGA_gset),1,15) %>% gsub("-",".",.)
write.csv(TCGA_gset,"./TCGA_output/TCGA_HNSC_Countdata_log2+1.csv")
TCGA_gset[1:4,1:4]
## TCGA.IQ.A6SH.01 TCGA.CV.6962.11 TCGA.CV.6960.11 TCGA.CV.A464.01
## TSPAN6 10.480790 13.527721 11.706064 9.861087
## TNMD 0.000000 2.321928 0.000000 1.584963
## DPM1 11.231821 10.665336 10.439831 11.187971
## SCYL3 8.622052 9.939579 9.250298 8.885696
1.6 根据表达矩阵病人的ID进行分组
TCGA_group_list <- ifelse(as.numeric(substring(colnames(TCGA_gset),14,15)) < 10,
"Tumor","Normal") %>%
factor(.,levels = c("Normal","Tumor"))
table(TCGA_group_list) # Normal 4 Tumor 8
## TCGA_group_list
## Normal Tumor
## 4 8
模块二. 识别mRNA,lncRNA和miRNA
不同类型的RNA是从ENSEMBL数据库获取的http://asia.ensembl.org/。
然后根据
http://vega.archive.ensembl.org/info/about/gene_and_transcript_types.html,针对`mRNA`,`miRNA`,`lncRNA`的`gene_biotype`的类型注释,最终汇总到了一个`EXCEL`文件。
2.1 读入EXCEL文件的gene注释信息,这里用到从biotype上下载好的EXCEL文件
# 注释mRNA,lncRNA和miRNA
mRNA_info <- read.xlsx("./RawData/Gene_info.xlsx",sheet = "mRNA_info")
lncRNA_info <- read.xlsx("./RawData/Gene_info.xlsx",sheet = "lncRNA_info")
miRNA_info <- read.xlsx("./RawData/Gene_info.xlsx",sheet = "miRNA_info")
2.2 根据基因的注释信息,提取对应的表达矩阵
mRNA_gset <- TCGA_gset[rownames(TCGA_gset) %in% mRNA_info$gene_name,]
dim(mRNA_gset) # 18192 12
## [1] 18192 12
write.csv(mRNA_gset,"./TCGA_output/TCGA_HNSC_mRNA.csv",quote = F,row.names = T)
lncRNA_gset <- TCGA_gset[rownames(TCGA_gset) %in% lncRNA_info$gene_name,]
dim(lncRNA_gset) # 14831 12
## [1] 14831 12
write.csv(lncRNA_gset,"./TCGA_output/TCGA_HNSC_lncRNA.csv",quote = F,row.names = T)
miRNA_gset <- TCGA_gset[rownames(TCGA_gset) %in% miRNA_info$gene_name,]
dim(miRNA_gset) # 1670 12
## [1] 1670 12
write.csv(miRNA_gset,"./TCGA_output/TCGA_HNSC_miRNA.csv",quote = F,row.names = T)
模块三. 临床信息与生存信息整合,并与表达数据匹配
3.1 临床数据
#### 3. 读取表型数据,并做粗处理,用来做后续分析,比如生存分析
Phenodata <- read_tsv("./RawData/TCGA-HNSC.GDC_phenotype.tsv")
## Parsed with column specification:
Phenodata[1:4,1:4]
## # A tibble: 4 x 4
## submitter_id.samp~ additional_pharmace~ additional_radiati~ additional_surgery_lo~
## <chr> <chr> <chr> <chr>
## 1 TCGA-DQ-5631-01A YES YES <NA>
## 2 TCGA-BA-7269-01A <NA> <NA> <NA>
## 3 TCGA-BA-A4IH-01A <NA> <NA> <NA>
## 4 TCGA-CV-6954-01A <NA> <NA> <NA>
Phenodata$submitter_id.samples <- substring(Phenodata$submitter_id.samples,1,15) %>%
gsub("-",".",.)
Phenodata[1:4,1:4]
## # A tibble: 4 x 4
## submitter_id.samp~ additional_pharmace~ additional_radiati~ additional_surgery_lo~
## <chr> <chr> <chr> <chr>
## 1 TCGA.DQ.5631.01 YES YES <NA>
## 2 TCGA.BA.7269.01 <NA> <NA> <NA>
## 3 TCGA.BA.A4IH.01 <NA> <NA> <NA>
## 4 TCGA.CV.6954.01 <NA> <NA> <NA>
3.2 随访数据
Sur_data <- read_tsv("./RawData/TCGA-HNSC.survival.tsv")
Sur_data$sample <- substring(Sur_data$sample,1,15) %>% gsub("-",".",.)
Sur_data[1:4,1:4]
## # A tibble: 4 x 4
## sample OS `_PATIENT` OS.time
## <chr> <dbl> <chr> <dbl>
## 1 TCGA.CN.5369.01 1 TCGA-CN-5369 1
## 2 TCGA.IQ.A61I.01 1 TCGA-IQ-A61I 2
## 3 TCGA.CV.7421.01 1 TCGA-CV-7421 2
## 4 TCGA.HD.A4C1.01 0 TCGA-HD-A4C1 11
3.3结合,提取部分列
Phen_surv <- Phenodata %>%
inner_join(Sur_data,by = c("submitter_id.samples" = "sample")) %>%
select(submitter_id.samples,age_at_index.demographic,gender.demographic,
tumor_grade.diagnoses,neoplasm_histologic_grade,tumor_stage.diagnoses,OS,OS.time)
head(Phen_surv)
## # A tibble: 6 x 8
## submitter_id.sa~ age_at_index.de~ gender.demograp~ tumor_grade.dia~
## <chr> <dbl> <chr> <chr>
## 1 TCGA.DQ.5631.01 52 male not reported
## 2 TCGA.BA.7269.01 61 male not reported
## 3 TCGA.BA.A4IH.01 57 male not reported
## 4 TCGA.CV.6954.01 59 male not reported
## 5 TCGA.CV.6954.11 59 male not reported
## 6 TCGA.CN.4740.01 79 female not reported
## # ... with 4 more variables: neoplasm_histologic_grade <chr>,
## # tumor_stage.diagnoses <chr>, OS <dbl>, OS.time <dbl>
3.4 表型数据和表达数据匹配并排序,并加上分组
Phen_surv = Phen_surv[match(colnames(TCGA_gset),Phen_surv$submitter_id.samples),]
identical(Phen_surv$submitter_id.samples,colnames(TCGA_gset))
## [1] TRUE
Phen_surv$group <- TCGA_group_list
Phen_surv = dplyr::select(Phen_surv,submitter_id.samples,group,everything())
write.csv(Phen_surv,"./TCGA_output/TCGA_HNSC_phenotype.csv")
head(Phen_surv)
## # A tibble: 6 x 9
## submitter_id.sa~ group age_at_index.de~ gender.demograp~ tumor_grade.dia~
## <chr> <fct> <dbl> <chr> <chr>
## 1 TCGA.IQ.A6SH.01 Tumor 55 male not reported
## 2 TCGA.CV.6962.11 Norm~ 65 male not reported
## 3 TCGA.CV.6960.11 Norm~ 49 male not reported
## 4 TCGA.CV.A464.01 Tumor 48 male not reported
## 5 TCGA.C9.A47Z.01 Tumor 72 female not reported
## 6 TCGA.CN.6010.01 Tumor 53 male not reported
## # ... with 4 more variables: neoplasm_histologic_grade <chr>,
## # tumor_stage.diagnoses <chr>, OS <dbl>, OS.time <dbl>
sessionInfo()
## R version 4.0.4 (2021-02-15)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=Chinese (Simplified)_People's Republic of China.936
## [2] LC_CTYPE=Chinese (Simplified)_People's Republic of China.936
## [3] LC_MONETARY=Chinese (Simplified)_People's Republic of China.936
## [4] LC_NUMERIC=C
## [5] LC_TIME=Chinese (Simplified)_People's Republic of China.936
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] limma_3.44.3 forcats_0.5.0 stringr_1.4.0 dplyr_1.0.0 purrr_0.3.4
## [6] readr_1.3.1 tidyr_1.1.0 tibble_3.0.2 ggplot2_3.3.2 tidyverse_1.3.0
## [11] openxlsx_4.1.5
##
## loaded via a namespace (and not attached):
## [1] tinytex_0.24 tidyselect_1.1.0 xfun_0.15 haven_2.3.1
## [5] colorspace_1.4-1 vctrs_0.3.1 generics_0.1.0 htmltools_0.5.0
## [9] yaml_2.2.1 utf8_1.1.4 blob_1.2.1 rlang_0.4.6
## [13] pillar_1.4.6 withr_2.2.0 glue_1.4.1 DBI_1.1.0
## [17] dbplyr_1.4.4 modelr_0.1.8 readxl_1.3.1 lifecycle_0.2.0
## [21] munsell_0.5.0 gtable_0.3.0 cellranger_1.1.0 rvest_0.3.5
## [25] zip_2.1.1 evaluate_0.14 knitr_1.29 fansi_0.4.1
## [29] broom_0.7.0 Rcpp_1.0.5 backports_1.1.7 scales_1.1.1
## [33] jsonlite_1.7.0 fs_1.4.2 hms_0.5.3 digest_0.6.25
## [37] stringi_1.4.6 grid_4.0.2 cli_2.0.2 tools_4.0.2
## [41] magrittr_1.5 crayon_1.3.4 pkgconfig_2.0.3 ellipsis_0.3.1
## [45] xml2_1.3.2 reprex_0.3.0 lubridate_1.7.9 assertthat_0.2.1
## [49] rmarkdown_2.3 httr_1.4.1 rstudioapi_0.11 R6_2.4.1
## [53] compiler_4.0.2
参考资料
生信补给站 《TCGA | 以项目方式管理代码数据 以及 数据读取存储》
王进 《TCGA数据下载,提取lncRNA mRNA》
《R数据科学》
数据和代码下载:https://gitee.com/ct5869/shengxin-baodian/tree/master/TCGA
往期精品(点击图片直达文字对应教程)
后台回复“生信宝典福利第一波”或点击阅读原文获取教程合集
(请备注姓名-学校/企业-职务等)
更多推荐
所有评论(0)