UCSC xena 数据库可以下载TCGATARGET等数据,非常便捷,下载后的数据需要用R进行一些处理:

一. 表达数据的ID转换,并识别对照组(Normal)和肿瘤组(Tumor); 

二. 注释区分mRNA,lncRNA和miRNA; 

三. 临床信息与生存信息整合,并与表达数据排序匹配。

本文展示GDC TCGA HNSC下载及处理的步骤。

首先,从UCSC xena (https://xenabrowser.net/datapages/)下载数据,找到GDCTCGA HNSC:

点击后可以看到,包括表达数据,临床信息和生存信息等数据:

然后运用R语言进行如下处理(为节省工作量,示例文件仅展示了TCGA HNSC中的12个样本的处理过程):

#### 载入R包
library(openxlsx)
library(tidyverse)

library(limma)
library(readr)

模块一. 表达数据的ID注释,并识别对照组和肿瘤组

1.1 读取表达量数据
#### 读取表达数据,并做粗处理。可以输入Count数据,也可以用FPKM数据

TCGA_rawdata <- read_tsv("./RawData/TCGA-HNSC.htseq_counts.tsv")

dim(TCGA_rawdata) #60488 13

## [1] 60488    13
1.2 UCSC上下载gencode.v22.annotation.gene.probeMap
probeMap <- read.table("./RawData/gencode.v22.annotation.gene.probeMap",sep = "\t" , header = T)
probeMap[1:4,1:4]

##                  id         gene chrom chromStart
## 1 ENSG00000223972.5      DDX11L1  chr1      11869
## 2 ENSG00000227232.5       WASH7P  chr1      14404
## 3 ENSG00000278267.1    MIR6859-3  chr1      17369
## 4 ENSG00000243485.3 RP11-34P13.3  chr1      29554
1.3 转换ID,但TCGA_gset gene有重复基因名
TCGA_gset <- TCGA_rawdata %>%
  inner_join(probeMap, by = c("Ensembl_ID" = "id")) %>%
  select(gene, starts_with("TCGA") )
TCGA_gset[1:4,1:4]


## # A tibble: 4 x 4
##   gene   `TCGA-IQ-A6SH-01A` `TCGA-CV-6962-11A` `TCGA-CV-6960-11A`
##   <chr>               <dbl>              <dbl>              <dbl>
## 1 TSPAN6              10.5               13.5               11.7 
## 2 TNMD                 0                  2.32               0   
## 3 DPM1                11.2               10.7               10.4 
## 4 SCYL3                8.62               9.94               9.25
1.4 利用limma包对重复的基因取平均值合并
TCGA_gset = as.data.frame(avereps(TCGA_gset[,-1],ID = TCGA_gset$gene) )
1.5 行名重命名:取行名的1-15位
colnames(TCGA_gset) <- substring(colnames(TCGA_gset),1,15) %>% gsub("-",".",.)
write.csv(TCGA_gset,"./TCGA_output/TCGA_HNSC_Countdata_log2+1.csv")
TCGA_gset[1:4,1:4]

##        TCGA.IQ.A6SH.01 TCGA.CV.6962.11 TCGA.CV.6960.11 TCGA.CV.A464.01
## TSPAN6       10.480790       13.527721       11.706064        9.861087
## TNMD          0.000000        2.321928        0.000000        1.584963
## DPM1         11.231821       10.665336       10.439831       11.187971
## SCYL3         8.622052        9.939579        9.250298        8.885696
1.6 根据表达矩阵病人的ID进行分组
TCGA_group_list <- ifelse(as.numeric(substring(colnames(TCGA_gset),14,15)) < 10,
                          "Tumor","Normal") %>% 
                        factor(.,levels = c("Normal","Tumor"))
table(TCGA_group_list) # Normal 4   Tumor 8        

## TCGA_group_list
## Normal  Tumor 
##      4      8

模块二. 识别mRNA,lncRNA和miRNA

不同类型的RNA是从ENSEMBL数据库获取的http://asia.ensembl.org/。

然后根据

http://vega.archive.ensembl.org/info/about/gene_and_transcript_types.html,针对`mRNA`,`miRNA`,`lncRNA`的`gene_biotype`的类型注释,最终汇总到了一个`EXCEL`文件。

2.1 读入EXCEL文件的gene注释信息,这里用到从biotype上下载好的EXCEL文件
# 注释mRNA,lncRNA和miRNA

mRNA_info <- read.xlsx("./RawData/Gene_info.xlsx",sheet = "mRNA_info")
lncRNA_info <- read.xlsx("./RawData/Gene_info.xlsx",sheet = "lncRNA_info")
miRNA_info <- read.xlsx("./RawData/Gene_info.xlsx",sheet = "miRNA_info")
2.2 根据基因的注释信息,提取对应的表达矩阵
mRNA_gset <- TCGA_gset[rownames(TCGA_gset) %in% mRNA_info$gene_name,]
dim(mRNA_gset) # 18192 12

## [1] 18192    12

write.csv(mRNA_gset,"./TCGA_output/TCGA_HNSC_mRNA.csv",quote = F,row.names = T)

lncRNA_gset <- TCGA_gset[rownames(TCGA_gset) %in% lncRNA_info$gene_name,]
dim(lncRNA_gset) # 14831 12

## [1] 14831    12

write.csv(lncRNA_gset,"./TCGA_output/TCGA_HNSC_lncRNA.csv",quote = F,row.names = T)

miRNA_gset <- TCGA_gset[rownames(TCGA_gset) %in% miRNA_info$gene_name,]
dim(miRNA_gset) # 1670 12

## [1] 1670   12

write.csv(miRNA_gset,"./TCGA_output/TCGA_HNSC_miRNA.csv",quote = F,row.names = T)

模块三. 临床信息与生存信息整合,并与表达数据匹配

3.1 临床数据
#### 3. 读取表型数据,并做粗处理,用来做后续分析,比如生存分析

Phenodata <- read_tsv("./RawData/TCGA-HNSC.GDC_phenotype.tsv")

## Parsed with column specification:

Phenodata[1:4,1:4]

## # A tibble: 4 x 4
##   submitter_id.samp~ additional_pharmace~ additional_radiati~ additional_surgery_lo~
##   <chr>              <chr>                <chr>               <chr>                 
## 1 TCGA-DQ-5631-01A   YES                  YES                 <NA>                  
## 2 TCGA-BA-7269-01A   <NA>                 <NA>                <NA>                  
## 3 TCGA-BA-A4IH-01A   <NA>                 <NA>                <NA>                  
## 4 TCGA-CV-6954-01A   <NA>                 <NA>                <NA>

Phenodata$submitter_id.samples <- substring(Phenodata$submitter_id.samples,1,15) %>% 
  gsub("-",".",.)
Phenodata[1:4,1:4]

## # A tibble: 4 x 4
##   submitter_id.samp~ additional_pharmace~ additional_radiati~ additional_surgery_lo~
##   <chr>              <chr>                <chr>               <chr>                 
## 1 TCGA.DQ.5631.01    YES                  YES                 <NA>                  
## 2 TCGA.BA.7269.01    <NA>                 <NA>                <NA>                  
## 3 TCGA.BA.A4IH.01    <NA>                 <NA>                <NA>                  
## 4 TCGA.CV.6954.01    <NA>                 <NA>                <NA>
3.2 随访数据
Sur_data <- read_tsv("./RawData/TCGA-HNSC.survival.tsv")

Sur_data$sample <- substring(Sur_data$sample,1,15) %>% gsub("-",".",.)
Sur_data[1:4,1:4]

## # A tibble: 4 x 4
##   sample             OS `_PATIENT`   OS.time
##   <chr>           <dbl> <chr>          <dbl>
## 1 TCGA.CN.5369.01     1 TCGA-CN-5369       1
## 2 TCGA.IQ.A61I.01     1 TCGA-IQ-A61I       2
## 3 TCGA.CV.7421.01     1 TCGA-CV-7421       2
## 4 TCGA.HD.A4C1.01     0 TCGA-HD-A4C1      11
3.3结合,提取部分列
Phen_surv <- Phenodata %>%
  inner_join(Sur_data,by = c("submitter_id.samples" = "sample")) %>%
  select(submitter_id.samples,age_at_index.demographic,gender.demographic,
         tumor_grade.diagnoses,neoplasm_histologic_grade,tumor_stage.diagnoses,OS,OS.time)
head(Phen_surv)

## # A tibble: 6 x 8
##   submitter_id.sa~ age_at_index.de~ gender.demograp~ tumor_grade.dia~
##   <chr>                       <dbl> <chr>            <chr>           
## 1 TCGA.DQ.5631.01                52 male             not reported    
## 2 TCGA.BA.7269.01                61 male             not reported    
## 3 TCGA.BA.A4IH.01                57 male             not reported    
## 4 TCGA.CV.6954.01                59 male             not reported    
## 5 TCGA.CV.6954.11                59 male             not reported    
## 6 TCGA.CN.4740.01                79 female           not reported    
## # ... with 4 more variables: neoplasm_histologic_grade <chr>,
## #   tumor_stage.diagnoses <chr>, OS <dbl>, OS.time <dbl>
3.4 表型数据和表达数据匹配并排序,并加上分组
Phen_surv = Phen_surv[match(colnames(TCGA_gset),Phen_surv$submitter_id.samples),]
identical(Phen_surv$submitter_id.samples,colnames(TCGA_gset))

## [1] TRUE

Phen_surv$group <- TCGA_group_list
Phen_surv = dplyr::select(Phen_surv,submitter_id.samples,group,everything())
write.csv(Phen_surv,"./TCGA_output/TCGA_HNSC_phenotype.csv")
head(Phen_surv)

## # A tibble: 6 x 9
##   submitter_id.sa~ group age_at_index.de~ gender.demograp~ tumor_grade.dia~
##   <chr>            <fct>            <dbl> <chr>            <chr>           
## 1 TCGA.IQ.A6SH.01  Tumor               55 male             not reported    
## 2 TCGA.CV.6962.11  Norm~               65 male             not reported    
## 3 TCGA.CV.6960.11  Norm~               49 male             not reported    
## 4 TCGA.CV.A464.01  Tumor               48 male             not reported    
## 5 TCGA.C9.A47Z.01  Tumor               72 female           not reported    
## 6 TCGA.CN.6010.01  Tumor               53 male             not reported    
## # ... with 4 more variables: neoplasm_histologic_grade <chr>,
## #   tumor_stage.diagnoses <chr>, OS <dbl>, OS.time <dbl>

sessionInfo()

## R version 4.0.4 (2021-02-15)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=Chinese (Simplified)_People's Republic of China.936 
## [2] LC_CTYPE=Chinese (Simplified)_People's Republic of China.936   
## [3] LC_MONETARY=Chinese (Simplified)_People's Republic of China.936
## [4] LC_NUMERIC=C                                                   
## [5] LC_TIME=Chinese (Simplified)_People's Republic of China.936    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] limma_3.44.3    forcats_0.5.0   stringr_1.4.0   dplyr_1.0.0     purrr_0.3.4    
##  [6] readr_1.3.1     tidyr_1.1.0     tibble_3.0.2    ggplot2_3.3.2   tidyverse_1.3.0
## [11] openxlsx_4.1.5 
## 
## loaded via a namespace (and not attached):
##  [1] tinytex_0.24     tidyselect_1.1.0 xfun_0.15        haven_2.3.1     
##  [5] colorspace_1.4-1 vctrs_0.3.1      generics_0.1.0   htmltools_0.5.0 
##  [9] yaml_2.2.1       utf8_1.1.4       blob_1.2.1       rlang_0.4.6     
## [13] pillar_1.4.6     withr_2.2.0      glue_1.4.1       DBI_1.1.0       
## [17] dbplyr_1.4.4     modelr_0.1.8     readxl_1.3.1     lifecycle_0.2.0 
## [21] munsell_0.5.0    gtable_0.3.0     cellranger_1.1.0 rvest_0.3.5     
## [25] zip_2.1.1        evaluate_0.14    knitr_1.29       fansi_0.4.1     
## [29] broom_0.7.0      Rcpp_1.0.5       backports_1.1.7  scales_1.1.1    
## [33] jsonlite_1.7.0   fs_1.4.2         hms_0.5.3        digest_0.6.25   
## [37] stringi_1.4.6    grid_4.0.2       cli_2.0.2        tools_4.0.2     
## [41] magrittr_1.5     crayon_1.3.4     pkgconfig_2.0.3  ellipsis_0.3.1  
## [45] xml2_1.3.2       reprex_0.3.0     lubridate_1.7.9  assertthat_0.2.1
## [49] rmarkdown_2.3    httr_1.4.1       rstudioapi_0.11  R6_2.4.1        
## [53] compiler_4.0.2

参考资料

  1. 生信补给站 《TCGA | 以项目方式管理代码数据 以及 数据读取存储》

  2. 王进 《TCGA数据下载,提取lncRNA mRNA》

  3. 《R数据科学》

  4. 典型医学设计实验GEO数据分析 (step-by-step) - Limma差异分析、火山图、功能富集

  5. 数据和代码下载:https://gitee.com/ct5869/shengxin-baodian/tree/master/TCGA

往期精品(点击图片直达文字对应教程)

机器学习

后台回复“生信宝典福利第一波”或点击阅读原文获取教程合集

 

(请备注姓名-学校/企业-职务等)

Logo

华为开发者空间,是为全球开发者打造的专属开发空间,汇聚了华为优质开发资源及工具,致力于让每一位开发者拥有一台云主机,基于华为根生态开发、创新。

更多推荐