一文搞定细菌基因组De Novo测序分析

本文转自基因的生物信息学分析，链接https://mp.weixin.qq.com/s/xWOlv5WVJ7LwTuRQDXmGzg以一个细菌的测序数据为例子，介绍细菌基因组测序分析流程...

刘永鑫Adam

13428人浏览 · 2021-04-30 07:00:00

刘永鑫Adam · 2021-04-30 07:00:00 发布

本文转自基因的生物信息学分析，链接 https://mp.weixin.qq.com/s/xWOlv5WVJ7LwTuRQDXmGzg

以一个细菌的测序数据为例子，介绍细菌基因组测序分析流程。本次实验中的细菌基因组大小为3.4M，通过illumina PE150bp测序，获得测序数据量为～800M，通过拼接得到该样本的草图基因组序列，并进行基因注释等分析。

内容

1 概述

2 基因组de novo测序的分析

2.1 分析流程图

2.2 测序数据格式

2.3 数据准备

2.4 数据质控

2.5 基因组组装

2.6 组装结果统计

2.7 组装示意图

2.8 基因组注释

3 在线分析网站

4 词汇表

概述

微生物广泛存在于自然界中，与人类的生产和生活息息相关。目前一般将微生物主要分为细菌、真菌、放线菌、螺旋体、立克次体、衣原体、支原体和病毒。

随着测序成本的大幅度降低以及测序效率的数量级提升，全基因组测序对微生物单菌基因组学研究起到了巨大的推动作用。通过全基因组序列，可构建该物种的基因组数据库；为后续研究该物种的生长、发育、进化、起源等重大问题搭建一个高效平台，并为后续的基因挖掘、功能验证提供DNA序列信息。

细菌基因组测序，可分为细菌基因组de novo测序和细菌基因组重测序两类。细菌基因组de novo测序，即从头测序，是指在没有任何现有的序列信息的情况下，对某个细菌物种进行测序，利用生物信息学分析手段对序列进行拼装，从而获得该细菌物种的基因组序列。细菌基因组重测序是对已有参考基因组序列（Reference Sequence）的物种的不同个体进行基因组测序，并以此为基础进行个体或群体水平的差异性分析。可关注大量的单核苷酸多态性位点（SNP）、插入缺失（InDel, Insertion/Deletion）、结构变异（Structure Variation，SV）等变异信息。

本篇文章将重点介绍细菌基因组de novo测序的分析流程。

基因组de novo测序的分析

分析流程图

测序数据格式

根据barcode序列区分样本，提取出的数据以标准的fastq格式保存。以双端测序（PE：paired-end）数据为例，每一个样本有read1.fastq和read2.fastq两个文件，分别代表5’ -> 3’和 3’->5’的测序结果。在生信入门：Fasta与Fastq格式文件详解里面，我们已经初步认识了fastq格式。

第4行是该序列的测序质量，每个字符对应为第2行每个碱基的测序质量值，用Phred值+33后，所对应的ASCII字符来表示。Phred值的计算方法为：Q =-10*log10(e) 其中e是碱基错误率。

质量得分与错误概率对照表:

Phred Quality Score	Probability of incorrect base call	Base call accuracy
10	1 in 10	90%
20	1 in 100	99%
30	1 in 1000	99.9%
40	1 in 10000	99.99%

如上表示，如果一个碱基的Q值为20，表示这个碱基可能测错的概率为1%。

数据准备

!ls

test_1.fq.gz  test_2.fq.gz

通过ls命令查看，发现当前目录下有两个fastq的压缩文件，它们是illumina测序获得的原始数据，也是通常要提交到NCBI SRA数据库的文件。关于如何提交请参考生信入门：如何将测序原始数据上传NCBI。

数据质控

使用fastqc软件对原始测序reads进行质控，生成网页统计报告，快速获得数据质量的好坏。

!fastqc -t 10 test_1.fq.gz test_2.fq.gz

Started analysis of test_1.fq.gz
Started analysis of test_2.fq.gz
Approx 5% complete for test_1.fq.gz
Approx 5% complete for test_2.fq.gz
Approx 10% complete for test_1.fq.gz
Approx 10% complete for test_2.fq.gz
Approx 15% complete for test_1.fq.gz
Approx 15% complete for test_2.fq.gz
Approx 20% complete for test_1.fq.gz
Approx 20% complete for test_2.fq.gz
Approx 25% complete for test_1.fq.gz
Approx 25% complete for test_2.fq.gz
Approx 30% complete for test_1.fq.gz
Approx 30% complete for test_2.fq.gz
Approx 35% complete for test_1.fq.gz
Approx 35% complete for test_2.fq.gz
Approx 40% complete for test_1.fq.gz
Approx 40% complete for test_2.fq.gz
Approx 45% complete for test_1.fq.gz
Approx 45% complete for test_2.fq.gz
Approx 50% complete for test_1.fq.gz
Approx 50% complete for test_2.fq.gz
Approx 55% complete for test_1.fq.gz
Approx 55% complete for test_2.fq.gz
Approx 60% complete for test_1.fq.gz
Approx 60% complete for test_2.fq.gz
Approx 65% complete for test_1.fq.gz
Approx 65% complete for test_2.fq.gz
Approx 70% complete for test_1.fq.gz
Approx 70% complete for test_2.fq.gz
Approx 75% complete for test_1.fq.gz
Approx 75% complete for test_2.fq.gz
Approx 80% complete for test_1.fq.gz
Approx 80% complete for test_2.fq.gz
Approx 85% complete for test_1.fq.gz
Approx 85% complete for test_2.fq.gz
Approx 90% complete for test_1.fq.gz
Approx 90% complete for test_2.fq.gz
Approx 95% complete for test_1.fq.gz
Approx 95% complete for test_2.fq.gz
Analysis complete for test_1.fq.gz
Analysis complete for test_2.fq.gz

查看运行fastqc获得的结果文件

!ls -t  test*html #-t sort by modification time, newest first

test_1_fastqc.html  test_2_fastqc.html

打开test_1_fastqc.html文件查看图形化的质量结果，如下：

这个数据的Q值曲线都在30以上，说明质量挺不错。横轴代表位置，纵轴代表quality。红色表示中位数，黄色是25%-75%区间，触须是10%-90%区间，蓝线是平均数。

使用fastp软件对原始测序数据进行过滤，去除低质量、adapter及N碱基等，得到cleandata。

!fastp -i test_1.fq.gz -I test_2.fq.gz -o clean_1.fq.gz -O clean_2.fq.gz

Read1 before filtering:
total reads: 2783612
total bases: 417541800
Q20 bases: 409754276(98.1349%)
Q30 bases: 398333598(95.3997%)
Read1 after filtering:
total reads: 2708145
total bases: 406146010
Q20 bases: 401516631(98.8602%)
Q30 bases: 391420330(96.3743%)
Read2 before filtering:
total reads: 2783612
total bases: 417541800
Q20 bases: 399039636(95.5688%)
Q30 bases: 376827191(90.249%)
Read2 aftering filtering:
total reads: 2708145
total bases: 406146010
Q20 bases: 392948823(96.7506%)
Q30 bases: 372542262(91.7262%)
Filtering result:
reads passed filter: 5416290
reads failed due to low quality: 149360
reads failed due to too many N: 1574
reads failed due to too short: 0
reads with adapter trimmed: 9616
bases trimmed due to adapters: 157916
JSON report: fastp.json
HTML report: fastp.html
fastp -i test_1.fq.gz -I test_2.fq.gz -o clean_1.fq.gz -O clean_2.fq.gz
fastp v0.12.2, time used: 66 seconds

查看发现运行fastp获得结果文件clean1.fq.gz和clean2.fq.gz

!ls -t

clean_2.fq.gz  fastp.html  test_1_fastqc.html  test_1_fastqc.zip  test_1.fq.gz
clean_1.fq.gz  fastp.json  test_2_fastqc.html  test_2_fastqc.zip  test_2.fq.gz

基因组组装

使用SPAdes (版本：3.11) 短序列组装软件对Clean Data进行组装，经多次调整参数后获得最优组装结果；然后reads将比对回组装获得的Contig上，再根据reads的paired-end和overlap关系，对组装结果进行局部组装和优化

!spades.py -h

SPAdes genome assembler v3.11.0
Usage: /usr/local/bin/spades.py [options] -o <output_dir>
Basic options:
-o    <output_dir>    directory to store all the resulting files (required)
--sc            this flag is required for MDA (single-cell) data
--meta            this flag is required for metagenomic sample data
--rna            this flag is required for RNA-Seq data
--plasmid        runs plasmidSPAdes pipeline for plasmid detection
--iontorrent        this flag is required for IonTorrent data
--test            runs SPAdes on toy dataset
-h/--help        prints this usage message
-v/--version        prints version
Input data:
--12    <filename>  file with interlaced forward and reverse paired-end reads
-1    <filename>  file with forward paired-end reads
-2    <filename>  file with reverse paired-end reads
-s    <filename>  file with unpaired reads
--pe<#>-12    <filename>  file with interlaced reads for paired-end library number <#> (<#> = 1,2,..,9)
--pe<#>-1    <filename>  file with forward reads for paired-end library number <#> (<#> = 1,2,..,9)
--pe<#>-2    <filename>  file with reverse reads for paired-end library number <#> (<#> = 1,2,..,9)
--pe<#>-s    <filename>  file with unpaired reads for paired-end library number <#> (<#> = 1,2,..,9)
--pe<#>-<or>    orientation of reads for paired-end library number <#> (<#> = 1,2,..,9; <or> = fr, rf, ff)
--s<#>        <filename>  file with unpaired reads for single reads library number <#> (<#> = 1,2,..,9)
--mp<#>-12    <filename>  file with interlaced reads for mate-pair library number <#> (<#> = 1,2,..,9)
--mp<#>-1    <filename>  file with forward reads for mate-pair library number <#> (<#> = 1,2,..,9)
--mp<#>-2    <filename>  file with reverse reads for mate-pair library number <#> (<#> = 1,2,..,9)
--mp<#>-s    <filename>  file with unpaired reads for mate-pair library number <#> (<#> = 1,2,..,9)
--mp<#>-<or>    orientation of reads for mate-pair library number <#> (<#> = 1,2,..,9; <or> = fr, rf, ff)
--hqmp<#>-12    <filename>  file with interlaced reads for high-quality mate-pair library number <#> (<#> = 1,2,..,9)
--hqmp<#>-1    <filename>  file with forward reads for high-quality mate-pair library number <#> (<#> = 1,2,..,9)
--hqmp<#>-2    <filename>  file with reverse reads for high-quality mate-pair library number <#> (<#> = 1,2,..,9)
--hqmp<#>-s    <filename>  file with unpaired reads for high-quality mate-pair library number <#> (<#> = 1,2,..,9)
--hqmp<#>-<or>    orientation of reads for high-quality mate-pair library number <#> (<#> = 1,2,..,9; <or> = fr, rf, ff)
--nxmate<#>-1    <filename>  file with forward reads for Lucigen NxMate library number <#> (<#> = 1,2,..,9)
--nxmate<#>-2    <filename>  file with reverse reads for Lucigen NxMate library number <#> (<#> = 1,2,..,9)
--sanger    <filename>  file with Sanger reads
--pacbio    <filename>  file with PacBio reads
--nanopore    <filename>  file with Nanopore reads
--tslr    <filename>  file with TSLR-contigs
--trusted-contigs    <filename>  file with trusted contigs
--untrusted-contigs    <filename>  file with untrusted contigs
Pipeline options:
--only-error-correction    runs only read error correction (without assembling)
--only-assembler    runs only assembling (without read error correction)
--careful        tries to reduce number of mismatches and short indels
--continue        continue run from the last available check-point
--restart-from    <cp>    restart run with updated options and from the specified check-point ('ec', 'as', 'k<int>', 'mc')
--disable-gzip-output    forces error correction not to compress the corrected reads
--disable-rr        disables repeat resolution stage of assembling
Advanced options:
--dataset    <filename>  file with dataset description in YAML format
-t/--threads    <int>       number of threads
                [default: 16]
-m/--memory    <int>       RAM limit for SPAdes in Gb (terminates if exceeded)
                [default: 250]
--tmp-dir    <dirname>   directory for temporary files
                [default: <output_dir>/tmp]
-k        <int,int,...>   comma-separated list of k-mer sizes (must be odd and
                less than 128) [default: 'auto']
--cov-cutoff    <float>     coverage cutoff value (a positive float number, or 'auto', or 'off') [default: 'off']
--phred-offset    <33 or 64>  PHRED quality offset in the input reads (33 or 64)
                [default: auto-detect]

运行spades程序进行组装，可以自定义-k 选项获得最佳组装结果

!spades.py -k 77,87,97,117 --careful  --only-assembler
-1 clean_1.fq.gz  -2 clean_2.fq.gz  -o assembly -t 20

组装结果统计

QUAST 用于基因组和宏基因组的拼接评估

#查看quast帮助文档
!quast.py -h

QUAST: QUality ASsessment Tool for Genome Assemblies
Version: 4.5
Usage: python /usr/local/bin/quast.py [options] <files_with_contigs>
Options:
-o  --output-dir  <dirname>   Directory to store all result files [default: quast_results/results_<datetime>]
-R                <filename>  Reference genome file
-G  --genes       <filename>  File with gene coordinates in the reference (GFF, BED, NCBI or TXT)
-O  --operons     <filename>  File with operon coordinates in the reference (GFF, BED, NCBI or TXT)
-m  --min-contig  <int>       Lower threshold for contig length [default: 500]
-t  --threads     <int>       Maximum number of threads [default: 25% of CPUs]
Advanced options:
-s  --scaffolds                       Assemblies are scaffolds, split them and add contigs to the comparison
-l  --labels "label, label, ..."      Names of assemblies to use in reports, comma-separated. If contain spaces, use quotes
-L                                    Take assembly names from their parent directory names
-f  --gene-finding                    Predict genes using GeneMarkS (prokaryotes, default) or GeneMark-ES (eukaryotes, use --eukaryote)
    --glimmer                         Use GlimmerHMM for gene prediction (instead of the default finder, see above)
    --mgm                             Use MetaGeneMark for gene prediction (instead of the default finder, see above)
    --gene-thresholds <int,int,...>   Comma-separated list of threshold lengths of genes to search with Gene Finding module
                                      [default: 0,300,1500,3000]
-e  --eukaryote                       Genome is eukaryotic
    --est-ref-size <int>              Estimated reference size (for computing NGx metrics without a reference)
    --gage                            Use GAGE (results are in gage_report.txt)
    --contig-thresholds <int,int,...> Comma-separated list of contig length thresholds [default: 0,1000,5000,10000,25000,50000]
-u  --use-all-alignments              Compute genome fraction, # genes, # operons in QUAST v1.* style.
                                      By default, QUAST filters Nucmer's alignments to keep only best ones
-i  --min-alignment <int>             Nucmer's parameter: the minimum alignment length [default: 0]
    --min-identity <float>            Nucmer's parameter: the minimum alignment identity (80.0, 100.0) [default: 95.0]
-a  --ambiguity-usage <none|one|all>  Use none, one, or all alignments of a contig when all of them
                                      are almost equally good (see --ambiguity-score) [default: one]
    --ambiguity-score <float>         Score S for defining equally good alignments of a single contig. All alignments are sorted
                                      by decreasing LEN * IDY% value. All alignments with LEN * IDY% < S * best(LEN * IDY%) are
                                      discarded. S should be between 0.8 and 1.0 [default: 0.99]
    --strict-NA                       Break contigs in any misassembly event when compute NAx and NGAx
                                      By default, QUAST breaks contigs only by extensive misassemblies (not local ones)
-x  --extensive-mis-size  <int>       Lower threshold for extensive misassembly size. All relocations with inconsistency
                                      less than extensive-mis-size are counted as local misassemblies [default: 1000]
    --scaffold-gap-max-size  <int>    Max allowed scaffold gap length difference. All relocations with inconsistency
                                      less than scaffold-gap-size are counted as scaffold gap misassemblies [default: 10000]
                                      Only scaffold assemblies are affected (use -s/--scaffolds)!
    --unaligned-part-size  <int>      Lower threshold for detecting partially unaligned contigs. Such contig should have
                                      at least one unaligned fragment >= the threshold [default: 500]
    --fragmented                      Reference genome may be fragmented into small pieces (e.g. scaffolded reference)
    --fragmented-max-indent  <int>    Mark translocation as fake if both alignments are located no further than N bases
                                      from the ends of the reference fragments [default: 85]
                                      Requires --fragmented option.
    --plots-format  <str>             Save plots in specified format [default: pdf]
                                      Supported formats: emf, eps, pdf, png, ps, raw, rgba, svg, svgz
    --memory-efficient                Run Nucmer using one thread, separately per each assembly and each chromosome
                                      This may significantly reduce memory consumption on large genomes
    --space-efficient                 Create only reports and plots files. .stdout, .stderr, .coords and other aux files will not be created
                                      This may significantly reduce space consumption on large genomes. Icarus viewers also will not be built
-1  --reads1  <filename>              File with forward reads (in FASTQ format, may be gzipped)
-2  --reads2  <filename>              File with reverse reads (in FASTQ format, may be gzipped)
    --sam  <filename>                 SAM alignment file
    --bam  <filename>                 BAM alignment file
                                      Reads (or SAM/BAM file) are used for structural variation detection and
                                      coverage histogram building in Icarus
    --sv-bedpe  <filename>            File with structural variations (in BEDPE format)
Speedup options:
    --no-check                        Do not check and correct input fasta files. Use at your own risk (see manual)
    --no-plots                        Do not draw plots
    --no-html                         Do not build html reports and Icarus viewers
    --no-icarus                       Do not build Icarus viewers
    --no-snps                         Do not report SNPs (may significantly reduce memory consumption on large genomes)
    --no-gc                           Do not compute GC% and GC-distribution
    --no-sv                           Do not run structural variation detection (make sense only if reads are specified)
    --no-gzip                         Do not compress large output files
    --fast                            A combination of all speedup options except --no-check
Other:
    --silent                          Do not print detailed information about each step to stdout (log file is not affected)
    --test                            Run QUAST on the data from the test_data folder, output to quast_test_output
    --test-sv                         Run QUAST with structural variants detection on the data from the test_data folder,
                                      output to quast_test_output
-h  --help                            Print full usage message
-v  --version                         Print version

#运行quast程序
!quast.py assembly/scaffolds.fasta

#查看拼接结果统计
!more quast_results/results_2019_02_23_12_05_14/report.tsv

Assembly    scaffolds
# contigs (>= 0 bp)    510
# contigs (>= 1000 bp)    18
# contigs (>= 5000 bp)    14
# contigs (>= 10000 bp)    12
# contigs (>= 25000 bp)    11
# contigs (>= 50000 bp)    9
Total length (>= 0 bp)    3421550
Total length (>= 1000 bp)    3256575
Total length (>= 5000 bp)    3245372
Total length (>= 10000 bp)    3233600
Total length (>= 25000 bp)    3216581
Total length (>= 50000 bp)    3122437
# contigs    35
Largest contig    883888
Total length    3266291
GC (%)    57.23
N50    385978
N75    363878
L50    3
L75    5
# N's per 100 kbp    0.00

组装示意图

基因组从头组装结果包含大量的Contigs序列以及错综复杂的连接方式，使用Bandage (a Bioinformatics Application for Navigating De novo Assembly Graphs Easily)可以方便的展示基因组组装结果的 de Bruijn图，此外还可以将基因注释结果直观的展现在组装结果上。

!Bandage image assembly/assembly_graph.fastg  assembly.png

Contigs组装连接图

基因组组装的de Bruijn图，图中上方构成组装的基因组草图，图中带有颜色的线表示不同的contig，断点表示出现的contig组成的分支序列。图中下方表示未能组成scaffolds的contig。

基因组注释

Prokka是一个快速注释原核生物（细菌、古细菌、病毒等）基因组的软件工具。它产生GFF3、GBK和SQN文件，能够在Sequin中编辑并最终上传到Genbank/DDJB/ENA数据库。不能注释真核生物。

使用的注释工具：

Prodigal 编码序列（CDS）

RNAmmer 核糖体RNA基因（rRNA）

Aragorn 转运RNA基因（tRNA）

SignalP 信号肽

Infernal 非编码RNA

输出：

.fna 原始输入contigs的FASTA文件（核苷酸）

.faa 翻译的编码基因的FASTA文件（蛋白质）

.ffn 所有基因组特征的FASTA文件（核苷酸）

.fsa 用于提交的Contig序列（核苷酸）

.tbl 用于提交的特征表（Feature table）

.sqn 用于提交的Sequin可编辑文件

.gbk 包含序列和注释的Genbank文件

.gff 包含序列和注释的GFF v3文件

.log 日志文件

.txt 注释汇总统计

#查看prokka帮助文档
!prokka -h

Option h is ambiguous (help, hmms)
Name:
  Prokka 1.12 by Torsten Seemann <torsten.seemann@gmail.com>
Synopsis:
  rapid bacterial genome annotation
Usage:
  prokka [options] <contigs.fasta>
General:
  --help            This help
  --version         Print version and exit
  --docs            Show full manual/documentation
  --citation        Print citation for referencing Prokka
  --quiet           No screen output (default OFF)
  --debug           Debug mode: keep all temporary files (default OFF)
Setup:
  --listdb          List all configured databases
  --setupdb         Index all installed databases
  --cleandb         Remove all database indices
  --depends         List all software dependencies
Outputs:
  --outdir [X]      Output folder [auto] (default '')
  --force           Force overwriting existing output folder (default OFF)
  --prefix [X]      Filename output prefix [auto] (default '')
  --addgenes        Add 'gene' features for each 'CDS' feature (default OFF)
  --addmrna         Add 'mRNA' features for each 'CDS' feature (default OFF)
  --locustag [X]    Locus tag prefix [auto] (default '')
  --increment [N]   Locus tag counter increment (default '1')
  --gffver [N]      GFF version (default '3')
  --compliant       Force Genbank/ENA/DDJB compliance: --addgenes --mincontiglen 200 --centre XXX (default OFF)
  --centre [X]      Sequencing centre ID. (default '')
  --accver [N]      Version to put in Genbank file (default '1')
Organism details:
  --genus [X]       Genus name (default 'Genus')
  --species [X]     Species name (default 'species')
  --strain [X]      Strain name (default 'strain')
  --plasmid [X]     Plasmid name or identifier (default '')
Annotations:
  --kingdom [X]     Annotation mode: Archaea|Bacteria|Mitochondria|Viruses (default 'Bacteria')
  --gcode [N]       Genetic code / Translation table (set if --kingdom is set) (default '0')
  --gram [X]        Gram: -/neg +/pos (default '')
  --usegenus        Use genus-specific BLAST databases (needs --genus) (default OFF)
  --proteins [X]    FASTA or GBK file to use as 1st priority (default '')
  --hmms [X]        Trusted HMM to first annotate from (default '')
  --metagenome      Improve gene predictions for highly fragmented genomes (default OFF)
  --rawproduct      Do not clean up /product annotation (default OFF)
  --cdsrnaolap      Allow [tr]RNA to overlap CDS (default OFF)
Computation:
  --cpus [N]        Number of CPUs to use [0=all] (default '8')
  --fast            Fast mode - only use basic BLASTP databases (default OFF)
  --noanno          For CDS just set /product="unannotated protein" (default OFF)
  --mincontiglen [N] Minimum contig size [NCBI needs 200] (default '1')
  --evalue [n.n]    Similarity e-value cut-off (default '1e-06')
  --rfam            Enable searching for ncRNAs with Infernal+Rfam (SLOW!) (default '0')
  --norrna          Don't run rRNA search (default OFF)
  --notrna          Don't run tRNA search (default OFF)
  --rnammer         Prefer RNAmmer over Barrnap for rRNA prediction (default OFF)

#基因组注释命令prokka
prokka --outdir annotation/ --force --locustag test --prefix test assembly/scaffolds.fasta --force  

--cpus 20  --centre test --compliant

#列出当前目录下的文件及大小
!tree -h -L 2  -n

.
├── [4.0K]  annotation
│   ├── [  92]  errorsummary.val
│   ├── [ 316]  test.ecn
│   ├── [422K]  test.err
│   ├── [1.0M]  test.faa
│   ├── [2.9M]  test.ffn
│   ├── [3.3M]  test.fna
│   ├── [3.4M]  test.fsa
│   ├── [7.4M]  test.gbf
│   ├── [4.5M]  test.gff
│   ├── [ 45K]  test.log
│   ├── [ 13M]  test.sqn
│   ├── [857K]  test.tbl
│   ├── [230K]  test.tsv
│   ├── [  99]  test.txt
│   └── [3.8K]  test.val
├── [4.0K]  assembly
│   ├── [6.7M]  assembly_graph.fastg
│   ├── [3.3M]  assembly_graph_with_scaffolds.gfa
│   ├── [3.3M]  before_rr.fasta
│   ├── [3.3M]  contigs.fasta
│   ├── [ 43K]  contigs.paths
│   ├── [  68]  dataset.info
│   ├── [ 186]  input_dataset.yaml
│   ├── [4.0K]  K117
│   ├── [4.0K]  K77
│   ├── [4.0K]  K87
│   ├── [4.0K]  K97
│   ├── [4.0K]  misc
│   ├── [4.0K]  mismatch_corrector
│   ├── [1.3K]  params.txt
│   ├── [3.3M]  scaffolds.fasta
│   ├── [ 43K]  scaffolds.paths
│   ├── [144K]  spades.log
│   └── [4.0K]  tmp
├── [234M]  clean_1.fq.gz
├── [258M]  clean_2.fq.gz
├── [ 138]  do.sh
├── [470K]  fastp.html
├── [127K]  fastp.json
├── [4.0K]  quast_results
│   ├── [  27]  latest -> results_2019_02_23_12_07_03
│   ├── [4.0K]  results_2019_02_23_12_05_14
│   ├── [4.0K]  results_2019_02_23_12_06_26
│   └── [4.0K]  results_2019_02_23_12_07_03
├── [4.0K]  test_1_fastqc
│   ├── [123K]  fastqc_data.txt
│   ├── [9.9K]  fastqc.fo
│   ├── [311K]  fastqc_report.html
│   ├── [4.0K]  Icons
│   ├── [4.0K]  Images
│   └── [ 494]  summary.txt
├── [311K]  test_1_fastqc.html
├── [389K]  test_1_fastqc.zip
├── [208M]  test_1.fq.gz
├── [315K]  test_2_fastqc.html
├── [393K]  test_2_fastqc.zip
└── [231M]  test_2.fq.gz
17 directories, 41 files

在线分析网站

基因组注释 RAST
http://rast.nmpdr.org/
基因预测
http://prodigal.ornl.gov/
http://ccb.jhu.edu/software/glimmer/index.shtml
RNAmer
http://www.cbs.dtu.dk/services/RNAmmer/
tRNAscan
http://lowelab.ucsc.edu/tRNAscan-SE/
trf预测
http://tandem.bu.edu/trf/trf407b.linux.download.html
操纵子预测
http://www.microbesonline.org/operons/
http://operondb.cbcb.umd.edu/cgi-bin/operondb/operons.cgi
基因岛预测 IslandViewer
http://www.pathogenomics.sfu.ca/islandviewer/browse/
预测噬菌体
http://phaster.ca
预测信号肽
http://www.cbs.dtu.dk/services/SignalP/
毒力基因数据 VFDB
http://www.mgc.ac.cn/VFs/
耐药基因数据 CARD
https://card.mcmaster.ca/

词汇表

1 De novo测序
从头测序，无需任何参考序列，直接对一个物种进行测序，然后进行拼接、组装成该物种的基因组序列图谱。

2 建库
在基因组随机打断的片段上机前，为保证在测序时有足够的数据强度支持，所进行的基于PCR反应的片段扩增，区别于通常说的基于Fosmid/BAC等载体的建立文库。

3 PCR-free建库
对于异常GC含量（<35%，>65%）的样品，为避免常规建库使用PCR导致测序偏向性产生从而影响后期信息分析，而采用的基于非PCR反应的建库手段。对于样品量需求也大一些（>10 ug），因为建库过程没有PCR扩增过程，此外，只适用于小片段的构建。

4 Index测序
常见于混合样品测序。将不同的样品混合在一起进行测序，为区分各个样品的数据，在待测序列后加一段已知（一般8 bp）的序列作为标签（index标签）。

5 Paired-End测序
双末端测序，对插入片段两端进行测序，产生具有Paired-End关系的reads。

6 测序策略sequences strategy
PE100：即Paired-End (100,100)，采用双末端测序法对插入片段两端进行读长为100 bp的高通量测序。

7 读长
测序仪所能获取的实际长度，即reads的长度，例如：90bp、125bp、300bp。

8 插入片段长度（Insert Size）
待测基因组序列被随机打断的长度，用于下一步的文库构建，不同长度的插入片段在组装中有不同的作用，小片段（<500 bp）长度一般用于Contig级别的组装，大片段（>2 k）一般用于Scaffold级别的组装。

9 Raw data（reads）
即下机数据，原始的reads。

10 Clean data（reads）
即下机数据经过去接头污染、过滤低质量reads、去duplication（针对大片段数据）等之后，实际用于组装及分析的数据。

11 Duplication
两对或多对Paired-End的reads，每对reads中的read1 和read2分别对应完全一样，属于 duplication，数据处理时会去掉这样的duplication，只保留一对reads。

12 Contig序列
由来源于同一基因组，具有overlapping关系的reads拼接而成的片段。

13 Scaffold序列
又称super Contig，通过reads的Paired-End关系将Contig连接，并用N补充其中的内洞而形成。

14 N50/N90
将组装所得片段（Scaffold/Contig）按照从长到短排序并累加求和，累加值达到基因组总长度一半时的片段长度即是该组装结果的N50值，通常用来衡量组装情况；N90与之类似，即累加长度为基因组总长90%时，该片段的长度。

15 K-mer
K-mer 就是一个长度为K 的DNA 序列，K 为正整数。有时候，如K=15，也称为15-mer。K-mer 有多种用途，用于纠正测序错误，构建Contig，以及估计基因组大小，杂合率，和重复序列含量等。

16 非一致序列
与所测基因组不一致的序列，通常情况下为外源基因序列污染。

17 平均测序深度（Depth）
平均每个碱基被测到的次数。与基因组被覆盖的程度相关，值为cleandata的数据量大小/所测物种基因组的大小。理论上30X的数据量可以覆盖全部的基因组序列，实际由于基因组情况的不同，相同测序深度对基因组覆盖程度也会发生变化。

18 Ref_genome/gene
参考基因组/基因。

19 基因组覆盖度
组装获得的测序基因组与参考基因组共有片段占全部共有片段的比例。有三种评价方式，基于reads，K-mer分析和参考序列。

20 基因区覆盖度
组装获得的测序基因组与参考基因组共有基因长度占全部共有基因长度的比例。

21 单碱基错误率
测序或者组装导致的错误碱基数目占整个序列长度的比例。

22 GC skew
使用（G-C）/（G+C）的计算方法得出，因为在细菌复制起始位点附件GC含量会有一个较大波动，所以通过该方法可以用来推测细菌基因组的复制起点。

系列教程：微生物组入门 Biostar 微生物组宏基因组

专业技能：学术图表高分文章生信宝典不可或缺的人

一文读懂：宏基因组寄生虫益处进化树

必备技能：提问搜索 Endnote

文献阅读热心肠 SemanticScholar Geenmedical

扩增子分析：图表解读分析流程统计绘图

16S功能预测 PICRUSt FAPROTAX Bugbase Tax4Fun

在线工具：16S预测培养基生信绘图

科研经验：云笔记云协作公众号

编程模板: Shell R Perl

生物科普: 肠道细菌人体上的生命生命大跃进细胞暗战人体奥秘

写在后面

为鼓励读者交流、快速解决科研困难，我们建立了“宏基因组”专业讨论群，目前己有国内外5000+ 一线科研人员加入。参与讨论，获得专业解答，欢迎分享此文至朋友圈，并扫码加主编好友带你入群，务必备注“姓名-单位-研究方向-职称/年级”。PI请明示身份，另有海内外微生物相关PI群供大佬合作交流。技术问题寻求帮助，首先阅读《如何优雅的提问》学习解决问题思路，仍未解决群内讨论，问题不私聊，帮助同行。