position-correlation scoring feature(PCSF)

2020-11_Theory in Biosciences_Eukaryotic and prokaryotic promoter prediction using hybrid approachhttps://link.springer.com/article/10.1007/s12064-010-0114-8原文：The PWM can be constructed by counting t

hellopbc

199人浏览 · 2022-06-10 11:02:55

hellopbc · 2022-06-10 11:02:55 发布

文章目录

PCSF

位置关联打分(PCSF)特征

position-correlation scoring feature

来源1

2020-11_Theory in Biosciences_Eukaryotic and prokaryotic promoter prediction using hybrid approach

https://link.springer.com/article/10.1007/s12064-010-0114-8

原文：

The PWM can be constructed by counting the frequencies of oligonucleotides in conserved sites of training sequences. The probability $p_{xi}$ of an oligonucleotide $x$ at the ith site can be formulated as (Li and Lin 2006; Wasserman and Sandelin 2004; Kielbasa et al. 2005):
$p_{x}=(n_{xi}+b_{xi})/(N_i+B_i)p_{xi}$
(2)

where $n_{xi}$ and $b_{xi}$ are real counts and pseudocounts of k-mer oligonucleotide x at the ith site, respectively. $N_i$ and $B_i$ are total number of real counts and pseudocounts at the ith site, respectively. If there are relatively few real counts, many k-mer variations may not be presented because of the small sample of sequences. The goal of adding pseudocounts is to obtain an improved estimate of the probability $p_{xi}$ of k-mer oligonucleotide x at the ith site. A relatively few pseudocounts should be added when there is a good sampling of sequences, and more pseudocounts should be added when the data is sparser. One simple formula that has worked well in some studies is to make $B_i$ equal to $N_i$ and $b_{xi}$ equal to $p_0√N_i$ ( $p_0$ is the average background frequency) in Eq. 2 (Wasserman and Sandelin 2004; Kielbasa et al. 2005), respectively. As $N_i$ increase, the influence of pseudocounts decrease because $\sqrt N i$ increase more slowly. Due to the existence of pseudocounts, the estimated probabilities are strictly positive (Kielbasa et al. 2005). Based on the probabilities $p_{xi}$ , the PCSF of an arbitrary sequence can be defined as (Li and Lin 2006):
$F=∑_iln(p_{xi}/p_0)$
(3)

where $p_0$ is average background probability of k-mer. The score F shows the degree of sequence closed to matrix resource.

来源2

2019-09_Mol Ther-Nucleic Acids_iProEP：A Computational Predictor for Predicting Promoter

https://www.sciencedirect.com/science/article/pii/S2162253119301611

通过对每个物种的启动子序列进行比对，我们可以构建一个位置相关评分矩阵position-correlation scoring matrix。PCSM中的每一行都由因子 $p_{xi}$ 组成， $p_{xi}$ 是启动子样本第i位的k-mer x的概率。 $p_{xi}$ 可通过以下公式计算：
$p_{xi}=\frac{n_{xi}+b_{xi}}{N_i+B_i}$
其中 $n_{xi}$ 是出现在第i位的x的实际计数，而 $b_{xi}$ 是相应的伪计数。 $N_i$ 表示第i个位置上所有k-mers的实数之和(即正样本数)，而 $B_i$ 是相应的伪计数之和。如果样本量不够大，当k增加时，一些k-mers将不存在。因此，伪计数可以改善对第i位k-mer x的概率 $p_{xi}$ 的估计。 $B_i$ 和 $b_{xi}$ 可以由下式给出:
$KaTeX parse error: Expected 'EOF', got '&' at position 5: B_i&̲= \sqrt{N_i},\\…$
其中 $p o$ 为k-mer的背景频率，等于 $1/4^k$ 。随着样品数N_i的增加，由于 $\sqrt{N_i}$ 增长缓慢，伪计数的影响会减弱。
通过对LIN和LI的大量复杂的保守分析和ACC评价，筛选出了五个物种三聚体的一些保护位点。基于这些位点和PCSM，五个物种的正负样本的PCSF特征可以表示为:
$PCSF=[f_1f_2...f_i...f_n]$
其中n是选定的保守位点的数量，每个元素定义为:
$f_i=ln(p_{xi}/po)$
在这个方程中， $p o$ 是每个三聚体的本底概率( $po=1/4^3$ )， $p_{xi}$ 可以在PCSM的基础上得到。