position-correlation scoring feature(PCSF)
2020-11_Theory in Biosciences_Eukaryotic and prokaryotic promoter prediction using hybrid approachhttps://link.springer.com/article/10.1007/s12064-010-0114-8原文:The PWM can be constructed by counting t
PCSF
位置关联打分(PCSF)特征
position-correlation scoring feature
来源1
2020-11_Theory in Biosciences_Eukaryotic and prokaryotic promoter prediction using hybrid approach
https://link.springer.com/article/10.1007/s12064-010-0114-8
原文:
The PWM can be constructed by counting the frequencies of oligonucleotides in conserved sites of training sequences. The probability
p
x
i
p_{xi}
pxi of an oligonucleotide
x
x
x at the ith site can be formulated as (Li and Lin 2006; Wasserman and Sandelin 2004; Kielbasa et al. 2005):
p
x
=
(
n
x
i
+
b
x
i
)
/
(
N
i
+
B
i
)
p
x
i
p_{x}=(n_{xi}+b_{xi})/(N_i+B_i)p_{xi}
px=(nxi+bxi)/(Ni+Bi)pxi
(2)
where
n
x
i
n_{xi}
nxi and
b
x
i
b_{xi}
bxi are real counts and pseudocounts of k-mer oligonucleotide x at the ith site, respectively.
N
i
N_i
Ni and
B
i
B_i
Bi are total number of real counts and pseudocounts at the ith site, respectively. If there are relatively few real counts, many k-mer variations may not be presented because of the small sample of sequences. The goal of adding pseudocounts is to obtain an improved estimate of the probability
p
x
i
p_{xi}
pxi of k-mer oligonucleotide x at the ith site. A relatively few pseudocounts should be added when there is a good sampling of sequences, and more pseudocounts should be added when the data is sparser. One simple formula that has worked well in some studies is to make
B
i
B_i
Bi equal to
√
N
i
√N_i
√Ni and
b
x
i
b_{xi}
bxi equal to
p
0
√
N
i
p_0√N_i
p0√Ni (
p
0
p_0
p0 is the average background frequency) in Eq. 2 (Wasserman and Sandelin 2004; Kielbasa et al. 2005), respectively. As
N
i
N_i
Ni increase, the influence of pseudocounts decrease because
√
N
i
√Ni
√Ni increase more slowly. Due to the existence of pseudocounts, the estimated probabilities are strictly positive (Kielbasa et al. 2005). Based on the probabilities
p
x
i
p_{xi}
pxi , the PCSF of an arbitrary sequence can be defined as (Li and Lin 2006):
F
=
∑
i
l
n
(
p
x
i
/
p
0
)
F=∑_iln(p_{xi}/p_0)
F=i∑ln(pxi/p0)
(3)
where p 0 p_0 p0 is average background probability of k-mer. The score F shows the degree of sequence closed to matrix resource.
来源2
2019-09_Mol Ther-Nucleic Acids_iProEP:A Computational Predictor for Predicting Promoter
https://www.sciencedirect.com/science/article/pii/S2162253119301611
通过对每个物种的启动子序列进行比对,我们可以构建一个位置相关评分矩阵position-correlation scoring matrix。PCSM中的每一行都由因子
p
x
i
p_{xi}
pxi组成,
p
x
i
p_{xi}
pxi是启动子样本第i位的k-mer x的概率。
p
x
i
p_{xi}
pxi可通过以下公式计算:
p
x
i
=
n
x
i
+
b
x
i
N
i
+
B
i
p_{xi}=\frac{n_{xi}+b_{xi}}{N_i+B_i}
pxi=Ni+Binxi+bxi
其中
n
x
i
n_{xi}
nxi是出现在第i位的x的实际计数,而
b
x
i
b_{xi}
bxi是相应的伪计数。
N
i
N_i
Ni表示第i个位置上所有k-mers的实数之和(即正样本数),而
B
i
B_i
Bi是相应的伪计数之和。如果样本量不够大,当k增加时,一些k-mers将不存在。因此,伪计数可以改善对第i位k-mer x的概率
p
x
i
p_{xi}
pxi的估计。
B
i
B_i
Bi和
b
x
i
b_{xi}
bxi可以由下式给出:
KaTeX parse error: Expected 'EOF', got '&' at position 5: B_i&̲= \sqrt{N_i},\\…
其中
p
o
po
po为k-mer的背景频率,等于
1
/
4
k
1/4^k
1/4k。随着样品数N_i的增加,由于
N
i
\sqrt{N_i}
Ni增长缓慢,伪计数的影响会减弱。
通过对LIN和LI的大量复杂的保守分析和ACC评价,筛选出了五个物种三聚体的一些保护位点。基于这些位点和PCSM,五个物种的正负样本的PCSF特征可以表示为:
P
C
S
F
=
[
f
1
f
2
.
.
.
f
i
.
.
.
f
n
]
PCSF=[f_1f_2...f_i...f_n]
PCSF=[f1f2...fi...fn]
其中n是选定的保守位点的数量,每个元素定义为:
f
i
=
l
n
(
p
x
i
/
p
o
)
f_i=ln(p_{xi}/po)
fi=ln(pxi/po)
在这个方程中,
p
o
po
po是每个三聚体的本底概率(
p
o
=
1
/
4
3
po=1/4^3
po=1/43),
p
x
i
p_{xi}
pxi可以在PCSM的基础上得到。
更多推荐
所有评论(0)