启发式
序列(生物学)
子序列
数学
计算复杂性理论
算法
熵(时间箭头)
计算机科学
组合数学
生物
遗传学
数学优化
有界函数
量子力学
物理
数学分析
作者
John C. Wootton,Scott Federhen
出处
期刊:Computers & chemistry
[Elsevier]
日期:1993-06-01
卷期号:17 (2): 149-163
被引量:676
标识
DOI:10.1016/0097-8485(93)85006-x
摘要
Protein sequences contain surprisingly many local regions of low compositional complexity. These include different types of residue clusters, some of which contain homopolymers, short period repeats or aperiodic mosaics of a few residue types. Several different formal definitions of local complexity and probability are presented here and are compared for their utility in algorithms for localization of such regions in amino acid sequences and sequence databases. The definitions are:—(1) those derived from enumeration a priori by a treatment analogous to statistical mechanics, (2) a log likelihood definition of complexity analogous to informational entropy, (3) multinomial probabilities of observed compositions, (4) an approximation resembling the χ2 statistic and (5) a modification of the coefficient of divergence. These measures, together with a method based on similarity scores of self-aligned sequences at different offsets, are shown to be broadly similar for first-pass, approximate localization of low-complexity regions in protein sequences, but they give significantly different results when applied in optimal segmentation algorithms. These comparisons underpin the choice of robust optimization heuristics in an algorithm, SEG, designed to segment amino acid sequences fully automatically into subsequences of contrasting complexity. After the abundant low-complexity segments have been partitioned from the Swissprot database, the remaining high-complexity sequence set is adequately approximated by a first-order random model.
科研通智能强力驱动
Strongly Powered by AbleSci AI