DNA 序列的 GC 含量是指序列中'G'和'C'所占的百分比。
一条 DNA 序列很容易表示,但是如果有多条 DNA 序列放在一起,则每条序列必须被标记,通常的做法是保存为 FASTA 格式文件。在这种格式中,序列的名称占一行,名称的最前面是一个大于符号‘>’开头,序列名称后面可以跟一系列说明;序列信息从名称的下一行开始,直到遇到下一个以‘>’开头的序列名称为止。Fasta 格式文件可参考下面的示例数据。
给定:一个 Fasta 序列文件。
需得:GC 含量最高的序列名称及其 GC 含量(各占一行行输出)。
示例数据
-
>Rosalind_6404
-
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
-
TCCCACTAATAATTCTGAGG
-
>Rosalind_5959
-
CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
-
ATATCCATTTGTCAGCAGACACGC
-
>Rosalind_0808
-
CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
-
TGGGAACCTGCGGGCAGTAGGTGGAAT
示例结果
-
Rosalind_0808
-
60.919540
Python 实现
Computing_GC_Content.py
-
import sys
-
import pysam
-
-
def gc_content(item):
-
n, s = item
-
return (s.count(
'G') + s.count(
'C')) *
100 /
len(s)
-
-
def max_gc_content(infasta):
-
dna = {}
-
with pysam.FastxFile(infasta) as fh:
-
for r in fh:
-
dna[r.name] = r.sequence.upper()
-
return max(dna.items(), key=gc_content)
-
-
def test():
-
item = max_gc_content(
'rosalind_gc_test.txt')
-
return item[
0] ==
'Rosalind_0808' and round(gc_content(item),
6)==
60.919540
-
-
if __name__ ==
'__main__':
-
if not test():
-
print(
"cout_gc_content:Failed")
-
sys.exit(
1)
-
-
item = max_gc_content(
'rosalind_gc.txt')
-
print(item[
0])
-
print(gc_content(item))
本题要点:
用 pysam 读取 Fasta 文件,并将其放入字典中;详细用法见:基因组文件读写(pysam)
max 函数的使用,特别是为其构造一个 key 函数并传入,这是解本题的关键,GC 含量本身是很容易理解的。
Problem
The GC-content of a DNA string is given by the percentage of symbols in the string that are 'C' or 'G'. For example, the GC-content of "AGCTATAG" is 37.5%. Note that the reverse complement of any DNA string has the same GC-content.
DNA strings must be labeled when they are consolidated into a database. A commonly used method of string labeling is called FASTA format. In this format, the string is introduced by a line that begins with '>', followed by some labeling information. Subsequent lines contain the string itself; the first line to begin with '>' indicates the label of the next string.
In Rosalind's implementation, a string in FASTA format will be labeled by the ID "Rosalind_xxxx", where "xxxx" denotes a four-digit code between 0000 and 9999.
Given: At most 10 DNA strings in FASTA format (of length at most 1 kbp each).
Return: The ID of the string having the highest GC-content, followed by the GC-content of that string. Rosalind allows for a default error of 0.001 in all decimal answers unless otherwise stated; please see the note on absolute error below.
Sample Dataset
-
>Rosalind_6404
-
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
-
TCCCACTAATAATTCTGAGG
-
>Rosalind_5959
-
CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
-
ATATCCATTTGTCAGCAGACACGC
-
>Rosalind_0808
-
CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
-
TGGGAACCTGCGGGCAGTAGGTGGAAT
Sample Output
-
Rosalind_0808
-
60.919540
喜欢文章请点个“赞”吧!或者点击“在看”,点击“阅读原文”可以在知乎专栏上给我留言,博客地址:https://jianzuoyi.github.io
转载:https://blog.csdn.net/weshengxin/article/details/110914458