生物信息学算法之Python实现|Rosalind刷题笔记：005 GC含量计算_小言_互联网的博客

生物信息学算法之Python实现|Rosalind刷题笔记：005 GC含量计算

2021-01-19 15:46 826人阅读评论(0)

DNA 序列的 GC 含量是指序列中'G'和'C'所占的百分比。

一条 DNA 序列很容易表示，但是如果有多条 DNA 序列放在一起，则每条序列必须被标记，通常的做法是保存为 FASTA 格式文件。在这种格式中，序列的名称占一行，名称的最前面是一个大于符号‘>’开头，序列名称后面可以跟一系列说明；序列信息从名称的下一行开始，直到遇到下一个以‘>’开头的序列名称为止。Fasta 格式文件可参考下面的示例数据。

给定：一个 Fasta 序列文件。

需得：GC 含量最高的序列名称及其 GC 含量（各占一行行输出）。

示例数据


   
    
     
      
     
     
      
       >Rosalind_6404
      
     
    
     
      
     
     
      
       CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
      
     
    
     
      
     
     
      
       TCCCACTAATAATTCTGAGG
      
     
    
     
      
     
     
      
       >Rosalind_5959
      
     
    
     
      
     
     
      
       CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
      
     
    
     
      
     
     
      
       ATATCCATTTGTCAGCAGACACGC
      
     
    
     
      
     
     
      
       >Rosalind_0808
      
     
    
     
      
     
     
      
       CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
      
     
    
     
      
     
     
      
       TGGGAACCTGCGGGCAGTAGGTGGAAT

示例结果


   
    
     
      
     
     
      
       Rosalind_0808
      
     
    
     
      
     
     
      
       60.919540

Python 实现

Computing_GC_Content.py


   
    
     
      
     
     
      
       import sys
      
     
    
     
      
     
     
      
       import pysam
      
     
    
     
      
     
     
       
      
     
    
     
      
     
     
      
       def gc_content(item):
      
     
    
     
      
     
     
      
           n, s = item
      
     
    
     
      
     
     
          
       return (s.count(
       'G') + s.count(
       'C')) * 
       100 / 
       len(s)
      
     
    
     
      
     
     
       
      
     
    
     
      
     
     
      
       def max_gc_content(infasta):
      
     
    
     
      
     
     
      
           dna = {}
      
     
    
     
      
     
     
      
           with pysam.FastxFile(infasta) as fh:
      
     
    
     
      
     
     
              
       for r in fh:
      
     
    
     
      
     
     
      
                   dna[r.name] = r.sequence.upper()
      
     
    
     
      
     
     
          
       return max(dna.items(), key=gc_content)
      
     
    
     
      
     
     
       
      
     
    
     
      
     
     
      
       def test():
      
     
    
     
      
     
     
      
           item = max_gc_content(
       'rosalind_gc_test.txt')
      
     
    
     
      
     
     
          
       return item[
       0] == 
       'Rosalind_0808' and round(gc_content(item), 
       6)== 
       60.919540
      
     
    
     
      
     
     
       
      
     
    
     
      
     
     
      
       if __name__ == 
       '__main__':
      
     
    
     
      
     
     
          
       if not test():
      
     
    
     
      
     
     
              
       print(
       "cout_gc_content:Failed")
      
     
    
     
      
     
     
      
               sys.exit(
       1)
      
     
    
     
      
     
     
       
      
     
    
     
      
     
     
      
           item = max_gc_content(
       'rosalind_gc.txt')
      
     
    
     
      
     
     
          
       print(item[
       0])
      
     
    
     
      
     
     
          
       print(gc_content(item))

本题要点：

用 pysam 读取 Fasta 文件，并将其放入字典中；详细用法见：基因组文件读写（pysam）
max 函数的使用，特别是为其构造一个 key 函数并传入，这是解本题的关键，GC 含量本身是很容易理解的。

Problem

The GC-content of a DNA string is given by the percentage of symbols in the string that are 'C' or 'G'. For example, the GC-content of "AGCTATAG" is 37.5%. Note that the reverse complement of any DNA string has the same GC-content.

DNA strings must be labeled when they are consolidated into a database. A commonly used method of string labeling is called FASTA format. In this format, the string is introduced by a line that begins with '>', followed by some labeling information. Subsequent lines contain the string itself; the first line to begin with '>' indicates the label of the next string.

In Rosalind's implementation, a string in FASTA format will be labeled by the ID "Rosalind_xxxx", where "xxxx" denotes a four-digit code between 0000 and 9999.

Given: At most 10 DNA strings in FASTA format (of length at most 1 kbp each).

Return: The ID of the string having the highest GC-content, followed by the GC-content of that string. Rosalind allows for a default error of 0.001 in all decimal answers unless otherwise stated; please see the note on absolute error below.

Sample Dataset


   
    
     
      
     
     
      
       >Rosalind_6404
      
     
    
     
      
     
     
      
       CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
      
     
    
     
      
     
     
      
       TCCCACTAATAATTCTGAGG
      
     
    
     
      
     
     
      
       >Rosalind_5959
      
     
    
     
      
     
     
      
       CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
      
     
    
     
      
     
     
      
       ATATCCATTTGTCAGCAGACACGC
      
     
    
     
      
     
     
      
       >Rosalind_0808
      
     
    
     
      
     
     
      
       CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
      
     
    
     
      
     
     
      
       TGGGAACCTGCGGGCAGTAGGTGGAAT

Sample Output


   
    
     
      
     
     
      
       Rosalind_0808
      
     
    
     
      
     
     
      
       60.919540

喜欢文章请点个“赞”吧！或者点击“在看”，点击“阅读原文”可以在知乎专栏上给我留言，博客地址：https://jianzuoyi.github.io

转载：https://blog.csdn.net/weshengxin/article/details/110914458

查看评论

小言_互联网的博客

小言_互联网的博客

个人资料

文章分类

文章存档

阅读排行

评论排行

推荐文章