小言_互联网的博客

使用biopython解析kegg数据库

487人阅读  评论(0)

欢迎关注”生信修炼手册”!

KEGG数据库称之为基因组百科全书,是一个包含gene, pathway等多个子数据库的综合性数据库。为了更好的查询kegg数据,官方提供了对应的API。

在biopython中,通过Bio.KEGG模块,对kegg官方的API进行了封装,允许在python环境中使用kegg API。KEGG API与python代码的对应关系如下


   
  1. /list/hsa: 10458+ece:Z5100 -> REST.kegg_list([ "hsa:10458", "ece:Z5100"])
  2. /find/compound/ 300 -310/mol_weight -> REST.kegg_find( "compound", "300-310", "mol_weight")
  3. /get/hsa: 10458+ece:Z5100/aaseq -> REST.kegg_get([ "hsa:10458", "ece:Z5100"], "aaseq")

利用REST模块,可以下载API支持的任何类型的数据,以pathway为例,示例如下


   
  1. >>> from Bio.KEGG import REST
  2. >>> pathway = REST.kegg_get( 'hsa00010')

对于查询获得的内容,通过read方法可以转换为纯文本,示例如下


   
  1. >>> pathway = REST.kegg_get( 'hsa00010')
  2. >>> res = pathway.read().split( "\n")
  3. >>> res[ 0]
  4. 'ENTRY hsa00010 Pathway'
  5. >>> res[ 1]
  6. 'NAME Glycolysis / Gluconeogenesis - Homo sapiens (human)'
  7. >>> res[ 2]
  8. 'DESCRIPTION Glycolysis is the process of converting glucose into pyruvate and generating small amounts of ATP (energy) and NADH (reducing power). It is a central pathway that produces important precursor metabolites: six-carbon compounds of glucose-6P and fructose-6P and three-carbon compounds of glycerone-P, glyceraldehyde-3P, glycerate-3P, phosphoenolpyruvate, and pyruvate [MD:M00001]. Acetyl-CoA, another important precursor metabolite, is produced by oxidative decarboxylation of pyruvate [MD:M00307]. When the enzyme genes of this pathway are examined in completely sequenced genomes, the reaction steps of three-carbon compounds from glycerone-P to pyruvate form a conserved core module [MD:M00002], which is found in almost all organisms and which sometimes contains operon structures in bacterial genomes. Gluconeogenesis is a synthesis pathway of glucose from noncarbohydrate precursors. It is essentially a reversal of glycolysis with minor variations of alternative paths [MD:M00003].'

这样就可以通过字符串解析,来获取通路对应的编号,名称,注释等信息。对于KEGG数据的解析,biopython还提供了专门的解析函数,但是解析函数并不完整,目前只覆盖了compound, map, enzyme等子数据库。以enzyme数据库为例,用法如下


   
  1. >>> from Bio.KEGG import REST
  2. >>> request = REST.kegg_get( "ec:5.4.2.2")
  3. >>> open( "ec_5.4.2.2.txt", "w").write(request.read())
  4. >>> records = Enzyme.parse(open( "ec_5.4.2.2.txt"))
  5. >>> record = list(records)[ 0]
  6. >>> record
  7. <Bio.KEGG.Enzyme.Record object at 0x02EE7D18>
  8. >>> record.classname
  9. [ 'Isomerases;', 'Intramolecular transferases;', 'Phosphotransferases (phosphomutases)']
  10. >>> record.entry
  11. '5.4.2.2'

通过biopython,我们不仅可以在python环境中使用kegg api, 更重要的是,可以借助python的逻辑处理,来实现复杂的筛选逻辑,比如查找human中DNA修复相关的基因,基本思路如下

1. 通过list API获取human所有的pathway编号;

2. 通过get API获取每条pathway, 解析其description信息,筛选出现了repair关键词的通路;

3. 对于筛选出的通路,通过文本解析获取该通路对应的基因;

完整的代码如下


   
  1. >>> from Bio.KEGG import REST
  2. >>> human_pathways = REST.kegg_list( "pathway", "hsa").read()
  3. >>> repair_pathways = []
  4. >>> for line in human_pathways.rstrip().split( "\n"):
  5. ...     entry, description = line.split( "\t")
  6. ...      if  "repair" in description:
  7. ...         repair_pathways. append(entry)
  8. ...
  9. >>> repair_pathways
  10. [ 'path:hsa03410', 'path:hsa03420', 'path:hsa03430']
  11. >>> repair_genes = []
  12. >>> for pathway in repair_pathways:
  13. ...     pathway_file = REST.kegg_get(pathway).read()
  14. ...     current_p = None
  15. ...      for line in pathway_file.rstrip().split( "\n"):
  16. ...         p = line[: 12].strip()
  17. ...          if not p == "":
  18. ...             current_p = p
  19. ...          if current_p == "GENE":
  20. ...             gene_identifiers, gene_description = line[ 12:].split( "; ")
  21. ...             gene_id, gene_symbol = gene_identifiers.split()
  22. ...              if not gene_symbol in repair_genes:
  23. ...                 repair_genes. append(gene_symbol)
  24. ...
  25. >>> repair_genes
  26. [ 'OGG1', 'NTHL1', 'NEIL1', 'NEIL2', 'NEIL3', 'UNG', 'SMUG1', 'MUTYH', 'MPG', 'MBD4', 'TDG', 'APEX1', 'APEX2', 'POLB', 'POLL', 'HMGB1', 'XRCC1', 'PCNA', 'POLD1', 'POLD2', 'POLD3', 'POLD4', 'POLE', 'POLE2', 'POLE3', 'POLE4', 'LIG1', 'LIG3', 'PARP1', 'PARP2', 'PARP3', 'PARP4', 'FEN1', 'RBX1', 'CUL4B', 'CUL4A', 'DDB1', 'DDB2', 'XPC', 'RAD23B', 'RAD23A', 'CETN2', 'ERCC8', 'ERCC6', 'CDK7', 'MNAT1', 'CCNH', 'ERCC3', 'ERCC2', 'GTF2H5', 'GTF2H1', 'GTF2H2', 'GTF2H2C_2', 'GTF2H2C', 'GTF2H3', 'GTF2H4', 'ERCC5', 'BIVM-ERCC5', 'XPA', 'RPA1', 'RPA2', 'RPA3', 'RPA4', 'ERCC4', 'ERCC1', 'RFC1', 'RFC4', 'RFC2', 'RFC5', 'RFC3', 'SSBP1', 'PMS2', 'MLH1', 'MSH6', 'MSH2', 'MSH3', 'MLH3', 'EXO1']

通过biopython, 可以更加高效的使用KEGG API, 结合API的数据获取能力和python的逻辑处理能力,来满足我们的个性化分析需求。‍

·end·

—如果喜欢,快分享给你的朋友们吧—

原创不易,欢迎收藏,点赞,转发!生信知识浩瀚如海,在生信学习的道路上,让我们一起并肩作战!

本公众号深耕耘生信领域多年,具有丰富的数据分析经验,致力于提供真正有价值的数据分析服务,擅长个性化分析,欢迎有需要的老师和同学前来咨询。

  更多精彩

  写在最后

转发本文至朋友圈,后台私信截图即可加入生信交流群,和小伙伴一起学习交流。

扫描下方二维码,关注我们,解锁更多精彩内容!

一个只分享干货的

生信公众号


转载:https://blog.csdn.net/weixin_43569478/article/details/112386873
查看评论
* 以上用户言论只代表其个人观点,不代表本网站的观点或立场