欢迎关注”生信修炼手册”!
KEGG数据库称之为基因组百科全书,是一个包含gene, pathway等多个子数据库的综合性数据库。为了更好的查询kegg数据,官方提供了对应的API。
在biopython中,通过Bio.KEGG模块,对kegg官方的API进行了封装,允许在python环境中使用kegg API。KEGG API与python代码的对应关系如下
-
/list/hsa:
10458+ece:Z5100 -> REST.kegg_list([
"hsa:10458",
"ece:Z5100"])
-
/find/compound/
300
-310/mol_weight -> REST.kegg_find(
"compound",
"300-310",
"mol_weight")
-
/get/hsa:
10458+ece:Z5100/aaseq -> REST.kegg_get([
"hsa:10458",
"ece:Z5100"],
"aaseq")
利用REST模块,可以下载API支持的任何类型的数据,以pathway为例,示例如下
-
>>> from Bio.KEGG
import REST
-
>>> pathway = REST.kegg_get(
'hsa00010')
对于查询获得的内容,通过read方法可以转换为纯文本,示例如下
-
>>> pathway = REST.kegg_get(
'hsa00010')
-
>>> res = pathway.read().split(
"\n")
-
>>> res[
0]
-
'ENTRY hsa00010 Pathway'
-
>>> res[
1]
-
'NAME Glycolysis / Gluconeogenesis - Homo sapiens (human)'
-
>>> res[
2]
-
'DESCRIPTION Glycolysis is the process of converting glucose into pyruvate and generating small amounts of ATP (energy) and NADH (reducing power). It is a central pathway that produces important precursor metabolites: six-carbon compounds of glucose-6P and fructose-6P and three-carbon compounds of glycerone-P, glyceraldehyde-3P, glycerate-3P, phosphoenolpyruvate, and pyruvate [MD:M00001]. Acetyl-CoA, another important precursor metabolite, is produced by oxidative decarboxylation of pyruvate [MD:M00307]. When the enzyme genes of this pathway are examined in completely sequenced genomes, the reaction steps of three-carbon compounds from glycerone-P to pyruvate form a conserved core module [MD:M00002], which is found in almost all organisms and which sometimes contains operon structures in bacterial genomes. Gluconeogenesis is a synthesis pathway of glucose from noncarbohydrate precursors. It is essentially a reversal of glycolysis with minor variations of alternative paths [MD:M00003].'
这样就可以通过字符串解析,来获取通路对应的编号,名称,注释等信息。对于KEGG数据的解析,biopython还提供了专门的解析函数,但是解析函数并不完整,目前只覆盖了compound, map, enzyme等子数据库。以enzyme数据库为例,用法如下
-
>>> from Bio.KEGG
import REST
-
>>> request = REST.kegg_get(
"ec:5.4.2.2")
-
>>> open(
"ec_5.4.2.2.txt",
"w").write(request.read())
-
>>> records = Enzyme.parse(open(
"ec_5.4.2.2.txt"))
-
>>> record = list(records)[
0]
-
>>> record
-
<Bio.KEGG.Enzyme.Record object at
0x02EE7D18>
-
>>> record.classname
-
[
'Isomerases;',
'Intramolecular transferases;',
'Phosphotransferases (phosphomutases)']
-
>>> record.entry
-
'5.4.2.2'
通过biopython,我们不仅可以在python环境中使用kegg api, 更重要的是,可以借助python的逻辑处理,来实现复杂的筛选逻辑,比如查找human中DNA修复相关的基因,基本思路如下
1. 通过list API获取human所有的pathway编号;
2. 通过get API获取每条pathway, 解析其description信息,筛选出现了repair关键词的通路;
3. 对于筛选出的通路,通过文本解析获取该通路对应的基因;
完整的代码如下
-
>>> from Bio.KEGG
import REST
-
>>> human_pathways = REST.kegg_list(
"pathway",
"hsa").read()
-
>>> repair_pathways = []
-
>>>
for line in human_pathways.rstrip().split(
"\n"):
-
... entry, description = line.split(
"\t")
-
...
if
"repair" in description:
-
... repair_pathways.
append(entry)
-
...
-
>>> repair_pathways
-
[
'path:hsa03410',
'path:hsa03420',
'path:hsa03430']
-
>>> repair_genes = []
-
>>>
for pathway in repair_pathways:
-
... pathway_file = REST.kegg_get(pathway).read()
-
... current_p = None
-
...
for line in pathway_file.rstrip().split(
"\n"):
-
... p = line[:
12].strip()
-
...
if not p ==
"":
-
... current_p = p
-
...
if current_p ==
"GENE":
-
... gene_identifiers, gene_description = line[
12:].split(
"; ")
-
... gene_id, gene_symbol = gene_identifiers.split()
-
...
if not gene_symbol in repair_genes:
-
... repair_genes.
append(gene_symbol)
-
...
-
>>> repair_genes
-
[
'OGG1',
'NTHL1',
'NEIL1',
'NEIL2',
'NEIL3',
'UNG',
'SMUG1',
'MUTYH',
'MPG',
'MBD4',
'TDG',
'APEX1',
'APEX2',
'POLB',
'POLL',
'HMGB1',
'XRCC1',
'PCNA',
'POLD1',
'POLD2',
'POLD3',
'POLD4',
'POLE',
'POLE2',
'POLE3',
'POLE4',
'LIG1',
'LIG3',
'PARP1',
'PARP2',
'PARP3',
'PARP4',
'FEN1',
'RBX1',
'CUL4B',
'CUL4A',
'DDB1',
'DDB2',
'XPC',
'RAD23B',
'RAD23A',
'CETN2',
'ERCC8',
'ERCC6',
'CDK7',
'MNAT1',
'CCNH',
'ERCC3',
'ERCC2',
'GTF2H5',
'GTF2H1',
'GTF2H2',
'GTF2H2C_2',
'GTF2H2C',
'GTF2H3',
'GTF2H4',
'ERCC5',
'BIVM-ERCC5',
'XPA',
'RPA1',
'RPA2',
'RPA3',
'RPA4',
'ERCC4',
'ERCC1',
'RFC1',
'RFC4',
'RFC2',
'RFC5',
'RFC3',
'SSBP1',
'PMS2',
'MLH1',
'MSH6',
'MSH2',
'MSH3',
'MLH3',
'EXO1']
通过biopython, 可以更加高效的使用KEGG API, 结合API的数据获取能力和python的逻辑处理能力,来满足我们的个性化分析需求。
·end·
—如果喜欢,快分享给你的朋友们吧—
原创不易,欢迎收藏,点赞,转发!生信知识浩瀚如海,在生信学习的道路上,让我们一起并肩作战!
本公众号深耕耘生信领域多年,具有丰富的数据分析经验,致力于提供真正有价值的数据分析服务,擅长个性化分析,欢迎有需要的老师和同学前来咨询。
更多精彩
写在最后
转发本文至朋友圈,后台私信截图即可加入生信交流群,和小伙伴一起学习交流。
扫描下方二维码,关注我们,解锁更多精彩内容!
一个只分享干货的
生信公众号
转载:https://blog.csdn.net/weixin_43569478/article/details/112386873