BibtextParser API使用小结

BibtexParser

官方文档:https://bibtexparser.readthedocs.io/en/master/

简介

bibtex格式的文本的解析工具。

文本为标准的bibtex格式

bibtex = """@ARTICLE{Cesar2013, author = {Jean César}, title = {An amazing title}, year = {2013}, volume = {12}, pages = {12--23}, journal = {Nice Journal}, abstract = {This is an abstract. This line should be long enough to test multilines...}, comments = {A comment}, keywords = {keyword1, keyword2} } """
with open('bibtex.bib', 'w') as bibfile:
    bibfile.write(bibtex)
 
 
# 开始解析
import bibtexparser
 
with open('bibtex.bib') as bibtex_file:
    bib_database = bibtexparser.load(bibtex_file)
 
print(bib_database.entries)
 
 
# 输出:
#[{'journal': 'Nice Journal',
# 'comments': 'A comment',
# 'pages': '12--23',
# 'abstract': 'This is an abstract. This line should be long enough to test\nmultilines...',
# 'title': 'An amazing title',
# 'year': '2013',
# 'volume': '12',
# 'ID': 'Cesar2013',
# 'author': 'Jean César',
# 'keyword': 'keyword1, keyword2',
# 'ENTRYTYPE': 'article'}]

文本不是标准的bibytex格式

需要自己做一个文本预处理,将其转为标准格式。否则,存在部分文章的信息解析不全。

自定义解析格式

新建一个py文件

# customization.py
# 自定义函数,固定写法,参数document是一个字典,代表一篇文章的信息
def author(document):
    if 'author' in document:
        if document['author']:
            document['author'] = document['author'].lower().replace('\n', ' ').replace('\\', '').split(' and ')
        else:
            document['author'] = None
    else:
        document['author'] = None
    return document

将自定义的格式,应用到解析的过程:

import bibtexparser
from bibtexparser.bparser import BibTexParser
from customization import *
 
""" @article{ ISI:000602258800001, Author = {Waterworth, Samantha C. and Isemonger, Eric W. and Rees, Evan R. and Dorrington, Rosemary A. and Kwan, Jason C.}, Title = {
   {Conserved bacterial genomes from two geographically isolated peritidal stromatolite formations shed light on potential functional guilds}}, Journal = {
   {ENVIRONMENTAL MICROBIOLOGY REPORTS}}, DOI = {
   {10.1111/1758-2229.12916}}, Early Access Date = {
   {DEC 2020}}, ISSN = {
   {1758-2229}}, ResearcherID-Numbers = {
   {Kwan, Jason/F-9589-2010}}, ORCID-Numbers = {
   {Kwan, Jason/0000-0001-9933-1536}}, Unique-ID = {
   {ISI:000602258800001}}, } """
def customizations(record):
    record = author(record)
    return record
 
 
def parse_bib_str(bib_str: str):
    """ 传入bibtex格式的字符串,解析为以字典为元素的list :param bib_str: :return: list (item is dic) """
    # 字符串预处理
    bib_str = bib_str.replace('{
   {', '{').replace('}}', '}').replace('Early Access Date', 'Early-Access-Date').replace(
        'Early Access Year', 'Early-Access-Year')
     
    # api固定写法
    parser = BibTexParser()
    parser.customization = customizations
    bib_datebase = bibtexparser.loads(bib_str, parser=parser)
    return bib_datebase.entries
 
if __name__ == '__main__':
    with open('1.bib',encoding='utf-8') as bib_file:
        bib_str = bib_file.read()
    entries = parse_bib_str(bib_str)
    print(len(entries))
    print(entries[0].get('author'))
    for k,v in entries[0].items():
        print('key:',k)
        print('value:',v)
        print('#'*50)
""" ['Waterworth, Samantha C.', 'Isemonger, Eric W.', 'Rees, Evan R.', 'Dorrington, Rosemary A.', 'Kwan, Jason C.'] key: unique-id value: ISI:000602258800001 ################################################## key: orcid-numbers value: Kwan, Jason/0000-0001-9933-1536 ################################################## key: researcherid-numbers value: Kwan, Jason/F-9589-2010 ################################################## key: issn value: 1758-2229 ################################################## key: early-access-date value: DEC 2020 ################################################## key: doi value: 10.1111/1758-2229.12916 ################################################## key: journal value: ENVIRONMENTAL MICROBIOLOGY REPORTS ################################################## key: title value: Conserved bacterial genomes from two geographically isolated peritidal stromatolite formations shed light on potential functional guilds ################################################## key: author value: ['Waterworth, Samantha C.', 'Isemonger, Eric W.', 'Rees, Evan R.', 'Dorrington, Rosemary A.', 'Kwan, Jason C.'] ################################################## key: ENTRYTYPE value: article ################################################## key: ID value: ISI:000602258800001 ################################################## Process finished with exit code 0 """
全部评论

相关推荐

猿辅导 Java后端日常实习 800一天
点赞 评论 收藏
转发
点赞 评论 收藏
转发
点赞 收藏 评论
分享
牛客网
牛客企业服务