解析非标准HL7 XML w/ Python-问答-阿里云开发者社区-阿里云

开发者社区> 问答> 正文
阿里云
为了无法计算的价值
打开APP
阿里云APP内打开

解析非标准HL7 XML w/ Python

2019-12-30 09:58:39 539 1

背景: 我比较熟悉通过DOM用Java解析XML。 我想做的是: 我试图解析一个来自NLM Daily Med网站的HL7 / XML结构化产品标签。我试图解析的一个示例url是:Atenolol SPL 目前为止我尝试过的: 我试过DOM、ElementTree、lxml和minidom。我能想到的最接近的方法就是使用这个代码:

#!/usr/bin/python3

import xml.sax
from xml.dom.minidom import parse
import xml.dom.minidom



# ------Using SAX Parser---------------
class MovieHandler(xml.sax.ContentHandler):
def __init__(self):
    self.CurrentData = ""
    self.type = ""
    self.title = ""
    self.text = ""
    self.description = ""
    self.displayName = ""

# Call when an element starts
def startElement(self, tag, attributes):
    self.CurrentData = tag
    if tag == "code":
        print ("*****Section*****")
        code = attributes["code"]
        #displayName = attributes["displayName"]
        print ("Code:", code)
        #print("Display Name:", displayName)

# Call when an elements ends
def endElement(self, tag):
    if self.CurrentData == "type":
        print ("Type:", self.type)
    elif self.CurrentData == "displayName":
        print("Display Name:", self.displayName)
    elif self.CurrentData == "title":
        print ("Title:", self.CurrentData.title())
    elif self.CurrentData == "text":
        print ("Text:", self.text)
    elif self.CurrentData == "description":
        print ("Description:", self.description)
    self.CurrentData = ""

# Call when a character is read
def characters(self, content):
    if self.CurrentData == "type":
        self.type = content
    elif self.CurrentData == "format":
        self.format = content
    elif self.CurrentData == "year":
        self.year = content
    elif self.CurrentData == "rating":
        self.rating = content
    elif self.CurrentData == "stars":
        self.stars = content
    elif self.CurrentData == "description":
        self.description = content


if (__name__ == "__main__"):
# create an XMLReader
parser = xml.sax.make_parser()
# turn off namepsaces
parser.setFeature(xml.sax.handler.feature_namespaces, 0)

# override the default ContextHandler
Handler = MovieHandler()
parser.setContentHandler(Handler)

parser.parse(saved_file_path)

控制台的结果是:

 *****Section***** Code: 34391-3 Title: Title
*****Section***** Code: 57664-264
*****Section***** Code: 50VV3VW0TI
*****Section***** Code: 50VV3VW0TI
*****Section***** Code: 368GB5141J
*****Section***** Code: 70097M6I30
*****Section***** Code: 57664-264-88
*****Section***** Code: 57664-264-13
*****Section***** Code: 57664-264-18
*****Section***** Code: SPLCOLOR
*****Section***** Code: SPLSHAPE
*****Section***** Code: SPLSCORE
*****Section***** Code: SPLSIZE
*****Section***** Code: SPLIMPRINT
*****Section***** Code: SPLCOATING
*****Section***** Code: SPLSYMBOL
*****Section***** Code: 57664-265
*****Section***** Code: 50VV3VW0TI
*****Section***** Code: 50VV3VW0TI
*****Section***** Code: 368GB5141J
*****Section***** Code: 70097M6I30
*****Section***** Code: 57664-265-88
*****Section***** Code: 57664-265-13
*****Section***** Code: 57664-265-18
*****Section***** Code: SPLCOLOR
*****Section***** Code: SPLSHAPE
*****Section***** Code: SPLSCORE
*****Section***** Code: SPLSIZE
*****Section***** Code: SPLIMPRINT
*****Section***** Code: SPLCOATING
*****Section***** Code: SPLSYMBOL
*****Section***** Code: 57664-266
*****Section***** Code: 50VV3VW0TI
*****Section***** Code: 50VV3VW0TI
*****Section***** Code: 368GB5141J
*****Section***** Code: 70097M6I30
*****Section***** Code: 57664-266-88
*****Section***** Code: 57664-266-13
*****Section***** Code: 57664-266-18
*****Section***** Code: SPLCOLOR
*****Section***** Code: SPLSHAPE
*****Section***** Code: SPLSCORE
*****Section***** Code: SPLSIZE
*****Section***** Code: SPLIMPRINT
*****Section***** Code: SPLCOATING
*****Section***** Code: SPLSYMBOL
*****Section***** Code: 34066-1 Title: Title Title: Title
*****Section***** Code: 34089-3 Title: Title
*****Section***** Code: 34090-1 Title: Title Title: Title Title: Title Title: Title
*****Section***** Code: 34067-9 Title: Title Title: Title Title: Title Title: Title
*****Section***** Code: 34070-3 Title: Title
*****Section***** Code: 34071-1 Title: Title Title: Title Title: Title Title: Title Title: Title Title: Title Title: Title Title: Title Title: Title Title: Title
*****Section***** Code: 42232-9 Title: Title
*****Section***** Code: 34072-9 Title: Title
*****Section***** Code: 34073-7 Title: Title
*****Section***** Code: 34083-6 Title: Title
*****Section***** Code: 34091-9 Title: Title
*****Section***** Code: 42228-7 Title: Title
*****Section***** Code: 34080-2 Title: Title
*****Section***** Code: 34081-0 Title: Title
*****Section***** Code: 34082-8 Title: Title Title: Title Title: Title
*****Section***** Code: 34084-4 Title: Title Title: Title Title: Title Text: 
*****Section***** Code: 34088-5 Title: Title Title: Title Title: Title Title: Title Title: Title Title: Title Title: Title Text: 
*****Section***** Code: 34068-7 Title: Title Title: Title Title: Title Title: Title Title: Title Title: Title
*****Section***** Code: 34069-5 Title: Title

Process finished with exit code 0

问题/不奏效的地方: 我真的不需要在包含“代码:XXXXX-X”章节之前的章节 对于每个部分,我都希望获得该部分及其所有子部分的

和<段落>标记的值。 虽然我已经能够使用DOM、ElementTree、lxml和minidom的教程,但目标XML是非标准的,在一个标签中包含多个属性,例如: 一些节点/元素将包含一个快捷结束标记(如上所示),而其他节点/元素将包含一个完整的传统结束标记。 难怪医疗保健这么复杂! 那么,如何获得标记的内容并遍历子节以实现相同的目的呢? 问题来源StackOverflow 地址:/questions/59379440/parsing-non-standard-hl7-xml-w-python
取消 提交回答
全部回答(1)
  • kun坤
    2019-12-30 10:00:35

    我希望我没有理解错你的问题,这段代码通过请求模块加载XML,然后提取每个和随后的

    :
    import requests
    from bs4 import BeautifulSoup
    
    url = 'https://dailymed.nlm.nih.gov/dailymed/services/v2/spls/f36d4ed3-dcbb-4465-9fa6-1da811f555e6.xml'
    
    soup = BeautifulSoup( requests.get(url).text, 'html.parser' )
    
    for section in soup.select('section:has(> code[code]):has(> title)'):
        print('Code   = ', section.select_one('code')['code'])
    
        for title in section.select('title'):
            print()
            print('Title  = ', title.text)
            print('*' * 80)
            txt = title.find_next_sibling('text')
    
            if not txt:
                continue
    
            for paragraph in txt.select('paragraph'):
                for tag in paragraph.select('br'):
                    tag.replace_with("\n")
                print()
                lines = '\n'.join(line.strip() for line in paragraph.get_text().splitlines() if line.strip())
                print(lines)
    
        print('-' * 120 + '\n')
    

    打印:

    Code   =  34066-1
    
    Title  =  BOXED WARNING
    ********************************************************************************
    
    Title  =  Cessation of Therapy with Atenolol
    ********************************************************************************
    
    Patients with coronary artery disease, who are being treated with atenolol, should be advised against abrupt discontinuation of therapy. Severe exacerbation of angina and the occurrence of myocardial infarction 
    and ventricular arrhythmias have been reported in angina patients following the abrupt discontinuation of therapy with beta-blockers. The last two complications may occur with or without preceding exacerbation o
    f the angina pectoris. As with other beta-blockers, when discontinuation of atenolol tablet, USP, is planned, the patients should be carefully observed and advised to limit physical activity to a minimum. If the
     angina worsens or acute coronary insufficiency develops, it is recommended that atenolol tablet, USP be promptly reinstituted, at least temporarily. Because coronary artery disease is common and may be unrecogn
    ized, it may be prudent not to discontinue atenolol tablet, USP, therapy abruptly even in patients treated only for hypertension. (See DOSAGE AND ADMINISTRATION.)
    ------------------------------------------------------------------------------------------------------------------------
    
    Code   =  34089-3
    
    Title  =  DESCRIPTION
    ********************************************************************************
    
    Atenolol, USP, a synthetic, beta1-selective (cardioselective) adrenoreceptor blocking agent, may be chemically described as benzeneacetamide, 4 -[2'-hydroxy- 3'-[(1- methylethyl) amino] propoxy]-. The molecular 
    and structural formulas are:
    
    Atenolol (free base) has a molecular weight of 266.34. It is a relatively polar hydrophilic compound with a water solubility of 26.5 mg/mL at 37°C and a log partition coefficient (octanol/water) of 0.23. It is f
    reely soluble in 1N HCl (300 mg/mL at 25°C) and less soluble in chloroform (3 mg/mL at 25°C).
    
    Atenolol is available as 25, 50 and 100 mg tablets for oral administration.
    
    Each tablet contains the labeled amount of atenolol, USP and the following inactive ingredients: povidone, microcrystalline cellulose, corn starch, sodium lauryl sulfate, croscarmellose sodium, colloidal silicon
     dioxide, sodium stearyl fumarate and magnesium stearate.
    ------------------------------------------------------------------------------------------------------------------------
    
    ...and so on.
    
    0 0
相关问答

1

回答

从S3 Python AWS Lambda读取XML

2019-12-28 13:51:34 579浏览量 回答数 1

1

回答

python怎么读取xml

2018-05-10 20:10:38 1177浏览量 回答数 1

1

回答

python函数传递多个参数时参数的数据类型有什么要求?

2021-11-02 21:28:33 770浏览量 回答数 1

1

回答

python中函数模型怎么使用

2018-05-10 20:10:58 996浏览量 回答数 1

2

回答

python怎么写成函数

2018-05-10 20:11:00 1288浏览量 回答数 2

2

回答

python输出函数叫什么意思

2018-05-10 20:11:01 1399浏览量 回答数 2

2

回答

怎么在python执行函数

2018-05-10 20:11:03 1136浏览量 回答数 2

1

回答

怎么看python函数源代码

2018-05-10 20:11:04 1193浏览量 回答数 1

2

回答

python year函数怎么用

2018-05-10 20:11:05 2558浏览量 回答数 2

2

回答

python中函数怎么用

2018-05-10 20:11:05 1022浏览量 回答数 2
+关注
0
文章
13395
问答
问答排行榜
最热
最新
相关电子书
更多
低代码开发师(初级)实战教程
立即下载
阿里巴巴DevOps 最佳实践手册
立即下载
冬季实战营第三期:MySQL数据库进阶实战
立即下载