Skip to content Skip to sidebar Skip to footer

Extract Text From Xml Documents In Python

This is the sample xml document : Everyday ItalianGia

Solution 1:

Using the lxml library with an xpath query is possible:

xml="""<bookstore><bookcategory="COOKING"><titlelang="english">Everyday Italian</title><author>Giada De Laurentiis</author><year>2005</year><price>300.00</price></book><bookcategory="CHILDREN"><titlelang="english">Harry Potter</title><author>J K. Rowling </author><year>2005</year><price>625.00</price></book></bookstore>
"""
from lxml import etree
root = etree.fromstring(xml).getroot()
root.xpath('/bookstore/book/*/text()')
# ['Everyday Italian', 'Giada De Laurentiis', '2005', '300.00', 'Harry Potter', 'J K. Rowling ', '2005', '625.00']

Although you don't get the category....

Solution 2:

If you want to call grep from inside python, see the discussion here, especially this post.

If you want to search through all the files in a directory you could try something like this using the glob module:

import glob    
import os    
import re    

p = re.compile('>.*<')    
os.chdir("./")    
for files in glob.glob("*.xml"):    
    file = open(files, "r")    
    line = file.read()    
    list =  map(lambda x:x.lstrip('>').rstrip('<'), p.findall(line))    
    printlistprint

This searches iterates through all the files in the directory, opens each file and exteacts text matching the regexp.

Output:

['Everyday Italian', 'Giada De Laurentiis', '2005', '300.00', 'Harry Potter', 'J
 K. Rowling ', '2005', '625.00']

EDIT: Updated code to extract only the text elements from the xml.

Solution 3:

You could simply strip out any tags:

>>>import re>>>txt = """<bookstore>...    <book category="COOKING">...        <title lang="english">Everyday Italian</title>...        <author>Giada De Laurentiis</author>...        <year>2005</year>...        <price>300.00</price>...    </book>......    <book category="CHILDREN">...        <title lang="english">Harry Potter</title>...        <author>J K. Rowling </author>...        <year>2005</year>...        <price>625.00</price>...    </book>...</bookstore>""">>>exp = re.compile(r'<.*?>')>>>text_only = exp.sub('',txt).strip()>>>text_only
'Everyday Italian\n        Giada De Laurentiis\n        2005\n        300.00\n
  \n\n    \n        Harry Potter\n        J K. Rowling \n        2005\n        6
25.00'

But if you just want to search files for some text in Linux, you can use grep:

burhan@sandbox:~$ grep"Harry Potter" file.xml
        <title lang="english">Harry Potter</title>

If you want to search in a file, use the grep command above, or open the file and search for it in Python:

>>>import re>>>exp = re.compile(r'<.*?>')>>>withopen('file.xml') as f:...    lines = ''.join(line for line in f.readlines())...    text_only = exp.sub('',lines).strip()...>>>if'Harry Potter'in text_only:...print'It exists'...else:...print'It does not'...
It exists

Post a Comment for "Extract Text From Xml Documents In Python"