How To Efficiently Detect An Xml Schema Without Having The Entire File In Python
Solution 1:
Several people have misinterpreted this question, and re-reading it, it's really not at all clear. In fact there are several questions.
How to detect an XML schema
Some people have interpreted this as saying you think there might be a schema within the file, or referenced from the file. I interpreted it as meaning that you wanted to infer a schema from the content of the instance.
What would be the fastest way to parse the structure of the main item node without previously knowing its structure?
Just put it through a parser, e.g. a SAX parser. A parser doesn't need to know the structure of an XML file in order to split it up into elements and attributes. But I don't think you actually want the fastest parse possible (in fact, I don't think performance is that high on your requirements list at all). I think you want to do something useful with the information (you haven't told us what): that is, you want to process the information, rather than just parsing the XML.
Is there a python utility that can do so 'on-the-fly' without having the complete xml loaded in memory?
Yes, according to this page which mentions 3 event-based XML parsers in the Python world: https://wiki.python.org/moin/PythonXml (I can't vouch for any of them)
what if I just saved the first 5MB of the file (by itself it would be invalid xml, as it wouldn't have ending tags) -- would there be a way to parse the schema from that?
I'm not sure you know what the verb "to parse" actually means. Your phrase certainly suggests that you expect the file to contain a schema, which you want to extract. But I'm not at all sure you really mean that. And in any case, if it did contain a schema in the first 5Mb, you could find it just be reading the file sequentially, there would be no need to "save" the first part of the file first.
Solution 2:
Question: way to parse the structure of the main item node without previously knowing its structure
This class TopSequenceElement
parse a XML
File to find all Sequence Elements.
The default is, to break
at the FIRST closing </...>
of the topmost Element.
Therefore, it is independend of the file size or even by truncated files.
from lxml import etree
from collections import OrderedDict
classTopSequenceElement(etree.iterparse):
"""
Read XML File
results: .seq == OrderedDict of Sequence Element
.element == topmost closed </..> Element
.xpath == XPath to top_element
"""classElement:
"""
Classify a Element
"""
SEQUENCE = (1, 'SEQUENCE')
VALUE = (2, 'VALUE')
def__init__(self, elem, event):
iflen(elem):
self._type = self.SEQUENCE
else:
self._type = self.VALUE
self._state = [event]
self.count = 0
self.parent = None
self.element = None @propertydefstate(self):
return self._state
@state.setterdefstate(self, event):
self._state.append(event)
@propertydefis_seq(self):
return self._type == self.SEQUENCE
def__str__(self):
return"Type:{}, Count:{}, Parent:{:10} Events:{}"\
.format(self._type[1], self.count, str(self.parent), self.state)
def__init__(self, fh, break_early=True):
"""
Initialize 'iterparse' only to callback at 'start'|'end' Events
:param fh: File Handle of the XML File
:param break_early: If True, break at FIRST closing </..> of the topmost Element
If False, run until EOF
"""super().__init__(fh, events=('start', 'end'))
self.seq = OrderedDict()
self.xpath = []
self.element = None
self.parse(break_early)
defparse(self, break_early):
"""
Parse the XML Tree, doing
classify the Element, process only SEQUENCE Elements
record, count of end </...> Events,
parent from this Element
element Tree of this Element
:param break_early: If True, break at FIRST closing </..> of the topmost Element
:return: None
"""
parent = []
try:
for event, elem in self:
tag = elem.tag
_elem = self.Element(elem, event)
if _elem.is_seq:
if event == 'start':
parent.append(tag)
if tag in self.seq:
self.seq[tag].state = event
else:
self.seq[tag] = _elem
elif event == 'end':
parent.pop()
if parent:
self.seq[tag].parent = parent[-1]
self.seq[tag].count += 1
self.seq[tag].state = event
if self.seq[tag].count == 1:
self.seq[tag].element = elem
if break_early andlen(parent) == 1:
breakexcept etree.XMLSyntaxError:
passfinally:
"""
Find the topmost completed '<tag>...</tag>' Element
Build .seq.xpath
"""for key inlist(self.seq):
self.xpath.append(key)
if self.seq[key].count > 0:
self.element = self.seq[key].element
break
self.xpath = '/'.join(self.xpath)
def__str__(self):
"""
String Representation of the Result
:return: .xpath and list of .seq
"""return"Top Sequence Element:{}\n{}"\
.format( self.xpath,
'\n'.join(["{:10}:{}"
.format(key, elem) for key, elem in self.seq.items()
])
)
if __name__ == "__main__":
withopen('../test/uyalicihow.xml', 'rb') as xml_file:
tse = TopSequenceElement(xml_file)
print(tse)
Output:
Top Sequence Element:Items/Item Items :Type:SEQUENCE, Count:0, Parent:None Events:['start'] Item :Type:SEQUENCE, Count:1, Parent:Items Events:['start', 'end', 'start'] Main :Type:SEQUENCE, Count:2, Parent:Item Events:['start', 'end', 'start', 'end'] Info :Type:SEQUENCE, Count:2, Parent:Item Events:['start', 'end', 'start', 'end'] Genres :Type:SEQUENCE, Count:2, Parent:Item Events:['start', 'end', 'start', 'end'] Products :Type:SEQUENCE, Count:1, Parent:Item Events:['start', 'end'] ... (omitted for brevity)
Step 2: Now, you know there is a
<Main>
Tag, you can do:
print(etree.tostring(tse.element.find('Main'), pretty_print=True).decode()) <Main><Platform>iTunes</Platform><PlatformID>353736518</PlatformID><Type>TVEpisode</Type><TVSeriesID>262603760</TVSeriesID></Main>
Step 3: Now, you know there is a
<Platform>
Tag, you can do:
print(etree.tostring(tse.element.find('Main/Platform'), pretty_print=True).decode()) <Platform>iTunes</Platform>
Tested with Python:3.5.3 - lxml.etree:3.7.1
Solution 3:
For very big files, reading is always a problem. I would suggest a simple algorithmic behavior for the reading of the file itself. The key point is always the xml tags
inside the files. I would suggest you read the xml
tags and sort them inside a heap
and then validate the content of the heap
accordingly.
Reading the file should also happen in chunks:
import xml.etree.ElementTree as etree
forevent, elem in etree.iterparse(xmL, events=('start', 'end', 'start-ns', 'end-ns')):
store_in_heap(event, element)
This will parse the XML file in chunks at a time and give it to you at every step of the way. start
will trigger when a tag is first encountered. At this point elem will be empty except for elem.attrib
that contains the properties of the tag. end
will trigger when the closing tag is encountered, and everything in-between has been read.
you can also benefit from the namespaces
that are in start-ns
and end-ns
. ElementTree
has provided this call to gather all the namespaces in the file.
Refer to this link for more information about namespaces
Solution 4:
My interpretation of your needs is that you want to be able to parse the partial file and build up the structure of the document as you go. I've taken some assumptions from the file you uploaded:
Fundamentally you want to be parsing collections of things which have similar properties - I'm inferring this from the way you presented your desired output as a table with rows containing the values.
You expect these collections of things to have the same number of values.
You need to be able to parse partial files.
You don't worry about the properties of elements, just their contents.
I'm using xml.sax
as this deals with arbitrarily large files and doesn't need to read the whole file into memory. Note that the strategy I'm following now doesn't actually scale that well as I'm storing all the elements in memory to build the dataframe, but you could just as well output the paths and contents.
In the sample file there is a problem with having one row per Item
since there are multiples of the Genre
tag and there are also multiple Product
tags. I've handled the repeated Genre
tags by appending them. This relies on the Genre tags appearing consecutively. It is not at all clear how the Product
relationships can be handled in a single table.
import xml.sax
from collections import defaultdict
classStructureParser(xml.sax.handler.ContentHandler):
def__init__(self):
self.text = ''
self.path = []
self.datalist = defaultdict(list)
self.previouspath = ''defstartElement(self, name, attrs):
self.path.append(name)
defendElement(self, name):
strippedtext = self.text.strip()
path = '/'.join(self.path)
if strippedtext != '':
if path == self.previouspath:
# This handles the "Genre" tags in the sample file
self.datalist[path][-1] += f',{strippedtext}'else:
self.datalist[path].append(strippedtext)
self.path.pop()
self.text = ''
self.previouspath = path
defcharacters(self, content):
self.text += content
You'd use this like this:
parser = StructureParser()
try:
xml.sax.parse('uyalicihow.xml', parser)
except xml.sax.SAXParseException:
print('File probably ended too soon')
This will read the example file just fine.
Once this has read and probably printed "File probably ended to soon", you have the parsed contents in parser.datalist
.
You obviously want to have just the parts which read successfully, so you can figure out the shortest list and build a DataFrame with just those paths:
import pandas as pd
smallest_items = min(len(e) for e in parser.datalist.values())
df = pd.DataFrame({key: value for key, value in parser.datalist.items() iflen(value) == smallest_items})
This gives something similar to your desired output:
Items/Item/Main/Platform Items/Item/Main/PlatformID Items/Item/Main/Type
0 iTunes 353736518 TVEpisode
1 iTunes 495275084 TVEpisode
The columns for the test file which are matched here are
>> df.columns
Index(['Items/Item/Main/Platform', 'Items/Item/Main/PlatformID',
'Items/Item/Main/Type', 'Items/Item/Main/TVSeriesID',
'Items/Item/Info/BaseURL', 'Items/Item/Info/EpisodeNumber',
'Items/Item/Info/HighestResolution',
'Items/Item/Info/LanguageOfMetadata', 'Items/Item/Info/LastModified',
'Items/Item/Info/Name', 'Items/Item/Info/ReleaseDate',
'Items/Item/Info/ReleaseYear', 'Items/Item/Info/RuntimeInMinutes',
'Items/Item/Info/SeasonNumber', 'Items/Item/Info/Studio',
'Items/Item/Info/Synopsis', 'Items/Item/Genres/Genre',
'Items/Item/Products/Product/URL'],
dtype='object')
Based on your comments, it appears as though it is more important to you to have all the elements represented, but perhaps just showing a preview, in which case you can perhaps use only the first elements from the data. Note that in this case the Product
s entries won't match the Item
entries.
df = pd.DataFrame({key: value[:smallest_items] for key, value in parser.datalist.items()})
Now we get all the paths:
>> df.columns
Index(['Items/Item/Main/Platform', 'Items/Item/Main/PlatformID',
'Items/Item/Main/Type', 'Items/Item/Main/TVSeriesID',
'Items/Item/Info/BaseURL', 'Items/Item/Info/EpisodeNumber',
'Items/Item/Info/HighestResolution',
'Items/Item/Info/LanguageOfMetadata', 'Items/Item/Info/LastModified',
'Items/Item/Info/Name', 'Items/Item/Info/ReleaseDate',
'Items/Item/Info/ReleaseYear', 'Items/Item/Info/RuntimeInMinutes',
'Items/Item/Info/SeasonNumber', 'Items/Item/Info/Studio',
'Items/Item/Info/Synopsis', 'Items/Item/Genres/Genre',
'Items/Item/Products/Product/URL',
'Items/Item/Products/Product/Offers/Offer/Price',
'Items/Item/Products/Product/Offers/Offer/Currency'],
dtype='object')
Solution 5:
There are a number of tools around that will generate a schema from a supplied instance document. I don't know how many of them will work on a 5Gb input file, and I don't know how many of them can be invoked from Python.
Many years ago I wrote a Java-based, fully streamable tool to generate a DTD from an instance document. It hasn't been touched in years but it should still run: https://sourceforge.net/projects/saxon/files/DTDGenerator/7.0/dtdgen7-0.zip/download?use_mirror=vorboss
There are other tools listed here: Any tools to generate an XSD schema from an XML instance document?
Post a Comment for "How To Efficiently Detect An Xml Schema Without Having The Entire File In Python"