How To Efficiently Detect An Xml Schema Without Having The Entire File In Python

March 31, 2024 Post a Comment

I have a very large feed file that is sent as an XML document (5GB). What would be the fastest way to parse the structure of the main item node without previously knowing its struc

Solution 1:

Several people have misinterpreted this question, and re-reading it, it's really not at all clear. In fact there are several questions.

How to detect an XML schema

Some people have interpreted this as saying you think there might be a schema within the file, or referenced from the file. I interpreted it as meaning that you wanted to infer a schema from the content of the instance.

What would be the fastest way to parse the structure of the main item node without previously knowing its structure?

Just put it through a parser, e.g. a SAX parser. A parser doesn't need to know the structure of an XML file in order to split it up into elements and attributes. But I don't think you actually want the fastest parse possible (in fact, I don't think performance is that high on your requirements list at all). I think you want to do something useful with the information (you haven't told us what): that is, you want to process the information, rather than just parsing the XML.

Is there a python utility that can do so 'on-the-fly' without having the complete xml loaded in memory?

Yes, according to this page which mentions 3 event-based XML parsers in the Python world: https://wiki.python.org/moin/PythonXml (I can't vouch for any of them)

what if I just saved the first 5MB of the file (by itself it would be invalid xml, as it wouldn't have ending tags) -- would there be a way to parse the schema from that?

I'm not sure you know what the verb "to parse" actually means. Your phrase certainly suggests that you expect the file to contain a schema, which you want to extract. But I'm not at all sure you really mean that. And in any case, if it did contain a schema in the first 5Mb, you could find it just be reading the file sequentially, there would be no need to "save" the first part of the file first.

Solution 2:

Question: way to parse the structure of the main item node without previously knowing its structure

This class TopSequenceElement parse a XML File to find all Sequence Elements. The default is, to break at the FIRST closing </...> of the topmost Element. Therefore, it is independend of the file size or even by truncated files.

from lxml import etree
from collections import OrderedDict

classTopSequenceElement(etree.iterparse):
    """
    Read XML File
    results: .seq == OrderedDict of Sequence Element
             .element == topmost closed </..> Element
             .xpath == XPath to top_element
    """classElement:
        """
        Classify a Element
        """
        SEQUENCE = (1, 'SEQUENCE')
        VALUE = (2, 'VALUE')

        def__init__(self, elem, event):
            iflen(elem):
                self._type = self.SEQUENCE
            else:
                self._type = self.VALUE

            self._state = [event]
            self.count = 0
            self.parent = None
            self.element = None        @propertydefstate(self):
            return self._state

        @state.setterdefstate(self, event):
            self._state.append(event)

        @propertydefis_seq(self):
            return self._type == self.SEQUENCE

        def__str__(self):
            return"Type:{}, Count:{}, Parent:{:10} Events:{}"\
                .format(self._type[1], self.count, str(self.parent), self.state)

    def__init__(self, fh, break_early=True):
        """
        Initialize 'iterparse' only to callback at 'start'|'end' Events

        :param fh: File Handle of the XML File
        :param break_early: If True, break at FIRST closing </..> of the topmost Element
                            If False, run until EOF
        """super().__init__(fh, events=('start', 'end'))
        self.seq = OrderedDict()
        self.xpath = []
        self.element = None

        self.parse(break_early)

    defparse(self, break_early):
        """
        Parse the XML Tree, doing
          classify the Element, process only SEQUENCE Elements
          record, count of end </...> Events, 
                  parent from this Element
                  element Tree of this Element

        :param break_early: If True, break at FIRST closing </..> of the topmost Element
        :return: None
        """
        parent = []

        try:
            for event, elem in self:
                tag = elem.tag
                _elem = self.Element(elem, event)

                if _elem.is_seq:
                    if event == 'start':
                        parent.append(tag)

                        if tag in self.seq:
                            self.seq[tag].state = event
                        else:
                            self.seq[tag] = _elem

                    elif event == 'end':
                        parent.pop()
                        if parent:
                            self.seq[tag].parent = parent[-1]

                        self.seq[tag].count += 1
                        self.seq[tag].state = event

                        if self.seq[tag].count == 1:
                            self.seq[tag].element = elem

                        if break_early andlen(parent) == 1:
                            breakexcept etree.XMLSyntaxError:
            passfinally:
            """
            Find the topmost completed '<tag>...</tag>' Element
            Build .seq.xpath
            """for key inlist(self.seq):
                self.xpath.append(key)
                if self.seq[key].count > 0:
                    self.element = self.seq[key].element
                    break

            self.xpath = '/'.join(self.xpath)

    def__str__(self):
        """
        String Representation of the Result 
        :return: .xpath and list of .seq
        """return"Top Sequence Element:{}\n{}"\
            .format( self.xpath,
                     '\n'.join(["{:10}:{}"
                               .format(key, elem) for key, elem in self.seq.items()
                                ])
                     )

if __name__ == "__main__":
    withopen('../test/uyalicihow.xml', 'rb') as xml_file:
        tse = TopSequenceElement(xml_file)
        print(tse)

Output:

Top Sequence Element:Items/Item
Items     :Type:SEQUENCE, Count:0, Parent:None       Events:['start']
Item      :Type:SEQUENCE, Count:1, Parent:Items      Events:['start', 'end', 'start']
Main      :Type:SEQUENCE, Count:2, Parent:Item       Events:['start', 'end', 'start', 'end']
Info      :Type:SEQUENCE, Count:2, Parent:Item       Events:['start', 'end', 'start', 'end']
Genres    :Type:SEQUENCE, Count:2, Parent:Item       Events:['start', 'end', 'start', 'end']
Products  :Type:SEQUENCE, Count:1, Parent:Item       Events:['start', 'end']
... (omitted for brevity)

Step 2: Now, you know there is a <Main> Tag, you can do:

print(etree.tostring(tse.element.find('Main'), pretty_print=True).decode())

<Main><Platform>iTunes</Platform><PlatformID>353736518</PlatformID><Type>TVEpisode</Type><TVSeriesID>262603760</TVSeriesID></Main>

Step 3: Now, you know there is a <Platform> Tag, you can do:
print(etree.tostring(tse.element.find('Main/Platform'), pretty_print=True).decode())

<Platform>iTunes</Platform>

Tested with Python:3.5.3 - lxml.etree:3.7.1

Solution 3:

For very big files, reading is always a problem. I would suggest a simple algorithmic behavior for the reading of the file itself. The key point is always the xml tags inside the files. I would suggest you read the xml tags and sort them inside a heap and then validate the content of the heap accordingly.

Reading the file should also happen in chunks:

import xml.etree.ElementTree as etree
forevent, elem in etree.iterparse(xmL, events=('start', 'end', 'start-ns', 'end-ns')):
  store_in_heap(event, element)

This will parse the XML file in chunks at a time and give it to you at every step of the way. start will trigger when a tag is first encountered. At this point elem will be empty except for elem.attrib that contains the properties of the tag. end will trigger when the closing tag is encountered, and everything in-between has been read.

you can also benefit from the namespaces that are in start-ns and end-ns. ElementTree has provided this call to gather all the namespaces in the file. Refer to this link for more information about namespaces

Solution 4:

My interpretation of your needs is that you want to be able to parse the partial file and build up the structure of the document as you go. I've taken some assumptions from the file you uploaded:

Fundamentally you want to be parsing collections of things which have similar properties - I'm inferring this from the way you presented your desired output as a table with rows containing the values.
You expect these collections of things to have the same number of values.
You need to be able to parse partial files.
You don't worry about the properties of elements, just their contents.

I'm using xml.sax as this deals with arbitrarily large files and doesn't need to read the whole file into memory. Note that the strategy I'm following now doesn't actually scale that well as I'm storing all the elements in memory to build the dataframe, but you could just as well output the paths and contents.

In the sample file there is a problem with having one row per Item since there are multiples of the Genre tag and there are also multiple Product tags. I've handled the repeated Genre tags by appending them. This relies on the Genre tags appearing consecutively. It is not at all clear how the Product relationships can be handled in a single table.

import xml.sax

from collections import defaultdict

classStructureParser(xml.sax.handler.ContentHandler):
    def__init__(self):
        self.text = ''
        self.path = []
        self.datalist = defaultdict(list)
        self.previouspath = ''defstartElement(self, name, attrs):
        self.path.append(name)

    defendElement(self, name):
        strippedtext = self.text.strip()
        path = '/'.join(self.path)
        if strippedtext != '':
            if path == self.previouspath:
                # This handles the "Genre" tags in the sample file
                self.datalist[path][-1] += f',{strippedtext}'else:
                self.datalist[path].append(strippedtext)
        self.path.pop()
        self.text = ''
        self.previouspath = path

    defcharacters(self, content):
        self.text += content

You'd use this like this:

parser = StructureParser()

try:
    xml.sax.parse('uyalicihow.xml', parser)
except xml.sax.SAXParseException:
    print('File probably ended too soon')

This will read the example file just fine.

Once this has read and probably printed "File probably ended to soon", you have the parsed contents in parser.datalist.

You obviously want to have just the parts which read successfully, so you can figure out the shortest list and build a DataFrame with just those paths:

import pandas as pd

smallest_items = min(len(e) for e in parser.datalist.values())
df = pd.DataFrame({key: value for key, value in parser.datalist.items() iflen(value) == smallest_items})

This gives something similar to your desired output:

  Items/Item/Main/Platform Items/Item/Main/PlatformID Items/Item/Main/Type 
0                   iTunes                  353736518            TVEpisode   
1                   iTunes                  495275084            TVEpisode

The columns for the test file which are matched here are

>> df.columns
Index(['Items/Item/Main/Platform', 'Items/Item/Main/PlatformID',
       'Items/Item/Main/Type', 'Items/Item/Main/TVSeriesID',
       'Items/Item/Info/BaseURL', 'Items/Item/Info/EpisodeNumber',
       'Items/Item/Info/HighestResolution',
       'Items/Item/Info/LanguageOfMetadata', 'Items/Item/Info/LastModified',
       'Items/Item/Info/Name', 'Items/Item/Info/ReleaseDate',
       'Items/Item/Info/ReleaseYear', 'Items/Item/Info/RuntimeInMinutes',
       'Items/Item/Info/SeasonNumber', 'Items/Item/Info/Studio',
       'Items/Item/Info/Synopsis', 'Items/Item/Genres/Genre',
       'Items/Item/Products/Product/URL'],
      dtype='object')

Based on your comments, it appears as though it is more important to you to have all the elements represented, but perhaps just showing a preview, in which case you can perhaps use only the first elements from the data. Note that in this case the Products entries won't match the Item entries.

df = pd.DataFrame({key: value[:smallest_items] for key, value in parser.datalist.items()})

Now we get all the paths:

>> df.columns
Index(['Items/Item/Main/Platform', 'Items/Item/Main/PlatformID',
       'Items/Item/Main/Type', 'Items/Item/Main/TVSeriesID',
       'Items/Item/Info/BaseURL', 'Items/Item/Info/EpisodeNumber',
       'Items/Item/Info/HighestResolution',
       'Items/Item/Info/LanguageOfMetadata', 'Items/Item/Info/LastModified',
       'Items/Item/Info/Name', 'Items/Item/Info/ReleaseDate',
       'Items/Item/Info/ReleaseYear', 'Items/Item/Info/RuntimeInMinutes',
       'Items/Item/Info/SeasonNumber', 'Items/Item/Info/Studio',
       'Items/Item/Info/Synopsis', 'Items/Item/Genres/Genre',
       'Items/Item/Products/Product/URL',
       'Items/Item/Products/Product/Offers/Offer/Price',
       'Items/Item/Products/Product/Offers/Offer/Currency'],
      dtype='object')

Solution 5:

There are a number of tools around that will generate a schema from a supplied instance document. I don't know how many of them will work on a 5Gb input file, and I don't know how many of them can be invoked from Python.

Many years ago I wrote a Java-based, fully streamable tool to generate a DTD from an instance document. It hasn't been touched in years but it should still run: https://sourceforge.net/projects/saxon/files/DTDGenerator/7.0/dtdgen7-0.zip/download?use_mirror=vorboss

There are other tools listed here: Any tools to generate an XSD schema from an XML instance document?

Python Playground