Parsing huge, badly encoded XML files in Python -


I am working on code which parses external XML files; some of these files are too large, Up to gigabyte Needless to say, these files need to be parsed as a stream because they are very inefficient to load in memory and often leads to outoffimary issues.

I have used libraries MiniDOM, ElementTree, cElementTree and I currently use Lxml, I have a functional, beautiful using lxml.etree.iterparse Memory is a skilled script. The problem is that I have some of the necessary XML files to parse the encoding error (they advertise as UTF-8, but include different encoded characters), some of the lxml.etree.parse < When using / code> it can be fixed using a custom parser's recover = true option, but iterparse does not accept custom parser (see also :)

My current code looks like this:

  lxml Import Entry Event = ("Start", "End") Reference = etree.iterparse (Xmlfile, Event = Event) Event, Root_Element = Reference.EXT () # and LT; Items & gt; For verb, element in context: if action == 'end' and element.tag == 'item': # & lt; Parse & gt; Error when  iterparse  is a bad character (in this case, this is a  ^ Y ):   

error P>

  lxml.etree.XMLSyntaxError: input is not appropriate UTF-8, point to encoding! Bytes: 0x19 0x73 0x20 0x65, line 949490, column 25   

I do not want to decode this data, I can leave it though I do not know how to leave the element - I try In the context.next and issued try / in other statements.

Appreciate any help!

Update

Some additional information: This is the line where iterpeers fail:

& lt; Description & gt; ; & Lt ;! [CDATA: [Musea de la Photography Force Mercator. Mate and Dane 80.000 Photo ^ YSN3 Mizosn Negativen Hat Sir Davis ...]] gt; & Lt; / Description & gt;

According to the entry, there is an error on the bytes 0x19 0x73 0x20 0x65 . According to heccedate, 1973 20 65 translation for ASCII .se
. There should be an apostrophy in this place.

I also came to know, which does not provide the solution.

If the problem is the actual character encoding problems, then compared to the distorted XML, the easiest and possibly the most Efficient, solution is to solve it at the reading point of the file. In this way: import code => "import" from lxml import atri events = ("start", "end") reader = codecs.EncodedFile (xmlfile, 'utf8', 'utf8', 'substitution'):

 ) Reference = etree.iterparse (Reader, Events = Events)   

This non-UTF8-readable bytes '?' There are some other options that will be replaced by; See the documentation for codecs module for more.

Comments