I am working on code which parses external XML files; some of these files are too large, Up to gigabyte Needless to say, these files need to be parsed as a stream because they are very inefficient to load in memory and often leads to outoffimary issues.
I have used libraries MiniDOM, ElementTree, cElementTree and I currently use Lxml, I have a functional, beautiful using My current code looks like this: error P> I do not want to decode this data, I can leave it though I do not know how to leave the element - I try In the Appreciate any help! Update Some additional information: This is the line where iterpeers fail: According to the entry, there is an error on the bytes I also came to know, which does not provide the solution. If the problem is the actual character encoding problems, then compared to the distorted XML, the easiest and possibly the most Efficient, solution is to solve it at the reading point of the file. In this way: import code => "import" from lxml import atri events = ("start", "end") reader = codecs.EncodedFile (xmlfile, 'utf8', 'utf8', 'substitution'): This non-UTF8-readable bytes '?' There are some other options that will be replaced by; See the documentation for codecs module for more. lxml.etree.iterparse Memory is a skilled script. The problem is that I have some of the necessary XML files to parse the encoding error (they advertise as UTF-8, but include different encoded characters), some of the
lxml.etree.parse < When using / code> it can be fixed using a custom parser's
recover = true option, but
iterparse does not accept custom parser (see also :)
iterparse is a bad character (in this case, this is a
^ Y ):
lxml.etree.XMLSyntaxError: input is not appropriate UTF-8, point to encoding! Bytes: 0x19 0x73 0x20 0x65, line 949490, column 25
context.next and
issued try / in other statements.
& lt; Description & gt; ; & Lt ;! [CDATA: [Musea de la Photography Force Mercator. Mate and Dane 80.000 Photo ^ YSN3 Mizosn Negativen Hat Sir Davis ...]] gt; & Lt; / Description & gt;
0x19 0x73 0x20 0x65 . According to heccedate,
1973 20 65 translation for ASCII
.se
. There should be an apostrophy in this place.
) Reference = etree.iterparse (Reader, Events = Events)
Comments
Post a Comment