vb.net - Parse Body Text from PDF -


I have recently experimented with parsing text data from a PDF document using iTextSharp in a VB2010 app. There are no images or other fancy elements in the document, only text Ive read some articles and used some code snippet and although it looks promising, Ive been trying to do just parsing the body of each page I do not have a header or footer with no guidance for that particular function.

Currently using the snippet here but it parses all the text in a page. There should be just one way of getting the body. Or at least I hope so.

PDF does not contain information about the logical structure of the text usually contained.

So there is nothing like a header, footer, body, paragraph and PDF. The only part of such operations as "Drag the glyph here", "Go in this situation and attract the group of that glyph". I did not write glyph and because PDF does not need to include the readable text.

One exception, but most of the PDFs in the wild are not tagged.

Given all of the above, it is likely left with the following approach: <

  • Remove all the text on each page
  • Analyze the text and < Find same parts at the beginning / end of each page
  • Remove the same parts

    This is an approximation-based detection, so it might Will not always give excellent results.

  • Comments