.NET Framework - libraries PDF read

Asked By kazik on 03-Jun-11 03:38 PM
Hi,

Could you recommend me the best libraries to read PDF files from C#
level?
I was making review already iTextSharp and PDFSharp and it is not easy
to find any documentation or samples to read pdf files. (about
generating pdf files there is a lot).
Can you help me?

Regards
Daniel




Peter Duniho replied to kazik on 04-Jun-11 12:54 AM
First you need to define your problem.  There are lots of ways to "read"
a PDF file, depending on what you really want to do.

If you simply want to display PDF, then the best and easiest approach is
just to use Acrobat Reader.  You can use it in the WebBrowser control,
or just launch it standalone.

If you need to examine the structure of the PDF document, then that is
quite a bit harder.  For the purpose of converting PDF to text
documents, I have recently been using the "Text Extraction Toolkit" from
PDFLib (http://www.pdflib.com/products/tet/), and it works quite well.
It has a pretty complicated and user-unfriendly API, which I have wrapped
in a much more convenient and usable managed interface.

(It actually has a managed wrapper with the library, but that wrapper is
_very_ thin and has essentially the same API as the basic C
library???every single option is specified through a relatively arcane
string descriptor.  My own wrapper maps regular .NET methods,
properties, objects, etc. to the single-level, string-based API provided
by the library).

In spite of that shortcoming, the library itself works reasonably well.
It does a very good job of extracting text from most documents I have
thrown at it, and it almost never crashes (at this point, I have extracted
text from a couple million documents using the library, and seen fewer
than a dozen crashes).  Of course, I'd rather it not crash at all, but I
have come to have relatively low expectations of third-party software.  :(

They have a free trial, so you can easily see if it suits your needs.

There are other libraries out there that do similar things.  A
Bing/Google search will easily show you them.

Pete
kazik replied to Peter Duniho on 04-Jun-11 03:53 AM
I
=A0:(

I am making converter to the other format. So I want to read all from
pdf file - text and graphics.
Peter Duniho replied to kazik on 04-Jun-11 05:49 AM
The PDBLib library can extract both text and graphics, but it will not give
you the document structure.

Again, as I said before, you need to be clear about your definition of
the problem.  If "the other format" is a text file, then just extracting
text would be sufficient.  You say you also want graphics, but that
does not explain to what level of fidelity your "other format" is
expected to represent the original PDF.

You almost certainly need the document structure if you are trying to
make a converter for a format in which you expect the converted document
to look exactly like the original PDF.  But otherwise, there is a whole
gamut of features you might or might not need, depending on what the
to be shown in that.

There is a product from Stellent, which is now part of Oracle's
But unfortunately it does not do nearly as good a job of handling
character encoding when it comes to extracting the text.

When I was researching this stuff last year, I found that Adobe claimed
to have a library that provided more elaborate access to PDF data
structures, beyond that which is found in the library they provide with
the full retail Acrobat product.  But in the Acrobat product, they did
an even worse job of dealing with character encoding (that is, they
did not bother to do anything with it???you got bytes back and if you
did not already know the encoding, too bad), and I have no reason to
believe that the other library their web site mentioned would be any better.

Note also that PDF is essentially Postscript; it is really a series of
instructions in that rendering API.  One of the reasons text extraction
is difficult is that the document only expresses where on the page the
text should go, and that may or may not relate well to the logical flow
of the text (I have even seen PDF documents where each character of the
text appears in the PDF in the exact opposite order from that in which
it appears visually).

Another reason text extraction is difficult is that the character data
in the PDF may or may not be in some recognized character encoding;
often, the text is instead just an index for each character into a glyph
table stored within the PDF, which in turn may not necessarily have any
direct way to map a glyph back to the actual character it represents.

So if your "other format" is similar in nature, it might be simpler to
accomplish what you want than if you actually needed to interpret the
text.  If it is sufficient simply to _draw_ the text without caring what
the actual characters are, that might be simpler.

But regardless, the fact is, for what it would cost to write a decent
PDF-to-whatever converter, you can buy a lot of licenses for the retail
Acrobat product (so your users can edit PDFs instead of whatever the
do not have to pay anything; just use Acrobat Reader.

If you still believe you do have a critical business case that requires
you to implement the converter, I am not personally aware of any good
products that can handle the full document structure that might be
required (again, depending on your _actual_ needs, which you have not
really expressed very well yet).  But you surely know how to use a web
search engine.  :)

Pete