There are two main types of PDF Documents - Native PDF's and Scanned PDF's. When it comes to converting to PDF it is important to understand the difference between these files. The information contained by each PDF type is different and thus their characteristics and features also differ. Below is a look at the two PDf types and how they differ.
Native PDFs are ones that are generated from an electronic source – such as a Word document, a computer generated report, or spreadsheet data. These have an internal structure that can be read and interpreted.
These "generated" PDF documents, thus, already contain characters that have an electronic character designation. In most cases, the PDF creation software will take information from the structure of the Word document - such as character information, word placement information, etc. - and retain these items in the created PDF, which is why you can word search a text-based native PDF document. Searching these PDF's relies on these electronic character designations to provide reliable information on the location of the search words.
Because not all documents needing to be transmitted are in electronic form yet, conversion of the physical paper document into the electronic form still needs to be done. This is where a scanned PDF type comes into play. It would be inefficient to re-type documents manually into electronic forms and then convert them into PDFs. The solution to this is to scan them, using an electronic scanning device. Like the PDF creator, a scanner "digitally captures" the image of the physical document into an electronic form.
A scanner, doesn't reconstruct the character of every word when it creates this scanned image; the scanner takes a "snap-shot" of the document. This snap-shot is then turned into a PDF by using software integrated with the scanner. The result is a scanned PDF document .However, even though the image may be a document that contains words, the computer recognizes those words only as “images” that it displays without any information structure behind it. If you try to text search the document, the PDF search engine won’t yield any results.
Converting a scanned PDF into an editable format requires OCR (Optical Character Recognition) software to analyze the “image” of each character and match it to an electronic character-based file. Because of this, it is much more difficult to determine if the character "recognized" by the OCR software is, indeed, the character on the scanned document.
One should note, that the quality of OCR output is affected by matters such as poor image quality of the scanned document, mixture of fonts used in the scanned documents, and italicized and underlined fonts, which may blur the quality and shape of individual characters. Despite not being perfect the ability to search documents that were once un-searchable without reading 100 pages is a great benefit of OCR.