pdf links

PDF Rendering
Convert PDF to Image (.NET)
Convert PDF to image on Android (Xamarin)
Convert PDF to image on iOS (Xamarin)
Convert PDF to image in Windows Store apps (.NET)
Convert PDF to image in Windows Phone apps (.NET)
PDF to image in Universal Windows Store apps (.NET)
Free PDF Viewer control for Windows Forms (.NET)
How to integrate PDF Viewer control in WPF app (.NET)
Creating WPF PDF Viewer supporting bookmarks (.NET)
Cross-platform PDF Viewer using GTK# (MONO)
Silverlight PDF viewer control (Silverlight 5)
Multithreaded PDF rendering (.NET)
Convert pdf to image in Silverlight app (C# sample)
How to set fallback fonts for PDF rendering (C#)
Avoiding the out-of-memory exception on rendering (C#)
PDF viewer single page application (WebAPI, AngularJS)
PDF viewer control for Windows 10 universal applications
Use custom ICC profile for CMYK to RGB conversion
PDF layers - separate images, text, annotations, graphics

PDF Forms Creation PDF Security
Conversion to PDF/A
Other topics
PDF Document Manipulation
PDF Content Generation
Fixed and Flow layout document API (.NET)
Creation of grids and tables in PDF (C# sample)
How to create interactive documents using Actions (C# sample)
Text flow effects in PDF (C# sample)
How to generate ordered and bulleted lists in PDF (C# sample)
Convert HTML to PDF using flow layout API (C# sample)
How to use custom fonts for PDF generation (.NET)
Create document with differently sized pages (C#)
Create PDF documents using MONO (C#/MONO/Windows/OSX)
How to use background images for content elements (C#/PDF Kit/FlowLayout)
Add transparent images to PDF document (C#)
Draw round rect borders in PDF documents(C#)
ICC color profiles and and ICC based colors in PDF (C#)
How to use bidirectional and right to left text in PDF (C#)
Create PDF documents from XML templates (C# sample)
How to resize PDF pages and use custom stamps (C#)
Add header and footer to PDF page (.NET sample)
How to use clipping mask for drawing on PDF page
Fill graphics path with gradient brushes in PDF (Shadings)
Apitron PDF Kit and Rasterizer engine settings
Add layers to PDF page (optional content, C# sample)
How to create free text annotation with custom appearance

PDF Content Extraction
PDF Navigation

PDF to TIFF conversion
Contact us if you have a PDF related question and we'll cover it in our blog.

2015-03-20

Extract formatted text from PDF document for search and analysis (C# .NET sample)

Introduction


Text in PDF documents is being drawn using individual text drawing and positioning commands and very often its initial formatting and logical structure doesn’t get preserved because of this process. When you see a textual paragraph it doesn’t mean that this paragraph is being stored or drawn as a whole thing. It can consist of many pieces each having its own unique properties or transformation and, most often, different fonts. So while it looks solid it’s actually chunky and needs further processing to get back its logical structure and formatting, being appearance dependent.

Apitron PDF Kit is a .NET component that provides very simple and easy way to get formatted text from PDF page, perform search in this text or analyze it in any desired way. Component features are described on our website and we have written a book describing it in action which you may download by the following link.

Getting the formatted text from PDF page


Code sample below shows how to extract formatted text from PDF page, the formatting is being applied intelligently using own algorithms which add necessary line breaks or spacing.

using (Stream stream = File.Open("Apitron PDF Kit in Action.pdf", FileMode.Open))
{
    FixedDocument document = new FixedDocument(stream);
    // extract formatted text
    string text = document.Pages[1].ExtractText(TextExtractionOptions.FormattedText);   
    // set console window size and print text
    Console.SetWindowSize(116, 60);
    Console.WriteLine(text);
}

Compare the program output with the original PDF file on the image below:

Pic. 1 Formatted text extraction from PDF document


This C# code sample is pretty self-describing and demonstrates text extraction feature one may easily use to get text from PDF document. One thing should be noted though, if text in PDF file was created using embedded font subset with custom encoding, which doesn’t have a Unicode mapping, then it couldn’t be extracted.

The Apitron PDF Kit .NET component can be downloaded from our website. We’ll be happy to answer your questions and welcome any feedback.

Downloadable version of this article can be found by the following link [PDF].

No comments:

Post a Comment