pdf links

PDF Rendering
Convert PDF to Image (.NET)
Convert PDF to image on Android (Xamarin)
Convert PDF to image on iOS (Xamarin)
Convert PDF to image in Windows Store apps (.NET)
Convert PDF to image in Windows Phone apps (.NET)
PDF to image in Universal Windows Store apps (.NET)
Free PDF Viewer control for Windows Forms (.NET)
How to integrate PDF Viewer control in WPF app (.NET)
Creating WPF PDF Viewer supporting bookmarks (.NET)
Cross-platform PDF Viewer using GTK# (MONO)
Silverlight PDF viewer control (Silverlight 5)
Multithreaded PDF rendering (.NET)
Convert pdf to image in Silverlight app (C# sample)
How to set fallback fonts for PDF rendering (C#)
Avoiding the out-of-memory exception on rendering (C#)
PDF viewer single page application (WebAPI, AngularJS)
PDF viewer control for Windows 10 universal applications
Use custom ICC profile for CMYK to RGB conversion
PDF layers - separate images, text, annotations, graphics

PDF Forms Creation PDF Security
Conversion to PDF/A
Other topics
PDF Document Manipulation PDF Content Generation
Fixed and Flow layout document API (.NET)
Creation of grids and tables in PDF (C# sample)
How to create interactive documents using Actions (C# sample)
Text flow effects in PDF (C# sample)
How to generate ordered and bulleted lists in PDF (C# sample)
Convert HTML to PDF using flow layout API (C# sample)
How to use custom fonts for PDF generation (.NET)
Create document with differently sized pages (C#)
Create PDF documents using MONO (C#/MONO/Windows/OSX)
How to use background images for content elements (C#/PDF Kit/FlowLayout)
Add transparent images to PDF document (C#)
Draw round rect borders in PDF documents(C#)
ICC color profiles and and ICC based colors in PDF (C#)
How to use bidirectional and right to left text in PDF (C#)
Create PDF documents from XML templates (C# sample)
How to resize PDF pages and use custom stamps (C#)
Add header and footer to PDF page (.NET sample)
How to use clipping mask for drawing on PDF page
Fill graphics path with gradient brushes in PDF (Shadings)
Apitron PDF Kit and Rasterizer engine settings
Add layers to PDF page (optional content, C# sample)
How to create free text annotation with custom appearance
PDF Content Extraction PDF Navigation
PDF to TIFF conversion
Contact us if you have a PDF related question and we'll cover it in our blog.

2015-04-29

How to extract text from pdf page and create pdf to html conversion tool

Introduction


In our first post about PDF text extraction we demonstrated how to extract raw and formatted text from PDF document using Apitron PDF Kit for .NET component. As a part of the latest improvements, we've added new text extraction functionality and are eager to share the results with you.

If you were to extract text from PDF document programmatically before this release, you would have to pick one of the following choices:
  • Process text blocks yourself by parsing PDF commands and combining the results into a formatted or raw text data. Frankly speaking, it’s the hardest way you might choose and it would require the complete understanding of many text-related PDF aspects.
  • Use Apitron PDF Kit API by calling Page::ExtractText() with RawText or FormattedText parameter.

While all these techniques are working fine and do what they’re meant for, you may still have a need in some easy and convenient way to analyze the text attributing information e.g. color, original font name, size, etc. without diving too much in PDF specifics. That’s why we've implemented new text extraction modes in addition to Raw and Formatted:
  • TaggedText – produces XML output, where each PDF text block becomes wrapped by an xml element containing all available appearance information affecting this block. It also reports coordinates of each block in page space and therefore gives you unique ability to analyze and format text data in any way you like. Read more on this in separate section further.
  • HtmlText – further elaboration of TaggedText, actually this mode is no more than an attempt to lessen the efforts needed for PDF to HTML conversion task which many of developers encounter very often. Described in details in separate section.

Note: in evaluation mode only part of the page text can be extracted and the performance of the text extraction might be slightly affected.

PDF text extraction modes 


TaggedText


This mode generates tagged output in XML format. For each PDF page it produces a root page element containing PDF text blocks represented by the textblock element.

Supported attributes for page element are:
  • width – width of the page
  • height – height of the page

Supported attributes for textblock element are:
  • left – block left coordinate, in PDF coordinate system (relative to lower left corner)
  • bottom – block bottom coordinate, in PDF coordinate system
  • width – block width
  • height – block height
  • fontSize – font size in points
  • fontFamily – describes the font used within text block
  • letterSpacing – additional letter spacing
  • wordSpacing – additional word spacing
  • fontStretch – indicates a selection of normal, condensed, or expanded face from a font
  • fontWeight – defines boldness of the font
  • fontStyle – indicates italic font
  • nonStrokeColor – fill color, rgb
  • strokeColor – stroking color, rgb

All coordinates and dimensions are in page space and relative to lower left corner of the PDF page. Text blocks are being produced in the same order they appear on PDF page and without any additional formatting applied. Using this markup one may perform analysis of text layout and perform any context related tasks. Having the information about document nature, it becomes possible to create custom-tailored solutions serving the particular needs of application developer.

Sample code and produced XML are below (page 7 from PDF32000_2008 spec):

/// <summary>
/// Extracts tagged text from PDF page at given index.
/// </summary>
/// <param name="pdfDoc">PDF document to extract text from.</param>
/// <param name="pageIndex">Page index.</param>
/// <returns>Extracted text or empty string if page wasn't found or empty.</returns>
public string ExtractTaggedText(FixedDocument pdfDoc, int pageIndex)
{
    if (pdfDoc != null && pageIndex >= 0 && pdfDoc.Pages.Count > pageIndex)
    {
        return pdfDoc.Pages[pageIndex].ExtractText(TextExtractionOptions.TaggedText);
    }

    return string.Empty;
}

XML:

<page width="595" height="842">
  <textblock left="464.94" bottom="791.91" fontSize="10.98" fontFamily="Helvetica-Bold" letterSpacing="0.01" wordSpacing="0" width="93.42" height="13.07">PDF 32000-1:2008</textblock>
  <textblock left="70.8" bottom="747.24" fontSize="13.98" fontFamily="Arial" fontWeight="700" fontStretch="Normal" letterSpacing="0.01" width="60.21" height="19.38">Contents</textblock>
  <textblock left="535.08" bottom="747.3" fontSize="9.96" fontFamily="Arial" fontStretch="Normal" letterSpacing="0.03" width="23.4" height="13.26">Page</textblock>
    …
</page>

The first text block is highlighted for demonstration purposes.


HtmlText


This text extraction mode is based on tagged text mode and was designed to provide base implementation of the often needed PDF to HTML conversion task. If you need an additional level of specificity you may implement it using xml output produced by tagged text mode.

HtmlText mode produces a <div> block containing preformatted text wrapped inside <pre> elements. Each of these <pre> elements becomes styled according to text properties of the corresponding PDF text object using the inline style. Location of the element within parent block is being set using relative positioning.

Sample code and produced HTML are below (page 6 from PDF32000_2008 spec):

/// <summary>
/// Extracts html text from PDF page at given index.
/// </summary>
/// <param name="pdfDoc">PDF document to extract text from.</param>
/// <param name="pageIndex">Page index.</param>
/// <returns>Extracted text or empty string if page wasn't found or empty.</returns>
public string ExtractHtmlText(FixedDocument pdfDoc, int pageIndex)
{
    if (pdfDoc != null && pageIndex >= 0 && pdfDoc.Pages.Count > pageIndex)
    {
        return pdfDoc.Pages[pageIndex].ExtractText(TextExtractionOptions.HtmlText);
    }

    return string.Empty;
}

Html:

<div style="width:595px;height:842px;position:relative;">
  <style scoped="scoped">pre{margin:0;padding:0;position:absolute;}</style>
  <pre style="left:464.94px;bottom:791.91px;font-size:10.98px;font-family:'Helvetica';width:93.42px;height:13.07px;">PDF 32000-1:2008</pre>
  <pre style="left:70.86px;bottom:747.24px;font-size:13.98px;font-family:'Helvetica';width:81.62px;height:19.38px;">Introduction</pre>
  <pre style="left:70.86px;bottom:717.42px;font-size:9.96px;font-family:'Helvetica';width:490.44px;height:13.26px;">ISO 32000 specifies a digital form for representing documents called the Portable Document Format or usually</pre>
  <pre style="left:70.86px;bottom:705.95px;font-size:9.96px;font-family:'Helvetica';width:490.37px;height:13.26px;">referred to as PDF. PDF was developed and specified by Adobe Systems Incorporated beginning in 1993 and</pre>
    …
</div>

All properties of PDF element which can be mapped to html style attribute are used here. Also, as you can see, the produced html block uses scoped style to set the common properties for all nested <pre> elements.

If we open the produced html in browser, it will look as follows:

Pic. 1 PDF to HTML conversion results
Pic. 1 PDF to HTML conversion results

Conclusion


Using new text extraction modes described in this post, you’re now able to perform document text analysis involving its attributes and positions. You can also easily convert PDF to simple html for quick preview purposes.

Apitron PDF Kit for .NET can be used as standard PDF component for creation of applications requiring PDF processing. Create apps for Windows Store, Google Play Store, and Apple Store using same API. Develop server side solutions, cloud, directly managed websites or web services. Being true cross-platform and .NET / Mono / Xamarin compatible, our library raises the implementation of PDF processing logic to the unmatched level.

Downloadable version of this article can be found by the following link [PDF].

No comments:

Post a Comment