PDF tips & tricks: How to extract text from pdf page and create pdf to html conversion tool

Introduction

In our first post about PDF text extraction we demonstrated how to extract raw and formatted text from PDF document using Apitron PDF Kit for .NET component. As a part of the latest improvements, we've added new text extraction functionality and are eager to share the results with you.

If you were to extract text from PDF document programmatically before this release, you would have to pick one of the following choices:

Process text blocks yourself by parsing PDF commands and combining the results into a formatted or raw text data. Frankly speaking, it’s the hardest way you might choose and it would require the complete understanding of many text-related PDF aspects.

Use Apitron PDF Kit API by calling Page::ExtractText() with RawText or FormattedText parameter.

While all these techniques are working fine and do what they’re meant for, you may still have a need in some easy and convenient way to analyze the text attributing information e.g. color, original font name, size, etc. without diving too much in PDF specifics. That’s why we've implemented new text extraction modes in addition to Raw and Formatted:

TaggedText – produces XML output, where each PDF text block becomes wrapped by an xml element containing all available appearance information affecting this block. It also reports coordinates of each block in page space and therefore gives you unique ability to analyze and format text data in any way you like. Read more on this in separate section further.

HtmlText – further elaboration of TaggedText, actually this mode is no more than an attempt to lessen the efforts needed for PDF to HTML conversion task which many of developers encounter very often. Described in details in separate section.

Note: in evaluation mode only part of the page text can be extracted and the performance of the text extraction might be slightly affected.

PDF text extraction modes

TaggedText

This mode generates tagged output in XML format. For each PDF page it produces a root page element containing PDF text blocks represented by the textblock element.

Supported attributes for page element are:

width – width of the page
height – height of the page

Supported attributes for textblock element are:

left – block left coordinate, in PDF coordinate system (relative to lower left corner)
bottom – block bottom coordinate, in PDF coordinate system
width – block width
height – block height
fontSize – font size in points
fontFamily – describes the font used within text block
letterSpacing – additional letter spacing
wordSpacing – additional word spacing
fontStretch – indicates a selection of normal, condensed, or expanded face from a font
fontWeight – defines boldness of the font
fontStyle – indicates italic font
nonStrokeColor – fill color, rgb
strokeColor – stroking color, rgb

All coordinates and dimensions are in page space and relative to lower left corner of the PDF page. Text blocks are being produced in the same order they appear on PDF page and without any additional formatting applied. Using this markup one may perform analysis of text layout and perform any context related tasks. Having the information about document nature, it becomes possible to create custom-tailored solutions serving the particular needs of application developer.

Sample code and produced XML are below (page 7 from PDF32000_2008 spec):

/// <summary>

/// Extracts tagged text from PDF page at given index.

/// </summary>

/// <param name="pdfDoc">PDF document to extract text from.</param>

/// <param name="pageIndex">Page index.</param>

/// <returns>Extracted text or empty string if page wasn't found or empty.</returns>

public string ExtractTaggedText(FixedDocument pdfDoc, int pageIndex)

{

if (pdfDoc != null && pageIndex >= 0 && pdfDoc.Pages.Count > pageIndex)

{

return pdfDoc.Pages[pageIndex].ExtractText(TextExtractionOptions.TaggedText);

}

return string.Empty;

}

XML:

<textblock left="70.8" bottom="747.24" fontSize="13.98" fontFamily="Arial" fontWeight="700" fontStretch="Normal" letterSpacing="0.01" width="60.21" height="19.38">Contents</textblock>

…

</page>

The first text block is highlighted for demonstration purposes.

HtmlText

This text extraction mode is based on tagged text mode and was designed to provide base implementation of the often needed PDF to HTML conversion task. If you need an additional level of specificity you may implement it using xml output produced by tagged text mode.

HtmlText mode produces a <div> block containing preformatted text wrapped inside <pre> elements. Each of these <pre> elements becomes styled according to text properties of the corresponding PDF text object using the inline style. Location of the element within parent block is being set using relative positioning.

Sample code and produced HTML are below (page 6 from PDF32000_2008 spec):

/// <summary>

/// Extracts html text from PDF page at given index.

/// </summary>

/// <param name="pdfDoc">PDF document to extract text from.</param>

/// <param name="pageIndex">Page index.</param>

/// <returns>Extracted text or empty string if page wasn't found or empty.</returns>

public string ExtractHtmlText(FixedDocument pdfDoc, int pageIndex)

{

if (pdfDoc != null && pageIndex >= 0 && pdfDoc.Pages.Count > pageIndex)

{

return pdfDoc.Pages[pageIndex].ExtractText(TextExtractionOptions.HtmlText);

}

return string.Empty;

}

Html:

<pre style="left:70.86px;bottom:747.24px;font-size:13.98px;font-family:'Helvetica';width:81.62px;height:19.38px;">Introduction</pre>

<pre style="left:70.86px;bottom:717.42px;font-size:9.96px;font-family:'Helvetica';width:490.44px;height:13.26px;">ISO 32000 specifies a digital form for representing documents called the Portable Document Format or usually</pre>

<pre style="left:70.86px;bottom:705.95px;font-size:9.96px;font-family:'Helvetica';width:490.37px;height:13.26px;">referred to as PDF. PDF was developed and specified by Adobe Systems Incorporated beginning in 1993 and</pre>

…

</div>

All properties of PDF element which can be mapped to html style attribute are used here. Also, as you can see, the produced html block uses scoped style to set the common properties for all nested <pre> elements.

If we open the produced html in browser, it will look as follows:

Pic. 1 PDF to HTML conversion results

Conclusion

Using new text extraction modes described in this post, you’re now able to perform document text analysis involving its attributes and positions. You can also easily convert PDF to simple html for quick preview purposes.

Apitron PDF Kit for .NET can be used as standard PDF component for creation of applications requiring PDF processing. Create apps for Windows Store, Google Play Store, and Apple Store using same API. Develop server side solutions, cloud, directly managed websites or web services. Being true cross-platform and .NET / Mono / Xamarin compatible, our library raises the implementation of PDF processing logic to the unmatched level.

Downloadable version of this article can be found by the following link [PDF].

PDF tips & tricks

pdf links

2015-04-29

How to extract text from pdf page and create pdf to html conversion tool

Introduction

PDF text extraction modes

TaggedText

HtmlText

Conclusion

No comments:

Post a Comment