PDF tips & tricks: April 2015

2015-04-29

How to extract text from pdf page and create pdf to html conversion tool

Introduction

In our first post about PDF text extraction we demonstrated how to extract raw and formatted text from PDF document using Apitron PDF Kit for .NET component. As a part of the latest improvements, we've added new text extraction functionality and are eager to share the results with you.

If you were to extract text from PDF document programmatically before this release, you would have to pick one of the following choices:

Process text blocks yourself by parsing PDF commands and combining the results into a formatted or raw text data. Frankly speaking, it’s the hardest way you might choose and it would require the complete understanding of many text-related PDF aspects.

Use Apitron PDF Kit API by calling Page::ExtractText() with RawText or FormattedText parameter.

While all these techniques are working fine and do what they’re meant for, you may still have a need in some easy and convenient way to analyze the text attributing information e.g. color, original font name, size, etc. without diving too much in PDF specifics. That’s why we've implemented new text extraction modes in addition to Raw and Formatted:

TaggedText – produces XML output, where each PDF text block becomes wrapped by an xml element containing all available appearance information affecting this block. It also reports coordinates of each block in page space and therefore gives you unique ability to analyze and format text data in any way you like. Read more on this in separate section further.

HtmlText – further elaboration of TaggedText, actually this mode is no more than an attempt to lessen the efforts needed for PDF to HTML conversion task which many of developers encounter very often. Described in details in separate section.

Note: in evaluation mode only part of the page text can be extracted and the performance of the text extraction might be slightly affected.

PDF text extraction modes

TaggedText

This mode generates tagged output in XML format. For each PDF page it produces a root page element containing PDF text blocks represented by the textblock element.

Supported attributes for page element are:

width – width of the page
height – height of the page

Supported attributes for textblock element are:

left – block left coordinate, in PDF coordinate system (relative to lower left corner)
bottom – block bottom coordinate, in PDF coordinate system
width – block width
height – block height
fontSize – font size in points
fontFamily – describes the font used within text block
letterSpacing – additional letter spacing
wordSpacing – additional word spacing
fontStretch – indicates a selection of normal, condensed, or expanded face from a font
fontWeight – defines boldness of the font
fontStyle – indicates italic font
nonStrokeColor – fill color, rgb
strokeColor – stroking color, rgb

All coordinates and dimensions are in page space and relative to lower left corner of the PDF page. Text blocks are being produced in the same order they appear on PDF page and without any additional formatting applied. Using this markup one may perform analysis of text layout and perform any context related tasks. Having the information about document nature, it becomes possible to create custom-tailored solutions serving the particular needs of application developer.

Sample code and produced XML are below (page 7 from PDF32000_2008 spec):

/// <summary>

/// Extracts tagged text from PDF page at given index.

/// </summary>

/// <param name="pdfDoc">PDF document to extract text from.</param>

/// <param name="pageIndex">Page index.</param>

/// <returns>Extracted text or empty string if page wasn't found or empty.</returns>

public string ExtractTaggedText(FixedDocument pdfDoc, int pageIndex)

{

if (pdfDoc != null && pageIndex >= 0 && pdfDoc.Pages.Count > pageIndex)

{

return pdfDoc.Pages[pageIndex].ExtractText(TextExtractionOptions.TaggedText);

}

return string.Empty;

}

XML:

<textblock left="70.8" bottom="747.24" fontSize="13.98" fontFamily="Arial" fontWeight="700" fontStretch="Normal" letterSpacing="0.01" width="60.21" height="19.38">Contents</textblock>

…

</page>

The first text block is highlighted for demonstration purposes.

HtmlText

This text extraction mode is based on tagged text mode and was designed to provide base implementation of the often needed PDF to HTML conversion task. If you need an additional level of specificity you may implement it using xml output produced by tagged text mode.

HtmlText mode produces a <div> block containing preformatted text wrapped inside <pre> elements. Each of these <pre> elements becomes styled according to text properties of the corresponding PDF text object using the inline style. Location of the element within parent block is being set using relative positioning.

Sample code and produced HTML are below (page 6 from PDF32000_2008 spec):

/// <summary>

/// Extracts html text from PDF page at given index.

/// </summary>

/// <param name="pdfDoc">PDF document to extract text from.</param>

/// <param name="pageIndex">Page index.</param>

/// <returns>Extracted text or empty string if page wasn't found or empty.</returns>

public string ExtractHtmlText(FixedDocument pdfDoc, int pageIndex)

{

if (pdfDoc != null && pageIndex >= 0 && pdfDoc.Pages.Count > pageIndex)

{

return pdfDoc.Pages[pageIndex].ExtractText(TextExtractionOptions.HtmlText);

}

return string.Empty;

}

Html:

<pre style="left:70.86px;bottom:747.24px;font-size:13.98px;font-family:'Helvetica';width:81.62px;height:19.38px;">Introduction</pre>

<pre style="left:70.86px;bottom:717.42px;font-size:9.96px;font-family:'Helvetica';width:490.44px;height:13.26px;">ISO 32000 specifies a digital form for representing documents called the Portable Document Format or usually</pre>

<pre style="left:70.86px;bottom:705.95px;font-size:9.96px;font-family:'Helvetica';width:490.37px;height:13.26px;">referred to as PDF. PDF was developed and specified by Adobe Systems Incorporated beginning in 1993 and</pre>

…

</div>

All properties of PDF element which can be mapped to html style attribute are used here. Also, as you can see, the produced html block uses scoped style to set the common properties for all nested <pre> elements.

If we open the produced html in browser, it will look as follows:

Pic. 1 PDF to HTML conversion results

Conclusion

Using new text extraction modes described in this post, you’re now able to perform document text analysis involving its attributes and positions. You can also easily convert PDF to simple html for quick preview purposes.

Apitron PDF Kit for .NET can be used as standard PDF component for creation of applications requiring PDF processing. Create apps for Windows Store, Google Play Store, and Apple Store using same API. Develop server side solutions, cloud, directly managed websites or web services. Being true cross-platform and .NET / Mono / Xamarin compatible, our library raises the implementation of PDF processing logic to the unmatched level.

Downloadable version of this article can be found by the following link [PDF].

2015-04-24

How to add watermark to pdf document

What is a watermark

Watermark is usually a semitransparent drawing added on top of the page content which can be created using various ways. This type of marking your documents becomes necessary when you have to indicate a particular purpose the document is designed for or to give some handling instructions. Examples are: “For internal reading only”, “Do not copy”, “Top Secret” etc. It’s also useful for placing banners indicating the product name, the document was created by, or its evaluation state.

We’ll describe several watermarking approaches in this post and provide C# code samples which generate watermarks programmatically.

Image watermark

This type of watermark is simple and convenient. You create an image containing your message and draw it over the page content.

Pros:

Easy to create and use, single image XObject can be shared by all pages
Provides a simple way to use any picture as watermark

Cons:

May affect resulting file by increasing its size significantly if image used is big enough
For the image to become transparent it has to include some kind of transparency mask and this fact can be a problem for non-transparency aware readers
Raster images don’t scale well, so this watermark may become pixelated when zoomed
Becomes a part of page content

See the C# code snippet below that shows how to add image watermark:

/// <summary>

/// Adds image watermark to PDF document.

/// </summary>

public void AddImageWatermark()

{

// open existing document

using (Stream file = File.OpenRead("Apitron PDF Kit in Action.pdf"))

{

FixedDocument doc= new FixedDocument(file);

// register image XObject

doc.ResourceManager.RegisterResource(new Image("watermark","watermark.png", true));

// add image watermark for each page

foreach (Page page in doc.Pages)

{

page.Content.AppendImage("watermark", 0, 0, page.Boundary.MediaBox.Width,

page.Boundary.MediaBox.Height);

}

// save watermarked file

using (Stream stream = File.Create("image_watermark.pdf"))

{

doc.Save(stream);

}

The image below demonstrates the execution results:

Pic. 1 Image watermark sample

Form XObject watermark

This type of watermark assumes basic knowledge of PDF drawing system. Using this approach it’s easy to create vector-based drawings suitable for watermarking.

Pros:

Compactness, single watermark form XObject can be shared by all pages
Scales well if it contains vector drawings only, requires no transparency mask

Cons:

Requires some knowledge of PDF drawing system
Becomes a part of page content

Let’s create a simple text-based watermark using the C# code below:

public void AddFormXObjectWatermark()

{

// open existing document

using (Stream file = File.OpenRead("Apitron PDF Kit in Action.pdf"))

{

FixedDocument pdfDocument = new FixedDocument(file);

// define watermark transparency using graphics state

GraphicsState watermarkGS = new GraphicsState("gs0"){CurrentNonStrokingAlpha=0.2};

// register graphics state object

pdfDocument.ResourceManager.RegisterResource(watermarkGS);

// create watermark form XObject

FixedContent watermark = new FixedContent("watermark", pdfDocument.Pages[0].Boundary.MediaBox);

// register form XObject

pdfDocument.ResourceManager.RegisterResource(watermark);

// define text and transformation for it

TextObject watermarkText = new TextObject(StandardFonts.Helvetica,48);

watermarkText.AppendText("Apitron PDF Kit for .NET");

watermark.Content.ModifyCurrentTransformationMatrix(1,1.25,-1.25,1,50,50);

// define current color and transparency

watermark.Content.SetGraphicsState("gs0");

watermark.Content.SetDeviceNonStrokingColor(RgbColors.Red.Components);

// draw watermark text

watermark.Content.AppendText(watermarkText);

// add watermark to each page

foreach (Page page in pdfDocument.Pages)

{

page.Content.AppendXObject("watermark");

}

// save watermarked file

using (Stream stream = File.Create("formXObject_watermark.pdf"))

{

pdfDocument.Save(stream);

}

The result is shown below. You may notice that it looks sharper because of its vector nature:

Pic. 2 Watermark added using form XObject

Watermark annotation

A watermark annotation can be used to represent graphics that is to be printed at a fixed size and position on a page, regardless of the dimensions of the printed page.

Pros:

Compactness, designed specifically for watermarks
Can be easily managed using page annotations dictionary
Requires no transparency mask

Cons:

Requires some knowledge of PDF drawing system and annotations

// Adds watermark annotation to the document.

public static void AddWatermarkAnnotation()

{

// open existing document

using (Stream file = File.OpenRead("Apitron PDF Kit in Action.pdf"))

{

FixedDocument pdfDocument = new FixedDocument(file);

// define watermark transparency using graphics state and register this object

GraphicsState watermarkGS = new GraphicsState("gs0"){CurrentNonStrokingAlpha=0.2};

pdfDocument.ResourceManager.RegisterResource(watermarkGS);

// create watermark content

FixedContent watermark = new FixedContent("watermark", pdfDocument.Pages[0].Boundary.MediaBox);

// define text and transformation for it

TextObject watermarkText = new TextObject(StandardFonts.Helvetica, 48);

watermarkText.AppendText("Apitron PDF Kit for .NET");

watermark.Content.ModifyCurrentTransformationMatrix(1, 1.25, -1.25, 1, 50, 50);

// define current color and transparency

watermark.Content.SetGraphicsState("gs0");

watermark.Content.SetDeviceNonStrokingColor(RgbColors.Red.Components);

// draw watermark text

watermark.Content.AppendText(watermarkText);

// create watermark annotation object for each pages

foreach (Page page in pdfDocument.Pages)

{

WatermarkAnnotation annotation=new WatermarkAnnotation(page.Boundary.MediaBox);

annotation.Appearance.Normal = watermark;

page.Annotations.Add(annotation);

}

using (Stream stream = File.Create("watermark_annotation.pdf"))

{

pdfDocument.Save(stream);

}

The code creating watermark annotation produces the same results as the code that adds form XObject watermark.

Watermarks removal

It’s possible to remove watermarks from PDF file however we don’t recommend doing it because it can cause legal problems. Techniques used involve content analysis as well as annotations checks. There is no 100% reliable method, however, to remove all watermarking information using single algorithm, because watermarks might be hidden in PDF metadata or other less evident places.

For example, one may use a fully transparent image which would appear only when the document is being printed. Think of watermark as of piece of info hidden inside the PDF file, it can be just anything.

Conclusion

Adding watermarks is not a tricky task and, as you can see, it can be completed quite easy using Apitron PDF Kit for .NET component. This component is available for many platforms and makes you able to create applications for Windows and Windows Store, Xamarin.iOS and Xamarin.Android, OS X or any other system where a .NET/MONO can run. ASP.NET and Azure environments are supported as well. You may visit its product page or browse documentation here.

Downloadable version of this article can be found by the following link [PDF].

pdf links

2015-04-29

How to extract text from pdf page and create pdf to html conversion tool

Introduction

PDF text extraction modes

TaggedText

HtmlText

Conclusion

2015-04-24

How to add watermark to pdf document

What is a watermark

Image watermark

Form XObject watermark

Watermark annotation

Watermarks removal

Conclusion