PDF tips & tricks: February 2016

2016-02-22

Search text in PDF documents using regular expressions

Introduction

Searching text in PDF document is easy and this feature became available to users of our Apitron PDF Rasterizer for .NET component many releases ago. Now we’ve updated the API and you can search for text on PDF page using standard .NET regular expression objects (Regex).

Text search API offered by Apitron PDF Rasterizer is decoupled from the rendering part and can be used independently. It’s represented by the SearchIndex class that handles all search tasks and offers very useful features like building search indices for the documents, and saving/loading of such indices for the later use.

Using search API offered by Apitron PDF Rasterizer you can also highlight text on rendered pages because you get all necessary information about text position on PDF page.

See the code section for details.

The code

class Program

{

// global rendering settings

static RenderingSettings renderingSettings = new RenderingSettings();

// hightlight brush for search results

static Brush hightlightBrush = new SolidBrush(Color.FromArgb(100,255,255,0));

static void Main(string[] args)

{

// the source file to search the text into

string inputFilePath = "../../data/Apitron_Pdf_Kit_in_Action.pdf";

// open pdf document for search and rendering

// we'll use 2 different streams here

using (Stream searchStream = new FileStream(inputFilePath, FileMode.Open,

FileAccess.Read),

documentStream = new FileStream(inputFilePath, FileMode.Open,

FileAccess.Read))

{

// create search object from PDF data stream

using (SearchIndex searchIndex = new SearchIndex(searchStream))

{

// open document to be used for rendering

using (Document doc = new Document(documentStream))

{

searchIndex.Search((handlerArgs =>

{

// if we have results

if (handlerArgs.ResultItems.Count != 0)

{

// create resulting image filename

string outputFileName = string.Format("{0}_{1}.png",

Path.GetFileNameWithoutExtension(inputFilePath),

handlerArgs.PageIndex);

// render found result and start system image viewer

Page page = doc.Pages[handlerArgs.PageIndex];

using (Image bitmap = page.Render(new Resolution(96, 96),

renderingSettings))

{

foreach (SearchResultItem searchResultItem in

handlerArgs.ResultItems)

{

HighlightSearchResult(bitmap, searchResultItem,

page);

}

bitmap.Save(outputFileName);

}

Process.Start(outputFileName);

}

}),

// find everything that matches [WORD][whitespaces]Kit pattern

new Regex("\\w+\\s+Kit"));

}

/// <summary>

/// Highlights the search result.

/// </summary>

/// <param name="bitmap"> The bitmap. </param>

/// <param name="searchResultItem"> The search result item. </param>

/// <param name="page"> The page. </param>

private static void HighlightSearchResult(Image bitmap,

SearchResultItem searchResultItem,

Page page)

{

using (Graphics gr = Graphics.FromImage(bitmap))

{

double[] rectangle;

SearchResultRegion region = page.TransformRegion(searchResultItem.Region,

bitmap.Width, bitmap.Height, renderingSettings);

foreach (double[] item in region.Blocks)

{

rectangle = item;

PointF[] points = new PointF[rectangle.Length / 2];

for (int i = 0; i < 4; i++)

{

points[i] = new PointF((float)rectangle[i * 2],

(float)rectangle[(i * 2) + 1]);

}

gr.FillPolygon(hightlightBrush, points);

}

The complete code sample can be found in our github repo. Results of the execution are shown below; please note that in evaluation mode search API searches for text on first three pages only.

Pic. 1 Search text in PDF document - highlighted text

Summary

Apitron PDF Rasterizer for .NET is a complex solution that you can use for PDF rendering and also for implementing text search in PDF documents. It’s a cross-platform library available for many .NET based platforms (Xamarin, Mono, .NET just to name a few) and can be used to create mobile, desktop and web applications. Contact us if you have any questions regarding our products or services.

2016-02-13

PDFA validation - overcoming limitations of validation tools

Introduction

PDF/A is a perfect alternative when it comes to archiving and saving documents for later use. The format guarantees that the document can be read years after creation because all resources needed to process the document are embedded into the file. Sometimes PDFA is set as a requirement for saving documents with digital signatures, e.g. contracts, official papers and so on.

There are plenty of tools on the market that claim that they can produce PDF/A documents, and the only way to check if the tool fulfills this condition is to check it using a PDFA validation tool.

The most popular and reliable tool from our point of view is Adobe Acrobat Professional – a paid professional version of the well-known Adobe Reader. It allows you to validate the document against many conditions including PDF/A compatibility using built-in Preflight tool. As Adobe is the author of PDF standard it know all inside outs of the PDF/A as well.

There are other PDFA validation tools produced by various software companies, but sometimes their results differ from Adobe Acrobat Professional due to double interpretation of the PDF-A specification.

We use Adobe as a gold standard and Apitron PDF Kit for .NET product produces files 100% verifiable by Adobe Acrobat Professional. If you use the same toolchain you don’t have to worry, as this post describes possible warnings produced by other tools, and custom settings needed to avoid them.

One of the possible warnings issued is – “the file contains cross reference streams”, it’s related to internal storage format of objects to ids mapping in PDF document. PDF versions prior to 1.5 (released in 2003) used cross reference tables instead of cross reference stream objects. The advantages of using streams over table are:

• A more compact representation of cross-reference information

• The ability to access compressed objects that are stored in object streams (see 7.5.7, "Object

Streams" section of the specification) and to allow new cross-reference entry types to be added in

the future

Current PDF version is 1.7 (updated 2011), so it’s a pretty old feature and PDFA (released in 2005) don’t forbid the use of such objects. To fix the cross-reference stream warning for those who need this we introduced the new setting for the PDF export API. The code sample can be found in the next section.

The code

class Program

{

static void Main(string[] args)

{

using (Stream stream = File.Open(@"../../data/document.pdf",

FileMode.Open, FileAccess.Read))

{

// create document object and specify the output format

FixedDocument doc = new FixedDocument(stream, PdfStandard.PDFA);

// save document

using (Stream outputStream = File.Create(@"pdfa_document.pdf"))

{

// turn off cross reference stream usage

doc.IsCompressedStructure = false;

doc.Save(outputStream);

}

Process.Start("pdfa_document.pdf");

}

You see that by setting the IsCompressesStructure property it’s not possible to control cross reference streams usage. The complete code sample can be found in our github repo.

The image below demonstrates PDFA document validation using Adobe Acrobat:

Pic. 1 PDFA validation

Summary

The Apitron PDF Kit for .NET is a powerful library for creation and manipulation of PDF and PDF/A documents. This product has many unique features, offers easy to use API and is cross-platform that means you can create apps for .net (windows, windows phone, windows store), ios & android (via xamarin) and mono targeting modern mobile, desktop and web platforms at once. Contact us and we’ll be happy to answer your questions.

2016-02-06

How to add layers to PDF page using optional content

Introduction

While working on exporting the PDF document, sometimes you need many versions of the same content to be on one page along with an option to show only one version at once.

A multilanguage report or manual are perfect examples of such documents. Instead of producing a separate file for each language you could create a single file which would contain all the necessary information. A user would be able to switch content versions with a single click by selecting the appropriate layer.

Another example of layered content structure is an engineering drawing or complex schema composed of different logically separated parts which could be made visible or invisible on demand.

All these things are made possible using PDF feature called optional content - see the section “8.11 Optional Content” of the PDF specification for the details. The Apitron PDF Kit .NET component provides an API for layers manipulation and creation. Using this product you can easily create layered content in your PDF documents.

In general, the creation of the multiple layers on PDF page looks as follows:

1. Create several OptionalContentGroup objects and register them as document resources – these objects represent layer identifiers in PDF.

2. Create the OptionalContentConfiguration object, set its properties controlling the behavior and visual layer structure shown in reader’s UI. This object combines layers together and you can use it to define initially visible layers, locked layers, layers that should work as radio buttons etc. You can also define the visual tree structure – parent layer nodes and child nodes.

3. Create and initialize the OptionalContentProperties object required by the FixedDocument object – this object is used to define the default configuration to be used by the PDF reader to show layers, and to specify the list of layers (OptionalContentGroups resource ids) actually referenced in document’s content (cause not all registered layer ids may be in use).

4. Use ClippedContent objects to define the layers and assign their OptionalContentID property to the one of the registered layer ids (Optional Content Group resource IDs). Put these objects on PDF page using Page.Content.AppendContent(…) method.

The code demonstrating these steps can be found in the next section.

The code

class Program

{

static void Main(string[] args)

{

using (Stream stream = File.Create("manual.pdf"))

{

// create our PDF document

using (FixedDocument doc = new FixedDocument())

{

// turn on the layers panel when opened

doc.PageMode = PageMode.UseOC;

// register image resource

doc.ResourceManager.RegisterResource(

new Apitron.PDF.Kit.FixedLayout.Resources.XObjects.Image(

"chair","../../data/chair.jpg"));

// FIRST STEP: create layer definitions,

// they should be registered as document resources

OptionalContentGroup group0 = new OptionalContentGroup("group0",

"Page layers", IntentName.View);

doc.ResourceManager.RegisterResource(group0);

OptionalContentGroup group1 = new OptionalContentGroup("group1",

"Chair image", IntentName.View);

doc.ResourceManager.RegisterResource(group1);

OptionalContentGroup group2 = new OptionalContentGroup("English", "English",

IntentName.View);

doc.ResourceManager.RegisterResource(group2);

OptionalContentGroup group3 = new OptionalContentGroup("Dansk", "Dansk",

IntentName.View);

doc.ResourceManager.RegisterResource(group3);

OptionalContentGroup group4 = new OptionalContentGroup("Deutch", "Deutch",

IntentName.View);

doc.ResourceManager.RegisterResource(group4);

OptionalContentGroup group5 = new OptionalContentGroup("Русский", "Русский",

IntentName.View);

doc.ResourceManager.RegisterResource(group5);

OptionalContentGroup group6 = new OptionalContentGroup("Nederlands",

"Nederlands", IntentName.View);

doc.ResourceManager.RegisterResource(group6);

OptionalContentGroup group7 = new OptionalContentGroup("Français",

"Français", IntentName.View);

doc.ResourceManager.RegisterResource(group7);

OptionalContentGroup group8 = new OptionalContentGroup("Italiano",

"Italiano", IntentName.View);

doc.ResourceManager.RegisterResource(group8);

// SECOND STEP:

// create the configuration,

// it allows to combine the layers together in any order

// Default configuration:

OptionalContentConfiguration config = new OptionalContentConfiguration(

"configuration");

// add groups to lists which define the rules controlling

// their visibility

// ON groups

config.OnGroups.Add(group0);

config.OnGroups.Add(group1);

config.OnGroups.Add(group2);

// OFF groups

config.OffGroups.Add(group3);

config.OffGroups.Add(group4);

config.OffGroups.Add(group5);

config.OffGroups.Add(group6);

config.OffGroups.Add(group7);

config.OffGroups.Add(group8);

// lock the image layer

config.LockedGroups.Add(group1);

// make other layers working as radio buttons

// only one translation will be visible at time

config.RadioButtonGroups.Add(new[] { group2, group3, group4, group5,

group6, group7, group8 });

// show only groups referenced by visible pages

config.ListMode = ListMode.VisiblePages;

// initialize the states for all content groups

// for the default configuration it should be on

config.BaseState = OptionalContentGroupState.On;

// set the name of the presentation tree

config.Order.Name = "Default config";

// create a root node + sub elements

config.Order.Entries.Add(group0);

config.Order.Entries.Add(new OptionalContentGroupTree(group1, group2,

group3, group4, group5, group6, group7, group8));

// FINAL step:

// assign the configuration properties to document

// all configurations and groups should be specified

doc.OCProperties = new OptionalContentProperties(config, new

OptionalContentConfiguration[] {}, new[] { group0, group1, group2,

group3, group4, group5, group6, group7, group8 });

// create page and assing top layer id to its content

// it will allow you to completely hide page's

// content using the configuration we have created

Page page = new Page();

page.Content.OptionalContentID = "group0";

// create image layer

ClippedContent imageBlock = new ClippedContent(0, 0, 245, 300);

// set the layer id

imageBlock.OptionalContentID = "group1";

imageBlock.AppendImage("chair", 0, 0, 245, 300);

// put the layer on page

page.Content.SaveGraphicsState();

page.Content.Translate(0, 530);

page.Content.AppendContent(imageBlock);

page.Content.RestoreGraphicsState();

// append text layers

AppendTextLayers(page);

// add the page to the document and save it

doc.Pages.Add(page);

doc.Save(stream);

}

Process.Start("manual.pdf");

}

static void AppendTextLayers(Page page)

{

page.Content.SaveGraphicsState();

page.Content.Translate(250, 325);

// evaluate each property of a resource dictionary and add text to the PDF page

foreach (PropertyInfo info in typeof(strings).GetRuntimeProperties())

{

if (info.PropertyType == typeof(string))

{

ClippedContent textContent = new ClippedContent(0, 0, 300, 500);

// assign layer id

textContent.OptionalContentID = info.Name;

textContent.Translate(0, 0);

// preprocess parsed elements and set additional properties

// for better visual appearance

IEnumerable<ContentElement> elements =

ContentElement.FromMarkup((string)info.GetValue(null));

foreach (Br lineBreak in elements.OfType<Br>())

{

lineBreak.Height = 10;

}

foreach (Section subSection in elements.OfType<Section>())

{

subSection.Font =

new Apitron.PDF.Kit.Styles.Text.Font("HelveticaBold", 14);

}

// draw text

textContent.AppendContentElement(new Section(elements), 300, 500);

// put the text layer on page

page.Content.AppendContent(textContent);

}

page.Content.RestoreGraphicsState();

}

You can see that we used content elements from FlowLayout API to prepare translated text blocks, more information about the Fixed and Flow layout API can be found by this link.

We exactly followed the algorithm described in the Introduction section:

1. Created layer identifier resources and registered them

2. Created default layers configuration in a form of tree view and configured layers to work as radio buttons, except the image layer which we marked as locked to demonstrate this feature

3. Let the document know about the created configuration and layers used

4. Marked all layers with corresponding registered layer ids

5. Saved the PDF document

The complete code sample can be downloaded from our github repo (link).

Resulting PDF document looks as follows:

Pic. 1 Multilanguage PDF document with layers

You see the locked layer containing chair image and language layers available for viewing. These language layers work as radio buttons group, when one is turned on others go off.

Summary

The Apitron PDF Kit for .NET is a powerful tool for creation and manipulation of PDF and PDF/A documents. It’s cross-platform and can be used to create .NET, Mono and Xamarin applications for Windows, iOS, Android and other operation systems. You can read more about the library on the product page. Contact us if you have any questions and we’ll be glad to assist you.

pdf links

2016-02-22

Search text in PDF documents using regular expressions

Introduction

The code

Summary

2016-02-13

PDFA validation - overcoming limitations of validation tools

Introduction

The code

Summary

2016-02-06

How to add layers to PDF page using optional content

Introduction

The code

Summary