PDF tips & tricks: Search text in PDF documents using regular expressions

2016-02-22

Search text in PDF documents using regular expressions

Introduction

Searching text in PDF document is easy and this feature became available to users of our Apitron PDF Rasterizer for .NET component many releases ago. Now we’ve updated the API and you can search for text on PDF page using standard .NET regular expression objects (Regex).

Text search API offered by Apitron PDF Rasterizer is decoupled from the rendering part and can be used independently. It’s represented by the SearchIndex class that handles all search tasks and offers very useful features like building search indices for the documents, and saving/loading of such indices for the later use.

Using search API offered by Apitron PDF Rasterizer you can also highlight text on rendered pages because you get all necessary information about text position on PDF page.

See the code section for details.

The code

class Program

{

// global rendering settings

static RenderingSettings renderingSettings = new RenderingSettings();

// hightlight brush for search results

static Brush hightlightBrush = new SolidBrush(Color.FromArgb(100,255,255,0));

static void Main(string[] args)

{

// the source file to search the text into

string inputFilePath = "../../data/Apitron_Pdf_Kit_in_Action.pdf";

// open pdf document for search and rendering

// we'll use 2 different streams here

using (Stream searchStream = new FileStream(inputFilePath, FileMode.Open,

FileAccess.Read),

documentStream = new FileStream(inputFilePath, FileMode.Open,

FileAccess.Read))

{

// create search object from PDF data stream

using (SearchIndex searchIndex = new SearchIndex(searchStream))

{

// open document to be used for rendering

using (Document doc = new Document(documentStream))

{

searchIndex.Search((handlerArgs =>

{

// if we have results

if (handlerArgs.ResultItems.Count != 0)

{

// create resulting image filename

string outputFileName = string.Format("{0}_{1}.png",

Path.GetFileNameWithoutExtension(inputFilePath),

handlerArgs.PageIndex);

// render found result and start system image viewer

Page page = doc.Pages[handlerArgs.PageIndex];

using (Image bitmap = page.Render(new Resolution(96, 96),

renderingSettings))

{

foreach (SearchResultItem searchResultItem in

handlerArgs.ResultItems)

{

HighlightSearchResult(bitmap, searchResultItem,

page);

}

bitmap.Save(outputFileName);

}

Process.Start(outputFileName);

}

}),

// find everything that matches [WORD][whitespaces]Kit pattern

new Regex("\\w+\\s+Kit"));

}

/// <summary>

/// Highlights the search result.

/// </summary>

/// <param name="bitmap"> The bitmap. </param>

/// <param name="searchResultItem"> The search result item. </param>

/// <param name="page"> The page. </param>

private static void HighlightSearchResult(Image bitmap,

SearchResultItem searchResultItem,

Page page)

{

using (Graphics gr = Graphics.FromImage(bitmap))

{

double[] rectangle;

SearchResultRegion region = page.TransformRegion(searchResultItem.Region,

bitmap.Width, bitmap.Height, renderingSettings);

foreach (double[] item in region.Blocks)

{

rectangle = item;

PointF[] points = new PointF[rectangle.Length / 2];

for (int i = 0; i < 4; i++)

{

points[i] = new PointF((float)rectangle[i * 2],

(float)rectangle[(i * 2) + 1]);

}

gr.FillPolygon(hightlightBrush, points);

}

The complete code sample can be found in our github repo. Results of the execution are shown below; please note that in evaluation mode search API searches for text on first three pages only.

Pic. 1 Search text in PDF document - highlighted text

Summary

Apitron PDF Rasterizer for .NET is a complex solution that you can use for PDF rendering and also for implementing text search in PDF documents. It’s a cross-platform library available for many .NET based platforms (Xamarin, Mono, .NET just to name a few) and can be used to create mobile, desktop and web applications. Contact us if you have any questions regarding our products or services.

PDF tips & tricks

pdf links

2016-02-22

Search text in PDF documents using regular expressions

Introduction

The code

Summary

No comments:

Post a Comment