While building the perfect apps with Apitron PDF Rasterizer
for .NET you were probably thinking about own PDF text search functionality
implementation.
It could be useful for custom viewers, and having this feature
makes your PDF rendering toolkit really complete: you’d have the Apitron PDFRasterizer for rendering, Apitron PDF Viewer (our free pdf viewer control) for
viewing and integrated text search for fast and efficient text search and
navigation.
We did it for you.
From now on, the Apitron PDF Rasterizer has the integrated
text search engine and you can easily use it in your apps.
The code
For demonstration purposes we will review a sample that
opens PDF file, searches for some text, renders corresponding PDF page and highlights
the results(complete C# code can be found under samples\SearchAndHighlightSpecifiedText folder in our download package).
Our PDF text search engine uses concept called search index, prior
to searching we analyze the document and build its “index” - data used for
actual searching. It can be stored by you for later use if you wish, so you
could avoid its recreation next time the document is being opened.
Index creation
FileStream pdfDocumentStreamToSearch = new FileStream( Path.Combine( pathToDocuments, "2003_ar.pdf" ), FileMode.Open, FileAccess.Read );
SearchIndex searchIndex = new SearchIndex(
pdfDocumentStreamToSearch );
As you see, index creation takes just a few lines
of code. It’s also possible to save the index to output stream for later use. Password-protected PDF files are also supported.
Text search
I used the PDF document from Adobe website, http://www.adobe.com/aboutadobe/invrelations/pdfs/2003_ar.pdf
and all images attached to this post represent actual output from the code
sample.
static void Main(string[] args)
{
string pathToDocument = @"..\Documents\2003_ar.pdf";
// create index from PDF file
using ( Stream
pdfDocumentStreamToSearch = new FileStream( pathToDocument, FileMode.Open, FileAccess.Read
) )
{
SearchIndex searchIndex = new SearchIndex(pdfDocumentStreamToSearch);
// create document used for rendering
using ( Stream
pdfDocumentStreamToRasterize = new FileStream( pathToDocument, FileMode.Open, FileAccess.Read
) )
{
document = new Document(pdfDocumentStreamToRasterize);
// search text in PDF document and render
pages containing results
searchIndex.Search( SearchHandler, "software
products derive" );
}
}
}
/// <summary>
/// Handle search results here. Draw pages with
highlighted text.
/// </summary>
/// <param name="handlerArgs">The handler args.</param>
private static void SearchHandler(SearchHandlerArgs
handlerArgs)
{
if (handlerArgs.ResultItems.Count != 0)
{
string outputFileName = string.Format("{0}.png",
handlerArgs.PageIndex);
Page page =
document.Pages[handlerArgs.PageIndex];
using (Image
bm = page.Render((int)page.Width * 2, (int)page.Height * 2, renderingSettings))
{
foreach (SearchResultItem
searchResultItem in handlerArgs.ResultItems)
{
HighlightSearchResult(bm, searchResultItem, page);
}
bm.Save( outputFileName );
}
Process.Start( outputFileName );
}
// Search cancellation condition, now we stop if we have
more than 3 results found,
// or all pages are searched
if (handlerArgs.ResultItems.Count > 3)
{
handlerArgs.CancelSearch = true;
}
}
What happens here? We take the previously created
index data and call the SearchIndex.Search method accepting the search event handler.
It processes our results one by one and highlights found items using HighlightSearchResult call - this method contains simple GDI+ code that draws a transparent rectangle around the found
text (if any). It also has a condition set for search cancellation, demonstrating the flexibility of PDF search API.
Resulting images
Resulting image(see yellow markers) |
One of the results produced by searching for “Intelligent Documents” |
Result produced for spiral text |
How to get it
The described PDF search engine is included in latest Apitron PDF
Rasterizer for .NET release, all related classes can be found under
Apitron.PDF.Rasterizer.Search namespace. We always welcome any feedback, so feel free
to ask questions and share ideas.