pdf links

PDF Rendering
Convert PDF to Image (.NET)
Convert PDF to image on Android (Xamarin)
Convert PDF to image on iOS (Xamarin)
Convert PDF to image in Windows Store apps (.NET)
Convert PDF to image in Windows Phone apps (.NET)
PDF to image in Universal Windows Store apps (.NET)
Free PDF Viewer control for Windows Forms (.NET)
How to integrate PDF Viewer control in WPF app (.NET)
Creating WPF PDF Viewer supporting bookmarks (.NET)
Cross-platform PDF Viewer using GTK# (MONO)
Silverlight PDF viewer control (Silverlight 5)
Multithreaded PDF rendering (.NET)
Convert pdf to image in Silverlight app (C# sample)
How to set fallback fonts for PDF rendering (C#)
Avoiding the out-of-memory exception on rendering (C#)
PDF viewer single page application (WebAPI, AngularJS)
PDF viewer control for Windows 10 universal applications
Use custom ICC profile for CMYK to RGB conversion
PDF layers - separate images, text, annotations, graphics

PDF Forms Creation PDF Security
Conversion to PDF/A
Other topics
PDF Document Manipulation PDF Content Generation
Fixed and Flow layout document API (.NET)
Creation of grids and tables in PDF (C# sample)
How to create interactive documents using Actions (C# sample)
Text flow effects in PDF (C# sample)
How to generate ordered and bulleted lists in PDF (C# sample)
Convert HTML to PDF using flow layout API (C# sample)
How to use custom fonts for PDF generation (.NET)
Create document with differently sized pages (C#)
Create PDF documents using MONO (C#/MONO/Windows/OSX)
How to use background images for content elements (C#/PDF Kit/FlowLayout)
Add transparent images to PDF document (C#)
Draw round rect borders in PDF documents(C#)
ICC color profiles and and ICC based colors in PDF (C#)
How to use bidirectional and right to left text in PDF (C#)
Create PDF documents from XML templates (C# sample)
How to resize PDF pages and use custom stamps (C#)
Add header and footer to PDF page (.NET sample)
How to use clipping mask for drawing on PDF page
Fill graphics path with gradient brushes in PDF (Shadings)
Apitron PDF Kit and Rasterizer engine settings
Add layers to PDF page (optional content, C# sample)
How to create free text annotation with custom appearance
PDF Content Extraction PDF Navigation
PDF to TIFF conversion
Contact us if you have a PDF related question and we'll cover it in our blog.

2014-03-04

Search text in PDF documents (C# sample)


While building the perfect apps with Apitron PDF Rasterizer for .NET you were probably thinking about own PDF text search functionality implementation.
It could be useful for custom viewers, and having this feature makes your PDF rendering toolkit really complete: you’d have the Apitron PDFRasterizer for rendering, Apitron PDF Viewer (our free pdf viewer control) for viewing and integrated text search for fast and efficient text search and navigation.
We did it for you.
From now on, the Apitron PDF Rasterizer has the integrated text search engine and you can easily use it in your apps.


The code 


For demonstration purposes we will review a sample that opens PDF file, searches for some text, renders corresponding PDF page and highlights the results(complete C# code can be found under samples\SearchAndHighlightSpecifiedText folder in our download package).

Our PDF text search engine uses concept called search index, prior to searching we analyze the document and build its “index” - data used for actual searching. It can be stored by you for later use if you wish, so you could avoid its recreation next time the document is being opened.

Index creation


FileStream pdfDocumentStreamToSearch = new FileStream( Path.Combine( pathToDocuments, "2003_ar.pdf" ), FileMode.Open, FileAccess.Read );


SearchIndex searchIndex = new SearchIndex( pdfDocumentStreamToSearch );


As you see, index creation takes just a few lines of code. It’s also possible to save the index to output stream for later use. Password-protected PDF files are also supported.

Text search 


I used the PDF document from Adobe website, http://www.adobe.com/aboutadobe/invrelations/pdfs/2003_ar.pdf and all images attached to this post represent actual output from the code sample.

static void Main(string[] args)
{
   string pathToDocument = @"..\Documents\2003_ar.pdf";

   // create index from PDF file
   using ( Stream pdfDocumentStreamToSearch = new FileStream( pathToDocument, FileMode.Open, FileAccess.Read ) )
   {
      SearchIndex searchIndex = new SearchIndex(pdfDocumentStreamToSearch);

      // create document used for rendering
      using ( Stream pdfDocumentStreamToRasterize = new FileStream( pathToDocument, FileMode.Open, FileAccess.Read ) )
      {
         document = new Document(pdfDocumentStreamToRasterize);
  
         // search text in PDF document and render pages containing results
         searchIndex.Search( SearchHandler, "software products derive" );
       }
   }
}

/// <summary>
/// Handle search results here. Draw pages with highlighted text.
/// </summary>
/// <param name="handlerArgs">The handler args.</param>
private static void SearchHandler(SearchHandlerArgs handlerArgs)
{
   if (handlerArgs.ResultItems.Count != 0)
   {
      string outputFileName = string.Format("{0}.png", handlerArgs.PageIndex);
      
      Page page = document.Pages[handlerArgs.PageIndex];
      using (Image bm = page.Render((int)page.Width * 2, (int)page.Height * 2, renderingSettings))
      {
         foreach (SearchResultItem searchResultItem in handlerArgs.ResultItems)
         {
            HighlightSearchResult(bm, searchResultItem, page);
         }

         bm.Save( outputFileName );
      }

      Process.Start( outputFileName );
   }

   // Search cancellation condition, now we stop if we have more than 3 results found,
   // or all pages are searched
   if (handlerArgs.ResultItems.Count > 3)
   {
      handlerArgs.CancelSearch = true;
   }

}


What happens here? We take the previously created index data and call the SearchIndex.Search method accepting the search event handler.
It processes our results one by one and highlights found items using HighlightSearchResult call - this method contains simple GDI+ code that draws a transparent rectangle around the found text (if any). It also has a condition set for search cancellation, demonstrating the flexibility of PDF search API.

Resulting images

Resulting image(see yellow markers)

One of the results produced by searching for “Intelligent Documents”




Result produced for spiral text

How to get it 


The described PDF search engine is included in latest Apitron PDF Rasterizer for .NET release, all related classes can be found under Apitron.PDF.Rasterizer.Search namespace. We always welcome any feedback, so feel free to ask questions and share ideas.



6 comments:

  1. Hello, how to search text with Windows phone version?

    ReplyDelete
  2. Hi, you may use the same code and API for it.

    ReplyDelete
  3. Hi!!.. You already confirm me that we could save for later use in database (sql server) searchIndex variable. We still hasn't start to implement this, but to advance I will like to know if we should use Long Text for example.

    Also, do you think that saving index in DB will save us some resources (it's a web application)? In this way we wouldn't have to call SearchFunction everytime a user search in PDF.

    Thanks!!

    ReplyDelete
    Replies
    1. Hello Ricardo,
      yes it's possible, it can speed up text search significantly. It seems that BLOB/ FILESTREAM suits better for the index data than long text.

      Delete
  4. Hi, I have used Apitron dll for searching but I am getting result up to 3 pages only. Is there any restriction for searching or am I doing any thing wrong? Please guide me.

    ReplyDelete
    Replies
    1. Hello Anuj, yes it's the evaluation limitation.

      Delete