pdf links

PDF Rendering
Convert PDF to Image (.NET)
Convert PDF to image on Android (Xamarin)
Convert PDF to image on iOS (Xamarin)
Convert PDF to image in Windows Store apps (.NET)
Convert PDF to image in Windows Phone apps (.NET)
PDF to image in Universal Windows Store apps (.NET)
Free PDF Viewer control for Windows Forms (.NET)
How to integrate PDF Viewer control in WPF app (.NET)
Creating WPF PDF Viewer supporting bookmarks (.NET)
Cross-platform PDF Viewer using GTK# (MONO)
Silverlight PDF viewer control (Silverlight 5)
Multithreaded PDF rendering (.NET)
Convert pdf to image in Silverlight app (C# sample)
How to set fallback fonts for PDF rendering (C#)
Avoiding the out-of-memory exception on rendering (C#)
PDF viewer single page application (WebAPI, AngularJS)
PDF viewer control for Windows 10 universal applications
Use custom ICC profile for CMYK to RGB conversion
PDF layers - separate images, text, annotations, graphics

PDF Forms Creation PDF Security
Conversion to PDF/A
Other topics
PDF Document Manipulation PDF Content Generation
Fixed and Flow layout document API (.NET)
Creation of grids and tables in PDF (C# sample)
How to create interactive documents using Actions (C# sample)
Text flow effects in PDF (C# sample)
How to generate ordered and bulleted lists in PDF (C# sample)
Convert HTML to PDF using flow layout API (C# sample)
How to use custom fonts for PDF generation (.NET)
Create document with differently sized pages (C#)
Create PDF documents using MONO (C#/MONO/Windows/OSX)
How to use background images for content elements (C#/PDF Kit/FlowLayout)
Add transparent images to PDF document (C#)
Draw round rect borders in PDF documents(C#)
ICC color profiles and and ICC based colors in PDF (C#)
How to use bidirectional and right to left text in PDF (C#)
Create PDF documents from XML templates (C# sample)
How to resize PDF pages and use custom stamps (C#)
Add header and footer to PDF page (.NET sample)
How to use clipping mask for drawing on PDF page
Fill graphics path with gradient brushes in PDF (Shadings)
Apitron PDF Kit and Rasterizer engine settings
Add layers to PDF page (optional content, C# sample)
How to create free text annotation with custom appearance
PDF Content Extraction PDF Navigation
PDF to TIFF conversion
Contact us if you have a PDF related question and we'll cover it in our blog.

2016-02-22

Search text in PDF documents using regular expressions

Introduction


Searching text in PDF document is easy and this feature became available to users of our Apitron PDF Rasterizer for .NET component many releases ago. Now we’ve updated the API and you can search for text on PDF page using standard .NET regular expression objects (Regex).

Text search API offered by Apitron PDF Rasterizer is decoupled from the rendering part and can be used independently. It’s represented by the SearchIndex class that handles all search tasks and offers very useful features like building search indices for the documents, and saving/loading of such indices for the later use.

Using search API offered by Apitron PDF Rasterizer you can also highlight text on rendered pages because you get all necessary information about text position on PDF page.

See the code section for details.

The code


class Program
{
    // global rendering settings
    static RenderingSettings renderingSettings = new RenderingSettings();
    // hightlight brush for search results
    static Brush hightlightBrush = new SolidBrush(Color.FromArgb(100,255,255,0));

    static void Main(string[] args)
    {
        // the source file to search the text into
        string inputFilePath = "../../data/Apitron_Pdf_Kit_in_Action.pdf";           

        // open pdf document for search and rendering
        // we'll use 2 different streams here
        using (Stream searchStream = new FileStream(inputFilePath, FileMode.Open,
            FileAccess.Read),
            documentStream = new FileStream(inputFilePath, FileMode.Open,
            FileAccess.Read))
        {               
            // create search object from PDF data stream
            using (SearchIndex searchIndex = new SearchIndex(searchStream))
            {
                // open document to be used for rendering
                using (Document doc = new Document(documentStream))
                {
                    searchIndex.Search((handlerArgs =>
                    {
                        // if we have results
                        if (handlerArgs.ResultItems.Count != 0)
                        {
                            // create resulting image filename
                            string outputFileName = string.Format("{0}_{1}.png",
                                Path.GetFileNameWithoutExtension(inputFilePath),
                                handlerArgs.PageIndex);

                            // render found result and start system image viewer
                            Page page = doc.Pages[handlerArgs.PageIndex];
                            using (Image bitmap = page.Render(new Resolution(96, 96),
                                renderingSettings))
                            {
                                foreach (SearchResultItem searchResultItem in
                                    handlerArgs.ResultItems)
                                {
                                    HighlightSearchResult(bitmap, searchResultItem,
                                    page);
                                }

                                bitmap.Save(outputFileName);
                            }

                            Process.Start(outputFileName);
                        }

                    }),
                    // find everything that matches [WORD][whitespaces]Kit pattern
                    new Regex("\\w+\\s+Kit"));                        
                }
            }
        }
    }
      
    /// <summary>
    /// Highlights the search result.
    /// </summary>
    /// <param name="bitmap"> The bitmap. </param>
    /// <param name="searchResultItem"> The search result item. </param>
    /// <param name="page"> The page. </param>
    private static void HighlightSearchResult(Image bitmap, 
        SearchResultItem searchResultItem,
        Page page)
    {
        using (Graphics gr = Graphics.FromImage(bitmap))
        {
            double[] rectangle;
            SearchResultRegion region = page.TransformRegion(searchResultItem.Region,
                bitmap.Width, bitmap.Height, renderingSettings);

            foreach (double[] item in region.Blocks)
            {
                rectangle = item;
                PointF[] points = new PointF[rectangle.Length / 2];
                for (int i = 0; i < 4; i++)
                {
                    points[i] = new PointF((float)rectangle[i * 2],
                        (float)rectangle[(i * 2) + 1]);
                }

                gr.FillPolygon(hightlightBrush, points);
            }
        }
    }
}   


The complete code sample can be found in our github repo. Results of the execution are shown below; please note that in evaluation mode search API searches for text on first three pages only.


Pic. 1 Search text in PDF document - highlighted text

Pic. 1 Search text in PDF document - highlighted text


Summary


Apitron PDF Rasterizer for .NET is a complex solution that you can use for PDF rendering and also for implementing text search in PDF documents. It’s a cross-platform library available for many .NET based platforms (Xamarin, Mono, .NET just to name a few) and can be used to create mobile, desktop and web applications. Contact us if you have any questions regarding our products or services.

No comments:

Post a Comment