pdf links

PDF Rendering
Convert PDF to Image (.NET)
Convert PDF to image on Android (Xamarin)
Convert PDF to image on iOS (Xamarin)
Convert PDF to image in Windows Store apps (.NET)
Convert PDF to image in Windows Phone apps (.NET)
PDF to image in Universal Windows Store apps (.NET)
Free PDF Viewer control for Windows Forms (.NET)
How to integrate PDF Viewer control in WPF app (.NET)
Creating WPF PDF Viewer supporting bookmarks (.NET)
Cross-platform PDF Viewer using GTK# (MONO)
Silverlight PDF viewer control (Silverlight 5)
Multithreaded PDF rendering (.NET)
Convert pdf to image in Silverlight app (C# sample)
How to set fallback fonts for PDF rendering (C#)
Avoiding the out-of-memory exception on rendering (C#)
PDF viewer single page application (WebAPI, AngularJS)
PDF viewer control for Windows 10 universal applications
Use custom ICC profile for CMYK to RGB conversion
PDF layers - separate images, text, annotations, graphics

PDF Forms Creation PDF Security
Conversion to PDF/A
Other topics
PDF Document Manipulation
PDF Content Generation
Fixed and Flow layout document API (.NET)
Creation of grids and tables in PDF (C# sample)
How to create interactive documents using Actions (C# sample)
Text flow effects in PDF (C# sample)
How to generate ordered and bulleted lists in PDF (C# sample)
Convert HTML to PDF using flow layout API (C# sample)
How to use custom fonts for PDF generation (.NET)
Create document with differently sized pages (C#)
Create PDF documents using MONO (C#/MONO/Windows/OSX)
How to use background images for content elements (C#/PDF Kit/FlowLayout)
Add transparent images to PDF document (C#)
Draw round rect borders in PDF documents(C#)
ICC color profiles and and ICC based colors in PDF (C#)
How to use bidirectional and right to left text in PDF (C#)
Create PDF documents from XML templates (C# sample)
How to resize PDF pages and use custom stamps (C#)
Add header and footer to PDF page (.NET sample)
How to use clipping mask for drawing on PDF page
Fill graphics path with gradient brushes in PDF (Shadings)
Apitron PDF Kit and Rasterizer engine settings
Add layers to PDF page (optional content, C# sample)
How to create free text annotation with custom appearance

PDF Content Extraction
PDF Navigation

PDF to TIFF conversion
Contact us if you have a PDF related question and we'll cover it in our blog.

2015-04-29

How to extract text from pdf page and create pdf to html conversion tool

Introduction


In our first post about PDF text extraction we demonstrated how to extract raw and formatted text from PDF document using Apitron PDF Kit for .NET component. As a part of the latest improvements, we've added new text extraction functionality and are eager to share the results with you.

If you were to extract text from PDF document programmatically before this release, you would have to pick one of the following choices:
  • Process text blocks yourself by parsing PDF commands and combining the results into a formatted or raw text data. Frankly speaking, it’s the hardest way you might choose and it would require the complete understanding of many text-related PDF aspects.
  • Use Apitron PDF Kit API by calling Page::ExtractText() with RawText or FormattedText parameter.

While all these techniques are working fine and do what they’re meant for, you may still have a need in some easy and convenient way to analyze the text attributing information e.g. color, original font name, size, etc. without diving too much in PDF specifics. That’s why we've implemented new text extraction modes in addition to Raw and Formatted:
  • TaggedText – produces XML output, where each PDF text block becomes wrapped by an xml element containing all available appearance information affecting this block. It also reports coordinates of each block in page space and therefore gives you unique ability to analyze and format text data in any way you like. Read more on this in separate section further.
  • HtmlText – further elaboration of TaggedText, actually this mode is no more than an attempt to lessen the efforts needed for PDF to HTML conversion task which many of developers encounter very often. Described in details in separate section.

Note: in evaluation mode only part of the page text can be extracted and the performance of the text extraction might be slightly affected.

PDF text extraction modes 


TaggedText


This mode generates tagged output in XML format. For each PDF page it produces a root page element containing PDF text blocks represented by the textblock element.

Supported attributes for page element are:
  • width – width of the page
  • height – height of the page

Supported attributes for textblock element are:
  • left – block left coordinate, in PDF coordinate system (relative to lower left corner)
  • bottom – block bottom coordinate, in PDF coordinate system
  • width – block width
  • height – block height
  • fontSize – font size in points
  • fontFamily – describes the font used within text block
  • letterSpacing – additional letter spacing
  • wordSpacing – additional word spacing
  • fontStretch – indicates a selection of normal, condensed, or expanded face from a font
  • fontWeight – defines boldness of the font
  • fontStyle – indicates italic font
  • nonStrokeColor – fill color, rgb
  • strokeColor – stroking color, rgb

All coordinates and dimensions are in page space and relative to lower left corner of the PDF page. Text blocks are being produced in the same order they appear on PDF page and without any additional formatting applied. Using this markup one may perform analysis of text layout and perform any context related tasks. Having the information about document nature, it becomes possible to create custom-tailored solutions serving the particular needs of application developer.

Sample code and produced XML are below (page 7 from PDF32000_2008 spec):

/// <summary>
/// Extracts tagged text from PDF page at given index.
/// </summary>
/// <param name="pdfDoc">PDF document to extract text from.</param>
/// <param name="pageIndex">Page index.</param>
/// <returns>Extracted text or empty string if page wasn't found or empty.</returns>
public string ExtractTaggedText(FixedDocument pdfDoc, int pageIndex)
{
    if (pdfDoc != null && pageIndex >= 0 && pdfDoc.Pages.Count > pageIndex)
    {
        return pdfDoc.Pages[pageIndex].ExtractText(TextExtractionOptions.TaggedText);
    }

    return string.Empty;
}

XML:

<page width="595" height="842">
  <textblock left="464.94" bottom="791.91" fontSize="10.98" fontFamily="Helvetica-Bold" letterSpacing="0.01" wordSpacing="0" width="93.42" height="13.07">PDF 32000-1:2008</textblock>
  <textblock left="70.8" bottom="747.24" fontSize="13.98" fontFamily="Arial" fontWeight="700" fontStretch="Normal" letterSpacing="0.01" width="60.21" height="19.38">Contents</textblock>
  <textblock left="535.08" bottom="747.3" fontSize="9.96" fontFamily="Arial" fontStretch="Normal" letterSpacing="0.03" width="23.4" height="13.26">Page</textblock>
    …
</page>

The first text block is highlighted for demonstration purposes.


HtmlText


This text extraction mode is based on tagged text mode and was designed to provide base implementation of the often needed PDF to HTML conversion task. If you need an additional level of specificity you may implement it using xml output produced by tagged text mode.

HtmlText mode produces a <div> block containing preformatted text wrapped inside <pre> elements. Each of these <pre> elements becomes styled according to text properties of the corresponding PDF text object using the inline style. Location of the element within parent block is being set using relative positioning.

Sample code and produced HTML are below (page 6 from PDF32000_2008 spec):

/// <summary>
/// Extracts html text from PDF page at given index.
/// </summary>
/// <param name="pdfDoc">PDF document to extract text from.</param>
/// <param name="pageIndex">Page index.</param>
/// <returns>Extracted text or empty string if page wasn't found or empty.</returns>
public string ExtractHtmlText(FixedDocument pdfDoc, int pageIndex)
{
    if (pdfDoc != null && pageIndex >= 0 && pdfDoc.Pages.Count > pageIndex)
    {
        return pdfDoc.Pages[pageIndex].ExtractText(TextExtractionOptions.HtmlText);
    }

    return string.Empty;
}

Html:

<div style="width:595px;height:842px;position:relative;">
  <style scoped="scoped">pre{margin:0;padding:0;position:absolute;}</style>
  <pre style="left:464.94px;bottom:791.91px;font-size:10.98px;font-family:'Helvetica';width:93.42px;height:13.07px;">PDF 32000-1:2008</pre>
  <pre style="left:70.86px;bottom:747.24px;font-size:13.98px;font-family:'Helvetica';width:81.62px;height:19.38px;">Introduction</pre>
  <pre style="left:70.86px;bottom:717.42px;font-size:9.96px;font-family:'Helvetica';width:490.44px;height:13.26px;">ISO 32000 specifies a digital form for representing documents called the Portable Document Format or usually</pre>
  <pre style="left:70.86px;bottom:705.95px;font-size:9.96px;font-family:'Helvetica';width:490.37px;height:13.26px;">referred to as PDF. PDF was developed and specified by Adobe Systems Incorporated beginning in 1993 and</pre>
    …
</div>

All properties of PDF element which can be mapped to html style attribute are used here. Also, as you can see, the produced html block uses scoped style to set the common properties for all nested <pre> elements.

If we open the produced html in browser, it will look as follows:

Pic. 1 PDF to HTML conversion results
Pic. 1 PDF to HTML conversion results

Conclusion


Using new text extraction modes described in this post, you’re now able to perform document text analysis involving its attributes and positions. You can also easily convert PDF to simple html for quick preview purposes.

Apitron PDF Kit for .NET can be used as standard PDF component for creation of applications requiring PDF processing. Create apps for Windows Store, Google Play Store, and Apple Store using same API. Develop server side solutions, cloud, directly managed websites or web services. Being true cross-platform and .NET / Mono / Xamarin compatible, our library raises the implementation of PDF processing logic to the unmatched level.

Downloadable version of this article can be found by the following link [PDF].

2015-04-24

How to add watermark to pdf document

What is a watermark


Watermark is usually a semitransparent drawing added on top of the page content which can be created using various ways. This type of marking your documents becomes necessary when you have to indicate a particular purpose the document is designed for or to give some handling instructions. Examples are: “For internal reading only”, “Do not copy”, “Top Secret” etc.  It’s also useful for placing banners indicating the product name, the document was created by, or its evaluation state.

We’ll describe several watermarking approaches in this post and provide C# code samples which generate watermarks programmatically. 

Image watermark


This type of watermark is simple and convenient. You create an image containing your message and draw it over the page content.

Pros:
  • Easy to create and use, single image XObject can be shared by all pages
  • Provides a simple way to use any picture as watermark

Cons:
  • May affect resulting file by increasing its size significantly if image used is big enough
  • For the image to become transparent it has to include some kind of transparency mask and this fact can be a problem for non-transparency aware readers
  • Raster images don’t scale well, so this watermark may become pixelated when zoomed
  • Becomes a part of page content

See the C# code snippet below that shows how to add image watermark:

/// <summary>
/// Adds image watermark to PDF document.
/// </summary>
public void AddImageWatermark()
{
    // open existing document
    using (Stream file = File.OpenRead("Apitron PDF Kit in Action.pdf"))
    {
        FixedDocument doc= new FixedDocument(file);
        // register image XObject
        doc.ResourceManager.RegisterResource(new Image("watermark","watermark.png", true));

        // add image watermark for each page
        foreach (Page page in doc.Pages)
        {
            page.Content.AppendImage("watermark", 0, 0, page.Boundary.MediaBox.Width,
                page.Boundary.MediaBox.Height);
        }               

        // save watermarked file
        using (Stream stream = File.Create("image_watermark.pdf"))
        {
            doc.Save(stream);
        }
    }
}

The image below demonstrates the execution results:

Pic. 1 Image watermark sample (pdf)

Pic. 1 Image watermark sample

Form XObject watermark


This type of watermark assumes basic knowledge of PDF drawing system. Using this approach it’s easy to create vector-based drawings suitable for watermarking.

Pros:
  • Compactness, single watermark form XObject can be shared by all pages
  • Scales well if it contains vector drawings only, requires no transparency mask

Cons:
  • Requires some knowledge of PDF drawing system
  • Becomes a part of page content

Let’s create a simple text-based watermark using the C# code below:

public void AddFormXObjectWatermark()
{
    // open existing document
    using (Stream file = File.OpenRead("Apitron PDF Kit in Action.pdf"))
    {
        FixedDocument pdfDocument = new FixedDocument(file);
        // define watermark transparency using graphics state
        GraphicsState watermarkGS = new GraphicsState("gs0"){CurrentNonStrokingAlpha=0.2};
        // register graphics state object
        pdfDocument.ResourceManager.RegisterResource(watermarkGS);

        // create watermark form XObject
        FixedContent watermark = new FixedContent("watermark", pdfDocument.Pages[0].Boundary.MediaBox);               
        // register form XObject
        pdfDocument.ResourceManager.RegisterResource(watermark);

        // define text and transformation for it
        TextObject watermarkText = new TextObject(StandardFonts.Helvetica,48);
        watermarkText.AppendText("Apitron PDF Kit for .NET");               
        watermark.Content.ModifyCurrentTransformationMatrix(1,1.25,-1.25,1,50,50);

        // define current color and transparency               
        watermark.Content.SetGraphicsState("gs0");
        watermark.Content.SetDeviceNonStrokingColor(RgbColors.Red.Components);

        // draw watermark text
        watermark.Content.AppendText(watermarkText);                               

        // add watermark to each page
        foreach (Page page in pdfDocument.Pages)
        {
            page.Content.AppendXObject("watermark");
        }

        // save watermarked file
        using (Stream stream = File.Create("formXObject_watermark.pdf"))
        {
            pdfDocument.Save(stream);
        }
    }
}

The result is shown below. You may notice that it looks sharper because of its vector nature:

Pic. 2 Watermark added using form XObject

Pic. 2 Watermark added using form XObject


Watermark annotation


A watermark annotation can be used to represent graphics that is to be printed at a fixed size and position on a page, regardless of the dimensions of the printed page.

Pros:
  • Compactness, designed specifically for watermarks
  • Can be easily managed using page annotations dictionary
  • Requires no transparency mask

Cons:
  • Requires some knowledge of PDF drawing system and annotations

// Adds watermark annotation to the document.
public static void AddWatermarkAnnotation()
{
    // open existing document
    using (Stream file = File.OpenRead("Apitron PDF Kit in Action.pdf"))
    {
        FixedDocument pdfDocument = new FixedDocument(file);
        // define watermark transparency using graphics state and register this object
        GraphicsState watermarkGS = new GraphicsState("gs0"){CurrentNonStrokingAlpha=0.2};
        pdfDocument.ResourceManager.RegisterResource(watermarkGS);

        // create watermark content
        FixedContent watermark = new FixedContent("watermark", pdfDocument.Pages[0].Boundary.MediaBox);

        // define text and transformation for it
        TextObject watermarkText = new TextObject(StandardFonts.Helvetica, 48);
        watermarkText.AppendText("Apitron PDF Kit for .NET");
        watermark.Content.ModifyCurrentTransformationMatrix(1, 1.25, -1.25, 1, 50, 50);

        // define current color and transparency               
        watermark.Content.SetGraphicsState("gs0");
        watermark.Content.SetDeviceNonStrokingColor(RgbColors.Red.Components);

        // draw watermark text
        watermark.Content.AppendText(watermarkText);

        // create watermark annotation object for each pages
        foreach (Page page in pdfDocument.Pages)
        {                   
            WatermarkAnnotation annotation=new WatermarkAnnotation(page.Boundary.MediaBox);
            annotation.Appearance.Normal = watermark;
            page.Annotations.Add(annotation);
        }

        using (Stream stream = File.Create("watermark_annotation.pdf"))
        {
            pdfDocument.Save(stream);
        }
    }
}

The code creating watermark annotation produces the same results as the code that adds form XObject watermark.

Watermarks removal


It’s possible to remove watermarks from PDF file however we don’t recommend doing it because it can cause legal problems. Techniques used involve content analysis as well as annotations checks. There is no 100% reliable method, however, to remove all watermarking information using single algorithm, because watermarks might be hidden in PDF metadata or other less evident places.

For example, one may use a fully transparent image which would appear only when the document is being printed. Think of watermark as of piece of info hidden inside the PDF file, it can be just anything.

Conclusion


Adding watermarks is not a tricky task and, as you can see, it can be completed quite easy using Apitron PDF Kit for .NET component. This component is available for many platforms and makes you able to create applications for Windows and Windows Store, Xamarin.iOS and Xamarin.Android, OS X or any other system where a .NET/MONO can run. ASP.NET and Azure environments are supported as well. You may visit its product page or browse documentation here

Downloadable version of this article can be found by the following link [PDF].