G4-compressed-TIFF-to-PDF-conversion FAQ

Last change: 02 Aug 2006 (most content is from 2003 though), contact Holger Blasum, c42pdf ATT ffii DOTT org for comments, critique or updates

Why PDF ?

PDF readers are somewhat more common than TIFF readers.

What is an G4 compressed TIFF ?, example ?

A TIFF file that has been compressed according to the ITU G4 Fax compression standard, so typically it is a black-and-white scan (may consist of several pages). Here is an example (TIFF) , (PDF) .

How can I check the compression of any TIFF / PDF files ?

Which compression options are there for monocolor PDF and TIFF ?

(see also: Jason Summers, comp.infosystems.www.authoring.images, 1999/01/02

In practice, for (a random) 300 dpi scan this boiled down to: 1.121 kb for no compression, 182 kb for packbit compression, 113 kb for G3 compression, 78 kb for G4 compression and 71 kb for Flate compression. But don't take this benchmark too serious, this varies widely !

File size of produced PDFs vs incoming G4 compressed TIFFs

Produced PDFs should usually not be more than 2% larger than incoming TIFFs. Depending on the structure of the TIFFs output may be even smaller and by changing compression (see below) you may typically gain another one-digit percentage. Deviations of more than 10% in either direction are atypical (though John has shown me some formally correct G4 compressed TIFF files that had huge chunks (15%) of repetitive data inside - this can be checked in a text editor or with tiffsplit).

What conversion programs are there ?

Well there is c42pdf from this site.

Most more general converters come in a package that as command line tools install several small programs:

You also may want to have a look at:

... but I am on Windows !

So many programs ! Which one shall I choose ? How shall I use it ? (Christoph)

There are several tradeoffs between ease of installation of programs and ease of the actual program use you can strike.

I: One-step conversion:

Either use c42pdf, command:

                     c42pdf sample.tif

This will create a file sample.pdf. In cases it gives you an error message, a more versatile tool is the convert command from 'ImageMagick':

                        convert sample.tif sample.pdf

You will be pleased to find out that convert also works in this intuitive way for many other formats.

Big limitation with ImageMagick (tested: version 5.3.9): the resulting PDFs use the packbits/RunLengthDecode compression which for some b/w images is about factor 10 less efficient than CCITT4. During conversion, the image data is represented as bitmap, so it is rather memory intensive and slow ('time convert sample.tif sample.pdf': 2.105s in contrast to: 'time c42pdf sample.tif': 0.063s).

Even if you are frustrated when ImageMagick just takes too long on your machine (and proceed to the next step) keep in mind that it is a very cool tool for raster image conversion in general (e.g. GIF to PNG), only that its black-white PDF image generation is not so good.

II.1: Ghostscript-based two-step conversion (eg c42pdf doesn't work, output size is an issue):

use tiff2ps from the libtiff-tools in connection with ps2pdf from Ghostscript (ps2pdf is just a shell script with reasonable params to gs):

                        tiff2ps -a sample.tif > sample.ps
                        ps2pdf sample.ps

My benchmark (BTW, on a 137 MHz [237 bogomips] cpu, with the sample.tif included in the source distrib of c42pdf) for this is 0.387s for first plus 0.630s for the second step.

I have so far not found a way of piping this (though I may be overlooking something obvious) so you would probably want to write a script like::

unix/Cygwin bash:

                        #!/bin/bash
                        basename=${1%.tif*}
                        tiff2ps -a $1 > ${basename}.ps && \
                                ps2pdf ${basename}.ps && \
                                rm ${basename}.ps

to be run as: script sample.tif It is of course also possible to enclose said script into a condition such as:

                        if ! test "c42pdf $1"; then script $1; fi

so that c42pdf is used where it works fast and fine and sth else where it doesn't - this setup would emulate the behavior www.fastio.com's tiff2pdf.

DOS shell, not tested:

                        tiff2ps -a %1.tif > %1.ps
                        ps2pdf %1.ps 
                        move %1.tif.pdf %1.pdf
                        del %1.tif.ps

to be run as: script sample (without tif ending)

Output paper format and resolution:

On my localized linux system, the default of ps2pdf is to write A4 format, but this may be different on another machine. You can easily add an argument to ps2pdf for the papersize, e.g.:

                        ps2pdf -sPAPERSIZE=a4 sample.ps

Other options for PAPERSIZE could be legal,b5, etc ..

II.2: Alternative: c42pdf-based two-step conversion:

use tiffcp from libtiff-tools in connection with c42pdf, first use tiffcp (small and fast) to bring your TIFFs into adequate format:

                        tiffcp -c g4 -r 100000 sample.tif sampleg4.tif

Then run c42pdf on the new file:

                        c42pdf -o sample.pdf sampleg4.tif 

unix/Cygwin bash:

                  #!/bin/bash
                  basename=${1%.tif*}
                  tiffcp -c g4 -r 100000 $1 /tmp/temptif.tif
                  c42pdf -o ${basename}.pdf /tmp/temptif.tif

DOS shell: 'SCRIPT.BAT':

                  tiffcp -c g4 -r 100000 %1.tif %1g4.tif
                  c42pdf -o %1.pdf %1g4.tif
                  del %1g4.tif

to be run as: SCRIPT SAMPLE (without tif ending!) The "del" command will obviously delete a file named sampleg4.tif so you should make sure that there is no file sampleg4.tif in the current directory if you have a file sample.tif.

III. Three-step conversion (advanced users):

Another obvious way to go is to use tiffcp for the conversion of any TIFF document to a format c42pdf can definitely process, then do the fast throughput via c42pdf and then run ghostscript for FlateDecode and optimized page trees on it - so in effect c42pdf is used as intermediate link between the most powerful tool for reading TIFFs and the most powerful tool for producing PDFs. In comparison with the aforementioned second two-step conversion the trick is by directly jumping into PDF directly the poor Postscript compression can be avoided.

unix/Cygwin bash:

                  #!/bin/bash
                  basename=${1%.tif*}
                  tiffcp -c g4 -r 100000 $1 /tmp/temptif.tif
                  c42pdf -o /tmp/temptif.pdf /tmp/temptif.tif
                  gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite \
                        -sOutputFile=${basename}.pdf /tmp/temptif.pdf 

My benchmark for this is 0.751s on a 466 bogomips CPU for (III) as opposed to 1.121s for (II.2) on that CPU on the tiny sample.tif. On larger files the ratio even improves to about 50% savings in computing power and disk activity of III vs II.2.

Remark on choosing compression modes in Ghostscript (advanced users)

Ghostscript version 5.50 produces a /CCITTFaxDecode PDF, whereas Ghostscript version 6.50 produces /FlateDecode PDF.

The compression efficiency of /FlateDecode seems slightly (usually about 0-15%) better than /CCITTFaxDecode on random documents, so this is nothing to worry about.

If for whatever reason you want CCITTFaxDecode you can either deliberately use version 5.50 of Ghostscript, since explicit control by the -dAutoFilterMonoImages=false or -dMonoImageFilter=/CCITTFaxDecode seems not to be operational in version 6.50, or apply a trick posted to comp.lang.postscript

Ghostscript also should be able to do it directly, but I haven't figured out how yet ;-) - pls tell me if you know how to do it because it is

Stuff that I had tried too (but not really recommended):

TIFF generation with tifftopnm:

                  tifftopnm sample.tif > sample.pbm; 
                  pnmtops sample.tif > sample.ps

  1. 664s for the first, 0.742s for the second step.

Limitation: this sometimes results in black-white inversion.

convert any TIFF to something else (eg non-multistripped TIFF),

netpbm tools:

                        tifftopnm sample.tif > sample.pbm
                        pnmtotiff -g4 -rowsperstrip 100000 sample.pbm > sample.tif      

  1. 606s for the first, 0.791s for the second step.

This creates a TIFF image with all (well the first 100000 which is sufficient for paper sizes less than 10 meters ;-) ) rows in one strip.

Limitation: this results in black-white inversion. I have also not figured out how to do this on multipage TIFFs.

Various routes from single-page to multipage (Hartmut)

All of the above-mentioned conversion from TIFF to PDF work fine where input files are already multipage TIFFs. When you want to merge single page input files into multipage TIFFs conceptually you have three options when to do it: before conversion, during conversion and after conversion.

Before conversion should be a piece of cake with libtiff and can definitely be done with the -adjoin option of convert but if we are converting anyway, why do one useless conversion more?

During conversion: either use the c42pdf -l option for converting lists of files (see its documentation) if that is good enough for you or slightly adapt our 'tiff2ps'->ps2pdf two-step conversion process.

A reasonable way for doing this (due to the bulkiness of postscript it is probably better to delay merging into the second step) could be:

                        #!/bin/bash
                        if [$# -le 2]
                        then 
                                echo "Usage: miff2pdf outfile infiles..."
                                exit 1
                        if
                        if [ -e tmpdir ]
                        then
                                echo "remove tmpdir manually"
                                exit 1
                        fi
                        outfile=$1
                        shift
                        mkdir tmpdir
                        for i in $@
                        do
                                tiff2ps -a $i > tmpdir/$i
                        done
                        cd tmpdir
                                gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite \
                                        -sOutputFile=../$outfile $@
                        cd ..
                        rm -rf tmpdir

My benchmark is 2.7s for running this for sample.tif and a copy of it, sample2.tif. Please do not forget to add the -sPAPERSIZE to gs if you want to control that.

After conversion: not the best way to do it (one conversion more to do than during) but should be explained in case somebody just delivers you PDFs, so this is about concatenating PDFs ...

The (recommended) fast way is to use Ghostscript:

                gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite \
                        -sOutputFile=output.pdf sample.pdf sample.pdf

Benchmark: 1.224s for that command. Use -sPAPERSIZE where needed.

If you want to add bookmarks or document information, this is easy, see section below: bookmarks and other pdfmarks.

A more playful way is to use eg pjscript , a working script for concatenation would be:

                #concat.pjs: A simple concat pjscript. http://www.etymon.com/pj/
                #invoked: pjscript concat.pjs infile1 infile2 outfile
                println "Concatenating..."
                #Command line args are stored in the vars 'arg0', 'arg1', etc.
                =file arg0
                readpdf
                =file arg1
                appendpdf
                =file arg2
                writepdf
                println "Done."

time pjscript output.pdf sample.pdf sample.pdf 4.329s

For multiple arguments let's write a little java program based on the same PJ library (LGPL):

                import com.etymon.pj.*;
                import com.etymon.pj.object.*;
                import java.io.*;
                public class PjConcat {
                        public static void main (String [] args) {
                                if (args.length < 2) System.out.println
                                  ("Usage: PjConcat outfile infiles ...");
                                else try {
                                  Pdf pdf = new Pdf (args[1]);
                                  int filesno = args.length - 1;
                                  for (int i=2; i<=filesno; i++) 
                                    pdf.appendPdfDocument(new Pdf(args[i]));
                                  pdf.writeToFile(args[0]);
                                } catch (Exception e) {
                                  e.printStackTrace();
                                }
                        }
                }       

time java PjConcat output.pdf sample.pdf sample.pdf 6.175s If you are interested in detailed PDF parsing look at PJ's Pdf.appendPdfDocument as well as the pjscript source for a good point to start learning or if you are more C-oriented PandaLex (http://www.stillhq.com/).

Or use something ready-made like pdcat, (expiring) demo versions at pdcat (win), (linux), solaris

Other approach to joining PDFs.

Image size of PDFs

This will mainly concern engineering drawings.

Image size in PDF 1.1 and 1.2 is limited to 3240x3240 units, in PDF 1.3 it is limited to 14,400x14,400 units. PDF 1.4 (in appendix C) gives the same number of 14,400x14,400 as a limitation of Acrobat Reader. Accordingly, Acrobat Reader as of version 5.05 was unable to display a sample scan of 13,568x42,438 pixels, whereas that scan could be displayed by ghostview.

If you plan to convert a whole repository of legacy TIFFs it would be wise to use tiffdump on some of these if there are headers containing document metadata. Though this seldom occurs in practice you'd better at least check. You can also use tiffcp to check for unknown headers.

Free OCR ?

As an alternative to use Acrobat Capture in some cases you might be interested in looking at GOCR or here .

Bookmarks and other pdfmarks (Hartmut) ?

By going through a Postscript stage you can utilize the pdfmark kludge for postscript (well documented by Thomas Merz, pdfmark Primer").

The cleanest way to this is to create another small file e.g. called "pdfmarks".

Into it we write something like:

                [ /Page 1 /View [/XYZ 0 842 1.0] /Title (1stpage) /OUT pdfmark
                [ /Page 2 /View [/XYZ 0 842 1.0] /Title (2ndpage) /OUT pdfmark
                [ /PageMode /UseOutlines /DOCVIEW pdfmark

/OUT is outline, a bookmark in pdfmarkspeek. The last line makes sure bookmarks are displayed on startup. And you might want to add Document info as well:

                [ /Title (A guide to TIFF-PDF conversion)
                  /Author (Holger Blasum)
                  /Creator (lousy software)
                  /Keywords (who knows)
                  /ModDate (D:20011004101012)
                  /DOCINFO pdfmark

Now you just invoke ghostscript with the infile plus the new "pdfmarks" file:

                gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite \
                -sOutputFile=withmarks.pdf withoutmarks.pdf pdfmarks

This adds pdfmark annotations to an existing PDF document. You can also use multiple documents, it actually doesn't matter when ghostscript receives the additional information in pdfmark. And of course "withoutmarks" can also be in Postscript format. Use -sPAPERSIZE where needed.

For much more detail see the above-mentioned pdfmark Primer (written at a time when gs had not yet supported pdfmark (an Acrobat Distiller "standard") and pdfmarks had to be embedded in PS but that is irrelevant here).

Single files to single files: Any way to convert multiple input files (all files in a directory) into individual output files with the same name (new extension) (Greg) ?

Converting all files in a directory can be achieved at the operating system's shell level:

                Unix solution::

                        for A in *.tif; do c42pdf $A; done

                DOS/Win solution: ("MS-DOS command line prompt")::

                        for %f in (*.tif) do c42pdf %f

Another, more powerful approach (e.g. for converting all *.tif in an entire drive recursively (including subdirs) that is also robust for "long" filenames)) would be e.g. to use (download) the (Cygwin) GNU tools (http://www.cygwin.com/) and run from the freshly installed Cygwin bash shell located in the directory you want all subdirs of to be converted:

                find . -name '*.tif' -exec c42pdf '{}' ';'

I'd rather convert from PDF to TIFF.

This is an easy one, use Ghostscript:

                gs -q -dNOPAUSE -dBATCH -sDEVICE=tiffg4 -r300 -sPAPERSIZE=a4
                -sOutputFile=mypdf.tif mypdf.pdf

-r is the resolution, not using it would default to 204x196 (standard FAX resolution) for tiffs.

If you'd prefer multistrip -dMaxStripSize=8192 would be an option.

Or simply use ImageMagick: convert mypdf.pdf mypdf.tif. Or use pdfimages (Xpdf) to convert to JPEG, PBM pr PPM.

I still have questions on PDF

Acknowledgments:

To Christoph Schulze, John A Kunze, James Y Hope, Hartmut Pilch, Greg Falvo, Jimmy Ngo, Dan Cogliano, Bill Gilchrist, Eric Smith for comments and questions.