Donovan

pg2html

unpaper

deskew

command line tricks

This is a small collection of software and information useful for producing etexts for Project Gutenberg, with an emphasis on tools and techniques for post-processing files that have gone or will be going through the Distributed Proofreaders site.

pg2html-0.18.zip
(DOS/Windows executable, C source code and documentation) 2005-07-19

This utility converts an HTML-ready or PG-ASCII formatted text file to HTML 4.01 Transitional or XHTML 1.0 Transitional.

General usage: pg2html inputfile >outputfile

For: Windows (98, XP, 2000); Linux.
For other platforms, the C source should compile without issues.

Please see the included README and CHANGELOG files for the most up-to-date information!
Detailed information...

unpaper-0.2 for DOS.
(DOS/Windows port of unpaper) 2006-07-04
Standalone DOS executable compiled with lcc-win32 and -DNOSINCOS.

See the unpaper site for complete details.

deskew for DOS/cygwin.
(cygwin port of deskew) 2004-11-15
This is a cygwin-based utility to deskew a page image scan, based on leptonlib.

Command Line Tricks.

To strip DOS newlines and remove end-of-line spaces all at once:
tr -d '\r' <original.txt | sed -e's/ \{1,\}$//g' >output.txt

To add DOS newlines back to a file:
awk 'BEGIN { ORS="\r\n" } { print $0 }' inputfile > outputfile

Copyright © 2003-2007 D. Garcia: Updated April 24, 2007.