Files
jpegdoc2pdf/README.md

1.8 KiB

jpegdoc2pdf

Convert smartphone JPGs of typewritten English documents into searchable OCRed PDFs using parallel batch processing.

Prerequisites

Install the following tools:

  • Tesseract OCR (ensure it's in PATH)
  • img2pdf - lossless image to PDF converter
  • ocrmypdf - adds OCR layer to PDFs
# macOS
brew install tesseract img2pdf ocrmypdf

# Linux (Debian/Ubuntu)
apt-get install tesseract-ocr img2pdf ocrmypdf

Usage

Basic Usage

./convert.sh ROOT_DIR [OUT_DIR] [-P N] [--recursive]

Examples

Process subdirectories in ROOT with default settings:

./convert.sh ./ROOT

Specify custom output directory:

./convert.sh ./ROOT ./my_output

Use 4 parallel processes:

./convert.sh ./ROOT ./out_pdfs -P 4

Process nested subdirectories recursively:

./convert.sh ./ROOT ./out_pdfs -P 4 --recursive

Folder Structure

Organize your images with one subdirectory per PDF:

ROOT/
  CaseA/
    001.jpg
    002.jpg
  CaseB/
    page1.jpg
    page2.jpg
  • Each subdirectory under ROOT becomes a single PDF
  • Nested subfolders (with --recursive) are named like Parent__Child.pdf
  • Output PDFs are saved to out_pdfs/ (or your specified output directory)

Options

  • ROOT_DIR (required): Root directory containing subdirectories of images
  • OUT_DIR (optional): Output directory (default: out_pdfs)
  • -P N (optional): Number of parallel processes (default: CPU core count)
  • --recursive or -r: Process nested subdirectories recursively

Supported Image Formats

jpg, jpeg, png, tif, tiff (case-insensitive)

OCR Settings

  • Language: English (eng)
  • Tesseract OEM: 1 (LSTM neural net mode)
  • Page segmentation mode: 6 (uniform text block)
  • Optimization level: 1