1.8 KiB
1.8 KiB
jpegdoc2pdf
Convert smartphone JPGs of typewritten English documents into searchable OCRed PDFs using parallel batch processing.
Prerequisites
Install the following tools:
- Tesseract OCR (ensure it's in PATH)
- img2pdf - lossless image to PDF converter
- ocrmypdf - adds OCR layer to PDFs
# macOS
brew install tesseract img2pdf ocrmypdf
# Linux (Debian/Ubuntu)
apt-get install tesseract-ocr img2pdf ocrmypdf
Usage
Basic Usage
./convert.sh ROOT_DIR [OUT_DIR] [-P N] [--recursive]
Examples
Process subdirectories in ROOT with default settings:
./convert.sh ./ROOT
Specify custom output directory:
./convert.sh ./ROOT ./my_output
Use 4 parallel processes:
./convert.sh ./ROOT ./out_pdfs -P 4
Process nested subdirectories recursively:
./convert.sh ./ROOT ./out_pdfs -P 4 --recursive
Folder Structure
Organize your images with one subdirectory per PDF:
ROOT/
CaseA/
001.jpg
002.jpg
CaseB/
page1.jpg
page2.jpg
- Each subdirectory under
ROOTbecomes a single PDF - Nested subfolders (with
--recursive) are named likeParent__Child.pdf - Output PDFs are saved to
out_pdfs/(or your specified output directory)
Options
- ROOT_DIR (required): Root directory containing subdirectories of images
- OUT_DIR (optional): Output directory (default:
out_pdfs) - -P N (optional): Number of parallel processes (default: CPU core count)
- --recursive or -r: Process nested subdirectories recursively
Supported Image Formats
jpg, jpeg, png, tif, tiff (case-insensitive)
OCR Settings
- Language: English (
eng) - Tesseract OEM: 1 (LSTM neural net mode)
- Page segmentation mode: 6 (uniform text block)
- Optimization level: 1