# jpegdoc2pdf Convert smartphone JPGs of typewritten English documents into searchable **OCRed PDFs** using parallel batch processing. ## Prerequisites Install the following tools: - **Tesseract OCR** (ensure it's in PATH) - **img2pdf** - lossless image to PDF converter - **ocrmypdf** - adds OCR layer to PDFs ```bash # macOS brew install tesseract img2pdf ocrmypdf # Linux (Debian/Ubuntu) apt-get install tesseract-ocr img2pdf ocrmypdf ``` ## Usage ### Basic Usage ```bash ./convert.sh ROOT_DIR [OUT_DIR] [-P N] [--recursive] ``` ### Examples **Process subdirectories in ROOT with default settings:** ```bash ./convert.sh ./ROOT ``` **Specify custom output directory:** ```bash ./convert.sh ./ROOT ./my_output ``` **Use 4 parallel processes:** ```bash ./convert.sh ./ROOT ./out_pdfs -P 4 ``` **Process nested subdirectories recursively:** ```bash ./convert.sh ./ROOT ./out_pdfs -P 4 --recursive ``` ## Folder Structure Organize your images with one subdirectory per PDF: ``` ROOT/ CaseA/ 001.jpg 002.jpg CaseB/ page1.jpg page2.jpg ``` - Each subdirectory under `ROOT` becomes a single PDF - Nested subfolders (with `--recursive`) are named like `Parent__Child.pdf` - Output PDFs are saved to `out_pdfs/` (or your specified output directory) ## Options - **ROOT_DIR** (required): Root directory containing subdirectories of images - **OUT_DIR** (optional): Output directory (default: `out_pdfs`) - **-P N** (optional): Number of parallel processes (default: CPU core count) - **--recursive** or **-r**: Process nested subdirectories recursively ## Supported Image Formats jpg, jpeg, png, tif, tiff (case-insensitive) ## OCR Settings - Language: English (`eng`) - Tesseract OEM: 1 (LSTM neural net mode) - Page segmentation mode: 6 (uniform text block) - Optimization level: 1