initial commit

This commit is contained in:
2025-11-01 18:04:28 -04:00
commit 4eb7ddfd99
5 changed files with 703 additions and 0 deletions

84
README.md Normal file
View File

@@ -0,0 +1,84 @@
# jpegdoc2pdf
Convert smartphone JPGs of typewritten English documents into searchable **OCRed PDFs** using parallel batch processing.
## Prerequisites
Install the following tools:
- **Tesseract OCR** (ensure it's in PATH)
- **img2pdf** - lossless image to PDF converter
- **ocrmypdf** - adds OCR layer to PDFs
```bash
# macOS
brew install tesseract img2pdf ocrmypdf
# Linux (Debian/Ubuntu)
apt-get install tesseract-ocr img2pdf ocrmypdf
```
## Usage
### Basic Usage
```bash
./convert.sh ROOT_DIR [OUT_DIR] [-P N] [--recursive]
```
### Examples
**Process subdirectories in ROOT with default settings:**
```bash
./convert.sh ./ROOT
```
**Specify custom output directory:**
```bash
./convert.sh ./ROOT ./my_output
```
**Use 4 parallel processes:**
```bash
./convert.sh ./ROOT ./out_pdfs -P 4
```
**Process nested subdirectories recursively:**
```bash
./convert.sh ./ROOT ./out_pdfs -P 4 --recursive
```
## Folder Structure
Organize your images with one subdirectory per PDF:
```
ROOT/
CaseA/
001.jpg
002.jpg
CaseB/
page1.jpg
page2.jpg
```
- Each subdirectory under `ROOT` becomes a single PDF
- Nested subfolders (with `--recursive`) are named like `Parent__Child.pdf`
- Output PDFs are saved to `out_pdfs/` (or your specified output directory)
## Options
- **ROOT_DIR** (required): Root directory containing subdirectories of images
- **OUT_DIR** (optional): Output directory (default: `out_pdfs`)
- **-P N** (optional): Number of parallel processes (default: CPU core count)
- **--recursive** or **-r**: Process nested subdirectories recursively
## Supported Image Formats
jpg, jpeg, png, tif, tiff (case-insensitive)
## OCR Settings
- Language: English (`eng`)
- Tesseract OEM: 1 (LSTM neural net mode)
- Page segmentation mode: 6 (uniform text block)
- Optimization level: 1