initial commit
This commit is contained in:
84
README.md
Normal file
84
README.md
Normal file
@@ -0,0 +1,84 @@
|
||||
# jpegdoc2pdf
|
||||
|
||||
Convert smartphone JPGs of typewritten English documents into searchable **OCRed PDFs** using parallel batch processing.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
Install the following tools:
|
||||
- **Tesseract OCR** (ensure it's in PATH)
|
||||
- **img2pdf** - lossless image to PDF converter
|
||||
- **ocrmypdf** - adds OCR layer to PDFs
|
||||
|
||||
```bash
|
||||
# macOS
|
||||
brew install tesseract img2pdf ocrmypdf
|
||||
|
||||
# Linux (Debian/Ubuntu)
|
||||
apt-get install tesseract-ocr img2pdf ocrmypdf
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
### Basic Usage
|
||||
|
||||
```bash
|
||||
./convert.sh ROOT_DIR [OUT_DIR] [-P N] [--recursive]
|
||||
```
|
||||
|
||||
### Examples
|
||||
|
||||
**Process subdirectories in ROOT with default settings:**
|
||||
```bash
|
||||
./convert.sh ./ROOT
|
||||
```
|
||||
|
||||
**Specify custom output directory:**
|
||||
```bash
|
||||
./convert.sh ./ROOT ./my_output
|
||||
```
|
||||
|
||||
**Use 4 parallel processes:**
|
||||
```bash
|
||||
./convert.sh ./ROOT ./out_pdfs -P 4
|
||||
```
|
||||
|
||||
**Process nested subdirectories recursively:**
|
||||
```bash
|
||||
./convert.sh ./ROOT ./out_pdfs -P 4 --recursive
|
||||
```
|
||||
|
||||
## Folder Structure
|
||||
|
||||
Organize your images with one subdirectory per PDF:
|
||||
|
||||
```
|
||||
ROOT/
|
||||
CaseA/
|
||||
001.jpg
|
||||
002.jpg
|
||||
CaseB/
|
||||
page1.jpg
|
||||
page2.jpg
|
||||
```
|
||||
|
||||
- Each subdirectory under `ROOT` becomes a single PDF
|
||||
- Nested subfolders (with `--recursive`) are named like `Parent__Child.pdf`
|
||||
- Output PDFs are saved to `out_pdfs/` (or your specified output directory)
|
||||
|
||||
## Options
|
||||
|
||||
- **ROOT_DIR** (required): Root directory containing subdirectories of images
|
||||
- **OUT_DIR** (optional): Output directory (default: `out_pdfs`)
|
||||
- **-P N** (optional): Number of parallel processes (default: CPU core count)
|
||||
- **--recursive** or **-r**: Process nested subdirectories recursively
|
||||
|
||||
## Supported Image Formats
|
||||
|
||||
jpg, jpeg, png, tif, tiff (case-insensitive)
|
||||
|
||||
## OCR Settings
|
||||
|
||||
- Language: English (`eng`)
|
||||
- Tesseract OEM: 1 (LSTM neural net mode)
|
||||
- Page segmentation mode: 6 (uniform text block)
|
||||
- Optimization level: 1
|
||||
Reference in New Issue
Block a user