| .env.example | ||
| .gitignore | ||
| LICENSE | ||
| pdf2epub_paddle.py | ||
| pyproject.toml | ||
| README.md | ||
| README_zh.md | ||
| uv.lock | ||
Scanned PDF to Epub Converter
This tool converts scanned PDF books into clean, readable EPUB ebooks using the Baidu PaddleOCR Layout Analysis API.
Features
- High-Quality Layout Analysis: Uses PaddleOCR to intelligently detect paragraphs, headers, images, and tables.
- Smart Chapter Splitting: Automatically detects chapter headings from OCR output. An interactive TOC review lets you confirm, remove, or adjust detected chapters before generating the EPUB.
- Cover Image: Automatically extracts the first page of the PDF as the EPUB cover.
- Metadata Support: Interactively prompts for book title and author based on OCR'd first-page text, or accepts them via CLI arguments.
- Image Embedding: Preserves images from the original PDF.
- Clean Output: Removes headers, footers, and page numbers for a seamless reading experience.
- Robustness:
- Checkpointing: Saves progress after every chunk. If interrupted, simply re-run to resume.
- Rate Limiting: Includes delays to respect API limits.
- Retry Logic: Automatically retries failed API requests.
Prerequisites
- Python 3.8+
- PaddleOCR API Token
Getting an API Token
- Log in to Baidu AIStudio (飞桨星河社区).
- Go to the "Applications" or "Online Models" section (Layout Parsing).
- Find the "PaddleOCR" or "Document Analysis" API.
- Copy your private API Token from your user profile or application settings dashboard. Note: Ensure you have sufficient quota (pages/day) for your usage.
Installation & Usage (Recommended: uv)
This project uses uv for fast, reliable dependency management.
-
Clone the repository:
git clone https://github.com/yourusername/pdf2epub-paddle.git cd pdf2epub-paddle -
Install
uv(if you haven't already):# On macOS/Linux curl -LsSf https://astral.sh/uv/install.sh | sh -
Set up your API token:
cp .env.example .envEdit
.envand add your token:PADDLE_API_TOKEN=your_api_token_here -
Run directly with
uv(handles virtualenv & dependencies automatically):uv run pdf2epub_paddle.py /path/to/your/book.pdfThe tool will display the OCR'd text from the first page and prompt you to enter the book title and author.
You can also provide metadata directly via CLI arguments:
uv run pdf2epub_paddle.py --title "Book Title" --author "Author Name" /path/to/your/book.pdfTo specify a custom output path:
uv run pdf2epub_paddle.py --output /path/to/output.epub /path/to/your/book.pdfTo skip the interactive TOC review and use automatic chapter detection:
uv run pdf2epub_paddle.py --auto-toc /path/to/your/book.pdfTo produce a single-chapter EPUB with no chapter splitting:
uv run pdf2epub_paddle.py --no-toc /path/to/your/book.pdf
Alternative: Standard Pip
If you prefer standard pip:
-
Create a virtual environment:
python -m venv .venv source .venv/bin/activate -
Install dependencies:
pip install . # Installs from pyproject.toml -
Set up your API token:
cp .env.example .env # Edit .env and add your token -
Run:
python pdf2epub_paddle.py /path/to/your/book.pdf
Note
: You can also set the token via environment variable directly:
export PADDLE_API_TOKEN='your_token'. The.envfile is loaded automatically but will not override an existing environment variable.
Configuration
- Chunk Size: Default is 5 pages per chunk to ensure stability. You can modify
CHUNK_SIZEin the script if you have a stable connection and higher limits. - Timeout: Default timeout is 180s per request.