Project Overview
The CCSD PDF to HTML Converter is a custom PHP web application designed to transform PDF documents into clean, “vanilla” HTML. Developed to support large-scale organizational needs, it provides an automated way to extract text and images from PDFs while maintaining a professional, web-ready structure.
The primary goal was to create a tool that handles the heavy lifting of document conversion while ensuring that the resulting code is lightweight, accessible, and free of the bloat typically found in automated PDF exports.
The Challenge
Converting PDFs to HTML often results in “spaghetti code”—inline styles, absolute positioning, and fragmented text that is difficult to edit or repurpose for the web. For a large organization, the need for a standardized, clean output is critical for maintaining website consistency and accessibility.
Additionally, managing multiple users’ files in a shared environment without a complex login system presented a unique challenge: providing privacy and history tracking without the friction of account creation.
The Solution
I developed a secure, database-backed application using PHP 8 and Poppler-utils. The system uses a sophisticated XML-parsing logic to reconstruct the document flow.
Key features include:
-
Intelligent Paragraph Detection: Instead of absolute positioning, the app calculates vertical gaps between lines to group text into logical
<p>tags. -
Header Identification: The logic compares font sizes to automatically assign
<h1>,<h2>, and<h3>tags. -
Image Extraction: Images are pulled directly from the PDF, stored in a dedicated folder, and linked inline within the HTML.
-
Clean Output: The final HTML is stripped of all internal CSS, inline styles, and non-breaking space artifacts, leaving only semantic markup.
Security & Access Control
To handle the high volume of users without requiring accounts, I implemented a cookie-based identification system:
-
A unique, anonymous 64-character token is generated for each browser.
-
Users can only see and manage their own conversion history.
-
All file paths are protected, ensuring no one can guess a record ID to access another person’s documents.
Results
The converter has significantly streamlined the workflow for migrating print-heavy documents to the web. By automating the extraction and cleaning process, it reduces the time spent on manual “copy-pasting” and code cleanup from hours to seconds.
The application is now available as an open-source project on GitHub for others looking for a robust, server-side PDF conversion utility.
Responsibilities
-
Project Management
-
Web Design
-
Backend Development
-
Database Schema Design
Framework/Programming Language
-
PHP 8.0+
-
MySQL
-
Bootstrap 5
-
Poppler-utils (pdftohtml)
Links
Tags
Case Studies custom GovernmentWould you like to hire me?
Interested in working together? Let's talk.
Contact
