Document Accessibility

OCR Is Not Enough: Why Scanned PDFs Still Need Accessibility Remediation

By Richard TamaroMarch 22, 202611 min read

Somewhere in your organization, someone has said this: “We ran OCR on the scanned PDFs, so they should be good now.”

They're not.

OCR — optical character recognition — solves one problem. It turns an image of text into machine-readable characters. That's important. A scanned PDF without OCR is literally a picture. A screen reader encounters it and says nothing. There's no text to read, no structure to navigate, no content to interpret. It's a blank page to anyone who can't see it.

So OCR is a necessary first step. But it is only a first step. And the distance between “has OCR” and “is accessible” is much larger than most people realize.

What OCR Gives You

When OCR processes a scanned page, it identifies character shapes in the image and converts them to digital text. After OCR, the document has a searchable text layer — you can use Ctrl+F to find words, you can copy and paste passages, and a screen reader can detect that text exists on the page.

That sounds like it should be enough. It isn't, for three reasons:

OCR creates text. It doesn't create structure. After OCR, every word on the page exists as a flat stream of characters. But there's no indication of what's a heading, what's a paragraph, what's a list item, what's a table cell, and what's a footnote. A sighted person can look at the page and see the visual hierarchy — large bold text is a heading, indented items are a list, grid lines mean a table. A screen reader can't see any of that. Without tags, the entire page reads as a single, continuous block of text with no landmarks, no navigation points, and no semantic meaning.

OCR doesn't set reading order. A scanned two-column newsletter, after OCR, has text — but the reading order might alternate between columns, read a sidebar in the middle of a paragraph, or present a footer before the body text. OCR extracted the characters, but it didn't determine the logical sequence a human would follow. For a screen reader user, the content is scrambled.

OCR doesn't describe images. A scanned document that includes photographs, charts, graphs, maps, or diagrams gets OCR applied to the text portions — but the visual elements remain undescribed. A budget chart, a campus map, a photo in an annual report — none of these get alt text from OCR. To a non-visual user, they don't exist.

What Accessible Actually Means for a Scanned PDF

A scanned PDF that meets WCAG 2.1 AA and PDF/UA standards needs everything OCR provides plus everything OCR doesn't:

What's Needed	OCR Provides It?
Machine-readable text layer	Yes
Tagged PDF structure (headings, paragraphs, lists)	No
Correct reading order	No
Alt text for images, charts, and graphs	No
Table header and data cell relationships	No
Bookmark navigation	No
Document language declaration	No
Document title in metadata	No
Artifact tagging (headers, footers, page numbers)	No
PDF/UA compliance validation	No

OCR handles one row out of ten. The other nine are the actual accessibility work.

Why Scanned Documents Are the Hardest to Remediate

Not all PDFs are created equal. Born-digital documents — files created in Word, InDesign, or PowerPoint and exported to PDF — start with some structure. If the author used heading styles and alt text in the source file, that information may carry over into the PDF. The remediation job is to verify and correct what's there.

Scanned documents start with nothing. Every element of accessibility must be built from scratch:

The text must be OCR'd (and OCR accuracy varies — older scans, faded type, handwritten annotations, unusual fonts, and low-resolution images all degrade OCR quality)
The structure must be inferred from visual layout cues — font size, bold weight, indentation, spacing, grid lines
Multi-column layouts must be detected and the reading order determined
Tables must be identified, and their header/data cell relationships must be established
Images must be identified and described
Page artifacts (headers, footers, page numbers, watermarks) must be separated from real content

Why CASO's Background Matters Here

Most PDF accessibility vendors come from a web accessibility or software background. They understand tags, WCAG criteria, and screen reader behavior. That's important, but it's only half the story for scanned documents.

CASO comes from a document management and scanning background — 30+ years of large-scale digitization, image processing, OCR, and records conversion for government agencies, medical centers, and enterprises. That means the team understands both sides:

The Document Image Side

How scanning resolution affects OCR accuracy. How skewed pages, bleed-through, and photocopy artifacts create false positives. How mixed-quality source documents from different eras and different copiers produce inconsistent results. How to distinguish real content from scanner noise.

The Accessibility Structure Side

How to build a correct tag tree from visual layout cues. How to set reading order in complex multi-column pages. How to identify and tag tables when the grid lines are faint or inconsistent. How to generate meaningful alt text for images that were never described in the original paper document.

This dual expertise is CASO's strongest differentiator for scanned document remediation.

The Types of Scanned Documents You Probably Have

If your organization has been digitizing paper for any length of time, you have some combination of:

Scanned backfiles — years of paper records converted to PDF during a digitization project
Legacy meeting packets — board agendas, council minutes, committee reports compiled from photocopied pages
Old forms — paper forms scanned to PDF, where fields are images of blank lines
Records converted from paper — HR files, legal documents, policy manuals, engineering drawings
Image-only PDFs on public websites — documents posted years ago with zero text
Mixed-quality archives — document sets with clean digital exports mixed with scanned inserts

What CASO Comply Does with Scanned Documents

CASO Comply's remediation pipeline handles scanned documents end-to-end:

OCR with High Accuracy

Up to 99.5% accuracy on standard documents. Flags files where source quality prevents reliable OCR.

AI-Powered Structure Detection

After OCR, the AI engine analyzes visual layout to identify headings, paragraphs, lists, tables, images, and page artifacts. Builds a complete tag tree.

Reading Order Correction

Determines the logical reading sequence for multi-column layouts, sidebars, and complex page compositions.

Alt Text Generation

Level 1: basic descriptions. Level 2: contextual, document-aware descriptions.

Table Remediation

Proper TH/TD tagging with scope attributes. Complex tables with merged cells get advanced tagging.

Metadata and Bookmarks

Document title, language, PDF/UA identifier. Bookmarks from heading structure.

Validation

Every document validated against PDF/UA (ISO 14289). Compliance validation report included.

For Level 3 (Full Remediation): expert human review with guaranteed 100% compliance and Certificate of Compliance. View pricing details.

The Bottom Line on OCR and Accessibility

OCR is the beginning, not the end. It turns a picture into text. But accessible means more than “contains text.” Accessible means a person using a screen reader, a braille display, or keyboard navigation can read, understand, navigate, and act on the document — with the same level of independence and comprehension as a sighted user.

For scanned documents, closing that gap requires OCR plus structure, order, semantics, descriptions, metadata, and validation. It's more work than born-digital PDFs. It's also more important — because scanned documents are often the oldest, most numerous, and most neglected files in an organization's library.

CASO was built for exactly that backlog. Start with the files that matter most, and we'll show you what accessible actually looks like.

Ready to tackle your scanned document backlog?

Submit a sample scanned PDF and see what fully accessible output looks like — or get a free compliance scan of your entire website.

Submit a Sample Scanned PDF →Get a Free Compliance Scan →

View Pricing →

Related Resources

Technical

PDF Remediation Tools Compared: Manual vs Automated vs Hybrid

An honest comparison of every major approach to PDF remediation.

Guides

How to Make PDFs Accessible: A Complete Guide

Step-by-step instructions for remediating your documents to meet WCAG 2.1 AA and PDF/UA standards.