Document Accessibility

OCR Is Not Enough: Why Scanned PDFs Still Need Accessibility Remediation

By Richard Tamaro11 min read

Somewhere in your organization, someone has said this: “We ran OCR on the scanned PDFs, so they should be good now.”

They're not.

OCR — optical character recognition — solves one problem. It turns an image of text into machine-readable characters. That's important. A scanned PDF without OCR is literally a picture. A screen reader encounters it and says nothing. There's no text to read, no structure to navigate, no content to interpret. It's a blank page to anyone who can't see it.

So OCR is a necessary first step. But it is only a first step. And the distance between “has OCR” and “is accessible” is much larger than most people realize.


What OCR Gives You

When OCR processes a scanned page, it identifies character shapes in the image and converts them to digital text. After OCR, the document has a searchable text layer — you can use Ctrl+F to find words, you can copy and paste passages, and a screen reader can detect that text exists on the page.

That sounds like it should be enough. It isn't, for three reasons:

OCR creates text. It doesn't create structure. After OCR, every word on the page exists as a flat stream of characters. But there's no indication of what's a heading, what's a paragraph, what's a list item, what's a table cell, and what's a footnote. A sighted person can look at the page and see the visual hierarchy — large bold text is a heading, indented items are a list, grid lines mean a table. A screen reader can't see any of that. Without tags, the entire page reads as a single, continuous block of text with no landmarks, no navigation points, and no semantic meaning.

OCR doesn't set reading order. A scanned two-column newsletter, after OCR, has text — but the reading order might alternate between columns, read a sidebar in the middle of a paragraph, or present a footer before the body text. OCR extracted the characters, but it didn't determine the logical sequence a human would follow. For a screen reader user, the content is scrambled.

OCR doesn't describe images. A scanned document that includes photographs, charts, graphs, maps, or diagrams gets OCR applied to the text portions — but the visual elements remain undescribed. A budget chart, a campus map, a photo in an annual report — none of these get alt text from OCR. To a non-visual user, they don't exist.


What Accessible Actually Means for a Scanned PDF

A scanned PDF that meets WCAG 2.1 AA and PDF/UA standards needs everything OCR provides plus everything OCR doesn't:

What's NeededOCR Provides It?
Machine-readable text layerYes
Tagged PDF structure (headings, paragraphs, lists)No
Correct reading orderNo
Alt text for images, charts, and graphsNo
Table header and data cell relationshipsNo
Bookmark navigationNo
Document language declarationNo
Document title in metadataNo
Artifact tagging (headers, footers, page numbers)No
PDF/UA compliance validationNo

OCR handles one row out of ten. The other nine are the actual accessibility work.


Why Scanned Documents Are the Hardest to Remediate

Not all PDFs are created equal. Born-digital documents — files created in Word, InDesign, or PowerPoint and exported to PDF — start with some structure. If the author used heading styles and alt text in the source file, that information may carry over into the PDF. The remediation job is to verify and correct what's there.

Scanned documents start with nothing. Every element of accessibility must be built from scratch:

  • The text must be OCR'd (and OCR accuracy varies — older scans, faded type, handwritten annotations, unusual fonts, and low-resolution images all degrade OCR quality)
  • The structure must be inferred from visual layout cues — font size, bold weight, indentation, spacing, grid lines
  • Multi-column layouts must be detected and the reading order determined
  • Tables must be identified, and their header/data cell relationships must be established
  • Images must be identified and described
  • Page artifacts (headers, footers, page numbers, watermarks) must be separated from real content

Why CASO's Background Matters Here

Most PDF accessibility vendors come from a web accessibility or software background. They understand tags, WCAG criteria, and screen reader behavior. That's important, but it's only half the story for scanned documents.

CASO comes from a document management and scanning background — 30+ years of large-scale digitization, image processing, OCR, and records conversion for government agencies, medical centers, and enterprises. That means the team understands both sides:

The Document Image Side

How scanning resolution affects OCR accuracy. How skewed pages, bleed-through, and photocopy artifacts create false positives. How mixed-quality source documents from different eras and different copiers produce inconsistent results. How to distinguish real content from scanner noise.

The Accessibility Structure Side

How to build a correct tag tree from visual layout cues. How to set reading order in complex multi-column pages. How to identify and tag tables when the grid lines are faint or inconsistent. How to generate meaningful alt text for images that were never described in the original paper document.

This dual expertise is CASO's strongest differentiator for scanned document remediation.


The Types of Scanned Documents You Probably Have

If your organization has been digitizing paper for any length of time, you have some combination of:

  • Scanned backfiles — years of paper records converted to PDF during a digitization project
  • Legacy meeting packets — board agendas, council minutes, committee reports compiled from photocopied pages
  • Old forms — paper forms scanned to PDF, where fields are images of blank lines
  • Records converted from paper — HR files, legal documents, policy manuals, engineering drawings
  • Image-only PDFs on public websites — documents posted years ago with zero text
  • Mixed-quality archives — document sets with clean digital exports mixed with scanned inserts

What CASO Comply Does with Scanned Documents

CASO Comply's remediation pipeline handles scanned documents end-to-end:

1

OCR with High Accuracy

Up to 99.5% accuracy on standard documents. Flags files where source quality prevents reliable OCR.

2

AI-Powered Structure Detection

After OCR, the AI engine analyzes visual layout to identify headings, paragraphs, lists, tables, images, and page artifacts. Builds a complete tag tree.

3

Reading Order Correction

Determines the logical reading sequence for multi-column layouts, sidebars, and complex page compositions.

4

Alt Text Generation

Level 1: basic descriptions. Level 2: contextual, document-aware descriptions.

5

Table Remediation

Proper TH/TD tagging with scope attributes. Complex tables with merged cells get advanced tagging.

6

Metadata and Bookmarks

Document title, language, PDF/UA identifier. Bookmarks from heading structure.

7

Validation

Every document validated against PDF/UA (ISO 14289). Compliance validation report included.

For Level 3 (Full Remediation): expert human review with guaranteed 100% compliance and Certificate of Compliance. View pricing details.


The Bottom Line on OCR and Accessibility

OCR is the beginning, not the end. It turns a picture into text. But accessible means more than “contains text.” Accessible means a person using a screen reader, a braille display, or keyboard navigation can read, understand, navigate, and act on the document — with the same level of independence and comprehension as a sighted user.

For scanned documents, closing that gap requires OCR plus structure, order, semantics, descriptions, metadata, and validation. It's more work than born-digital PDFs. It's also more important — because scanned documents are often the oldest, most numerous, and most neglected files in an organization's library.

CASO was built for exactly that backlog. Start with the files that matter most, and we'll show you what accessible actually looks like.

Ready to tackle your scanned document backlog?

Submit a sample scanned PDF and see what fully accessible output looks like — or get a free compliance scan of your entire website.