PDF to Markdown limitations: OCR, tables, images, and layouts

May 24, 2026

PDF to Markdown conversion is useful, but it is not magic. A PDF is a layout document: it describes where text, images, shapes, and other objects appear on a page. Markdown is a structured text format: it describes headings, paragraphs, lists, links, code blocks, and simple tables.

That mismatch explains most conversion limits. The converter is not only extracting text. It is also making decisions about reading order, headings, list structure, table boundaries, and what to ignore.

Last reviewed: June 1, 2026. Regular Mode is best for selectable text and runs locally in the browser. Advanced OCR is available for scanned, image-only, and complex PDFs, with results retained for 24 hours.

Why output quality varies

Two PDFs can look identical but contain very different internal data. One may have clean text, paragraphs, and a logical reading order. Another may contain individually positioned letters, scanned page images, invisible text layers, or layout objects that do not map to Markdown.

The most common reason for imperfect output is that the PDF does not store semantic structure. It may know that a word appears at a certain coordinate, but not that the word is part of a second-level heading, the first cell in a table, or the continuation of a paragraph from the previous page.

Scanned PDFs need OCR

If a PDF page is just an image, Regular Mode has no text layer to extract. You may see words on the page, but the browser cannot copy them as text because they are pixels, not characters.

Use Advanced OCR for:

  • Scanned contracts, books, letters, and forms.
  • Photos of printed pages saved as PDF.
  • Image-only exports from scanning apps.
  • PDFs where text selection highlights the whole page instead of individual words.

OCR can recognize text, but it can still make mistakes. Low-resolution scans, skewed pages, handwriting, stamps, shadows, and unusual fonts can reduce accuracy. Always check important names, numbers, dates, and legal or financial terms against the original.

Tables are often the hardest part

Markdown supports simple tables. PDFs often contain tables that are visually rich but structurally ambiguous. A clean Markdown table usually needs clear rows, clear columns, one header row, and no merged cells.

Expect manual cleanup when the PDF includes:

  • Merged cells.
  • Multi-level table headers.
  • Financial statements with indentation and subtotal rows.
  • Tables split across pages.
  • Footnotes inside cells.
  • Rotated column labels.
  • Cells that contain lists or multiple paragraphs.

Regular Mode can preserve simple table-like structures when possible. Advanced OCR can handle some tables better and inline recognized tables into Markdown, but complex tables may still need manual rebuilding. When a table is business-critical, compare it row by row with the original PDF.

Images, charts, and diagrams are separate from Markdown text

Regular Mode focuses on text extraction and does not extract images, charts, or diagrams as real image files. If the PDF contains a chart, the surrounding caption may convert, but the chart itself will not become a usable image asset in Regular Mode.

Advanced OCR can include extracted image assets when the OCR provider returns them. When the result contains images, downloading a ZIP with Markdown plus separate image files is usually easier to manage than embedding base64 images directly in one Markdown file.

Even when images are preserved, they need context. A good Markdown result should include nearby captions, figure references, and alt text if you plan to publish it.

Multi-column layouts can break reading order

PDF text order can differ from visual order. A two-column page might store all text by horizontal position, by creation order, or by fragments that depend on the original design tool. The result can jump from the left column to the right column at the wrong time.

Watch for this in:

  • Academic papers.
  • Brochures.
  • Newsletters.
  • Product sheets.
  • Forms with labels and values.
  • Reports with sidebars or callout boxes.

If the content is important, review the converted Markdown section by section. Sometimes the fastest fix is to move blocks manually after conversion rather than trying to force the PDF into a perfect extraction.

Headers, footers, and page numbers may appear in the text

PDFs often repeat page numbers, document titles, confidentiality labels, dates, or company names on every page. A converter may include those because they are real text objects in the PDF.

After conversion, scan for repeated lines. Remove them when they interrupt the main reading flow. Keep them only when they carry meaningful information, such as version labels, section names, or required legal notices.

Formulas and special symbols require review

Mathematical formulas, chemical notation, legal symbols, currency symbols, and technical marks can be difficult to preserve. The PDF may store them as special fonts, vector shapes, or positioned characters rather than normal Unicode text.

If the output will be used for engineering, academic, legal, or financial work, verify symbols manually. For formulas, you may need to rewrite them in LaTeX or another notation supported by your Markdown renderer.

PDF links may convert as visible text without a clean Markdown link target. Footnotes can appear in the middle of a paragraph or at the end of a page. Citations may lose spacing or punctuation when lines are joined.

Useful cleanup steps include:

  • Rebuild important links as normal Markdown links.
  • Move footnotes to the end of the relevant section.
  • Normalize citation spacing.
  • Remove broken line wraps in long URLs.
  • Keep reference lists as plain text when exact formatting is not required.

Password-protected and restricted PDFs

Some PDFs prevent copying, printing, or extraction. Others require a password before they can be opened. A converter can only work with files it is allowed to read. If a PDF is encrypted, restricted, or corrupted, conversion may fail or produce incomplete output.

Only process documents you have permission to handle. For sensitive files, prefer Regular Mode when the PDF is text-based because it runs locally in the browser and does not upload the file.

File size and page limits still apply

Conversion quality is not the only constraint. Large PDFs use more memory and take longer to parse. Batch jobs need queueing so one file does not destabilize the browser. Advanced OCR also has account, file, and credit rules because server-side recognition has processing cost.

When a document is large, test a representative file or a smaller section first. If the output is messy on a sample, running the whole document will usually multiply the cleanup work.

How to decide whether the result is good enough

Use the intended purpose to decide how much review is necessary:

  • Personal notes: quick cleanup is usually enough.
  • Internal drafts: check headings, reading order, and important tables.
  • Public documentation: edit for structure, style, links, and accessibility.
  • Research or compliance work: compare the Markdown against the PDF carefully.
  • Legal, medical, or financial content: treat the Markdown as a draft and verify every critical detail.

The original PDF remains the source of truth whenever accuracy matters.

Practical triage workflow

When a conversion looks imperfect, do this before starting a full manual rewrite:

  1. Check whether the PDF has selectable text.
  2. If it is scanned, switch to Advanced OCR.
  3. If the text is selectable but out of order, inspect columns, sidebars, and page furniture.
  4. Remove repeated headers and footers.
  5. Rebuild only the tables that matter.
  6. Compare critical facts against the original.
  7. Decide whether the Markdown is good enough for the workflow or whether the PDF should remain the primary document.

This keeps cleanup proportional to the value of the document.

Summary

PDF to Markdown works best when the PDF contains clean selectable text and a simple reading order. OCR, tables, images, formulas, multi-column layouts, and repeated page elements add complexity because they require interpretation, not just extraction. Use Regular Mode for text-based PDFs, Advanced OCR for scanned and complex files, and manual review for any content where accuracy matters.

PDF To Markdown

PDF To Markdown

PDF to Markdown limitations: OCR, tables, images, and layouts | PDF To Markdown Blog