feat: extract PDF metadata (title, author, dates) into markdown output by VANDRANKI · Pull Request #1787 · microsoft/markitdown

VANDRANKI · 2026-04-16T14:21:54Z

Summary

Closes #1664.

PDF files often carry useful metadata - title, author, subject, keywords, creator, producer, and creation/modification dates - but the current PdfConverter silently discards it. This PR surfaces that information as a Document Properties section at the top of the converted markdown.

What changed

Added to convert PDF date strings () to readable format
Added to format the pdfplumber metadata dict as a markdown section
Wired both into — metadata is prepended to the output when present, skipped entirely when absent or empty
Added with 16 tests covering unit helpers and end-to-end output via mocks

Example output

For a PDF with title and author set:

Design decisions

Only fields that are present and non-empty are emitted — no blank lines for missing fields
Uses 's dict (already imported); no new dependencies
Falls back gracefully: if pdfplumber raises, metadata is skipped and pdfminer handles the text as before

Adds metadata extraction to PdfConverter. When pdfplumber finds fields like title, author, subject, keywords, creator, producer, or dates, they are emitted as a Document Properties section at the top of the output. Closes microsoft#1664

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: extract PDF metadata (title, author, dates) into markdown output#1787

feat: extract PDF metadata (title, author, dates) into markdown output#1787
VANDRANKI wants to merge 1 commit intomicrosoft:mainfrom
VANDRANKI:feat/pdf-metadata-extraction

VANDRANKI commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

VANDRANKI commented Apr 16, 2026

Summary

What changed

Example output

Design decisions

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant