Skip to content

feat: extract PDF metadata (title, author, dates) into markdown output#1787

Open
VANDRANKI wants to merge 1 commit intomicrosoft:mainfrom
VANDRANKI:feat/pdf-metadata-extraction
Open

feat: extract PDF metadata (title, author, dates) into markdown output#1787
VANDRANKI wants to merge 1 commit intomicrosoft:mainfrom
VANDRANKI:feat/pdf-metadata-extraction

Conversation

@VANDRANKI
Copy link
Copy Markdown

Summary

Closes #1664.

PDF files often carry useful metadata - title, author, subject, keywords, creator, producer, and creation/modification dates - but the current PdfConverter silently discards it. This PR surfaces that information as a Document Properties section at the top of the converted markdown.

What changed

  • Added to convert PDF date strings () to readable format
  • Added to format the pdfplumber metadata dict as a markdown section
  • Wired both into — metadata is prepended to the output when present, skipped entirely when absent or empty
  • Added with 16 tests covering unit helpers and end-to-end output via mocks

Example output

For a PDF with title and author set:

Design decisions

  • Only fields that are present and non-empty are emitted — no blank lines for missing fields
  • Uses 's dict (already imported); no new dependencies
  • Falls back gracefully: if pdfplumber raises, metadata is skipped and pdfminer handles the text as before

Adds metadata extraction to PdfConverter. When pdfplumber finds fields
like title, author, subject, keywords, creator, producer, or dates,
they are emitted as a Document Properties section at the top of the output.

Closes microsoft#1664
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PdfConverter does not extract PDF metadata (title, author, creation date)

1 participant