fix: write UTF-8 to stdout.buffer to avoid UnicodeEncodeError on non-UTF-8 systems by octo-patch · Pull Request #1790 · microsoft/markitdown

octo-patch · 2026-04-17T03:32:22Z

Problem

On Windows systems with a non-UTF-8 locale (e.g. GBK on Chinese Windows), running:

markitdown file.pdf > output.md

raises a UnicodeEncodeError:

UnicodeEncodeError: 'gbk' codec can't encode character '\u2022' in position 140: illegal multibyte sequence

The previous approach of encoding to sys.stdout.encoding with errors='replace' had two remaining issues:

sys.stdout.encoding can be None when stdout is a raw pipe, causing a TypeError instead of a graceful failure.
Characters are silently replaced with '?' (lossy output), which is undesirable when redirecting to a file.

Solution

Write UTF-8 encoded bytes directly to sys.stdout.buffer when available. This produces lossless UTF-8 output regardless of the system locale encoding, consistent with the behavior of the -o/--output flag (which already writes UTF-8 explicitly).

A safe fallback handles the rare case where stdout.buffer is not available (e.g. some embedded or wrapped stdout objects), using the locale encoding with errors='replace' and guarding against None encoding.

Testing

Verified the fix handles the sys.stdout.encoding is None case without raising TypeError
Verified lossless UTF-8 output when redirecting (> file.md) on systems with non-UTF-8 locale encoding

…UTF-8 systems On Windows systems with a non-UTF-8 locale (e.g. GBK on Chinese Windows), running `markitdown file.pdf > output.md` raises: UnicodeEncodeError: 'gbk' codec can't encode character '\u2022' Two problems existed in the previous approach of encoding to sys.stdout.encoding with errors='replace': 1. sys.stdout.encoding can be None when stdout is a raw pipe, causing a TypeError. 2. Characters are silently replaced with '?' (lossy output), which is undesirable when redirecting to a file. Fix by writing UTF-8 encoded bytes directly to sys.stdout.buffer when available. This produces lossless UTF-8 output regardless of the system locale, matching the behaviour of the -o/--output flag. A safe fallback handles the rare case where stdout.buffer is absent. Fixes microsoft#1788

VANDRANKI

Writing directly to sys.stdout.buffer is the right fix for locales like GBK that cannot represent all Unicode code points. The hasattr check handles environments without a binary buffer cleanly.

One minor thing: print() appends a trailing newline but sys.stdout.buffer.write() does not. If downstream tooling or scripts rely on the output ending with , you may want sys.stdout.buffer.write(result.markdown.encode("utf-8") + b" "). Low impact since markdown content usually ends with a newline already, but worth a quick check.

…rint() Per review feedback (microsoft#1790): sys.stdout.buffer.write does not append a trailing newline like print() does. Match print()'s behavior so downstream tools relying on a final newline keep working.

octo-patch · 2026-04-19T05:17:56Z

Thanks for catching that! You're right — print() appends a trailing newline that sys.stdout.buffer.write() does not. Pushed a small fix that appends b"\n" if the markdown does not already end with one, preserving the prior behavior for any downstream tooling that expects a final newline. (986e2a8)

VANDRANKI

The fix is exactly right - appending b" " only when the content does not already end with one avoids a double newline on the common case while preserving parity with print() for content that lacks a trailing newline. LGTM.

VANDRANKI reviewed Apr 18, 2026

View reviewed changes

fix: append trailing newline to stdout buffer write for parity with p…

986e2a8

…rint() Per review feedback (microsoft#1790): sys.stdout.buffer.write does not append a trailing newline like print() does. Match print()'s behavior so downstream tools relying on a final newline keep working.

VANDRANKI approved these changes Apr 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: write UTF-8 to stdout.buffer to avoid UnicodeEncodeError on non-UTF-8 systems#1790

fix: write UTF-8 to stdout.buffer to avoid UnicodeEncodeError on non-UTF-8 systems#1790
octo-patch wants to merge 2 commits intomicrosoft:mainfrom
octo-patch:fix/issue-1788-unicode-stdout-encoding

octo-patch commented Apr 17, 2026

Uh oh!

VANDRANKI left a comment

Uh oh!

octo-patch commented Apr 19, 2026

Uh oh!

VANDRANKI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

octo-patch commented Apr 17, 2026

Problem

Solution

Testing

Uh oh!

VANDRANKI left a comment

Choose a reason for hiding this comment

Uh oh!

octo-patch commented Apr 19, 2026

Uh oh!

VANDRANKI left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants