Skip to content

fix: write UTF-8 to stdout.buffer to avoid UnicodeEncodeError on non-UTF-8 systems#1790

Open
octo-patch wants to merge 2 commits intomicrosoft:mainfrom
octo-patch:fix/issue-1788-unicode-stdout-encoding
Open

fix: write UTF-8 to stdout.buffer to avoid UnicodeEncodeError on non-UTF-8 systems#1790
octo-patch wants to merge 2 commits intomicrosoft:mainfrom
octo-patch:fix/issue-1788-unicode-stdout-encoding

Conversation

@octo-patch
Copy link
Copy Markdown

Fixes #1788

Problem

On Windows systems with a non-UTF-8 locale (e.g. GBK on Chinese Windows), running:

markitdown file.pdf > output.md

raises a UnicodeEncodeError:

UnicodeEncodeError: 'gbk' codec can't encode character '\u2022' in position 140: illegal multibyte sequence

The previous approach of encoding to sys.stdout.encoding with errors='replace' had two remaining issues:

  1. sys.stdout.encoding can be None when stdout is a raw pipe, causing a TypeError instead of a graceful failure.
  2. Characters are silently replaced with '?' (lossy output), which is undesirable when redirecting to a file.

Solution

Write UTF-8 encoded bytes directly to sys.stdout.buffer when available. This produces lossless UTF-8 output regardless of the system locale encoding, consistent with the behavior of the -o/--output flag (which already writes UTF-8 explicitly).

A safe fallback handles the rare case where stdout.buffer is not available (e.g. some embedded or wrapped stdout objects), using the locale encoding with errors='replace' and guarding against None encoding.

Testing

  • Verified the fix handles the sys.stdout.encoding is None case without raising TypeError
  • Verified lossless UTF-8 output when redirecting (> file.md) on systems with non-UTF-8 locale encoding

…UTF-8 systems

On Windows systems with a non-UTF-8 locale (e.g. GBK on Chinese
Windows), running `markitdown file.pdf > output.md` raises:

  UnicodeEncodeError: 'gbk' codec can't encode character '\u2022'

Two problems existed in the previous approach of encoding to
sys.stdout.encoding with errors='replace':
1. sys.stdout.encoding can be None when stdout is a raw pipe,
   causing a TypeError.
2. Characters are silently replaced with '?' (lossy output), which
   is undesirable when redirecting to a file.

Fix by writing UTF-8 encoded bytes directly to sys.stdout.buffer
when available. This produces lossless UTF-8 output regardless of
the system locale, matching the behaviour of the -o/--output flag.
A safe fallback handles the rare case where stdout.buffer is absent.

Fixes microsoft#1788
Copy link
Copy Markdown

@VANDRANKI VANDRANKI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Writing directly to sys.stdout.buffer is the right fix for locales like GBK that cannot represent all Unicode code points. The hasattr check handles environments without a binary buffer cleanly.

One minor thing: print() appends a trailing newline but sys.stdout.buffer.write() does not. If downstream tooling or scripts rely on the output ending with , you may want sys.stdout.buffer.write(result.markdown.encode("utf-8") + b" "). Low impact since markdown content usually ends with a newline already, but worth a quick check.

…rint()

Per review feedback (microsoft#1790): sys.stdout.buffer.write does not append a
trailing newline like print() does. Match print()'s behavior so downstream
tools relying on a final newline keep working.
@octo-patch
Copy link
Copy Markdown
Author

Thanks for catching that! You're right — print() appends a trailing newline that sys.stdout.buffer.write() does not. Pushed a small fix that appends b"\n" if the markdown does not already end with one, preserving the prior behavior for any downstream tooling that expects a final newline. (986e2a8)

Copy link
Copy Markdown

@VANDRANKI VANDRANKI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix is exactly right - appending b" " only when the content does not already end with one avoids a double newline on the common case while preserving parity with print() for content that lacks a trailing newline. LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

UnicodeEncodeError

2 participants