fix: write UTF-8 to stdout.buffer to avoid UnicodeEncodeError on non-UTF-8 systems#1790
fix: write UTF-8 to stdout.buffer to avoid UnicodeEncodeError on non-UTF-8 systems#1790octo-patch wants to merge 2 commits intomicrosoft:mainfrom
Conversation
…UTF-8 systems On Windows systems with a non-UTF-8 locale (e.g. GBK on Chinese Windows), running `markitdown file.pdf > output.md` raises: UnicodeEncodeError: 'gbk' codec can't encode character '\u2022' Two problems existed in the previous approach of encoding to sys.stdout.encoding with errors='replace': 1. sys.stdout.encoding can be None when stdout is a raw pipe, causing a TypeError. 2. Characters are silently replaced with '?' (lossy output), which is undesirable when redirecting to a file. Fix by writing UTF-8 encoded bytes directly to sys.stdout.buffer when available. This produces lossless UTF-8 output regardless of the system locale, matching the behaviour of the -o/--output flag. A safe fallback handles the rare case where stdout.buffer is absent. Fixes microsoft#1788
VANDRANKI
left a comment
There was a problem hiding this comment.
Writing directly to sys.stdout.buffer is the right fix for locales like GBK that cannot represent all Unicode code points. The hasattr check handles environments without a binary buffer cleanly.
One minor thing: print() appends a trailing newline but sys.stdout.buffer.write() does not. If downstream tooling or scripts rely on the output ending with , you may want sys.stdout.buffer.write(result.markdown.encode("utf-8") + b" "). Low impact since markdown content usually ends with a newline already, but worth a quick check.
…rint() Per review feedback (microsoft#1790): sys.stdout.buffer.write does not append a trailing newline like print() does. Match print()'s behavior so downstream tools relying on a final newline keep working.
|
Thanks for catching that! You're right — |
VANDRANKI
left a comment
There was a problem hiding this comment.
The fix is exactly right - appending b" " only when the content does not already end with one avoids a double newline on the common case while preserving parity with print() for content that lacks a trailing newline. LGTM.
Fixes #1788
Problem
On Windows systems with a non-UTF-8 locale (e.g. GBK on Chinese Windows), running:
raises a
UnicodeEncodeError:The previous approach of encoding to
sys.stdout.encodingwitherrors='replace'had two remaining issues:sys.stdout.encodingcan beNonewhen stdout is a raw pipe, causing aTypeErrorinstead of a graceful failure.'?'(lossy output), which is undesirable when redirecting to a file.Solution
Write UTF-8 encoded bytes directly to
sys.stdout.bufferwhen available. This produces lossless UTF-8 output regardless of the system locale encoding, consistent with the behavior of the-o/--outputflag (which already writes UTF-8 explicitly).A safe fallback handles the rare case where
stdout.bufferis not available (e.g. some embedded or wrapped stdout objects), using the locale encoding witherrors='replace'and guarding againstNoneencoding.Testing
sys.stdout.encoding is Nonecase without raisingTypeError> file.md) on systems with non-UTF-8 locale encoding