fix: redetect plain-text charset after ASCII-only sniff#1789
fix: redetect plain-text charset after ASCII-only sniff#1789lawrence3699 wants to merge 1 commit intomicrosoft:mainfrom
Conversation
|
@lawrence3699 please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
Contributor License AgreementContribution License AgreementThis Contribution License Agreement (“Agreement”) is agreed to by the party signing below (“You”),
|
There was a problem hiding this comment.
Pull request overview
Fixes a bug in MarkItDown’s plain-text conversion path where an ASCII-only initial charset sniff (first 4096 bytes) can lead to decoding the entire file as ASCII and raising UnicodeDecodeError when later UTF-8 bytes appear.
Changes:
- Update
PlainTextConverter.convert()to read the full byte stream once and, onUnicodeDecodeError, re-runcharset_normalizeron the full content as a fallback. - Add a regression test covering “UTF-8 after long ASCII prefix”
.txtfiles.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| packages/markitdown/src/markitdown/converters/_plain_text_converter.py | Adds full-file charset re-detection fallback when decoding with an initially guessed charset fails. |
| packages/markitdown/tests/test_module_misc.py | Adds regression test to ensure late non-ASCII bytes don’t break conversion after an ASCII-only sniff. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| text_content = file_stream.read().decode(stream_info.charset) | ||
| try: | ||
| text_content = file_bytes.decode(stream_info.charset) | ||
| except UnicodeDecodeError: |
There was a problem hiding this comment.
The UnicodeDecodeError fallback will also trigger when stream_info.charset was explicitly provided (e.g., via CLI --charset or HTTP Content-Type), silently overriding that declared charset with a charset-normalizer guess. This changes behavior from “fail fast on wrong/invalid declared charset” to “auto-recover”, which may be surprising for callers relying on strict decoding. Consider limiting this fallback to the specific sniffing failure case (e.g., only when the charset is the guessed ascii, or by adding an explicit flag to enable recovery) so explicitly supplied charsets keep their current strict semantics.
| except UnicodeDecodeError: | |
| except UnicodeDecodeError: | |
| normalized_charset = stream_info.charset.strip().lower() | |
| if normalized_charset not in ("ascii", "us-ascii"): | |
| raise |
VANDRANKI
left a comment
There was a problem hiding this comment.
Good diagnosis and clean fix. Reading all the bytes upfront before the if/else means the charset detection and the fallback share the same full-file buffer, which is what chardet/charset-normalizer needs to catch late non-ASCII bytes reliably.
Test reproduces the exact failure mode (4100 ASCII bytes followed by a non-ASCII character), which is the right way to guard this regression. LGTM.
Summary
Fix plain-text conversion when the initial charset sniff only sees ASCII bytes but later bytes require UTF-8 decoding.
Problem
MarkItDown().convert()can guesscharset='ascii'for.txtfiles whose first 4096 bytes are ASCII-only.PlainTextConverter.convert()then decodes the entire file as ASCII and raisesUnicodeDecodeErroronce it reaches later non-ASCII bytes.This reproduces with a text file that starts with 4100 ASCII characters and ends with
café.Fix
When decoding with the guessed charset fails, rerun
charset_normalizeragainst the full file bytes and use that result as a fallback.Validation
PYTHONPATH=packages/markitdown/src uv run --with requests --with markdownify --with beautifulsoup4 --with defusedxml --with charset-normalizer --with magika --with pytest --python 3.12 python -m pytest packages/markitdown/tests/test_module_misc.py -k 'stream_info_operations or redetects_charset_after_ascii_prefix' -qPYTHONPATH=packages/markitdown/src uv run --with requests --with markdownify --with beautifulsoup4 --with defusedxml --with charset-normalizer --with magika --with pytest --python 3.12 python -m pytest 'packages/markitdown/tests/test_module_vectors.py::test_guess_stream_info[test_vector10]' -qFixes #1505.