Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with "Using Fetch": (makeTextFileLineIterator ignores UTF8 multi-byte boundaries) #10086

Closed
mattblack2809 opened this issue Oct 25, 2021 · 4 comments · Fixed by #34278
Closed

Comments

@mattblack2809
Copy link

MDN URL: https://developer.mozilla.org/en-US/docs/Web/API/Fetch_API/Using_Fetch

What information was incorrect, unhelpful, or incomplete?

The reader is informed 'The chunks that are read from a response are not broken neatly at line boundaries and are Uint8Arrays'. What is unhelpful/incomplete is that the subsequent code will generate malformed output at line
chunk = chunk ? utf8Decoder.decode(chunk) : ''; where the chunk finishes mid-way through a multi-byte UTF8 character.

Specific section or headline?

Processing a text file line by line

What did you expect to see?

A warning at least that the code does not deal with multi-byte UTF8 characters. Up to 3 characters at the end of the chunk need to be retained and passed to the next invocation of the TextDecoder in order that a valid UTF8 sequence is passed to the TextDecoder.

Did you test this? If so, how?

I wrote and tested a 'structured reader' wrapper class that shortened the chunk data to align with the end of a valid UTF8 rune - and which then added the additional 0..3 bytes at the start of the next chunk, so that the data provided to TextDecode.decode() is always a valid sequence of complete UTF8 runes (assuming the source data is valid).

MDN Content page report details
@Rumyra Rumyra added Content:WebAPI Web API docs needs triage Triage needed by staff and/or partners. Automatically applied when an issue is opened. labels Nov 1, 2021
@himanshugarg
Copy link
Contributor

himanshugarg commented Nov 18, 2021

Thanks @mattblack2809 for the detailed report. I could be mistaken and I haven't tried either. But it appears that the if block following let result = re.exec(chunk); should take care of it by extracting prefixes up to line endings. And by returning the leftover suffix as is.

@mattblack2809
Copy link
Author

mattblack2809 commented Nov 18, 2021 via email

@himanshugarg
Copy link
Contributor

Ok, I was hoping that this would return true. Like you said, it doesn't:-

const decoder = new TextDecoder();
let chunk1 = decoder.decode(new Uint8Array([226, 130]));
chunk1 = chunk1 + decoder.decode(new Uint8Array([172]));

let chunk2 = decoder.decode(new Uint8Array([226, 130, 172]));
chunk2 === chunk1 // returns false

I find it odd though that a common use case such reading a fetch response line by line requires such a non-trivial handling. Btw, for readability you could edit your comment to enclose the code blocks inside markup.

@sideshowbarker sideshowbarker removed the needs triage Triage needed by staff and/or partners. Automatically applied when an issue is opened. label Jul 20, 2022
@wbamberg
Copy link
Collaborator

There's some discussion of this issue in #34278. I don't like this example much in the fetch guide, because it adds a lot of complexity that's not specific to fetch. But for this particular issue, can't we just replace this:

const reader = response.body.getReader();

with:

const reader = response.body.pipeThrough(new TextDecoderStream()).getReader();

...and then use the output without decoding it explicitly?

@wbamberg wbamberg linked a pull request Jun 21, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment