Issue with "Using Fetch": (makeTextFileLineIterator ignores UTF8 multi-byte boundaries) #10086

mattblack2809 · 2021-10-25T20:15:25Z

MDN URL: https://developer.mozilla.org/en-US/docs/Web/API/Fetch_API/Using_Fetch

What information was incorrect, unhelpful, or incomplete?

The reader is informed 'The chunks that are read from a response are not broken neatly at line boundaries and are Uint8Arrays'. What is unhelpful/incomplete is that the subsequent code will generate malformed output at line
chunk = chunk ? utf8Decoder.decode(chunk) : ''; where the chunk finishes mid-way through a multi-byte UTF8 character.

Specific section or headline?

Processing a text file line by line

What did you expect to see?

A warning at least that the code does not deal with multi-byte UTF8 characters. Up to 3 characters at the end of the chunk need to be retained and passed to the next invocation of the TextDecoder in order that a valid UTF8 sequence is passed to the TextDecoder.

Did you test this? If so, how?

I wrote and tested a 'structured reader' wrapper class that shortened the chunk data to align with the end of a valid UTF8 rune - and which then added the additional 0..3 bytes at the start of the next chunk, so that the data provided to TextDecode.decode() is always a valid sequence of complete UTF8 runes (assuming the source data is valid).

MDN Content page report details

Folder: en-us/web/api/fetch_api/using_fetch
MDN URL: https://developer.mozilla.org/en-US/docs/Web/API/Fetch_API/Using_Fetch
GitHub URL: https://github.com/mdn/content/blob/main/files/en-us/web/api/fetch_api/using_fetch/index.md
Last commit: 8bc9a25
Document last modified: 2021-10-13T06:21:42.000Z

The text was updated successfully, but these errors were encountered:

himanshugarg · 2021-11-18T02:05:12Z

Thanks @mattblack2809 for the detailed report. I could be mistaken and I haven't tried either. But it appears that the if block following let result = re.exec(chunk); should take care of it by extracting prefixes up to line endings. And by returning the leftover suffix as is.

mattblack2809 · 2021-11-18T17:04:02Z

Hi, Thanks for looking at this and responding. Your observation ‘it appears that the if block following let result = re.exec(chunk); should take care of it by extracting prefixes up to line endings' The problem I think lies in the 3 Ines below: ``` const reader = response.body.getReader(); let { value: chunk, done: readerDone } = await reader.read(); chunk = chunk ? utf8Decoder.decode(chunk) : ''; ``` There is no guarantee the reader.read() returns a chunk that is the whole file (or a chunk that ends at a newline) - but rather an arbitrary length chunk that could finish mid way through a must-byte UTF8 rune - meaning the 3rd line (which changes chunk from a Uint8Array to a string) will return a string with the ‘unparsed/invalid’ UTF8 character as the last entry. Similarly, the next chunk read will have a decode error at the start… I am not sure where/how to post it (being rather new to this) - the code below has a version of TextDecoder that does its own buffering which if used in place of TextDecoder in this example works… ```ts class Utf8TextDecoder { dec = new TextDecoder(); ar : Array<number> = []; utfEndBytes(ar: Uint8Array) : 0|1|2|3 { const len = ar.length; if(ar[len - 3] >= 240) // 1111 0xxx return 3; if(ar[len - 2] >= 224) // 1110 xxxx return 2; if(ar[len - 1] >= 192) // 110x xxxx return 1; return 0; } decode(p: Uint8Array) : string { // if we have retained bytes let start = 0; let str = ''; if(this.ar.length > 0) { const runeLength = this.ar[0] >= 240 ? 4 : this.ar[0] >= 224 ? 3 : 2; while(start < p.byteLength && this.ar.length < runeLength) { this.ar.push(p[start]); start++; } str = this.dec.decode(new Uint8Array(this.ar)); } const remaining = p.subarray(start); const endBytes = this.utfEndBytes(remaining); //console.log('remaining', remaining, 'endBytes', endBytes); this.ar = []; for(let i = 0; i < endBytes; i++) { this.ar.push(remaining[remaining.byteLength - endBytes + i]); } const validRemaining = remaining.subarray(0, remaining.byteLength - endBytes); str += this.dec.decode(validRemaining); return str; } } ``` Regards, Matt Black

…

On 18 Nov 2021, at 02:05, Himanshu ***@***.***> wrote: Thanks @mattblack2809 <https://github.com/mattblack2809> for the detailed report. I could be mistaken and I haven't tried either. But it appears that the if block following let result = re.exec(chunk); should take care of it by extracting prefixes up to line endings. And the leftover suffix as is. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#10086 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFADUTDX4V7ECV7QGTCRRPLUMRNOFANCNFSM5GWD3ADQ>. Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

himanshugarg · 2021-11-19T03:23:55Z

Ok, I was hoping that this would return true. Like you said, it doesn't:-

const decoder = new TextDecoder();
let chunk1 = decoder.decode(new Uint8Array([226, 130]));
chunk1 = chunk1 + decoder.decode(new Uint8Array([172]));

let chunk2 = decoder.decode(new Uint8Array([226, 130, 172]));
chunk2 === chunk1 // returns false

I find it odd though that a common use case such reading a fetch response line by line requires such a non-trivial handling. Btw, for readability you could edit your comment to enclose the code blocks inside markup.

wbamberg · 2024-06-21T02:51:41Z

There's some discussion of this issue in #34278. I don't like this example much in the fetch guide, because it adds a lot of complexity that's not specific to fetch. But for this particular issue, can't we just replace this:

const reader = response.body.getReader();

with:

const reader = response.body.pipeThrough(new TextDecoderStream()).getReader();

...and then use the output without decoding it explicitly?

Rumyra added Content:WebAPI Web API docs needs triage Triage needed by staff and/or partners. Automatically applied when an issue is opened. labels Nov 1, 2021

sideshowbarker removed the needs triage Triage needed by staff and/or partners. Automatically applied when an issue is opened. label Jul 20, 2022

Josh-Cena added the area: Fetch/XMLHttpRequest label Jun 26, 2023

Josh-Cena mentioned this issue Jun 20, 2024

update fetch guide #34278

Merged

wbamberg linked a pull request Jun 21, 2024 that will close this issue

update fetch guide #34278

Merged

Josh-Cena closed this as completed in #34278 Jun 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with "Using Fetch": (makeTextFileLineIterator ignores UTF8 multi-byte boundaries) #10086

Issue with "Using Fetch": (makeTextFileLineIterator ignores UTF8 multi-byte boundaries) #10086

mattblack2809 commented Oct 25, 2021

himanshugarg commented Nov 18, 2021 •

edited

Loading

mattblack2809 commented Nov 18, 2021 via email •

edited by Josh-Cena

Loading

himanshugarg commented Nov 19, 2021

wbamberg commented Jun 21, 2024

Issue with "Using Fetch": (makeTextFileLineIterator ignores UTF8 multi-byte boundaries) #10086

Issue with "Using Fetch": (makeTextFileLineIterator ignores UTF8 multi-byte boundaries) #10086

Comments

mattblack2809 commented Oct 25, 2021

What information was incorrect, unhelpful, or incomplete?

Specific section or headline?

What did you expect to see?

Did you test this? If so, how?

himanshugarg commented Nov 18, 2021 • edited Loading

mattblack2809 commented Nov 18, 2021 via email • edited by Josh-Cena Loading

himanshugarg commented Nov 19, 2021

wbamberg commented Jun 21, 2024

himanshugarg commented Nov 18, 2021 •

edited

Loading

mattblack2809 commented Nov 18, 2021 via email •

edited by Josh-Cena

Loading