-
Notifications
You must be signed in to change notification settings - Fork 22.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue with "Using Fetch": (makeTextFileLineIterator ignores UTF8 multi-byte boundaries) #10086
Comments
Thanks @mattblack2809 for the detailed report. I could be mistaken and I haven't tried either. But it appears that the |
Hi,
Thanks for looking at this and responding.
Your observation ‘it appears that the if block following let result = re.exec(chunk); should take care of it by extracting prefixes up to line endings'
The problem I think lies in the 3 Ines below:
```
const reader = response.body.getReader();
let { value: chunk, done: readerDone } = await reader.read();
chunk = chunk ? utf8Decoder.decode(chunk) : '';
```
There is no guarantee the reader.read() returns a chunk that is the whole file (or a chunk that ends at a newline) - but rather an arbitrary length chunk that could finish mid way through a must-byte UTF8 rune - meaning the 3rd line (which changes chunk from a Uint8Array to a string) will return a string with the ‘unparsed/invalid’ UTF8 character as the last entry. Similarly, the next chunk read will have a decode error at the start…
I am not sure where/how to post it (being rather new to this) - the code below has a version of TextDecoder that does its own buffering which if used in place of TextDecoder in this example works…
```ts
class Utf8TextDecoder {
dec = new TextDecoder();
ar : Array<number> = [];
utfEndBytes(ar: Uint8Array) : 0|1|2|3 {
const len = ar.length;
if(ar[len - 3] >= 240) // 1111 0xxx
return 3;
if(ar[len - 2] >= 224) // 1110 xxxx
return 2;
if(ar[len - 1] >= 192) // 110x xxxx
return 1;
return 0;
}
decode(p: Uint8Array) : string {
// if we have retained bytes
let start = 0;
let str = '';
if(this.ar.length > 0) {
const runeLength = this.ar[0] >= 240 ? 4 : this.ar[0] >= 224 ? 3 : 2;
while(start < p.byteLength && this.ar.length < runeLength) {
this.ar.push(p[start]);
start++;
}
str = this.dec.decode(new Uint8Array(this.ar));
}
const remaining = p.subarray(start);
const endBytes = this.utfEndBytes(remaining);
//console.log('remaining', remaining, 'endBytes', endBytes);
this.ar = [];
for(let i = 0; i < endBytes; i++) {
this.ar.push(remaining[remaining.byteLength - endBytes + i]);
}
const validRemaining = remaining.subarray(0, remaining.byteLength - endBytes);
str += this.dec.decode(validRemaining);
return str;
}
}
```
Regards,
Matt Black
… On 18 Nov 2021, at 02:05, Himanshu ***@***.***> wrote:
Thanks @mattblack2809 <https://github.com/mattblack2809> for the detailed report. I could be mistaken and I haven't tried either. But it appears that the if block following let result = re.exec(chunk); should take care of it by extracting prefixes up to line endings. And the leftover suffix as is.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#10086 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFADUTDX4V7ECV7QGTCRRPLUMRNOFANCNFSM5GWD3ADQ>.
Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Ok, I was hoping that this would return const decoder = new TextDecoder();
let chunk1 = decoder.decode(new Uint8Array([226, 130]));
chunk1 = chunk1 + decoder.decode(new Uint8Array([172]));
let chunk2 = decoder.decode(new Uint8Array([226, 130, 172]));
chunk2 === chunk1 // returns false I find it odd though that a common use case such reading a fetch response line by line requires such a non-trivial handling. Btw, for readability you could edit your comment to enclose the code blocks inside markup. |
There's some discussion of this issue in #34278. I don't like this example much in the fetch guide, because it adds a lot of complexity that's not specific to fetch. But for this particular issue, can't we just replace this: const reader = response.body.getReader(); with: const reader = response.body.pipeThrough(new TextDecoderStream()).getReader(); ...and then use the output without decoding it explicitly? |
MDN URL: https://developer.mozilla.org/en-US/docs/Web/API/Fetch_API/Using_Fetch
What information was incorrect, unhelpful, or incomplete?
The reader is informed 'The chunks that are read from a response are not broken neatly at line boundaries and are Uint8Arrays'. What is unhelpful/incomplete is that the subsequent code will generate malformed output at line
chunk = chunk ? utf8Decoder.decode(chunk) : '';
where the chunk finishes mid-way through a multi-byte UTF8 character.Specific section or headline?
Processing a text file line by line
What did you expect to see?
A warning at least that the code does not deal with multi-byte UTF8 characters. Up to 3 characters at the end of the chunk need to be retained and passed to the next invocation of the TextDecoder in order that a valid UTF8 sequence is passed to the TextDecoder.
Did you test this? If so, how?
I wrote and tested a 'structured reader' wrapper class that shortened the chunk data to align with the end of a valid UTF8 rune - and which then added the additional 0..3 bytes at the start of the next chunk, so that the data provided to TextDecode.decode() is always a valid sequence of complete UTF8 runes (assuming the source data is valid).
MDN Content page report details
en-us/web/api/fetch_api/using_fetch
The text was updated successfully, but these errors were encountered: