How can I truncate a string to have at most N characters?

Question

The expected approach of String.truncate(usize) fails because it doesn't consider Unicode characters (which is baffling considering Rust treats strings as Unicode).

let mut s = "ボルテックス".to_string();
s.truncate(4);

thread '' panicked at 'assertion failed: self.is_char_boundary(new_len)'

Additionally, truncate modifies the original string, which is not always desired.

The best I've come up with is to convert to chars and collect into a String.

fn truncate(s: String, max_width: usize) -> String {
    s.chars().take(max_width).collect()
}

e.g.

fn main() {
    assert_eq!(truncate("ボルテックス".to_string(), 0), "");
    assert_eq!(truncate("ボルテックス".to_string(), 4), "ボルテッ");
    assert_eq!(truncate("ボルテックス".to_string(), 100), "ボルテックス");
    assert_eq!(truncate("hello".to_string(), 4), "hell");
}

However this feels very heavy handed.

Unicode is freaking complicated. Are you sure you want char (which corresponds to code points) as unit and not grapheme clusters? — user395760, Commented Jul 19, 2016 at 14:35
Actually, the other direction is just as valid: Impose a limit on the number of bytes the UTF-8 encoding takes (you need some care to chop off whole characters — take as many chars as possible without going over N bytes). While this does not match people's perception of character counts, it is reasonable when the restriction is storage-motivated (e.g., the size of a database column). — user395760, Commented Jul 19, 2016 at 14:50

Community · Accepted Answer · 2017-05-23 11:47:21Z

Make sure you read and understand delnan's point:

Unicode is freaking complicated. Are you sure you want char (which corresponds to code points) as unit and not grapheme clusters?

The rest of this answer assumes you have a good reason for using char and not graphemes.

which is baffling considering Rust treats strings as Unicode

This is not correct; Rust treats strings as UTF-8. In UTF-8, every code point is mapped to a variable number of bytes. There's no O(1) algorithm to convert "6 characters" to "N bytes", so the standard library doesn't hide that from you.

You can use char_indices to step through the string character by character and get the byte index of that character:

fn truncate(s: &str, max_chars: usize) -> &str {
    match s.char_indices().nth(max_chars) {
        None => s,
        Some((idx, _)) => &s[..idx],
    }
}

fn main() {
    assert_eq!(truncate("ボルテックス", 0), "");
    assert_eq!(truncate("ボルテックス", 4), "ボルテッ");
    assert_eq!(truncate("ボルテックス", 100), "ボルテックス");
    assert_eq!(truncate("hello", 4), "hell");
}

This also returns a slice that you can choose to move into a new allocation if you need to, or mutate a String in place:

// May not be as efficient as inlining the code...
fn truncate_in_place(s: &mut String, max_chars: usize) {
    let bytes = truncate(&s, max_chars).len();
    s.truncate(bytes);
}

fn main() {
    let mut s = "ボルテックス".to_string();
    truncate_in_place(&mut s, 0);
    assert_eq!(s, "");
}

How is using char_indices() different from my use of chars()? — Peter Uhnak, Commented Jul 20, 2016 at 7:41
@Peter chars only returns the characters. char_indices is similar in concept to chars().enumerate() except it returns the actual index of the u8 that character starts at in the original str. — Linear, Commented Jul 20, 2016 at 11:05
@Veedrac Every. Single. Time. I will never remember it! Clippy feature request! — Shepmaster, Commented Jul 20, 2016 at 12:48

Collectives™ on Stack Overflow

How can I truncate a string to have at most N characters?

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
string
unicode
rust
truncate
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Not the answer you're looking for? Browse other questions tagged stringunicoderusttruncate or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
string
unicode
rust
truncate
or ask your own question.