Newest 'tokenize' Questions

-2 votes

0 answers

9 views

What is the best language model for fine tuning with dataset in Persian language? [closed]

I try to fine tune llama2 language model with dataset that I created in Persian language. But when I tokenize this dataset I noticed that llama2 tokenizer tokenized dataset in character level not word ...

user23446017

1

asked 19 hours ago

0 votes

0 answers

18 views

How can I run an entire HuggingFace iterable_dataset through a function before it reaches another function

I am building my own tokenizer using the byte pair encoding algorithm. I am applying this on a HuggingFace dataset. Due to memory constraints of my computer, I am first converting the dataset to a ...

FilmCode

1

asked Jul 13 at 0:08

-3 votes

0 answers

52 views

How do I resolve this .lower() attribute error when calling it with tokenizer function within my code using tensorflow?

This is the actual code, the previous one was a misplacement (my mistake). As explained before, my code is trying to pass the .lower() module in the tokenizer function but instead of an output I get ...

Isesele Victor

21

asked Jul 8 at 10:28

1 vote

1 answer

71 views

Pythonic refactor advice needed

I am writing a tokenizer for files that have ignored preamble. The files are written in Markdown and there is a list of keywords in H1 titles that change the state of the parser. When an EOF is ...

pawkw

59

asked Jun 29 at 22:17

0 votes

0 answers

26 views

C++ program using JackTokenizer fails to add tokens to XML output

Description: I'm developing a C++ program that converts .jack files into XML format using a JackTokenizer class. The program is supposed to read each .jack file, tokenize its content, and generate a ...

Ilan Vinograd

21

asked Jun 21 at 13:50

0 votes

1 answer

174 views

How to Track Token Usage with TikToken Library for Anthropic Models in llama-index Query Engine?

I'm facing an issue with tracking token usage for Anthropic models using the TikToken library. The tiktoken library natively supports OpenAI models, but I'm working with the Claude-3 model family from ...

Mohil

57

asked Jun 18 at 10:49

0 votes

0 answers

50 views

Calculate token utilization for streaming endpoints in gemini

I want to get the token utilization for google gemini multimodal streaming endpoint in which I pass an image as input. Note for non streaming endpoints token information is returned by gemini models, ...

s_v

152

asked Jun 18 at 5:29

0 votes

1 answer

22 views

Elasticsearch implement off-the-shelf language analyser but use custom tokeniser

This may be a duplicate but I've done a bit of searching and found no answer. I have a simple requirement: I want to use the French (for example) analyser and I simply want to tweak it slightly so ...

mike rodent

15.1k

asked Jun 17 at 20:49

0 votes

2 answers

66 views

XSLT: How to split strings for multiple fields simultaneously

I've seen a number of posts about using tokenize() to split strings, but they all involve a single field. I have a slightly different situation and I'm not sure how to approach it. My XML can ...

DR - Idemia

3

asked Jun 12 at 16:42

0 votes

0 answers

12 views

Can I increase tiktoken throughput?

Hello I'm trying to speed up processing when using tiktoken. Is by default a limitation set regarding the processing of documents using tiktoken or can i somehow change thread settings? Would ...

Ben

324

asked Jun 10 at 17:36

1 vote

1 answer

33 views

Reordering GPT2Tokenizer tokens by frequency leads to unrecognized tokens

I am trying to create a new tokenizer by reordering the token ids in my existing tokenizer based on frequency. In theory, the order of token ids has no effect on performance or usability, but it ...

Cade Harger

11

asked Jun 5 at 7:49

0 votes

0 answers

166 views

Open Source LLM Repeating Tokens Until Max Tokens Reached - How to Fix?

I'm working with an open-source language model (LLM) for generating text in Portuguese, and I'm encountering an issue where the model keeps repeating tokens until the maximum number of tokens is ...

Miguel Casagrande

3

asked May 31 at 17:54

1 vote

0 answers

64 views

LLM dataset tokenization issues for question answering fine tuning

I am using hugging face to fine tune a LLM for question answering. I am trying to figure out how to write a data preprocessing/tokenization function to use for this data set. I am using the nvidia/...

LeGOATJames23

121

asked May 28 at 21:54

0 votes

0 answers

29 views

bert-tokenizer to tokenize the sentence

how can i overcome to remove # signs when I am using ber-tokenizer Tokens length: 200 Tokens: ['[CLS]', 'educational', 'background', 'computer', 'applications', 'masters', 'degree', 'software', '...

basit khan

1

asked May 28 at 19:43

0 votes

0 answers

26 views

Does different way of string normaliztion effect to tokenizing phase [duplicate]

I have two string having Different type of normalization # Text from source I crawled string1 = 'Thu điếu' # Text from my keyboard string2 = 'Thu điếu' print('string1:', string1.encode('utf-8')) ...

Long Trần

1

asked May 28 at 9:53

Collectives™ on Stack Overflow

Questions tagged [tokenize]

What is the best language model for fine tuning with dataset in Persian language? [closed]

How can I run an entire HuggingFace iterable_dataset through a function before it reaches another function

How do I resolve this .lower() attribute error when calling it with tokenizer function within my code using tensorflow?

Pythonic refactor advice needed

C++ program using JackTokenizer fails to add tokens to XML output

How to Track Token Usage with TikToken Library for Anthropic Models in llama-index Query Engine?

Calculate token utilization for streaming endpoints in gemini

Elasticsearch implement off-the-shelf language analyser but use custom tokeniser

XSLT: How to split strings for multiple fields simultaneously

Can I increase tiktoken throughput?

Reordering GPT2Tokenizer tokens by frequency leads to unrecognized tokens

Open Source LLM Repeating Tokens Until Max Tokens Reached - How to Fix?

LLM dataset tokenization issues for question answering fine tuning

bert-tokenizer to tokenize the sentence

Does different way of string normaliztion effect to tokenizing phase [duplicate]

Hot Network Questions

Collectives™ on Stack Overflow

Questions tagged [tokenize]

Related Tags