Skip to main content

Questions tagged [tokenize]

Tokenizing is the act of splitting a string into discrete elements called tokens.

tokenize
-2 votes
0 answers
9 views

What is the best language model for fine tuning with dataset in Persian language? [closed]

I try to fine tune llama2 language model with dataset that I created in Persian language. But when I tokenize this dataset I noticed that llama2 tokenizer tokenized dataset in character level not word ...
user23446017's user avatar
0 votes
0 answers
18 views

How can I run an entire HuggingFace iterable_dataset through a function before it reaches another function

I am building my own tokenizer using the byte pair encoding algorithm. I am applying this on a HuggingFace dataset. Due to memory constraints of my computer, I am first converting the dataset to a ...
FilmCode's user avatar
-3 votes
0 answers
52 views

How do I resolve this .lower() attribute error when calling it with tokenizer function within my code using tensorflow?

This is the actual code, the previous one was a misplacement (my mistake). As explained before, my code is trying to pass the .lower() module in the tokenizer function but instead of an output I get ...
Isesele Victor's user avatar
1 vote
1 answer
71 views

Pythonic refactor advice needed

I am writing a tokenizer for files that have ignored preamble. The files are written in Markdown and there is a list of keywords in H1 titles that change the state of the parser. When an EOF is ...
pawkw's user avatar
  • 59
0 votes
0 answers
26 views

C++ program using JackTokenizer fails to add tokens to XML output

Description: I'm developing a C++ program that converts .jack files into XML format using a JackTokenizer class. The program is supposed to read each .jack file, tokenize its content, and generate a ...
Ilan Vinograd's user avatar
0 votes
1 answer
174 views

How to Track Token Usage with TikToken Library for Anthropic Models in llama-index Query Engine?

I'm facing an issue with tracking token usage for Anthropic models using the TikToken library. The tiktoken library natively supports OpenAI models, but I'm working with the Claude-3 model family from ...
Mohil's user avatar
  • 57
0 votes
0 answers
50 views

Calculate token utilization for streaming endpoints in gemini

I want to get the token utilization for google gemini multimodal streaming endpoint in which I pass an image as input. Note for non streaming endpoints token information is returned by gemini models, ...
s_v's user avatar
  • 152
0 votes
1 answer
22 views

Elasticsearch implement off-the-shelf language analyser but use custom tokeniser

This may be a duplicate but I've done a bit of searching and found no answer. I have a simple requirement: I want to use the French (for example) analyser and I simply want to tweak it slightly so ...
mike rodent's user avatar
  • 15.1k
0 votes
2 answers
66 views

XSLT: How to split strings for multiple fields simultaneously

I've seen a number of posts about using tokenize() to split strings, but they all involve a single field. I have a slightly different situation and I'm not sure how to approach it. My XML can ...
DR - Idemia's user avatar
0 votes
0 answers
12 views

Can I increase tiktoken throughput?

Hello I'm trying to speed up processing when using tiktoken. Is by default a limitation set regarding the processing of documents using tiktoken or can i somehow change thread settings? Would ...
Ben's user avatar
  • 324
1 vote
1 answer
33 views

Reordering GPT2Tokenizer tokens by frequency leads to unrecognized tokens

I am trying to create a new tokenizer by reordering the token ids in my existing tokenizer based on frequency. In theory, the order of token ids has no effect on performance or usability, but it ...
Cade Harger's user avatar
0 votes
0 answers
166 views

Open Source LLM Repeating Tokens Until Max Tokens Reached - How to Fix?

I'm working with an open-source language model (LLM) for generating text in Portuguese, and I'm encountering an issue where the model keeps repeating tokens until the maximum number of tokens is ...
Miguel Casagrande's user avatar
1 vote
0 answers
64 views

LLM dataset tokenization issues for question answering fine tuning

I am using hugging face to fine tune a LLM for question answering. I am trying to figure out how to write a data preprocessing/tokenization function to use for this data set. I am using the nvidia/...
LeGOATJames23's user avatar
0 votes
0 answers
29 views

bert-tokenizer to tokenize the sentence

how can i overcome to remove # signs when I am using ber-tokenizer ​ Tokens length: 200 Tokens: ['[CLS]', 'educational', 'background', 'computer', 'applications', 'masters', 'degree', 'software', '...
basit khan's user avatar
0 votes
0 answers
26 views

Does different way of string normaliztion effect to tokenizing phase [duplicate]

I have two string having Different type of normalization # Text from source I crawled string1 = 'Thu điếu' # Text from my keyboard string2 = 'Thu điếu' print('string1:', string1.encode('utf-8')) ...
Long Trần's user avatar

15 30 50 per page
1
2 3 4 5
202