Skip to main content

All Questions

Tagged with
0 votes
0 answers
18 views

Does it common for LM (hundreds million parameters) beat LLM (billion parameters) for binary classification task?

Preface I am trying to fine-tune the transformer-based model (LM and LLM). The LM that I used is DEBERTA, and the LLM is LLaMA 3. The task is to classify whether a text contains condescending language ...
sempraEdic's user avatar
1 vote
1 answer
62 views

Improving GPU Utilization in LLM Inference System

I´m trying to build a distributed LLM inference platform with Huggingface support. The implementation involves utilizing Python for model processing and Java for interfacing with external systems. ...
Cardstdani's user avatar
0 votes
1 answer
37 views

How do transformer-based architectures generate contextual embeddings?

How do transformer-based architectures, such as Roberta, etc., generate contextual embeddings? The issue is, I haven't found any articles that explain this process.
user avatar
0 votes
1 answer
59 views

Fine tuning or just feature extraction or both using Roberta?

I'm reading a program that use the pre-trained Roberta model (roberta-base). The code first extracts word embeddings from each caption in the batch, using the last hidden state of the Roberta model. ...
user avatar
1 vote
1 answer
197 views

What do special tokens used for in Roberta?

When I use this code: ...
user avatar
0 votes
1 answer
57 views

Why was the learning rate decreased for Roberta compared to LSTM?

I'm reading the codebase of a project that uses Bidirectional-LSTM. The learning rate for it is 0.02. Later, someone improved the project by replacing LSTM with Roberta and decreased the learning rate ...
user avatar
2 votes
1 answer
213 views

What do these terms mean in the context of Roberta?

When I read articles about Roberta, I often read the terms "transfer learning" and "fine-tuning". Additionally, they also mention "feature extraction". What are the ...
user avatar
1 vote
0 answers
228 views

Why do the Llama 2 weights have eight different files?

I downloaded the weights for Llama 2 (70B-chat). This process created a folder titled "llama-2-70b-chat," which contained 8 files titled consolidated.00.pth, consolidated.01.pth, and so on ...
jskattt797's user avatar
0 votes
1 answer
111 views

What are the differences between Embedding Layer and Roberta Embedding?

I'm reading an article about the Embedding Layer: The Embedding Layer learns word embeddings from raw text. It is initialized with small random numbers and can be learned simultaneously with a neural ...
user avatar
0 votes
2 answers
164 views

What are the differences between contextual embeddings of Bidirectional-LSTM and Transformer?

A Transformer, like Roberta, can generate contextual embeddings using the encoder part, similar to a Bidirectional-LSTM that concatenates hidden states. What are the differences between them ? Are ...
user avatar
0 votes
1 answer
90 views

Questions about hidden states of bidirectional LSTMs

I read this in an article about bidirectional LSTM: In bidirectional LSTM, each word corresponds to two hidden states, one for each direction. Thus, we concatenate these two hidden states to ...
user avatar
2 votes
1 answer
1k views

What are the differences between BPE and byte-level BPE?

In Roberta, I'm not sure if the model use BPE or byte-level BPE tokenization, are these techniques different or the same ? Can someone explain ? Thanks
user avatar
0 votes
1 answer
92 views

Anomaly Detection in Log Data using LSTM

Problem Overview: I am currently working on a project involving anomaly detection in log data. The anomalies are defined by deviations from historical patterns. The log data has a simple structure: [...
Raj's user avatar
  • 1
0 votes
0 answers
7 views

How to label a dataset of text pairs to use it as a universal one for calculating the precision@k metric for different models?

I am facing a semantic search problem. I am fine tuning different NLU models and i want to use precision@k as my main metric. Is it possible to label a dataset of text pairs to use it as a universal ...
Ir8_mind's user avatar
  • 183
1 vote
1 answer
63 views

Why my validation loss and accuracy decays over epochs?

Im trying to build 2 simple networks with cleaned dataset for tweets sentiment classification(0/1): one with all dense layers(binary bag of words) another with RNN layer(embedding layer). But it both ...
emily 's user avatar
  • 35

15 30 50 per page
1
2 3 4 5
12