All Questions
Tagged with neural-network nlp
179
questions
0
votes
0
answers
18
views
Does it common for LM (hundreds million parameters) beat LLM (billion parameters) for binary classification task?
Preface
I am trying to fine-tune the transformer-based model (LM and LLM). The LM that I used is DEBERTA, and the LLM is LLaMA 3. The task is to classify whether a text contains condescending language ...
1
vote
1
answer
62
views
Improving GPU Utilization in LLM Inference System
I´m trying to build a distributed LLM inference platform with Huggingface support. The implementation involves utilizing Python for model processing and Java for interfacing with external systems. ...
0
votes
1
answer
37
views
How do transformer-based architectures generate contextual embeddings?
How do transformer-based architectures, such as Roberta, etc., generate contextual embeddings? The issue is, I haven't found any articles that explain this process.
0
votes
1
answer
59
views
Fine tuning or just feature extraction or both using Roberta?
I'm reading a program that use the pre-trained Roberta model (roberta-base). The code first extracts word embeddings from each caption in the batch, using the last hidden state of the Roberta model. ...
1
vote
1
answer
197
views
What do special tokens used for in Roberta?
When I use this code:
...
0
votes
1
answer
57
views
Why was the learning rate decreased for Roberta compared to LSTM?
I'm reading the codebase of a project that uses Bidirectional-LSTM. The learning rate for it is 0.02. Later, someone improved the project by replacing LSTM with Roberta and decreased the learning rate ...
2
votes
1
answer
213
views
What do these terms mean in the context of Roberta?
When I read articles about Roberta, I often read the terms "transfer learning" and "fine-tuning". Additionally, they also mention "feature extraction". What are the ...
1
vote
0
answers
228
views
Why do the Llama 2 weights have eight different files?
I downloaded the weights for Llama 2 (70B-chat). This process created a folder titled "llama-2-70b-chat," which contained 8 files titled consolidated.00.pth, consolidated.01.pth, and so on ...
0
votes
1
answer
111
views
What are the differences between Embedding Layer and Roberta Embedding?
I'm reading an article about the Embedding Layer:
The Embedding Layer learns word embeddings from raw text. It is
initialized with small random numbers and can be learned
simultaneously with a neural ...
0
votes
2
answers
164
views
What are the differences between contextual embeddings of Bidirectional-LSTM and Transformer?
A Transformer, like Roberta, can generate contextual embeddings using the encoder part, similar to a Bidirectional-LSTM that concatenates hidden states. What are the differences between them ? Are ...
0
votes
1
answer
90
views
Questions about hidden states of bidirectional LSTMs
I read this in an article about bidirectional LSTM:
In bidirectional LSTM, each word corresponds to two hidden states, one
for each direction. Thus, we concatenate these two hidden states to
...
2
votes
1
answer
1k
views
What are the differences between BPE and byte-level BPE?
In Roberta, I'm not sure if the model use BPE or byte-level BPE tokenization, are these techniques different or the same ? Can someone explain ? Thanks
0
votes
1
answer
92
views
Anomaly Detection in Log Data using LSTM
Problem Overview:
I am currently working on a project involving anomaly detection in log data. The anomalies are defined by deviations from historical patterns. The log data has a simple structure: [...
0
votes
0
answers
7
views
How to label a dataset of text pairs to use it as a universal one for calculating the precision@k metric for different models?
I am facing a semantic search problem. I am fine tuning different NLU models and i want to use precision@k as my main metric. Is it possible to label a dataset of text pairs to use it as a universal ...
1
vote
1
answer
63
views
Why my validation loss and accuracy decays over epochs?
Im trying to build 2 simple networks with cleaned dataset for tweets sentiment classification(0/1):
one with all dense layers(binary bag of words)
another with RNN layer(embedding layer).
But it both ...