Questions tagged [transformer]
Use for questions related to the Transformer (based on encoder-decoder) architecture in machine learning.
482
questions
0
votes
0
answers
9
views
Implementing pytorch temporal fusion transformer on time series
I am trying to run the temporal fusion transformer from the pytorch package. I am trying to compare the output on like terms to the tensorflow output in this paper p. 15 https://arxiv.org/pdf/1912....
0
votes
0
answers
18
views
Does it common for LM (hundreds million parameters) beat LLM (billion parameters) for binary classification task?
Preface
I am trying to fine-tune the transformer-based model (LM and LLM). The LM that I used is DEBERTA, and the LLM is LLaMA 3. The task is to classify whether a text contains condescending language ...
0
votes
0
answers
16
views
Training a transformer CNN for image output from scratch
I'm trying to train a Transformer-CNN model from scratch. The Transformer model is comparable to that of the ViViT model 2. The CNN is taking the output of the second (temporal) transformer and is ...
0
votes
0
answers
13
views
Do a transformer's embeddings self-organise the same way as word2vec embeddings?
Word2vec embeddings are well-known for being able to do vector arithmetic on them. So King - Queen ≈ Man - Woman. Or Germany - Berlin ≈ France - Paris.
When I first learned about transformers, one of ...
2
votes
0
answers
45
views
Transformer model conditional probability distribution of sub-sentences
I have a simple transformer model (decoder only) which is trained on some dataset containing sentences to do next-word prediction. The model captures a probability distribution $P_{\theta}(\mathbf{a})$...
1
vote
0
answers
30
views
Can Transformers predict periodic time series data?
I want to use Transformers to predict a noise-free periodic 2D signal $f(t)$. The signal has a period of $T=10$, and since there is no noise, future predictions can be made perfectly from the past 5 ...
0
votes
1
answer
31
views
attentions not returned from transformers ViT model when using output_attentions=True
I'm using this code snippet from the docs of HuggingFace ViT classification model - with one addition: I'm using the output_attentions=True parameter. Nevertheless, ...
1
vote
0
answers
20
views
The real world implementations of RAG vs the methods explained in the paper
While building a RAG application we
Encode the query
Retrieve k docs
Concatenate before the query
Pass the entire thing to a LLM and it completes it for you
I do not think this is either of RAG-...
0
votes
0
answers
18
views
How to interpret the token embeddings from decoders?
I am having trouble thinking about the token embeddings from masked attention compared to BERT.
Let's say we have 5 tokens. The embedding of the first token will be used to predict the second token, ...
0
votes
0
answers
18
views
Apply Swin transformer to 1d arrays
My input features are 1d arrays of shape (1000,)
I can tokenize the arrays using tf.extract_patches
...
0
votes
0
answers
14
views
How contextual embeddings learned during training a transformer are applied to the input sequence at inference time
I'm trying to understand contextual word embeddings better, and how they are applied at inference time.
While training a transformer, embeddings are learned as parameters during training. Are the ...
0
votes
0
answers
16
views
In Swin-Transformer, Is each token (to-embedding) value an integer?
Swin-Transformer transform the image to tokens to input to transformer.
Is each token (before-embedding) value an integer?
In practice, where is this done? https://github.com/microsoft/Swin-...
0
votes
3
answers
80
views
Why do we use similarity/cosine between Query and Key in attention?
Let's take an example sentence for translation:
I am going to my home and play with toy house.
For translating 'home', as per my understanding, Query will be 'house'...
0
votes
0
answers
31
views
Instruction LLM - extract data from text wrongly continues
I'm trying to fine-tune open sourced LLMs, for now let's stick with Mistral-7b-instruct model.
My task is a follow: I have emails, that represents "price requests" for shipments sends by our ...
2
votes
1
answer
38
views
Practical Experiments on Self-Attention Mechanisms: QQ^T vs. QK^T
I'm currently exploring the self-attention mechanism used in models like Transformers, and I have a question about the necessity of using a separate key matrix (K) instead of just using the query ...