6
$\begingroup$

Regarding LSTM neural networks, I am unable to understand the relationship between batch size, the number of neurons in the input layer and the number of "variables" or "columns" in the input. (Assuming that there is a relationship and despite seeing examples to the contrary, I cannot understand why there is no relationship)

For the sake of clarity, I am going to use an example to formulate my query.

Let's assume that the dataset contains three columns of input and one column of output. So it will be something like input variable 1 input variable 2 input variable 3 output variable 1

From what I understand, the input layer of the LSTM network has to have 3 neurons corresponding to each of the input variables. It cannot be less or more. Even though I have seen examples like this answer (which appears to be very well described but unfortunately, I am not able to comprehend it.)

Now let us say that we have 50 rows of the above 4 columns.

That is essentially to me means that we have 50 samples. Now if the batch size is 5, then how many input neurons we have? Is the number of neurons in the input layer independent of the batch size? The way I understand batch size is that it is the number of samples that the neural network will see before updating its weights. So let's assume if we have only three neurons in the input layer, then we will be passing the first row of input variable, followed by the second row of input variable and repeating it until the fifth row before we update the weights. How is that going to be any different than just passing the fifth row?

$\endgroup$

2 Answers 2

7
$\begingroup$

In general I think you've understood both concepts. I'll try to address both of them in more detail.

Input layer size

In Neural Networks (NN), the size of the input layer is always equal to the number of variables (or features as we usually call them in Machine Learning) in your data.

Batch size

This refers to how many samples of data the network will see before updating its weights.

Normally, when performing gradient descent (i.e. the algorithm that trains the NN), the network would be input all of the samples in order to calculate the loss and update the weights. However, especially in Deep Neural Networks (DNNs) this is infeasible, as the size of the dataset is too big and even if the network had enough memory the computation of the gradients would be inefficient. What we do instead is pass a few samples (called a batch) at a time and train the model on them. Then another batch is passed, and so on. The number of samples we input to the model at each iteration is called the batch size.

Relationship between the two

In theory, if you have many features (in the thousands) you can't use a large enough batch size and the more features you have the less of a batch size you can use. However, in practice I have found the two to be practically unrelated! In Deep Neural Networks, the batch size is primarily governed by the size of the model. An easy (but not always accurate) way of estimating this size is through the number of trainable parameters in a model. More parameters, mean more memory for the model (and longer gradient computations), leading to a smaller batch size. In DNNs, that have multiple layers, only a fraction of the parameters are those of the input layer. Thus, the number of features plays a minor part in the selection of the batch size.

Example

In your example you say you have 50 rows and 4 columns, out of which 3 are the input variables.

The number of neurons in the input layer will be 3, to match the number of input variables (or features). This we cannot change, even if we wanted to!

The batch size is something we can change. Let's say we use a batch size of 5.

1st iteration:
In the first iteration, the NN will be given the first 5 rows of data. For each, the first three columns will be passed to the first layer of the network, which will calculate an output. After all 5 outputs are generated, they will be compared with the first 5 rows of the 4th column, to calculate a loss. Using this loss, the network will compute the gradients and perform the updates. Again, this loss corresponds to the first 5 samples, not just the 5th.

2nd iteration:
The rows 6-10 will be passed to the network, which will do the same procedure as in the first step.

10th iteration: The rows 45-50 will be passed to the network. This marks the point where the NN has seen every sample of data once. In the context of Machine Learning, we call this an epoch. At this point, the data are shuffled and the 1st iteration of the 2nd epoch begins.

Note:

In most frameworks the samples in each batch are processed in parallel by the network and not sequentially. This makes no difference to the result, I just wanted to point this out.

$\endgroup$
2
  • 2
    $\begingroup$ Thank you for a very comprehensive answer. I have marked the question as resolved. I wanted to vote up but since I have less than 15 points, it is not allowing me to give a vote. At least the system should allow me to vote for the question that I had put in. I'm sure it would have taken at least 20 minutes for you to type this out if not more and I was hoping I at least show my gratitude by voting for your answer. Thank you anyway. Also thank you for pointing out that the batch are processed in parallel $\endgroup$
    – Ramana
    Commented Aug 9, 2018 at 5:23
  • $\begingroup$ You're very welcome, I'm glad I could help :) $\endgroup$
    – Djib2011
    Commented Aug 9, 2018 at 11:20
1
$\begingroup$

I do NOT agree with @Djib2011 on what he said "The number of neurons in the input layer will be 3, to match the number of input variables (or features). This we cannot change, even if we wanted to!" The number of neurons, also called units or nodes, is a totally free parameter which is specified up to you. It does not have to match the number of features. It determines the dimensions of the output of this layer though.

Check the example here https://keras.io/layers/core/, in which the first layer

model.add(Dense(32, input_shape=(16,)))

Here 32 is the number of neurons, and 16 is the number of feature. Another hidden layer

model.add(Dense(32))

where 32 is also the number of neurons, and in keras you do not need specify the dimension of input (output from previous layers) after first hidden layer (p.s. the input layer is not a real layer but a tensor).

Also please check this example https://machinelearningmastery.com/tutorial-first-neural-network-python-keras/, in which the author mentioned:

We can piece it all together by adding each layer:

  • The model expects rows of data with 8 variables (the input_dim=8 argument)
  • The first hidden layer has 12 nodes and uses the relu activation function.
  • The second hidden layer has 8 nodes and uses the relu activation function.
  • The output layer has one node and uses the sigmoid activation function.
# define the keras model
model = Sequential()
model.add(Dense(12, input_dim=8, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
$\endgroup$
5
  • $\begingroup$ Though, in your example your input layer has 16 neurons not 32. You are confusing the input layer with the first hidden layer. The input layer isn't a trainable layer (meaning that it has no parameters), it is just to provide an input to the network and thus we can't change its size. In your example, keras allows you for convenience to bypass the input layer, just by adding the input_shape parameter to your first layer. If you were using keras' functional API, you'd have to specify an Input() layer explicitly. $\endgroup$
    – Djib2011
    Commented Oct 18, 2019 at 21:41
  • $\begingroup$ Hi @Djib2011, thank you for your comment. As I mentioned, input layer is not an actual layer because no computations are performed in this "layer". It does not even count as a depth of a DL model. This layer just passes data into the first real layer -- the first hidden layer. In a neuron, some "reaction" is expected; in my opinion, in DL the "reaction" is the computation that requires hyperparameter introducing and tuning. Apparently the "first layer" does not have that function. 16 is the number of features, maybe you can call 16 the nodes, but not neurons. $\endgroup$ Commented Oct 21, 2019 at 16:55
  • $\begingroup$ Yes, technically it isn't a real layer because it isn't trained, but conventionally it is considered a "layer". For example from the wiki page for Neural Networks: "Signals travel from the first layer (the input layer), to the last layer (the output layer), possibly after traversing the layers multiple times.". Other examples include any other Neural Network example $\endgroup$
    – Djib2011
    Commented Oct 21, 2019 at 18:57
  • $\begingroup$ Thanks @Djib2011. I don't think we have real disagreement on this question . My confusion came from the keras sequential model’s first layer setup, as the example I provided above. model.add(Dense(32, input_shape=(16,)))Can we consider it as a combination of input layer and first hidden layer? $\endgroup$ Commented Jan 3, 2020 at 3:16
  • $\begingroup$ Yes, exactly. The command model.add(Dense(32, input_shape=(16,))) is equivalent to creating both the input and the hidden layers. $\endgroup$
    – Djib2011
    Commented Jan 3, 2020 at 10:48

Not the answer you're looking for? Browse other questions tagged or ask your own question.