You are partially correct. On CNNs the output shape per layer is defined by the amount of filters used, and the application of the filters (dilation, stride, padding, etc.).

## CNNs shapes

In your example, your input is `30 x 30 x 3`

. Assuming stride of `1`

, no padding, and no dilation on the filter, you will get a spatial shape equal to your input, that is `30 x 30`

. Regarding the depth if you have `10`

filters (of shape `5 x 5 x 3`

) you will end up with a `30 x 30 x 10`

output at your first layer. Similarly, on the second layer with 5 filters (of shape `3 x 3 x 10`

, note the depth to work on the previous layer) you have `30 x 30 x 5`

output. The FC layer has the same amount of weights as the input (that is `4500`

weights) to create a linear combination of them.

## CNN vs Convolution

Note that the CNNs operate differently from the traditional signal processing convolution. In the former, the convolution operation performs a dot product with the filter and the input to output a single value (and even add bias if you want to). While the latter outputs the same amount of channels.

The CNNs borrow the idea of a shifting kernel and a kernel response. But they do not apply a convolution operation per se.

## Operation over the RGB

The CNN is not operating on each channel separately. It is merging the responses of the three channels and mixing them further. The deeper you get the more mix you get over your previous results.

The output of your FC is just one value. If you want more, you need to add more FC neurons to get more linear combinations of your inputs.

Hi. Please, ask only one question per post. For example, ask the question "Also, how to deal with channels, in this case RGB?" in a separate post. I understand that these questions are related, but if you want to get a satisfactory answer to each of these questions, you should ask them in their dedicated post. – nbro – 2019-12-09T23:33:24.233