Can multiple synaptic connections enhance MLPs?
It might be that having multiple synaptic connections in pairs of neurons plays a role in the mammalian brain’s ability to learn. So can you enhance a MLP by assigning multiple weights to each connection in a pair of adjacent layers?
Let’s try to see what happens. We can let the layers be $L$ and $L’$. If we give each connection between neuron $L_i$ and ${L’}_j$ a number of $k$ weights, what does the matrix multiplication for this layer look like?
We can denote the outputs of layer $L$ as $O_i$. The matrix multiplication used to look like
\[{\begin{bmatrix} w_{11} & \dots & w_{1n} \\ \vdots & & \vdots \\ w_{m1} & \dots & w_{mn} \end{bmatrix} \begin{bmatrix} O_{1} \\ \vdots \\ O_{n} \end{bmatrix} }\] where the $w_{ij}$'s were the weights of layer $L'$. Now, since each connection between neuron $L'_i$ and neuron in $L_j$ is assigned $k$ weights, each neuron $L'_i$ is assigning $k$ weights to each input, so we can replace $w_{ij}$ in the matrix with a row of $k$ weights $w_{ij1}, w_{ij2}, \dots w_{ijk}$. When the output of neuron $L'_i$ is computed, each $O_j$ is attached to each of these $k$ weights $w_{ij1}, w_{ij2}, \dots w_{ijk}$ and summed, so the matrix multiplication looks like $$ {\begin{bmatrix} (w_{111} \dots w_{11k}) & \dots & (w_{1n1} \dots w_{1nk}) \\ \vdots & & \vdots \\ (w_{m11} \dots w_{m1k}) & \dots & (w_{mn1} \dots w_{mnk}) \end{bmatrix} \begin{bmatrix} O_{1} \\ \vdots \\ O_{1} \\ \vdots \\ O_{n} \\ \vdots \\ O_{n} \end{bmatrix} } $$ When the output of neuron $O_i$ is computed, each $O_j$ is attached to the sum $w_{ij1} + w_{ij2} + \dots + w_{ijk}$. If we let $w_{ij1} + w_{ij2} + \dots + w_{ijk} = W_{ij}$, the matrix multiplication becomes $$ {\begin{bmatrix} W_{11} & \dots & W_{1n} \\ \vdots & & \vdots \\ W_{m1} & \dots & W_{mn} \end{bmatrix} \begin{bmatrix} O_{1} \\ \vdots \\ O_{n} \end{bmatrix} } $$So, in each connection between $L’_i$ and $L_j$, we’re just attaching the sum of the $k$ weights to $O_j$.
What are the new gradients of the outputs of layer $L$? If we let the outputs of layer $L’$ be $O’_i$ and gradient of the loss of each $O’_i$ be $\frac{\partial l}{\partial O’_i}$, the matrix multiplication for the gradients for $O_i$ used to be
\[{\begin{bmatrix} w_{11} & \dots & w_{1n} \\ \vdots & & \vdots \\ w_{m1} & \dots & w_{mn} \end{bmatrix}^T \begin{bmatrix} \frac{\partial l}{\partial O'_1} \\ \vdots \\ \frac{\partial l}{\partial O'_m} \end{bmatrix} }\] Now, since each $ \frac{\partial O'_i}{\partial O_j} = W_{ij} $, the matrix multiplication looks like $$ {\begin{bmatrix} W_{11} & \dots & W_{1n} \\ \vdots & & \vdots \\ W_{m1} & \dots & W_{mn} \end{bmatrix}^T \begin{bmatrix} \frac{\partial l}{\partial O'_1} \\ \vdots \\ \frac{\partial l}{\partial O'_m} \end{bmatrix} } $$So in the forward and backwards pass, it looks like we’ve only replaced each $w_{ij}$ with $W_{ij}$.
But what happens to the gradients of the loss for the weights of layer $L'$? We can look again at the forward pass: $$ {\begin{bmatrix} (w_{111} \dots w_{11k}) & \dots & (w_{1n1} \dots w_{1nk}) \\ \vdots & & \vdots \\ (w_{m11} \dots w_{m1k}) & \dots & (w_{mn1} \dots w_{mnk}) \end{bmatrix} \begin{bmatrix} O_{1} \\ \vdots \\ O_{1} \\ \vdots \\ O_{n} \\ \vdots \\ O_{n} \end{bmatrix} } $$ In computing output $O'_i$, for every $O_j$, the set of the $k$ weights $w_{ij1}, w_{ij2}, \dots w_{ijk}$ are involved in exactly the computation of $O'_i$, and each one is attached to $O_j$, so each of these $k$ weights has the same gradient, $ \frac{\partial l}{\partial O'_i} \cdot O_j$. So, regardless of the update algorithm, each of $w_{ij1} \dots w_{ijk}$ is updated the same way, so $W_{ij}$ is updated the same way. Notice that we could have also seen this from the multiplication $$ {\begin{bmatrix} W_{11} & \dots & W_{1n} \\ \vdots & & \vdots \\ W_{m1} & \dots & W_{mn} \end{bmatrix} \begin{bmatrix} O_{1} \\ \vdots \\ O_{n} \end{bmatrix} } =\begin{bmatrix} O'_{1} \\ \vdots \\ O'_{n} \end{bmatrix} $$ The gradient $\frac{\partial l}{\partial W'_{ij}}$ is $ \frac{\partial l}{\partial O'_i} \cdot O_j$!So no, giving a layer more freedom to assign $k$ weights to its inputs is equivalent to a layer that assigns a single weight, equal to the sum of the previous $k$ weights, to each input. The only difference is that the updates for each weight are scaled by $k$, so we’ve just scaled the learning rate by $k$.