## Abstract

We show how a Hopfield network with modifiable recurrent connections undergoing slow Hebbian learning can extract the underlying geometry of an input space. First, we use a slow and fast analysis to derive an averaged system whose dynamics derives from an energy function and therefore always converges to equilibrium points. The equilibria reflect the correlation structure of the inputs, a global object extracted through local recurrent interactions only. Second, we use numerical methods to illustrate how learning extracts the hidden geometrical structure of the inputs. Indeed, multidimensional scaling methods make it possible to project the final connectivity matrix onto a Euclidean distance matrix in a high-dimensional space, with the neurons labeled by spatial position within this space. The resulting network structure turns out to be roughly convolutional. The residual of the projection defines the nonconvolutional part of the connectivity, which is minimized in the process. Finally, we show how restricting the dimension of the space where the neurons live gives rise to patterns similar to cortical maps. We motivate this using an energy efficiency argument based on wire length minimization. Finally, we show how this approach leads to the emergence of ocular dominance or orientation columns in primary visual cortex via the self-organization of recurrent rather than feedforward connections. In addition, we establish that the nonconvolutional (or long-range) connectivity is patchy and is co-aligned in the case of orientation learning.

## 1. Introduction

Activity-dependent synaptic plasticity is generally thought to be the basic cellular substrate underlying learning and memory in the brain. In Donald Hebb (1949) postulated that learning is based on the correlated activity of synaptically connected neurons: if both neurons A and B are active at the same time, then the synapses from A to B and B to A should be strengthened proportionally to the product of the activity of A and B. However, as it stands, Hebb's learning rule diverges. Therefore, various modifications of Hebb's rule have been developed, which basically take one of three forms (see Gerstner & Kistler, 2002, and Dayan & Abbott, 2001). First, a decay term can be added to the learning rule so that each synaptic weight is able to “forget” what it previously learned. Second, each synaptic modification can be normalized or projected on different subspaces. These constraint-based rules may be interpreted as implementing some form of competition for energy between dendrites and axons (for details, see Miller, 1996; Miller & MacKay, 1996; Ooyen, 2001). Third, a sliding threshold mechanism can be added to Hebbian learning. For instance, a postsynaptic threshold rule consists of multiplying the presynaptic activity and the subtraction of the average postsynaptic activity from its current value, which is referred to as covariance learning (see Sejnowski & Tesauro, 1989). Probably the best known of these rules is the BCM rule presented in Bienenstock, Cooper, and Munro (1982). It should be noted that history-based rules can also be defined without changing the qualitative dynamics of the system. Instead of considering the instantaneous value of the neurons’ activity, these rules consider its weighted mean over a time window (see Földiák, 1991; Wallis & Baddeley, 1997). Recent experimental evidence suggests that learning may also depend on the precise timing of action potentials (see Bi & Poo, 2001). Contrary to most Hebbian rules that detect only correlations, these rules can also encode causal relationships in the patterns of neural activation. However, the mathematical treatment of these spike-timing-dependent rules is much more difficult than rate-based ones.

Hebbian-like learning rules have often been studied within the framework of unsupervised feedfoward neural networks (see Oja, 1982; Bienenstock et al., 1982; Miller & MacKay, 1996; Dayan & Abbott, 2001). They also form the basis of most weight-based models of cortical development, assuming fixed lateral connectivity (e.g., Mexican hat) and modifiable vertical connections (see the review of Swindale, 1996).^{1} In these developmental models, the statistical structure of input correlations provides a mechanism for spontaneously breaking some underlying symmetry of the neuronal receptive fields, leading to the emergence of feature selectivity. When such correlations are combined with fixed intracortical interactions, there is a simultaneous breaking of translation symmetry across the cortex, leading to the formation of a spatially periodic cortical feature map. A related mathematical formulation of cortical map formation has been developed in Takeuchi and Amari (1979) and Bressloff (2005) using the theory of self-organizing neural fields. Although very irregular, the two-dimensional cortical maps observed at a given stage of development, can be unfolded in higher dimensions to get smoother geometrical structures. Indeed, Bressloff, Cowan, Golubitsky, Thomas, and Wiener (2001) suggested that the network of orientation pinwheels in V1 is a direct product between a circle for orientation preference and a plane for position, based on a modification of the ice cube model of Hubel and Wiesel (1977). From a more abstract geometrical perspective, Petitot (2003) has associated such a structure to a 1-jet space and used this to develop some applications to computer vision. More recently, more complex geometrical structures such as spheres and hyperbolic surfaces that incorporate additional stimulus features such as spatial frequency and textures, were considered, respectively, in Bressloff and Cowan (2003) and Chossat and Faugeras (2009).

In this letter, we show how geometrical structures related to the distribution of inputs can emerge through unsupervised Hebbian learning applied to recurrent connections in a rate-based Hopfield network. Throughout this letter, the inputs are presented as an external nonautonomous forcing to the system and not an initial condition, as is often the case in Hopfield networks. It has previously been shown that in the case of a single fixed input, there exists an energy function that describes the joint gradient dynamics of the activity and weight variables (see Dong & Hopfield, 1992). This implies that the system converges to an equilibrium during learning. We use averaging theory to generalize the above result to the case of multiple inputs, under the adiabatic assumption that Hebbian learning occurs on a much slower timescale than both the activity dynamics and the sampling of the input distribution. We then show that the equilibrium distribution of weights, when embedded into for sufficiently large integer *k*, encodes the geometrical structure of the inputs. Finally, we numerically show that the embedding of the weights in two dimensions (*k*=2) gives rise to patterns that are qualitatively similar to experimentally observed cortical maps, with the emergence of features columns and patchy connectivity. In contrast to standard developmental models, cortical map formation arises via the self-organization of recurrent connections rather than feedforward connections from an input layer. Although the mathematical formalism we introduce here could be extended to most of the rate-based Hebbian rules in the literature, we present the theory for Hebbian learning with decay because of the simplicity of the resulting dynamics.

The use of geometrical objects to describe the emergence of connectivity patterns has previously been proposed by Amari in a different context. Based on the theory of information geometry, Amari considers the geometry of the set of all the networks and defines learning as a trajectory on this manifold for perceptron networks in the framework of supervised learning (see Amari, 1998) or for unsupervised Boltzmann machines (see Amari, Kurata, & Nagaoka, 1992). He uses differential and Riemannian geometry to describe an object that is at a larger scale than the cortical maps considered here. Moreover, Zucker and colleagues are currently developing a nonlinear dimensionality-reduction approach to characterize the statistics of natural visual stimuli (see Zucker, Lawlor, & Holtmann-Rice, 2011; Coifman, Maggioni, Zucker, & Kevrekidis, 2005). Although they do not use learning neural networks and stay closer to the field of computer vision than this letter does, it turns out their approach is similar to the geometrical embedding approach we are using.

The structure of the letter is as follows. In section 2, we apply mathematical methods to analyze the behavior of a rate-based learning network. We formally introduce a nonautonomous model to be averaged in a second time. This then allows us to study the stability of the learning dynamics in the presence of multiple inputs by constructing an appropriate energy function. In section 3, we determine the geometrical structure of the equilibrium weight distribution and show how it reflects the structure of the inputs. We also relate this approach to the emergence of cortical maps. Finally, the results are discussed in section 4.

## 2. Analytical Treatment of a Hopfield-Type Learning Network

### 2.1. Model.

#### 2.1.1. Neural Network Evolution.

*V*(

_{i}*t*) at time

*t*. The instantaneous population firing rate is linked to the membrane potential through the relation , where

*s*is a smooth sigmoid function. In the following, we choose where

*S*, , and are, respectively, the maximal firing rate, the maximal slope, and the offset of the sigmoid.

_{m}*i*, with

*W*(

_{ij}*t*) the effective synaptic weight from neural mass

*j*. The synaptic weights are time dependent because they evolve according to a continuous time Hebbian learning rule (see below). The third term,

*I*(

_{i}*t*), corresponds to an external input to neural mass

*i*, such as information extracted by the retina or thalamocortical connections. We take the inputs to be piecewise constant in time; at regular time intervals, a new input is presented to the network. In this letter, we assume that the inputs are chosen by periodically cycling through a given set of

*M*inputs. An alternative approach would be to randomly select each input from a given probability distribution (see Geman, 1979). It is convenient to introduce vector notation by representing the time-dependent membrane potentials by , the time-dependent external inputs by , and the time-dependent network weight matrix by . We can then rewrite the above system of ordinary differential equations as a single vector-valued equation, where corrresponds to the term-by-term application of the sigmoid

*s*:

*S*(

*V*)

_{i}=

*s*(

*V*).

_{i}#### 2.1.2. Correlation-Based Hebbian Learning.

*S*is treated as a mapping from to . The tensor product implements Hebb's rule that synaptic modifications involve the product of postynaptic and presynaptic firing rates. We can then rewrite the combined voltage and weight dynamics as the following nonautonomous (due to time-dependent inputs) dynamical system:

Let us make few remarks about the existence and uniqueness of solutions. First, boundedness of *S* implies boundedness of the system . More precisely, if *I* is bounded, the solutions are bounded. To prove this, note that the right-hand side of the equation for *W* is the sum of a bounded term and a linear decay term in *W*. Therefore, *W* is bounded, and, hence, the term is also bounded. The same reasoning applies to *V*. *S* being Lipschitz continuous implies that the right-hand side of the system is Lipschitz. This is sufficient to prove the existence and uniqueness of the solution by applying the Cauchy-Lipschitz theorem. In the following, we derive an averaged autonomous dynamical system , which will be well defined for the same reasons.

### 2.2. Averaging the System.

System is a nonautonomous system that is difficult to analyze because the inputs are periodically changing. It has already been studied in the case of a single input (see Dong & Hopfield, 1992), but it remains to be analyzed in the case of multiple inputs. We show in section A.1 that this system can be approximated by an autonomous Cauchy problem, which will be much more convenient to handle. This averaging method makes the most of multiple timescales in the system.

*M*fixed inputs, with the period of cycling given by

*T*. It follows that

*I*is

*T*-periodic, piecewise constant. We assume that the sampling rate is also much slower than the evolution of the membrane potentials: Finally, we assume that the period

*T*is small compared to the timescale of the learning dynamics, In section A.1, we simplify the system by applying Tikhonov's theorem for slow and fast systems and then classical averaging methods for periodic systems. It leads to defining another system , which will be a good approximation of in the aymptotic regime.

*M*inputs by and denote by

*V*

^{(a)}the fixed-point solution of the equation . If we now introduce the matrices and with components and , then we define To illustrate this approximation, we simulate a simple network with both exact (i.e., ) and averaged (i.e., ) evolution equations. For these simulations, the network consists of

*N*=10 fully connected neurons and is presented with

*M*=10 different random inputs taken uniformly in the intervals [0, 1]

^{N}. For this simulation, we use , and . Figure 1 (left) shows the percentage of error between final connectivities for different values of and

*T*/

*M*. Figure 1 (right) shows the temporal evolution of the norm of the connectivity for both the exact and averaged system for

*T*=10

^{3}and .

In the remainder of the letter, we focus on the system whose solutions are close to those of the original system provided condition A.2 in the appendix is satisfied, that is, the network is weakly connected. Finally, note that it is straighforward to extend our approach to time-functional rules (e.g., sliding threshold or BCM rules as described in Bienenstock et al., 1982), which, in this new framework, would be approximated by simple ordinary differential equations (as opposed to time-functional differential equations) provided *S* is redefined appropriately.

### 2.3. Stability.

#### 2.3.1. Lyapunov Function.

*M*=1), systems and are equivalent and reduce to the neural network with adapting synapses previously analyzed by Dong and Hopfield (1992). Under the additional constraint that the weights are symmetric (

*W*=

_{ij}*W*), these authors showed that the simultaneous evolution of the neuronal activity variables and the synaptic weights can be reexpressed as a gradient dynamical system that minimizes a Lyapunov or energy function of state. We can generalize their analysis to the case of multiple inputs (

_{ji}*M*>1) and nonsymmetric weights using the averaged system . That is, following along lines similar to Dong and Hopfield (1992), we introduce the energy function where , , and In contrast to Dong and Hopfield (1992), we do not require a priori that the weight matrix is symmetric. However, it can be shown that the system always converges to a symmetric connectivity pattern. More precisely, is an attractor of the system . A proof can be found in section A.2. It can then be shown that on (symmetric weights),

*E*is a Lyapunov function of the dynamical system , that is, The boundedness of

*E*and the Krasovskii-LaSalle invariance principle then implies that the system converges to an equilibrium (see Khalil & Grizzle, 1996). We thus have

The initial value problem for the system with , converges to an equilibrium state.

See section A.3.

It follows that neither oscillatory nor chaotic attractor dynamics can occur.

#### 2.3.2. Linear Stability.

Although we have shown that there are stable fixed points, not all of the fixed points are stable. However, we can apply a linear stability analysis on the system to derive a simple sufficient condition for a fixed point to be stable. The method we use in the proof could be extended to more complex rules.

See section A.4.

This condition is strikingly similar to that derived in Faugeras, Grimbert, and Slotine (2008) (in fact, it is stronger than the contracting condition they find). It says the network may converge to a weakly connected situation. It also says that the dynamics of is likely (because the condition is only sufficient) to be contracting and therefore subject to no bifurcations: a fully recurrent learning neural network is likely to have a simple dynamics.

### 2.4. Equilibrium Points.

*I*=0), then implies that the activity is nonzero, that is, there is spontaneous activity. Combining these observations, we see that the network roughly extracts and stores the correlation matrix of the strongest inputs within the weights of the network.

## 3. Geometrical Structure of Equilibrium Points

### 3.1. From a Symmetric Connectivity Matrix to a Convolutional Network.

*k*-dimensional space. This is quite natural since is symmetric and has positive coefficients, properties shared with a Euclidean distance matrix. More specifically, we want to find an integer and

*N*points in , denoted by , so that the connectivity can roughly be written as , where

*g*is a positive decreasing real function. If we manage to do so, then the interaction term in system becomes where we redefine the variable

*V*as a field such that

*V*(

*x*)=

_{j}*V*. This equation says that the network is convolutional with respect to the variables

_{j}*x*,

_{i}*i*=1, ‥,

*N*and the associated convolution kernel is .

In practice, it is not always possible to find a geometry for which the connectivity is a distance matrix. Therefore, we project the appropriate matrix on the set of Euclidean distance matrices. This is the set of matrices *M* such that with . More precisely, we define , where *g*^{−1} is applied to the coefficients of . We then search for the distance matrix such that is minimal. In this letter, we consider an *L*_{2}-norm. Although the choice of an *L _{p}*-norm will be motivated by the wiring length minimization argument in section 3.3, note that the choice of a

*L*

_{2}norm is somewhat arbitrary and corresponds to penalizing long-distance connections. The minimization turns out to be a least square minimization whose parameters are the . This can be implemented by a set of methods known as multidimensional scaling, which are reviewed in Borg and Groenen (2005). In particular, we use the stress majorization or SMACOF algorithm for the stress1 cost function throughout this letter. This leads to writing and therefore .

We now consider two choices of the function *g*:

*x*: . For obvious reasons,

_{i}*M*is called the nonconvolutional connectivity. It is the role of multidimensional scaling methods to minimize the role of the undetermined function

*M*in the previous equations, that is, ideally having (resp. ) for the first (resp. second) assumption above. The ideal case of a fully convolutional connectivity can alway be obtained if

*k*is large enough. Indeed, proposition 1 shows that satisfies the triangular inequality for matrices (i.e., ) for both

*g*under some plausible assumptions. Therefore, it has all the properties of a distance matrix (symmetric, positive coefficients, and triangular inequality), and one can find points in such that it is the distance matrix of these points iff is large enough. In this case, the connectivity on the space defined by these points is fully convolutional; equation 3.1 is exactly verified.

See section A.5.

### 3.2. Unveiling the Geometrical Structure of the Inputs.

We hypothesize that the space defined by the *x _{i}* reflects the underlying geometrical structure of the inputs. We have not found a way to prove this, so we provide numerical examples that illustrate this claim. Therefore, the following is only a (numerical) proof of concept. For each example, we feed the network with inputs having a defined geometrical structure and then show how this structure can be extracted from the connectivity by the method outlined in section 3.1. In particular, we assume that the inputs are uniformly distributed over a manifold with fixed geometry. This strong assumption amounts to considering that the feedforward connectivity (which we do not consider here) has already properly filtered the information coming from the sensory organs. More precisely, define the set of inputs by the matrix such that where the

*z*are uniformly distributed points over , the

_{a}*y*are the positions on that label the

_{i}*i*th neuron, and

*f*is a decreasing function on . The norm is the natural norm defined over the manifold . For simplicity, assume so that the inputs are localized bell-shaped bumps on the shape .

#### 3.2.1. Planar Retinotopy.

*N*=

*M*=

*K*

^{2}and set

*z*=

_{a}*y*for

_{i}*i*=

*a*, . (The numerical results show an identical structure for the final connectivity when the

*y*correspond to random points, but the analysis is harder.) In the simpler case of one-dimensional gaussians with

_{j}*N*=

*M*=

*K*, the input matrix takes the form , where

*T*is a symmetric Toeplitz matrix: In the two-dimensional case, we set and introduce the labeling

_{f}*y*

_{k+(l−1)K}=(

*u*,

_{k}*v*) for . It follows that for

_{l}*i*=

*k*+(

*l*−1)

*K*and . Hence, we can write , where is the Kronecker product; the Kronecker product is responsible for the substructure we can observe in Figure 2, with

*K*=10. Note that if we were interested in a

*n*-dimensional retinotopy, then the input matrix could be written as a Kronecker product between

*n*Toeplitz matrices. As previously mentioned, the final connectivity matrix roughly corresponds to the correlation matrix of the input matrix. It turns out that the correlation matrix of is also a Kronecker product of two Toeplitz matrixes generated by a single gaussian (with a different standard deviation). Thus, the connectivity matrix has the same basic form as the input matrix when

*z*=

_{a}*y*for

_{i}*i*=

*a*. The inputs and stable equilibrium points of the simulated system are shown in Figure 2. The positions

*x*of the neurons after multidimensional scaling are shown in Figure 3 for different parameters. Note that we find no significant change in the position

_{i}*x*of the neurons when the convolutional kernel

_{i}*g*varies (as will also be shown in section 3.3.1). Thus, we show results for only one of these kernels,

*g*(

*x*)=

*e*

^{−x}.

If the standard deviation of the inputs is properly chosen as in Figure 3b, we observe that the neurons are distributed on a regular grid, which is retinotopically organized. In other words, the network has learned the geometric shape of the inputs. This can also be observed in Figure 3d, which corresponds to the same connectivity matrix as in Figure 3b but is represented in three dimensions. The neurons self-organize on a two-dimensional saddle shape that accounts for the border distortions that can be observed in two dimensions (which we discuss in the next paragraph). If is too large, as can be observed in Figure 3c, the final result is poor. Indeed, the inputs are no longer local and cover most of the visual field. Therefore, the neurons saturate, , for all the inputs, and no structure can be read in the activity variable. If is small, the neurons still seem to self-organize (as long as the inputs are not completely localized on single neurons) but with significant border effects.

There are several reasons that we observe border distortions in Figure 3. We believe the most important is due to an unequal average excitation of the neurons. Indeed, the neurons corresponding to the border of the “visual field” are less excited than the others. For example, consider a neuron on the left border of the visual field. It has no neighbors on its left and therefore is less likely to be excited by its neighbors and less excited on average. The consequence is that it is less connected to the rest of the network (see, e.g., the top line of the right side of Figure 2) because their connection depends on their level of excitment through the correlation of the activity. Therefore, it is farther away from the other neurons, which is what we observe. When the inputs are localized, the border neurons are even less excited on average and thus are farther away, as shown in Figure 3a.

Another way to get distortions in the positions *x _{i}* is to reduce or increase excessively the amplitude

*I*=max

_{m}_{i,a}|

*I*

^{(a)}

_{i}| of the inputs. Indeed, if it is small, the equilibrium actitivity described by equation 2.14 is also small and likely to be the flat part of the sigmoid. In this case, neurons tend to be more homogeneously excited and less sensitive to the particular shape of the inputs. Therefore, the network loses some information about the underlying structures of the inputs. Actually the neurons become relatively more sensitive to the neighborhood structure of the network, and the border neurons have a different behavior from the rest of the network, as shown in Figure 3e. The parameter has much less impact on the final shape since it corresponds only to a homogeneous scaling of the final connectivity.

So far, we have assumed that the inputs were uniformly spread on the manifold . If this assumption is broken, the final position of the neurons will be affected. As shown in Figure 3f, where 50 inputs were added to the case of Figure 3b in only half of the visual field, the neurons that code for this area tend to be closer. Indeed, they tend not to be equally excited on average (as supposed in proposition 1), and a distortion effect occurs. This means that a proper understanding of the role of the vertical connectivity would be needed to complete this geometrical picture of the functioning of the network. This is, however, beyond the scope of this letter.

#### 3.2.2. Toroïdal Retinotopy.

We now assume that the inputs are uniformly distributed over a two-dimensional torus, . That is, the input labels *z _{a}* are randomly distributed on the torus. The neuron labels

*y*are regularly and uniformly distributed on the torus. The inputs and final stable weight matrix of the simulated system are shown in Figure 4. The positions

_{i}*x*of the neurons after multidimensional scaling for

_{i}*k*=3 are shown in Figure 5 and appear to form a cloud of points distributed on a torus. In contrast to the previous example, there are no distortions now because there are no borders on the torus. In fact, the neurons are equally excited on average in this case which makes proposition 1 valid.

### 3.3. Links with Neuroanatomy.

The brain is subject to energy constraints, which are completely neglected in the above formulation. These constraints most likely have a significant impact on the positions of real neurons in the brain. Indeed, it seems reasonable to assume that the positions and connections of neurons reflect a trade-off between the energy costs of biological tissue and their need to process information effectively. For instance, it has been suggested that a principle of wire length minimization may occur in the brain (see Swindale, 1996; Chklovskii, Schikorski, & Stevens, 2002). In our neural mass framework, one may consider that the stronger two neural masses are connected, the larger the number of real axons linking the neurons together. Therefore, minimizing axonal length can be read as that the stronger the connection, the closer, which is consistent with the convolutional part of the weight matrix. However, the underlying geometry of natural inputs is likely to be very high-dimensional, whereas the brain lies in a three-dimensional world. In fact, the cortex is so flat that it is effectively two-dimensional. Hence, the positions of real neurons are different from the positions in a high-dimensional vector space; since the cortex is roughly two-dimensional, the positions could be realized physically only if *k* = 2. Therefore, the three-dimensional toric geometry or any higher-dimensional structure could not be perfectly implemented in the cortex without the help of nonconvolutional long-range connections. Indeed, we suggest that the cortical connectivity is made of two parts: (1) a local convolutional connectivity corresponding to the convolutional term *g* in equations 3.2 and 3.3, which is consistent with the requirements of energy efficiency, and (2) a nonconvolutional connectivity corresponding to the factor *M* in equations 3.2 and 3.3, which is required in order to represent various stimulus features. If the cortex were higher-dimensional (), then there would no nonconvolutionnal connectivity *M*, that is, for the linear convolutional kernel or for the exponential one.

We illustrate this claim by considering two examples based on the functional anatomy of the primary visual cortex: the emergence of ocular dominance columns and orientation columns, respectively. We proceed by returning to the case of planar retinotopy (see section 5.3.1) but now with additional input structure. In the first case, the inputs are taken to be binocular and isotropic, whereas in the second case, they are taken to be monocular and anisotropic. The details are presented below. Given a set of prescribed inputs, the network evolves according to equation 2.10, and the lateral connections converge to a stable equilibrium. The resulting weight matrix is then projected onto the set of distance matrices for *k*=2 (as described in section 3.2.1) using the stress majorization or SMACOF algorithm for the stress1 cost function as described in Borg and Groenen (2005). We thus assign a position to the *i*th neuron, . (Note that the position *x _{i}* extracted from the weights using multidimensional scaling is distinct from the “physical” position

*y*of the neuron in the retinocortical plane; the latter determines the center of its receptive field.) The convolutional connectivity (

_{i}*g*in equations 3.2 and 3.3) is therefore completely defined. On the planar map of points

*x*, neurons are isotropically connected to their neighbors; the closer the neurons are, the stronger is their convolutional connection. Moreover, since the stimulus feature preferences (orientation, ocular dominance) of each neuron

_{i}*i*, , are prescribed, we can superimpose these feature preferences on the planar map of points

*x*. In both examples, we find that neurons with the same ocular or orientation selectivity tend to cluster together: interpolating these clusters then generates corresponding feature columns. It is important to emphasize that the retinocortical positions

_{i}*y*do not have any columnar structure, that is, they do not form clusters with similar feature preferences. Thus, in contrast to standard developmental models of vertical connections, the columnar structure emerges from the recurrent weights after learning, which are intrepreted as a Euclidean distances. It follows that neurons coding for the same feature tend to be strongly connected; indeed, the multidimensional scaling algorithm has the property that it positions strongly connected neurons close together. Equations 3.2 and 3.3 also suggest that the connectivity has a nonconvolutional part,

_{i}*M*, which is a consequence of the low dimensionality (

*k*=2). In order to illustrate the structure of the nonconvolutional connectivity, we select a neuron

*i*in the plane and draw a link from it at position

*x*to the neurons at position

_{i}*x*for which

_{j}*M*(

*x*,

_{i}*x*) is maximal in Figures 7, 8, and 9. We find that

_{j}*M*tends to be patchy; it connects neurons having the same feature preferences. In the case of orientation,

*M*also tends to be coaligned, that is, connecting neurons with similar orientation preference along a vector in the plane of the same orientation.

Note that the proposed method to get cortical maps is artificial. First, the networks learn its connectivity and, second, the equilibrium connectivity matrix is used to give a position to the neurons. In biological tissues, the cortical maps are likely to emerge slowly during learning. Therefore, a more biological way of addressing the emergence of cortical maps may consist in repositioning the neurons at each time step, taking the position of the neurons at time *t* as the initial condition of the algorithm at time *t*+1. This would correspond better to the real effect of the energetic constraints (leading to wire length minimization), which occur not only after learning but also at each time step. However, when learning is finished, the energetic landscape corresponding to the minimization of the wire length would still be the same. In fact, repositioning the neurons successively during learning changes only the local minimum where the algorithm converges: it corresponds to a different initialization of the neurons’ positions. Yet we do not think this corresponds to a more relevant minimum since the positions of the points at time *t*=0 are still randomly chosen. Although not biological, the a posteriori method is not significantly worse than the application of algorithm at each time step and gives an intuition of the shapes of the local minima of the functional leading to the minimization of the wire length.

#### 3.3.1. Ocular Dominance Columns and Patchy Connectivity.

*N*neurons into two sets and that code for the left and right eyes, respectively. The

*i*th neuron is then given a retinocortical position

*y*, with the

_{i}*y*uniformly distributed across the line for Figure 6 and across the plane for Figures 7 and 8. We do not assume a priori that there exist any ocular dominance columns, that is, neurons with similar retinocortical positions

_{i}*y*do not form clusters of cells coding for the same eye. We then take the

_{i}*a*th input to the network to be of the form where the

*z*are randomly generated from [0, 1] in the one-dimensional case and [0, 1]

_{a}^{2}in the two-dimensional case. For each input

*a*, is randomly and uniformly taken in with (see Bressloff, 2005). Thus, if (, then the corresponding input is predominantly from the left (right) eye.

First, we illustrate the results of ocular dominance simulations in one dimension in Figure 6. Although it is not biologically realistic, taking the visual field to be one-dimensional makes it possible to visualize the emergence of ocular dominance columns more easily. Indeed, in Figure 6 we analyze the role of the binocular disparity of the network, that is, we change the value of . If (the blue curves in Figures 6a and 6b and top pictures in Figures 6c and 6d), there are virtually no differences between left and right eyes, and we observe much less segregation than in the case (the green curves in Figures 6a and 6b and the bottom pictures in Figures 6c and 6d). Increasing the binocular disparity between the two eyes results in the emergence of ocular dominance columns. Yet there does not seem to be any spatial scale associated with these columns: they form on various scales, as shown in Figure 6d.

In Figures 7 and 8, we plot the results of ocular dominance simulations in two dimensions. In particular, we illustrate the role of changing the binocular disparity , changing the standard deviation of the inputs , and using different convolutional kernels *g*. We plot the points *x _{i}* obtained by performing multidimensional scaling on the final connectivity matrix for

*k*=2 and superimposing on this the ocular dominance map obtained by interpolating between clusters of neurons with the same eye preference. The convolutional connectivity (

*g*in equations 3.2 and 3.3) is implicitly described by the position of the neurons: the closer the neurons, the stronger their connections. We also illustrate the nonconvolutional connectivity (

*M*in equations 3.2 and 3.3) by linking one selected neuron to the neurons it is most strongly connected to. The color of the link refers to the color of the target neuron. The multidimensional scaling algorithm was applied for each set of parameters with different initial conditions, and the best final solution (i.e., with the smallest nonconvolutional part) was kept and plotted. The initial conditions were random distributions of neurons or artificially created ocular dominance stripes with different numbers of neurons per stripe. It turns out the algorithm performed better on the latter. (The number of tunable parameters was too high for the system to converge to a global equilibrium for a random initial condition.) Our results show that nonconvolutional or long-range connections tend to link cells with the same ocular dominance provided the inputs are sufficiently strong and different for each eye.

#### 3.3.2. Orientation Columns and Collinear Connectivity.

*N*neurons into four groups corresponding to different orientation preferences . Thus, if neuron , then its orientation preference is . For each group, the neurons are randomly assigned a retinocortical position . Again, we do not assume a priori that there exist any orientation columns, that is, neurons with similar retinocortical positions

*y*do not form clusters of cells coding for the same orientation preference. Each cortical input

_{i}*I*

^{(a)}

_{i}is generated by convolving a thalamic input consisting of an oriented gaussian with a Gabor-like receptive field (as in Miikkulainen et al., 2005). Let denote a two-dimensional rigid body rotation in the plane with . Then where and is the Gabor-like function, with

*e*

_{0}=(0, 1) and The amplitudes

*A*

_{+},

*A*

_{−}are chosen so that . Similarly, the thalamic input with the anisotropic gaussian The input parameters and

*z*are randomly generated from and [0, 1]

_{a}^{2}, respectively. In our simulations, we take , and . The results of our simulations are shown in the Figure 9 (left). In particular, we plot the points

*x*obtained by performing multidimensional scaling on the final connectivity matrix for

_{i}*k*=2, and superimposing on this the orientation preference map obtained by interpolating between clusters of neurons with the same orientation preference. To avoid border problems, we have zoomed on the center on the map. We also illustrate the nonconvolutional connectivity by linking a group of neurons gathered in an orientation column to all other neurons for which

*M*is maximal. The patchy, anisotropic nature of the long-range connections is clear. The anisotropic nature of the connections is further quantified in the histogram of Figure 9.

## 4. Discussion

In this letter, we have shown how a neural network can learn the underlying geometry of a set of inputs. We have considered a fully recurrent neural network whose dynamics is described by a simple nonlinear rate equation, together with unsupervised Hebbian learning with decay that occurs on a much slower timescale. Although several inputs are periodically presented to the network, so that the resulting dynamical system is nonautonomous, we have shown that such a system has a fairly simple dynamics: the network connectivity matrix always converges to an equilibrium point. We then demonstrated how this connectivity matrix can be expressed as a distance matrix in for sufficiently large *k*, which can be related to the underlying geometrical structure of the inputs. If the connectivity matrix is embedded in a lower two-dimensional space (*k*=2), then the emerging patterns are qualitatively similar to experimentally observed cortical feature maps, although we have considered simplistic stimuli and the feedforward connectivity as fixed. That is, neurons with the same feature preferences tend to cluster together, forming cortical columns within the embedding space. Moreover, the recurrent weights decompose into a local isotropic convolutional part, which is consistent with the requirements of energy efficiency and a longer-range nonconvolutional part that is patchy. This suggest a new interpretation of the cortical maps: they correspond to two-dimensional embeddings of the underlying geometry of the inputs.

Geometric diffusion methods (see Coifman et al., 2005) are also an efficient way to reveal the underlying geometry of sets of inputs. There are differences with our approach, although both share the same philosophy. The main difference is that geometric harmonics deals with the probability of co-occurence, whereas our approach is more focused on wiring length, which is indirectly linked to the inputs through Hebbian coincidence. From an algoritmic point of view, our method is concerned with the position of *N* neurons and not *M* inputs, which can be a clear advantage in certain regimes. Indeed, we deal with matrices of size , whereas the total size of the inputs is , which is potentially much higher. Finally, this letter is devoted to decomposing the connectivity between a convolutional and nonconvolutional part, and this is why we focus not only on the spatial structure but also on the shape of the activity equation on this structure. These two results come together when decomposing the connectivity. Actually, this focus on the connectivity was necessary to use the energy minization argument of section 2.3.1 and compute the cortical maps in section 3.3.

One of the limitations of applying simple Hebbian learning to recurrent cortical connections is that it takes into account only excitatory connections, whereas 20% of cortical neurons are inhibitory. Indeed, in most developmental models of feedforward connections, it is assumed that the local and convolutional connections in cortex have a Mexican hat shape with negative (inhibitory) lobes for neurons that are sufficiently far from each other. From a computational perspective, it is possible to obtain such a weight distribution by replacing Hebbian learning with some form of covariance learning (see Sejnowski & Tesauro, 1989). However, it is difficult to prove convergence to a fixed point in the case of the covariance learning rule, and the multidimensional scaling method cannot be applied directly unless the Mexican hat function is truncated so that it is invertible. Another limitation of rate-based Hebbian learning is that it does not take into account causality, in contrast to more biologically detailed mechanisms such as spike-timing-dependent plasticity.

The approach taken here is very different from standard treatments of cortical development (as in Miller, Keller, & Stryker, 1989; Swindale, 1996), in which the recurrent connections are assumed to be fixed and of convolutional Mexican hat form while the feedforward vertical connections undergo some form of correlation-based Hebbian learning. In the latter case, cortical feature maps form in the physical space of retinocortoical coordinates *y _{i}*, rather than in the more abstract planar space of points

*x*obtained by applying multidimensional scaling to recurrent weights undergoing Hebbian learning in the presence of fixed vertical connections. A particular feature of cortical maps formed by modifiable feedforward connections is that the mean size of a column is determined by a Turing-like pattern forming instability and depends on the length scales of the Mexican hat weight function and the two-point input correlations (see Miller et al., 1989; Swindale, 1996). No such Turing mechanism exists in our approach, so the resulting cortical maps tend to be more fractal-like (many length scales) compared to real cortical maps. Nevertheless, we have established that the geometrical structure of cortical feature maps can also be encoded by modifiable recurrent connections. This should have interesting consequences for models that consider the joint development of feedforward and recurrent cortical connections. One possibility is that the embedding space of points

_{i}*x*arising from multidimensional scaling of the weights becomes identified with the physical space of retinocortical positions

_{i}*y*. The emergence of local convolutional structures, together with sparser long-range connections, would then be consistent with energy efficiency constraints in physical space.

_{i}Our letter also draws a direct link between the recurrent connectivity of the network and the positions of neurons in some vector space such as . In other words, learning corresponds to moving neurons or nodes so that their final position will match the inputs’ geometrical structure. Similarly, the Kohonen algorithm detailed in Kohonen (1990) describes a way to move nodes according to the inputs presented to the network. It also converges toward the underlying geometry of the set of inputs. Although these approaches are not formally equivalent, it seems that both have the same qualitative behavior. However, our method is more general in the sense that no neighborhood structure is assumed a priori; such a structure emerges via the embedding into .

Finally, note that we have used a discrete formalism based on a finite number of neurons. However, the resulting convolutional structure obtained by expressing the weight matrix as a distance matrix in (see equations 3.2 and 3.3) allows us to take an appropriate continuum limit. This then generates a continuous neural field model in the form of an integro-differential equation whose integral kernel is given by the underlying weight distribution. Neural fields have been used increasingly to study large-scale cortical dynamics (see Coombes, 2005 for a review). Our geometrical learning theory provides a developmental mechanism for the formation of these neural fields. One of the useful features of neural fields from a mathematical perspective is that many of the methods of partial differential equations can be carried over. Indeed, for a general class of connectivity functions defined over continuous neural fields, a reaction-diffusion equation can be derived whose solution approximates the firing rate of the associated neural field (see Degond & Mas-Gallic, 1989; Cottet, 1995; Edwards, 1996). It appears that the necessary connectivity functions are precisely those that can be written in the form 3.2. This suggests that a network that has been trained on a set of inputs with an appropriate geometrical structure behaves as a diffusion equation in a high-dimensional space together with a reaction term corresponding to the inputs.

## Appendix

### A.1. Proof of System's Averaging.

Here, we show that system can be approximated by the autonomous Cauchy problem . Indeed, we can simplify system by applying Tikhonov's theorem for slow and fast systems and then classical averaging methods for periodic systems.

#### A.1.1. Tikhonov's Theorem

Tikhonov's theorem (see Tikhonov, 1952, and Verhulst, 2007 for a clear introduction) deals with slow and fast systems. It says the following:

Assume that:

The equation 0=

*g*(*x*,*y*,*t*) is solved by , where is a continuous function and an isolated root. Also suppose that is an asymptotically stable solution of the equation that is uniform in the parameters and .*y*(0) is contained in an interior subset of the domain of attraction of .

*S*. Following Faugeras et al. (2008), we know that if there exists an isolated root of the equation and is asymptotically stable. Equation A.2 corresponds to the weakly connected case. Moreover, the initial condition belongs to the basin of attraction of this single fixed point. Note that we require so that the membrane potentials have sufficient time to approach the equilibrium associated with a given input before the next input is presented to the network. In fact, this assumption make it reasonable to neglect the transient activity dynamics due to the switching between inputs.

#### A.1.2. Periodic Averaging

*W*with

*T*-periodic forcing due to the presence of

*V*on the right-hand side. Since , we can use classical averaging methods to show that solutions of equation A.1 are close to solutions of the following autonomous system on the time interval (which we suppose is large because ): It follows that solutions of are also close to solutions of . Finding the explicit solution

*V*(

*t*) for each input

*I*(

*t*) is difficult and requires fixed-point methods (e.g., a Picard algorithm). Therefore, we consider yet another system whose solutions are also close to and hence . In order to construct , we need to introduce some additional notation.

*M*inputs by and denote by

*V*

^{(a)}the fixed-point solution of the equation . Given the periodic sampling of the inputs, we can rewrite equation A.3 as If we now introduce the matrices and with components and , we can eliminate the tensor product and simply write equation A.4 in the matrix form where such that . A second application of Tikhonov's theorem (in the reverse direction) then establishes that solutions of the system (written in the matrix form A.5) are close to solutions of the matrix system

In this letter, we focus on system whose solutions are close to those of the original system provided condition A.2 is satisfied, that is, the network is weakly connected. Clearly the fixed points of system satisfy . Therefore, equation A.2 says that if , then Tikhonov's theorem can be applied, and systems and can be reasonably considered as good approximations of each other.

### A.2. Proof of the Convergence to the Symmetric Attractor .

We need to prove the two points: (1) is an invariant set and (2) for all , converges to as . Since is the direct sum of the set of symmetric connectivities and the set of antisymmetric connectivies, we write , where *W _{S}* is symetric and

*W*is antisymetric.

_{A}In equation 2.10, the right-hand side of the equation for is symmetric. Therefore, if such that

*W*(_{A}*t*_{1})=0, then W remains in for .- Projecting the expression for in equation 2.10 on to the antisymmetric component leads to whose solution is . Therefore, . The system converges exponentially to .

### A.3. Proof of Theorem 1.

*W*=

*W*+

_{S}*W*, where

_{A}*W*is symmetric and

_{S}*W*is antisymmetric: Therefore, writing the system , equation 2.10, as where , we see that where , , and such that exponentially (because the system converges to ). It follows that the time derivative of along trajectories is given by Substituting equation A.10 then yields

_{A}- •
is lower-bounded. Indeed, and

*W*are bounded. Given that and*S*are also bounded, it is clear that is bounded. - •
is negative semidefinite on the trajectories as shown in equation A.13.

Then the invariance principle tells us that the solutions of system approach the set . Equation A.13 implies that . Since and everywhere, *M* consists of the equilibrium points of the system. This completes the proof.

### A.4. Proof of Theorem 2.

*F*at is where denotes a Hadamard product, that is, . Assume that there exist such that . Taking the second component of this equation and computing the dot product with leads to where , . Substituting this expression in the first equation leads to

*W*: Define to be the magnitude of the largest eigenvalue of . First, note that and have the same eigenvalues but different eigenvectors denoted by

*u*for and

_{i}*v*for . In the basis set spanned by the , we find that and are diagonal with as eigenvalues. Therefore, and . Moreover, observe that Therefore, . In other words, is the composition of an orthogonal operator (i.e., an isometry) and a diagonal matrix. Immediately, it follows that .

_{i}### A.5. Proof of Proposition 1.

*i*=1, ‥,

*N*, . For simplicity and without loss of generality, we can assume

*c*= 1 and (we can play with

*a*and to generalize this to any as long as

*a*>

*c*). The triangular inequality we want to prove can therefore be written as For readability we rewrite , , and . These three vectors are on the unity sphere and have only positive coefficients. Actually, there is a distance between these vectors that consists of computing the geodesic angle between them. In other words, consider the intersection of

*vect*(

*u*,

*v*) and the

*M*-dimensional sphere. This is a circle where

*u*and

*v*are located. The angle between the two points is written . It is a distance on the sphere and thus satisfies the triangular inequality: Actually, all these angles belong to because

*u*,

*v*, and

*w*only have positive coefficients.

Observe that and separate the two cases for the choice of function *g*:

- If with , then . Therefore, define . We now want to apply
*h*_{1}to equation A.20, but*h*_{1}is monotonic only if . Therefore, divide equation A.20 by 2 and apply*h*_{1}on both sides to get Now consider the function for . Because,*h*_{1}is increasing, it is clear that (and similarly for*y*), such that reaches its maximum for . Besides, . This proves that . Moreover, it is easy to observe that for all*a*>1. This concludes the proof for . If

*g*(*x*)=*ae*^{−x}, then . As before, define . We still want to apply*h*_{2}to equation A.20, but*h*_{2}is not defined for , which is likely for the right-hand side of equation A.20. Therefore, we apply*h*_{2}to equation A.20, divided by two, and use the fact that*h*_{2}is increasing on . This leads to . First, we use the convexity of*h*_{2}to get , and then we use the fact that for , with . This would conclude the proof, but we have to make sure the angles remain in . Actually, we can compute , which verifies . This leads to . In fact, the coefficients of*u*,*v*, and*w*are strictly positive and larger than*S*(0). Therefore, the angles between them are strictly smaller than . More precisely, . Therefore, a necessary condition for the result to be true is , using the fact that leads to .

## Acknowledgments

M.N.G. and O.D.F. were partially funded by ERC advanced grant NerVi nb 227747. M.N.G. was partially funded by the région PACA, France. This letter was based on work supported in part by the National Science Foundation (DMS-0813677) and by Award KUK-C1-013-4 made by King Abdullah University of Science and Technology. P.C.B. was also partially supported by the Royal Society Wolfson Foundation. This work was partially supported by the IP project BrainScales No. 269921.