Post-Modern Machine Learning: The Deep Revolution

The mysteries of human intelligence are among the final frontiers of science. Despite our pride in what science has achieved across the past century, we have stalled when it comes to understanding intelligence or emulating it. The best we have done so far is through machine learning — harnessing the computational power of computers to begin to mimic the mind, attempting to answer the essential question:

How do we get machines to Know what we Know?

In modern machine learning, the answer is algorithmic.

In post-modern machine learning, the answer is manifestation.

The algorithms of modern machine learning are cause and effect, rules to follow, producing only what the programmer can imagine. But post-modern machine learning has blown past explicit algorithms to embrace deep networks. Deep networks today are defined by neural networks with thousands, or tens of thousands, or even hundreds of thousands, of neurons arrayed in multiple layers of dense connections. The interactivity of so many crossing streams of information defies direct deconstruction of what the networks are doing — they are post-modern. Their outputs manifest themselves, self-assembling into simplified structures and patterns and dependencies that are otherwise buried unseen in complicated data.

Fig. 1 A representation of a deep network with three fully-connected hidden layers. Deep networks are typically three or more layers deep, but each layer can have thousands of neurons. (Figure from the TowardsDataScience blog.)

Deep learning emerged as recently as 2006 and has opened wide new avenues of artificial intelligence that move beyond human capabilities for some tasks.  But deep learning also has pitfalls, some of which are inherited from the legacy approaches of traditional machine learning, and some of which are inherent in the massively high-dimensional spaces in which deep learning takes place.  Nonetheless, deep learning has revolutionized many aspects of science, and there is reason for optimism that the revolution will continue. Fifty years from now, looking back, we may recognize this as the fifth derivative of the industrial revolution (Phase I: Steam. Phase II: Electricity. Phase III: Automation. Phase IV: Information. Phase V: Intelligence).

From Multivariate Analysis to Deep Learning

Conventional machine learning, as we know it today, has had many names.  It began with Multivariate Analysis of mathematical population dynamics around the turn of the last century, pioneered by Francis Galton (1874), Karl Pearson (1901), Charles Spearman (1904) and Ronald Fisher (1922) among others.

The first on-line computers during World War II were developed to quickly calculate the trajectories of enemy aircraft for gunnery control, introducing the idea of feedback control of machines. This was named Cybernetics by Norbert Wiener, who had participated in the development of automated control of antiaircraft guns.

Table I. Evolution of Names for Machine Learning

A decade later, during the Cold War, it became necessary to find hidden objects in large numbers of photographs.  The embryonic electronic digital computers of the day were far too slow with far too little memory to do the task, so the Navy contracted with the Cornell Aeronautical Laboratory in Cheektowaga, New York, a suburb of Buffalo, to create an analog computer capable of real-time image analysis.  This led to the invention of the Perceptron by Frank Rosenblatt as the first neural network-inspired computer [1], building on ideas of neural logic developed by Warren McColloch and Walter Pitts. 

Fig. 2 Frank Rosenblatt working on the Perceptron. (From the Cornell Chronicle)
Fig. 3 Rosenblatt’s conceptual design of the connectionism of the perceptron (1958).

Several decades passed with fits and starts as neural networks remained too simple to accomplish anything profound.  Then in 1986, David Rumelhart and Ronald Williams at UC San Diego with Geoff Hinton at Carnegie-Mellon discovered a way to train multiple layers of neurons, in a process called error back propagation [2].  This publication opened the floodgates of Connectionism — also known as Parallel Distributed Processing.  The late 80’s and much of the 90’s saw an expansion of interest in neural networks, until the increasing size of the networks ran into limits caused by the processing speed and capacity of conventional computers towards the end of the decade.  During this time it had become clear that the most interesting computations required many layers of many neurons, and the number of neurons expanded into the thousands, but it was difficult to train such systems that had tens of thousands of adjustable parameters, and research in neural networks once again went into a lull.

The beginnings of deep learning started with two breakthroughs.  The first was by Yann Lecun at Bell Labs in 1998 who developed, with Leon Bottou, Yoshua Bengio and Patrick Haffner, a Convolutional Neural Network that had seven layers of neurons that classified hand-written digits [3]. The second was from Geoff Hinton in 2006, by then at the University of Toronto, who discovered a fast learning algorithm for training deep layers of neurons [4].  By the mid 2010’s, research on neural networks was hotter than ever, propelled in part by several very public successes, such as Deep Mind’s machine that beat the best player in the world at Go in 2017, self-driving cars, personal assistants like Siri and Alexa, and YouTube recommendations.

The Challenges of Deep Learning

Deep learning today is characterized by neural network architectures composed of many layers of many neurons.  The nature of deep learning brings with it two main challenges:  1) efficient training of the neural weights, and 2) generalization of trained networks to perform accurately on previously unseen data inputs.

Solutions to the first challenge, efficient training, are what allowed the deep revolution in the first place—the result of a combination of increasing computer power with improvements in numerical optimization. This included faster personal computers that allowed nonspecialists to work with deep network programming environments like Matlab’s Deep Learning toolbox and Python’s TensorFlow.

Solutions to the second challenge, generalization, rely heavily on a process known as “regularization”. The term “regularization” has a slippery definition, an obscure history, and an awkward sound to it. Regularization is the noun form of the verb “to regularize” or “to make regular”. Originally, regularization was used to keep certain inverse algorithms from blowing up, like inverse convolutions, also known as deconvolution. Direct convolution is a simple mathematical procedure that “blurs” ideal data based on the natural response of a measuring system. However, if one has experimental data, one might want to deconvolve the system response from the data to recover the ideal data. But this procedure is numerically unstable and can “blow up”, often because of the divide-by-zero problem. Regularization was a simple technique that kept denominators from going to zero.

Regularization became a common method for inverse problems that are notorious to solve because of the many-to-one mapping that can occur in measurement systems. There can be a number of possible causes that produce a single response. Regularization was a way of winnowing out “unphysical” solutions so that the physical inverse solution remained.

During the same time, regularization became a common tool used by quantum field theorists to prevent certain calculated quantities from diverging to infinity. The solution was again to keep denominators from going to zero by setting physical cut-off lengths on the calculations. These cut-offs were initially ad hoc, but the development of renormalization group theory by Kenneth Wilson at Cornell (Nobel Prize in 1982) provided a systematic approach to solving the infinities of quantum field theory.

With the advent of neural networks, having hundreds to thousands to millions of adjustable parameters, regularization became the catch-all term for fighting the problem of over-fitting. Over-fitting occurs when there are so many adjustable parameters that any training data can be fit, and the neural network becomes just a look-up table. Look-up tables are the ultimate hash code, but they have no predictive capability. If a slightly different set of data are fed into the network, the output can be anything. In over-fitting, there is no generalization, the network simply learns the idiosyncrasies of the training data without “learning” the deeper trends or patterns that would allow it to generalize to handle different inputs.

Over the past decades a wide collection of techniques have been developed to reduce over-fitting of neural networks. These techniques include early stopping, k-fold holdout, drop-out, L1 and L2 weight-constraint regularization, as well as physics-based constraints. The goal of all of these techniques is to keep neural nets from creating look-up tables and instead learning the deep codependencies that exist within complicated data.

Table II. Regularization Techniques in Machine Learning

By judicious application of these techniques, combined with appropriate choices of network design, amazingly complex problems can be solved by deep networks and they can generalized too (to some degree). As the field moves forward, we may expect additional regularization tricks to improve generalization, and design principles will emerge so that the networks no longer need to be constructed by trial and error.

The Potential of Deep Learning

In conventional machine learning, one of the most critical first steps performed on a dataset has been feature extraction. This step is complicated and difficult, especially when the signal is buried either in noise or in confounding factors (also known as distractors). The analysis is often highly sensitive to the choice of features, and the selected features may not even be the right ones, leading to bad generalization. In deep learning, feature extraction disappears into the net itself. Optimizing the neural weights subject to appropriate constraints forces the network to find where the relevant information lies and what to do with it.

The key to finding the right information was not just having many neurons, but having many layers, which is where the potential of deep learning emerges. It is as if each successive layer is learning a more abstract or more generalized form of the information than the last. This hierarchical layering is most evident in the construction of convolutional deep networks, where the layers are receiving a telescoping succession of information fields from lower levels. Geoff Hinton‘s Deep Belief Network, which launched the deep revolution in 2006, worked explicitly with this hierarchy in mind through direct design of the network architecture. Since then, network architecture has become more generalized, with less up-front design while relying on the increasingly sophisticated optimization techniques of training to set the neural weights. For instance, a simplified instance of a deep network is shown in Fig. 4 with three hidden layers of neurons.

Fig. 4 General structure of a deep network with three hidden layers. Layers will typically have hundreds or thousands of neurons. Each gray line represents a weight value, and each circle is a neural activation function.

The mathematical structure of a deep network is surprisingly simple. The equations for the network in Fig. 4, that convert an input xa to an output ye, are

These equations use index notation to denote vectors (single superscript) and matrices (double indexes). The repeated index (one up and one down) denotes an implicit “Einstein” summation. The function φ(.) is known as the activation function, which is nonlinear. One of the simplest activation functions to use and analyze, and the current favorite, is known as the ReLU (for rectifying linear unit). Note that these equations represent a simple neural cascade, as the output of one layer becomes the input for the next.

The training of all the matrix elements assumes a surprisingly simply optimization function, known as an objective function or a loss function, that can look like

where the first term is the mean squared error of the network output ye relative to the desired output y0 for the training set, and the second term, known as a regularization term (see section above) is a quadratic form that keeps the weights from blowing up. This loss function is minimized over the set of adjustable matrix weights.

The network in Fig. 4 is just a toy, with only 5 inputs and 5 outputs and only 23 neurons. But it has 30+36+36+30+23 = 155 adjustable weights. If this seems like overkill, but it is nothing compared to neural networks with thousands of neurons per layer and tens of layers. That massive overkill is exactly the power of deep learning — as well as its pitfall.

The Pitfalls of Deep Learning

Despite the impressive advances in deep learning, serious pitfalls remain for practitioners. One of the most challenging problems in deep learning is the saddle-point problem. A saddle-point in an objective function is like a mountain pass through the mountains: at the top of the pass it slopes downward in two opposite directions into the valleys but slopes upward in the two orthogonal directions to mountain peaks. A saddle point is an unstable equilibrium where a slight push this way or that can lead the traveller to two very different valleys separated by high mountain ridges. In our familiar three-dimensional space, saddle points are relatively rare and landscapes are dominated by valleys and mountaintops. But this intuition about landscapes fails in high dimensions.

Landscapes in high dimensions are dominated by neutral ridges that span the domain of the landscape. This key understanding about high-dimensional space actually came from the theory of evolutionary dynamics for the evolution of species. In the early days of evolutionary dynamics, there was a serious challenge to understand how genetic mutations could allow such diverse speciation to occur. If the fitness of a species were viewed as a fitness landscape, and if a highly-adapted species were viewed as a mountain peak in this landscape, then genetic mutations would need to drive the population state point into “valleys of death” that would need to be crossed to arrive at a neighboring fitness peak. It would seem that genetic mutations would likely kill off the species in the valleys before they could rise to the next equilibrium.

However, the geometry of high dimensions does not follow this simple low-dimensional intuition. As more dimensions come available, landscapes have more and more ridges of relatively constant height that span the full space (See my recent blog on random walks in 10-dimensions and my short YouTube video). For a species to move from one fitness peak to another fitness peak in a fitness landscape (in ultra-high-dimensional mutation space), all that is needed is for a genetic mutation to step the species off of the fitness peak onto a neutral ridge where many mutations can keep the species on that ridge as it moves ever farther away from the last fitness peak. Eventually, the neutral ridge brings the species near a new fitness peak where it can climb to the top, creating a new stable species. The point is that most genetic mutations are neutral — they do not impact the survivability of an individual. This is known as the neutral network theory of evolution proposed by Motoo Kimura (1924 – 1994) [5]. As these mutation accumulate, the offspring can get genetically far from the progenitor. And when a new fitness peak comes near, many of the previously neutral mutations can come together and become a positive contribution to fitness as the species climbs the new fitness peak.

The neutral network of genetic mutation was a paradigm shift in the field of evolutionary dynamics, and it also taught everyone how different random walks in high-dimensional spaces are from random walks in 3D. But although neutral networks solved the evolution problem, they become a two-edged sword in machine learning. On the positive side, fitness peaks are just like the minima of objective functions, and the ability for partial solutions to perform random walks along neutral ridges in the objective-function space allows optimal solutions to be found across a broad range of the configuration space of the neural weights. However, on the negative side, ridges are loci of unstable equilibrium. Hence there are always multiple directions that a solution state can go to minimize the objective function. Each successive run of a deep-network neural weight optimizer can find equivalent optimal solutions — but they each can be radically different. There is no hope of averaging the weights of an ensemble of networks to arrive at an “average” deep network. The averaging would simply drive all weights to zero. Certainly, the predictions of an ensemble of equivalently trained networks can be averaged—but this does not illuminate what is happening “under the hood” of the machine, which is where our own “understanding” of what the network is doing would come from.

Post-Modern Machine Learning

Post-modernism is admittedly kind of a joke — it works so hard to pull down every aspect of human endeavor that it falls victim to its own methods. The convoluted arguments made by its proponents sound like ultimate slacker talk — circuitous logic circling itself in an endless loop of denial.

But post-modernism does have its merits. It surfs on the moving crest of what passes as modernism, as modernism passes onward to its next phase. The philosophy of post-modernism moves beyond rationality in favor of a subjectivism in which cause and effect are blurred.  For instance, in post-modern semiotic theory, a text or a picture is no longer an objective element of reality, but fragments into multiple semiotic versions, each one different for each different reader or watcher — a spectrum of collaborative efforts between each consumer and the artist. The reader brings with them a unique set of life experiences that interact with the text to create an entirely new experience in each reader’s mind.

Deep learning is post-modern in the sense that deterministic algorithms have disappeared. Instead of a traceable path of sequential operations, neural nets scramble information into massively-parallel strings of partial information that cross and interact nonlinearly with other massively-parallel strings. It is difficult to impossible to trace any definable part of the information from input to output. The output simply manifests some aspect of the data that was hidden from human view.

But the Age of Artificial Intelligence is not here yet. The vast multiplicity of saddle ridges in high dimensions is one of the drivers for one of the biggest pitfalls of deep learning — the need for massive amounts of training data. Because there are so many adjustable parameters in a neural network, and hence so many dimensions, a tremendous amount of training data is required to train a network to convergence. This aspect of deep learning stands in strong contrast to human children who can be shown a single picture of a bicycle next to a tricycle, and then they can achieve almost perfect classification accuracy when shown any number of photographs of different bicycles and tricycles. Humans can generalize with an amazingly small amount of data, while deep networks often need thousands of examples. This example alone points to the marked difference between human intelligence and the current state of deep learning. There is still a long road ahead.


[1] F. Rosenblatt, “THE PERCEPTRON – A PROBABILISTIC MODEL FOR INFORMATION-STORAGE AND ORGANIZATION IN THE BRAIN,” Psychological Review, vol. 65, no. 6, pp. 386-408, (1958)

[2] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “LEARNING REPRESENTATIONS BY BACK-PROPAGATING ERRORS,” Nature, vol. 323, no. 6088, pp. 533-536, Oct (1986)

[3] LeCun, Yann; Léon Bottou; Yoshua Bengio; Patrick Haffner (1998). “Gradient-based learning applied to document recognition”. Proceedings of the IEEE. 86 (11): 2278–2324.

[4] G. E. Hinton, S. Osindero, and Y. W. Teh, “A fast learning algorithm for deep belief nets,” Neural Computation, vol. 18, no. 7, pp. 1527-1554, Jul (2006)

[5] M. Kimura, The Neutral Theory of Molecular Evolution. Cambridge University Press, 1968.

A Random Walk in 10 Dimensions

Physics in high dimensions is becoming the norm in modern dynamics.  It is not only that string theory operates in ten dimensions (plus one for time), but virtually every complex dynamical system is described and analyzed within state spaces of high dimensionality.  Population dynamics, for instance, may describe hundreds or thousands of different species, each of whose time-varying populations define a separate axis in a high-dimensional space.  Coupled mechanical systems likewise may have hundreds or thousands (or more) of degrees of freedom that are described in high-dimensional phase space

In high-dimensional landscapes, mountain ridges are much more common than mountain peaks. This has profound consequences for the evolution of life, the dynamics of complex systems, and the power of machine learning.

For these reasons, as physics students today are being increasingly exposed to the challenges and problems of high-dimensional dynamics, it is important to build tools they can use to give them an intuitive feeling for the highly unintuitive behavior of systems in high-D.

Within the rapidly-developing field of machine learning, which often deals with landscapes (loss functions or objective functions) in high dimensions that need to be minimized, high dimensions are usually referred to in the negative as “The Curse of Dimensionality”.

Dimensionality might be viewed as a curse for several reasons.  First, it is almost impossible to visualize data in dimensions higher than d = 4 (the fourth dimension can sometimes be visualized using colors or time series).  Second, too many degrees of freedom create too many variables to fit or model, leading to the classic problem of overfitting.  Put simply, there is an absurdly large amount of room in high dimensions.  Third, our intuition about relationships among areas and volumes are highly biased by our low-dimensional 3D experiences, causing us to have serious misconceptions about geometric objects in high-dimensional spaces.  Physical processes occurring in 3D can be over-generalized to give preconceived notions that just don’t hold true in higher dimensions.

Take, for example, the random walk.  It is usually taught starting from a 1-dimensional random walk (flipping a coin) that is then extended to 2D and then to 3D…most textbooks stopping there.  But random walks in high dimensions are the rule rather than the exception in complex systems.  One example that is especially important in this context is the problem of molecular evolution.  Each site on a genome represents an independent degree of freedom, and molecular evolution can be described as a random walk through that space, but the space of all possible genetic mutations is enormous.  Faced with such an astronomically large set of permutations, it is difficult to conceive of how random mutations could possibly create something as complex as, say, ATP synthase which is the basis of all higher bioenergetics.  Fortunately, the answer to this puzzle lies in the physics of random walks in high dimensions. 

Why Ten Dimensions?

This blog presents the physics of random walks in 10 dimensions.  Actually, there is nothing special about 10 dimensions versus 9 or 11 or 20, but it gives a convenient demonstration of high-dimensional physics for several reasons.  First, it is high enough above our 3 dimensions that there is no hope to visualize it effectively, even by using projections, so it forces us to contend with the intrinsic “unvisualizability” of high dimensions.  Second, ten dimensions is just big enough that it behaves roughly like any higher dimension, at least when it comes to random walks.  Third, it is about as big as can be handled with typical memory sizes of computers.  For instance, a ten-dimensional hypercubic lattice with 10 discrete sites along each dimension has 10^10 lattice points (10 Billion or 10 Gigs) which is about the limit of what a typical computer can handle with internal memory.

As a starting point for visualization, let’s begin with the well-known 4D hypercube but extend it to a 4D hyperlattice with three values along each dimension instead of two. The resulting 4D lattice can be displayed in 2D as a network with 3^4 = 81 nodes and 216 links or edges. The result is shown in Fig. 1, represented in two dimensions as a network graph with nodes and edges. Each node has four links with neighbors. Despite the apparent 3D look that this graph has about it, if you look closely you will see the frustration that occurs when trying to link to 4 neighbors, causing many long-distance links.

[See YouTube video for movies showing evolving hyperlattices and random walks in 10D.]

Fig. 1 A 4D hyperlattice with three sites along each of the 4 dimensions. This high dimensional discrete lattice is represented as a network graph in 2D with nodes and edges.

We can also look at a 10D hypercube that has 2^10 = 1024 nodes and 5120 edges, shown in Fig. 2. It is a bit difficult to see the hypercubic symmetry when presented in 2D, but each node has exactly 10 links.

Fig. 2 A 10D hypercube of 1024 nodes and 5120 edges. Each node has exactly 10 links to neighbors

Extending this 10D lattice to 10 positions instead of 2 and trying to visualize it is prohibitive, since the resulting graph in 2D just looks like a mass of overlapping circles. However, our interest extends not just to ten locations per dimension, but to an unlimited number of locations. This is the 10D infinite lattice on which we want to explore the physics of the random walk.

Diffusion in Ten Dimensions

An unconstrained random walk in 10D is just a minimal extension beyond a simple random walk in 1D. Because each dimension is independent, a single random walker takes a random step along any of the 10 dimensions at each iteration so that motion in any one of the 10 dimensions is just a 1D random walk. Therefore, a simple way to visualize this random walk in 10D is simply to plot the walk against each dimension, as in Fig. 3. There is one chance in ten that the walker will take a positive or negative step along any given dimension at each time point.

Fig. 3 A single walker taking random unit steps in 10 dimensions. The position of the walker as a function of time is shown for all ten dimensions.

An alternate visualization of the 10D random walker is shown in Fig. 4 for the same data as Fig. 3. In this case the displacement is color coded, and each column is a different dimension. Time is on the vertical axis (starting at the top and increasing downward). This type of color map can easily be extended to hundreds of dimensions. Each row is a position vector of the single walker in the 10D space

Fig. 4 Same data as in Fig. 3 for a single 10D random walker on a hyperlattice. Distance is color coded. Time is on the vertical axis (increasing downward). Each row is a 10D position vector, and this representation is of a single 10D trajectory.

In the 10D hyperlattice in this section, all lattice sites are accessible at each time point, so there is no constraint preventing the walk from visiting a previously-visited node. There is a possible adjustment that can be made to the walk that prevents it from ever crossing its own path. This is known as a self-avoiding-walk (SAW). In two dimensions, there is a major difference in the geometric and dynamical properties of an ordinary walk and an SAW. However, in dimensions larger than 4, it turns out that there are so many possibilities of where to go (high-dimensional spaces have so much free room) that it is highly unlikely that a random walk will ever cross itself. Therefore, in our 10D hyperlattice we do not need to make the distinction between an ordinary walk and a self-avoiding-walk. However, there are other constraints that can be imposed that mimic how complex systems evolve in time, and these constraints can have important consequences, as we see next.

Random Walk in a Maximally Rough Landscape

In the infinite hyperlattice of the previous section, all lattice sites are the same and are all equally accessible. However, in the study of complex systems, it is common to assign a value to each node in a high-dimensional lattice. This value can be assigned by a potential function, producing a high-dimensional potential landscape over the lattice geometry. Or the value might be the survival fitness of a species, producing a high-dimensional fitness landscape that governs how species compete and evolve. Or the value might be a loss function (an objective function) in a minimization problem from multivariate analysis or machine learning. In all of these cases, the scalar value on the nodes defines a landscape over which a state point executes a walk. The question then becomes, what are the properties of a landscape in high dimensions, and how does it affect a random walker?

As an example, let’s consider a landscape that is completely random point-to-point. There are no correlations in this landscape, making it maximally rough. Then we require that a random walker takes a walk along iso-potentials in this landscape, never increasing and never decreasing its potential. Beginning with our spatial intuition living in 3D space, we might be concerned that such a walker would quickly get confined in some area of the lanscape. Think of a 2D topo map with countour lines drawn on it — If we start at a certain elevation on a mountain side, then if we must walk along directions that maintain our elevation, we stay on a given contour and eventually come back to our starting point after circling the mountain peak — we are trapped! But this intuition informed by our 3D lives is misleading. What happens in our 10D hyperlattice?

To make the example easy to analyze, let’s assume that our potential function is restricted to N discrete values. This means that of the 10 neighbors to a given walker site, on average only 10/N are likely to have the same potential value as the given walker site. This constrains the available sites for the walker, and it converts the uniform hyperlattice into a hyperlattice site percolation problem.

Percolation theory is a fascinating topic in statistical physics. There are many deep concepts that come from asking simple questions about how nodes are connected across a network. The most important aspect of percolation theory is the concept of a percolation threshold. Starting with a complete network that is connected end-to-end, start removing nodes at random. For some critical fraction of nodes removed (on average) there will no longer be a single connected cluster that spans the network. This critical fraction is known as the percolation threshold. Above the percolation threshold, a random walker can get from one part of the network to another. Below the percolation threshold, the random walker is confined to a local cluster.

If a hyperlattice has N discrete values for the landscape potential (or height, or contour) and if a random walker can only move to site that has the same value as the walker’s current value (remains on the level set), then only a fraction of the hyperlattice sites are available to the walker, and the question of whether the walker can find a path the spans the hyperlattice becomes simply a question of how the fraction of available sites relates to the percolation threshold.

The percolation threshold for hyperlattices is well known. For reasonably high dimensions, it is given to good accuracy by

where d is the dimension of the hyperlattice. For a 10D hyperlattice the percolation threshold is pc(10) = 0.0568, or about 6%. Therefore, if more than 6% of the sites of the hyperlattice have the same value as the walker’s current site, then the walker is free to roam about the hyperlattice.

If there are N = 5 discrete values for the potential, then 20% of the sites are available, which is above the percolation threshold, and walkers can go as far as they want. This statement holds true no matter what the starting value is. It might be 5, which means the walker is as high on the landscape as they can get. Or it might be 1, which means the walker is as low on the landscape as they can get. Yet even if they are at the top, if the available site fraction is above the percolation threshold, then the walker can stay on the high mountain ridge, spanning the landscape. The same is true if they start at the bottom of a valley. Therefore, mountain ridges are very common, as are deep valleys, yet they allow full mobility about the geography. On the other hand, a so-called mountain peak would be a 5 surrounded by 4’s or lower. The odds for having this happen in 10D are 0.2*(1-0.8^10) = 0.18. Then the total density of mountain peaks, in a 10D hyperlattice with 5 potential values, is only 18%. Therefore, mountain peaks are rare in 10D, while mountain ridges are common. In even higher dimensions, the percolation threshold decreases roughly inversely with the dimensionality, and mountain peaks become extremely rare and play virtually no part in walks about the landscape.

To illustrate this point, Fig. 5 is the same 10D network that is in Fig. 2, but only the nodes sharing the same value are shown for N = 5, which means that only 20% of the nodes are accessible to a walker who stays only on nodes with the same values. There is a “giant cluster” that remains connected, spanning the original network. If the original network is infinite, then the giant cluster is also infinite but contains a finite fraction of the nodes.

Fig. 5 A 10D cluster that spans the network in Fig. 2 for 1/5 of the nodes sharing the same landscape value. This cluster represents a mountain ridge that spans the space. There are four additional co-existing clusters, each of which separately spans the same 10D space.

The quantitative details of the random walk can change depending on the proximity of the sub-networks (the clusters, the ridges or the level sets) to the percolation threshold. For instance, a random walker in D =10 with N = 5 is shown in Fig. 6. The diffusion is a bit slower than in the unconstrained walk of Figs. 3 and 4. But the ability to wander about the 10D space is retained.

Fig. 6 A random walker on the level-set cluster of Fig. 5

This is then the general important result: In high-dimensional landscapes, mountain ridges are much more common than mountain peaks. This has profound consequences for the evolution of life, the dynamics of complex systems, and the power of machine learning.

Consequences for Evolution and Machine Learning

When the high-dimensional space is the space of possible mutations on a genome, and when the landscape is a fitness landscape that assigns a survival advantage for one mutation relative to others, then the random walk describes the evolution of a species across generations. The prevalence of ridges, or more generally level sets, in high dimensions has a major consequence for the evolutionary process, because a species can walk along a level set acquiring many possible mutations that have only neutral effects on the survivability of the species. At the same time, the genetic make-up is constantly drifting around in this “neutral network”, allowing the species’ genome to access distant parts of the space. Then, at some point, natural selection may tip the species up a nearby (but rare) peak, and a new equilibrium is attained for the species.

One of the early criticisms of fitness landscapes was the (erroneous) criticism that for a species to move from one fitness peak to another, it would have to go down and cross wide valleys of low fitness to get to another peak. But this was a left-over from thinking in 3D. In high-D, neutral networks are ubiquitous, and a mutation can take a step away from one fitness peak onto one of the neutral networks, which can be sampled by a random walk until the state is near some distant peak. It is no longer necessary to think in terms of high peaks and low valleys of fitness — just random walks. The evolution of extremely complex structures, like ATP synthase, can then be understood as a random walk along networks of nearly-neutral fitness — once our 3D biases are eliminated.

The same arguments hold for many situations in machine learning and especially deep learning. When training a deep neural network, there can be thousands of neural weights that need to be trained through the minimization of a loss function, also known as an objective function. The loss function is the equivalent to a potential, and minimizing the loss function over the thousands of dimensions is the same problem as maximizing the fitness of an evolving species.

At first look, one might think that deep learning is doomed to failure. We have all learned, from the earliest days in calculus, that enough adjustable parameter can fit anything, but the fit is meaningless because it predicts nothing. Deep learning seems to be the worst example of this. How can fitting thousands of adjustable parameters be useful when the dimensionality of the optimization space is orders of magnitude larger than the degrees of freedom of the system being modeled?

The answer comes from the geometry of high dimensions. The prevalence of neutral networks in high dimensions gives lots of chances to escape local minima. In fact, local minima are actually rare in high dimensions, and when they do occur, there is a neutral network nearby onto which they can escape (if the effective temperature of the learning process is set sufficiently high). Therefore, despite the insanely large number of adjustable parameters, general solutions, that are meaningful and predictive, can be found by adding random walks around the objective landscape as a partial strategy in combination with gradient descent.

Given the superficial analogy of deep learning to the human mind, the geometry of random walks in ultra-high dimensions may partially explain our own intelligence and consciousness.

Biblography

S. Gravilet, Fitness Landscapes and the Origins of Species. Princeton University Press, 2004.

M. Kimura, The Neutral Theory of Molecular Evolution. Cambridge University Press, 1968.

YouTube Vlog on A Random Walk in 10 Dimensions