A Short History of Neural Networks

When it comes to questions about the human condition, the question of intelligence is at the top. What is the origin of our intelligence? How intelligent are we? And how intelligent can we make other things…things like artificial neural networks?

This is a short history of the science and technology of neural networks, not just artificial neural networks but also the natural, organic type, because theories of natural intelligence are at the core of theories of artificial intelligence. Without understanding our own intelligence, we probably have no hope of creating the artificial type.

Ramon y Cajal (1888): Visualizing Neurons

The story begins with Santiago Ramon y Cajal (1853 – 1934) who received the Nobel Prize in physiology in 1906 for his work illuminating natural neural networks. He built on work by Camillo Golgi, using a stain to give intracellular components contrast [1], and then went further to developed his own silver emulsions like those of early photography (which was one of his hobbies). Cajal was the first to show that neurons were individual constituents of neural matter and that their contacts were sequential: axons of sets of neurons contacted the dendrites of other sets of neurons, never axon-to-axon or dendrite-to-dendrite, to create a complex communication network. This became known as the neuron doctrine, and it is a central idea of neuroscience today.

Fig. 1 One of Cajal’s published plates demonstrating neural synapses. From Link.

McCulloch and Pitts (1943): Mathematical Models

In 1941, Warren S. McCulloch (1898–1969) arrived at the Department of Psychiatry at the University of Illinois at Chicago where he met with the mathematical biology group at the University of Chicago led by Nicolas Rashevsky (1899–1972), widely acknowledged as the father of mathematical biophysics in the United States.

An itinerant member of Rashevsky’s group at the time was a brilliant, young and unusual mathematician, Walter Pitts (1923– 1969). He was not enrolled as a student at Chicago, but had simply “showed up” one day as a teenager at Rashevsky’s office door.  Rashevsky was so impressed by Pitts that he invited him to attend the group meetings, and Pitts became interested in the application of mathematical logic to biological information systems.

When McCulloch met Pitts, he realized that Pitts had the mathematical background that complemented his own views of brain activity as computational processes. Pitts was homeless at the time, so McCulloch invited him to live with his family, giving the two men ample time to work together on their mutual obsession to provide a logical basis for brain activity in the way that Turing had provided it for computation.

McColloch and Pitts simplified the operation of individual neurons to their most fundamental character, envisioning a neural computing unit with multiple inputs (received from upstream neurons) and a single on-off output (sent to downstream neurons) with the additional possibility of feedback loops as downstream neurons fed back onto upstream neurons. They also discretized the dynamics in time, using discrete logic and time-difference equations, succeeding in devising a logical structure with rules and equations for the general operation of nets of neurons.  They published their results a 1943 in the paper titled “A logical calculus of the ideas immanent in nervous activity,” [2] introducing computational language and logic to neuroscience.  Their simplified neural unit became the basis for discrete logic, picked up a few years later by von Neumann as an elemental example of a logic gate upon which von Neumann began constructing the theory and design of the modern electronic computer.

Fig. 2 The only figure in McCulloch and Pitt’s “Logical Calculus”.

Donald Hebb (1949): Hebbian Learning

The basic model for learning and adjustment of synaptic weights among neurons was put forward in 1949 by the physiological psychologist Donald Hebb (1904-1985) of McGill University in Canada in a book titled The Organization of Behavior [3].

In Hebbian learning, an initially untrained network consists of many neurons with many synapses having random synaptic weights. During learning, a synapse between two neurons is strengthened when both the pre-synaptic and post-synaptic neurons are firing simultaneously. In this model, it is essential that each neuron makes many synaptic contacts with other neurons because it requires many input neurons acting in concert to trigger the output neuron. In this way, synapses are strengthened when there is collective action among the neurons. The synaptic strengths are therefore altered through a form of self-organization. A collective response of the network strengthens all those synapses that are responsible for the response, while the other synapses that do not contribute, weaken. Despite the simplicity of this model, it has been surprisingly robust, standing up as a general principle for the training of artificial neural networks.

Fig. 3. A Figure from Hebb’s textbook on psychology (1958). From Link.

Hodgkin and Huxley (1952): Neuron Transporter Models

Alan Hodgkin (1914 – 1998) and Andrew Huxley (1917 – 2012) were English biophysicists who received the 1963 Nobel Prize in physiology for their work on the physics behind neural activation.  They constructed a differential equation for the spiking action potential for which their biggest conceptual challenge was the presence of time delays in the voltage signals that were not explained by linear models of the neural conductance. As they began exploring nonlinear models, using their experiments to guide the choice of parameters, they settled on a dynamical model in a four-dimensional phase space. One dimension was voltage, while another was inhibitory current. The two remaining dimensions were sodium and potassium conductances, which they had determined were the major ions participating in the generation and propagation of the action potential. The nonlinear conductances of their model described the observed time delays and captured the essential neural behavior of the fast spike followed by a slow recovery. Huxley solved the equations on a hand-cranked calculator, taking over three months of tedious cranking to plot the numerical results.

Fig. 4 The Hodgkin-Huxley model of the neuron, including capacitance C, voltage V and bias current I along with the conductances of potassium (K), sodium (Na) and Lithium (L) channels.

Hodgkin and Huxley published [4] their measurements and their model (known as the Hodgkin-Huxley model) in a series of six papers in 1952 that led to an explosion of research in electrophysiology, for which Hodgkin and Huxley won the 1963 Nobel Prize in physiology or medicine. The four-dimensional Hodgkin–Huxley model stands as a classic example of the power of phenomenological modeling when combined with accurate experimental observation. Hodgkin and Huxley were able to ascertain not only the existence of ion channels in the cell membrane, but also their relative numbers, long before these molecular channels were ever directly observed using electron microscopes. The Hodgkin–Huxley model lent itself to simplifications that could capture the essential behavior of neurons while stripping off the details.

Frank Rosenblatt (1958): The Perceptron

Frank Rosenblatt (1928–1971) had a PhD in psychology from Cornell University and was in charge of the cognitive systems section of the Cornell Aeronautical Laboratory (CAL) located in Buffalo, New York.  He was tasked with fulfilling a contract from the Navy to develop an analog image processor. Drawing from the work of McCulloch and Pitts, his team constructed a software system and then constructed a hardware model that adaptively updated the strength of the inputs, that they called neural weights, as it was trained on test images. The machine was dubbed the Mark I Perceptron, and its announcement in 1958 created a small media frenzy [5]. A New York Times article reported the perceptron was “the embryo of an electronic computer that [the navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence.”

The perceptron had a simple architecture, with two layers of neurons consisting of an input layer and a processing layer, and it was programmed by adjusting the synaptic weights to the inputs. This computing machine was the first to adaptively learn its functions, as opposed to following predetermined algorithms like digital computers. It seemed like a breakthrough in cognitive science and computing, as trumpeted by the New York Times.  But within a decade, the development had stalled because the architecture was too restrictive.

Fig. 5 Frank Rosenblatt with his Perceptron. From Link.

Richard Fitzhugh and Jin-Ichi Nagumo (1961): Neural van der Pol Oscillators

In 1961 Richard FitzHugh (1922–2007), a neurophysiology researcher at the National Institute of Neurological Disease and Blindness (NINDB) of the National Institutes of Health (NIH), created a surprisingly simple model of the neuron that retained only a third order nonlinearity, just like the third-order nonlinearity that Rayleigh had proposed and solved in 1883, and that van der Pol extended in 1926. Around the same time that FitzHugh proposed his mathematical model [6], the electronics engineer Jin-Ichi Nagumo (1926-1999) in Japan created an electronic diode circuit with an equivalent circuit model that mimicked neural oscillations [7]. Together, this work by FitzHugh and Nagumo led to the so-called FitzHugh–Nagumo model. The conceptual importance of this model is that it demonstrated that the neuron was a self-oscillator, just like a violin string or wheel shimmy or the pacemaker cells of the heart. Once again, self-oscillators showed themselves to be common elements of a complex world—and especially of life.

Fig. 6 The FitzHugh-Nagumo model of the neuron simplifies the Hodgkin-Huxley model from four dimensions down to two dimensions of voltage V and channel activation n.

John Hopfield (1982): Spin Glasses and Recurrent Networks

John Hopfield (1933–) received his PhD from Cornell University in 1958, advised by Al Overhauser in solid state theory, and he continued to work on a broad range of topics in solid state physics as he wandered from appointment to appointment at Bell Labs, Berkeley, Princeton, and Cal Tech. In the 1970s Hopfield’s interests broadened into the field of biophysics, where he used his expertise in quantum tunneling to study quantum effects in biomolecules, and expanded further to include information transfer processes in DNA and RNA. In the early 1980s, he became aware of aspects of neural network research and was struck by the similarities between McColloch and Pitts’ idealized neuronal units and the physics of magnetism. For instance, there is a type of disordered magnetic material called a spin glass in which a large number of local regions of magnetism are randomly oriented. In the language of solid-state physics, one says that the potential energy function of a spin glass has a large number of local minima into which various magnetic configurations can be trapped. In the language of dynamics, one says that the dynamical system has a large number of basins of attraction [8].

The Parallel Distributed Processing Group (1986): Backpropagation

David Rumelhart, a mathematical psychologist at UC San Diego, was joined by James McClelland in 1974 and then by Geoffrey Hinton in 1978 to become what they called the Parallel Distributed Processing (PDP) group. The central tenets of the PDP framework they developed were: 1) processing is distributed across many semi-autonomous neural units, that 2) learn by adjusting the weights of their interconnections based on the strengths of their signals (i.e., Hebbian learning), whose memories and behaviors are 3) an emergent property of the distributed learned weights.

PDP was an exciting framework for artificial intelligence, and it captured the general behavior of natural neural networks, but it had a serious problem: How could all of the neural weights be trained?

In 1986, Rumelhart and Hinton with the mathematician Ronald Williams developed a mathematical procedure for training neural weights called error backpropagation [9]. The idea is actually very simple: create a mean squared error of the response of a neural network compared to an ideal response, then tweak one of the neural weights and see if the error increases or decreases. If the error decreases, keep the tweak for that weight and move to the next, working iteratively, tweak by tweak, to minimize the mean squared error. In this way, large numbers of neural weights can be adjusted as the network is trained to perform a specified task.

Error backpropagation has come a long way from that early 1986 paper, and it now lies at the core of the AI revolution we are experiencing today as tens of millions of neural weights are trained on massive datasets.

Yann LeCun (1989): Convolutional Neural Networks

In 1988, I was a new post-doc at AT&T Bell Labs at Holmdel, New Jersey fresh out of my PhD in physics from Berkeley. Bell Labs liked to give its incoming employees inspirational talks and tours of their facilities, and one of the tours I took was of the neural network lab run by Lawrence Jackel that was working on computer recognition of zip-code digits. The team’s new post-doc, arriving at Bell Labs the same time as me, was Yann LeCun. It is very possible that the demo our little group watched was run by him, or at least he was there, but at the time he was a nobody, so even if I had heard his name, it wouldn’t have meant anything to me.

Fast forward to today, and Yann LeCun’s name is almost synonomous with AI. He is the Chief AI Scientist at Facebook and his google scholar page reports that he gets 50,000 citations per year.

LeCun is famous for developing the convolutional neural network (CNN) in work that he published from Bell Labs in 1989 [10]. It is a biomimetic neural network that takes its inspiration from the receptive fields of the neural networks in the retina. What you think you see, when you look at something, is actually reconstructed by your brain. Your retina is a neural processor with receptive fields that are a far cry from one-to-one. Most prominent in the retina are center-surround fields, or kernels, that respond to the derivatives of the focused image instead of the image itself. It’s the derivatives that are sent up your optic neuron to your brain which then reconstructs the image. It works as a form of image compression so that broad uniform areas in an image are reduced to its edges.

The convolutional neural network works in the same way, it’s just engineered specifically to produce compressed and multiscale codes that capture broad areas as well as the fine details of an image. By constructing many different “kernel” operators at many different scales, it creates a set of features that capture the nuances of the image in quantitative form that is then processed by training neural weights in downstream neural networks.

Fig. 7 Example of a receptive field of a CNN. The filter is the kernel (in this case a discrete 3×3 Laplace operator) that is stepped sequentially across the image field to produce the Laplacian feature map of the original image. One feature map for every different kernel becomes the input for the next level of kernels in a hierarchical scaling structure.

Geoff Hinton (2006): Deep Belief

It seems like Geoff Hinton has had his finger in almost every pie when it comes to how we do AI today. Backpropagation? Geoff Hinton. Rectified Linear Units? Geoff Hinton. Boltzmann Machines? Geoff Hinton. t-SNE? Geoff Hinton. Dropout regularization? Geoff Hinton. AlexNet? Geoff Hinton. The 2024 Nobel Prize in Physics? Geoff Hinton! He may not have invented all of these, but he was in the midst of it all.

Hinton received his PhD in Artificial Intelligence (ar rare field at the time) from the University of Edinburgh in 1978 after which he joined the PDP group at UCSD (see above) as a post-doc. After a time at Carnegie-Mellon, he joined the University of Toronto, Canada, in 1987 where he established one of the leading groups in the world on neural network research. It was from here that he launched so many of the ideas and techniques that have become the core of deep learning.

A central idea of deep learning came from Hinton’s work on Boltzmann Machines that learn statistical distributions of complex data. This type of neural network is known as an energy-based model, similar to a Hopfield network, and it has strong ties to the statistical mechanics of spin-glass systems. Unfortunately, it is a bitch to train! So Hinton simplified it into a Restricted Boltzmann Machine (RBM) that was much more tractable and layers of RBMs could be stacked into “Deep Belief Networks” [11] that had a hierarchical structure that allowed the neural nets to learn layers of abstractions. These were among the first deep networks that were able to do complex tasks at the level of human capabilities (and sometimes beyond).

The breakthrough that propelled Geoff Hinton to world-wide acclaim was the success of AlexNet, a neural network constructed by his graduate student Alex Krizhevsky at Toronto in 2012 consisting of 650,000 neurons with 60 million parameters that were trained using two early Nvidia GPUs. It won the ImageNet challenge that year, enabled by its deep architecture and representing a marked advancement that has been proceeding unabated today.

Deep learning is now the rule in AI, supported by the Attention mechanism and Transformers that underpin the large language models, like ChatGPT and others, that are poised to disrupt all the legacy business models based on the previous silicon revolution of 50 years ago.

Further Reading

(Sections of this article have been excerpted from Chapter 11 of Galileo Unbound, (Oxford University Press)

References

[1] Ramón y Cajal S. (1888). Estructura de los centros nerviosos de las aves. Rev. Trim. Histol. Norm. Pat. 1, 1–10.

[2] McCulloch, W.S. and W. Pitts, A Logical Calculus of the Ideas Immanent in Nervous Activity. Bull. Math. Biophys., 1943. 5: p. 115.

[3] Hebb, D. O. (1949). The Organization of Behavior: A Neuropsychological Theory. New York: Wiley and Sons. ISBN 978-0-471-36727-7 – via Internet Archive.

[4] Hodgkin AL, Huxley AF (August 1952). “A quantitative description of membrane current and its application to conduction and excitation in nerve”. The Journal of Physiology. 117 (4): 500–44.

[5] Rosenblatt, Frank (1957). “The Perceptron—a perceiving and recognizing automaton”. Report 85-460-1. Cornell Aeronautical Laboratory.

[6] FitzHugh, Richard (July 1961). “Impulses and Physiological States in Theoretical Models of Nerve Membrane”. Biophysical Journal. 1 (6): 445–466.

[7] Nagumo, J.; Arimoto, S.; Yoshizawa, S. (October 1962). “An Active Pulse Transmission Line Simulating Nerve Axon”. Proceedings of the IRE. 50 (10): 2061–2070.

[8] Hopfield, J. J. (1982). “Neural networks and physical systems with emergent collective computational abilities”. Proceedings of the National Academy of Sciences. 79 (8): 2554–2558.

[9] Rumelhart, D.E. et al. Nature 323, 533-536 (1986).

[10] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard and L. D. Jackel: Backpropagation Applied to Handwritten Zip Code Recognition, Neural Computation, 1(4):541–551, Winter 1989.

[11] G. E. Hinton, S. Osindero, and Y. W. Teh, “A fast learning algorithm for deep belief nets,” Neural Computation 18, 1527-1554 (2006).

Books by David D. Nolte at Oxford University Press
Read more in Books by David D. Nolte at Oxford University Press

Ada Lovelace at the Dawn of Cyber Steampunk

Something strange almost happened in 1840’s England just a few years into Queen Victoria’s long reign—a giant machine the size of a large shed, built of thousands of interlocking steel gears, driven by steam power, almost came to life—a thinking, mechanical automaton, the very image of Cyber Steampunk.

Cyber Steampunk is a genre of media that imagines an alternate history of a Victorian Age with advanced technology—airships and rockets and robots and especially computers—driven by steam power.  Some of the classics that helped launch the genre are the animé movies Castle in the Sky (1986) by Hayao Miyazaki and Steam Boy (2004) by Katsuhiro Otomo and the novel The Difference Engine (1990) by William Gibson and Bruce Sterling.  The novel pursues Ada Byron, Lady Lovelace, through the shadows of London by those who suspect she has devised a programmable machine that can win at gambling using steam and punched cards.  This is not too far off from what might have happened in real life if Ada Lovelace had a bit more sway over one of her unsuitable suitors—Charles Babbage. 

But Babbage, part genius, part fool, could not understand what Lovelace understood—for if he had, a Victorian computer built of oiled gears and leaky steam pipes, instead of tiny transistors and metallic leads, might have come a hundred years early as another marvel of the already marvelous Industrial Revolution.  How might our world today be different if Babbage had seen what Lovelace saw?

Fig. 1 Sony Entertainment Ad for Steamboy (2004).

Boundless Babbage

There is no question of Babbage’s genius.  He was so far ahead of his time that he appeared to most people in his day to be a crackpot, and he was often treated as one.  His father thought he was useless, and he told him so, because to be a scientist in the early 1800’s was to be unemployable, and Babbage was unemployed for years after college.  Science was, literally, natural philosophy, and no one hired a philosopher unless they were faculty at some college.  But Babbage’s friends from Trinity College, Cambridge, like William Whewell (future dean of Trinity) and John Herschel (son of the famous astronomer), new his worth and were loyal throughout their lives and throughout his trials.

Fig. 2 Charles Babbage

Charles Babbage was a favorite at Georgian dinner parties because he was so entertaining to watch and to listen to.  From personal letters of his friends (and enemies) of the time one gets a picture of a character not too different from Sheldon Cooper on the TV series The Big Bang Theory—convinced of his own genius and equally convinced of the lack of genius of everyone else and ready to tell them so.  His mind was so analytic, that he talked like a walking computer—although nothing like a computer existed in those days—everything was logic and functions and propositions—hence his entertainment value.  No one understood him, and no one cared—until he ran into a young woman who actually did, but more of that later.

One summer day in 1821, Babbage and Herschel were working on mathematical tables for the Astrophysical Society, a dull but important job to ensure that star charts and moon positions could be used accurately for astronomical calculations and navigation.  The numbers filled column after column, page after page. But as they checked the values, the two were shocked by how many entries in the tables were wrong.  In that day, every numerical value of every table or chart was calculated by a person (literally called a computer), and people make mistakes.  Even as they went to correct the numbers, new mistakes would crop in.  In frustration, Babbage exclaimed to Herschel that what they needed was a steam-powered machine that would calculate the numbers automatically.  No sooner had he said it, than Babbage had a vision of a mechanical machine, driven by a small steam engine, full of gears and rods, that would print out the tables automatically without flaws.

Being unemployed (and unemployable) Babbage had enough time on his hands to actually start work on his engine.  He called it the Difference Engine because it worked on the Method of Differences—mathematical formulas were put into a form where a number was expressed as a series, and the differences between each number in the series would be calculated by the engine.  He approached the British government for funding, and it obliged with considerable funds.  In the days before grant proposals and government funding, Babbage had managed to jump start his project and, in a sense, gain employment.  His father was not impressed, but he did not live long enough to see what his son Charles could build.  Charles inherited a large sum from his father (the equivalent of about 14 million dollars today), which further freed him to work on his Difference Engine.  By 1832, he had finally completed a seventh part of the Engine and displayed it in his house for friends and visitors to see. 

This working section of the Difference Engine can be seen today in the London Science Museum.  It is a marvel of steel and brass, consisting of three columns of stacked gears whose enmeshed teeth represent digital numbers.  As a crank handle is turned, the teath work upon each other, generating new numbers through the permutations of rotated gear teeth.  Carrying tens was initially a problem for Babbage, as it is for school children today, but he designed an ingenious mechanical system to accomplish the carry.

Fig. 3 One-seventh part of Babbage’s Difference Engine.

All was going well, and the government was pleased with progress, until Charles had a better idea that threatened to scrap all he had achieved.  It is not known how this new idea came into being, but it is known that it happened shortly after he met the amazing young woman: Ada Byron.

Lovely Lovelace

Ada Lovelace, born Ada Byron, had the awkward distinction of being the only legitimate child of Lord Byron, lyric genius and poet.  Such was Lord Byron’s hedonist lifestyle that no-one can say for sure how many siblings Ada had, not even Lord Byron himself, which was even more awkward when his half-sister bore a bastard child that may have been his.

Fig. 4 Ada Lovelace

Ada’s high-born mother prudently divorced the wayward poet and was not about to have Ada pulled into her father’s morass.  Where Lord Byron was bewitched (some would say possessed) by art and spirit, the mother sought an antidote, and she encouraged Ada to study hard cold mathematics.  She could not have known that Ada too had a genius like her father’s, only aimed differently, bewitched by the beauty in the sublime symbols of math. 

An insight into the precocious child’s way of thinking can be gained from a letter that the 12-year-old girl wrote to her mother who was off looking for miracle cures for imaginary ills. At that time in 1828, in a confluence of historical timelines in the history of mathematics, Ada and her mother (and Ada’s cat Puff) were living at Bifrons House which was the former estate of Brook Taylor, who had developed the Taylor’s series a hundred years earlier in 1715. In Ada’s letter, she describes a dream she had of a flying machine, which is not so remarkable, but then she outlined her plan to her mother to actually make one, which is remarkable. As you read her letter, you see she is already thinking about weights and material strengths and energy efficiencies, thinking like an engineer and designer—at the age of only 12 years!

In later years, Lovelace would become the Enchantress of Number to a number of her mathematical friends, one of whom was the strange man she met at a dinner party in the summer of 1833 when she was 17 years old.  The strange man was Charles Babbage, and when he talked to her about his Difference Engine, expecting to be tolerated as an entertaining side show, she asked pertinent questions, one after another, and the two became locked in conversation. 

Babbage was a recent widower, having lost his wife with whom he had been happily compatible, and one can only imagine how he felt when the attractive and intelligent woman gave him her attention.  But Ada’s mother would never see Charles as a suitable husband for her daughter—she had ambitious plans for her, and she tolerated Babbage only as much as she did because of the affection that Ada had for him.  Nonetheless, Ada and Charles became very close as friends and met frequently and wrote long letters to each other, discussing problems and progress on the Difference Engine.

In December of 1834, Charles invited Lady Byron and Ada to his home where he described with great enthusiasm a vision he had of an even greater machine.  He called it his Analytical Engine, and it would surpass his Difference Engine in a crucial way:  where the Difference Engine needed to be reconfigured by hand before every new calculation, the Analytical Engine would never need to be touched, it just needed to be programmed with punched cards.  Charles was in top form as he wove his narrative, and even Lady Byron was caught up in his enthusiasm.  The effect on Ada, however, was nothing less than a religious conversion. 

Fig. 5 General block diagram of Babbage’s Analytical Engine. From [8].

Ada’s Notes

To meet Babbage as an equal, Lovelace began to study mathematics with an obsession, or one might say, with delusions of grandeur.  She wrote “I believe myself to possess a most singular combination of qualities exactly fitted to make me pre-eminently a discoverer of the hidden realities of nature,” and she was convinced that she was destined to do great things.

Then, in 1835, Ada was married off to a rich but dull aristocrat who was elevated by royal decree to the Earldom of Lovelace, making her the Countess of Lovelace.  The marriage had little effect on Charles’ and Ada’s relationship, and he was invited frequently to the new home where they continued their discussions about the Analytical Engine. 

By this time Charles had informed the British government that he was putting all his effort into the design his new machine—news that was not received favorably since he had never delivered even a working Difference Engine.  Just when he hoped to start work on his Analytical Engine, the government ministers pulled their money. This began a decade’s long ordeal for Babbage as he continued to try to get monetary support as well as professional recognition from his peers for his ideas. Neither attempt was successful at home in Britain, but he did receive interest abroad, especially from a future prime minister of Italy, Luigi Menabrae, who invited Babbage to give a lecture in Turin on his Analytical Engine. Menabrae later had the lecture notes published in French. When Charles Wheatstone, a friend of Babbage, learned of Menabrae’s publication, he suggested to Lovelace that she translate it into English. Menabrae’s publication was the only existing exposition of the Analytical Engine, because Babbage had never written on the Engine himself, and Wheatstone was well aware of Lovelace’s talents, expecting her to be one of the only people in England who had the ability and the connections to Babbage to accomplish the task.

Ada Lovelace dove into the translation of Menabrae’s “Sketch of the Analytical Engine Invented by Charles Babbage” with the single-mindedness that she was known for. Along with the translation, she expanded on the work with Notes of her own that she added, lettered from A to G. By the time she wrote them, Lovelace had become a top-rate mathematician, possibly surpassing even Babbage, and her Notes were three times longer than the translation itself, providing specific technical details and mathematical examples that Babbage and Menabrae only allude to.

On a different level, the character of Ada’s Notes stands in stark contrast to Charles’ exposition as captured by Menabrae: where Menabrae provided only technical details of Babbage’s Engine, Lovelace’s Notes captured the Engine’s potential. She was still a poet by disposition—that inheritance from her father was never lost.

Lovelace wrote:

We may say most aptly, that the Analytical Engine weaves algebraic patterns just as the Jacquard-loom weaves flowers and leaves.

Here she is referring to the punched cards that the Jacquard loom used to program the weaving of intricate patterns into cloth. Babbage had explicitly borrowed this function from Jacquard, adapting it to provide the programmed input to his Analytical Engine.

But it was not all poetics. She also saw the abstract capabilities of the Engine, writing

In studying the action of the Analytical Engine, we find that the peculiar and independent nature of the considerations which in all mathematical analysis belong to operations, as distinguished from the objects operated upon and from the results of the operations performed upon those objects, is very strikingly defined and separated.

Again, it might act upon other things besides number, where objects found whose mutual fundamental relations could be expressed by those of the abstract science of operations, and which should be also susceptible of adaptations to the action of the operating notation and mechanism of the engine.

Supposing, for instance, that the fundamental relations of pitched sounds in the science of harmony and of musical composition were susceptible of such expression and adaptations, the engine might compose elaborate and scientific pieces of music of any degree of complexity or extent.

Here she anticipates computers generating musical scores.

Most striking is Note G. This is where she explicitly describes how the Engine would be used to compute numerical values as solutions to complicated problems. She chose, as her own example, the calculation of Bernoulli numbers which require extensive numerical calculations that were exceptionally challenging even for the best human computers of the day. In Note G, Lovelace writes down the step-by-step process by which the Engine would be programmed by the Jacquard cards to carry out the calculations. In the history of computer science, this stands as the first computer program.

Fig. 6 Table from Lovelace’s Note G on her method to calculate Bernoulli numbers using the Analytical Engine.

When it was time to publish, Babbage read over Lovelace’s notes, checking for accuracy, but he appears to have been uninterested in her speculations, possibly simply glossing over them. He saw his engine as a calculating machine for practical applications. She saw it for what we know today to be the exceptional adaptability of computers to all realms of human study and activity. He did not see what she saw. He was consumed by his Engine to the same degree as she, but where she yearned for the extraordinary, he sought funding for the mundane costs of machining and materials.

Ada’s Business Plan Pitch

Ada Lovelace watched in exasperation as Babbage floundered about with ill-considered proposals to the government while making no real progress towards a working Analytical Engine. Because of her vision into the potential of the Engine, a vision that struck her to her core, and seeing a prime opportunity to satisfy her own yearning to make an indelible mark on the world, she despaired in ever seeing it brought to fruition. Charles, despite his genius, was too impractical, wasting too much time on dead ends and incapable of performing the deft political dances needed to attract support. She, on the other hand, saw the project clearly and had the time and money and the talent, both mathematically and through her social skills, to help.

On Monday August 14, 1843, Ada wrote what might be the most heart-felt and impassioned business proposition in the history of computing. She laid out in clear terms to Charles how she could advance the Analytical Engine to completion if only he would surrender to her the day-to-day authority to make it happen. She was, in essence, proposing to be the Chief Operating Officer in a disruptive business endeavor that would revolutionize thinking machines a hundred years before their time. She wrote (she liked to underline a lot):

Firstly: I want to know whether if I continue to work on & about your own great subject, you will undertake to abide wholly by the judgment of myself (or of any persons whom you may now please to name as referees, whenever we may differ), on all practical matters relating to whatever can involve relations with any fellow-creature or fellow-creatures.

Secondly: can you undertake to give your mind wholly & undividedly, as a primary object that no engagement is to interfere with, to the consideration of all those matters in which I shall at times require your intellectual assistance & supervision; & can you promise not to slur & hurry things over; or to mislay, & allow confusion and mistakes to enter into documents, &c?

Thirdly: if I am able to lay before you in the course of a year or two, explicit & honorable propositions for executing your engine, (such as are approved by persons whom you may now name to be referred to for their approbation), would there be any chance of your allowing myself & such parties to conduct the business for you; your own undivided energies being devoted to the execution of the work; & all other matters being arranged for you on terms which your own friends should approve?

This is a remarkable letter from a self-possessed 28-year-old woman, laying out in explicit terms how she proposed to take on the direction of the project, shielding Babbage from the problems of relating to other people or “fellow-creatures” (which was his particular weakness), giving him time to focus his undivided attention on the technical details (which was his particular strength), while she would be the outward face of the project that would attract the appropriate funding.

In her preface to her letter, Ada adroitly acknowledges that she had been a romantic disappointment to Charles, but she pleads with him not to let their personal history cloud his response to her proposal. She also points out that her keen intellect would be an asset to the project and asks that he not dismiss it because of her sex (which a biased Victorian male would likely do). Despite her entreaties, this is exactly what Babbage did. Pencilled on the top of the original version of Ada’s letter in the Babbage archives is his simple note: “Tuesday 15 saw AAL this morning and refused all the conditions”. He had not even given her proposal 24 hours consideration as he indeed slurred and hurried things over.

Aftermath

Babbage never constructed his Analytical Engine and never even wrote anything about it. All his efforts would have been lost to history if Alan Turing had not picked up on Ada’s Notes and expanded upon them a hundred years later, bringing both her and him to the attention of the nascent computing community.

Ada Lovelace died young in 1852, at the age of 36, of cancer. By then she had moved on from Babbage and was working on other things. But she never was able to realize her ambition of uncovering such secrets of nature as to change the world.

Ada had felt from an early age that she was destined for greatness. She never achieved it in her lifetime and one can only wonder what she thought about this as she faced her death. Did she achieve it in posterity? This is a hotly debated question. Some say she wrote the first computer program, which may be true, but little programming a hundred years later derived directly from her work. She did not affect the trajectory of computing history. Discovering her work after the fact is interesting, but cannot be given causal weight in the history of science. The Vikings were the first Europeans to discover America, but no-one knew about it. They did not affect subsequent history the way that Columbus did.

On the other hand, Ada has achieved greatness in a different way. Now that her story is known, she stands as an exemplar of what scientific and technical opportunities look like, and the risk of ignoring them. Babbage also did not achieve greatness during his lifetime, but he could have—if he had not dismissed her and her intellect. He went to his grave embittered rather than lauded because he passed up an opportunity he never recognized.

By David D. Nolte, June 26, 2023


References

[1] Facsimile of “Sketch of the Analytical Engine Invented by Charles Babbage” translated by Ada Lovelace from Harvard University.

[2] Facsimile of Ada Lovelace’s “Notes by the Translator“.

[3] Stephen Wolfram, “Untangling the Tale of Ada Lovelace“, Wolfram Writings (2015).

[4] J. Essinger, “Charles and Ada : The computer’s most passionate partnership,” (History Press, 2019).

[5] D. Swade, The Difference Engine: Charles Babbage and the quest to build the first computer (Penguin Books, 2002).

[6] W. Gibson, and B. Sterling, The Difference Engine (Bantam Books, 1992).

[7] L. J. Snyder, The Philosophical Breakfast Club : Four remarkable friends who transformed science and changed the world (Broadway Books, 2011).

[8] Allan G. Bromley, Charles Babbage’s Analytical Engine, 1838, Annals of the History of Computing, Volume 4, Number 3, July 1982, pp. 196 – 217


Books by David Nolte at Oxford University Press
Read more in Books by David Nolte at Oxford University Press

From Coal and Steam to ChatGPT: Chapters in the History of Technology

Mark Twain once famously wrote in a letter from London to a New York newspaper editor:

“I have … heard on good authority that I was dead [but] the report of my death was an exaggeration.”

The same may be true of recent reports on the grave illness and possible impending death of human culture at the hands of ChatGPT and other so-called Large Language Models (LLM).  It is argued that these algorithms have such sophisticated access to the bulk of human knowledge, and can write with apparent authority on virtually any topic, that no-one needs to learn or create anything new. It can all be recycled—the end of human culture!

While there may be a kernel of truth to these reports, they are premature.  ChatGPT is just the latest in a continuing string of advances that have disrupted human life and human culture ever since the invention of the steam engine.  We—humans, that is—weathered the steam engine in the short term and are just as likely to weather the LLM’s. 

ChatGPT: What is it?

For all the hype, ChatGPT is mainly just a very sophisticated statistical language model (SLM). 

To start with a very simple example of SLM, imagine you are playing a word scramble game and have the letter “Q”. You can be pretty certain that the “Q“ will be followed by a “U” to make “QU”.  Or if you have the initial pair “TH” there is a very high probability that it will be followed by a vowel as “THA…”, “THE…”, ”THI…”, “THO..” or “THU…” and possibly with an “R” as “THR…”.  This almost exhausts the probabilities.  This is all determined by the statistical properties of English.

Statistical language models build probability distributions for the likelihood that some sequence of letters will be followed by another sequence of letters, or a sequence of words (and punctuations) will be followed by another sequence of words.  The bigger the chains of letters and words, the number of possible permutations grows exponentially.  This is why SLMs usually stop at some moderate order of statistics.  If you build sentences from such a model, it sounds OK for a sentence or two, but then it just drifts around like it’s dreaming or hallucinating in a stream of consciousness without any coherence.

ChatGPT works in much the same way.  It just extends the length of the sequences where it sounds coherent up to a paragraph or two.  In this sense, it is no more “intelligent” than the SLM that follows “Q” with “U”.  ChatGPT simply sustains the charade longer.

Now the details of how ChatGPT accomplishes this charade is nothing less than revolutionary.  The acronym GPT means Generative Pre-Trained Transformer.  Transformers were a new type of neural net architecture invented in 2017 by the Google Brain team.  Transformers removed the need to feed sentences word-by-word into a neural net, instead allowing whole sentences and even whole paragraphs to be input in parallel.  Then, by feeding the transformers on more than a Terabyte of textual data from the web, they absorbed the vast output of virtually all the crowd-sourced information from the past 20 years.  (This what transformed the model from an SLM to an LLM.)  Finally, using humans to provide scores on what good answers looked like versus bad answers, ChatGPT was supervised to provide human-like responses.  The result is a chatbot that in any practical sense passes the Turing Test—if you query it for an extended period of time, you would be hard pressed to decide if it was a computer program or a human giving you the answers.  But Turing Tests are boring and not very useful. 

Figure. The Transformer architecture broken into the training step and the generation step. In training, pairs of inputs and targets are used to train encoders and decoders to build up word probabilities at the output. In generation, a partial input, or a query, is presented to the decoders that find the most likely missing, or next, word in the sequence. The sentence is built up sequentially in each iteration. It is an important distinction that this is not a look-up table … it is trained on huge amounts of data and learns statistical likelihoods, not exact sequences.

The true value of ChatGPT is the access it has to that vast wealth of information (note it is information and not knowledge).  Give it almost any moderately technical query, and it will provide a coherent summary for you—on amazingly esoteric topics—because almost every esoteric topic has found its way onto the net by now, and ChatGPT can find it. 

As a form of search engine, this is tremendous!  Think how frustrating it has always been searching the web for something specific.  Furthermore, the lengthened coherence made possible by the transformer neural net means that a first query that leads to an unsatisfactory answer from the chatbot can be refined, and ChatGPT will find a “better” response, conditioned by the statistics of its first response that was not optimal.  In a feedback cycle, with the user in the loop, very specific information can be isolated.

Or, imagine that you are not a strong writer, or don’t know the English language as well as you would like.  But entering your own text, you can ask ChatGPT to do a copy-edit, even rephrasing your writing where necessary, because ChatGPT above all else has an unequaled command of the structure of English.

Or, for customer service, instead of the frustratingly discrete menu of 5 or 10 potted topics, ChatGPT with a voice synthesizer could respond to continuously finely graded nuances of the customer’s problem—not with any understanding or intelligence, but with probabilistic likelihoods of what the solutions are for a broad range of possible customer problems.

In the midst of all the hype surrounding ChatGPT, it is important to keep in mind two things:  First, we are witnessing the beginning of a revolution and a disruptive technology that will change how we live.  Second, it is still very early days, just like the early days of the first steam engines running on coal.

Disruptive Technology

Disruptive technologies are the coin of the high-tech realm of Silicon Valley.  But this is nothing new.  There have always been disruptive technologies—all the way back to Thomas Newcomen and James Watt and the steam engines they developed between 1712 and 1776 in England.  At first, steam engines were so crude they were used only to drain water from mines, increasing the number jobs in and around the copper and tin mines of Cornwall (viz. the popular BBC series Poldark) and the coal mines of northern England.  But over the next 50 years, steam engines improved, and they became the power source for textile factories that displaced the cottage industry of spinning and weaving that had sustained marginal farms for centuries before.

There is a pattern to a disruptive technology.  It not only disrupts an existing economic model, but it displaces human workers.  Once-plentiful jobs in an economic sector can vanish quickly after the introduction of the new technology.  The change can happen so fast, that there is not enough time for the workforce to adapt, followed by human misery in some sectors.  Yet other, newer, sectors always flourish, with new jobs, new opportunities, and new wealth.  The displaced workers often never see these benefits because they lack skills for the new jobs. 

The same is likely true for the LLMs and the new market models they will launch. There will be a wealth of new jobs curating and editing LLM outputs. There will also be new jobs in the generation of annotated data and in the technical fields surrounding the support of LLMs. LLMs are incredibly hungry for high-quality annotated data in a form best provided by humans. Jobs unlikely to be at risk, despite prophesies of doom, include teachers who can use ChatGPT as an aide by providing appropriate context to its answers. Conversely, jobs that require a human to assemble information will likely disappear, such as news aggregators. The same will be true of jobs in which effort is repeated, or which follow a set of patterns, such as some computer coding jobs or data analysts. Customer service positions will continue to erode, as will library services. Media jobs are at risk, as well as technical writing. The writing of legal briefs may be taken over by LLMs, along with market and financial analysts. By some estimates, there are 300 million jobs around the world that will be impacted one way or another by the coming spectrum of LLMs.

This pattern of disruption is so set and so clear and so consistent, that forward-looking politicians or city and state planners could plan ahead, because we have been on a path of continuing waves disruption for over two hundred years.

Waves of Disruption

In the history of technology, it is common to describe a series of revolutions as if they were distinct.  The list looks something like this:

First:          Power (The Industrial Revolution: 1760 – 1840)

Second:     Electricity and Connectivity (Technological Revolution: 1860 – 1920)

Third:        Automation, Information, Cybernetics (Digital Revolution: 1950 – )

Fourth:      Intelligence, cyber-physical (Imagination Revolution: 2010 – )

The first revolution revolved around steam power fueled by coal, radically increasing output of goods.  The second revolution shifted to electrical technologies, including communication networks through telegraph and the telephones.  The third revolution focused on automation and digital information.

Yet this discrete list belies an underlying fact:  There is, and has been, only one continuous Industrial Revolution punctuated by waves.

The Age of Industrial Revolutions began around 1760 with the invention of the spinning jenny by James Hargreaves—and that Age has continued, almost without pause, up to today and will go beyond.  Each disruptive technology has displaced the last.  Each newly trained workforce has been displaced by the last.  The waves keep coming. 

Note that the fourth wave is happening now, as artificial intelligence matures. This is ironic, because this latest wave of the Industrial Revolution is referred to as the “Imagination Revolution” by the optimists who believe that we are moving into a period where human creativity is unleashed by the unlimited resources of human connectivity across the web. Yet this moment of human ascension to the heights of creativity is happening at just the moment when LLM’s are threatening to remove the need to create anything new.

So is it the end of human culture? Will all knowledge now just be recycled with nothing new added?

A Post-Human Future?

The limitations of the generative aspects of ChatGPT might be best visualized by using an image-based generative algorithm that has also gotten a lot of attention lately. This is the ability to input a photograph, and input a Van Gogh painting, and create a new painting of the photograph in the style of Van Gogh.

In this example, the output on the right looks like a Van Gogh painting. It is even recognizable as a Van Gogh. But in fact it is a parody. Van Gogh consciously created something never before seen by humans.

Even if an algorithm can create “new” art, it is a type of “found” art, like a picturesque stone formation or a sunset. The beauty becomes real only in the response it elicits in the human viewer. Art and beauty do not exist by themselves; they only exist in relationship to the internal state of the conscious observer, like a text or symbol signifying to an interpreter. The interpreter is human, even if the artist is not.

ChatGPT, or any LLM like Google’s Bard, can generate original text, but its value only resides in the human response to it. The human interpreter can actually add value to the LLM text by “finding” sections that are interesting or new, or that inspire new thoughts in the interpreter. The interpreter can also “edit” the text, to bring it in line with their aesthetic values. This way, the LLM becomes a tool for discovery. It cannot “discover” anything on its own, but it can present information to a human interpreter who can mold it into something that they recognize as new. From a semiotic perspective, the LLM can create the signifier, but the signified is only made real by the Human interpreter—emphasize Human.

Therefore, ChatGPT and the LLMs become part of the Fourth Wave of the human Industrial Revolution rather than replacing it.

We are moving into an exciting time in the history of technology, giving us a rare opportunity to watch as the newest wave of revolution takes shape before our very eyes. That said … just as the long-term consequences of the steam engine are only now coming home to roost two hundred years later in the form of threats to our global climate, the effect of ChatGPT in the long run may be hard to divine until far in the future—and then, maybe after it’s too late, so a little caution now would be prudent.

Resources

OpenAI ChatGPT: https://openai.com/blog/chatgpt/

Training GPT with human input: https://arxiv.org/pdf/2203.02155.pdf

Generative art: https://github.com/Adi-iitd/AI-Art

Status of Large Language Models: https://www.tasq.ai/blog/large-language-models/

LLMs at Google: https://blog.google/technology/ai/bard-google-ai-search-updates/

How Transformers work: https://towardsdatascience.com/transformers-explained-visually-part-1-overview-of-functionality-95a6dd460452

The start of the Transformer: https://arxiv.org/abs/1706.03762

Post-Modern Machine Learning: The Deep Revolution

The mysteries of human intelligence are among the final frontiers of science. Despite our pride in what science has achieved across the past century, we have stalled when it comes to understanding intelligence or emulating it. The best we have done so far is through machine learning — harnessing the computational power of computers to begin to mimic the mind, attempting to answer the essential question:

How do we get machines to Know what we Know?

In modern machine learning, the answer is algorithmic.

In post-modern machine learning, the answer is manifestation.

The algorithms of modern machine learning are cause and effect, rules to follow, producing only what the programmer can imagine. But post-modern machine learning has blown past explicit algorithms to embrace deep networks. Deep networks today are defined by neural networks with thousands, or tens of thousands, or even hundreds of thousands, of neurons arrayed in multiple layers of dense connections. The interactivity of so many crossing streams of information defies direct deconstruction of what the networks are doing — they are post-modern. Their outputs manifest themselves, self-assembling into simplified structures and patterns and dependencies that are otherwise buried unseen in complicated data.

Fig. 1 A representation of a deep network with three fully-connected hidden layers. Deep networks are typically three or more layers deep, but each layer can have thousands of neurons. (Figure from the TowardsDataScience blog.)

Deep learning emerged as recently as 2006 and has opened wide new avenues of artificial intelligence that move beyond human capabilities for some tasks.  But deep learning also has pitfalls, some of which are inherited from the legacy approaches of traditional machine learning, and some of which are inherent in the massively high-dimensional spaces in which deep learning takes place.  Nonetheless, deep learning has revolutionized many aspects of science, and there is reason for optimism that the revolution will continue. Fifty years from now, looking back, we may recognize this as the fifth derivative of the industrial revolution (Phase I: Steam. Phase II: Electricity. Phase III: Automation. Phase IV: Information. Phase V: Intelligence).

From Multivariate Analysis to Deep Learning

Conventional machine learning, as we know it today, has had many names.  It began with Multivariate Analysis of mathematical population dynamics around the turn of the last century, pioneered by Francis Galton (1874), Karl Pearson (1901), Charles Spearman (1904) and Ronald Fisher (1922) among others.

The first on-line computers during World War II were developed to quickly calculate the trajectories of enemy aircraft for gunnery control, introducing the idea of feedback control of machines. This was named Cybernetics by Norbert Wiener, who had participated in the development of automated control of antiaircraft guns.

Table I. Evolution of Names for Machine Learning

A decade later, during the Cold War, it became necessary to find hidden objects in large numbers of photographs.  The embryonic electronic digital computers of the day were far too slow with far too little memory to do the task, so the Navy contracted with the Cornell Aeronautical Laboratory in Cheektowaga, New York, a suburb of Buffalo, to create an analog computer capable of real-time image analysis.  This led to the invention of the Perceptron by Frank Rosenblatt as the first neural network-inspired computer [1], building on ideas of neural logic developed by Warren McColloch and Walter Pitts. 

Fig. 2 Frank Rosenblatt working on the Perceptron. (From the Cornell Chronicle)
Fig. 3 Rosenblatt’s conceptual design of the connectionism of the perceptron (1958).

Several decades passed with fits and starts as neural networks remained too simple to accomplish anything profound.  Then in 1986, David Rumelhart and Ronald Williams at UC San Diego with Geoff Hinton at Carnegie-Mellon discovered a way to train multiple layers of neurons, in a process called error back propagation [2].  This publication opened the floodgates of Connectionism — also known as Parallel Distributed Processing.  The late 80’s and much of the 90’s saw an expansion of interest in neural networks, until the increasing size of the networks ran into limits caused by the processing speed and capacity of conventional computers towards the end of the decade.  During this time it had become clear that the most interesting computations required many layers of many neurons, and the number of neurons expanded into the thousands, but it was difficult to train such systems that had tens of thousands of adjustable parameters, and research in neural networks once again went into a lull.

The beginnings of deep learning started with two breakthroughs.  The first was by Yann Lecun at Bell Labs in 1998 who developed, with Leon Bottou, Yoshua Bengio and Patrick Haffner, a Convolutional Neural Network that had seven layers of neurons that classified hand-written digits [3]. The second was from Geoff Hinton in 2006, by then at the University of Toronto, who discovered a fast learning algorithm for training deep layers of neurons [4].  By the mid 2010’s, research on neural networks was hotter than ever, propelled in part by several very public successes, such as Deep Mind’s machine that beat the best player in the world at Go in 2017, self-driving cars, personal assistants like Siri and Alexa, and YouTube recommendations.

The Challenges of Deep Learning

Deep learning today is characterized by neural network architectures composed of many layers of many neurons.  The nature of deep learning brings with it two main challenges:  1) efficient training of the neural weights, and 2) generalization of trained networks to perform accurately on previously unseen data inputs.

Solutions to the first challenge, efficient training, are what allowed the deep revolution in the first place—the result of a combination of increasing computer power with improvements in numerical optimization. This included faster personal computers that allowed nonspecialists to work with deep network programming environments like Matlab’s Deep Learning toolbox and Python’s TensorFlow.

Solutions to the second challenge, generalization, rely heavily on a process known as “regularization”. The term “regularization” has a slippery definition, an obscure history, and an awkward sound to it. Regularization is the noun form of the verb “to regularize” or “to make regular”. Originally, regularization was used to keep certain inverse algorithms from blowing up, like inverse convolutions, also known as deconvolution. Direct convolution is a simple mathematical procedure that “blurs” ideal data based on the natural response of a measuring system. However, if one has experimental data, one might want to deconvolve the system response from the data to recover the ideal data. But this procedure is numerically unstable and can “blow up”, often because of the divide-by-zero problem. Regularization was a simple technique that kept denominators from going to zero.

Regularization became a common method for inverse problems that are notorious to solve because of the many-to-one mapping that can occur in measurement systems. There can be a number of possible causes that produce a single response. Regularization was a way of winnowing out “unphysical” solutions so that the physical inverse solution remained.

During the same time, regularization became a common tool used by quantum field theorists to prevent certain calculated quantities from diverging to infinity. The solution was again to keep denominators from going to zero by setting physical cut-off lengths on the calculations. These cut-offs were initially ad hoc, but the development of renormalization group theory by Kenneth Wilson at Cornell (Nobel Prize in 1982) provided a systematic approach to solving the infinities of quantum field theory.

With the advent of neural networks, having hundreds to thousands to millions of adjustable parameters, regularization became the catch-all term for fighting the problem of over-fitting. Over-fitting occurs when there are so many adjustable parameters that any training data can be fit, and the neural network becomes just a look-up table. Look-up tables are the ultimate hash code, but they have no predictive capability. If a slightly different set of data are fed into the network, the output can be anything. In over-fitting, there is no generalization, the network simply learns the idiosyncrasies of the training data without “learning” the deeper trends or patterns that would allow it to generalize to handle different inputs.

Over the past decades a wide collection of techniques have been developed to reduce over-fitting of neural networks. These techniques include early stopping, k-fold holdout, drop-out, L1 and L2 weight-constraint regularization, as well as physics-based constraints. The goal of all of these techniques is to keep neural nets from creating look-up tables and instead learning the deep codependencies that exist within complicated data.

Table II. Regularization Techniques in Machine Learning

By judicious application of these techniques, combined with appropriate choices of network design, amazingly complex problems can be solved by deep networks and they can generalized too (to some degree). As the field moves forward, we may expect additional regularization tricks to improve generalization, and design principles will emerge so that the networks no longer need to be constructed by trial and error.

The Potential of Deep Learning

In conventional machine learning, one of the most critical first steps performed on a dataset has been feature extraction. This step is complicated and difficult, especially when the signal is buried either in noise or in confounding factors (also known as distractors). The analysis is often highly sensitive to the choice of features, and the selected features may not even be the right ones, leading to bad generalization. In deep learning, feature extraction disappears into the net itself. Optimizing the neural weights subject to appropriate constraints forces the network to find where the relevant information lies and what to do with it.

The key to finding the right information was not just having many neurons, but having many layers, which is where the potential of deep learning emerges. It is as if each successive layer is learning a more abstract or more generalized form of the information than the last. This hierarchical layering is most evident in the construction of convolutional deep networks, where the layers are receiving a telescoping succession of information fields from lower levels. Geoff Hinton‘s Deep Belief Network, which launched the deep revolution in 2006, worked explicitly with this hierarchy in mind through direct design of the network architecture. Since then, network architecture has become more generalized, with less up-front design while relying on the increasingly sophisticated optimization techniques of training to set the neural weights. For instance, a simplified instance of a deep network is shown in Fig. 4 with three hidden layers of neurons.

Fig. 4 General structure of a deep network with three hidden layers. Layers will typically have hundreds or thousands of neurons. Each gray line represents a weight value, and each circle is a neural activation function.

The mathematical structure of a deep network is surprisingly simple. The equations for the network in Fig. 4, that convert an input xa to an output ye, are

These equations use index notation to denote vectors (single superscript) and matrices (double indexes). The repeated index (one up and one down) denotes an implicit “Einstein” summation. The function φ(.) is known as the activation function, which is nonlinear. One of the simplest activation functions to use and analyze, and the current favorite, is known as the ReLU (for rectifying linear unit). Note that these equations represent a simple neural cascade, as the output of one layer becomes the input for the next.

The training of all the matrix elements assumes a surprisingly simply optimization function, known as an objective function or a loss function, that can look like

where the first term is the mean squared error of the network output ye relative to the desired output y0 for the training set, and the second term, known as a regularization term (see section above) is a quadratic form that keeps the weights from blowing up. This loss function is minimized over the set of adjustable matrix weights.

The network in Fig. 4 is just a toy, with only 5 inputs and 5 outputs and only 23 neurons. But it has 30+36+36+30+23 = 155 adjustable weights. If this seems like overkill, but it is nothing compared to neural networks with thousands of neurons per layer and tens of layers. That massive overkill is exactly the power of deep learning — as well as its pitfall.

The Pitfalls of Deep Learning

Despite the impressive advances in deep learning, serious pitfalls remain for practitioners. One of the most challenging problems in deep learning is the saddle-point problem. A saddle-point in an objective function is like a mountain pass through the mountains: at the top of the pass it slopes downward in two opposite directions into the valleys but slopes upward in the two orthogonal directions to mountain peaks. A saddle point is an unstable equilibrium where a slight push this way or that can lead the traveller to two very different valleys separated by high mountain ridges. In our familiar three-dimensional space, saddle points are relatively rare and landscapes are dominated by valleys and mountaintops. But this intuition about landscapes fails in high dimensions.

Landscapes in high dimensions are dominated by neutral ridges that span the domain of the landscape. This key understanding about high-dimensional space actually came from the theory of evolutionary dynamics for the evolution of species. In the early days of evolutionary dynamics, there was a serious challenge to understand how genetic mutations could allow such diverse speciation to occur. If the fitness of a species were viewed as a fitness landscape, and if a highly-adapted species were viewed as a mountain peak in this landscape, then genetic mutations would need to drive the population state point into “valleys of death” that would need to be crossed to arrive at a neighboring fitness peak. It would seem that genetic mutations would likely kill off the species in the valleys before they could rise to the next equilibrium.

However, the geometry of high dimensions does not follow this simple low-dimensional intuition. As more dimensions come available, landscapes have more and more ridges of relatively constant height that span the full space (See my recent blog on random walks in 10-dimensions and my short YouTube video). For a species to move from one fitness peak to another fitness peak in a fitness landscape (in ultra-high-dimensional mutation space), all that is needed is for a genetic mutation to step the species off of the fitness peak onto a neutral ridge where many mutations can keep the species on that ridge as it moves ever farther away from the last fitness peak. Eventually, the neutral ridge brings the species near a new fitness peak where it can climb to the top, creating a new stable species. The point is that most genetic mutations are neutral — they do not impact the survivability of an individual. This is known as the neutral network theory of evolution proposed by Motoo Kimura (1924 – 1994) [5]. As these mutation accumulate, the offspring can get genetically far from the progenitor. And when a new fitness peak comes near, many of the previously neutral mutations can come together and become a positive contribution to fitness as the species climbs the new fitness peak.

The neutral network of genetic mutation was a paradigm shift in the field of evolutionary dynamics, and it also taught everyone how different random walks in high-dimensional spaces are from random walks in 3D. But although neutral networks solved the evolution problem, they become a two-edged sword in machine learning. On the positive side, fitness peaks are just like the minima of objective functions, and the ability for partial solutions to perform random walks along neutral ridges in the objective-function space allows optimal solutions to be found across a broad range of the configuration space of the neural weights. However, on the negative side, ridges are loci of unstable equilibrium. Hence there are always multiple directions that a solution state can go to minimize the objective function. Each successive run of a deep-network neural weight optimizer can find equivalent optimal solutions — but they each can be radically different. There is no hope of averaging the weights of an ensemble of networks to arrive at an “average” deep network. The averaging would simply drive all weights to zero. Certainly, the predictions of an ensemble of equivalently trained networks can be averaged—but this does not illuminate what is happening “under the hood” of the machine, which is where our own “understanding” of what the network is doing would come from.

Post-Modern Machine Learning

Post-modernism is admittedly kind of a joke — it works so hard to pull down every aspect of human endeavor that it falls victim to its own methods. The convoluted arguments made by its proponents sound like ultimate slacker talk — circuitous logic circling itself in an endless loop of denial.

But post-modernism does have its merits. It surfs on the moving crest of what passes as modernism, as modernism passes onward to its next phase. The philosophy of post-modernism moves beyond rationality in favor of a subjectivism in which cause and effect are blurred.  For instance, in post-modern semiotic theory, a text or a picture is no longer an objective element of reality, but fragments into multiple semiotic versions, each one different for each different reader or watcher — a spectrum of collaborative efforts between each consumer and the artist. The reader brings with them a unique set of life experiences that interact with the text to create an entirely new experience in each reader’s mind.

Deep learning is post-modern in the sense that deterministic algorithms have disappeared. Instead of a traceable path of sequential operations, neural nets scramble information into massively-parallel strings of partial information that cross and interact nonlinearly with other massively-parallel strings. It is difficult to impossible to trace any definable part of the information from input to output. The output simply manifests some aspect of the data that was hidden from human view.

But the Age of Artificial Intelligence is not here yet. The vast multiplicity of saddle ridges in high dimensions is one of the drivers for one of the biggest pitfalls of deep learning — the need for massive amounts of training data. Because there are so many adjustable parameters in a neural network, and hence so many dimensions, a tremendous amount of training data is required to train a network to convergence. This aspect of deep learning stands in strong contrast to human children who can be shown a single picture of a bicycle next to a tricycle, and then they can achieve almost perfect classification accuracy when shown any number of photographs of different bicycles and tricycles. Humans can generalize with an amazingly small amount of data, while deep networks often need thousands of examples. This example alone points to the marked difference between human intelligence and the current state of deep learning. There is still a long road ahead.

By David D. Nolte, April 18, 2022


[1] F. Rosenblatt, “THE PERCEPTRON – A PROBABILISTIC MODEL FOR INFORMATION-STORAGE AND ORGANIZATION IN THE BRAIN,” Psychological Review, vol. 65, no. 6, pp. 386-408, (1958)

[2] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “LEARNING REPRESENTATIONS BY BACK-PROPAGATING ERRORS,” Nature, vol. 323, no. 6088, pp. 533-536, Oct (1986)

[3] LeCun, Yann; Léon Bottou; Yoshua Bengio; Patrick Haffner (1998). “Gradient-based learning applied to document recognition”. Proceedings of the IEEE. 86 (11): 2278–2324.

[4] G. E. Hinton, S. Osindero, and Y. W. Teh, “A fast learning algorithm for deep belief nets,” Neural Computation, vol. 18, no. 7, pp. 1527-1554, Jul (2006)

[5] M. Kimura, The Neutral Theory of Molecular Evolution. Cambridge University Press, 1968.