IBM Aims To Reduce Power Needed For Neural Net Training By 100x

This site may earn affiliate commissions from the links on this page. Terms of use.

With the rush to incorporate AI into nearly everything, there is an insatiable demand for the computing and electrical power needed. As a result, the power-hungry GPUs that are used today are starting to give way to lower-cost, lower-power, custom devices when it comes to running production-scale trained neural networks. However, the time-consuming training process has been slower to yield to new architectures. IBM Research, which brought TrueNorth — one of the first custom inferencing chips — to life, is aiming to do it again with a hybrid analog-digital chip architecture that can also train fully-connected deep neural networks.

Neural Networks Were Unleashed By the Modern GPU

Digital computer CPUs are almost always built on a Von Neumann computer architecture, and have been since their invention. Data and programs are loaded from some type of memory into a processing unit, and results are written back. Early versions were limited to one operation at a time, but of course we now have multi-core CPUs, multi-threaded cores, and other techniques to achieve some parallelism. In contrast, our brains, which were the original inspiration for neural networks, have billions of neurons that are all capable of doing something at the same time. While they aren’t all working on the same task, there can still be a stunning number of parallel operations going on essentially constantly in our minds.

This total mismatch in architecture is one reason that neural networks floundered for decades after their invention. There wasn’t enough performance, even on the fastest computers, to make them a reality. The invention of the modern GPU changed that. By having hundreds or thousands of very-high-speed, relatively simple cores connected to fast memory, it became practical to train and run the types of neural networks that have many layers (called Deep Neural Networks or DNNs) and can be used to solve real world problems effectively.

Custom Silicon For Inferencing Is Now a Proven Technology

Google Cloud TPUAs powerful as they have become, GPUs still have their limits when put up against the industry’s seemingly insatiable demand for AI solutions. For starters, GPUs’ fairly-general circuitry has proven to be overkill for the runtime portion of AI — called inferencing — so a crop of specialized chips like Google’s Tensor Processing Unit (TPU) and Intel’s Movidius Myriad chips have been brought to market. Even Facebook is working on its own silicon for running neural networks. Those chips still use a fairly traditional parallel architecture; they’ve just done a great job of optimizing it for the task at hand.

IBM’s TrueNorth chip, in contrast, is built to more directly model the human brain, simulating a million neurons using specialized circuitry. It achieves impressive power saving for inferencing, but isn’t suited to the important task of training networks. Now, IBM researchers think they have found a way to extend the power savings of using neuromorphic (brain-like) circuitry similar to that found in TrueNorth, along with some ideas borrowed from resistive computing, to achieve massive power savings in network training.

Resistive Computing May Come Back As an Efficient AI Platform

One of the largest bottlenecks of traditional computers when they are used to run neural networks is the reading and writing of data. In particular, each node (or neuron) in a neural network needs to store (during training) and retrieve (during both training and inferencing) many weights. Even with fast GPU RAM, fetching them is a bottleneck. So designers have tapped into a technology called resistive computing to find ways to store the weights right in the analog circuitry that implements the neuron. They’re taking advantage of the fact that neurons don’t have to be very precise, so close is often good enough. When we wrote about IBM’s work in this area in 2016, it was mostly aimed at speeding up inferencing. That was because of some of the issues inherent in trying to use it for training. Now one group at IBM thinks they have found the solution to those issues.

The crossbar architecture is modular and also allows for both forward and backward propagation

The crossbar architecture is modular and also allows for both forward and backward propagation

Hybrid Architecture Aims to Lower AI-Training Power by 100x

The IBM team, writing in the journal Nature, has come up with a hybrid analog plus digital design that aims to address the shortcomings of resistive computing for training. For starters, they have implemented a simulated chip that uses a crossbar architecture, which allows for massively parallel calculation of the output of a neuron based on the sum of all its weighted inputs. Essentially, it’s a hardware implementation of matrix math. Each small crossbar block in the chip can be connected in a variety of ways, so it can model fairly deep or wide networks up to the capacity of the chip — 209,400 synapses in the team’s current simulation version.

But that doesn’t do any good if all those synapses can’t get the data they need fast enough. Until now, memory used in this type of experimental AI chip has either been very high-speed — but volatile with limited precision or dynamic range — or slower Phase-Change Memory (PCM) — with lower write performance. The team’s proposed design uses a model similar to the brain to provide for each of these needs: by separating the short-term and long-term storage for each neuron. Data needed for computation is kept in volatile, but very fast, short-term analog memory. This includes all the weights needed for each synapse of every neuron. During training, the weights are periodically off-loaded into persistent PCM, which also has a larger capacity. The short-term weights are then reset, so the limited range of the analog memory isn’t overrun.

The concept is pretty simple, but the implementation is not. Device physics heavily affect analog circuitry, so the researchers have proposed a series of techniques, including using voltage differentials and swapping polarities periodically, to minimize the errors that could creep into the system during prolonged operation.

In Simulation, the Chip Is Competitive With Software At 1/100th the Power

The team hasn’t actually built the spiffy 3-transistor circuit that is the heart of the hybrid synapses. But it has modeled it in SPICE, and connected it to real PCM to run tests. Overall, the chip performs quite well compared with a more traditional computer running the same model using TensorFlow, both scoring just over 97 percent on MNIST, for example.

However, since the chip is only capable of running fully-connected layers, like those found at the higher layers of most deep models, there are limits to what it can do. It can run MNIST (the classic digit recognition benchmark) essentially un-aided, but for image recognition tasks like CIFAR it needs to have a pre-trained model for the feature-recognition layers. Fortunately, that type of transfer learning (using a pre-trained model for the feature extraction layers) has become fairly common, so it doesn’t have to be a big stumbling block for the new approach.

Are Hybrid Chips the Future for Neural Networks?

As impressive as these research results are, they come with a lot of very-specific device tweaks and compromises. By itself, it’s hard for me to see anything quite this specialized become mainstream. What I think is important, and makes this and the other resistive computing research worth writing about, is that we have an existence proof of the ultimate neuromorphic computer — the brain — and how powerful and efficient it is. So it makes sense to keep looking for ways we can learn from it and incorporate those lessons into our computing architectures for AI. Don’t be surprised if someday your GPU features hybrid cores.

[Image credits: Nature Magazine]

Let’s block ads! (Why?)

ExtremeTechExtreme – ExtremeTech