Over the past few years, Nvidia has established itself as a major leader in machine learning and artificial intelligence processing. The GPU designer dove into the HPC market over a decade ago when it launched the G80 and its parallel compute platform API, CUDA. Early leadership has paid off for Nvidia; the company holds 87 spots on the TOP500 list of supercomputers, compared with just 10 for Intel. But as machine learning and artificial intelligence workloads proliferate, vendors are emerging to give Nvidia a run for its money, including Google’s new Cloud TPU. New benchmarks from RiseML put both Nvidia and Google’s TPU head-to-head — and the cost curve strongly favors Google.
Because ML and AI are both new and emerging fields, it’s important that tests be conducted fairly and that the benchmark runs don’t favor one architecture over the other simply because best testing parameters aren’t well-known. To that end, RiseML allowed both Nvidia and Google engineers to review drafts of their test results and to offer comments and suggestions. The company also states its figures have been reviewed by an additional panel of outside experts in the field.
The comparison is between four Google TPUv2 chips (which form one Cloud TPU) against 4x Nvidia Volta GPUs. Both have 64GB of total RAM and the data sets were trained in the same fashion. RiseML tested the ResNet-50 model (exact configuration details are available in the blog post) and the team investigated both raw performance (throughput), accuracy, and convergence (an algorithm converges when its output comes closer and closer to a specific value).
The suggested batch size for TPUs is 1024, but other batch sizes were tested at reader request. Nvidia does perform better at those lower batch sizes. In accuracy and convergence, the TPU solution is somewhat better (76.4 percent top-1 accuracy for Cloud TPU, compared with 75.7 percent for Volta). Improvements to top-end accuracy are difficult to come by, and the RiseML team makes the small difference between the two solutions out to be more important than you might think. But where Google’s Cloud TPU really wins, at least right now, is on pricing.
Ultimately, what matters is the time and cost it takes to reach a certain accuracy. If we assume an acceptable solution at 75.7 percent (the best accuracy achieved by the GPU implementation), we can calculate the cost to achieve this accuracy based on required epochs and training speed in images per second. This excludes time to evaluate the model in-between epochs and training startup time.
As shown above, the current pricing of the Cloud TPU allows to train a model to 75.7 percent on ImageNet from scratch for $ 55 in less than 9 hours! Training to convergence at 76.4 percent costs $ 73. While the V100s perform similarly fast, the higher price and slower convergence of the implementation results in a considerably higher cost-to-solution.
Google may be subsidizing its cloud processor pricing, and the exact performance characteristics of ML chips will vary depending on implementation and programmer skill. This is far from the final word on Volta’s performance, or even Volta as compared with Google’s Cloud TPU. But at least for now, in ResNet-50, Google’s cloud TPU appears to offer nearly identical performance at substantially lower prices.