AI Computation – Why use GPUs?

In the semiconductor industry, chips are usually categorized into digital chips and analog chips. Digital chips occupy a larger market share, around 70%.

Digital chips can be further subdivided into logic chips, memory chips, and microcontroller units (MCUs). Logic chips, also known as computing chips, contain various logic gate circuits that can perform arithmetic and logical operations, making them one of the most common types of chips.

Well-known components like CPUs, GPUs, FPGAs, and ASICs all fall under the category of logic chips. The so-called “AI chips” that are particularly popular nowadays also primarily refer to these.

CPU (Central Processing Unit)

Let’s start with the CPU, the most familiar component to everyone. Its full name is the Central Processing Unit, the heart of a computer.

Modern computers are based on the von Neumann architecture developed in the 1940s. This architecture includes components such as the arithmetic logic unit (ALU), control unit (CU), memory, input devices, and output devices.

When data arrives, it is first stored in memory. The control unit then retrieves the appropriate data from memory and passes it to the ALU for processing. Once the calculation is complete, the result is returned to memory.

This process also has a more technical name: “Fetch-Decode-Execute-Memory Access-Write Back.”

As you can see, the core functions of the ALU and the control unit are both handled by the CPU. Specifically, the ALU (including adders, subtractors, multipliers, and dividers) performs arithmetic and logical operations – the real work. The control unit reads instructions from memory, decodes them, and executes them – it’s like a choreographer.

Apart from the ALU and control unit, the CPU also includes components like the clock module and registers (high-speed cache).

The clock module manages the CPU’s timing, providing a stable time base. It periodically issues signals to drive all operations within the CPU and schedule the work of various modules.

Registers are high-speed storage units within the CPU used to temporarily store instructions and data. They act as a “buffer” between the CPU and memory (RAM), offering faster speed than regular memory, preventing memory from slowing down the CPU’s work.

The capacity and access speed of registers can affect the number of times the CPU accesses memory, thereby influencing the efficiency of the entire system. We’ll discuss this further when we cover memory chips.

CPUs are generally classified based on their instruction set architecture, including x86 and non-x86 architectures. x86 is primarily a Complex Instruction Set Computer (CISC), while non-x86 architectures generally use Reduced Instruction Set Computer (RISC).

PCs and most servers use the x86 architecture, dominated by Intel and AMD. Non-x86 architectures are more diverse, with ARM, MIPS, Power, RISC-V, Alpha, and others gaining popularity in recent years. We’ll cover these in more detail later.

GPU (Graphics Processing Unit)

Next, let’s look at the GPU. The GPU is the core component of a graphics card, with the full name Graphics Processing Unit (GPU).

However, a GPU is not equivalent to a graphics card. Apart from the GPU, a graphics card includes components like video memory, VRM voltage regulation modules, MRAM chips, buses, fans, and peripheral device interfaces.

In 1999, NVIDIA first introduced the concept of the GPU. The reason for introducing the GPU was the rapid development of the gaming and multimedia industries in the 1990s. These industries placed higher demands on computers’ 3D graphics processing and rendering capabilities, which traditional CPUs couldn’t handle, leading to the introduction of the GPU to share this workload.

Based on their form, GPUs can be classified as discrete GPUs (dGPU) and integrated GPUs (iGPU), commonly known as dedicated and integrated graphics, respectively.

Like the CPU, the GPU is a computing chip and includes components such as the ALU, control unit, and registers.

However, because the GPU’s primary responsibility is graphics processing, its internal architecture differs significantly from the CPU’s.

CPUs have a relatively small number of cores (including ALUs), typically no more than a few dozen. But CPUs have a substantial amount of cache and a complex control unit (CU).

This design is due to the CPU being a general-purpose processor. As the computer’s core, it has to handle complex tasks, including different types of data computations and responding to human-computer interactions.

Complex conditions, branches, and task synchronization require a significant amount of branch jumping and interrupt handling. The CPU needs larger caches to store various task states, reducing latency during task switching. It also requires a more complex control unit for logical control and scheduling.

The CPU’s strengths lie in management and scheduling. Its actual “work” function (the ALU occupies only about 5-20%) is not as strong.

If we consider the processor as a restaurant, the CPU would be like an all-purpose restaurant with a few dozen high-level chefs. This restaurant can handle all types of cuisine, but because it offers so many menu items, it needs to spend a considerable amount of time coordinating and preparing dishes, resulting in relatively slower serving times.

The GPU, on the other hand, is entirely different.

The GPU was born for graphics processing, and its tasks are very specific and singular. Its job is graphics rendering. Graphics are composed of vast numbers of pixels, which are highly uniform and independent data.

Therefore, the GPU’s task is to complete a large amount of parallel computation on homogeneous data in the shortest possible time. The so-called “miscellaneous work” of scheduling and coordination is relatively minimal.

Parallel computation, of course, requires more cores.

The number of cores in a GPU far exceeds that of a CPU, ranging from thousands to tens of thousands (hence the term “massively parallel”).

The cores in a GPU are called stream multiprocessors (SMs), which are independent task processing units.

Within the entire GPU, there are multiple stream processing areas. Each processing area contains hundreds of cores. Each core is equivalent to a simplified CPU, capable of integer and floating-point arithmetic, as well as queuing and result collection.

The GPU’s control unit function is simple, and it has relatively less cache. Its ALU occupies over 80% of the chip.

Although a single GPU core has weaker processing power than a CPU core, the massive number of cores makes the GPU more suitable for high-intensity parallel computation. Under the same transistor scale, its computing power can surpass that of a CPU.

Using the restaurant analogy again, a GPU would be like a single-purpose restaurant with thousands of entry-level chefs. It’s only suitable for preparing a specific type of cuisine. But because there are many chefs and simple dish preparation, they can collectively cook and serve dishes much faster.

GPU and AI Computation

Everyone knows that current AI computation relies heavily on GPUs, which has been highly profitable for NVIDIA. But why is this the case?

The reason is simple: AI computation, like graphics computation, involves a large amount of high-intensity parallel computation.

Deep learning is currently the most prevalent artificial intelligence algorithm. The process includes two stages: training and inference.

During the training stage, a complex neural network model is trained by feeding it a massive amount of data. In the inference stage, the trained model is used to infer various conclusions from a large amount of data.

The training stage involves vast amounts of training data and complex deep neural network structures, requiring tremendous computational power and high-performance chips. The inference stage demands efficient and low-latency repetitive computations on simple and specific tasks.

The specific algorithms used, such as matrix multiplication, convolution, recurrent layers, and gradient computations, can be broken down into many parallel tasks, effectively shortening the time required to complete the overall task.

With its powerful parallel computing capabilities and memory bandwidth, the GPU can effectively handle both training and inference tasks, becoming the industry’s preferred solution for deep learning.

Currently, most enterprise AI training utilizes NVIDIA GPU clusters. With reasonable optimization, a single GPU card can provide computing power equivalent to dozens or even hundreds of CPU servers.

However, the GPU’s market share in the inference stage is not as high. We’ll discuss the reasons for this later.

The application of GPUs to computing beyond graphics processing originated in 2003, when the concept of GPGPU (General-Purpose computing on GPU) was first introduced. It referred to using the GPU’s computing power for more general and broader scientific computing in non-graphics processing areas.

GPGPU further optimized and designed traditional GPUs to make them more suitable for high-performance parallel computing.

In 2009, several scholars at Stanford University first demonstrated the use of GPUs to train deep neural networks, causing a sensation.

A few years later, in 2012, Geoffrey Hinton’s two students – Alex Krizhevsky and Ilya Sutskever – used the “deep learning + GPU” approach to propose the deep neural network AlexNet, which increased recognition accuracy from 74% to 85%, winning the ImageNet challenge championship.

This completely ignited the “AI+GPU” boom. NVIDIA quickly followed up, investing substantial resources to increase GPU performance by 65 times within three years.

In addition to brute-forcing computing power, they actively built an ecosystem around GPUs. They established the CUDA (Compute Unified Device Architecture) ecosystem based on their GPUs, providing a comprehensive development environment and solutions to help developers more easily use GPUs for deep learning development or high-performance computing.

These early strategic efforts ultimately helped NVIDIA reap significant rewards when AI and generative AI took off. Currently, with a market capitalization of $1.22 trillion (nearly six times that of Intel), NVIDIA is truly the “AI king without a crown.”

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top