FPGA, ASIC, GPU: Who Is the Most Suitable AI Chip?

ASIC (Application Specific Integrated Circuit)

GPUs have strong parallel computing capabilities, but they also have drawbacks such as high power consumption, large size, and high cost.

Entering the 21st century, the demand for computing power has shown two significant trends: first, the usage scenarios for computing power began to diversify; second, users’ requirements for computing performance have become increasingly higher. General-purpose computing chips can no longer meet users’ needs.

Therefore, more and more companies have begun to strengthen their research and investment in dedicated computing chips. ASIC (Application Specific Integrated Circuit) is one such chip designed for specific tasks.

The official definition of ASIC is an integrated circuit that is designed and manufactured to meet the requirements of a specific user or the needs of a specific electronic system.

ASICs started in the 1970s-80s. Initially, they were used in computers. Later, they were mainly used for embedded control. In recent years, as mentioned earlier, they have risen to prominence in applications such as AI inference, high-speed searching, and vision and image processing.

When discussing ASICs, it’s impossible not to mention Google’s famous TPU.

TPU, standing for Tensor Processing Unit, is a mathematical entity that contains multiple numbers (multidimensional arrays).

Currently, nearly all machine learning systems use tensors as the basic data structure. Therefore, a Tensor Processing Unit can be simply understood as an “AI processing unit.”

In 2015, to better complete its deep learning tasks and enhance AI computing power, Google launched a chip specifically for neural network training, the TPU v1.

Compared to traditional CPUs and GPUs, TPU v1 can achieve 15-30 times the performance improvement in neural network computations, with energy efficiency improvements reaching 30-80 times, causing significant industry impact.

In 2017 and 2018, Google introduced the more powerful TPU v2 and TPU v3 for AI training and inference. In 2021, they launched TPU v4, using a 7nm process, with 22 billion transistors, achieving a performance improvement of 10 times over the previous generation, 1.7 times stronger than Nvidia’s A100.

Besides Google, many other large companies have also been developing ASICs in recent years.

Intel acquired the Israeli AI chip company Habana Labs at the end of 2019 and released the Gaudi 2 ASIC chip in 2022. IBM Research released an AI ASIC chip, AIU, at the end of 2022.

Samsung also ventured into ASICs a few years ago, focusing on chips specifically for mining machines. Indeed, many people’s first encounter with ASICs was through Bitcoin mining. Compared to GPU and CPU mining, ASIC mining machines are more efficient and consume less energy.

Aside from TPUs and mining machines, two other well-known types of ASIC chips are DPUs and NPUs.

DPU stands for Data Processing Unit, primarily used in data centers.

NPU, or Neural Processing Unit, simulates human neurons and synapses at the circuit level and processes data using a deep learning instruction set.

NPUs are specifically used for neural network inference, capable of efficiently performing operations such as convolution and pooling. This type of chip is often integrated into some smartphone chips.

Speaking of smartphone chips, it’s worth mentioning that the main chip in our phones, commonly known as the SoC chip, is also a type of ASIC chip.

Where do the advantages of ASICs, as specialized custom chips, lie? Are they exclusive to enterprises, with their own logos and naming?

Not at all.

Customization is akin to tailoring. Based on the specific tasks the chip is aimed at, its computing power and efficiency are strictly matched to the task’s algorithms. The number of cores, the ratio of logical computing units to control units, and cache, among other aspects of the chip architecture, are also precisely customized.

Therefore, custom specialized chips can achieve optimal size and power consumption. These chips’ reliability, secrecy, computing power, and energy efficiency are all stronger than general-purpose chips (CPU, GPU).

It will be noticed that the companies mentioned earlier involved in ASIC development are major enterprises like Google, Intel, IBM, and Samsung.

This is because custom chip design requires a high level of R&D technical capability from a company, as well as significant investment.

Creating an ASIC chip first involves complex design processes such as code design, synthesis, and backend, followed by months of production, processing, and packaging tests before the chip can be used to build systems.

“Tape-out” is a term many have heard of. It refers to the manufacturing of chips through a series of process steps, similar to an assembly line. Simply put, it’s trial production.

The development process for ASICs requires tape-out. A 14nm process tape-out costs around 3 million USD, while a 5nm process can cost up to 47.25 million USD.

If tape-out fails, the investment is lost, and significant time and energy are wasted. Smaller companies simply cannot afford this.

So, does this mean small companies cannot engage in chip customization?

Certainly not. This is where another marvel comes into play, the FPGA.

FPGA (Field Programmable Gate Array)

FPGA, fully known as Field Programmable Gate Array, has become very popular in the industry in recent years, even being dubbed as the “universal chip.”

Simply put, an FPGA is a reconfigurable chip. It can be reprogrammed an infinite number of times after manufacturing, according to the user’s needs, to achieve the desired digital logic functions.

The reason FPGAs can be DIYed is due to their unique architecture.

An FPGA consists of Configurable Logic Blocks (CLBs), Input/Output Blocks (IOBs), Programmable Interconnect Resources (PIRs), and Static Random Access Memory (SRAM).

CLBs are the most important part of an FPGA, serving as the basic unit for realizing logic functions and carrying the main circuit functions. They are typically arranged in a grid formation across the chip, known as a Logic Cell Array (LCA).

IOBs mainly facilitate the interface between the chip’s logic and external pins, usually arranged around the chip’s perimeter.

PIRs provide a wealth of wiring resources, including vertical and horizontal interconnects, programmable switch matrices, and programmable connection points. They play a crucial role in connecting different components to form circuits with specific functions.

Static Random Access Memory (SRAM) is used to store the programming data for IOBs, CLBs, and PIRs, thereby controlling them to complete the system’s logic functions.

CLBs themselves mainly consist of Look-Up Tables (LUTs), Multiplexers, and Flip-Flops. These components carry individual logic “gates” within the circuit, enabling the implementation of complex logic functions.

LUTs can be thought of as RAMs that store computed results. When a user describes a logic circuit, the software calculates all possible outcomes and writes them into the LUT. Each logic operation is equivalent to inputting an address to fetch a result from the LUT, which then returns the corresponding outcome.

This “hardware-based” computation method offers faster processing speeds.

When using an FPGA, users can complete circuit designs using hardware description languages (Verilog or VHDL) and then “program” (write) the design onto the FPGA to implement the corresponding functions.

Upon power-up, the FPGA reads data from an EPROM (Erasable Programmable Read-Only Memory) into the SRAM. Once configured, the FPGA enters its operational state. When the power is turned off, the FPGA reverts to a blank slate, with its internal logic erased. This process can be repeated, enabling “on-site” customization.

FPGAs are incredibly powerful. In theory, if the FPGA provides a sufficiently large scale of gate circuits, it can replicate any ASIC’s logic functions through programming.

Let’s look at the development history of FPGA.

FPGA emerged from the foundation of programmable devices such as PAL (Programmable Array Logic) and GAL (Generic Array Logic), and belongs to a type of semi-custom circuit.

It was invented in 1985 by Xilinx. Later, companies such as Altera, Lattice, and Microsemi also entered the FPGA field, eventually forming a quartet of industry leaders.

In May 2015, Intel acquired Altera for a staggering $16.7 billion and later incorporated it into its PSG (Programmable Solutions Group) division.

In 2020, not to be outdone, Intel’s competitor AMD acquired Xilinx for $35 billion.

This resulted in the big four of Xilinx (under AMD), Intel, Lattice, and Microsemi (changing the lineup without altering the essence).

In 2021, the market shares of these four companies were 51%, 29%, 7%, and 6%, respectively, totaling 93% of the global market share.

In October 2023, Intel announced plans to spin off its PSG division, operating it as an independent business.

Let’s delve into the differences between ASIC and FPGA, as well as their distinctions from CPUs and GPUs.

Both ASIC and FPGA are types of chips at their core. ASICs are fully custom chips with fixed functions that cannot be altered, while FPGAs are semi-custom chips that offer flexibility and high adaptability.

To illustrate the difference between the two, we can use an analogy.

An ASIC is like manufacturing toys with a mold. You need to create a mold beforehand, which is quite a hassle. Once the mold is made, it cannot be modified. If you want to create a new toy, you have to make a new mold.

On the other hand, an FPGA is like building toys with LEGO bricks. You can start building right away and complete your project in a short time. If you’re not satisfied or wish to build something new, you can simply take it apart and start over.

Many of the design tools for ASICs and FPGAs are the same. In terms of design processes, FPGAs are not as complex as ASICs; they eliminate some manufacturing processes and extra design verification steps, roughly amounting to only 50%-70% of the ASIC process. The particularly burdensome tape-out process required for ASICs is not needed for FPGAs.

This means that developing an ASIC could take several months or even more than a year, while FPGA development might only take a few weeks or months.

Regarding the earlier point about FPGAs not requiring tape-out, does this mean that FPGAs are always cheaper than ASICs?

Not necessarily.

FPGAs can be pre-made and programmed in the lab or on-site without the need for non-recurring engineering (NRE) costs. However, as a “general-purpose toy,” its cost is about 10 times that of an ASIC (“mold-made toy”).

If the production volume is low, then FPGAs are cheaper. If the volume is high and the NRE costs for ASICs are amortized, then ASICs become cheaper.

This is similar to mold costs. Creating a mold is expensive, but if the sales volume is high, the investment in the mold becomes worthwhile.

As illustrated below, the 40,000-unit mark serves as a boundary line between the cost-effectiveness of ASICs and FPGAs. Below this threshold, FPGAs are cheaper; above it, ASICs are more economical.

From a performance and power consumption perspective, as specialized custom chips, ASICs are stronger than FPGAs.

FPGAs are general-purpose and editable chips, which means they tend to have a lot of redundant features. No matter how you design it, there will always be some excess components.

ASICs, on the other hand, are custom-tailored with no waste and use hardwiring. Therefore, they offer stronger performance and lower power consumption.

The relationship between FPGAs and ASICs is not simply one of competition and substitution; rather, they each have their own specific roles.

FPGAs are commonly used for product prototype development, design iterations, and certain applications with low production volumes. They are suitable for products that require short development cycles. FPGAs are also often used for ASIC verification.

ASICs are used for designing large-scale, complex chips, or for products that are highly mature and produced in large volumes.

FPGAs are particularly suitable for beginners learning and participating in competitions. Many universities’ electronics-related majors now use FPGAs for teaching.

From a commercial perspective, the main application areas for FPGAs include telecommunications, defense, aerospace, data centers, medical, automotive, and consumer electronics.

FPGAs have been used early on in the field of communications. Many processing chips in base stations (such as baseband processing, beamforming, antenna transceivers, etc.) are FPGAs. They are also used in core network encoding and protocol acceleration. Previously, data centers also employed FPGAs in components like DPUs.

Later, as many technologies matured and became standardized, telecommunications equipment manufacturers began to replace FPGAs with ASICs to reduce costs.

It’s worth mentioning that the recently popular Open RAN technology mainly uses general-purpose processors (such as Intel CPUs) for computation. This approach’s energy consumption is far higher than that of FPGAs and ASICs. This is one of the main reasons why equipment manufacturers, including Huawei, are reluctant to follow the Open RAN trend.

In the automotive and industrial sectors, FPGAs are valued for their latency advantages, so they are used in ADAS (Advanced Driver Assistance Systems) and servo motor drives.

FPGAs are used in consumer electronics because of the rapid iteration of products. The development cycle for ASICs is too long; by the time a product is developed, the market may have moved on.

FPGA, ASIC, GPU: Who Is the Most Suitable AI Chip?

Finally, we circle back to the topic of AI chips.

AI computing is divided into training and inference. GPUs hold an absolute leading position in training, but not in inference.

First and foremost, it’s important to remember that purely from a theoretical and architectural standpoint, the performance and cost of ASICs and FPGAs are definitely superior to CPUs and GPUs.

CPUs and GPUs follow the von Neumann architecture, where instructions go through stages like storage, decoding, and execution. Shared memory access involves arbitration and caching.

FPGAs and ASICs, however, do not follow the von Neumann architecture (they use the Harvard architecture). Taking FPGAs as an example, they essentially operate without instructions and do not require shared memory.

The functions of FPGA logic units are determined at programming time, representing the hardware implementation of software algorithms. For the need to save states, FPGAs use registers and on-chip memory (BRAM) that belong to their own control logic, avoiding the need for arbitration and caching.

Looking at the proportion of ALU (Arithmetic Logic Unit) components, GPUs have a higher ratio than CPUs, and FPGAs, having almost no control modules and consisting entirely of ALU units, have an even higher ratio than GPUs.

Therefore, from various perspectives, FPGAs can compute faster than GPUs.

Now, let’s consider power consumption.

GPUs are notoriously power-hungry, with a single chip consuming up to 250W or even 450W (RTX 4090). FPGAs, on the other hand, generally consume only 30~50W.

This is primarily due to memory access. The memory interface of GPUs (GDDR5, HBM, HBM2) has extremely high bandwidth, about 4-5 times that of traditional FPGA DDR interfaces. However, in terms of the chip itself, the energy consumed in reading DRAM is more than 100 times that of SRAM. The frequent DRAM reads by GPUs result in very high power consumption.

Additionally, the operating frequency of FPGAs (below 500MHz) is lower than that of CPUs and GPUs (1~3GHz), which also contributes to their lower power consumption. The lower operating frequency of FPGAs is mainly due to limitations in routing resources. Some connections require longer paths, and if the clock frequency is too high, it becomes unmanageable.

Finally, let’s look at latency.

GPU latency is higher than FPGA.

GPUs typically need to divide different training samples into fixed-size “batches” to maximize parallelism, requiring several batches to be collected before processing them together.

FPGAs operate on a batch-less architecture. They can output immediately after processing a data packet, offering an advantage in latency.

So, the question arises. If GPUs are inferior to FPGAs and ASICs in these respects, why have they become so popular in AI computing?

The answer is simple. In the pursuit of ultimate computational performance and scale, the industry currently does not care about costs or power consumption.

Thanks to Nvidia’s long-term efforts, the core count and operating frequency of GPUs have continuously increased, and the chip size has grown, representing a brute-force approach to computational power. Power consumption is managed through process technology and passive cooling methods like water cooling, as long as it doesn’t overheat.

Beyond hardware, as mentioned in the previous article, Nvidia has also strategically developed its software and ecosystem.

CUDA, developed by Nvidia, is a core competitive advantage of GPUs. Based on CUDA, even beginners can quickly get started with GPU development. Their years of diligent work have also built a solid user base.

In comparison, the development of FPGAs and ASICs is still too complex and not suitable for widespread adoption.

In terms of interfaces, although GPUs primarily use a single type (mainly PCIe) and are not as flexible as FPGAs (the programmability of FPGAs allows them to easily interface with any standard or non-standard interface), it is sufficient for servers, offering plug-and-play convenience.

Aside from FPGAs, ASICs struggle against GPUs in AI due to their high costs, long development cycles, and significant development risks. With AI algorithms changing rapidly, the development cycle for ASICs is a critical issue.

For these reasons, GPUs have enjoyed their current success.

In AI training, the powerful computational abilities of GPUs can significantly improve efficiency.

In AI inference, where the input is generally a single object (like an image), the requirements are lower, and there’s less need for parallelism, making the computational advantage of GPUs less apparent. Many companies start to opt for cheaper and more energy-efficient FPGAs or ASICs for computation.

For other computational scenarios, the choice depends on the absolute performance requirement for computational power. GPUs are the first choice for those prioritizing outright computational performance. For situations where computational performance demands are not as high, FPGAs or ASICs could be considered for cost savings.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top