Contact us
The Biggest Challenge Of Inference Chips
In less than a year, generative AI has gained global fame and usage through OpenAI’s ChatGPT, a popular algorithm based on Transformer. Transformer-based algorithms can learn complex interactions between different elements of an object, such as sentences or questions, and convert them into human-like conversations.

Led by Transformer and other large language models (LLMs), software algorithms have made rapid progress while the processing hardware responsible for executing them has been left behind. Even the most advanced algorithmic processors do not have the performance required to elaborate the latest ChatGPT query in a timeframe of one or two seconds.

Metrans is verified distributor by manufacture and have more than 30000 kinds of electronic components in stock, we guarantee that we are sale only New and Original!

To make up for the lack of performance, leading semiconductor companies build systems consisting of a large number of the best hardware processors. In the process, they weighed power consumption, bandwidth/latency, and cost. This approach is suitable for algorithm training, but not for inference deployed on edge devices.


Power Consumption Challenge
While training is typically based on fp32 or fp64 floating point algorithms that generate large amounts of data, it does not require strict latency. High power consumption and high cost affordability.
What is quite different is the reasoning process. Inference is typically performed on the fp8 algorithm, which still produces large amounts of data but requires critical latency, low energy consumption, and low cost.
The solution for model training comes from the computing field. They run for days, use large amounts of electricity, generate large amounts of heat, and are expensive to acquire, install, operate, and maintain. Even worse is the inference process that hits a wall and hinders the proliferation of GenAI on edge devices.

The latest technology for edge-generated AI inference
A successful GenAI inference hardware accelerator must meet five properties:
  • High processing power and efficiency (more than 50%) in the petaflops range
  • Low latency, providing query responses within seconds
  • Energy consumption limited to 50W/Petaflops or less
  • Affordable and compatible with edge applications
  • Field programmability accommodates software updates or upgrades to avoid factory hardware modifications
Most existing hardware accelerators can meet some, but not all, of the requirements. Legacy CPUs are the worst choice due to unacceptable execution speed; GPUs offer considerable speed (hence the choice for training) with high power consumption and insufficient latency; FPGAs compromise on performance and latency .
The perfect device would be a custom/programmable system-on-chip (SoC) designed to execute transformer-based algorithms as well as the development of other types of algorithms. It should support suitable memory capacity to store the large amounts of data embedded in the LLM, and should be programmable to accommodate field upgrades.
Two obstacles stand in the way of this goal: memory walls and the high energy consumption of CMOS devices.

Memory Wall
It was observed early in the history of semiconductor development that advances in processor performance were offset by lack of progress in memory access.
Over time, the gap between the two grows, forcing the processor to wait longer and longer for memory to transfer data. The result is a drop in processor efficiency from full 100% utilization (Figure 1).

In order to alleviate the decrease in efficiency, the industry has designed a multi-level hierarchical memory structure that uses faster and more expensive memory technology and multi-level cache close to the processor to minimize the use of slower main memory and even slower external memory. flow (Figure 2).


CMOS IC energy consumption
Counterintuitively, the power consumption of a CMOS IC is primarily determined by data movement rather than data processing. According to Stanford University research led by Professor Mark Horowitz (Table 1), memory accesses consume orders of magnitude more power than basic digital logic calculations.

The power consumption of adders and multipliers ranges from less than one picojoule when using integer arithmetic to several picojoules when handling floating point arithmetic. By comparison, when accessing data in DRAM, the energy spent accessing data in cache jumps by one order of magnitude, to 20-100 picojoules, and by three orders of magnitude, to over 1,000 picojoules.
The GenAI accelerator is a prime example of data movement-led design.

The impact of memory walls and energy consumption on latency and efficiency
The impact of memory walls and energy consumption in generative AI processing is becoming difficult to control.
Within a few years, GPT, the base model that powers ChatGPT, has evolved from GPT-2 in 2019, to GPT-3 in 2020, to GPT-3.5 in 2022, and to the current GPT-4. The size of the model and the number of parameters (weights, tokens and states) increase by several orders of magnitude with each generation.
GPT-2 contains 1.5 billion parameters, the GPT-3 model contains 175 billion parameters, and the latest GPT-4 model pushes the parameter size to about 1.7 trillion parameters (official numbers have not been released yet).
Not only does the sheer number of these parameters force memory capacity into the terabyte range, but simultaneous high-speed access to them during training/inference also pushes memory bandwidth to hundreds of GB/sec (if not TB/sec). To further exacerbate this, moving them consumes a lot of energy.
Expensive hardware sits idle
The daunting data transfer bandwidth between memory and processor and the significant power consumption overwhelm the processor's efficiency. Recent analysis shows that the efficiency of running GPT-4 on cutting-edge hardware drops to around 3%. The expensive hardware designed to run these algorithms sits idle 97% of the time.
The less efficient the execution, the more hardware is required to perform the same task. For example, assume that a requirement of 1 Petaflops (1,000 Teraflops) can be met by two suppliers. Suppliers (A and B) offer different treatment efficiencies, 5% and 50% respectively (Table 2).
Then Vendor A can only provide 50 Teraflops of effective processing power, not theoretical processing power. Supplier B will provide 500 Teraflops. To provide 1 petaflop of effective computing power, Vendor A requires 20 processors, but Vendor B only needs 2.


For example, one Silicon Valley startup plans to use 22,000 Nvidia H100 GPUs in its supercomputer data centers. A back-of-the-envelope calculation puts 22,000 H100 GPUs at $800 million — the bulk of its latest funding. This number does not include the cost of the rest of the infrastructure, real estate, energy costs, and all other factors in the total cost of ownership (TCO) of on-premises hardware.
The impact of system complexity on latency and efficiency
Another example, based on the current state-of-the-art GenAI training accelerator, will help illustrate this concern. The Silicon Valley startup's GPT-4 configuration will require 22,000 copies of Nvidia H100 GPUs deployed in octets on HGX H100 or DGX H100 systems, resulting in a total of 2,750 systems.
Considering that GPT-4 includes 96 decoders, mapping them across multiple chips may lessen the impact on latency. Since the GPT structure allows sequential processing, assigning a decoder to each chip for a total of 96 chips might be a reasonable setup.
This configuration can be converted to 12 HGX/DGX H100 systems, not only challenging the latency associated with moving data between individual chips, between boards, and between systems. Using delta transformers can significantly reduce processing complexity, but it requires the processing and storage of state, which in turn increases the amount of data to be processed.
The bottom line is that the 3% implementation efficiency mentioned earlier is unrealistic. When the impact of system implementation and the associated long delays are added, the actual efficiency in real applications drops significantly.
Taken together, the amount of data required by GPT-3.5 is far less than that of GPT-4. From a business perspective, using something like GPT-3’s complexity is more attractive than GPT-4. The flip side is that GPT-4 is more accurate and would be preferred if the hardware challenges could be solved.
Best guess cost analysis
Let's focus on the implementation costs of a system capable of handling large numbers of queries, say Google-like volumes of 100,000 queries per second.
Using current state-of-the-art hardware, it is reasonable to assume that the total cost of ownership (including acquisition costs, system operations and maintenance costs) is approximately $1 trillion. For the record, that's roughly half the 2021 gross domestic product (GDP) of Italy, the world's eighth-largest economy.
ChatGPT's cost-per-query impact makes it commercially challenging. Morgan Stanley estimates that Google search queries (3.3 trillion queries) will cost £0.20 per query in 2022 (considered the benchmark). The same analysis shows that the cost per query on ChatGPT-3 ranges from 3 to 14 euros, which is 15-70 times higher than the baseline.
The semiconductor industry is actively looking for solutions to cost/query challenges. While all attempts are welcome, solutions must come from novel chip architectures that will break down memory walls and drastically reduce power consumption.



Top