AI Will Disrupt Processor Design|metrans

AI Will Disrupt Processor Design

AI is fundamentally changing processor design, combining custom processing elements for specific AI workloads with more traditional processors for other tasks.

Metrans is verified distributor by manufacture and have more than 30000 kinds of electronic components in stock, we guarantee that we are sale only New and Original!

However, this trade-off is increasingly confusing, complex, and challenging to manage. For example, workloads may change faster than it takes to customize a design. Additionally, AI-specific processes may exceed power and thermal budgets, which may require workload adjustments. Integrating all these components can create issues that need to be addressed at the system level rather than just in the chip.

"Artificial intelligence workloads are disrupting processor architecture," said Rambus researcher and distinguished inventor Steven Woo. "It was clear that the existing architecture was not working very well. As early as 2014, people began to realize that using GPUs could greatly improve transaction performance, which gave a huge boost to the development of artificial intelligence. At that time, people Start saying: 'GPU is a specialized architecture. Can we do more?' At the time, multiply-accumulate, which is so common in AI, was clearly the bottleneck. Now, you've got so much great hardware. We've got it So what else do we need to add to the hardware? That’s what architecture is all about. It’s about finding the tall pegs or long tent poles in the tent and knocking them off.”

Others agree. Rich Goldman, director of Ansys, said: "Artificial intelligence fits right into the GPU architecture, which is why Nvidia has a market value of trillions of dollars." Interestingly, Intel has been doing GPUs for a long time, but in The video processor is driven within the CPU. Now they are starting to do discrete GPUs. Additionally, AMD has a very interesting architecture where the GPU and CPU share memory. Still, the CPU is important. Nvidia's Grace Hopper is a combination of CPU and GPU, because not all applications are suitable for GPU architecture. Even among applications that are suitable for using GPUs, some only run on small CPUs. We've been running everything on CPU x86 architectures for decades, and maybe RISC architectures, but it's the CPU. Different applications run better on different architectures, and Nvidia happened to focus on video games first before moving on to animation and movies. The same architecture is ideal for artificial intelligence, which is driving everything today."

The challenge now is to develop more efficient platforms and optimize them for specific use cases. “When you implement this thing in truly scalable hardware and not just a one-time use case, then the challenge becomes how do you run this thing?” said Suhas Mitra, director of product marketing for AI at Cadence Tensilica. "Traditionally, in a processor, we have a CPU. If you have a mobile platform, you have a GPU, DSP, etc. All of that gets confusing because people see these workloads being embarrassingly parallel sometimes "With the advent of parallel computing, which is why GPUs have become very popular - they have very good hardware engines that can do parallel processing - it's easy for vendors to profit immediately."

Sharad Chole, chief scientist at Expedera, said this approach works best when the workload is well-defined. "In this type of architecture, let's say you are trying to integrate the ISP and the NPU in a tightly coupled way in the edge architecture. SoC leaders are looking at how to reduce the area and power consumption of the design."
The challenge here, Chole said, is understanding the latency impact on the memory portion of the architecture. "What does the memory look like if the NPU is slow? What does the memory look like when the NPU is fast? Finally, the question of balancing MAC versus balancing memory comes in, where we are trying to minimize input and Output buffering.
External memory bandwidth is also a critical part of this, especially for edge devices. "No one has enough bandwidth," he added. "So, how do we partition the workload or schedule the neural network so that the external memory bandwidth is maintained and reduced as much as possible? That's basically what we do by packaging or breaking the neural network into smaller pieces and trying to execute those two pieces to achieve.

Designed for a rapidly changing future

A big problem with artificial intelligence is that algorithms and computational models develop and change faster than they can be designed from scratch.
“If you say you’re going to build a CPU that’s well-suited to the LSTM (long short-term memory) model, the cycle is a few years,” said Rambus’ Woo. “Then you realize within two years that the LSTM model is the dominant thing. Out and back. You want to make specialized hardware, but you have to do it faster to keep up. If we could create hardware as fast as we can change algorithms, that would be the Holy Grail. That would be great, but we can't do that , despite pressure on the industry to do so."
"If you say you're going to build a CPU that's very good at LSTM (long short-term memory) models, the cycle is a few years," said Rambus' Woo. "Then you find two years later that LSTM models become mainstream. You want to do specialized hardware, but you have to do it faster to keep up. If we could create hardware as fast as we can change algorithms, that would be a Miracle. It's a good thing, but it's a pity we can't do it, despite the pressure the industry is facing.
It also means that processors that handle AI workloads will be architected differently than those that are not focused on AI. "If you look at these engines for training, they don't run Linux or Word because they're not designed for general branching, multiple instructions, or support for multiple languages," Woo said. "They are almost all basic engines that run very fast on only a few types of operations. They are highly tuned for the specific data movement patterns required for computing. Take the Google TPU, for example, which has been around since the 1980s uses a systolic array architecture. It's very good at doing certain types of evenly distributed work over large data arrays, so it's ideal for these dense neural networks. But running general-purpose code is not what these things are designed for. They're more like large-scale Coprocessors are great at doing the really important parts of computing, but they still need to be connected to the processor that manages the rest of the computing."
Even benchmarking is difficult because it doesn't always compare apples to apples, which makes the development of the architecture difficult. “It’s a difficult topic because different people use different tools to solve the problem,” said Expedera’s Chole. “In a design engineer’s day-to-day job, that task is system-level benchmarking. You do SoC chip testing. Each part of the is benchmarked individually, and then the required bandwidth is extrapolated from that data. That's the performance, that's the latency I'm going to get. Based on that, you try to estimate what the whole system will look like. But, as We're getting more advanced in the design process and we're looking at some kind of simulation-based approach that's not a full simulation but transaction-accurate simulation within simulation to get the exact performance and the exact bandwidth requirements for the different design blocks .For example, there is a RISC-V and an NPU, they have to work together and fully coexist. Do they have to be pipelined? Can their workloads be pipelined? How many cycles does RISC really need? For that, we have to do that in RISC-V Compile the program on the NPU, compile our program on the NPU, and then perform co-simulation.

Impact of AI workloads on processor design
All of these variables impact the power, performance, and area/cost of the design.
"The PPA trade-offs for ML workloads are similar to those faced by all architects when considering acceleration - energy efficiency versus area," said Ian Bratt, researcher and senior technical director at Arm. "Over the past few years, with the increase in ML-specific acceleration instructions, CPU performance has improved significantly in processing ML workloads. Many ML workloads run excellently on modern CPUs. However, if you are in a highly energy-constrained environment, it is worth the additional silicon area cost to add a dedicated NPU , because NPUs are more energy efficient than CPUs in ML inference. This energy efficiency comes at the expense of increased silicon area and flexibility; NPU IP can typically only run neural networks. In addition, specialized units like NPUs may also be more efficient than CPUs. More flexible components like the CPU enable higher overall performance (lower latency)".
Russell Klein, project director at Siemens EDA Software Division, explains: “There are two main aspects of a design that have the most significant impact on its operating characteristics (PPA). One is the data representation used in the calculation. For most machine learning calculations, floating Points are really inefficient. Using a more appropriate representation can make designs faster, smaller, and consume less power."
Another major factor is the number of computational components in the design. "Fundamentally, it's how many multipliers you want to build into the design," Klein said, "that will bring about the parallelism that is necessary to deliver performance. A design can have a large number of multipliers, making it bulky , high power consumption and fast speed. There can also be only a few multipliers, which are small in size and low in power consumption, but much slower. In addition to power consumption, performance and area, there is also a very important indicator, that is, each Energy consumption per inference. Any device that is battery powered or harvests energy is likely to be more sensitive to energy per inference than power."

The numerical representation of features and weights can also have a significant impact on the PPA of the design.
"In the data center, everything is 32-bit floating point. Alternative representations can reduce the size of operators and the amount of data that needs to be moved and stored," he noted. "Most AI algorithms do not require the full range of floating point support. , works well with fixed-point numbers. Fixed-point multipliers typically have only 1/2 the area and power of corresponding floating-point multipliers, and run much faster. There is also generally no need for a 32-bit fixed-point representation. Many algorithms can The bit width of features and weights is reduced to 16 bits, and in some cases even to 8 bits. The size and power of a multiplier is proportional to the square of the size of the data it operates on. Therefore, the area of a 16-bit multiplier is The power is 1/4 that of a 32-bit multiplier. The area and power consumption of an 8-bit fixed-point multiplier is approximately 3% of that of a 32-bit floating-point multiplier. If the algorithm can use 8-bit fixed-point numbers instead of 32-bit floating-point numbers, then only 1/4 of the memory is required to store data, and only 1/4 of the bus bandwidth is required to move the data. This greatly saves area and power consumption. Through quantization-aware training, the required bit width can be further reduced. Typically, with The bit width required for the network trained in the quantization-aware manner is approximately 1/2 of the quantized network after training. In this way, the storage and communication costs can be reduced by 1/2, and the multiplier area and power can be reduced by 3/4. After quantization-aware training Networks usually require only 3-8 bits of fixed-point representation. Sometimes, some layers require only one bit. A 1-bit multiplier is an AND gate.
Furthermore, spillover becomes an important issue when actively quantifying a network. "With 32-bit floating point numbers, the developer does not need to worry about the value exceeding the representation capacity. But with small fixed-point numbers, this problem must be solved. Overflow is likely to occur frequently. Using the saturation operator is one of the ways to solve this problem 1. This operator does not overflow, but instead stores the maximum possible value that exists in the represented value.
It turns out that this works very well for machine learning algorithms because the exact size of the large intermediate sum doesn't matter, as long as it gets large it's enough. By using saturation math, developers can reduce the size of fixed-point numbers by another one or two. Some neural networks do require the dynamic range provided by floating point representation. They lose too much precision when converted to fixed-point, or require more than 32-bit representation to provide good precision.
In this case, several floating point representations can be used. B-float16 (or "brain float"), developed by Google for its NPU, is a 16-bit floating point number that can be easily converted to and from traditional floating point numbers. Like smaller fixed-point numbers, it enables smaller multipliers, reducing data storage and movement. "There's also an IEEE-754 16-bit floating point, and Nvidia's Tensorfloat," Klein added.

Using either will result in a smaller, faster, lower power design.
Furthermore, Woo said, "If you have a general-purpose core, it does a lot of things, but it won't do it very well. It's just general-purpose. Any time you complete a workload, there's going to be part of the general-purpose core that's being used, Part of it is not in use. It takes area and power to have these things.
People are starting to realize that Moore's Law is still giving us more transistors, so maybe the right thing to do is to build these specialized cores on the AI pipeline that are good at specific tasks. Sometimes you turn them off, sometimes you turn them on. But it's better than having these general-purpose cores, because on a general-purpose core you're always wasting some area and power and you're never going to get the best performance. Add to that a market that's willing to pay - a market with high margins and high value, and you've got a good combination.
Marc Swinnen, director of product marketing at Ansys, said: "It's also a relatively easy-to-understand approach in the hardware engineering world, where you come up with version 1 and once you install it, you find out what works and what doesn't and try to fix those. question. The applications you run are critical to understanding what these trade-offs need to be. If you can match your own hardware to the applications you want to run, you can get a more efficient design than using something off-the-shelf. What you do for Making your own chips is perfect for what you want to do."

That's why some generative AI developers are exploring building their own chips, suggesting that in their view, even current semiconductors won't be enough to meet their future needs. This is yet another example of how artificial intelligence is changing processor design and surrounding market dynamics.
AI may also play an important role in the chip world, where semi-custom and custom hardware blocks can be characterized and added to designs without having to create everything from scratch. Big chipmakers like Intel and AMD have been doing this internally for some time, but fabless companies are at a disadvantage.
“The problem is that your chiplets have to compete with existing solutions,” said Andy Heinig, head of Efficient Electronics in Fraunhofer IIS’s adaptive systems engineering department. "If you're not focused on performance right now, you can't compete. People are focused on getting this ecosystem up and running. But from our perspective, it's a chicken-and-egg problem. You need to Performance, especially since chips are more expensive than SoC solutions. But you can't really focus on performance yet because you have to get the ecosystem up and running first."

Right Start
Unlike in the past, when many chips were designed for one socket, AI is entirely workload dependent.
“When making these trade-offs, the most important thing is to know what the goals are,” said Expedera’s Chole. "If you just say 'I want to do everything and support all features,' then you're not really optimizing anything. You're basically just putting a generic solution in there and hoping it meets your power requirements. From what we understand, this rarely works. Every neural network and every deployment case on an edge device is unique. If your chip is installed in a headset and runs an RNN, instead of being installed in an ADAS chip and running transformer, then it's a completely different use case. The NPU, the memory system, the configuration and the power consumption are all completely different. So it's very important to understand the important set of workloads that we want to try. This can be multiple networks. You have to Get the team to agree on the important networks and optimize based on that. That's missing from engineering teams when they think about NPUs. They just want to get the best in the world, but you can't get the best without doing some trade-offs. Okay. I can give you the best, but where do you want the best?”
Cadence's Mitra pointed out that everyone has similar opinions on PPA, but people will emphasize which part of power, performance, area/cost (PPAC) they care about. "If you're a data center person, you might accept sacrificing a little bit of area because what you're after are very high-throughput machines because you need to do billions of AI inferences or AI jobs, and those jobs Market share is being traded while running huge models, generating massive amounts of data. Gone are the days when you could think of running AI model development work on a desktop for inference, but even some large Language model inference also becomes quite tricky. That means you need a massive data cluster, and you need to do massive data calculations on a hyperscale data center scale."
There are other considerations. “Hardware architectural decisions drive this, but software plays a critical role as well,” said William Ruby, director of product management for Synopsys EDA Group, noting that performance and energy efficiency are key. "How much memory is required? How is the memory subsystem partitioned? Can the software code be optimized for energy efficiency? (Yes, it can.) The choice of process technology is also important—for all PPAC reasons."
Additionally, Gordon Cooper, product manager for AI/ML processors at Synopsys, believes that embedded GPUs can be used if energy efficiency is not a priority. "It will give you the best coding flexibility, but will never be as power and area efficient as a dedicated processor. If you design with an NPU, there will still be a trade-off in balancing area and power. Try to Reducing on-chip memory can significantly reduce the total area budget, but will increase data transfer from external memory, thereby significantly increasing power consumption. Increasing on-chip memory will reduce power consumption for external memory reads and writes."

In Conclusion
All of these issues are increasingly becoming system issues, not just chip issues.
"People look at the training part of AI and say, 'Wow, that's really computationally expensive,'" Woo said. "Once you try to throw all this acceleration hardware into it, then other parts of the system start to get hindered." For this reason, we are increasingly seeing these platforms from companies like Nvidia, which have well-designed AI training engines, but may also use Intel Xeon chips. This is because the AI engine is not suitable for other Parts of computing. They're not designed to run universal code, so increasingly it's a heterogeneous systems problem. You have to make everything work together."
Another difficulty is that on the software side, efficiency can be improved through various methods, such as reduction methods. "We recognized that in artificial intelligence, there are certain algorithms and certain calculations called reduction, which is a fancy way of reducing a large number of numbers to a single number or a small set of numbers," Woo explained, “It could be adding them together or something like that. The traditional approach is if you have data from all the other processors, send it over the interconnect to one processor and then have that processor add it all up. All this data comes to this processor via the network via the switch. So why don't we add them up directly in the switch? The benefit of this is that it's similar to online processing. The most attractive thing is that once you Once everything is added to the switch, you only have to send one number, which means less traffic on the network.”
Such architectural considerations are worth considering because they solve several problems at once, Woo said. First, data transfers over the network very slowly, which requires us to transfer as little data as possible. Second, it avoids the redundant work of transmitting data to the processor, then letting the processor perform operations, and finally transmitting the results back. Third, it's very parallel, so you can have each switch do part of the computation.
Likewise, Expedera's Chole said AI workloads can now be defined through a single diagram. "With this diagram, it's not a small set of instructions. We're not doing one addition. We're doing millions of additions at the same time, or we're doing 10 million matrix multiplications at the same time. That changes your perspective on execution. Change your thinking mode about instructions, change your compression mode for instructions, change your prediction and scheduling mode for instructions. It is impractical to do this in a general-purpose CPU. To do this , the cost is too high. However, as a neural network, the number of MACs active at the same time is huge. Therefore, the way of generating instructions, creating instructions, compressing instructions, and scheduling instructions will change greatly in terms of utilization and bandwidth. This is artificial intelligence Huge impact on processor architecture."