17³Ô¹Ï

ASIP eUpdate April 2025

<p>17³Ô¹Ï¡¯ solution to efficiently design and implement your own application-specific instruction-set processor (ASIP) when you can¡¯t find suitable processor IP, or when hardware implementations require more flexibility.</p><p>This bi-annual newsletter provides you with easy access to ASIP-related resources.</p>

ASIP Designer

17³Ô¹Ï¡¯ solution to efficiently design and implement your own application-specific instruction-set processor (ASIP) when you can¡¯t find suitable processor IP, or when hardware implementations require more flexibility.

Technology Feature: Conciliate fixed-function HW performance and programmability with ASIP Designer ¨C an LDPC Decoder Case Study

ASIP Designer W-2024.12 includes a new example processor model which demonstrates that fixed-function hardware efficiency and software programmability are not contradicting goals. The example implements a programmable accelerator for low-density parity check (LDPC) decoding in wireless communications, an application with daunting computational requirements.


Fixed-function RTL versus High-Level Synthesis versus ASIP

For computationally intensive applications with high requirements to throughput, area, and power, and where the flexibility of a software programmable solution is not necessarily required, design teams traditionally chose to implement a fixed-function RTL solution to avoid the suspected overhead of a programmable architecture in terms of its program memory, the instruction fetching and decoder logic, but also due to the effort to design the required software tools such as a compiler.

However, a pure fixed-function approach with RTL design entry has the disadvantage that changes to the specification or requirements, or just multiple design iterations to explore different architectural options, can be implemented only with significant design effort. As a result, design changes, particularly late changes, are expensive, as they require time and RTL design expertise. After tape-out, changes or bug fixes require a complete re-spin of the chip manufacturing process.

High-Level Synthesis (HLS) provides some remedy to this problem. The design entry is typically algorithm code written in C++, and along with user-specified constraints to timing and/or power, an HLS tool automatically generates an RTL implementation. The quality of results in terms of power, performance and area, particularly the efficiency of resource sharing, depends on the tool. The user has some control by adapting the C++ algorithm code and the constraints for a new iteration.

This approach allows some flexibility at design time, as the input C++ code can be modified and the design flow can be iterated multiple times, before the chip gets taped out. But in the same way as an RTL-entry approach, changes and fixes after tape-out imply the large costs associated with a silicon redesign.

An ASIP (Application-Specific Instruction-Set Processor) is a programmable architecture that is optimized for a particular application or class of applications. The hardware is tailored to achieve a certain throughput and/or power consumption as required by the applications. The architecture is programmable, that is, the application is written in a high-level programming language such as C or C++. Changes to the specification or bug fixes, no matter how late and even post-production, can be applied in the software and do not need to result in a redesign of the chip. In the worst case, a big change to the application code may result in a non-optimal cycle count on current ASIP architecture, but it will still run correctly.

The challenges typically associated with designing an ASIP are the fear of hardware overhead that is inferred to make the architecture programmable, and the expertise required to design software tools for the architecture, such as an optimizing compiler.

ASIP Designer addresses these challenges. While production-quality RTL code and software tools, including an optimizing compiler, are automatically generated from the tool, the designer has full control of the PPA quality of results through the modeling of the processor architecture. All hardware resources are explicitly defined in the processor model, and ASIP Designer¡¯s optimizing compiler takes care that they are used in the most efficient way.

The following LDPC case study demonstrates that a programmable architecture designed with ASIP Designer can indeed meet daunting throughput requirements with very little area penalty incurred from adding programmability.

 

Case Study: LDPC Decoder

Low-Density Parity Check (LDPC) encoding and decoding is a forward error correction method used in various wireless standards, such as WiFi or 5G Cellular. These applications have high throughput and low latency requirements, and they are traditionally implemented in fixed-function RTL designs.

The ASIP Designer W-2024.12 release comes with an example processor model that implements a programmable accelerator for LDPC decoding, which meets aggressive performance requirements with only little area overhead for programmability. Explaining the details of the algorithm or the architecture is out of scope for this newsletter, for more detailed information ASIP Designer users may consult the documentation of the processor model. At this point, we just give an overview of the performance requirements, the resulting processor architecture, and synthesis results.

Typical performance requirements are 100Mbit/s message throughput. In the worst case, i.e., for the maximum package size specified in the 5G standard, a package contains of 8448 message bits. Assuming a 1 GHz target clock rate, this results in a time budget of 84k cycles per package. Within this time budget, the decoder must process 971k so-called ¡°parity operations¡±, each of which is composed of 20-30 ¡°basic operations¡± (load/store/add/sub/min/max). This translates to approximately 25 million basic operations to be performed in 84k cycles, or 300 basic operations in parallel to reach 100Mbit/s throughput. This indicates that a high degree of specialization and parallelization is needed.

Starting from Trv32p5x, one of the scalar RISC-V base models that are included with ASIP Designer, the LDPC application has been profiled and the processor model extended in several design iterations with the compiler-in-the-loop and synthesis-in-the-loop, adding several degrees of specialization and parallelization.

The main features of the resulting architecture are:

  • 128 lane x 8-bit SIMD vector processing unit, with specialized instructions for:
    • variable rotation
    • add/sub with saturation
    • minimum detection
    • element selection
  • 8 x 1024-bit vector register file
  • Closely coupled 1024-bit vector memory
  • Dual address generation unit for scalar & vector memories
  • 64-bit instruction width
  • Up to 4-way instruction level parallelism:
    • Scalar RISC-V instructions (including scalar load/store with post-increment & zero-overhead loops)
    • Vector load/store
    • Vector arithmetic
    • Vector rotation

 

Diagram

Figure 1 shows a schematic view of the processor datapath. The units vec and rot are explicitly defined resources in the processor model that implement the SIMD vector processing units. ASIP Designer¡¯s compiler tries to utilize these resources in an optimal way.

Diagram

Figure 2 shows an extract of the adapted application software code, using compiler intrinsics to address the vector operations. The intrinsics vaddsat, vsubsat, and vcndneg are mapped by the compiler onto a single, shared hardware resource as specified by the designer, the vector unit vec.

The final design, synthesized in a 28nm technology, is 430k logic gates in size (excluding memory), running at a 900 MHz clock, achieving a net throughput (message bits) of 114 Mbit/s. At design time, throughput and area can be scaled linearly with the number of SIMD vector lanes. Due to the scalar RISC-V unit, the processor can compile arbitrary C code, and will significantly accelerate all applications that can utilize the same vector intrinsics.

Diagram

Figure 3 shows a layout view of the design, highlighting the different components of the architecture. It is not a big surprise that the largest part of the area is occupied by the vector units and vector registers. The original RISC-V base model (which already included a divider, multiplier, 32x32-bit scalar register file) takes about 22% of the total area.  Arithmetic resources and registers similar to this RISC-V base would typically also occur in a fixed-function RTL design. 

The area penalty resulting from software programmability is very small.  The instruction decoder only takes about 1% of the total area.  The program code size implementing the LDPC application is only 384 bytes. The corresponding program memory area is roughly equivalent to an area of 3k logic gates, which is less than 1% of the total core area.

In general, the bigger the datapath needed to accelerate the time critical operation is (which is the same area you would also need in a fixed-function RTL design to achieve the same throughput), the smaller is the relative area overhead attributed to programmability.

What¡¯s New: ASIP Designer W-2024.12 Release

Since the last edition of this newsletter, we have launched a new feature release for ASIP Designer in December 2024, offering various enhancements and extensions. Below is a categorized summary of these updates (ASIP Designer customers can refer to the official Release Notes for a comprehensive list of details).

Click on each tab for additional information about that new feature

Example Processor Models

The following updates have been made to the library of example processor models: 

  • A new example processor model ¡°LDPC¡± has been added, featuring an accelerator for 5G Low-Density Parity Check decoding. For more details, refer to Section ¡°Technology Feature:  Conciliate fixed-function HW performance and programmability with ASIP Designer ¨C an LDPC Decoder Case Study¡± above.
  • The DLX example processor family has been enhanced with a new variant ¡°MLX¡±, featuring a two-stage fetch pipeline to support a program memory with two cycles latency.
  • The Trv (RISC-V ISA) educational models have been extended with:
    • An example how to model record registers
    • A model implementing the RV32E ISA
    • An example integrating hardware tracing functionality.

Processor Modeling

  • Support for structs in the PDG language has been enhanced, including visualization in the ChessDE simulation widgets and expansion for RTL generation.
  • The nML viewer has been enhanced with more flexibility for expanding nML rules.

C/C++ Compiler

  • Improved support for complex load/store patterns, including access to unaligned objects.
  • Improved support for the 2-step constant generation in the RISC-V ISA, including RISC-V ABI compatible relocator expressions.
  • The synthetic member concept has been extended with an index operator, allowing for indirect access on a register file.
  • Non-leaf loop software pipelining has been extended to loops that contain function calls and jump-based if-then-else statements.
  • Enabled the integration of ChessCC into CMake, through the support for the GCC compatible -M family of options in combination with a CMake toolchain file.
  • The Chess LLVM front-end switches to LLVM version 19.0.

ChessDE GUI, Instruction-Set Simulation and Debugging

 
  • The language server support, which was introduced to the editor of the ChessDE IDE in Release U-2022.12, has been further enhanced and extended with additional functionality, including drill-down to the compiler processor header file from a generated LLVM header.
  • Signal-flow-graph (SFG) viewing has been enhanced in the editor.
  • Eclipse support has been updated to version 2024.06, with language server support enabled.
  • Support for application programming in Visual Studio Code, as documented in a new manual ¡°ASIP Extension for Visual Studio Code¡±.
  • A new simulation mode ¡°NI¡± (No Interface) has been introduced to speed up cycle-accurate simulation of aggressively scheduled code.
  • Support for real-time hardware tracing in on-chip debugging has been added. This functionality is part of a new ASIP Designer Tracing Add-On product.

RTL Generation, Verification, and Synthesis Support

  • Support for vector slicing has been extended, allowing to combine multiple SIMD lanes into a single slice.
  • Formal ISA Verification, which has been introduced in Release V-2023.12 as part of the Advanced Verification Add-On product, now includes program flow verification.

 

Additional Resources

Training and Tutorial Videos


Events and Webinars


White Papers and Articles