Photos are available in the DATE 2024 Gallery.
The time zone for all times mentioned at the DATE website is CET – Central Europe Time (UTC+1). AoE = Anywhere on Earth.
REG Registration
Add this session to my calendar
Date: Monday, 31 March 2025
Time: 08:00 CET - 08:30 CET
OC Opening Ceremony
Add this session to my calendar
Date: Monday, 31 March 2025
Time: 08:30 CET - 09:00 CET
OK1 Opening Keynote 1
Add this session to my calendar
Date: Monday, 31 March 2025
Time: 09:00 CET - 09:45 CET
OK2 Opening Keynote 2: Giovanni De Micheli (EPFL, Switzerland)
Add this session to my calendar
Date: Monday, 31 March 2025
Time: 09:45 CET - 10:30 CET
ASD01 ASD technical session
Add this session to my calendar
Date: Monday, 31 March 2025
Time: 11:00 CET - 12:30 CET
BPA01 D Topic Session 1
Add this session to my calendar
Date: Monday, 31 March 2025
Time: 11:00 CET - 12:30 CET
Time | Label | Presentation Title Authors |
---|---|---|
QUANTIFYING TRADE-OFFS IN POWER, PERFORMANCE, AREA, AND TOTAL CARBON FOOTPRINT OF FUTURE THREE-DIMENSIONAL INTEGRATED COMPUTING SYSTEMS Speaker: Danielle Grey-Stewart, Harvard University, US Authors: Danielle Grey-Stewart, Mariam Elgamal, David Kong, Georgios Kyriazidis, Jalil Morris and Gage Hills, Harvard University, US Abstract To address computing's carbon footprint challenge, designers of computing systems are beginning to consider carbon footprint as a first-class figure of merit, alongside conventional metrics such as power, performance, and area. To account for total carbon (tC) footprint of a computing system, carbon footprint models must consider both embodied carbon (Cembodied) due to emissions during manufacturing, and operational carbon (Coperational) from day-to-day use. Models for Coperational are relatively mature due to the direct relationship between Coperational and energy consumed while computing. In contrast, models for Cembodied primarily focus on today's silicon-based technologies, not capturing the wide range of beyond-Si technologies that are actively being developed for future computing systems, including emerging nanomaterials, emerging memory devices, and various three-dimensional (3D) integration techniques. Cembodied models for emerging technologies are essential for accurately predicting which technology directions to pursue without exacerbating computing's carbon footprint. In this paper, we (1) develop Cembodied models for 3D-integrated computing systems that leverage emerging nanotechnologies. We analyze an example fabrication process that is highly promising for energy-efficient computing: 3D integration of carbon nanotube field-effect transistors (CNFETs) and indium gallium zinc oxide (IGZO) FETs fabricated directly on top of Si CMOS at a 7 nm technology node. We show that Cembodied of this process is, on average (considering various energy grids), 1.31× higher per wafer vs. a baseline 7 nm node Si CMOS process. (2) As a case study, we quantify trade-offs in power, performance, area, and tC footprint for an embedded system comprising an ARM Cortex-M0 processor and embedded DRAM, implemented in each of the above processes. For a representative lifetime of the system (running applications from the Embench suite for 2 hours per day over 24 months, with a clock frequency of 500 MHz), we show that the 3D IGZO/CNFET/Si implementation is 1.02× more carbon-efficient per good die (considering yield) vs. the baseline Si implementation, quantified by the product of tC and application execution time (tCDP, an effective metric of carbon efficiency). (3) Finally, we show techniques to quantify carbon efficiency benefits of future computing systems, even when there is uncertainty in carbon footprint models. Specifically, we show how to robustly compare tCDP for multiple computing systems, given underlying uncertainty in Cembodied, computing system lifetime, carbon intensity (in equivalent grams of CO2 emissions per unit energy consumption), and yield./proceedings-archive/2024/DATA/224_pdf_upload.pdf |
||
COMPUTE-IN-MEMORY ARRAY DESIGN USING STACKED HYBRID IGZO/SI EDRAM CELLS Speaker: Munhyeon Kim, Seoul National University, KR Authors: Munhyeon Kim1, Yulhwa Kim2 and Jae-Joon Kim1 1Seoul National University, KR; 2Sungkyunkwan University, KR Abstract To effectively accelerate neural networks in compute-in-memory (CIM) systems, higher memory cell density is critical for managing increasing computational workloads and parameters. While CMOS-based embedded dynamic random access memory (eDRAM) is being explored as an alternative, addressing the short retention time (tret) (<1 ms) remains a challenge for system applications. Recent studies highlight that InGaZnO (IGZO)-based eDRAM achieves a significantly longer retention time (>100 s), but additional improvements are needed due to considerable cell variability and slower operating speeds compared to CMOS-based cells. This paper proposes a 3T-based stacked hybrid IGZO/Si eDRAM (Hybrid-3T) cell and array design for CIM systems, alongside a system-level evaluation for deep neural network (DNN) workloads. The Hybrid-3T cell, built on 7-nm FinFET technology, extends the retention time by 100 s compared to IGZO-based 3T eDRAM (IGZO-3T). It also provides 3.4× higher bit cell density compared to 8T SRAM cells and 2× higher density than CMOS-based 3T eDRAM (CMOS-3T), while maintaining similar throughput and variability levels as eDRAM and SRAM systems. Additionally, DNN inference accuracy for vision and natural language processing (NLP) tasks is evaluated using the proposed CIM design, considering the impact of enhanced cell variability and retention time on system-level performance. The retention time required for CIM operation accuracy (tret,CIM) is more than 10^7 times longer in Hybrid-3T than in CMOS-3T, and the retention time accounting for variability (tret,CIM v) is over 3× longer than IGZO-3T eDRAM. Consequently, the proposed Hybrid-3T eDRAM CIM integrates the strengths of both CMOS-3T and IGZO-3T CIM designs, enabling high-performance, reliable systems./proceedings-archive/2024/DATA/974_pdf_upload.pdf |
||
TIMING-DRIVEN GLOBAL PLACEMENT BY EFFICIENT CRITICAL PATH EXTRACTION Speaker: Yunqi Shi, Nanjing University, CN Authors: Yunqi Shi1, Siyuan Xu2, Shixiong Kai2, Xi Lin3, Ke Xue1, Mingxuan Yuan4 and Chao Qian1 1Nanjing University, CN; 2Huawei Noah's Ark Lab, CN; 3Nanjing University, China, CN; 4Huawei Noah's Ark Lab, HK Abstract Timing optimization during the global placement of integrated circuits has been a significant focus for decades, yet it remains a complex, unresolved issue. Recent analytical methods typically use pin-level timing information to adjust net weights, which is fast and simple but neglects the path-based nature of the timing graph. The existing path-based methods, however, cannot balance the accuracy and efficiency due to the exponential growth of number of critical paths. In this work, we propose a GPU-accelerated timing-driven global placement framework, integrating accurate path-level information into the efficient DREAMPlace infrastructure. It optimizes the fine-grained pin-to-pin attraction objective and is facilitated by efficient critical path extraction. We also design a quadratic distance loss function specifically to align with the RC timing model. Experimental results demonstrate that our method significantly outperforms the current leading timing-driven placers, achieving an average improvement of 40.5% in total negative slack (TNS) and 8.3% in worst negative slack (WNS), as well as an improvement in half-perimeter wirelength (HPWL)./proceedings-archive/2024/DATA/166_pdf_upload.pdf |
ET01 Embedded Tutorial
Add this session to my calendar
Date: Monday, 31 March 2025
Time: 11:00 CET - 12:30 CET
FS01 Focus session
Add this session to my calendar
Date: Monday, 31 March 2025
Time: 11:00 CET - 12:30 CET
LKS01 Later … with the keynote speakers
Add this session to my calendar
Date: Monday, 31 March 2025
Time: 11:00 CET - 12:00 CET
TS01 Session 10 - D14 - Emerging design technologies for future computing
Add this session to my calendar
Date: Monday, 31 March 2025
Time: 11:00 CET - 12:30 CET
Time | Label | Presentation Title Authors |
---|---|---|
OPTIMAL SYNTHESIS OF MEMRISTIVE MIXED-MODE CIRCUITS Speaker: Ilia Polian, University of Stuttgart, DE Authors: Ilia Polian1, Xianyue Zhao2, Li-Wei Chen1, Felix Bayhurst1, Ziang Chen2, Heidemarie Schmidt2 and Nan Du2 1University of Stuttgart, DE; 2University of Jena and Leibniz Institute of Photonic Technology, Jena, Germany, DE Abstract Memristive crossbars are attractive for in-memory computing due to their integration density combined with compute and storage capabilities of their basic devices. However, yield and fidelity of emerging memristive technologies can make their reliable operation unattainable, thus raising interest in simpler topologies. In this paper, we consider synthesis of Boolean functions on 1D memristive line arrays. We propose an optimal procedure that can fully utilize the rich electrical behavior of memristive devices, mixing stateful (resistance-input) and nonstateful (voltage-input) operations as desired by the designer, leveraging their respective strengths. The synthesis method is based on Boolean satisfiability (SAT) solving and supports flexible constraints to enforce, e.g., restrictions of the available peripherals. We experimentally validate memristive logic circuits beyond individual logic gates by demonstrating the operation of a Galois field multiplier using a 1D line array of 10 memristors in parallel, highlighting the robust performance of our proposed mixed-mode circuit and its synthesis procedure./proceedings-archive/2024/DATA/114_pdf_upload.pdf |
||
NVCIM-PT: AN NVCIM-ASSISTED PROMPT TUNING FRAMEWORK FOR EDGE LLMS Speaker: Ruiyang Qin, University of Notre Dame, US Authors: Ruiyang Qin1, Pengyu Ren1, Zheyu Yan2, Liu Liu1, Dancheng Liu3, Amir Nassereldine3, Jinjun Xiong3, Kai Ni1, X. Sharon Hu1 and Yiyu Shi1 1University of Notre Dame, US; 2Zhejiang University, CN; 3University at Buffalo, US Abstract Large Language Models (LLMs) deployed on edge devices, known as edge LLMs, only use constrained resources to learn from user-generated data. Although existing learning methods have demonstrated performance improvements for edge LLMs, their constraints in high resource cost and low learning capacity limit their effectiveness as optimal learning methods for edge LLMs. Prompt tuning (PT), a learning method without these constraints, has significant potential to improve edge LLM performance while modifying only a small portion of LLM parameters. However, PT-based edge LLMs can suffer from user domain shift, leading to repetitive training that neither effectively improves performance nor resource efficiency. Conventional efforts to address domain shifts involve more complex neural network designs and sophisticated training, inevitably resulting in higher resource usage. It remains an open question: how can we avoid domain shift and high resource usage for edge LLM PT? In this paper, we propose a prompt tuning framework for edge LLMs, exploiting the benefits offered by non-volatile computing-in-memory (NVCiM) architectures. We introduce a novel NVCiM-assisted PT framework, where we narrow down the core operations to matrix-matrix multiplication, accelerated by performing in-situ computation on NVCiM. To the best of our knowledge, this is the first work employing NVCiM to improve the edge LLM PT performance./proceedings-archive/2024/DATA/118_pdf_upload.pdf |
||
PICELF: AN AUTOMATIC ELECTRONIC LAYER LAYOUT GENERATION FRAMEWORK FOR PHOTONIC INTEGRATED CIRCUITS Speaker: Xiaohan Jiang, Hong Kong University of Science and Technology, HK Authors: Xiaohan JIANG1, Yinyi LIU2, Peiyu CHEN1, Wei Zhang1 and Jiang Xu1 1Hong Kong University of Science and Technology, HK; 2Electronic and Computer Engineering Department, The Hong Kong University of Science and Technology, HK Abstract In recent years, the advent of photonic integrated circuits (PICs) has demonstrated great prospects and applications to address critical issues such as limited bandwidth, high latency, and high power consumption in data-intensive systems. However, the field of physical design automation for PICs remains in its infancy, with a notable gap in electronic layer layout design tools. Current research on PIC physical design automation primarily focuses on optical layer layouts, often overlooking the equally crucial electronic layer layouts. Although well-established for conventional integrated circuits (ICs), existing EDA tools are inadequately adapted for PICs due to their unique characteristics and constraints. As PICs grow in integration density and size, traditional manual-based design methods become increasingly inefficient and sub-optimal, potentially compromising overall PIC performance. To address this challenge, we propose PICELF, the first framework in the literature for automatic PIC electronic layer layout generation. Our framework comprises a nonlinear binary programming (NBP)-based netlist generator with scalability optimization and a two-stage router featuring initial parallel routing followed by post-routing optimization. We validate our framework's effectiveness and efficiency using a real PIC chip benchmark established by us. Experimental results demonstrate that our method can efficiently generate high-quality PIC electronic layer layouts and satisfy all design rules, within reasonable CPU times, while related existing methods are not applicable./proceedings-archive/2024/DATA/331_pdf_upload.pdf |
||
SYSTEM LEVEL PERFORMANCE EVALUATION FOR SUPERCONDUCTING SYSTEMS Speaker: Debjyoti Bhattacharjee, IMEC, BE Authors: Joyjit Kundu, Debjyoti Bhattacharjee, Nathan Josephsen, Ankit Pokhrel, Udara De Silva, Wenzhe Guo, Steven Winckel, Steven Brebels, Manu Perumkunnil, Quentin Herr and Anna Herr, imec, BE Abstract Superconducting Digital~(SCD) technology offers significant potential for enhancing the performance of next generation large scale compute workloads. By leveraging advanced lithography and a 300 mm platform, SCD devices can reduce energy consumption and boost computational power. This paper presents an analytical performance modeling approach to evaluate the system-level performance benefits of SCD architectures for LLM training and inference. Our findings, based on experimental data and Pulse Conserving Logic~(PCL) design principles, demonstrate substantial improvements in both training and inference. SCD's ability to address memory and interconnect limitations positions it as a promising solution for next-generation compute systems./proceedings-archive/2024/DATA/415_pdf_upload.pdf |
||
INTEGRATED HARDWARE ANNEALING BASED ON LANGEVIN DYNAMICS FOR ISING MACHINES Speaker: Hui Wu, University of Rochester, US Authors: Yongchao Liu, Lianlong Sun, Michael Huang and Hui Wu, University of Rochester, US Abstract Ising machines are non-von Neumann machines designed to solve combinatorial optimization problems (COP) by searching for the ground state, or the lowest energy configuration, within the Ising model. However, Ising machines often face the challenges of getting trapped in local minima due to the complex energy landscapes. Hardware annealing algorithms help mitigate this issue by using a probabilistic approach to steer the system toward the ground state. In this paper, we present a hardware annealing algorithm for Ising machines based on Langevin dynamics, a stochastic perturbation by random noise. Theoretical analysis, system-level design, and detailed circuit design are carried out. We evaluate the performance of the algorithm through chip-level simulation using a standard 65-nm CMOS technology to demonstrate the algorithm's efficacy. The results show that the proposed hardware annealing algorithm effectively guides the system to reach the ground state with a probability of 86.5\%, significantly improving the solution quality by 97.5\%. Further, we compare the algorithm with state-of-the-art hardware annealing methods through behavioral-level simulations, highlighting its improved solution quality alongside a 50\% reduction in time-to-solution./proceedings-archive/2024/DATA/613_pdf_upload.pdf |
||
NORA: NOISE-OPTIMIZED RESCALING OF LLMS ON ANALOG COMPUTE-IN-MEMORY ACCELERATORS Speaker: Garrett Gagnon, Rensselaer Polytechnic Institute, US Authors: Yayue Hou1, Hsinyu Tsai2, Kaoutar El Maghraoui2, Tayfun Gokmen2, Geoffrey Burr2 and Liu Liu1 1Rensselaer Polytechnic Institute, US; 2IBM, US Abstract Large Language Models (LLMs) have become critical in AI applications, yet current digital AI accelerators suffer from significant energy inefficiencies due to frequent data movement. Analog compute-in-memory (CIM) accelerators offer a potential solution for improving energy efficiency but introduce non-idealities that can degrade LLM accuracy. While analog CIM has been extensively studied for traditional deep neural networks, its impact on LLMs remains unexplored, particularly concerning the large influence of Analog CIM non-idealities. In this paper, we conduct a sensitivity analysis on the effects of analog-induced noise on LLM accuracy. We find that while LLMs demonstrate robustness to weight-related noise, they are highly sensitive to quantization noise and additive Gaussian noise. Based on these insights, we propose a noise-optimized rescaling method to mitigate LLM accuracy loss by shifting the non-ideality burden from the sensitive input/output to the more resilient weight. Through rescaling, we can implement the OPT-6.7b model on simulated analog CIM hardware with less than 1% accuracy loss from the floating-point baseline, compared to a much higher loss of around 30% without rescaling./proceedings-archive/2024/DATA/628_pdf_upload.pdf |
||
BOSON−1: UNDERSTANDING AND ENABLING PHYSICALLY-ROBUST PHOTONIC INVERSE DESIGN WITH ADAPTIVE VARIATION-AWARE SUBSPACE OPTIMIZATION Speaker: Haoyu Yang, Nvidia Inc., US Authors: Pingchuan Ma1, Zhengqi Gao2, Amir Begovic3, Meng Zhang3, Haoyu Yang4, Haoxing Ren4, Rena Huang3, Duane Boning2 and Jiaqi Gu1 1Arizona State University, US; 2Massachusetts Institute of Technology, US; 3Rensselaer Polytechnic Institute, US; 4NVIDIA Corp., US Abstract Nanophotonic device design aims to optimize photonic structures to meet specific requirements across various applications. Inverse design has unlocked non-intuitive, high-dimensional design spaces, enabling the discovery of compact, high-performance device topologies beyond traditional heuristic or analytic methods. The adjoint method, which calculates analytical gradients for all design variables using just two electromagnetic simulations, enables efficient navigation of this complex space. However, many inverse-designed structures, while numerically plausible, are difficult to fabricate and highly sensitive to physical variations, limiting their practical use. The discrete material distributions with numerous local-optimal structures also pose significant optimization challenges, often causing gradient-based methods to converge on suboptimal designs. In this work, we formulate inverse design as a fabrication-restricted, discrete, probabilistic optimization problem and introduce BOSON-1, an end-to-end, adaptive, variation-aware subspace optimization framework to address the challenges of manufacturability, robustness, and optimizability. With elegant reparametrization, we explicitly emulate the fabrication process and differentiably optimize the design in the fabricable subspace. To overcome optimization difficulty, we propose dense target-enhanced gradient flows to mitigate misleading local optima and introduce a conditional subspace optimization strategy to create high-dimensional tunnels to escape local optima. Furthermore, we significantly reduce the prohibitive runtime associated with optimizing across exponential variation samples through an adaptive sampling-based robust optimization method, ensuring both efficiency and variation robustness. On three representative photonic device benchmarks, our proposed inverse design methodology BOSON-1 delivers fabricable structures and achieves the best convergence and performance under realistic variations, outperforming prior arts with 74.3% post-fabrication performance./proceedings-archive/2024/DATA/647_pdf_upload.pdf |
||
BIMAX: A BITWISE IN-MEMORY ACCELERATOR USING 6T-SRAM STRUCTURE Speaker: Nezam Rohbani, BSC, ES Authors: Nezam Rohbani1, Mohammad Arman Soleimani2, behzad salami3, Osman Unsal3, Adrian Cristal Kestelman3 and Hamid Sarbazi-Azad4 1Institute for Research in Fundamental Sciences (IPM), IR; 2Sharif University of Technology, IR; 3BSC, ES; 4Sharif U of Tech, IR Abstract In-memory computing (IMC) paradigm reduces costly and inefficient data transfer between memory modules and processing cores by implementing simple and parallel operations inside the memory subsystem. SRAM, the fastest memory structure in the memory hierarchy, is an appropriate platform to implement IMC. However, the main challenges of implementing IMC in SRAM are the limited operations and unreliable accuracy due to environmental noise and process variations. This work proposes a low-latency, energy-efficient, and noise-robust IMC technique, called Bitwise In-Memory Accelerator using 6T-SRAM Structure (BIMAX). BIMAX performs parallel bitwise operations (i.e., (N)AND, (N)OR, NOT, X(N)OR) as well as row-copy with the capability of writing the computation result back to a target memory row. BIMAX functionality is based on an imbalanced differential sense amplifier (SA) that reads and writes data from and into multiple 6T-SRAM cells. The simulations show BIMAX performs these operations with 52.7% lower energy dissipation compared to the state-of-the-art IMC technique, with 5.7% average higher performance rate. Furthermore, BIMAX is about 5.4× more robust against environmental noises compared to the state-of-the-art./proceedings-archive/2024/DATA/847_pdf_upload.pdf |
||
DSC-ROM: A FULLY DIGITAL SPARSITY-COMPRESSED COMPUTE-IN-ROM ARCHITECTURE FOR ON-CHIP DEPLOYMENT OF LARGE-SCALE DNNS Speaker: Tianyi Yu, Tsinghua University, CN Authors: Tianyi Yu, Zhonghao Chen, Yiming Chen, Shuang Wang, Yongpan Liu, Huazhong Yang and Xueqing Li, Tsinghua University, CN Abstract Compute-in-Memory (CiM) is a promising technique to mitigate the memory bottleneck for energy-efficient deep neural network (DNN) inference. Unfortunately, conventional SRAM-based CiM has low density and limited on-chip capacity, resulting in undesired weight reloading from off-chip DRAM. The emerging high-density ROM-based CiM architecture has recently revealed the opportunity of deploying large-scale DNNs on-chip, with optional assisting SRAM to ensure moderate flexibility. However, prior analog-domain ROM CiM still suffers from limited memory density improvement and low computing area efficiency due to stringent array structure and large A/D converter (ADC) overhead. This paper presents DSC-ROM, a fully digital sparsity-compressed compute-in-ROM architecture to address these challenges. DSC-ROM introduces a fully synthesizable macro-level design methodology that achieves a record-high memory density of 27.9 Mb/mm^2 in a 28nm CMOS technology. Experimental results show that the macro area efficiency of DSC-ROM improves by 5.6-6.6x compared with prior analog-based ROM CiM. Furthermore, a novel weight fine-tuning technique is proposed to ensure task transfer flexibility and reduce required assisting SRAM cells by 94.4%. Experimental results show that DSC-ROM designed for ResNet-18 pre-trained on ImageNet dataset achieves <0.5% accuracy loss in CIFAR-10 and FER2013, compared with the fully SRAM-based CiM./proceedings-archive/2024/DATA/1105_pdf_upload.pdf |
||
COMPACT NON-VOLATILE LOOKUP TABLE ARCHITECTURE BASED ON FERROELECTRIC FET ARRAY THROUGH IN-SITU COMBINATORIAL ONE-HOT ENCODING FOR RECONFIGURABLE COMPUTING Speaker: Weikai Xu, Peking University, CN Authors: Weikai Xu, Meng Li, Qianqian Huang and Ru Huang, Peking University, CN Abstract Lookup tables (LUTs) are widely used for reconfigurable computing applications due to the capability of implementing arbitrary logic functions. Various emerging non-volatile memories (eNVMs) have been introduced for LUT designs with reduced hardware cost and power consumption compared with conventional SRAM-based LUT. However, the existing designs still follow the conventional LUT architecture, where the memory cells are only used for storage of configuration bits, requiring dedicated bulky multiplexer (MUX) for computation of each LUT, resulting in inevitable high area, latency, and energy cost. In this work, a compact and efficient non-volatile LUT architecture based on ferroelectric FET (FeFET) array is proposed, where the configuration bit storage and computation can be implemented within the FeFET array through in-situ combinatorial one-hot encoding, eliminating the need of costly MUX for each LUT. Moreover, multibit LUTs can be efficiently implemented in the FeFET array using only one shared decoder instead of multiple costly MUXs. Due to the eliminated MUX in the calculation path, the proposed LUT can also achieve enhanced computation speed compared with the conventional LUTs. Based on the proposed LUT architecture, the input expansion of LUT, full adder, and content addressable memory are further implemented and demonstrated with reduced hardware and energy cost. Evaluation results show that the proposed FeFET array-based LUT architecture achieves 51.7×/8.3× reduction in area-energy-delay product compared with conventional SRAM-based/FeFET-based LUT architecture, indicating its great potential for reconfigurable computing applications./proceedings-archive/2024/DATA/1139_pdf_upload.pdf |
||
GRAMC: GENERAL-PURPOSE AND RECONFIGURABLE ANALOG MATRIX COMPUTING ARCHITECTURE Speaker: Lunshuai Pan, Peking University, CN Authors: Lunshuai Pan, Shiqing Wang, Pushen Zuo and Zhong Sun, Peking University, CN Abstract In-memory analog matrix computing (AMC) with resistive random-access memory (RRAM) represents a highly promising solution that solves matrix problems in one step. However, the existing AMC circuits each have a specific connection topology to implement a single computing function, lack of the universality as a matrix processor. In this work, we design a reconfigurable AMC macro for general-purpose matrix computations, which is achieved by configuring proper connections between memory array and amplifier circuits. Based on this macro, we develop a hybrid system that incorporates an on-chip write-verify scheme and digital functional modules, to deliver a general-purpose AMC solver for various applications./proceedings-archive/2024/DATA/1195_pdf_upload.pdf |
||
SHWCIM:A SCALABLE HETEROGENEOUS WORKLOAD COMPUTING-IN-MEMORY ARCHITECTURE Speaker: Yanfeng Yang, School of Microelectronics, South China University of Technology, CN Authors: Yanfeng Yang1, Yi Zou2, Zhibiao Xue2 and Liuyang Zhang3 1School of Integrated Circuits, South China University of Technology, CN; 2School of Microelectronics, South China University of Technology, CN; 3School of Microelectronics, Southern University of Science and Technology, CN Abstract This study introduces HWCIM, a SRAM-based Computing-In-Memory core, and SHWCIM, a CIM-capable Coarse-Grained Reconfigurable Architecture, to enhance resource utilization, multi-functionality, and on-chip memory size in SRAM-based CIM designs. Evaluated using the SMIC 55nm process, HWCIM achieves 1.6× lower power, 2.8× higher energy efficiency, and up to 4.1× smaller area compared to previous CIM and CGRA works. Additionally, SHWCIM delivers an average 105.9× speedup over existing CGRAs and consumes 2–5× less energy than the Nvidia A40 GPU on realistic workloads./proceedings-archive/2024/DATA/1060_pdf_upload.pdf |
TS02 Session 1 - A3 - Secure systems, circuits, and architectures
Add this session to my calendar
Date: Monday, 31 March 2025
Time: 11:00 CET - 12:30 CET
Time | Label | Presentation Title Authors |
---|---|---|
FLEXENM: A FLEXIBLE ENCRYPTING-NEAR-MEMORY WITH REFRESH-LESS EDRAM-BASED MULTI-MODE AES Speaker: Hyunseob Shin, Korea University, KR Authors: Hyunseob Shin and Jaeha Kung, Korea University, KR Abstract On-chip cryptography engines face significant challenges in efficiently processing large volumes of data while maintaining security and versatility. Most existing solutions support only a single AES mode, limiting their applicability across diverse use cases. This paper introduces FlexENM, a low-power and area-efficient near-eDRAM encryption engine. The FlexENM implements refresh-less operation by leveraging inherent characteristics of the AES algorithm, reordering AES stages, and employing a simultaneous read and write scheme using dual-port eDRAM. Furthermore, FlexENM supports three AES modes, parallelizing their operations and sharing hardware resources across different modes to improve compute efficiency. Compared to other AES engines, FlexENM achieves 16% lower power consumption and 83% higher throughput per unit area, on average, demonstrating improved power- and area-efficiency for on-chip data protection./proceedings-archive/2024/DATA/201_pdf_upload.pdf |
||
PASTA ON EDGE: CRYPTOPROCESSOR FOR HYBRID HOMOMORPHIC ENCRYPTION Speaker: Aikata Aikata, TU Graz, AT Authors: Aikata Aikata1, Daniel Sobrino2 and Sujoy Sinha Roy1 1TU Graz, AT; 2Universidad Politécnica de Madrid, ES Abstract Fully Homomorphic Encryption (FHE) enables privacy-preserving computation but imposes significant computational and communication overhead on the client for the public-key encryption. To alleviate this burden, previous works have introduced the Hybrid Homomorphic Encryption (HHE) paradigm, which combines symmetric encryption with homomorphic decryption to enhance performance for the FHE client. While early HHE schemes focused on binary data, modern versions now support integer prime fields, improving their efficiency for practical applications such as secure machine learning. Despite several HHE schemes proposed in the literature, there has been no comprehensive study evaluating their performance or area advantages over FHE for encryption tasks. This paper addresses this gap by presenting the first implementation of an HHE scheme- PASTA. It is a symmetric encryption scheme over integers designed to facilitate fast client encryption and homomorphic symmetric decryption on the server. We provide its performance results for both FPGA and ASIC platforms, including a RISC-V System-on-Chip (SoC) implementation on a low-end 130nm ASIC technology, which achieves a 43–171x speedup compared to a CPU. Additionally, on high-end 7nm and 28nm ASIC platforms, our design demonstrates a 97x speedup over prior public-key client accelerators for FHE. We have made our design public and benchmarked an application to support future research./proceedings-archive/2024/DATA/325_pdf_upload.pdf |
||
DESIGN, IMPLEMENTATION AND VALIDATION OF NSCP: A NEW SECURE CHANNEL PROTOCOL FOR HARDENED IOT Speaker: Vittorio Zaccaria, Politecnico di Milano, IT Authors: Joan Bushi1, Alberto Battistello2, Guido Bertoni2 and Vittorio Zaccaria1 1Politecnico di Milano, IT; 2Security Pattern, IT Abstract This paper deals with the design, implementation, and validation of a new secure channel protocol to connect microcontrollers and secure elements. The new secure channel protocol (NSCP) relies on a lightweight cryptographic primitive (Xoodyak) and simplified operating principles to provide secure data exchange. The performance of the new protocol is compared with that of GlobalPlatform's Secure Channel Protocol 03 (SCP03), the current extit{de facto} standard for hardening the connection between a microcontroller and a secure element in industrial IoT. The evaluation was performed in two scenarios where the secure element was emulated with an Arm Cortex M4 and a OpenHW RISC-V MPU synthesized on an Artix FPGA. The results of the evaluation are an indicator of the potential advantage of the new protocol over SCP03: In the best case, the new protocol is able to apply cryptographic protection to messages from 3.64x to 4x with respect to SCP03 at its maximum security level. The speedup in the channel initiation process is also considerable, with a factor of up to 3.7. These findings demonstrate that it is possible to conceive a new protocol which offers adequate cryptographic protection, while being more lightweight than the present standard./proceedings-archive/2024/DATA/463_pdf_upload.pdf |
||
RHYCHEE-FL: ROBUST AND EFFICIENT HYPERDIMENSIONAL FEDERATED LEARNING WITH HOMOMORPHIC ENCRYPTION Presenter: Yujin Nam, University of California, San Diego, US Authors: Yujin Nam1, Abhishek Moitra2, Yeshwanth Venkatesha2, Xiaofan Yu1, Gabrielle De Micheli1, Xuan Wang1, Minxuan Zhou3, Augusto Vega4, Priyadarshini Panda2 and Tajana Rosing1 1University of California, San Diego, US; 2Yale University, US; 3Illinois Tech, US; 4IBM Research, US Abstract Federated learning (FL) is a widely-used collaborative learning approach where clients train models locally without sharing their data with servers. However, privacy concerns remain since clients still upload locally trained models, which could reveal sensitive information. Fully homomorphic encryption (FHE) addresses this issue by enabling clients to share encrypted models and the server to aggregate them without decryption. While FHE resolves the privacy concerns, the encrypted data introduces larger communication and computational complexity. Moreover, ciphertexts are vulnerable to channel noise, where a single bit error can disrupt model convergence. To overcome these limitations, we introduce Rhychee-FL, the first lightweight and noise-resilient FHE-enabled FL framework based on Hyperdimensional Computing (HDC), a low-overhead training method. Rhychee-FL leverages HDC's small model size and noise resilience to reduce communication overhead and enhance model robustness without sacrificing accuracy or privacy. Additionally, we thoroughly investigate the parameter space of Rhychee-FL and propose an optimized system in terms of computation and communication costs. Finally, we show that our global model can successfully converge without being impacted by channel noise. Rhychee-FL achieves comparable final accuracy to CNN, while reaching 90% accuracy in 6x fewer rounds and with 2.2x greater communication efficiency. Our framework shows at least 4.5x faster client side latency compared to previous FHE-based FL works. |
||
COMPROMISING THE INTELLIGENCE OF MODERN DNNS: ON THE EFFECTIVENESS OF TARGETED ROW PRESS Speaker: Shaahin Angizi, New Jersey Institute of Technology, US Authors: Ranyang Zhou1, Jacqueline Liu2, Sabbir Ahmed3, Shaahin Angizi1 and Adnan Siraj Rakin2 1New Jersey Institute of Technology, US; 2Binghamton University, US; 3Binghamton University (SUNY), US Abstract Recent advancements in side-channel attacks have revealed the vulnerability of modern Deep Neural Networks (DNNs) to malicious adversarial weight attacks. The well-studied RowHammer attack has effectively compromised DNN performance by inducing precise and deterministic bit-flips in the main memory (e.g., DRAM). Similarly, RowPress has emerged as another effective strategy for flipping targeted bits in DRAM. However, the impact of RowPress on deep learning applications has yet to be explored in the existing literature, leaving a fundamental research question unanswered: How does RowPress compare to RowHammer in leveraging bit-flip attacks to compromise DNN performance? This paper is the first to address this question and evaluate the impact of RowPress on DNN applications. We conduct a comparative analysis utilizing a novel DRAM-profile-aware attack designed to capture the distinct bit-flip patterns caused by RowHammer and RowPress. Eleven widely-used DNN architectures trained on different benchmark datasets deployed on a Samsung DRAM chip conclusively demonstrate that they suffer from a drastically more rapid performance degradation under the RowPress attack compared to RowHammer. The difference in the underlying attack mechanism of RowHammer and RowPress also renders existing RowHammer mitigation mechanisms ineffective under RowPress. As a result, RowPress introduces a new vulnerability paradigm for DNN compute platforms and unveils the urgent need for corresponding protective measures./proceedings-archive/2024/DATA/950_pdf_upload.pdf |
||
COALA: COALESCION-BASED ACCELERATION OF POLYNOMIAL MULTIPLICATION FOR GPU EXECUTION Speaker: Homer Gamil, New York University, US Authors: Homer Gamil, Oleg Mazonka and Michail Maniatakos, New York University Abu Dhabi, AE Abstract In this study, we introduce Coala, a novel framework designed to enhance the performance of finite field transformations for GPU environments. We have developed a GPU-optimized version of the Discrete Galois Transformation (DGT), a variant of the Number Theoretic Transform (NTT). We introduce a novel data access pattern scheme specifically engineered to enable coalesced accesses, significantly enhancing the efficiency of data transfers between global and shared memory. This enhancement not only boosts execution efficiency but also optimizes the interaction with the GPU's memory architecture. Additionally, Coala presents a comprehensive framework that optimizes the allocation of computational tasks across the GPU's architecture and execution kernels, thereby maximizing the use of GPU resources. Lastly, we provide a flexible method to adjust security levels and polynomial sizes through the incorporation of an in-kernel RNS method, and a flexible parameter generation approach. Comparative analysis against current state-of-the-art techniques reveals significant improvements. We observe performance gains of 2.82x - 17.18x against other DGT works on GPUs for different parameters, achieved concurrently with equal or lesser memory utilization./proceedings-archive/2024/DATA/1082_pdf_upload.pdf |
||
HEILP: AN ILP-BASED SCALE MANAGEMENT METHOD FOR HOMOMORPHIC ENCRYPTION COMPILER Speaker: Weidong Yang, Shanghai Jiao Tong University, CN Authors: Weidong Yang, Shuya Ji, Jianfei Jiang, Naifeng Jing, Qin Wang, Zhigang Mao and Weiguang Sheng, Shanghai Jiao Tong University, CN Abstract RNS-CKKS, a fully homomorphic encryption (FHE) scheme, enabling secure computation on encrypted data, has widely be used in statistical analysis and data mining. However, developing RNS-CKKS programs requires substantial knowledge of cryptography, which is unfriendly to non-expert programmers. A critical obstacle is the scale management, which affects the complexity of programming and performance. Different FHE operations impose specific requirements on the scale and level, necessitating programmer intervention to ensure the recoverability of the results. Furthermore, operations at different levels have a significant impact on program performance. Existing methods rely on heuristic insights or iterative methods to manage the scales of ciphertexts. However, these methods lack a holistic understanding of the optimization space, leading to inefficient exploration and suboptimal performance. This work proposes HEILP, the first constrained-optimization-based approach for scale management in FHE. HEILP expresses node scale decision and scale management operation inserting as an integer linear programming model which can be solved with existing mathematical techniques in one shot. Our method creates a more comprehensive optimization space and enables a faster and more efficient exploration. Experimental results demonstrate that HEILP achieves an average performance improvement of 1.72xover existing heuristic method, and outperforms a 1.19x performance improvement with 48.65x faster compilation time compared to the state-of-the-art iteration-based method./proceedings-archive/2024/DATA/1102_pdf_upload.pdf |
||
A UNIFIED VECTOR PROCESSING UNIT FOR FULLY HOMOMORPHIC ENCRYPTION Speaker: Jiangbin Dong, Xi an Jiaotong University, CN Authors: Jiangbin Dong1, Xinhua Chen2 and Mingyu Gao3 1Xi'an Jiaotong University, CN; 2Fudan University, CN; 3Tsinghua University, CN Abstract Fully homomorphic encryption (FHE) algorithms enable privacy-preserving computing directly on encrypted data without leaking sensitive contents, while their excessive computational overheads could be alleviated by specialized hardware accelerators. The vector architecture has been prominently used for FHE accelerators to match the underlying polynomial data structures. While most FHE operations can be efficiently supported by vector processing units, the number theoretic transform (NTT) and automorphism operators involve complex and irregular data permutations among vector elements, and thus are handled with separate dedicated hardware units in existing FHE accelerators. In this paper, we present an efficient inter-lane network design and the corresponding dataflow control scheme, in order to realize NTT and automorphism operations among the multiple lanes of a vector unit. An arbitrarily large operator is first decomposed to fit in the fixed width of the vector unit, and the required data permutation and transposition are conducted on the specialized inter-lane network. Compared to previous designs, our solution reduces the hardware resources needed, with up to 9.4x area and 6.0x power savings for only the inter-lane network, and up to 1.2x area and 1.1x power savings for the whole vector unit./proceedings-archive/2024/DATA/1170_pdf_upload.pdf |
||
TESTING ROBUSTNESS OF HOMOMORPHICALLY ENCRYPTED SPLIT MODEL LLMS Speaker: Lars Folkerts, University of Delaware, US Authors: Lars Folkerts and Nektarios Georgios Tsoutsos, University of Delaware, US Abstract Large language models (LLMs) have recently transformed many industries, enhancing content generation, customer service agents, data analysis, and even software generation. These applications are often hosted on remote servers to protect the neural-network model IP; however, this raises concerns about the privacy of input queries. Fully Homomorphic Encryption (FHE), an encryption technique that allows computations on private data, has been proposed as a solution to this challenge. Nevertheless, due to the increased size of LLMs and the computational overheads of FHE, today's practical FHE LLMs are implemented using a split model approach. Here, a user sends their FHE encrypted data to the server to run an encrypted attention head layer; then the server returns the result of the layer for the user to run the rest of the model locally. By employing this method, the server maintains part of their model IP, while the user still gets to perform private LLM inference. In this work, we evaluate the neural-network model IP protections of single-layer split model LLMs, and demonstrate a novel attack vector that makes it easy for a user to extract the neural network model IP from the server, bypassing the claimed protections for encrypted computation. In our analysis, we demonstrate the feasibility of this attack, and discuss potential mitigations./proceedings-archive/2024/DATA/1346_pdf_upload.pdf |
||
TARN: TRUST AWARE ROUTING TO ENHANCE SECURITY IN 3D NETWORK-ON-CHIPS Speaker: Hasin Ishraq Reefat, University of Maryland Baltimore County, US Authors: Hasin Ishraq Reefat1, Alec Aversa2, Ioannis Savidis2 and Naghmeh Karimi1 1University of Maryland Baltimore County, US; 2Drexel University, US Abstract The growing complexity and performance demands of modern computing systems resulted in a shift from traditional System-on-Chip (SoC) designs to Network-on-Chip (NoC) architectures, and further to three-dimensional Network-on-Chip (3D NoC) solutions. Despite their performance and power efficiency, the increased complexity and inter-layer communication of 3D NoCs can create opportunities for adversaries who opt to prevent reliable communications between embedded nodes by inserting hardware Trojans in such nodes. The hardware Trojans, introduced through untrusted third-party Intellectual Property (IP) blocks, can severely compromise 3D NoCs by tampering with data integrity, misrouting packets, or dropping them; thus triggering denial-of-service attacks. Detecting such behaviors is particularly difficult due to their infrequent activation. Thereby it is of utmost importance to take the trustworthiness of the embedded nodes into account when routing the packets in the NoCs. Accordingly, this paper proposes a trust-aware routing scheme, so-called TARN, to significantly reduce the rate of packet loss that can occur due to malicious behaviors of one or more nodes (or interconnects). Our distributed trust-aware path selection protocol bypasses malicious IPs and securely routes packets to their destination. Furthermore, we introduce a lowoverhead mechanism for delegating trust scores to neighboring routers, thereby enhancing network efficiency. Experimental results demonstrate significant improvements in packet loss while imposing low performance and energy overhead./proceedings-archive/2024/DATA/1373_pdf_upload.pdf |
||
C2C: A FRAMEWORK FOR CRITICAL TOKEN CLASSIFICATION IN TRANSFORMER-BASED INFERENCE SYSTEMS Speaker: Sihyun Kim, KAIST, KR Authors: Myeongjae Jang, Jesung Kim, Haejin Nam, Sihyun Kim and Soontae Kim, KAIST, KR Abstract Because embedding vectors in a Transformer-based model represent crucial information about input texts, attacks or errors affecting them can cause severe accuracy degradation. We observe critical tokens for the first time, that determine the overall accuracy but their embedding vectors take only a small portion of the embedding table. Therefore, we propose a framework called C2C that classifies the critical tokens to facilitate their protection in a Transformer-based inference system with a small overhead. Using BERT with GLUE datasets, critical embedding vectors take only 13.8% of the embedding table. Compromising critical embedding vectors can reduce accuracy by up to 44.8% even if other parameters are not corrupted./proceedings-archive/2024/DATA/102_pdf_upload.pdf |
||
A DRAM-BASED PROCESSING-IN-MEMORY ACCELERATOR FOR PRIVACY-PROTECTING MACHINE LEARNING Speaker and Author: Bokyung Kim, Rutgers University, US /proceedings-archive/2024/DATA/889_pdf_upload.pdf |
BPA02 BPA 2 - D Topic Session 2
Add this session to my calendar
Date: Monday, 31 March 2025
Time: 14:00 CET - 15:30 CET
Time | Label | Presentation Title Authors |
---|---|---|
QGDP: QUANTUM LEGALIZATION AND DETAILED PLACEMENT FOR SUPERCONDUCTING QUANTUM COMPUTERS Speaker: Junyao Zhang, Duke University, US Authors: Junyao Zhang1, Guanglei Zhou1, Feng Cheng1, Jonathan Ku1, Qi Ding2, Jiaqi Gu3, Hanrui Wang4, Hai (Helen) Li1 and Yiran Chen1 1Duke University, US; 2Massachusetts Institute of Technology, US; 3Arizona State University, US; 4University of California, Los Angeles, US Abstract Quantum computers (QCs) are currently limited by qubit numbers. A major challenge in scaling these systems is crosstalk, which arises from unwanted interactions among neighboring components such as qubits and resonators. An innovative placement strategy tailored for superconducting QCs can systematically address crosstalk within limited substrate areas. Legalization is a crucial stage in placement process, refining post-global-placement configurations to satisfy design constraints and enhance layout quality. However, existing legalizers are not supported to legalize quantum placements. We aim to address this gap with qGDP, developed to meticulously legalize quantum components by adhering to quantum spatial constraints and reducing resonator crossing to alleviate various crosstalk effects. Our results indicate that qGDP effectively legalizes and fine-tunes the layout, addressing the quantum-specific spatial constraints inherent in various device topologies. By evaluating diverse benchmarks. qGDP consistently outperforms state-of-the-art legalization engines, delivering substantial improvements in fidelity and reducing spatial violation, with average gains of 34.4x and 16.9x, respectively/proceedings-archive/2024/DATA/343_pdf_upload.pdf |
||
RVEBS: EVENT-BASED SAMPLING ON RISC-V Speaker: Tiago Rocha, INESC-ID, Instituto Superior Técnico, University of Lisbon, Portugal, Pl Authors: Tiago Rocha1, Nuno Neves2, Nuno Roma2, Pedro Tomás3 and Leonel Sousa4 1INESC-ID, PT; 2INESC-ID, Instituto Superior Técnico, Universidade de Lisboa, PT; 3INESC-ID, Instituto Superior Técnico, PT; 4INESC-ID | Universidade de Lisboa, PT Abstract As RISC-V ISA continues to gain traction for both embedded and high-performance computing, the demand for advanced monitoring tools has become critical to fine-tuning the applications' performance. Current RISC-V hardware performance monitors already provide basic event counting but lack sophisticated features like event-based sampling, which are available in more established architectures such as x86 and ARM. This paper presents the first RISC-V Event-Based Sampling (RVEBS) system for comprehensive performance monitoring and application profiling. The proposed system builds upon existing RISC-V specifications, incorporating necessary modifications to enable the desired functionality. It also presents an OpenSBI extension to provide privileged software access to newly implemented control status registers that manage the sampling process. An implementation use case based on an OpenPiton processor featuring a CVA6 core on 28nm CMOS technology was presented. The results indicate that the proposed scheme is lightweight, highly accurate, and does not impact the processor's critical path while maintaining minimal impact on overall application performance./proceedings-archive/2024/DATA/549_pdf_upload.pdf |
||
XRAY: DETECTING AND EXPLOITING VULNERABILITIES IN ARM AXI INTERCONNECTS Speaker: Melisande Zonta, ETH Zurich, CH Authors: Melisande Zonta, Nora Hinderling and Shweta Shinde, ETH Zurich, CH Abstract The Arm AMBA Advanced eXtensible Interface (AXI) interconnect is a critical IP in FPGA-based designs. While AXI and interconnect designs are primarily optimized for performance, their security requires closer investigation—any bugs in these components can potentially compromise critical IPs like processing systems and memory. To this end, Xray systematically analyzes AXI interconnects. Specifically, it treats the AXI interconnect as a transaction processing block that is expected to adhere to certain properties (e.g., bus and data isolation, progress). Then, Xray employs a traffic generator that creates transaction workloads with the aim of triggering violations in the AXI interconnects. As the last piece of the puzzle, Xray wrappers automatically flag transaction traces as either compliant, errors, or warnings. Put together, Xray comprises 13 properties, has been tested on 7 interconnects, identifies 41 violations corresponding to 41 vulnerabilities. When compared to existing approaches such as verification IPs (VIPs) and protocol checkers from commercial tools, Xray identifies 19 known and 22 new violations. We show the security impact of Xray by sampling 5 Xray violations to construct 3 proof-of-concept exploits on realistic scenarios deployed on FPGA to leak intermediate data, drop transactions, and corrupt memory./proceedings-archive/2024/DATA/1324_pdf_upload.pdf |
TS03 Session 15 - E1
Add this session to my calendar
Date: Monday, 31 March 2025
Time: 14:00 CET - 15:30 CET
Time | Label | Presentation Title Authors |
---|---|---|
MPFS: A SCALABLE USER-SPACE PERSISTENT MEMORY FILE SYSTEM FOR MULTIPLE PROCESSES Speaker: Bo Ding, Huazhong University of Science & Technology, CN Authors: Bo Ding, Wei Tong, Yu Hua, Yuchong Hu, Zhangyu Chen, Xueliang Wei, Qiankun Liu, Dong Huang and Dan Feng, Huazhong University of Science & Technology, CN Abstract Persistent memory (PM) leveraging memory-mapped I/O (MMIO) delivers superior I/O performance, leading to the development of user-space PM file systems based on MMIO. While effective in single-process scenarios, these systems encounter challenges in multi-process environments, such as performance degradation due to repeated page faults and cross-process synchronizations, as well as a large memory footprint from duplicated paging structures. To address these problems, we propose a Multi-process PM File System (MPFS). MPFS builds a shareable page table and shares it among processes, avoiding building duplicate paging structures for distinct processes, thereby significantly reducing the software overhead and memory footprint caused by repeated page faults. MPFS further proposes a PGD-aligned (512GB) mapping method to accelerate page table sharing. Furthermore, MPFS provides a cross-process memory protection mechanism based on the PGD-aligned mapping, ensuring multi-process data reliability with negligible overheads. The experimental results show that MPFS outperforms existing user-space PM file systems by 1560% in multi-process scenarios./proceedings-archive/2024/DATA/66_pdf_upload.pdf |
||
EILID: EXECUTION INTEGRITY FOR LOW-END IOT DEVICES Speaker: Youngil Kim, University of California, Irvine, US Authors: Sashidhar Jakkamsetti1, Youngil Kim2, Andrew Searles2 and Gene Tsudik2 1Bosch Research, US; 2University of California, Irvine, US Abstract Prior research yielded many techniques to mitigate software compromise for low-end Internet of Things (IoT) devices. Some of them detect software modifications via remote attestation and similar services, while others preventatively ensure software (static) integrity. However, achieving run-time (dynamic) security, e.g., control-flow integrity (CFI), remains a challenge. Control-flow attestation (CFA) is one approach that minimizes the burden on devices. However, CFA is not a real-time countermeasure against run-time attacks since it requires communication with a verifying entity. This poses significant risks if safety- or time-critical tasks have memory vulnerabilities. To address this issue, we construct EILID – a hybrid architecture that ensures software execution integrity by actively monitoring control-flow violations on low-end devices. EILID is built atop CASU, a prevention-based (i.e., active) hybrid Root-of-Trust (RoT) that guarantees software immutability. EILID achieves fine-grained backward-edge and function-level forward-edge CFI via semi-automatic code instrumentation and a secure shadow stack./proceedings-archive/2024/DATA/246_pdf_upload.pdf |
||
DANCER: DYNAMIC COMPRESSION AND QUANTIZATION ARCHITECTURE FOR DEEP GRAPH CONVOLUTIONAL NETWORK Speaker: Yi Wang, Shenzhen University, CN Authors: Yunhao Dong, Zhaoyu Zhong, Yi Wang, Chenlin Ma and Tianyu Wang, Shenzhen University, CN Abstract Graph Convolutional Networks (GCNs) have been widely applied in fields such as social network analysis and recommendation systems. Recently, deep GCNs have emerged, enabling the exploration of deeper hidden information. Compared to traditional shallow GCNs, deep GCNs feature significantly more layers, leading to considerable computational and data movement challenges. Processing-In-Memory (PIM) offers a promising solution for efficiently handling GCNs by enabling near-data computation, thus reducing data transfer between processing units and memory. However, previous work mainly focused on shallow GCNs and has shown limited performance with deep GCNs. In this paper, we present Dancer, an innovative PIM-based GCN accelerator. Dancer optimizes data movement during the inference process, significantly improving efficiency and reducing energy consumption. Specifically, we introduce a novel compressed graph storage architecture and a dynamic quantization technique to minimize data transfers at each layer of the GCN. Additionally, through a detailed analysis of weight dynamics changes, we propose a sparsity propagation strategy to further alleviate the computational and data transfer burden between layers. Experimental results demonstrate that, compared to current state-of-the-art methods, Dancer achieves 3.7× speedup, 7.6× energy efficiency, and reduces of 9.6× DRAM access on average./proceedings-archive/2024/DATA/642_pdf_upload.pdf |
||
LOOPLYNX: A SCALABLE DATAFLOW ARCHITECTURE FOR EFFICIENT LLM INFERENCE Speaker: Jianing Zheng, National Sun Yat-Sen University, TW Authors: Jianing Zheng and Gang Chen, National Sun Yat-Sen University, TW Abstract In this paper, we propose LoopLynx, a scalable dataflow architecture for efficient LLM inference that optimizes FPGA usage through a hybrid spatial-temporal design. The design of LoopLynx incorporates a hybrid temporal-spatial architecture, where computationally intensive operators are implemented as large dataflow kernels. This achieves high throughput similar to spatial architecture, and organizing and reusing these kernels in a temporal way together enhances FPGA peak performance. Furthermore, to overcome the resource limitations of a single device, we provide a multi-FPGA distributed architecture that overlaps and hides all data transfers so that the distributed accelerators are fully utilized. By doing so, LoopLynx can be effectively scaled to multiple devices to further explore model parallelism for large-scale LLM inference. Evaluation of GPT-2 model demonstrates that LoopLynx can achieve comparable performance to state-of-the-art single FPGA-based accelerations. In addition, compared to Nvidia A100, our accelerator with a dual-FPGA configuration delivers a 2.52x speed-up in inference latency while consuming only 48.1% of the energy./proceedings-archive/2024/DATA/731_pdf_upload.pdf |
||
REMAPCOM: OPTIMIZING COMPACTION PERFORMANCE OF LSM TREES VIA DATA BLOCK REMAPPING IN SSDS Speaker: Yi Fan, Wuhan University of Technology, CN Authors: Yi Fan1, Yajuan Du1 and Sam H. Noh2 1Wuhan University of Technology, CN; 2UNIST, KR Abstract In LSM-based KV stores, typically deployed on systems with DRAM-SSD storage, compaction degrades write performance and SSD endurance due to significant write amplification. To address this issue, recent proposals have mostly focused on redesigning the structure of LSM trees. In this paper, we observe the prevalence of data blocks that are are simply read and written back without being altered during the LSM tree compaction process, which we refer to as Unchanged Data Blocks (UDBs). These UDBs are source of unnecessary write amplification leading to performance degradation and shortening of SSD lifetime. To address this duplication issue, we propose a remapping-based compaction method, which we call RemapCom. RemapCom considers the identification and retention by designing a lightweight state machine to track the status of the KV items in each data block as well as designing a UDB retention strategy to prevent data blocks from being split due to adjacent intersecting blocks. We implement a prototype of RemapCom on LevelDB by providing two primitives for the remapping. Compared to the state-of-the-art, evaluation results demonstrate that RemapCom can reduce the write amplification by up to 53%./proceedings-archive/2024/DATA/754_pdf_upload.pdf |
||
A PRACTICAL LEARNING-BASED FTL FOR MEMORY-CONSTRAINED MOBILE FLASH STORAGE Speaker: Zelin Du, The Chinese University of Hong Kong, CN Authors: Zelin Du1, Kecheng HUANG1, Tianyu Wang2, Xin Yao3, Renhai Chen4 and Zili Shao1 1The Chinese University of Hong Kong, HK; 2Shenzhen University, CN; 3Huawei Inc, HK; 4Huawei Inc, CN Abstract The rapidly growing mobile market is pushing flash storage manufacturers to expand capacity into the terabyte range. However, this presents a significant challenge for mobile storage management: more logical-to-physical page mappings are desired to be efficiently managed and cached while the available caching space is extremely limited. This motivates us to shift toward a new learning-based paradigm: rather than maintaining mappings for individual pages, the learning-based approach can represent mapping relationships for a set of continuous pages. However, to construct linear models, existing methods that either consume the already-limited memory space or reuse flash garbage collection demonstrate poor model construction capabilities or significantly degrade flash performance, making them impractical for real-world use. In this paper, we propose LFTL, a practical, learning-based on-demand flash translation layer design for flash management in mobile devices. In contrast to prior work that centered around gathering sufficient mappings for linear model construction, our key insight is that linear patterns can be extracted and refined by leveraging the orderly, LPA-aligned write stream typical of mobile devices. By doing this, highly accurate linear models can be constructed regardless of the constraints of mobile device's cache limitation. We have implemented a fully functional prototype of LFTL based on FEMU. Our evaluation results show that LFTL shows preferable adaptability to memory-constrained storage devices compared to state-of-the-art learning-based approaches./proceedings-archive/2024/DATA/920_pdf_upload.pdf |
||
CONZONE: A ZONED FLASH STORAGE EMULATOR FOR CONSUMER DEVICES Speaker: Dingcui Yu, East China Normal University, CN Authors: Dingcui Yu, Jialin Liu, Yumiao Zhao, Wentong Li, Ziang Huang, Zonghuan Yan, Mengyang Ma and Liang Shi, East China Normal University, CN Abstract Considering the potential benefits to lifespan and performance, zoned flash storage is expected to be incorporated into the next generation of consumer devices. However, due to the limited volatile cache and heterogeneous flash cells of consumergrade flash storage, adopting a zone abstraction requires additional internal hardware design to maximize its benefits. To understand and efficiently improve the hardware design on consumer-grade zoned flash storage, we present ConZone—the first emulator tailored to the characteristics of consumer-grade zoned flash storage. Users can explore the internal architecture and management strategies of consumer-grade zoned flash storage and integrate the optimization with software. We validate the accuracy of ConZone by realizing a hardware architecture for consumer-grade zoned flash storage and comparing it with the state-of-the-art. We also make a case study for read performance research with ConZone to explore the design of mapping mechanisms and cache management strategies./proceedings-archive/2024/DATA/962_pdf_upload.pdf |
||
A HARDWARE-ASSISTED APPROACH FOR NON-INVASIVE AND FINE-GRAINED MEMORY POWER MANAGEMENT IN MCUS Speaker: Michael Kuhn, University of Tübingen, DE Authors: Michael Kuhn, Patrick Schmid and Oliver Bringmann, University of Tübingen, DE Abstract The energy demand of embedded systems is crucial and typically dominated by the memory subsystem. Off-the-shelf MCU platforms usually offer a wide range of memory configurations in terms of overall memory size, which may differ in the number of memory banks provided. Split memory banks have the potential to optimize energy demand, but this often remains unused in available hardware due to a lack of power management support or require significant manual effort to leverage the benefits of split-banked memory architectures. This paper proposes an approach to solve the challenge of integrating fine-grained power management support automatically, by a combined hardware/software solution for future off-the-shelf platforms. We present a method to efficiently search for an optimized code and data mapping onto the modules of split memory banks to maximize the idle times of all memory modules. To non-invasively put memory modules into sleep mode, a PC-driven power management controller (PMC) autonomously triggers transitions between power modes during embedded software execution. The evaluation of our optimization flow demonstrates that memory mappings can be explored in seconds, including the generation of the necessary PMC configuration and linker scripts. The application of PC-driven power management enables active memory modules to remain in light sleep mode for approximately 13% to 86% of the execution time, depending on the workload and memory configuration. This results in overall power savings of up to 24% in the memory banks, in terms of static and dynamic power./proceedings-archive/2024/DATA/1146_pdf_upload.pdf |
||
TKD: AN EFFICIENT DEEP LEARNING COMPILER WITH CROSS-DEVICE KNOWLEDGE DISTILLATION Speaker: Chaoyao Shen, Southeast University, CN Authors: Yiming Ma, Chaoyao Shen, Linfeng Jiang, Tao Xu and Meng Zhang, Southeast University, CN Abstract Generating high-performance tensor programs on resource-constrained devices is challenging for current Deep Learning (DL) compilers that use learning-based cost models to predict the performance of tensor programs. Due to the inability of cost models to leverage cross-device information, it is extremely time-consuming to collect data and train a new cost model. To address this problem, this paper proposes TKD, a novel DL compiler that can be efficiently adapted to devices that are resource-constrained. TKD reduces the time budget by over 11x through an adaptive tensor program filter that eliminates redundant and unimportant measurements of tensor programs. Furthermore, by refining the cost model architecture with a multi-head attention module and distilling transferable knowledge from source devices, TKD outperforms state-of-the-art methods in prediction accuracy, compilation time, and compilation quality. We conducted experiments on the edge GPU, NVIDIA Jetson TX2, and the results show that compared to TenSet and TLP, TKD reduces compilation time by 1.58x and 1.16x, while achieving 1.40x and 1.27x speedups of the tensor programs, respectively./proceedings-archive/2024/DATA/1400_pdf_upload.pdf |
||
DISPEED: DISTRIBUTING PACKET FLOW ANALYSES IN A SWARM OF HETEROGENEOUS EMBEDDED PLATFORMS Speaker: Louis Morge-rollet, ENSTA Institut Polytechnique de Paris, FR Authors: Louis Morge-Rollet1, Camelia Slimani2, Laurent Lemarchand3, Frédéric Leroy4, Jalil Boukhobza5 and David Espes3 1ENSTA-Bretagne, FR; 2ENSTA Bretagne, FR; 3University Brest, FR; 4ENSTA-bretagne, FR; 5ENSTA-Bretagne Lab-STICC, FR Abstract Security is a major challenge in swarm of drones. Network intrusion detection systems (IDS) are deployed to analyze and detect suspicious packet flows. Traditionally, they are implemented independently on each drone. However, due to heterogeneity and resource limitations of drones, IDS algorithms can fall short in satisfying Quality of Service (QoS) metrics, such as latency and accuracy. We argue that a drone can make profit from the swarm by delegating part of the analysis of their packet flows to neighbor drones that have more processing power to enforce security. In this paper, we propose two solving methods to distribute the packet flows to analyze among drones in a way to ensure that it is processed with a minimum communication overhead to limit the attack surface, while ensuring QoS metrics imposed by the drone mission. First, we propose a formulation of the distribution problem using both an Integer Linear Programming (ILP) and a Maximum-Flow Minimum-Cost (MFMC). Furthermore, we propose two specific solving methods for the distribution problem: (1) a Greedy Heuristic (GH), a non-exact solving method, but with small time overhead, and (2) an Adapted Edmonds-Karp (AEK) algorithm, an exact method, but with a higher time overhead. GH proved to be a very fast solution (up to more than 2000x faster than ILP with Branch and Bound), while AEK solution proved to find the exact solution even when the problem is very difficult./proceedings-archive/2024/DATA/1484_pdf_upload.pdf |
||
ONE GRAY CODE FITS ALL: OPTIMIZING ACCESS TIME WITH BI-DIRECTIONAL PROGRAMMING FOR QLC SSDS Speaker: Tianyu Wang, Shenzhen University, CN Authors: Shaoqi Li1, Tianyu Wang1, Yongbiao Zhu1, Chenlin Ma1, Yi Wang1, Zhaoyan Shen2 and Zili Shao3 1Shenzhen University, CN; 2Shandong University, CN; 3The Chinese University of Hong Kong, HK Abstract Gray code, a voltage-level-to-data-bit translation scheme, is widely used in QLC SSDs. However, it causes the four data bits in QLC to exhibit significantly different read and write performance with up to 8x latency variation, severely impacting the worst-case performance of QLC SSDs. This paper presents BDP, a novel Bi-Directional Programming scheme. Based on a fixed Gray code, BDP combines both the normal (forward) and reverse programming directions to enable runtime programming direction arbitration. Experimental results show that BDP can effectively improve the read and write performance of SSD compared to representative schemes./proceedings-archive/2024/DATA/390_pdf_upload.pdf |
BPA03 BPA 3 - ML Session 4
Add this session to my calendar
Date: Monday, 31 March 2025
Time: 16:30 CET - 18:00 CET
Time | Label | Presentation Title Authors |
---|---|---|
TYRCA: A RISC-V TIGHTLY-COUPLED ACCELERATOR FOR CODE-BASED CRYPTOGRAPHY Speaker: Alessandra Dolmeta, Politecnico di Torino, IT Authors: Alessandra Dolmeta1, Stefano Di Matteo2, Emanuele Valea3, Mikael Carmona4, Antoine Loiseau4, Maurizio Martina5 and Guido Masera5 1Politecnico di Torino, IT; 2CEA-Leti, CEA-List, FR; 3CEA-List, FR; 4CEA-Leti, FR; 5DET - Politecnico di Torino, IT Abstract Post-quantum cryptography (PQC) has garnered significant attention across various communities, particularly with the National Institute of Standards and Technology (NIST) advancing to the fourth round of PQC standardization. One of the leading candidates is Hamming Quasi-Cyclic (HQC), which received a significant update on February 23, 2024. This update, which introduces a classical dense-dense multiplication approach, has not known dedicated hardware implementations yet. The innovative Core-V eXtension InterFace (CV-X-IF) is a communication interface for RISC-V processors that significantly facilitates the integration of new instructions to the Instruction Set Architecture (ISA), through tightly connected accelerators. In this paper, we present a TightlY-coupled accelerator for RISC-V for Code-based cryptogrAphy (TYRCA), proposing the first fully tightly-coupled hardware implementation of the HQC-PQC algorithm, leveraging the CV-X-IF. The proposed architecture is implemented on the Xilinx Kintex-7 FPGA. Experimental results demonstrate that TYRCA reduces the execution time by 94% to 96% for HQC-128, HQC-192, and HQC-256, showcasing its potential for efficient HQC code-based cryptography./proceedings-archive/2024/DATA/69_pdf_upload.pdf |
||
A SOFT ERROR TOLERANT DUAL STORAGE MODE FLIP-FLOP FOR EFPGA CONFIGURATION HARDENING IN 22NM FINFET PROCESS Speaker: Prashanth Mohan, Carnegie Mellon University, US Authors: Prashanth Mohan1, Siddharth Das1, Oguz Aatli1, Josh Joffrion2 and Ken Mai1 1Carnegie Mellon University, US; 2Sandia National Laboratories, US Abstract We propose a soft error tolerant flip-flop (FF) design to protect configuration storage cells in standard cell-based embedded FPGA fabrics used in SoC designs. Traditional rad-hard FFs such as DICE and Triple Modular Redundant (TMR) use additional redundant storage nodes for soft error tolerance and hence incur high area overheads. Since the eFPGA configuration storage is static, the master latch of the FF is transparent and unused, except when a configuration is loaded. The proposed dual-storage-mode (DSM) FF reuses the master and slave latches as redundant storage along with a C-element for error correction. The DSM FF was fabricated on a 22nm FinFET process along with standard D-FF, pulse DICE FF, and TMR FF designs to evaluate soft error tolerance. The radiation test results show that the DSM FF can reduce the error cross section by more than three orders of magnitude (3735X) compared to the standard D-FF and two orders of magnitude (455X) compared to the pulse DICE FF with a comparable area. Additionally the DSM FF is ~42% smaller than the TMR FF with similar error cross section./proceedings-archive/2024/DATA/1394_pdf_upload.pdf |
||
REBERT: LLM FOR GATE-LEVEL TO WORD-LEVEL REVERSE ENGINEERING Speaker: Lizi Zhang, University of Wisconsin Madison, US Authors: Lizi Zhang1, Azadeh Davoodi2 and Rasit Topaloglu3 1University of Wisconsin Madison, US; 2University of Wisconsin-Madison, US; 3Adeia, US Abstract In this paper, we introduce ReBERT, a specialized large language model (LLM) based on BERT, fine-tuned specifically for grouping bits into words within gate-level netlists. By treating the netlist as a form of language, we encode bits and their fan-in cones into sequences that capture structural dependencies. A novel contribution is augmenting BERT's embedding with a tree-based embedding strategy which mirrors the hierarchical nature of circuit designs in hardware. Leveraging the powerful representational learning capabilities of LLMs, we interpret hardware circuits at a higher level of abstraction. We evaluate ReBERT on various hardware designs, demonstrating that it significantly outperforms a state-of-the-art work based on partial structural matching in recovering word-level groupings. Our improvements are on average between 12.2% to 218.2% depending on degree of corrupting the structural patterns./proceedings-archive/2024/DATA/1165_pdf_upload.pdf |
TS04 Session 11 - D15
Add this session to my calendar
Date: Monday, 31 March 2025
Time: 16:30 CET - 18:00 CET
Time | Label | Presentation Title Authors |
---|---|---|
GLEAM: GRAPH-BASED LEARNING THROUGH EFFICIENT AGGREGATION IN MEMORY Speaker: Ivris Raymond, University of Michigan, US Authors: Andrew McCrabb, Ivris Raymond and Valeria Bertacco, University of Michigan, US Abstract Graph Neural Networks (GNNs) have emerged as a powerful tool for analyzing relationship-based data, such as those found in social networks, logistics, weather forecasting, and other domains. Inference and training with GNN models execute slowly, bottlenecked by limited data bandwidths between memory and GPU hosts, as a result of the many irregular memory accesses inherent to GNN-based computation. To overcome these limitations, we present GLEAM, a Processing-in-Memory (PIM) hardware accelerator designed specifically for GNN-based training and inference. GLEAM units are placed per-bank and leverage the much larger, internal bandwidth of HBMs to handle GNNs' irregular memory accesses, significantly boosting performance and reducing the energy consumption entailed by the dominant activity of GNN-based computation: neighbor aggregation. Our evaluation of GLEAM demonstrates up to a 10x speedup for GNN inference over GPU baselines, alongside a significant reduction in energy usage./proceedings-archive/2024/DATA/177_pdf_upload.pdf |
||
PFP: PARALLEL FLOATING-POINT VECTOR MULTIPLICATION ACCELERATION IN MAGIC RERAM Speaker: Wenqing Wang, National University of Defense Technology, CN Authors: Wenqing Wang, Ziming Chen, Quan Deng and Liang Fang, National University of Defense Technology, CN Abstract Emerging applications, e.g., machine learning, large language models (LLMs), and graphic processing, are rapidly developing and are both compute-intensive and memory-intensive. Computing in Memory (CIM) is a promising architecture that accelerates these applications by eliminating the data movement between memory and processing units. Memristor-aided logic (MAGIC) CIM achieves massive parallelism, flexible computing, and non-volatility. However, MAGIC ReRAM performs floating-point (FP) vector multiplication sequentially, which wastes parallel computing resources and is limited by the array size. To solve this issue, we propose a parallel floating-point vector multiplication accelerator in MAGIC ReRAM. We exploit three levels of parallelism during the calculation of FP vector multiplication, referred to as PFP. First, we leverage the parallelism of MAGIC ReRAM. Second, we bring forward the final exponent to make the exponent calculations parallel. Third, we decouple the calculation of exponent, mantissa, and sign, which allows parallel calculation across accumulation. The experimental results show that PFP achieves a performance speedup of 2.51× and 15% energy savings compared to AritPIM when performing FP32 vector multiplication with a vector length of 512./proceedings-archive/2024/DATA/179_pdf_upload.pdf |
||
AN EDRAM DIGITAL IN-MEMORY NEURAL NETWORK ACCELERATOR FOR HIGH-THROUGHPUT AND EXTENDED DATA RETENTION TIME Speaker: Jehun Lee, Seoul National University, KR Authors: Inhwan Lee1, Jehun Lee2, Jaeyong Jang2 and Jae-Joon Kim2 1Pohang University of Science and Technology, KR; 2Seoul National University, KR Abstract Computing-in-Memory (CIM) optimizes multiply-and-accumulate (MAC) operations for energy-efficient acceleration of neural network models. While SRAM has been a popular choice for CIM designs due to its compatibility with logic processes, its large cell size restricts storage capacity for neural network parameters. Consequently, gain-cell eDRAM, featuring memory cells with only 2-4 transistors, has emerged as an alternative for CIM cells. While digital CIM (DCIM) structure has been actively adopted in SRAM-based CIMs for better accuracy and scalability than analog CIMs (ACIM), previous eDRAM-based CIMs still employed ACIM structure since the eDRAM CIM cells were not able to perform a complete digital logic operation. In this paper, we propose an eDRAM bit cell for more efficient DCIM operations using only 4 transistors. The proposed eDRAM DCIM structure also maintains consistent and accurate output values over time, improving retention times compared to previous eDRAM ACIM designs. We validate our approach by fabricating an eDRAM DCIM macro chip and conducting hardware validation experiments, measuring retention time and neural network accuracy. Experimental results show that the proposed eDRAM DCIM achieves 3× longer retention time than state-of-the-art eDRAM ACIM designs, along with higher throughput without accuracy loss./proceedings-archive/2024/DATA/305_pdf_upload.pdf |
||
A TWO-LEVEL SLC CACHE HIERARCHY FOR HYBRID SSDS Speaker: Jun Li, Nanjing University of Posts and Telecommunications, CN Authors: Li Cai1, Zhibing Sha1, Jun Li2, Jiaojiao Wu1, Huanhuan Tian1, Zhigang Cai1 and Jianwei Liao1 1Southwest University, CN; 2Nanjing University of Posts and Telecommunications, CN Abstract Although high-density NAND flash memory, such as triple-level-cell (TLC) flash memory can offer high density, its lower write performance and endurance compared to single-levelcell (SLC) flash memory are impediments to the proliferation of TLC products. To overcome such disadvantages of TLC flash memory, hybrid architectures, which integrate a portion of SLC chips and employ them as a write cache, are widely adopted in commercial solid-state disks (SSDs). However, it is challenging to optimize the SLC cache, such as the granularity of cached data and the cold/hot data separation. In this paper, we propose supporting two-level hierarchy (i.e. L1 and L2) of SLC cache stores based on varying granularity of cached data. Moreover, we support the segmentation of the L1 and the L2 cache in the SLC region with a dynamic manner, by considering the write size characteristics of user applications. The evaluation results show that our proposal can improve I/O performance by between 12.6% and 25.1%, in contrast to existing cache management schemes for SLC-TLC hybrid storage./proceedings-archive/2024/DATA/382_pdf_upload.pdf |
||
MULTI-MODE BORDERGUARD CONTROLLERS FOR EFFICIENT ON-CHIP COMMUNICATION IN HETEROGENEOUS DIGITAL/ANALOG NEURAL PROCESSING UNITS Speaker: Hong Pang, ETH Zurich, CH Authors: Hong Pang1, Carmine Cappetta2, Riccardo Massa2, Athanasios Vasilopoulos3, Elena Ferro3, Gamze Islamoglu1, Angelo Garofalo4, Francesco Conti5, Luca Benini6, Irem Boybat3 and Thomas Boesch7 1ETH Zurich, CH; 2STMicroelectronics, IT; 3IBM Research Europe - Zurich, CH; 4University of Bologna, ETH Zurich, IT; 5Università di Bologna, IT; 6ETH Zurich, CH | Università di Bologna, IT; 7STMicroelectronics, CH Abstract Driven by the growing demand for data-intensive parallel computation, particularly for Matrix-Vector Multiplications (MVMs), and the pursuit of high energy efficiency, Analog In-Memory Computing (AIMC) has garnered significant attention. AIMC addresses the data movement bottleneck by performing MVMs directly within memory, significantly reducing latency and enhancing energy efficiency. Integrating AIMC with digital units for non-MVM operations yields heterogeneous Neural Processing Units (NPUs) that can be combined in a tiled architecture to deliver promising solutions for end-to-end AI inference. Besides powerful heterogeneous NPUs, an efficient on-chip communication infrastructure is also pivotal for inter-node data transmission and efficient AI model execution. This paper introduces the Borderguard Controller (BG-CTRL), a multi-mode, path-through routing controller designed to support three distinct operating modes—time-scheduling, data-driven, and time-sliced data-driven (TSDD)—each offering varying levels of routing flexibility and energy efficiency depending on the data flow patterns and AI model complexity. To demonstrate the design, BG-CTRLs are integrated into a 9-node system of heterogeneous NPUs, arranged in a 3x3 grid and connected using a 2D mesh topology. The system is synthesized using STM 28nm FD-SOI technology. Experimental results show that the BG-CTRL cluster achieves an aggregate throughput of 983 Gb/s, with an energy efficiency of up to 0.41 pJ/B/hop at 0.64 GHz, and a minimal area overhead of 204 kGE./proceedings-archive/2024/DATA/544_pdf_upload.pdf |
||
MAPPING SPIKING NEURAL NETWORKS TO HETEROGENEOUS CROSSBAR ARCHITECTURES USING INTEGER LINEAR PROGRAMMING Speaker: Devin Pohl, Georgia Tech, US Authors: Devin Pohl1, Aaron Young2, Kazi Asifuzzaman2, Narasinga Miniskar2 and Jeffrey Vetter2 1Georgia Tech, US; 2Oak Ridge National Lab, US Abstract Advances in novel hardware devices and architectures allow Spiking Neural Network (SNN) evaluation using ultra-low power, mixed-signal, memristor crossbar arrays. As individual network sizes quickly scale beyond the dimensional capabilities of single crossbars, networks must be mapped onto multiple crossbars. Crossbar sizes within modern Memristor Crossbar Architectures (MCAs) are determined predominately not by device technology but by network topology; more, smaller crossbars consume less area thanks to the high structural sparsity found in larger, brain-inspired SNNs. Motivated by continuing increases in SNN sparsity due to improvements in training methods, we propose utilizing heterogeneous crossbar sizes to further reduce area consumption. This approach was previously unachievable as prior compiler studies only explored solutions targeting homogeneous MCAs. Our work improves on the state-of-the-art by providing Integer Linear Programming (ILP) formulations supporting arbitrarily heterogeneous architectures. By modeling axonal interactions between neurons, our methods produce better mappings while removing inhibitive a priori knowledge requirements. We first show a 16.7–27.6% reduction in area consumption for square-crossbar homogeneous architectures. Then, we demonstrate 66.9–72.7% further reduction when using a reasonable configuration of heterogeneous crossbar dimensions. Next, we present a new optimization formulation capable of minimizing the number of inter-crossbar routes. When applied to solutions already near-optimal in area, an 11.9–26.4% routing reduction is observed without impacting area consumption. Finally, we present a profile-guided optimization capable of minimizing the number of runtime spikes between crossbars. Compared to the best-area-then-route optimized solutions, we observe a further 0.5–14.8% inter-crossbar spike reduction while requiring 1–3 orders of magnitude less solver time./proceedings-archive/2024/DATA/563_pdf_upload.pdf |
||
AN EFFICIENT ON-CHIP REFERENCE SEARCH AND OPTIMIZATION ALGORITHMS FOR VARIATION-TOLERANT STT-MRAM READ Speaker: Kiho Chung, Sungkyunkwan University, KR Authors: Kiho Chung, Youjin Choi, Donguk Seo and Yoonmyung Lee, Sungkyunkwan University, KR Abstract A novel reference search algorithm is proposed in this paper to significantly reduce the reference search time of embedded spin transfer torque magnetic random access memory (STT-MRAM). Unlike conventional methods that sequentially search reference levels with linearly increasing references, the proposed Dual Read Reference Search (DRRS) algorithm requires only two array read operations. By analyzing the statistical characteristics of the read data using a customized function, the optimal reference level can be quickly determined in a few steps. Consequently, the number of read operations required for a reference search is reduced, providing a substantial improvement in the reference search time. The DRRS algorithm can be operated on-chip, and its effectiveness was confirmed through simulations. The optimization speed was improved by 85% compared to the conventional methods. Additionally, an Triple Read Reference Search (TRRS) algorithm is proposed to decrease the variation occurring across different cell arrays and to enhance optimization accuracy. STT-MRAM is composed of numerous cell arrays, where the cell distributions in each array exhibit different characteristics. The TRRS algorithm enhances optimization accuracy for variations occurring in each array, achieving over a 2x increase in accuracy compared to the DRRS algorithm. Furthermore, Simultaneous Reference Search for P and AP (SRS) algorithm that significantly reduces the search time by simultaneously optimizing Parallel (P) and Anti-parallel state (AP) reference cells is also proposed. Lastly, regarding cell degradation after power-up, we enable prompt re-optimization through revolutionary time-saving algorithms (DRRS, TRRS and SRS). This allows for rapid re-optimization in the event of errors caused by cell degradation and ensures regular optimization to maintain maximum read margin even before errors occur, thereby enhancing reliability./proceedings-archive/2024/DATA/716_pdf_upload.pdf |
||
FDAIMC: A FULLY-DIFFERENTIAL ANALOG IN-MEMORY-COMPUTING FOR MAC IN MRAM WITH ACCURACY CALIBRATION UNDER PROCESS AND VOLTAGE VARIATION Speaker: Xiangyu Li, School of Microelectronics Science and Technology, Sun Yat-sen University, CN Authors: Xiangyu Li1, Weichong Chen1, Ruida Hong1, Jinghai Wang2, Ningyuan Yin1 and Zhiyi Yu1 1School of Microelectronics Science and Technology, Sun Yat-sen University, CN; 2National Sun Yat-Sen University, TW Abstract Analog in-memory-computing (AIMC) is adopted extensively in non-volatile memory for multibit multiply-and-accumulate (MAC) operation. However, the low-on/off-ratio feature of magnetic tunnel junction (MTJ) impedes a high-performance AIMC macro based on spin transfer torque magnetic random access memory (STT-MRAM). Secondly, because of the uncertainty feature of a mixed-signal system under process and voltage variation, a calibration support is indispensable. Moreover, the incompatibility between a nonlinear analog signal and a linear digital signal hinders accurate computation and calibration support. To overcome these challenges, this work proposes a STT-MRAM-AIMC macro featuring: 1) a 2-level-differential cell array and a linear computing scheme with a calibration support in analog domain; 2) an analog-digital-conversion (ADC) system, including a slew-rate-independent voltage-to-time converter (SRIVTC) scheme and a self-triggered time-to-MAC value converter (STTMC) scheme; 3) a compact layout design for high area efficiency. Finally, an average accuracy of 95.44% is obtained under the TT&0.9V corner. By using the calibration strategy, the average accuracy of 97.8% and 88.6% are obtained under FF&0.945V and SS&0.855V separately, with over 30% enhancement. Furthermore, a 1.64~21.18 times area FoM than state of the art is obtained. An energy efficiency of 87.2~312.4 TOPS/W is obtained./proceedings-archive/2024/DATA/784_pdf_upload.pdf |
||
ARBITER: ALLEVIATING CONCURRENT WRITE AMPLIFICATION IN PERSISTENT MEMORY Speaker: Bolun Zhu, Huazhong University of Science & Technology, CN Authors: Bolun Zhu and Yu Hua, Huazhong University of Science & Technology, CN Abstract Persistent memory (PM) is able to bridge the gap between the high performance and persistence, thus receiving many research attentions. The concurrency in PM is often constrained due to limited concurrent I/O bandwidth. The I/O requests from different threads are serialized and interleaved in the memory controller. Such concurrent interleaving unintentionally hurts the locality of PM's on-DIMM buffer (XPBuffer) and thus causes significant performance degradation. Existing systems either endure performance degradation caused by the concurrent interleaving or leverage dedicated background threads to asynchronously perform I/O to PM. Unlike conventional designs, we present a non-blocking synchronous I/O scheduling mechanism that can achieve high performance and low I/O amplification. The key insight is that inserting a proper number of delays to I/O can mitigate the I/O amplification and improve the effective bandwidth. We periodically assess the system states and adaptively determine the number of delays to be inserted for each thread. Evaluation results show that our design can significantly alleviate the I/O amplification and improve application performance for concurrent applications./proceedings-archive/2024/DATA/998_pdf_upload.pdf |
||
TRACKSCORER: SKYRMION LOGIC-IN-MEMORY ACCELERATOR FOR TREE-BASED RANKING MODELS Speaker: Elijah Cishugi, University of Twente, NL Authors: Elijah Cishugi1, Sebastian Buschjäger2, Martijn Noorlander1, Marco Ottavi3 and Kuan-Hsun Chen1 1University of Twente, NL; 2The Lamarr Institute for Machine Learning and Artificial Intelligence and TU Dortmund University, DE; 3University of Rome Tor Vergata | University of Twente, IT Abstract Racetrack memories (RTMs) have been shown to have lower leakage power and higher density compared to traditional DRAM/SRAM technologies. However, their efficiency is often hindered by the need to shift the targeted data to access ports for read and write operations. Suitable mapping approaches are therefore essential to unleash their potential. In this work, we explore the mapping of the popular tree-based document ranking algorithm, Quickscorer, onto Skyrmion-based racetrack memories (SK-RTMs). Our approach leverages a Logic-in-Memory (LiM) accelerator, specifically designed to execute simple logic operations directly within SK-RTMs, enabling an efficient mapping of Quickscorer by exploiting its bitvector representation and interleaved traversal scheme of tree structures through bitwise logical operations. We present several mapping strategies, including one based on a quadratic assignment problem (QAP) optimization algorithm for optimal data placement of Quickscorer onto the racetracks. Our results demonstrate a significant reduction in read and write operations and, in certain cases, a decrease in the time spent shifting data during Quickscorer inference./proceedings-archive/2024/DATA/1016_pdf_upload.pdf |
||
EF-IMR: EMBEDDED FLASH WITH INTERLACED MAGNETIC RECORDING TECHNOLOGY Speaker: Chenlin Ma, Shenzhen University, CN Authors: Chenlin Ma, Xiaochuan Zheng, Kaoyi Sun, Tianyu Wang and Yi Wang, Shenzhen University, CN Abstract Interlaced Magnetic Recording (IMR), a technology that improves storage density through track overlap, introduces signiffcant latency due to Read-Modify-Write (RMW) operations. Writing to overlapped tracks affects underlying tracks, requiring additional I/O operations to read, back up, and rewrite them, resulting in signiffcant head movement latency. We propose EF-IMR, a new architecture that ensures crash consistency in IMR while minimizing RMW latency and head movement. EF-IMR reduces head movement during RMW operations and decreases redundant RMW operations. Evaluations under real-world, intensive I/O workloads show that EF-IMR reduces RMW latency by 20.11% and head movement latency by 89.37% compared to existing methods./proceedings-archive/2024/DATA/1201_pdf_upload.pdf |
TS05 Session 4 - D1
Add this session to my calendar
Date: Monday, 31 March 2025
Time: 16:30 CET - 18:00 CET
Time | Label | Presentation Title Authors |
---|---|---|
IMPROVING LLM-BASED VERILOG CODE GENERATION WITH DATA AUGMENTATION AND RL Speaker: Kyungjun Min, Pohang University of Science and Technology, KR Authors: Kyungjun Min, Seonghyeon Park, Hyeonwoo Park, Jinoh Cho and Seokhyeong Kang, Pohang University of Science and Technology, KR Abstract Large language models (LLMs) have recently attracted significant attention for their potential in Verilog code generation. However, existing LLM-based methods face several challenges, including data scarcity and the high computational cost of generating prompts for fine-tuning. Motivated by these challenges, we explore methods to augment training datasets, develop more efficient and effective prompts for fine-tuning, and implement training methods incorporating electronic design automation (EDA) tools. Our proposed framework for fine-tuning LLMs for Verilog code generation includes (1) abstract syntax tree (AST)-based data augmentation, (2) output-relevant code masking, a prompt generation method based on the logical structure of Verilog code, and (3) reinforcement learning with tool feedback (RLTF), a fine-tuning method using EDA tool results. Experimental studies confirm that our framework significantly improves syntax and functional correctness, outperforming commercial and non-commercial models on open-source benchmarks./proceedings-archive/2024/DATA/514_pdf_upload.pdf |
||
SPARDR:ACCELERATING UNSTRUCTURED SPARSE DNN INFERENCE VIA DATAFLOW OPTIMIZATION Speaker: wei wang, beihang university, CN Authors: Wei Wang1, Hongxu Jiang2, Runhua Zhang3, Yongxiang Cao4 and Yaochen Han5 1ww bhbuaa [dot] edu [dot] cn, CN; 2jianghxbuaa [dot] edu [dot] cn, CN; 3rhzhang20buaa [dot] edu [dot] cn, CN; 4caoyongxiangbuaa [dot] edu [dot] cn, CN; 518811203327, CN Abstract Unstructured sparsity is becoming a key dimension in exploring the inference efficiency of neural networks. However, its data layout presents irregularity, making it difficult to match the parallel computing mode of hardware, resulting in low computational and memory access efficiency. We have studied this issue and found that the main reason is that existing sparse acceleration libraries and compilers perform sparse matrix multiplication optimization exploration through the splitting and reconstruction of sparse patterns, thus ignoring the acceleration of sparse convolution operations centered on data streams, which may miss some optimization opportunities for sparse operations. In this article, we propose SparDR, a general sparse convolution operation acceleration method centered around data streams. Through novel feature map data stream reconstruction and convolutional kernel data representation, redundant zero value calculations are effectively avoided, addressing efficiency is improved, and memory overhead is reduced. SparDR is based on TVM and allows for automatic scheduling across different hardware configurations. Compared with the current mainstream five methods on four types of hardware, the inference delay acceleration reaches 1.1-12x and the memory usage decreases by 20%./proceedings-archive/2024/DATA/677_pdf_upload.pdf |
||
AN IMITATION AUGMENTED REINFORCEMENT LEARNING FRAMEWORK FOR CGRA DESIGN SPACE EXPLORATION Speaker: Liangji Wu, Southeast University, Nanjing, Jiangsu Province, CN Authors: Liangji Wu, Shuaibo Huang, Ziqi Wang, Shiyang Wu, Yang Chen, Hao Yan and Longxing Shi, Southeast University, CN Abstract Coarse-Grained Reconfigurable Arrays (CGRAs) are a promising architecture that warrants thorough design space exploration (DSE). However, Traditional DSE methods for CGRAs often get trapped in local optima due to singularities, i.e., invalid design points caused by CGRA mapping failures. In this paper, we propose a singularity-aware framework based on the integration of reinforcement learning (RL) and imitation learning (IL) for DSE of CGRAs. Our approach learns from both valid and invalid points, substantially reducing the probability of sampling singularities and accelerating the escape from inefficient regions, ultimately achieving high-quality Pareto points. Experimental results demonstrate that our framework improves the hypervolume (HV) of the Pareto front by 23.56% compared to state-of-the-art methods, with a comparable time overhead./proceedings-archive/2024/DATA/708_pdf_upload.pdf |
||
OPERATION DEPENDENCY GRAPH-BASED SCHEDULING FOR HIGH-LEVEL SYNTHESIS Speaker: AOXIANG QIN, National Sun Yat-Sen University, TW Authors: Aoxiang Qin1, Minghua Shen1 and Nong Xiao2 1National Sun Yat-Sen University, TW; 2The School of Computer, Sun Yat-sen University,Panyue, CN Abstract Scheduling determines the execution order and time of operations in program. The order is related to operation dependencies, including data and resource dependencies. Data dependencies are intrinsic in programs, while resource dependencies are determined by scheduling methods. Existing scheduling methods lack an accurate and complete operation dependency graph (ODG), leading to poor performance. In this paper, we propose an ODG-based scheduling method for HLS with GNN and RL. We adopt GNN to perceive accurate relations between operations. We use the relations to guide an RL agent in building a complete ODG. We perform feedback-guided iterative scheduling with the graph to converge to a high-quality solution. Experiments show that our method reduces 23.8% and 16.4% latency on average, compared with the latest GNN-based and RL-based methods, respectively./proceedings-archive/2024/DATA/768_pdf_upload.pdf |
||
LOCALITY-AWARE DATA PLACEMENT FOR NUMA ARCHITECTURES: DATA DECOUPLING AND ASYNCHRONOUS REPLICATION Speaker: Shuhan Bai, Huazhong University of Science & Technology, CN Authors: Shuhan BAI, Haowen Luo, burong Dong, Jian Zhou and Fei Wu, Huazhong University of Science & Technology, CN Abstract Non-Uniform Memory Access (NUMA) architectures bring new opportunities and challenges to bridge the gap between computing power and memory performance. Their complex memory hierarchies feature non-uniform access performance, known as NUMA locality, indicating data placement and access without NUMA-awareness significantly impact performance. Existing NUMA-aware solutions often prioritize fast local access but at the cost of heavy replication overhead, suffering a read-write performance tradeoff and limited scalability. To overcome these limitations, this paper presents $ m{Ladapa}$, a scalable and high-performance locality-aware data placement strategy. The key insight is decoupling data into metadata and data layers, allowing independent management with adaptive asynchronous replication for lower overhead. Additionally, $ m{Ladapa}$ employs multi-level metadata management leveraging fast caches for efficient data location, further boosting performance. Experimental results show that $ m{Ladapa}$ outperforms typical replication techniques by up to 27.37$ imes$ in write performance and 1.63$ imes$ in read performance./proceedings-archive/2024/DATA/782_pdf_upload.pdf |
||
HAVEN: HALLUCINATION-MITIGATED LLM FOR VERILOG CODE GENERATION ALIGNED WITH HDL ENGINEERS Speaker: Yiyao Yang, Shanghai Jiao Tong University, CN Authors: Yiyao Yang1, Fu Teng2, Pengju Liu1, Mengnan Qi1, Chenyang Lv1, Ji Li3, Xuhong Zhang2 and Zhezhi He1 1Shanghai Jiao Tong University, CN; 2Zhejiang University, CN; 3Independant Researcher, CN Abstract Recently, the use of large language models (LLMs) for Verilog code generation has attracted great research interest to enable hardware design automation. However, previous works have shown a gap between the ability of LLMs and the practical demands of hardware description language (HDL) engineering. This gap includes differences in how engineers phrase questions and hallucinations in the code generated. To address these challenges, we introduce HaVen, a novel LLM framework designed to mitigate hallucinations and align Verilog code generation with the practices of HDL engineers. HaVen tackles hallucination issues by proposing a comprehensive taxonomy and employing a chain-of-thought (CoT) mechanism to translate symbolic modalities (e.g. truth tables, state diagrams, etc.) into accurate natural language descriptions. Furthermore, HaVen bridges this gap by using a data augmentation strategy. It synthesizes high-quality instruction-code pairs that match real HDL engineering practices. Our experiments demonstrate that HaVen significantly improves the correctness of Verilog code generation, outperforming state-of-the-art LLM-based Verilog generation methods on VerilogEval and RTLLM benchmark. HaVen is publicly available at https://github.com/Intelligent-Computing-Research-Group/HaVen./proceedings-archive/2024/DATA/812_pdf_upload.pdf |
||
ENABLING MEMORY-EFFICIENT ON-DEVICE LEARNING VIA DATASET CONDENSATION Speaker: Gelei Xu, University of Notre Dame, US Authors: Gelei Xu1, Ningzhi Tang1, Jun Xia1, Ruiyang Qin1, Wei Jin2 and Yiyu Shi1 1University of Notre Dame, US; 2Emory University, US Abstract Upon deployment to edge devices, it is often desirable for a model to further learn from streaming data to improve accuracy. However, learning from such data is challenging because it is typically unlabeled, non-independent and identically distributed (non-i.i.d), and only seen once, which can lead to potential catastrophic forgetting. A common strategy to mitigate this issue is to maintain a small data buffer on the edge device to select and retain the most representative data for rehearsal. However, the selection process leads to significant information loss since most data is either never stored or quickly discarded. This paper proposes a framework that addresses this issue by condensing incoming data into informative synthetic samples. Specifically, to effectively handle unlabeled incoming data, we propose a pseudo-labeling technique designed for on-device learning environments. We also develop a dataset condensation technique tailored for on-device learning scenarios, which is significantly faster compared to previous methods. To counteract the effects of noisy labels during the condensation process, we further utilize a feature discrimination objective to improve the purity of class data. Experimental results indicate substantial improvements over existing methods, especially under strict buffer limitations. For instance, with a buffer capacity of just one sample per class, our method achieves a 56.7% relative increase in accuracy compared to the best existing baseline on the CORe50 dataset./proceedings-archive/2024/DATA/853_pdf_upload.pdf |
||
TAICHI: EFFICIENT EXECUTION FOR MULTI-DNNS USING GRAPH-BASED SCHEDULING Speaker: Xilang Zhou, Fudan University, CN Authors: Xilang Zhou, Haodong Lu, Tianchen Wang, Zhuoheng Wan, Jianli Chen, Jun Yu and Kun Wang, Fudan University, CN Abstract Deep Neural Networks (DNNs) are increasingly used for complex tasks (e.g., AR/VR) by constructing different types of DNNs into a workflow. However, efficient frameworks are lacking for accelerating these applications which have complex connec- tivity and require real-time processing. We introduce ReFA, an FPGA-based co-design framework for acceleration of real-time multi-DNN workloads. Specifically, on the hardware level, we develop an FPGA-based multi-core accelerator, which adopts an unified template for various DNN models and supports depth- first execution to reduce data movements. On the software level, we design a lightweight scheduler based on genetic algorithm, which can find high quality scheduling strategies rapidly from a huge solution space. Our evaluations show that ReFA deployed on Xilinx Alveo U200 achieves up to 10.1-37.3× and 1.4-1.5× reduction in job completion time (JCT), compared with CPU and GPU, respectively. Furthermore, ReFA gains 6.1-9.3×, 7.9×, 5.6-7.1×, and 2.4× reduction in energy-delay product, compared with GPU, Planaria, Herald and H3M, respectively./proceedings-archive/2024/DATA/1030_pdf_upload.pdf |
||
VTOT: AUTOMATIC VERILOG GENERATION VIA LLMS WITH TREE OF THOUGHTS PROMPTING Speaker: Xiangyu Wang, National University of Defense Technology, CN Authors: Yingjie Zhou1, Renzhi Chen2, Xinyu Li1, Jingkai Wang1, Zhigang Fang1, Bowei Wang1, Wenqiang Bai1, Qilin Cao1 and Lei Wang3 1National University of Defense Technology, CN; 2Qiyuan Laboratory, CN; 3Academy of Military Sciences, CN Abstract The automatic generation of Verilog code using Large Language Models (LLMs) presents a compelling solution for enhancing the efficiency of hardware design flow. However, the state-of-the-art performance of LLMs in Verilog generation remains limited when compared to programming languages such as Python. Previous research, Chain of Thought (CoT) has demonstrated that incorporating intermediate reasoning steps can significantly improve the performance of LLMs in code generation. In this paper, we propose the Verilog Tree of Thoughts (VToT) method. This structured prompting technique addresses the abstraction gap between Verilog and CoT by embedding hierarchical design constraints within the prompt. Experimental results on the VerilogEval and RTLLM benchmarks demonstrate that VToT prompting enhances both the syntactic and functional correctness of the generated code.Specifically. Under the RTLLM benchmark, VToT achieved a correctness rate of 75.9\% at pass@5, representing an improvement of 10.4\%. Furthermore, in the VerilogEval benchmark, VToT achieved state-of-the-art performance with a correctness rate of 52.4\% at pass@1 (an increase of 8.9\%) and 65.4\% at pass@5 (an increase of 9.6\%)./proceedings-archive/2024/DATA/1088_pdf_upload.pdf |
||
SIGNAL PREDICTION FOR DIGITAL CIRCUITS BY SIGMOIDAL APPROXIMATIONS USING NEURAL NETWORKS Speaker: Josef Salzmann, TU Wien, AT Authors: Josef Salzmann and Ulrich Schmid, TU Wien, AT Abstract Investigating the temporal behavior of digital circuits is a crucial step in system design, usually done via analog or digital simulation. Analog simulators like SPICE iteratively solve the differential equations characterizing the circuits' components numerically. Although unrivaled in accuracy, this is only feasible for small designs, due to the high computational effort even for short signal traces. Digital simulators use digital abstractions for predicting the timing behavior of a circuit. We advocate a novel approach, which generalizes digital traces to traces consisting of sigmoids, each parameterized by threshold crossing time and slope. For a given gate, we use an artificial neural network for implementing the transfer function that predicts, for any trace of input sigmoids, the parameters of the generated output sigmoids. By means of a prototype simulator, which can handle circuits consisting of inverters and NOR gates, we demonstrate that our approach operates substantially faster than an analog simulator, while offering a much better accuracy than a digital simulator./proceedings-archive/2024/DATA/31_pdf_upload.pdf |
||
VERILUA: AN OPEN SOURCE VERSATILE FRAMEWORK FOR EFFICIENT HARDWARE VERIFICATION AND ANALYSIS USING LUAJIT Speaker: Chuyu Zheng, College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China, CN Authors: Ye Cai1, Chuyu Zheng1, Wei He2 and Dan Tang3 1Shenzhen University, CN; 2Beiiing Institute of Open Source Chip, CN; 3Institute of Computing Technology, Chinese Academy of Sciences (lCT) / Beiiing Institute of Open Source Chip, CN Abstract The growing complexity of hardware verification highlights limitations in existing frameworks, particularly regarding flexibility and reusability. Current methodologies often require multiple specialized environments for functional verification, waveform analysis, and simulation, leading to toolchain fragmentation and inefficient code reuse. This paper presents Verilua, a unified framework leveraging LuaJIT and the Verilog Procedural Interface (VPI), which integrates three core functionalities: Lua-based functional verification, a scripting engine for RTL simulation, and waveform analysis. By enabling complete code reuse through a unified Lua codebase, the framework achieves a 12× speedup in RTL simulation compared to cocotb and a 70× improvement in waveform analysis over state-of-the-art solutions. Through consolidating verification tasks into a single platform, Verilua enhances efficiency while reducing tool fragmentation and learning overhead, addressing critical challenges in modern hardware design./proceedings-archive/2024/DATA/741_pdf_upload.pdf |
BPA04 BPA 4 - Security - DT - A3 Topic - Session 3
Add this session to my calendar
Date: Tuesday, 01 April 2025
Time: 08:30 CET - 10:00 CET
Time | Label | Presentation Title Authors |
---|---|---|
A LIGHTWEIGHT CNN FOR REAL-TIME PRE-IMPACT FALL DETECTION Speaker: Cristian Turetta, Università di Verona, IT Authors: Cristian Turetta1, Muhammed Ali1, Florenc Demrozi2 and Graziano Pravadelli1 1Università di Verona, IT; 2Department of Electrical Engineering and Computer Science, University of Stavanger, NO Abstract Falls can have significant and far-reaching effects on various groups, particularly the elderly, workers, and the general population. These effects can impact both physical and psychological well-being, leading to long-term health problems, reduced productivity, and a decreased quality of life. Numerous fall detection systems have been developed to prompt first aid in the event of a fall and reduce its impact on people's lives. However, detecting a fall after it has occurred is insufficient to mitigate its consequences, such as trauma. These effects can be further minimized by activating safety systems (e.g., wearable airbags) during the fall itself, specifically in the pre- impact phase, to reduce the severity of the impact when hitting the ground. Achieving this, however, requires recognizing the fall early enough to provide the necessary time for the safety system to become fully operational before impact. To address this challenge, this paper introduces a novel lightweight convolutional neural network (CNN) designed to detect pre-impact falls. The proposed model overcomes the limitations of current solutions regarding deployability on resource-constrained embedded devices, specifically for controlling the inflation of an airbag jacket. We extensively tested and compared our model, deployed on an STM32F722 microcontroller, against state-of-the-art approaches using two different datasets./proceedings-archive/2024/DATA/560_pdf_upload.pdf |
||
COCKTAIL: CHUNK-ADAPTIVE MIXED-PRECISION QUANTIZATION FOR LONG-CONTEXT LLM INFERENCE Speaker: Wei Tao, Huazhong University of Science & Technology, CN Authors: Wei Tao1, Bin Zhang1, Xiaoyang Qu2, Jiguang Wan1 and Jianzong Wang3 1Huazhong University of Science & Technology, CN; 2Ping An Technology (shenzhen)Co., Ltd, CN; 3Ping An Technology, CN Abstract Recently, large language models (LLMs) have been able to handle longer and longer contexts. However, a context that is too long may cause intolerant inference latency and GPU memory usage. Existing methods propose mixed-precision quantization to the key-value (KV) cache in LLMs based on token granularity, which is time-consuming in the search process and hardware inefficient during computation. This paper introduces a novel approach called Cocktail, which employs chunk-adaptive mixed-precision quantization to optimize the KV cache. Cocktail consists of two modules: chunk-level quantization search and chunk-level KV cache computation. Chunk-level quantization search determines the optimal bitwidth configuration of the KV cache chunks quickly based on the similarity scores between the corresponding context chunks and the query, maintaining the model accuracy. Furthermore, chunk-level KV cache computation reorders the KV cache chunks before quantization, avoiding the hardware inefficiency caused by mixed-precision quantization in inference computation. Extensive experiments demonstrate that Cocktail outperforms state-of-the-art KV cache quantization methods on various models and datasets. Our code is presented on https://github.com/Sullivan12138/Cocktail./proceedings-archive/2024/DATA/1442_pdf_upload.pdf |
||
RANKMAP: PRIORITY-AWARE MULTI-DNN MANAGER FOR HETEROGENEOUS EMBEDDED DEVICES Speaker: Iraklis Anagnostopoulos, Southern Illinois University Carbondale, US Authors: Andreas Karatzas1, Dimitrios Stamoulis2 and Iraklis Anagnostopoulos1 1Southern Illinois University Carbondale, US; 2The University of Texas at Austin, US Abstract Modern edge data centers simultaneously handle multiple Deep Neural Networks (DNNs), leading to significant challenges in workload management. Thus, current management systems need to leverage the architectural heterogeneity of new embedded systems, enabling efficient handling of multi-DNN workloads. This paper introduces RankMap, a priority-aware manager specifically designed for multi-DNN tasks on heterogeneous embedded devices. RankMap addresses the extensive solution space of multi-DNN mapping through stochastic space exploration combined with a performance estimator. Experimental results show that RankMap achieves x3.6 higher average throughput compared to existing methods, while effectively preventing DNN starvation under heavy workloads and improving the prioritization of specified DNNs by x57.5./proceedings-archive/2024/DATA/555_pdf_upload.pdf |
TS06 Session 19 - D16
Add this session to my calendar
Date: Tuesday, 01 April 2025
Time: 08:30 CET - 10:00 CET
Time | Label | Presentation Title Authors |
---|---|---|
EMPOWERING QUANTUM ERROR TRACEABILITY WITH MOE FOR AUTOMATIC CALIBRATION Speaker: Tingting Li, Zhejiang University, CN Authors: Tingting Li1, Ziming Zhao1, Liqiang Lu1, Siwei Tan2 and Jianwei Yin1 1Zhejiang University, CN; 2Zhejiang university, CN Abstract Quantum computing offers the potential for exponential speedups over classical computing in tackling complex tasks, such as large-number factorization and chemical molecular simulation. However, quantum noise remains a significant challenge, hindering the reliability and scalability of quantum systems. Therefore, effective characterization and calibration of quantum noise are critical to advancing these systems. Quantum calibration is a process that heavily relies on expert knowledge, and there currently is a range of research focused on automatic calibration. However, traditional calibration methods often need an effective error traceback mechanism, leading to repeated calibration attempts without identifying root causes. To address the issue of error traceback in calibration failures, this paper proposes an automatic calibration error traceback algorithm facilitated by a Mixture of Experts (MoE) system inspired by the current large language model technologies. Our approach enables traceability of quantum calibration errors, allowing for the rapid identification and correction of deviations from the calibration state. Extensive experimental results demonstrate that the MoE-based automatic calibration method significantly outperforms traditional error traceability and calibration efficiency techniques. Notably, our approach improved the average visibility of 77 qubits by 25.5%, surpassing the outcomes of fixed calibration processes. This work presents a promising path toward more reliable and scalable quantum computing systems./proceedings-archive/2024/DATA/228_pdf_upload.pdf |
||
OPTIMAL STATE PREPARATION FOR LOGICAL ARRAYS ON ZONED NEUTRAL ATOM QUANTUM COMPUTERS Speaker: Yannick Stade, TU Munich, DE Authors: Yannick Stade, Ludwig Schmid, Lukas Burgholzer and Robert Wille, TU Munich, DE Abstract Quantum computing promises to solve problems previously deemed infeasible. However, high error rates necessitate quantum error correction for practical applications. Seminal experiments with zoned neutral atom architectures have shown remarkable potential for fault-tolerant quantum computing. To fully harness their potential, efficient software solutions are vital. A key aspect of quantum error correction is the initialization of physical qubits representing a logical qubit in a highly entangled state. This process, known as state preparation, is the foundation of most quantum error correction codes and, hence, a crucial step towards fault-tolerant quantum computing. Generating a schedule of target-specific instructions to perform the state preparation is highly complex. First software tools exist but are not suitable for the zoned neutral atom architectures. This work addresses this gap by leveraging the computational power of SMT solvers and generating minimal schedules for the state preparation of logical arrays. Experimental evaluations demonstrate that actively utilizing zones to shield idling qubits consistently results in higher fidelities than solutions disregarding these zones. The complete code is publicly available in open-source as part of the Munich Quantum Toolkit (MQT) at https://github.com/cda-tum/mqt-qmap./proceedings-archive/2024/DATA/260_pdf_upload.pdf |
||
DESIGN OF AN FPGA-BASED NEUTRAL ATOM REARRANGEMENT ACCELERATOR FOR QUANTUM COMPUTING Speaker: Xiaorang Guo, TU Munich, DE Authors: Xiaorang Guo, Jonas Winklmann, Dirk Stober, Amr Elsharkawy and Martin Schulz, TU Munich, DE Abstract Neutral atoms have emerged as a promising technology for implementing quantum computers due to their scalability and long coherence times. However, the execution frequency of neutral atom quantum computers is constrained by image processing procedures, particularly the assembly of defect-free atom arrays, which is a crucial step in preparing qubits (atoms) for execution. To optimize this assembly process, we propose a novel quadrant-based rearrangement algorithm that employs a divide-and-conquer strategy and also enables the simultaneous movement of multiple atoms, even across different columns and rows. We implement the algorithm on Field Programmable Gate Arrays (FPGAs) to handle each quadrant independently (hardware-level optimization) while maximizing parallelization. To the best of our knowledge, this is the first hardware acceleration work for atom rearrangement, and it significantly reduces processing time. This achievement also contributes to the ongoing efforts of tightly integrating quantum accelerators into High-Performance Computing (HPC) systems. Tested on a Zynq RFSoC FPGA at 250 MHz, our hardware implementation is able to complete the rearrangement process of a 30×30 compact target array, derived from a 50×50 initial loaded array, in approximately 1.0 μs. Compared to a comparable CPU implementation and to state-of-the-art FPGA work, we achieved about 54× and 300× speedups in the rearrangement analysis time, respectively. Additionally, the FPGA-based acceleration demonstrates good scalability, allowing for seamless adaptation to varying sizes of the atom array, which makes this algorithm a promising solution for large-scale quantum systems./proceedings-archive/2024/DATA/685_pdf_upload.pdf |
||
IMAGE COMPUTATION FOR QUANTUM TRANSITION SYSTEMS Speaker: Xin Hong, Institute of Software, Chinese Academy of Sciences, CN Authors: Xin Hong1, Dingchao Gao1, Sanjiang Li2, Shenggang Ying1 and Mingsheng Ying3 1Institute of Software, Chinese Academy of Sciences, CN; 2UTS, AU; 3University of Technology Sydney, CN Abstract With the rapid progress in quantum hardware and software, the need for verification of quantum systems becomes increasingly crucial. While model checking is a dominant and very successful technique for verifying classical systems, its application to quantum systems is still an underdeveloped research area. This paper advances the development of model checking quantum systems by providing efficient image computation algorithms for quantum transition systems, which play a fundamental role in model checking. In our approach, we represent quantum circuits as tensor networks and design algorithms by leveraging the properties of tensor networks and tensor decision diagrams. Our experiments demonstrate that our contraction partition-based algorithm can greatly improve the efficiency of image computation for quantum transition systems./proceedings-archive/2024/DATA/700_pdf_upload.pdf |
||
LOW-LATENCY DIGITAL FEEDBACK FOR STOCHASTIC QUANTUM CALIBRATION USING CRYOGENIC CMOS Speaker: Nathan Miller, Georgia Tech, US Authors: Nathan Miller, Laith Shamieh and Saibal Mukhopadhyay, Georgia Tech, US Abstract In order to develop quantum computing systems towards practically useful applications, their physical quantum bits (qubits) must be able to operate with minimal error. Recent work has demonstrated stochastic gate calibration protocols for quantum systems which are meant to track drifting control parameters and tune gate operations to high fidelity. These protocols critically rely on low-latency feedback between the quantum system and its classical control hardware, which is impossible without on-board classical compute from FPGAs or ASICs. In this work, we analyze the performance of a single-shot stochastic calibration protocol for indefinite outcome quantum circuits under various latency conditions based on timing considerations from experimental quantum systems. We also demonstrate the benefits that can be achieved with ASIC implementation of the protocol by synthesizing the classical control logic in a 28 nm CMOS design node, with simulations extended to 14 nm FinFET and at both room and cryogenic temperatures. We show that these classes of quantum calibration protocols can be easily implemented within contemporary control system architectures for low-latency performance without significant power or resource utilization, allowing for the rapid tuning and drift control of any gate-model quantum system towards fault-tolerant computation./proceedings-archive/2024/DATA/840_pdf_upload.pdf |
||
IMPROVING FIGURES OF MERIT FOR QUANTUM CIRCUIT COMPILATION Speaker: Patrick Hopf, TU Munich, DE Authors: Patrick Hopf1, Nils Quetschlich1, Laura Schulz2 and Robert Wille1 1TU Munich, DE; 2Leibniz Supercomputing Centre, DE Abstract Quantum computing is an emerging technology that has seen significant software and hardware improvements in recent years. Executing a quantum program requires the compilation of its quantum circuit for a target Quantum Processing Unit (QPU). Various methods for qubit mapping, gate synthesis, and optimization of quantum circuits have been proposed and implemented in compilers. These compilers try to generate a quantum circuit that leads to the best execution quality - a criterium which is usually approximated by figures of merit such as the number of (two-qubit) gates, the circuit depth, expected fidelity, or estimated success probability. However, it is often unclear how well these figures of merit represent the actual execution quality on a QPU. In this work, we investigate the correlation between established figures of merit and actual execution quality on real machines - revealing that the correlation is weaker than anticipated and that more complex figures of merit are not necessarily more accurate. Motivated by this finding, we propose an improved figure of merit (based on a machine learning approach) that can be used to predict the expected execution quality of a quantum circuit for a chosen QPU without actually executing it. The employed machine learning model reveals the influence of various circuit features on generating high correlation scores. The proposed figure of merit demonstrates a strong correlation and outperforms all previous ones in a case study - achieving an average correlation improvement of 49%./proceedings-archive/2024/DATA/1077_pdf_upload.pdf |
||
DETERMINISTIC FAULT-TOLERANT STATE PREPARATION FOR NEAR-TERM QUANTUM ERROR CORRECTION: AUTOMATIC SYNTHESIS USING BOOLEAN SATISFIABILITY Speaker: Ludwig Schmid, TU Munich, DE Authors: Ludwig Schmid1, Tom Peham1, Lucas Berent1, Markus Müller2 and Robert Wille1 1TU Munich, DE; 2RWTH Aachen University, DE Abstract To ensure resilience against the unavoidable noise in quantum computers, quantum information needs to be encoded using an error-correcting code, and circuits must have a particular structure to be fault-tolerant. Compilation of fault-tolerant quantum circuits is thus inherently different from the non-fault-tolerant case. However, automated fault-tolerant compilation methods are widely underexplored, and most known constructions are obtained manually for specific codes only. In this work, we focus on the problem of automatically synthesizing fault-tolerant circuits for the deterministic initialization of an encoded state for a broad class of quantum codes that are realizable on current and near-term hardware. To this end, we utilize methods based on techniques from classical circuit design, such as satisfiability solving, resulting in tools for the synthesis of (optimal) fault-tolerant state preparation circuits for near-term quantum codes. We demonstrate the correct fault-tolerant behavior of the synthesized circuits using circuit-level noise simulations. We provide all routines as open-source software as part of [retracted for double-blind review] for general use and to foster research in fault-tolerant circuit synthesis./proceedings-archive/2024/DATA/1130_pdf_upload.pdf |
||
OPTIMIZING QUBIT ASSIGNMENT IN MODULAR QUANTUM SYSTEMS VIA ATTENTION-BASED DEEP REINFORCEMENT LEARNING Speaker: Enrico Russo, University of Catania, IT Authors: Enrico Russo, Maurizio Palesi, Davide Patti, Giuseppe Ascia and Vincenzo Catania, University of Catania, IT Abstract Modular, distributed, and multi-core architectures are considered a promising solution for scaling quantum computing systems. Optimising communication is crucial to preserve quantum coherence. The compilation and mapping of quantum circuits should minimise state transfers while adhering to architectural constraints. To address this problem efficiently, we propose a novel approach using Reinforcement Learning (RL) to learn heuristics for a specific multi-core architecture. Our RL agent uses a Transformer encoder and Graph Neural Networks, encoding quantum circuits with self-attention and producing outputs via an attention-based pointer mechanism to match logical qubits with physical cores efficiently. Experimental results show our method outperform the baseline reducing by 28% inter-core communications for random circuits while minimising time-to-solution./proceedings-archive/2024/DATA/1281_pdf_upload.pdf |
||
NEURAL CIRCUIT PARAMETER PREDICTION FOR EFFICIENT QUANTUM DATA LOADING Speaker: Dohun Kim, Pohang University of Science and Technology, KR Authors: Dohun Kim, Sunghye Park and Seokhyeong Kang, Pohang University of Science and Technology, KR Abstract Quantum machine learning (QML) has demonstrated the potential to outperform classical machine learning algorithms in various fields. However, encoding classical data into quantum states, known as quantum data loading, remains a challenge. Existing methods achieve high accuracy in loading single data, but lack efficiency for large-scale data loading tasks. In this work, we propose Neural Circuit Parameter Prediction, a novel method that leverages classical deep neural networks to predict the parameters of parameterized quantum circuits directly from the input data. This approach benefits from the batch inference capability of neural networks and improves the accuracy of quantum data loading. We introduce real-valued parameterization of quantum circuits and a three-phase training strategy to further enhance training efficiency and accuracy. Experimental results on MNIST dataset show that our method achieves a 17.31% improvement in infidelity score and 108 times faster runtime compared to existing methods. Our approach provides an efficient solution for quantum data loading, enabling the practical deployment of QML algorithms on large-scale datasets/proceedings-archive/2024/DATA/1328_pdf_upload.pdf |
||
CIM-BASED PARALLEL FULLY FFNN SURFACE CODE HIGH-LEVEL DECODER FOR QUANTUM ERROR CORRECTION Speaker: Hao Wang, Hong Kong University of Science and Technology, HK Authors: Hao Wang1, Erjia Xiao1, Songhuan He2, Zhongyi Ni1, Lingfeng Zhang1, Xiaokun Zhan3, Yifei Cui2, Jinguo Liu1, Cheng WANG2, Zhongrui Wang4 and Renjing Xu1 1Hong Kong University of Science and Technology, HK; 2University of Electronic Science and Technology of China, CN; 3Harbin Institute of Technology, CN; 4Southern University of Science and Technology, CN Abstract In all types of surface code decoders, fully neural network-based high-level decoders offer decoding thresholds that surpass decoder-Minimum Weight Perfect Matching (MWPM), and exhibit strong scalability, making them one of the ideal solutions for addressing surface code challenges. However, current fully neural network-based high-level decoders can only operate serially and do not meet the current latency requirements (below 440 ns). To address these challenges, we first propose a parallel fully feedforward neural network (FFNN) high-level surface code decoder, and comprehensively measure its decoding performance on a computing-in-memory (CIM) hardware simulation platform. With the currently available hardware specifications, our work achieves a decoding threshold of 14.22%, and achieves high pseudo-thresholds of 10.4%, 11.3%, 12%, and 11.6% with decoding latencies of 197.03 ns, 234.87 ns, 243.73 ns, and 251.65 ns for distances of 3, 5, 7 and 9, respectively./proceedings-archive/2024/DATA/815_pdf_upload.pdf |
TS07 Session 2 - A5
Add this session to my calendar
Date: Tuesday, 01 April 2025
Time: 08:30 CET - 10:00 CET
Time | Label | Presentation Title Authors |
---|---|---|
HYPERDYN: DYNAMIC DIMENSIONAL MASKING FOR EFFICIENT HYPER-DIMENSIONAL COMPUTING Speaker: Fangxin Liu, Shanghai Jiao Tong University, CN Authors: Fangxin Liu, Haomin Li, Zongwu Wang, Dongxu Lyu and Li Jiang, Shanghai Jiao Tong University, CN Abstract Hyper-dimensional computing (HDC) is a bio-inspired computing paradigm that mimics cognitive tasks by encoding data into high-dimensional vectors and employing non-complex learning techniques. However, existing HDC solutions face a major challenge hindering their deployment on low-power embedded devices: the costly associative search module, especially in high-precision computations. This module involves calculating the distance between class vectors and query vectors, as well as sorting distances In this paper, we present HyperDyn, an efficient dynamic inference framework designed for accurate and efficient hyper-dimensional computing. Our framework first offline analyzes the importance of different dimensions in the associative memory based on the contributions of the dimensions to the classification accuracy. In addition, we introduce a dynamic dimensional importance scaling mechanism for more flexible and accurate dimension contribution judgments. Finally, HyperDyn achieves efficient dynamic associative search through a dimension masking mechanism that adapts to the characteristics of the input sample. We evaluate HyperDyn on datasets from three different fields and the results show that HyperDyn can achieve $7.65 imes$ speedup and $58\%$ energy savings, with less than $0.2\%$ loss in accuracy./proceedings-archive/2024/DATA/8_pdf_upload.pdf |
||
C3CIM: CONSTANT COLUMN CURRENT MEMRISTOR-BASED COMPUTATION-IN-MEMORY MICRO-ARCHITECTURE Speaker: Yashvardhan Biyani, TU Delft, NL Authors: Yashvardhan Biyani, Rajendra Bishnoi, Said Hamdioui and Theofilos Spyrou, TU Delft, NL Abstract Advancements in Artificial Intelligence (AI) and Internet-of-Things (IoT) have increased demand for edge AI, but deployment on traditional AI accelerators, like GPUs and TPUs, using von Neumann architecture, suffer from inefficiencies due to separate memory and compute units. Computation-in-Memory (CIM), utilizing non-volatile memristor devices to leverage analog computing principles and perform in-place computations, holds great potential in improving computational efficiency by eliminating frequent data movement. However, standard implementation of CIM faces several challenges, primarily high power consumption and subsequently induced non-linearity, debating its viability for edge devices. In this paper, we propose C3CIM, a novel memristor-based CIM micro-architecture, featuring a new bit-cell and array design, targeting efficient implementation of Neural Networks (NN). Our architecture uses a constant current source to perform Multiply-and-Accumulate (MAC) operations with a very low computation current (10 to 100 nA), thereby significantly enhancing power efficiency. We adapted C3CIM for Spiking Neural Networks (SNN) and developed a prototype using TSMC 40nm CMOS node for on-silicon validation. Furthermore, our micro-architecture was benchmarked using two SNN models based on N-MNIST and IBM-Gesture datasets, for comparison against current state-of-the-art (SOTA). Results show up to 35x reduction in power along with 6.7x saving in energy compared to SOTA, demonstrating promising potential of this work for edge AI applications./proceedings-archive/2024/DATA/1365_pdf_upload.pdf |
||
ASNPC: AN AUTOMATED GENERATION FRAMEWORK FOR SNN AND NEUROMORPHIC PROCESSOR CO-DESIGN Speaker: Xiangyu Wang, National University of Defense Technology, CN Authors: Xiangyu Wang1, Yuan Li2, Zhijie Yang3, Chao Xiao1, Xun Xiao1, Renzhi Chen4, Weixia Xu1 and Lei Wang3 1National University of Defense Technology, CN; 2College of Computer, National University of Defense Technology, CN; 3Academy of Military Sciences, CN; 4qiyuan laboratory, CN Abstract Spiking neural networks (SNNs) are promisingly considered as energy-efficient alternatives to traditional deep neural networks. At the same time, neuromorphic processors have garnered increasing attention to support the efficient execution of SNNs. However, current works always separate their design to primarily prioritize a single criterion. Hardware-algorithm co-design allows for the simultaneous consideration of hardware and algorithm characteristics during the design process, effectively reducing resource usage while optimizing the algorithm's performance. In light of this, we developed a hardware-algorithm co-design framework named ASNPC for SNNs and neuromorphic processors. Considering the vast mixed-variable co-design space and the time-expensive function evaluations, we employed the surrogate-based multi-objective optimization algorithm MOTPE to identify Pareto solutions that balance algorithm performance and hardware costs. To rapidly obtain hardware results, we designed an end-to-end methodology that can automatically generate the Register-Transfer Level (RTL) code for neuromorphic processors corresponding to each candidate using templates from the hardware library. The evaluated hardware metrics, such as hardware resource and power consumption, are then fed back to MOTPE for the next candidate selection. Compared to existing works, the proposed approach exhibits the ability to find better Pareto solutions, balancing hardware costs and accuracy within a limited search budget, making it widely applicable to various application scenarios. Additionally, under the same hardware configuration, the neuromorphic processor we generated achieves lower hardware resource usage and higher throughput./proceedings-archive/2024/DATA/336_pdf_upload.pdf |
||
SIMULTANEOUS DENOISING AND COMPRESSION FOR DVS WITH PARTITIONED CACHE-LIKE SPATIOTEMPORAL FILTER Speaker: Qinghang Zhao, Xidian University, CN Authors: Qinghang Zhao, Yixi Ji, Jiaqi Wang, Jinjian Wu and Guangming Shi, Xidian University, CN Abstract Dynamic vision sensor (DVS) is a novel neuromorphic imaging device that asynchronously generates event data corresponding to changes in light intensity at each pixel. However, the differential imaging paradigm of DVS renders it highly sensitive to background noise. Additionally, the substantial volume of event data produced in a very short time presents significant challenges for data transmission and processing. In this work, we present a novel spatiotemporal filter design, named PCLF, to achieve simultaneous denoising and compression for the first time. The PCLF employs a hierarchical memory structure that utilizes symmetric multi-bank cache-like row and column memories to store event data from a partitioned pixel array, which exhibits low memory complexity of O(m+n) for an m×n DVS. Furthermore, we propose a probability-based criterion to effectively control the compression ratio. We have implemented our design on an FPGA, demonstrating capabilities for real-time operation (≤60 ns) and low power consumption (<200 mW). Extensive experiments conducted on real-world DVS data across various tasks indicate that our design enables a reduction of event data by 30% to 68%, while maintaining or even enhancing the performance of the tasks./proceedings-archive/2024/DATA/402_pdf_upload.pdf |
||
PRACTICAL MU-MIMO DETECTION AND LDPC DECODING THROUGH DIGITAL ANNEALING Speaker: Po-Shao Chen, University of Michigan, US Authors: Po-Shao Chen, Wei Tang and Zhengya Zhang, University of Michigan, US Abstract Digital annealing has been successfully applied to solving combinatorial optimization (CO) problems. It is more flexible, robust, and easier to deploy on edge platforms compared to its counterparts including quantum annealing and analog and in-memory Ising machines. In this work, we apply digital annealing to compute-intensive communication digital signal processing problems, including multi-user detection in multiple-input and multiple-output (MU-MIMO) wireless communication systems and decoding low-density parity-check (LDPC) codes. We show that digital annealing can achieve near maximum likelihood (ML) accuracy for MIMO detection with even lower complexity than the conventional minimum mean square error (MMSE) detection. In LDPC decoding, we enhance digital annealing by introducing a new cost function that improves decoding accuracy and reduces computational complexity compared to the standard formulations./proceedings-archive/2024/DATA/538_pdf_upload.pdf |
||
LLM-SRAF: SUB-RESOLUTION ASSIST FEATURE GENERATION USING LARGE LANGUAGE MODEL Speaker: Tianyi Li, ShanghaiTech University, Pl Authors: Tianyi Li1, Zhexin Tang1, Tao Wu1, Bei Yu2, Jingyi Yu1 and Hao Geng1 1ShanghaiTech University, CN; 2The Chinese University of Hong Kong, HK Abstract As integrated circuit (IC) feature sizes continue to shrink, using sub-resolution assist features (SRAF) becomes increasingly crucial for improving wafer pattern resolution and fidelity. However, model-based SRAF insertion techniques, while accurate, require substantial computational resources and are often impractical for industrial scenarios. This demands more efficient and industry-compatible methods that maintain high performance. In this work, we introduce LLM-SRAF, a novel framework for SRAF generation driven by a large language model fine-tuned on an SRAF dataset. LLM-SRAF accepts semantic prompt inputs, including SRAF generation task descriptions, OPC recipe, lithography conditions, mask rules, and sequential layout descriptions, to directly generate SRAFs. Both supervised fine-tuning and reinforcement learning with human feedback (RLHF) are employed to enable the model to acquire domain-specific knowledge and specialize in SRAF generation. Experimental results show that LLM-SRAF outperforms existing state-of-the-art methods in metrics of mask quality, including edge placement error (EPE) and process variation band (PVB) area. Moreover, the runtime of LLM-SRAF is also 3x faster compared to the Calibre commercial tool./proceedings-archive/2024/DATA/771_pdf_upload.pdf |
||
A MULTI-STAGE POTTS MACHINE BASED ON COUPLED CMOS RING OSCILLATORS Speaker: Yilmaz Ege Gonul, Drexel University, US Authors: Yilmaz Gonul and Baris Taskin, Drexel University, US Abstract This work presents a multi-stage coupled ring oscillator based Potts machine, designed with phase-shifted Sub-Harmonic-Injection-Locking (SHIL) to represent multivalued Potts spins at different solution stages with oscillator phases. The proposed Potts machine is able to solve a certain class of combinatorial optimization problems that natively require multivalued spins with a divide-and-conquer approach, facili tated through the alternating phase-shifted SHILs acting on the oscillators. The proposed architecture eliminates the need for any external intermediary mappings or usage of external memory, as the influence of SHIL allows oscillators to act as both memory and computation units. Planar 4-coloring problems of sizes up to 2116 nodes are mapped to the proposed architecture. Simulations demonstrate that the proposed Potts machine provides exact solutions for smaller problems (e.g. 49 nodes) and generates solutions reaching up to 97% accuracy for larger problems (e.g. 2116 nodes)./proceedings-archive/2024/DATA/845_pdf_upload.pdf |
||
ADAPT-PNC: MITIGATING DEVICE VARIABILITY AND SENSOR NOISE IN PRINTED NEUROMORPHIC CIRCUITS WITH SO ADAPTIVE LEARNABLE FILTERS Speaker: Tara Gheshlaghi, KIT - Karlsruher Institut für Technologie, DE Authors: Tara Gheshlaghi1, Priyanjana Pal1, Haibin Zhao1, Michael Hefenbrock2, Michael Beigl1 and Mehdi Tahoori1 1Karlsruhe Institute of Technology, DE; 2RevoAI GmbH, DE Abstract The rise of the Internet of Things demands flexible, biocompatible, and cost-effective devices. Printed electronics provide a solution through low-cost and on-demand additive manufacturing on flexible substrates, making them ideal for IoT applications. However, variations in additive manufacturing processes pose challenges for reliable circuit fabrication. Adapting neuromorphic computing to printed electronics could address these issues. Printed neuromorphic circuits offer robust computational capabilities for near-sensor processing in IoT. One limitation of existing printed neuromorphic circuits is their inability to process temporal sensory inputs. To address this, integrating temporal components in printed neuromorphic circuit architectures enables the effective processing of time-series sensory data. Printed neuromorphic circuits face challenges from manufacturing variations such as ink dispersion, sensor noise, and temporal fluctuations, especially when processing temporal data and using time-dependent components like capacitors. To mitigate these challenges, we propose robustness-aware temporal processing neuromorphic circuits with low-pass second-order learnable filters (SO-LF). This approach integrates variation awareness by considering the variation potential of component values during training and using data augmentation to enhance adaptability against physical and sensor data variations. Simulations on 15 benchmark time-series datasets show our circuit effectively handles noisy temporal information under 10% process variations, achieving an average accuracy and power improvement of ≈ 24.7% and ≈ 91% respectively compared to models lacking variation with ≈ 1.9× more devices./proceedings-archive/2024/DATA/868_pdf_upload.pdf |
||
SELF-ADAPTIVE ISING MACHINES FOR CONSTRAINED OPTIMIZATION Speaker: Corentin Delacour, University of California, Santa Barbara, US Author: corentin delacour, University of California, Santa Barbara, US Abstract Ising machines (IMs) are physics-inspired alternatives to von Neumann architectures for solving hard optimization tasks. By mapping binary variables to coupled Ising spins, IMs can naturally solve unconstrained combinatorial optimization problems such as finding maximum cuts in graphs. However, despite their importance in practical applications, constrained problems remain challenging to solve for IMs that require large quadratic energy penalties to ensure the correspondence between energy ground states and constrained optimal solutions. To relax this requirement, we propose a self-adaptive IM that iteratively shapes its energy landscape using a Lagrange relaxation of constraints and avoids prior tuning of penalties. Using a probabilistic-bit (p-bit) IM emulated in software, we benchmark our algorithm with multidimensional knapsack problems (MKP) and quadratic knapsack problems (QKP), the latter being an Ising problem with linear constraints. For QKP with 300 variables, the proposed algorithm finds better solutions than state-of-the-art IMs such as Fujitsu's Digital Annealer and requires 7,500x fewer samples. Our results show that adapting the energy landscape during the search can speed up IMs for constrained optimization./proceedings-archive/2024/DATA/942_pdf_upload.pdf |
||
ENABLING SNN-BASED NEAR-MEA NEURAL DECODING WITH CHANNEL SELECTION: AN OPEN-HW APPROACH Speaker: Gianluca Leone, Università degli Studi di Cagliari, IT Authors: Gianluca Leone, Luca Martis, Luigi Raffo and Paolo Meloni, Università degli Studi di Cagliari, IT Abstract Advancements in CMOS microelectrode array sensors have significantly improved sensing area and resolution, paving the way to accurate Brain-Machine Interfaces (BMIs). However, near-sensor neural decoding on implantable computing devices is still an open problem. A promising solution is provided by Spiking Neural Networks (SNNs), which leverage event sparsity to improve energy consumption. However, given the typical data rates involved, the workload related to I/O acquisition and spike encoding is dominant and limits the benefits achievable with event-based processing. In this work, we present two power-efficient implementations, on FPGA and ASIC, of a dedicated processor for the decoding of intracortical action potentials from primary motor cortex. The processor leverages lightweight sparse SNNs to achieve state-of-the-art accuracy. To limit the impact of I/O transfers on energy efficiency, we introduced a channel selection scheme that reduced bandwidth requirements by 3x and power consumption by 2.3x and 1.6x on the FPGA and ASIC, respectively, enabling inference at 0.446 µJ and 1.04 µJ, with no significant loss in accuracy. To promote broad adoption in a specialized, research-intensive domain, we have based our implementations on open-source EDA tools, low-cost hardware, and an open PDK./proceedings-archive/2024/DATA/1119_pdf_upload.pdf |
||
TOWARDS FAST AUTOMATIC DESIGN OF SILICON DANGLING BOND LOGIC Speaker: Jan Drewniok, TU Munich, DE Authors: Jan Drewniok1, Marcel Walter1, Samuel Ng2, Konrad Walus2 and Robert Wille1 1TU Munich, DE; 2University of British Columbia, CA Abstract In recent years, Silicon Dangling Bond (SiDB) logic has emerged as a promising beyond-CMOS technology. Unlike conventional circuit technology, where logic is realized through transistors, SiDB logic utilizes quantum dots with variable charge states. By strategically arranging these dots, logic functions can be constructed. However, determining such arrangements is a tremendously complex task. Because of that, the automatic obtainment of SiDB logic implementations is inefficient. To address this challenge, we propose an idea to speed up the design process by utilizing dedicated search space pruning strategies. Initial results show that the combined pruning techniques yield 1) a drastic reduction of the search space, and 2) a corresponding reduction in runtime by up to a factor of 33./proceedings-archive/2024/DATA/448_pdf_upload.pdf |
||
LOADING-AWARE MIXING-EFFICIENT SAMPLE PREPARATION ON PROGRAMMABLE MICROFLUIDIC DEVICE Speaker: Debraj Kundu, TU Munich, DE Authors: Debraj Kundu1, Tsun-Ming Tseng2, Shigeru Yamashita3 and Ulf Schlichtmann2 1TU Munich (TUM), DE; 2TU Munich, DE; 3Ritsumeikan University, JP Abstract Sample preparation, where a certain number of reagents must be mixed in a specific volumetric ratio, is an integral step for various bio-assays. A programmable microfluidic device (PMD) is an advanced flow-based microfluidic biochip (FMB) platform, that considered to be very effective for sample preparation. However, the impact of mixer placement, reagents' distribution, and mixing time on the automation of sample preparation has not yet been investigated. We consider a mixing efficiency model controlled by the number of alternations "μ" of reagents along the mixing circulation path and propose a loading-aware placement strategy that maximizes the mixing efficiency. We use satisfiability modulo theories (SMT) and propose a one-pass strategy for placing the mixers and the reagents, that successfully enhance the loading and mixing efficiencies./proceedings-archive/2024/DATA/877_pdf_upload.pdf |
TS08 Session 21 - A6+E4
Add this session to my calendar
Date: Tuesday, 01 April 2025
Time: 11:00 CET - 12:30 CET
Time | Label | Presentation Title Authors |
---|---|---|
FILTER-BASED ADAPTIVE MODEL PRUNING FOR EFFICIENT INCREMENTAL LEARNING ON EDGE DEVICES Speaker: Jing-Jia Hung, National Taiwan University & TSMC, TW Authors: Jing-Jia Hung1, Yi-Jung Chen2, Hsiang-Yun Cheng3, Hsu Kao4 and Chia-Lin Yang1 1National Taiwan University, TW; 2Department of Computer Science and Information Engineering, National Chi Nan University, TW; 3Academia Sinica, TW | National Taiwan University, TW; 4National Tsing Hua University, TW Abstract Incremental Learning (IL) enhances Machine Learning (ML) models over time with new data, ideal for edge devices at the forefront of data collection. However, executing IL on edges faces challenges due to limited resources. Common methods involve IL followed by model pruning or specialized IL methods for edges. However, the former increases training time due to fine-tuning and compromises accuracy for past classes due to limited retained samples or features. Meanwhile, existing edge-specific IL methods utilize weight pruning, which requires specialized hardware or compilers to speed up and cannot reduce computations on general embedded platforms. In this paper, we propose Filter-based Adaptive Model Pruning (FAMP), the first pruning method designed specifically for IL. FAMP prunes the model before the IL process, allowing fine-tuning to occur concurrently with IL, thereby avoiding extended training time. To maintain high accuracy for both new and past data classes, FAMP adapts the compressed model based on observed data classes and retains filter settings from the previous IL iteration to mitigate forgetting. Across all tests, FAMP achieves the best average accuracy, with only a 2.78% accuracy drop over full ML models with IL. Moreover, unlike the common methods that prolong training time, FAMP takes 35% shorter training time on average than using the full ML models for IL./proceedings-archive/2024/DATA/1044_pdf_upload.pdf |
||
DYLGNN: EFFICIENT LM-GNN FINE-TUNING WITH DYNAMIC NODE PARTITIONING, LOW-DEGREE SPARSITY, AND ASYNCHRONOUS SUB-BATCH Speaker: Zhen Yu, SHANGHAI JIAO TONG UNIVERSITY, CN Authors: zhen yu, Jinhao Li, Jiaming Xu, Shan Huang, Jiancai Ye, Ningyi Xu and Guohao Dai, Shanghai Jiao Tong University, CN Abstract Text-Attributed Graphs (TAGs) tasks involve both textual node information and graph topological structure. The top-k method, using Language Models (LMs) for text encoding and Graph Neural Networks (GNNs) for graph processing, offers the best accuracy while balancing memory and training time. However, challenges still exist: (1) Static sampling of k neighbors reduces performance. Using a fixed k can result in sampling too few or too many nodes, leading to a 3.2% accuracy loss across datasets. (2) Time-consuming processing for non-trainable nodes. After partitioning all nodes into with-gradient trainable and without-gradient non-trainable sets, the number of non-trainable nodes is ∼9-10× larger than trainable nodes, resulting in nearly 70% of the total time. (3) Time-consuming data movement. For processing non-trainable nodes, after the text strings are tokenized into tokens on the CPU side, the data movement from host memory to GPU takes 30%-40% of the time. In this paper, we propose DyLGNN, an efficient end-to-end LM-GNN fine-tuning framework through three innovations: (1) Heuristic Node Partitioning. We propose an algorithm that dynamically and adaptively selects "important" nodes to participate in the training process for downstream tasks. Compared to the static top-k method, we reduce the training memory usage by 24.0%. (2) Low-Degree Sparse Attention. We point out that the embedding of low-degree nodes has minimal impact on the final results (e.g. ∼1.5% accuracy loss), therefore, We perform sparse attention computation on low-degree nodes to further reduce the computation caused by "unimportant" nodes, achieving an average 1.27× speedup. (3) Asynchronous Sub-batch Pipeline. Within the top-k framework, we analyze the time breakdown of the LM inference component. Leveraging our heuristic node partitioning, which effectively minimizes memory demands, we can asynchronously execute data movement and computation, thereby overlapping the time required for data movement. This improves GPU utilization and results in an average 1.1× speedup. We conduct experiments on several common graph datasets, and by combining the three methods mentioned above, DyLGNN achieves a 22.0% reduction in memory usage and a 1.3× end-to-end speedup compared to the top-k strategy./proceedings-archive/2024/DATA/756_pdf_upload.pdf |
||
ITERL2NORM: FAST ITERATIVE L2-NORMALIZATION Speaker: ChangMin Ye, Hanyang University, KR Authors: ChangMin Ye, Yonguk Sim, Youngchae Kim, SeongMin Jin and Doo Seok Jeong, Hanyang University, KR Abstract Transformer-based large language models are a memory-bound model whose operation is based on a large amount of data that are marginally reused. Thus, the data movement between a host and accelerator likely dictates the total wall-clock time. Layer normalization is one of the key workloads in the transformer model, following each of multi-head attention and feed-forward network blocks. To reduce data movement, layer normalization needs to be performed on the same chip as the matrix-matrix multiplication engine. To this end, we introduce an iterative L2-normalization method for 1D input (IterL2Norm), ensuring fast convergence to the steady-state solution within five iteration steps and high precision, outperforming the fast inverse square root algorithm in six out of nine cases for FP32 and five out of nine for BFloat16 across the embedding lengths used in the OPT models. Implemented in 32/28nm CMOS, the IterL2Norm macro normalizes $d$-dimensional vectors, where 64<=d<=1024, with a latency of 116-227 cycles at 100MHz/1.05V./proceedings-archive/2024/DATA/793_pdf_upload.pdf |
||
MPTORCH-FPGA: A CUSTOM MIXED PRECISION FRAMEWORK FOR FPGA-BASED DNN TRAINING Speaker: Sami BEN ALI, Inria Rennes, FR Authors: Sami BEN ALI1, Silviu-Ioan Filip1, Olivier Sentieys1 and Guy Lemieux2 1INRIA, FR; 2University of British Columbia, CA Abstract Training Deep Neural Networks (DNNs) is compu- tationally demanding, leading to a growing interest in reduced precision formats to enhance hardware efficiency. Several frame- works explore custom number formats with parameterizable pre- cision through software emulation on CPUs or GPUs. However, they lack comprehensive support for different rounding modes and struggle to accurately evaluate the impact of custom precision for FPGA-based targets. This paper introduces MPTorch-FPGA, an extension of the MPTorch framework for performing custom, multi-precision inference and training computations in CPU, GPU, and FPGA environments in PyTorch. MPTorch-FPGA can generate a model-specific accelerator for DNN training , with customizable sizes and arithmetic implementations, providing bit-level accuracy with respect to emulated low precision DNN training on GPUs or CPUs. An offline matching algorithm selects one of several pre-generated (static) FPGA configurations using a custom performance model to estimate latency. To showcase the versatility of MPTorch-FPGA, we present a series of training benchmarks using diverse DNN models, exploring a range of number format configurations and rounding modes. We report both accuracy and hardware performance metrics, verifying the precision of our performance model by comparing estimated and measured latencies across multiple benchmarks. These results highlight the flexibility and practical value of our framework./proceedings-archive/2024/DATA/1481_pdf_upload.pdf |
||
MEMHD: MEMORY-EFFICIENT MULTI-CENTROID HYPERDIMENSIONAL COMPUTING FOR FULLY-UTILIZED IN-MEMORY COMPUTING ARCHITECTURES Speaker: Do Yeong Kang, Sungkyunkwan University, KR Authors: Do Yeong Kang, Yeong Hwan Oh, Chanwook Hwang, Jinhee Kim, Kang Eun Jeon and Jong Hwan Ko, Sungkyunkwan University, KR Abstract Hyperdimensional Computing (HDC) has shown great potential in brain-inspired computing, but its integration with In-Memory Computing (IMC) faces challenges due to high-dimensional vector operations and memory utilization issues. This paper introduces a novel multi-centroid Associative Memory (AM) structure for HDC implemented on IMC architectures, addressing these challenges while maintaining high accuracy in classification tasks. Our approach compresses dimensions through the multi-centroid model, bringing IMC array utilization for Associative Search close to 100% and significantly reducing computations. This dimension compression substantially decreases memory footprint in both the Encoding Module and Associative Memory, while reducing computational requirements. Additionally, we propose innovative initialization and learning methods for multi-centroid AM, including clustering-based initialization for faster convergence and a quantization-aware iterative learning approach for high-accuracy, IMC-compatible AM training. Our adaptive structure optimizes model design based on available hardware resources by adjusting memory columns and rows. Comprehensive evaluations across various classification datasets demonstrate that our method achieves superior memory efficiency at equivalent accuracy levels and improved accuracy at equivalent memory usage compared to conventional HDC models./proceedings-archive/2024/DATA/1407_pdf_upload.pdf |
||
ODIN: LEARNING TO OPTIMIZE OPERATION UNIT CONFIGURATION FOR ENERGY-EFFICIENT DNN INFERENCING Speaker: Gaurav Narang, Washington State University, US Authors: Gaurav Narang, Jana Doppa and Partha Pratim Pande, Washington State University, US Abstract ReRAM-based Processing-In-Memory (PIM) architectures enable energy-efficient Deep Neural Network (DNN) inferencing. However, ReRAM crossbars suffer from various non-idealities that affect overall inferencing accuracy. To address that, the matrix-vector-multiplication (MVM) operations are computed by activating a subset of the full crossbar, referred to as Operation Unit (OU). However, OU configurations vary with the neural layers' features such as sparsity, kernel size and their impact on predictive accuracy. In this paper, we consider the problem of learning appropriate layer-wise OU configurations in ReRAM crossbars for unseen DNNs at runtime such that performance is maximized without loss in predictive accuracy. We employ a machine learning (ML) based framework called Odin, which selects the OU sizes for different neural layers as a function of the neural layer features and time-dependent ReRAM conductance drift. Our experimental results demonstrate that the energy-delay-product (EDP) is reduced by up to 8.7× over state-of-the-art homogeneous OU configurations without compromising predictive accuracy./proceedings-archive/2024/DATA/856_pdf_upload.pdf |
||
SLIPSTREAM: SEMANTIC-BASED TRAINING ACCELERATION FOR RECOMMENDATION MODELS Speaker: Yassaman Ebrahimzadeh Maboud, University of British Columbia, CA Authors: Yassaman Ebrahimzadeh Maboud1, Muhammad Adnan1, Divya Mahajan2 and Prashant Jayaprakash Nair1 1University of British Columbia, CA; 2Georgia Tech, US Abstract Recommendation models play a crucial role in delivering accurate and tailored user experiences. However, training such models poses significant challenges regarding resource utilization and performance. Prior research has proposed an approach that categorizes embeddings into popular and non- popular classes to reduce the training time for recommendation models. We observe that, even among the popular embeddings, certain embeddings undergo rapid training and exhibit minimal subsequent variation, resulting in saturation. Consequently, updates to these embeddings become redundant, lacking any contribution to model quality. This paper presents Slipstream, a software framework that identifies stale embeddings on the fly and skips their updates to enhance performance. Our experiments demonstrate Slipstream's ability to maintain accuracy while effectively discarding updates to non-varying embeddings. This capability enables Slipstream to achieve substantial speedup, optimize CPU-GPU bandwidth usage, and eliminate unnecessary memory access. SlipStream showcases training time reductions of 2x, 2.4x, 1.2x, and 1.175x across real- world datasets and configurations, compared to Baseline XDL, Intel-optimized DRLM, FAE, and Hotline, respectively./proceedings-archive/2024/DATA/1374_pdf_upload.pdf |
||
COMPASS: A COMPILER FRAMEWORK FOR RESOURCE-CONSTRAINED CROSSBAR-ARRAY BASED IN-MEMORY DEEP LEARNING ACCELERATORS Speaker: Jihoon Park, Seoul National University, KR Authors: Jihoon Park, Jeongin Choe, Dohyun Kim and Jae-Joon Kim, Seoul National University, KR Abstract Recently, crossbar array based in-memory accelerators have been gaining interest due to their high throughput and energy efficiency. While software and compiler support for the in-memory accelerators has also been introduced, they are currently limited to the case where all weights are assumed to be on-chip. This limitation becomes apparent with the significantly increasing network sizes compared to the in-memory footprint. Weight replacement schemes are essential to address this issue. We propose COMPASS, a compiler framework for resource-constrained crossbar-based processing-in-memory (PIM) deep neural network (DNN) accelerators. COMPASS is specially targeted for networks that exceed the capacity of PIM crossbar arrays, necessitating access to external memories. We propose an algorithm to determine the optimal partitioning that divides the layers so that each partition can be accelerated on chip. Our scheme takes into account the data dependence between layers, core utilization, and the number of write instructions to minimize latency, memory accesses, and improve energy efficiency. Simulation results demonstrate that COMPASS can accommodate much more networks using a minimal memory footprint, while improving throughput by 1.78X and providing 1.28X savings in energy-delay product (EDP) over baseline partitioning methods./proceedings-archive/2024/DATA/125_pdf_upload.pdf |
||
OPS: OUTLIER-AWARE PRECISION-SLICE FRAMEWORK FOR LLM ACCELERATION Speaker: Fangxin Liu, Shanghai Jiao Tong University, CN Authors: Fangxin Liu1, Ning Yang1, Zongwu Wang1, Xuanpeng Zhu2, Haidong Yao2, Xiankui Xiong2, Qi Sun3 and Li Jiang1 1Shanghai Jiao Tong University, CN; 2ZTE Corporation, CN; 3Zhejiang University, CN Abstract Large language models (LLMs) have transformed numerous AI applications, with on-device deployment becoming increasingly important for reducing cloud computing costs and protecting user privacy. However, the astronomical model size and limited hardware resources pose significant deployment challenges. Model quantization is a promising approach to mitigate this gap, but the presence of outliers in LLMs reduces its effectiveness. Previous efforts addressed this issue by employing compression-based encoding for mixed-precision quantization. These approaches struggle to balance model accuracy with hardware efficiency due to their value-wise outlier granularity and complex encoding/decoding hardware logic. To address this, we propose OPS (Outlier-aware Precision-Slicing), an acceleration framework that exploits massive sparsity in the higher-order part of LLMs by splitting 16-bit values into a 4-bit/12-bit format. Crucially, OPS introduces an early bird mechanism that leverages the high-order 4-bit computation to predict the importance of the full calculation result. This mechanism enables efficient computational skips by continuing execution only for important computations and using preset values for less significant ones. This scheme can be efficiently integrated with existing hardware accelerators like systolic arrays without complex encoding/decoding. As a result, OPS outperforms state-of-the-art outlier-aware accelerators, achieving a $1.3-4.3 imes$ performance boost and $14.3-66.7\%$ greater energy efficiency, with minimal model accuracy loss. This approach enables more efficient on-device LLM deployment, effectively balancing computational efficiency and model accuracy./proceedings-archive/2024/DATA/137_pdf_upload.pdf |
||
OPENC2: AN OPEN-SOURCE END-TO-END HARDWARE COMPILER DEVELOPMENT FRAMEWORK FOR DIGITAL COMPUTE-IN-MEMORY MACRO Speaker: TIANCHU DONG, Hong Kong University of Science and Technology, HK Authors: Tianchu Dong, Shaoxuan Li, Yihang Zuo, Hongwu Jiang, Yuzhe Ma and Shanshi Huang, Hong Kong University of Science and Technology, HK Abstract Digital Compute-in-Memory (DCIM), which inserts logic circuits into SRAM arrays, presents a significant advancement in CIM architecture. DCIM has shown great potential in applications, and the diversity of applications requires rapid hardware iteration. However, the hardware design flow from user specifications to layout is extremely tedious and time-consuming for manual design. Commercial EDA tools are limited by restrictive licenses and the inability to specifically optimize the datapath, which calls for an open-source end-to-end hardware compiler for DCIM. This paper proposes OpenC2 , the first open-source end-to-end development framework for DCIM macro compilation. OpenC2 provides a template-based generation platform for DCIM macros across various technologies, sizes, and configurations. It can automatically generate a datapath-optimized, compact DCIM macro layout based on a hierarchical physical design methodology. Our experiment results show that OpenC2's compact design on FreePDK45 delivers over 30% area reduction and over 40% improvement in area efficiency compared to AutoDCIM on TSMC40./proceedings-archive/2024/DATA/669_pdf_upload.pdf |
||
SPEEDING-UP SUCCESSIVE READ OPERATIONS OF STT-MRAM VIA READ PATH ALTERNATION FOR DELAY SYMMETRY Speaker: Taehwan Kim, Korea University, KR Authors: Taehwan Kim and Jongsun Park, Korea University, KR Abstract Recent research on data-intensive computing systems has demonstrated that system throughput and latency are critically dependent on memory read bandwidth, highlighting the need for fast memory read operations. Although spin-transfer torque magnetic random-access memory (STT-MRAM) has emerged as a promising alternative to CMOS-based embedded memories, STT-MRAM continues to face challenges related to read speed and energy efficiency. This paper introduces a novel read scheme that enhances read speed and energy in successive read operations by alternating read paths between data and reference cells. This approach effectively mitigates worst-case read scenarios by balancing the read voltage swings. HSPICE simulations using 28nm CMOS technology show a 31.5% improvement in read speed and 48.8% reduction in energy consumption compared to the previous approach. SCALE-Sim system simulations also demonstrate that applying the proposed read scheme to STT-MRAM embedded memories in AI accelerators shows a significant reduction in memory energy for CNN inference tasks compared to the SRAM embedded memory./proceedings-archive/2024/DATA/1535_pdf_upload.pdf |
TS09 Session 8 - D9
Add this session to my calendar
Date: Tuesday, 01 April 2025
Time: 11:00 CET - 12:30 CET
Time | Label | Presentation Title Authors |
---|---|---|
CO-UP: COMPREHENSIVE CORE AND UNCORE POWER MANAGEMENT FOR LATENCY-CRITICAL WORKLOADS Speaker: Ki-Dong Kang, Electronics and Telecommunications Research Institute, KR Authors: Ki-Dong Kang1, Gyeongseo Park1 and Daehoon Kim2 1Electronics and Telecommunications Research Institute, KR; 2Yonsei University, KR Abstract Improving energy efficiency to reduce costs in server environments has attracted considerable attention. Considering that processors account for a significant portion of energy consumption in servers, Dynamic Voltage and Frequency Scaling (DVFS) enhances their energy efficiency by adjusting the operational speed and power consumption of processors. Additionally, modern high-end processors extend DVFS functionality not only to core components but also to uncore parts. This is because the increasing complexity and integration of System on Chips (SoCs) have highlighted the substantial energy consumption. However, existing uncore voltage/frequency scaling fails to effectively consider Latency-Critical (LC) applications, leading to sub-optimal energy efficiency or degraded performance. In this paper, we introduce Co-UP, power management that simultaneously scales core and uncore frequencies for latency-critical applications, designed to improve energy efficiency without violating Service Level Objectives (SLOs). To this end, Co-UP incorporates a prediction model that estimates outcomes of energy consumption and performance as uncore and core frequency changes. Based on the estimated gains, Co-UP adjusts to uncore and/or core frequencies to further enhance energy efficiency or performance. This predictive model can rapidly adapt to new and unlearned loads, enabling Co-UP to operate online without any prior profiling. Our experiments show that Co-UP can reduce energy consumption by up to 28.2% compared to existing Intel's policy and up to 17.6% compared to state-of-the-art power management studies, without SLO violations./proceedings-archive/2024/DATA/240_pdf_upload.pdf |
||
FLEXIBLE THERMAL CONDUCTANCE MODEL (TCM) FOR EFFICIENT THERMAL SIMULATION OF 3-D ICS AND PACKAGES Speaker: Shunxiang Lan, Shanghai Jiao Tong University, CN Authors: Shunxiang Lan, Min Tang and Jun Ma, Shanghai Jiao Tong University, CN Abstract Thermal management plays an increasingly important role in the design of 3-D integrated circuits (ICs) and packages. To deal with the related thermal issues, efficient and accurate evaluation of the thermal performance is obviously essential. In this paper, an efficient approach with the flexible thermal conductance model (TCM) is presented for thermal simulation of 3-D ICs and packages. Firstly, the entire structure is partitioned and classified into two kinds of regions, named region of interest (ROI) and region of fixity (ROF). The ROI usually contains the key components in thermal designs while the ROF holds invariant thermal characteristics. Then, in order to represent the thermal impact of ROF on ROI, a novel technique based on the TCM is developed, which can be treated as the equivalent boundary condition of the ROI. By this means, the solution domain of the whole system is constrained to the ROI, which results in significant reduction of computational costs. Furthermore, in the representation of ROF, a flexible TCM with elegant rational expressions on the heat convection coefficient is proposed to deal with varying boundary conditions, which greatly expands the applicability of this method. The validity and efficiency of the proposed method is illustrated by the numerical examples, where a 138x speedup is achieved comparing with the commercial software./proceedings-archive/2024/DATA/256_pdf_upload.pdf |
||
THANOS: ENERGY-EFFICIENT KEYWORD SPOTTING PROCESSOR WITH HYBRID TIME-FEATURE-FREQUENCY-DOMAIN ZERO-SKIPPING Speaker: Sangyeon Kim, Sogang University, KR Authors: Sangyeon Kim, Hyunmin Kim and Sungju Ryu, Sogang University, KR Abstract In recent years, the keyword spotting algorithm has gained significant attention for applications such as personalized virtual assistants. However, the keyword spotting system must be always turned on to listen to the input voice for the recognition, which worsens the battery constraint problem in the edge devices. In this paper, we first analyze the sparsities in the keyword spotting computation. Based on the characteristic, we introduce the keyword spotting processor called Thanos to enable the zero-skipping scheme in the multiple keyword spotting domains to mitigate the burdensome energy consumption. Experimental results show that our hybrid-domain zero-skipping scheme reduces the latency and the energy consumption by 80.3-87.4% and 48.1-79.8%, respectively, over the baseline architecture./proceedings-archive/2024/DATA/417_pdf_upload.pdf |
||
ALGORITHM-HARDWARE CO-DESIGN OF A UNIFIED ACCELERATOR FOR NON-LINEAR FUNCTIONS IN TRANSFORMERS Speaker: Haonan Du, Zhejiang University, CN Authors: Haonan Du1, Chenyi Wen1, Zhengrui Chen1, Li Zhang2, Qi Sun1, Zheyu Yan1 and Cheng Zhuo1 1Zhejiang University, CN; 2Hubei University of Technology, CN Abstract Non-linear functions (NFs) in Transformers require high-precision computation consuming significant time and energy, despite the aggressive quantization schemes for other components. Piece-wise Linear (PWL) approximation-based methods offer more efficient processing schemes for NFs but fall short in dealing with functions with high non-linearities. Moreover, PWL-based methods still suffer from inevitably high latency introduced by the Multiply-And-Add (MADD) unit. To address these issues, this paper proposes a novel quadratic approximation scheme and a highly integrated, multiplier-less hardware structure, as a unified method to accelerate any unary non-linear function. We also demonstrate implementation examples for GELU, Softmax, and LayerNorm. The experimental results show that the proposed method achieves up to 5.41% higher inference accuracy and 60.12% lower area-delay product./proceedings-archive/2024/DATA/617_pdf_upload.pdf |
||
EFFICIENT HOLD BUFFER OPTIMIZATION BY SUPPLY NOISE-AWARE DYNAMIC TIMING ANALYSIS Speaker: Lishuo Deng, Southeast University, CN Authors: Lishuo Deng, Changwei Yan, Cai Li, Zhuo Chen and Weiwei Shan, Southeast University, CN Abstract As the CMOS process scales down, digital circuits become more susceptible to hold time violations due to increased sensitivity to supply voltage fluctuations. Since hold time violation is fatal, sufficient hold fixing buffers need to be inserted into the short paths to prevent it. However, by assuming a constant power supply level, traditional hold fixing causes imprecise and overly conservative timing analysis and hence leads to circuit overhead and degraded performance. To address this, we propose a power supply noise (PSN)-aware dynamic timing analysis for realistic hold time analysis and efficient hold buffer optimization, which integrates a machine learning-based timing model into the conventional design flow. Building on the highly effective application of the Weibull cumulative distribution function and machine learning for dynamic PSN-aware timing analysis, we propose introducing an additional parameter for PSN amplitude, which has a significant impact on delay, and narrowing the overall parameter range using real PSN waveforms extracted from the RedHawk. This approach achieves a prediction error of only 3.45% for cell delay and 5.1% for path delay, while also reducing dataset acquisition costs. To the best of our knowledge, this work is the first to apply PSN-aware dynamic timing analysis specifically for hold optimization, mitigating the pessimism of traditional static timing analysis (STA) and effectively minimizing redundant hold fixing buffers while remaining compatible with existing design workflows. Since short paths often overlap with critical paths, reducing redundant hold buffers not only decreases area overhead but also enhances performance. Applied to a 22 nm, 64-point Fast Fourier Transform (FFT) circuit, our EDA compatible method combined with a greedy algorithm reduces hold buffers by 55%, achieving not only 6.79% circuit area reduction but also 8.1% performance improvement due to the elimination of redundant buffers in short and critical paths./proceedings-archive/2024/DATA/695_pdf_upload.pdf |
||
LARED: EFFICIENT IR DROP PREDICTOR WITH LAYOUT-PRESERVING REBUILDER-ENCODER-DECODER ARCHITECTURE Speaker: Zhou Jin, SSSLab, Dept. of CST, China University of Petroleum-Beijing, China, CN Authors: ChengXuan Yu1, YanShuang Teng1, WenHao Dai1, YongJiang Li1, Wei Xing2, Xiao Wu3, Dan Niu4 and Zhou Jin5 1Super Scientific Software Laboratory, University of Petroleum-Beijing, CN; 2University of Sheffield, GB; 3Huada Empyrean Software Co.Ltd, CN; 4Southeast University, CN; 5Super Scientific Software Laboratory, Dept. of CST, China University of Petroleum-Beijing, CN Abstract In the realm of integrated circuit verification, IR drop analysis plays a crucial role. Recent advancements in machine learning (ML) significantly enhance its efficiency, yet many current approaches fail to fully leverage the input structure of feature maps and the transmission mechanism of Power Delivery Network (PDN) layouts. To bridge these gaps, we introduce Layout-Preserving Rebuilder-Encoder-Decoder Architecture Predictor (LaRED), which employs a novel Rebuilder-Encoder-Decoder (RED) architecture and utilizes an innovative downsampling approach and upsampling framework to optimize its perception of instances and the transmission of features. LaRED captures information from various regions with asymmetric topological structure while preserving and transferring layout characteristics through deformable convolution, hybrid downsampling, cascaded upsampling, and attentional feature fusion. The rebuilder rebuilds raw input, whereas the encoder ensures comprehensive feature transmission across all instances. The decoder then facilitates seamless transfer of feature information across layers. This approach enables LaRED to integrate chip features of varying topologies and scales, enhancing its representational power. Compared to the current State-Of-The-Art (SOTA), MAUnet, LaRED achieves accuracy improvements of 34.6\% to 42.6\% in benchmark tests, establishing it as the new standard in static IR drop analysis for integrated circuit design with ML techniques. The code is available at https://github.com/Todi85/LaRED./proceedings-archive/2024/DATA/751_pdf_upload.pdf |
||
COOL3D: COST-OPTIMIZED AND EFFICIENT LIQUID COOLING FOR 3D INTEGRATED CIRCUITS Speaker: Jing Li, Beihang University, CN Authors: Jing Li1, Bingrui Zhang1, Yuquan Sun1, Wei Xing2 and Yuanqing Cheng1 1Beihang University, CN; 2University of Sheffield, GB Abstract CMOS scaling faces challenges due to lithography and device physics issues, leading to increased costs and difficulties in expanding chip footprint. 3D integration technology offers increased integration density without increasing footprint, but elevated power density makes heat dissipation a significant challenge. Microchannel cooling effectively removes heat inside 3D chips. Traditional microchannel optimizations typically focus only on minimizing pump power within a limited parameter design space, leading to suboptimal cooling efficiency. Moreover, existing research rarely considers manufacturing costs, limiting practical application. To address these issues, we propose a high-dimensional non-uniform microchannel design scheme based on Segmented Sampling Bayesian Optimization (SSBO). This multi-parameter collaborative optimization framework comprehensively optimizes microchannel design. Our method reduces pump power by 70% compared to limited parameter design spaces. Additionally, we introduce a cost model for microchannel design, formulating a multi-objective optimization problem that considers both manufacturing cost and pump power consumption. By solving the multi-objective optimization problem by searching for the Pareto front, we demonstrate a balanced design between microchannel manufacturing cost and pump power and provide guidelines for key design parameters./proceedings-archive/2024/DATA/808_pdf_upload.pdf |
||
JOINT DNN PARTITION AND THREAD ALLOCATION OPTIMIZATION FOR ENERGY-HARVESTING MEC SYSTEMS Speaker: Yizhou Shi, Nanjing University of Science and Technology, CN Authors: Yizhou Shi, Liying Li, Yue Zeng, Peijin Cong and Junlong Zhou, Nanjing University of Science and Technology, CN Abstract Deep neural networks (DNNs) have demonstrated exceptional performance, leading to diverse applications across various mobile devices (MDs). Considering factors like portability and environmental sustainability, an increasing number of MDs are adopting energy harvesting (EH) techniques for power supply. However, the computational intensity of DNNs presents significant challenges for their deployment on these resource-constrained devices. Existing approaches often employ DNN partition or offloading to mitigate the time and energy consumption associated with running DNNs on MDs. Nonetheless, existing methods frequently fall short in accurately modeling the execution time of DNNs, and do not consider to use thread allocation for further latency and energy consumption optimization. To solve these problems, we propose a dynamic DNN partition and thread allocation method to optimize the latency and energy consumption of running DNNs on EH-enabled MDs. Specifically, we first investigate the relationship between DNN inference latency and allocated threads and establish an accurate DNN latency prediction model. Based on the prediction model, a DRL-based DNN partition (DDP) algorithm is designed to find the optimal partitions for DNNs. A thread allocation (TA) algorithm is proposed to reduce the inference latency. Experimental results from our test-bed platform demonstrate that compared to four benchmarking methods, our scheme can reduce DNN inference latency and energy consumption by up to 37.3% and 38.5%./proceedings-archive/2024/DATA/837_pdf_upload.pdf |
||
FAST DYNAMIC IR-DROP PREDICTION WITH DUAL-PATH SPATIAL-TEMPORAL ATTENTION Speaker: Bangqi Fu, The Chinese University of Hong Kong, HK Authors: Bangqi Fu1, Lixin Liu2, Qijing Wang1, Yutao Wang1, Martin Wong1 and Evangeline Young1 1The Chinese University of Hong Kong, HK; 2The Chinese University of Hong Kong, CN Abstract The analysis of IR-drop stands as a fundamental step in optimizing the power distribution network (PDN), and subsequently influences the design performance. However, traditional IR-drop analysis using commercial tools proves to be exceedingly time-consuming. Fast and accurate IR-drop analysis is desperately in demand to achieve high performance on timing and power. Recently, machine learning approaches have garnered attention owing to their remarkable speed and extensibility in IC designs. However, prior works for dynamic IR-drop prediction presented limited performance since they did not exploit the time-varying activities. In this paper, we proposed a dual-path model with spatial-temporal transformers to extract the static spatial features and dynamic time-variant activities for dynamic IR drop prediction. Experimental results on the large-scale advanced dataset CircuitNet show that our model significantly outperforms the state-of-the-art works./proceedings-archive/2024/DATA/961_pdf_upload.pdf |
||
A NOVEL FREQUENCY-SPATIAL DOMAIN AWARE NETWORK FOR FAST THERMAL PREDICTION IN 2.5D ICS Presenter: Dekang Zhang, Southeast University, CN Authors: Dekang Zhang1, Dan Niu1, Zhou Jin2, Yichao Dong1, Jingweijia Tan3 and Changyin Sun4 1Southeast University, CN; 2Super Scientific Software Laboratory, Dept. of CST, China University of Petroleum-Beijing, CN; 3Jilin University, CN; 4Anhui University, CN Abstract In the post-Moore era, 2.5D chiplet-based ICs present significant challenges in thermal management due to increased power density and thermal hotspots. Neural network-based thermal prediction models can perform real-time predictions for many unseen new designs. However, existing CNN-based and GCN-based methods cannot effectively capture the global thermal features, especially for high-frequency components, hindering prediction accuracy enhancement. In this paper, we propose a novel frequency- spatial dual domain aware prediction network (FSA-Heat) for fast and high-accuracy thermal prediction in 2.5D ICs. It integrates high-to-low frequency and spatial domain encoder (FSTE) module with frequency domain cross-scale interaction module (FCIFormer) to achieve high-to-low frequency and global-to-local thermal dissipation feature extraction. Additionally, a frequency-spatial hybrid loss (FSL) is designed to effectively attenuate high-frequency thermal gradient noises and spatial misalignments. The experimental results show that the performance enhancements offered by our proposed method are substantial, outperforming the newly-proposed 2.5D method, GCN+PNA, by considerable margins (over 99% RMSE reduction, 4.23X inference speedup). Moreover, extensive experiments demonstrate that FSA-Heat also exhibits robust generalization capabilities. |
TS10 Session 3 - A6
Add this session to my calendar
Date: Tuesday, 01 April 2025
Time: 11:00 CET - 12:30 CET
Time | Label | Presentation Title Authors |
---|---|---|
TAIL: EXPLOITING UNDERLINE{T}EMPORAL UNDERLINE{A}SYNCHRONOUS EXECUTION FOR EFFICIENT SPIKING NEURAL NETWORKS WITH UNDERLINE{I}NTER-UNDERLINE{L}AYER PARALLELISM Speaker: Haomin Li, Shanghai Jiao Tong University, CN Authors: Haomin Li1, Fangxin Liu1, Zongwu Wang1, Dongxu Lyu1, Shiyuan Huang1, Ning Yang1, Qi Sun2, Zhuoran Song1 and Li Jiang1 1Shanghai Jiao Tong University, CN; 2Zhejiang University, CN Abstract Spiking neural networks (SNNs) are an alternative computational paradigm to artificial neural networks (ANNs) that have attracted attention due to their event-driven execution mechanisms, enabling extremely low energy consumption. However, the existing SNN execution model, based on software simulation or synchronized hardware circuitry, is incompatible with the event-driven nature, thus resulting in poor performance and energy efficiency. The challenge arises from the fact that neuron computations across multiple time steps result in increased latency and energy consumption. To overcome this bottleneck and leverage the full potential of SNNs, we propose TAIL, a pioneering temporal asynchronous execution mechanism for SNNs driven by a comprehensive analysis of SNN computations. Additionally, we propose an efficient dataflow design to support SNN inference, enabling concurrent computation of various time steps across multiple layers for optimal Processing Element (PE) utilization. Our evaluations show that TAIL greatly improves the performance of SNN inference, achieving a $6.94 imes$ speedup and a $6.97 imes$ increase in energy efficiency on current SNN computing platforms./proceedings-archive/2024/DATA/7_pdf_upload.pdf |
||
EXPLOITING BOOSTING IN HYPERDIMENSIONAL COMPUTING FOR ENHANCED RELIABILITY IN HEALTHCARE Speaker: Sungheon Jeong, University of california, irvine, US Authors: SungHeon Jeong1, Hamza Errahmouni Barkam1, Sanggeon Yun1, Yeseong Kim2, Shaahin Angizi3 and Mohsen Imani1 1University of California, Irvine, US; 2DGIST, KR; 3New Jersey Institute of Technology, US Abstract Hyperdimensional computing (HDC) enables efficient data encoding and processing in high-dimensional spaces, benefiting machine learning and data analysis. However, underutilization of these spaces can lead to overfitting and reduced model reliability, especially in data-limited systems—a critical issue in sectors like healthcare that demand robustness and consistent performance. We introduce BoostHD, an approach that applies boosting algorithms to partition the hyperdimensional space into subspaces, creating an ensemble of weak learners. By integrating boosting with HDC, BoostHD enhances performance and reliability beyond existing HDC methods. Our analysis highlights the importance of efficient utilization of hyperdimensional spaces for improved model performance. Experiments on healthcare datasets show that BoostHD outperforms state-of-the-art methods. On the WESAD dataset, it achieved an accuracy of 98.37 ± 0.32%, surpassing Random Forest, XGBoost, and OnlineHD. BoostHD also demonstrated superior inference efficiency and stability, maintaining high accuracy under data imbalance and noise. In person-specific evaluations, it achieved an average accuracy of 96.19%, outperforming other models. By addressing the limitations of both boosting and HDC, BoostHD expands the applicability of HDC in critical domains where reliability and precision are paramount./proceedings-archive/2024/DATA/130_pdf_upload.pdf |
||
LCACHE: LOG-STRUCTURED SSD CACHING FOR TRAINING DEEP LEARNING MODELS Speaker: Shucheng Wang, China Mobile (Suzhou) Software Technology, CN Authors: Shucheng Wang1, Zhiguo Xu1, Zhandong Guo1, Jian Sheng2, Kaiye Zhou1 and Qiang Cao3 1China Mobile (Suzhou) Software Technology Co., Ltd., CN; 2Suzhou City Univercity, CN; 3Huazhong University of Science & Technology, CN Abstract Training deep learning models is computational and data-intensive. Existing approaches utilize local SSDs within training servers to cache datasets, thereby accelerating data loading during model training. However, we experimentally observe that data loading remains a performance bottleneck when randomly retrieving small-sized sample files on SSDs. In this paper, we introduce LCache, a log-structured dataset caching mechanism designed to fully leverage the I/O capabilities of SSDs and reduce I/O-induced training stalls. LCache determines the randomized dataset access order by extracting the pseudo-random seed from the training frameworks. It then aggregates small-sized sample files into larger chunks and stores them in a log file on SSDs, thus enabling sequential I/O requests on data retrieval and improving data loading throughput. Furthermore, LCache proposes a real-time log reordering mechanism that strategically schedules cached data to organize logs across different epochs, which enhances cache utilization and minimizes data retrieval from low-performance remote storage systems. Additionally, LCache incorporates an MetaIndex to enable rapid log traversal and querying. We evaluate LCache with various real-world DL models and datasets. LCache outperforms the native PyTorch Dataloader and NoPFS by up to 9.4x and 7.8x in throughput, respectively./proceedings-archive/2024/DATA/184_pdf_upload.pdf |
||
OLORAS: ONLINE LONG RANGE ACTION SEGMENTATION FOR EDGE DEVICES Speaker: Filippo Ziche, Università di Verona, IT Authors: Filippo Ziche and Nicola Bombieri, Università di Verona, IT Abstract Temporal action segmentation (TAS) is essential for identifying when actions are performed by a subject, with applications ranging from healthcare to Industry 5.0. In such contexts, the need for real-time, low-latency responses and privacy-aware data handling often requires the use of edge devices, despite their limited memory, power, and computational resources. This paper presents OLORAS, a novel TAS model designed for real-time performance on edge devices. By leveraging human pose data instead of video frames and employing linear recurrent units (LRUs), OLORAS efficiently processes long sequences while minimizing memory usage. Tested on the standard Assembly101 dataset, the model outperforms state-of-the-art TAS methods in accuracy with 10x memory footprint reduction, making it well-suited for deployment on resource-constrained devices./proceedings-archive/2024/DATA/475_pdf_upload.pdf |
||
ONLINE LEARNING FOR DYNAMIC STRUCTURAL CHARACTERIZATION IN ELECTRON ENERGY LOSS SPECTROSCOPY Speaker: Lakshmi Varshika Mirtinti, Drexel University, US Authors: Lakshmi Varshika M1, Jonathan Hollenbach2, Nicolas Agostini3, Ankur Limaye3, Antonino Tumeo4 and Anup Das1 1Drexel University, US; 2Johns Hopkins University, US; 3Pacific Northwest National Lab, US; 4Pacific Northwest National Laboratory, US Abstract In-situ Electron Energy Loss Spectroscopy (EELS) is a crucial technique for determining the elemental composition of materials through EELS Spectrum Images (EELS-SI). While recent innovations have made it possible for EELS-SI data acquisition at rates of 400 frames per second with near-zero read noise, the challenge lies in processing this massive stream of real-time data to capture nanoscale dynamic changes. This task demands advanced machine learning methods capable of identifying subtle and complex features in EELS spectra. Furthermore, the EELS data acquired in difficult experimental conditions often suffer from a low signal-to-noise ratio (SNR), leading to unreliable classification and limiting their utility. In response to this critical need, we introduce a spiking neural network (SNN)-based Variational Autoencoder (VAE) that embeds spectral data into a latent space, facilitating precise prediction of structural changes. VAEs are designed to learn efficient low-dimensional representations while capturing the inherent variability in the data, making them highly effective for processing multi-dimensional data. Additionally, SNNs, which use biological neurons, offer unmatched scalability and energy efficiency by processing information through binary spikes, making them ideal for high-throughput data. We validate our framework using MXene annealing data, achieving denoised spectrum images with an SNR of 28.3dB. For the first time, we present a fully online learning solution for dynamic structural tracking, implemented directly in hardware, eliminating the traditional bottleneck of offline training. Our method achieves reliable, realtime, on-device characterization of high-speed EELS data when evaluated on an FPGA platform. Joint experiments with the SNN-VAE model on both spiking autoencoder hardware and a software-trained hybrid configuration of hardware spiking encoders demonstrated latency reductions of 25.2×, 93.7×, and 1.04×, 4.5× in energy savings, respectively, compared to baseline./proceedings-archive/2024/DATA/547_pdf_upload.pdf |
||
SCALES: BOOST BINARY NEURAL NETWORK FOR IMAGE SUPER-RESOLUTION WITH EFFICIENT SCALINGS Speaker: Renjie Wei, Peking University, CN Authors: Renjie Wei1, Zechun Liu2, Yuchen Fan3, Runsheng Wang1, Ru Huang1 and Meng Li1 1Peking University, CN; 2Meta Inc, US; 3Meta, US Abstract Deep neural networks for image super-resolution (SR) have demonstrated superior performance. However, the large memory and computation consumption hinders their deployment on resource-constrained devices. Binary neural networks (BNNs), which quantize the floating point weights and activations to 1- bit can significantly reduce the cost. Although BNNs for image classification have made great progress these days, existing BNNs for SR still suffer from a large performance gap between the FP SR networks. To this end, we observe the activation distribution in SR networks and find much larger pixel-to-pixel, channel-tochannel, layer-to-layer, and image-to-image variation in the activation distribution than image classification networks. However, existing BNNs for SR fail to capture these variations that contain rich information for image reconstruction, leading to inferior performance. To address this problem, we propose SCALES, a binarization method for SR networks that consists of the layerwise scaling factor, the spatial re-scaling method, and the channelwise re-scaling method, capturing the layer-wise, pixel-wise, and channel-wise variations efficiently in an input-dependent manner. We evaluate our method across different network architectures and datasets. For CNN-based SR networks, our binarization method SCALES outperforms the prior art method by 0.2dB with fewer parameters and operations. With SCALES, we achieve the first accurate binary Transformer-based SR network, improving PSNR by more than 1dB compared to the baseline method./proceedings-archive/2024/DATA/1196_pdf_upload.pdf |
||
POROS: ONE-LEVEL ARCHITECTURE-MAPPING CO-EXPLORATION FOR TENSOR ALGORITHMS Speaker: Fuyu Wang, National Sun Yat-Sen University, TW Authors: Fuyu Wang and Minghua Shen, National Sun Yat-Sen University, TW Abstract Tensor algorithms increasingly rely on specialized accelerators to meet growing performance and efficiency demands. Given the rapid evolution of these algorithms and the high cost of designing accelerators, automated solutions for jointly optimizing both architectures and mappings have gained attention. However, the joint design space is non-convex and non-smooth, hindering the finding of optimal or near-optimal designs. Moreover, prior work conducts two-level exploration, resulting in a combinatorial explosion. In this paper, we propose Poros, a one-level architecture-mapping co-exploration framework. Poros directly explores a batch of architecture-mapping configurations and evaluates their performance. It then exploits reinforcement learning to perform gradient-based search in the non-smooth joint design space. By sampling from the policy, Poros keeps exploring new actions to address non-convexity. Experimental results demonstrate that Poros achieves up to 5.32$ imes$ and 2.15$ imes$ better EDP compared with hand-designed accelerators and state-of-the-art automatic approaches respectively. Through one-level exploration scheme, Poros also converges at least 20\% faster than other approaches./proceedings-archive/2024/DATA/772_pdf_upload.pdf |
||
A CNN COMPRESSION METHODOLOGY FOR LAYER-WISE RANK SELECTION CONSIDERING INTER-LAYER INTERACTIONS Speaker: Milad Kokhazadeh, School of Informatics, Aristotle University of Thessaloniki, GR Authors: Milad Kokhazadeh1, Georgios Keramidas2, Vasilios Kelefouras3 and Iakovos Stamoulis4 1PhD Candidate, Aristotle University of Thessaloniki, GR; 2Aristotle University of Thessaloniki/Think Silicon S.A., GR, GR; 3University of Plymouth, GB; 4Think Silicon, S.A. An Applied Materials Company, GR Abstract Convolutional Neural Networks (CNNs) achieve state-of-the-art performance across various application domains but are often resource-intensive, limiting their use on resource-constrained devices. Low-rank factorization (LRF) has emerged as a promising technique to reduce the computational complexity and memory footprint of CNNs, enabling efficient deployment without significant performance loss. However, challenges still remain in optimizing the rank selection problem, balancing memory reduction and accuracy, and integrating LRF into the training process of CNNs. In this paper, a novel and generic methodology for layer-wise rank selection is presented, considering inter-layer interactions. Our approach is compatible with any decomposition method and does not require additional retraining. The proposed methodology is evaluated in thirteen widely-used, CNN models, significantly reducing model parameters and Floating-Point Operations (FLOPs). In particular, our approach achieves up to a 94.6% parameter reduction (82.3% on average) and up to 90.7% FLOPs reduction (59.6% on average), with less than a 1.5% drop in validation accuracy, demonstrating superior performance and scalability compared to existing techniques./proceedings-archive/2024/DATA/825_pdf_upload.pdf |
||
FINEQ: SOFTWARE-HARDWARE CO-DESIGN FOR LOW-BIT FINE-GRAINED MIXED-PRECISION QUANTIZATION OF LLMS Speaker: Xilong Xie, Beihang University, CN Authors: Xilong Xie1, Liang Wang1, Limin Xiao1, Meng Han1, Lin Sun2, Shuai Zheng1 and Xiangrong Xu1 1Beihang University, CN; 2Jiangsu Shuguang Optoelectric Co., Ltd., CN Abstract Large language models (LLMs) have significantly advanced the natural language processing paradigm but impose substantial demands on memory and computational resources. Quantization is one of the most effective ways to reduce memory consumption of LLMs. However, advanced single-precision quantization methods experience significant accuracy degradation when quantizing to ultra-low bits. Existing mixed-precision quantization methods are quantized by groups with coarse granularity. Employing high precision for group data leads to substantial memory overhead, whereas low precision severely impacts model accuracy. To address this issue, we propose FineQ, software-hardware co-design for low-bit fine-grained mixed-precision quantization of LLMs. First, FineQ partitions the weights into finer-grained clusters and considers the distribution of outliers within these clusters, thus achieving a balance between model accuracy and memory overhead. Then, we propose an outlier protection mechanism within clusters that uses 3 bits to represent outliers and introduce an encoding scheme for index and data concatenation to enable aligned memory access. Finally, we introduce an accelerator utilizing temporal coding that effectively supports the quantization algorithm while simplifying the multipliers in the systolic array. FineQ achieves higher model accuracy compared to the SOTA mixed-precision quantization algorithm at a close average bit-width. Meanwhile, the accelerator achieves up to 1.79× energy efficiency and reduces the area of the systolic array by 61.2\%./proceedings-archive/2024/DATA/953_pdf_upload.pdf |
||
SOLVING THE COLD-START PROBLEM FOR THE EDGE: CLUSTERING AND ADAPTIVE DEEP LEARNING FOR EMOTION DETECTION Speaker: Junjiao Sun, Centro de Electrónica Industrial, Universidad Politécnica de Madrid, ES Authors: Junjiao Sun1, Laura Gutierrez Martin2, Jose Miranda Calero3, Celia López-Ongil2, Jorge Portilla1 and Jose Andres Otero Marnotes1 1Centro de Electrónica Industrial Universidad Politecnica de Madrid, ES; 2UC3M (Universidad Carlos III de Madrid), ES; 3EPFL, CH Abstract Designing AI-based applications personalized to each user's behavior presents significant challenges due to the cold start problem and the impracticality of extensive individual data labeling. These challenges are further compounded when deploying such applications at the edge, where limited computing resources constrain the design space. This paper introduces a novel approach to AI-driven personalized solutions in biosensing applications by combining deep learning with clustering-based separation techniques. The proposed Clustering and Learning for Emotion Adaptive Recognition (CLEAR) methodology strikes a balance between population-wide models and fully personalized systems by leveraging data-driven clustering. CLEAR demonstrates its effectiveness in emotion recognition tasks, and its integration with fine-tuning enables efficient deployment on edge devices, ensuring data privacy and real-time detection when new users are introduced to the system. We conducted experiments for model personalization on two edge computing platforms: the Coral Edge TPU Dev Board and the Raspberry Pi with an Intel Movidius Neural Compute Stick 2. The results show that initial cluster assignment for new users can be achieved without labeled data, directly addressing the cold-start problem. Compared to baseline validation without clustering, this proposal improves accuracy metric from 75% to 81.9%. Furthermore, fine-tuning with minimal labeled data significantly improves accuracy, achieving up to 86.34% for the fear detection task in the WEMAC dataset while remaining suitable for deployment on resource-constrained edge devices./proceedings-archive/2024/DATA/1025_pdf_upload.pdf |
||
KALMMIND: A CONFIGURABLE KALMAN FILTER DESIGN FRAMEWORK FOR EMBEDDED BRAIN-COMPUTER INTERFACES Speaker: Guy Eichler, Columbia University, Department of Computer Science, IL Authors: Guy Eichler, Joseph Zuckerman and Luca Carloni, Columbia University, US Abstract Kalman Filter (KF) is one of the most prominent algorithms to predict motion from measurements of brain activity. However, little effort has been made to optimize the KF for deployment in embedded brain-computer interfaces (BCIs). To address this challenge, we propose a new framework for designing KF hardware accelerators specialized for BCI, which facilitates design-space exploration by providing a tunable balance between latency and accuracy. Through FPGA-based experiments with brain data, we demonstrate improvements in both latency and accuracy compared to the state of the art./proceedings-archive/2024/DATA/43_pdf_upload.pdf |
||
SEGTRANSFORMER: ENHANCING SOFTMAX PERFORMANCE THROUGH SEGMENTATION WITH A RERAM-BASED PIM ACCELERATOR Presenter: YuCheng Wang, (+886)978599918, TW Authors: YuCheng Wang1, Ing-Chao Lin2 and Yuan-Hao Chang3 1(+886)978599918, TW; 2National Cheng Kung University, TW; 3Academia Sinica, TW | National Taiwan University, TW Abstract To accelerate Transformer computations, numerous ReRAM-based Processor-In-Memory (PIM) architectures have been proposed, which effectively speed up matrix multiplication. However, these approaches often shift the performance bottleneck from the attention mechanism to the Softmax computation. Additionally, data sharding for acceleration can disrupt the core logic of the Transformer, and when computing the exponential part of extremely small Euler's numbers, slight output differences lead to inefficiency in Softmax computation. To address these challenges, we propose SegTransformer, a ReRAM-based PIM accelerator that enhances matrix computation speed through segmentation techniques and generates segmented data for local Softmax operations. Moreover, we introduce an Integrated Softmax Processing Unit (ISPU), which computes both local Softmax and global factors to reduce errors and improve efficiency. Experimental results show that SegTransformer outperforms state-of-the-art Transformer accelerators. |
TS11 Session 6 - D8
Add this session to my calendar
Date: Tuesday, 01 April 2025
Time: 14:00 CET - 15:30 CET
Time | Label | Presentation Title Authors |
---|---|---|
ACCELERATING AUTHENTICATED BLOCK CIPHERS VIA RISC-V CUSTOM CRYPTOGRAPHY INSTRUCTIONS Speaker: Yuhang Qiu, State Key Lab of Processors, Institute of Computing Technology, CAS, Beijing, China, CN Authors: Qiu Yuhang Qiu Yuhang, Li Wenming Li Wenming, Liu Tianyu Liu Tianyu, Wang Zhen Wang Zhen, Zhang Zhiyuan Zhang Zhiyuan, Fan Zhihua Fan Zhihua, Ye Xiaochun Ye Xiaochun, Fan Dongrui Fan Dongrui and Tang Zhimin Tang Zhimin, State Key Lab of Processors, Institute of Computing Technology, CAS, CN Abstract As one of the standardized encryption algorithms, authenticated block ciphers based on Galois/Counter Mode (GCM) is a widely-used method to guarantee the accuracy and reliability in data transmission. Across the execution process of authenticated block ciphers the authentication operation is the main performance bottleneck because it introduces operations in high-dimensional Galois field (GF) which could not be efficiently executed via existing ISA. To overcome this problem, we propose a custom ISA extension and cooperate it with RISC-V cryptography extension to accelerate the whole process of authenticated block ciphers. Besides, we propose the specific hardware design including a fully-pipelined GF(2^128) multiplier to support the extended instructions and integrate it into the multi-issue out-of-order core XT910 without introducing any clock frequency overhead. The proposed design manages to accelerate the main operations in various kind of authenticated block ciphers. We compare the performance of our designs to other existing acceleration scheme based on RISC-V ISA extension. Experimental result shows that our design outperforms other related work and achieves up to 17x speedup with a lightweight hardware overhead./proceedings-archive/2024/DATA/32_pdf_upload.pdf |
||
NDPAGE: EFFICIENT ADDRESS TRANSLATION FOR NEAR-DATA PROCESSING ARCHITECTURES VIA TAILORED PAGE TABLE Presenter: Qingcai Jiang, University of Science and Technology of China, CN Authors: Qingcai Jiang, Buxin Tu and Hong An, University of Science and Technology of China, CN Abstract Near-Data Processing (NDP) has been a promising architectural paradigm to address the memory wall problem for data-intensive applications. Practical implementation of NDP architectures calls for system support for better programmability, where having virtual memory (VM) is critical. Modern computing systems incorporate a 4-level page table design to support address translation in VM. However, simply adopting an existing 4-level page table design in NDP systems causes significant address translation overhead because (1) NDP applications generate a lot of address translation requests, and (2) the limited L1 cache in NDP systems cannot cover the accesses to page table entries (PTEs). We extensively analyze the 4-level page table design and observe that (1) the memory access to page table entries is highly irregular, thus cannot benefit from the L1 cache, and (2) the last two levels of page tables are nearly fully occupied. Based on our observations, we propose NDPage, an efficient page table design tailored for NDP systems. The key mechanisms of NDPage are (1) an L1 cache bypass mechanism for PTEs that not only accelerates the memory accesses of PTEs but also prevents the pollution of PTEs in the cache system, and (2) a flattened page table design that merges the last two levels of page tables, allowing the page table to enjoy the flexibility of a 4KB page while reducing the number of PTE accesses. We evaluate NDPage using a variety of data-intensive workloads. Our evaluation shows that in a single-core NDP system, NDPage improves the end-to-end performance over the state-of-the-art address translation mechanism of 14.3\%; in 4-core and 8-core NDP systems, NDPage enhances the performance of 9.8\% and 30.5\%, respectively. |
||
SPIRE: INFERRING HARDWARE BOTTLENECKS FROM PERFORMANCE COUNTER DATA Speaker: Nicholas Wendt, University of Michigan, US Authors: Nicholas Wendt1, Mahesh Ketkar2 and Valeria Bertacco1 1University of Michigan, US; 2Intel Labs, US Abstract The persistent demand for greater computing efficiency, coupled with diminishing returns from semiconductor scaling, has led to increased microarchitecture complexity and diversity. Thus, it has become increasingly difficult for application developers and hardware architects to accurately identify low-level performance bottlenecks. Abstract performance models, such as roofline models, help but strip away important microarchitectural details. In contrast, analyses based on hardware performance counters preserve detail but are challenging to implement. This work proposes SPIRE, a novel performance model that combines the accessibility and generality of roofline models with the microarchitectural detail of performance counters. SPIRE ([S]tatistical [P]iecewise L[i]near [R]oofline [E]nsemble) uses a collection of roofline models to estimate a processor's maximum throughput, based on data from its performance counters. Training this ensemble simply requires sampling data from a processor's performance counters. After training a SPIRE model on 23 workloads running on a CPU, we evaluated it with 4 new workloads and compared our findings against a commercial performance analysis tool. We found that our SPIRE analysis accurately identified many of the same bottlenecks while requiring minimal deployment effort./proceedings-archive/2024/DATA/96_pdf_upload.pdf |
||
IMPROVING ADDRESS TRANSLATION IN TAGLESS DRAM CACHE BY CACHING PTE PAGES Speaker: Osang Kwon, Sungkyunkwan University, KR Authors: Osang Kwon, Yongho Lee and Seokin Hong, Sungkyunkwan University, KR Abstract This paper proposes a novel caching mechanism for PTE pages to enhance the Tagless DRAM Cache architecture and improve address translation in large in-package DRAM caches. Existing OS-managed DRAM cache architectures have achieved significant performance improvements by focusing on efficient tag management. However, prior studies have been limited in that they only update the PTE after caching pages, without directly accessing PTEs from the DRAM cache. This limitation leads to performance degradation during page walks. To address this issue, we propose a method to copy both data pages and PTE pages simultaneously to the DRAM cache. This approach reduces address translation and cache access latency. Additionally, we introduce a shootdown mechanism to maintain the consistency of PTEs and page walk caches in multi-core systems, ensuring that all cores access the latest information for shared pages. Experimental results demonstrate that the proposed Caching PTE pages can reduce address translation overhead by up to 33.3% compared to traditional OS-managed tagless DRAM caches, improving overall program execution time by an average of 10.5%. This effectively mitigates bottlenecks caused by address translation./proceedings-archive/2024/DATA/1467_pdf_upload.pdf |
||
EXPLORING THE SPARSITY-QUANTIZATION INTERPLAY ON A NOVEL HYBRID SNN EVENT-DRIVEN ARCHITECTURE Speaker: Tosiron Adegbija, University of Arizona, US Authors: Ilkin Aliyev, Jesus Lopez and Tosiron Adegbija, University of Arizona, US Abstract Spiking Neural Networks (SNNs) offer potential advantages in energy efficiency but currently trail Artificial Neural Networks (ANNs) in versatility, largely due to challenges in efficient input encoding. Recent work shows that direct coding achieves superior accuracy with fewer timesteps than traditional rate coding. However, there is a lack of specialized hardware to fully exploit the potential of direct-coded SNNs, especially their mix of dense and sparse layers. This work proposes the first hybrid inference architecture for direct-coded SNNs. The proposed hardware architecture comprises a dense core to efficiently process the input layer and sparse cores optimized for event-driven spiking convolutions. Furthermore, for the first time, we investigate and quantify the quantization effect on sparsity. Our experiments on two variations of the VGG9 network and implemented on a Xilinx Virtex UltraScale+ FPGA (Field-Programmable Gate Array) reveal two novel findings. Firstly, quantization increases the network sparsity by up to 15.2% with minimal loss of accuracy. Combined with the inherent low power benefits, this leads to a 3.4x improvement in energy compared to the full-precision version. Secondly, direct coding outperforms rate coding, achieving a 10% improvement in accuracy and consuming 26.4x less energy per image. Overall, our accelerator achieves ~51x higher throughput and consumes half the power compared to previous work. Our accelerator code is available at: https://github.com/githubofaliyev/SNN-DSE/tree/DATE25./proceedings-archive/2024/DATA/219_pdf_upload.pdf |
||
SWIFT-SIM: A MODULAR AND HYBRID GPU ARCHITECTURE SIMULATION FRAMEWORK Speaker: Xiangrong Xu, Beihang University, CN Authors: Xiangrong Xu, Yuanqiu Lv, Liang Wang, Limin Xiao, Meng Han, Runnan Shen and Jinquan Wang, Beihang University, CN Abstract Simulation tools are critical for architects to quickly estimate the impact of aggressive new features of GPU architecture. Existing cycle-accurate GPU simulators are typically cumbersome and slow to run. We observe that it is time-consuming and unnecessary for cycle-accurate GPU simulators to perform detailed simulations for the entire GPU when exploring the design space of specific components. This paper proposes Swift-Sim, a modular and hybrid GPU simulation framework. With a highly modular design, our framework can choose appropriate modeling approaches for each component according to requirements. For components of interest to architects, we use cycle-accurate simulation to evaluate new GPU architectures. For other components, we use analytical modeling, which accelerates simulation speed with only minor and acceptable degradation in overall accuracy. Based on this simulation framework, we present two working examples of hybrid modeling that simulate the ALU pipeline and memory accesses using analytical models. We further implement two GPU performance simulators with different levels of simplification based on Swift-Sim and evaluate them using configurations from real GPUs. The results show that the two simulators achieve an 82.6x and 211.2x geometric mean speedup compared to Accel-Sim with insignificant accuracy degradation./proceedings-archive/2024/DATA/298_pdf_upload.pdf |
||
HYMM: A HYBRID SPARSE-DENSE MATRIX MULTIPLICATION ACCELERATOR FOR GCNS Speaker: Hunjong Lee, Korea University, KR Authors: Hunjong Lee1, Jihun Lee1, Jaewon Seo1, Yunho Oh1, Myungkuk Yoon2 and Gunjae Koo1 1Korea University, KR; 2Ewha Womans University, KR Abstract Graph convolutional networks (GCNs) are emerging neural network models designed to process graph-structured data. Due to massively parallel computations using irregular data structures by GCNs, traditional processors such as CPUs, GPUs, and TPUs exhibit significant inefficiency when performing GCN inferences. Even though researchers have proposed several GCN accelerators, the prior dataflow architectures struggle with inefficient data utilization due to the divergent and irregularly structured graph data. In order to overcome such performance hurdles, we propose a hybrid dataflow architecture for sparse-dense matrix multiplications (SpDeMMs), called HyMM. HyMM employs disparate dataflow architectures using different data formats to achieve more efficient data reuse across varying degree levels within graph structures, hence HyMM can reduce off-chip memory accesses significantly. We implement the cycle-accurate simulator to evaluate the performance of HyMM. Our evaluation results demonstrate HyMM can achieve up to 4.78x performance uplift by reducing off-chip memory accesses by 91% compared to the conventional non-hybrid dataflow./proceedings-archive/2024/DATA/315_pdf_upload.pdf |
||
BUDDY ECC: MAKING CACHE MOSTLY CLEAN IN CXL-BASED MEMORY SYSTEMS FOR ENHANCED ERROR CORRECTION AT LOW COST Speaker: Yongho Lee, Sungkyunkwan University, KR Authors: Yongho Lee, Junbum Park, Osang Kwon, Sungbin Jang and Seokin Hong, Sungkyunkwan University, KR Abstract As Compute Express Link (CXL) emerges as a key memory interconnect, interest in optimization opportunities and challenges has grown. However, due to the different characteristics of the CXL Memory Module (CMM) compared to traditional DRAM-based Dual In-line Memory Modules (DIMMs), existing optimizations may not be effectively applied. In this paper, we propose an Proactively Write-back Policy that leverages the full-duplex nature and features of the CMM to optimize bandwidth, enhance reliability, and reduce area overhead. First, the Proactively Write-back improves bandwidth efficiency by minimizing dirty cachelines in the last-level cache through dead block prediction, proactively identifying and writing back cachelines that are unlikely to be rewritten. Second, the Utilization-aware Policy dynamically monitors the internal bandwidth of the CMM, sending write-back requests only when the module is under low load rate, thus preventing performance degradation during high traffic. Finally, the robust Buddy ECC scheme enhances data reliability by separating Error Detection Code (EDC) for clean cachelines and stronger Error Correction Code (ECC) for dirty cachelines. Buddy ECC improved bandwidth utilization by 46%, limited performance degradation to 0.33%, and kept energy consumption increase under 1%./proceedings-archive/2024/DATA/391_pdf_upload.pdf |
||
A PERFORMANCE ANALYSIS OF CHIPLET-BASED SYSTEMS Speaker: Neethu Bal Mallya, Department of Computer Science and Engineering, Chalmers University of Technology, Sweden, SE Authors: Neethu Bal Mallya, Panagiotis Strikos, Bhavishya Goel, Ahsen Ejaz and Ioannis Sourdis, Chalmers University of Technology, SE Abstract As the semiconductor industry struggles to keep Moore's law alive and integrate more functionality on a chip, multi-chiplet chips offer a lower cost alternative to large monolithic chips due to their higher yield. However, chiplet-based chips are naturally Non-Uniform Memory Access (NUMA) systems and therefore suffer from slow remote accesses. NUMA overheads are exacerbated by the limited throughput and higher latency of inter-chiplet communication. This paper offers a comprehensive analysis of chiplet-based systems with different design parameters measuring their performance overheads compared to traditional monolithic multicore designs and their scalability to system and chiplet size. Several design alternatives pertaining to the memory hierarchy, interconnects, and technology aspects are studied. Our analysis shows that although chiplet-based chips can cut (recurring engineering) costs to half, they may give away over a third of the monolithic performance. Part of this performance overhead can be regained with specific design choices./proceedings-archive/2024/DATA/445_pdf_upload.pdf |
||
A HIGH-PERFORMANCE AND FLEXIBLE ACCELERATOR FOR DYNAMIC GRAPH CONVOLUTIONAL NETWORKS Speaker: Ke Wang, University of North Carolina at Charlotte, US Authors: Yingnan Zhao1, Ke Wang2 and Ahmed Louri1 1The George Washington University, US; 2University of North Carolina at Charlotte, US Abstract Dynamic Graph Convolutional Networks (DGCNs) have been applied to various dynamic graph-related applications, such as social networks, to achieve high inference accuracy. Typically, each DGCN layer consists of two distinct modules: a Graph Convolutional Network (GCN) module that captures spatial information, and a Recurrent Neural Network (RNN) module that extracts temporal information from input dynamic graphs. The different functionalities of these modules pose significant challenges for hardware platforms, particularly in achieving high-performance and energy-efficient inference processing. To this end, this paper introduces HiFlex, a high-performance and flexible accelerator designed for DGCN inference. At the architecture level, HiFlex implements multiple homogeneous processing elements (PEs) to perform main computations for GCN and RNN modules, along with a versatile interconnection fabric to optimize data communication and enhance on-chip data reuse efficiency. The flexible interconnection fabric can be dynamically configured to provide various on-chip topologies, supporting point-to-point and multicast communication patterns needed for GCN and RNN processing. At the algorithm level, HiFlex introduces a dynamic control policy that partitions, allocates, and configures hardware resources for distinct modules based on their computational requirements. Evaluation results using real-world dynamic graphs demonstrate that HiFlex achieves, on average, a 38% reduction in execution time and a 42% decrease in energy consumption for DGCN inference, compared to state-of-the-art approaches such as ES-DGCN, ReaDy, and RACE./proceedings-archive/2024/DATA/501_pdf_upload.pdf |
||
AMPHI: PRACTICAL AND INTELLIGENT DATA PREFETCHING FOR THE FIRST-LEVEL CACHE Speaker: Zicong Wang, College of Computer Science and Technology, National University of Defense Technology, CN Authors: Xuan Tang, Zicong Wang, Shuiyi He, Dezun Dong and Xiangke Liao, National University of Defense Technology, CN Abstract Data prefetchers play a crucial role in alleviating the memory wall by predicting future memory accesses. First-level cache prefetchers can observe all memory instructions but often rely on simpler strategies due to limited resources. While emerging machine learning-based approaches cover more memory access patterns, they typically require higher computational and storage resources and are usually deployed in the last-level cache. Other intelligent solutions for the first-level cache show only modest performance gains. To address this, we propose Amphi, the first practical and intelligent data prefetcher specifically designed for the first-level cache. Applying a binarized temporal convolutional network, Amphi significantly reduces storage overhead while maintaining performance comparable to the SOTA intelligent prefetcher. With a storage overhead of only 3.4 KB, Amphi requires only one-eighth of Pythia's storage needs. Amphi paves the way for the broader adoption of intelligence-driven prefetching solutions./proceedings-archive/2024/DATA/14_pdf_upload.pdf |
TS12 Session 23 - A4+A2
Add this session to my calendar
Date: Tuesday, 01 April 2025
Time: 14:00 CET - 15:30 CET
Time | Label | Presentation Title Authors |
---|---|---|
SACK: ENABLING ENVIRONMENTAL SITUATION-AWARE ACCESS CONTROL FOR AUTONOMOUS VEHICLES IN LINUX KERNEL Speaker: Boyan Chen, Peking University, CN Authors: Boyan Chen1, Qingni Shen1, Lei Xue2, Jiarui She1, Xiaolei Zhang1, Xiapu Luo3, Xin Zhang1, Wei Chen1 and Zhonghai Wu1 1Peking University, CN; 2National Sun Yat-Sen University, TW; 3The Hong Kong Polytechnic Unversity, HK Abstract Connected and autonomous vehicles (CAVs) operate in open and evolving environments, which require timely and adaptive permission restriction to address dynamic risks that arise from changes in environmental situations (hereinafter referred to as situations), such as emergency situations due to vehicle crashes. Enforcing situation-aware access control is an effective approach to support adaptive permission restriction. Current works mainly implement situation-aware access control in the permission framework and API monitoring in user space. They are vulnerable to being bypassed and are coarse-grained. Autonomous systems have widely adopted mandatory access control (MAC) to configure and enforce system-wide and fine-grained access control policies. However, the MAC supported by Linux security modules (LSM) relies on pre-defined security contexts (e.g., type) and relatively fixed permission transition conditions (e.g., exec syscall), which lacks consideration of environmental factors. To address these issues, we propose a Situation-aware Access Control framework in the Kernel (SACK), which enforces adaptive permission restriction based on environmental factors for CAVs. Incorporating environmental situations into the LSM framework is not straightforward. SACK introduces situation states as a new security context for abstracting environmental factors in the kernel. Subsequently, SACK utilizes a situation state machine to implement new adaptive permission transitions triggered by situation events. In addition, SACK provides a novel situation-aware policy language that links specific user space permissions to MAC rules while maintaining compatibility with other LSMs such as AppArmor. We develop two prototypes: an independent SACK with its own policies and a SACK-enhanced AppArmor that adaptively updates the corresponding policies of AppArmor. The experimental results demonstrate that SACK can efficiently enforce situation-adaptive permissions with negligible runtime overhead./proceedings-archive/2024/DATA/312_pdf_upload.pdf |
||
EXPLOITING SYSML V2 MODELING FOR AUTOMATIC SMART FACTORIES CONFIGURATION Speaker: Mario Libro, Università di Verona, IT Authors: Mario Libro1, Sebastiano Gaiardelli1, Marco Panato2, Stefano Spellini2, Michele Lora1 and Franco Fummi1 1Università di Verona, IT; 2Factoryal S.r.l., IT Abstract Smart factories are complex environments equipped with both production machinery and computing devices that collect, share, and analyze data. For this reason, the modeling of today's factories can no longer rely on traditional methods, and computer engineering tools, such as SysML, must be employed. At the same time, the current SysML v1.* standard does not provide the rigorousness required to model the complexity and the criticalities of a smart factory. Recently, SysML v2 has been proposed and is about to be released as the new version of the standard. Its release candidate version shows the new version aims at providing a more rigorous and complete modeling language, able to fulfill the requirements of the smart factory domain. In this paper, we explore the capabilities of the new SysML v2 standard by building a rigorous modeling strategy, able to capture the aspects of a smart factory related to the production process, the computation, and the communication. We apply the proposed strategy to model a fully-fledged smart factory, and we rely on models to automatically configure the different pieces of equipment and software components in the factory./proceedings-archive/2024/DATA/495_pdf_upload.pdf |
||
HIDP: HIERARCHICAL DNN PARTITIONING FOR DISTRIBUTED INFERENCE ON HETEROGENEOUS EDGE PLATFORMS Speaker: Zain Taufique, University of Turku, FI Authors: Zain Taufique1, Aman Vyas1, Antonio Miele2, Pasi Liljeberg1 and Anil Kanduri1 1University of Turku, FI; 2Politecnico di Milano, IT Abstract Edge inference techniques partition and distribute Deep Neural Network (DNN) inference tasks among multiple edge nodes for low latency inference, without considering the core-level heterogeneity of edge nodes. Further, default DNN inference frameworks also do not fully utilize the resources of heterogeneous edge nodes, resulting in higher inference latency. In this work, we propose a hierarchical DNN partitioning strategy (HiDP) for distributed inference on heterogeneous edge nodes. Our strategy hierarchically partitions DNN workloads at both global and local levels by considering the core-level heterogeneity of edge nodes. We evaluated our proposed HiDP strategy against relevant distributed inference techniques over widely used DNN models on commercial edge devices. On average our strategy achieved 38% lower latency, 46% lower energy, and 56% higher throughput in comparison with other relevant approaches./proceedings-archive/2024/DATA/532_pdf_upload.pdf |
||
COUPLING NEURAL NETWORKS AND PHYSICS EQUATIONS FOR LI-ION BATTERY STATE-OF-CHARGE PREDICTION Speaker: Giovanni Pollo, Politecnico di Torino, IT Authors: Giovanni Pollo1, Alessio Burrello2, Enrico Macii1, Massimo Poncino1, Sara Vinco1 and Daniele Jahier Pagliari1 1Politecnico di Torino, IT; 2Politecnico di Torino | Università di Bologna, IT Abstract Estimating the evolution of the battery's State of Charge (SoC) in response to its usage is critical for implementing effective power management policies and for ultimately improving the system's lifetime. Most existing estimation methods are either physics-based digital twins of the battery or data-driven models such as Neural Networks (NNs). In this work, we propose two new contributions in this domain. First, we introduce a novel NN architecture formed by two cascaded branches: one to predict the current SoC based on sensor readings, and one to estimate the SoC at a future time as a function of the load behavior. Second, we integrate battery dynamics equations into the training of our NN, merging the physics-based and data-driven approaches, to improve the models' generalization over variable prediction horizons. We validate our approach on two publicly accessible datasets, showing that our Physics-Informed Neural Networks (PINNs) outperform purely data-driven ones while also obtaining superior prediction accuracy with a smaller architecture with respect to the state-of-the-art./proceedings-archive/2024/DATA/565_pdf_upload.pdf |
||
AUTONOMOUS UAV-ASSISTED IOT SYSTEMS WITH DEEP REINFORCEMENT LEARNING BASED DATA FERRY Speaker: Mason Conkel, The University of Texas at San Antonio, US Authors: Mason Conkel1, Wen Zhang2, Mimi Xie1, Yufang Jin1 and Chen Pan1 1The University of Texas at San Antonio, US; 2Wright State University, US Abstract Emerging unmanned aerial vehicle (UAV) technology offers reliable, flexible, and controllable techniques for transferring data collected by wireless internet of things (IoT) devices located in remote areas. However, deploying UAVs faces limitations in mission distance to recharging, especially when recharge occurs far from the monitoring. To address these challenges, we propose smart charging stations installed within the monitoring area equipped with energy-harvest features and communication modules. These stations can replenish the UAV's energy and act as cluster heads by collecting information from IoT devices within their jurisdiction. This allows a UAV to operate continuously by downloading while charging and forwarding the data to the remote server during flight. Despite these improvements, the unpredictable nature of energy-harvest devices and charging needs can lead to stale or obsolete information at cluster heads. The limited communication range may prevent the cluster heads from establishing connections with all nodes in their jurisdiction. To overcome these issues, we proposed an age-of-information-aware data ferry algorithm using deep reinforcement learning to determine the UAV's flight path. The deep reinforcement learning agent, running on cluster heads, utilizes a global state gathered by the UAV to output the location of the next stop, which can be a cluster head or an IoT device. The experiments show that the algorithm can minimize the age of information without diminishing data collection./proceedings-archive/2024/DATA/939_pdf_upload.pdf |
||
AERODIFFUSION: COMPLEX AERIAL IMAGE SYNTHESIS WITH DYNAMIC TEXT DESCRIPTIONS AND FEATURE-AUGMENTED DIFFUSION MODELS Speaker: Douglas Townsell, Wright State University, US Authors: Douglas Townsell1, Mimi Xie2, Bin Wang1, Fathi Amsaad1, Varshitha Thanam3 and Wen Zhang1 1Wright State University, US; 2The University of Texas at San Antonio, US; 3wright state university, US Abstract Aerial imagery provides crucial insights for various fields, including remote monitoring, environmental assessment, and autonomous navigation. However, the availability of aerial image datasets is limited due to privacy concerns and imbalanced data distribution, impeding the development of robust deep learning models. Recent advancements in text-guided image synthesis offer a promising approach to enrich and diversify these datasets. Despite progress, existing generative models face challenges in synthesizing realistic aerial images due to the lack of paired text-aerial datasets, the complexity of densely packed objects, and the limitations of modeling object relationships. In this paper, we introduce extbf{{proposedmodel}}, a novel framework designed to overcome these challenges by leveraging large language models (LLMs) for keypoint-aware text description generation and a feature-augmented diffusion process for realistic image synthesis. Our approach integrates region-level feature extraction to preserve small objects and multi-modal feature alignment to improve textual descriptions of complex aerial scenes. extbf{{proposedmodel}} is the first to extend deep generative models for high-resolution, text-guided aerial image generation, including the creation of images from novel viewpoints. We contribute a new paired text-aerial image dataset and demonstrate the effectiveness of our model, achieving an FID score of 78.15 across five benchmarks, significantly outperforming state-of-the-art models such as DDPM (217.95), Stable Diffusion (119.13), and ARLDM (111.59)./proceedings-archive/2024/DATA/1188_pdf_upload.pdf |
||
POWER- AND DEADLINE-AWARE DYNAMIC INFERENCE ON INTERMITTENT COMPUTING SYSTEMS Speaker: Hengrui Zhao, University of Southampton, GB Authors: Hengrui Zhao, Lei Xun, Jagmohan Chauhan and Geoff Merrett, University of Southampton, GB Abstract In energy-harvesting intermittent computing systems, balancing power constraints with the need for timely and accurate inference remains a critical challenge. Existing methods often sacrifice significant accuracy or fail to adapt effectively to fluctuating power conditions. This paper presents DualAdaptNet, a power- and deadline-aware neural network architecture that dynamically adapts both its width and depth to ensure reliable inference under variable power conditions. Additionally, a runtime scheduling method is introduced to select an appropriate sub-network configuration based on real-time energy-harvesting conditions and system deadlines. Experimental results on the MNIST dataset demonstrate that our approach completes up to 7.0% more inference tasks within a specified deadline while also improving average accuracy by 15.4% compared to the state-of-the-art./proceedings-archive/2024/DATA/1239_pdf_upload.pdf |
||
DCHA: DISTRIBUTED-CENTRALIZED HETEROGENEOUS ARCHITECTURE ENABLES EFFICIENT MULTI-TASK PROCESSING FOR SMART SENSING Speaker: Cheng Qu, Beijing University of Posts and Telecommunication, CN Authors: Erxiang Ren1, Cheng Qu2, Li Luo1, Yonghua Li2, Zheyu Liu3, Xinghua Yang4, Qi Wei5 and Fei Qiao5 1Beijing Jiaotong University, CN; 2Beijing University of Posts and Telecommunications, CN; 3MakeSens AI, CN; 4Beijing Forestry University, CN; 5Tsinghua University, CN Abstract The rapid development of artificial intelligence (AI) has accelerated the progression of IoT technology into the smart era. Integrating AI processing capabilities into IoT devices to create smart sensing systems holds significant promise. In this work, we propose a distributed-centralized heterogeneous architecture that enables efficient multi-task processing for smart sensing. This architecture improves the operational efficiency of sensing systems and enhances the deployment scalability through collaborative computing across end, edge, and center nodes. Specifically, we partition the network in traditional centralized sensing systems into several parts and perform algorithm-hardware co-design for each part on its respective deployment platform. We developed a sample design to validate the proposed architecture. By implementing a lightweight image encoder, we achieved an 88x reduction in encoder parameters and up to 9873x energy gain, facilitating deployment on resource-constrained devices. Experimental results demonstrate that the proposed architecture effectively reduces overall energy consumption by 0.0573x to 0.0889x, while maintaining robust multi-task inference capabilities. Moreover, energy consumption reductions of 2.88x to 3.22x on edge nodes and 6311.56x to 10037.23x on end nodes were observed./proceedings-archive/2024/DATA/245_pdf_upload.pdf |
||
FAIRXBAR: IMPROVING THE FAIRNESS OF DEEP NEURAL NETWORKS WITH NON-IDEAL IN-MEMORY COMPUTING HARDWARE Speaker: Cheng Wang, Iowa State University of Science and Technology, US Authors: Sohan Salahuddin Mugdho1, Yuanbo Guo2, Ethan Rogers1, Weiwei Zhao1, Yiyu Shi2 and Cheng Wang1 1Iowa State University of Science and Technology, US; 2University of Notre Dame, US Abstract While artificial intelligence (AI) based on deep neural networks (DNN) has achieved near-human performance in various cognitive tasks, such data-driven models are known to exhibit implicit bias against specific subgroups, leading to fairness issues. Most existing methods for improving model fairness only consider software-based optimizations, while the impact of hardware is largely unexplored. In this work, we investigate the impact of underlying hardware technology on AI fairness as we deploy DNN-based medical diagnosis algorithms onto in-memory computing hardware accelerators. Based on our newly developed framework that characterizes the importance of DNN weight parameters to fairness, we demonstrate that device variability-induced non-idealities such as stuck-at faults and noises due to variation can be exploited to deliver improved fairness (up to 32% improvement) with significantly reduced trade-off (less than 1% loss) of the overall accuracy. We additionally develop a hardware non-idealities-aware training methodology that further mitigates the bias between unprivileged and privileged demographic groups in our experiments on skin lesion diagnosis datasets. Our work suggests exciting opportunities for leveraging the hardware attributes in a cross-layer co-design to enable equitable and fair AI./proceedings-archive/2024/DATA/1009_pdf_upload.pdf |
||
HUMAN-CENTERED DIGITAL TWIN FOR INDUSTRY 5.0 Speaker: Francesco Biondani, Università di Verona, IT Authors: Francesco Biondani1, Luigi Capogrosso1, Nicola Dall'Ora1, Enrico Fraccaroli2, Marco Cristani1 and Franco Fummi1 1Università di Verona, IT; 2University of North Carolina at Chapel Hill, IT Abstract Moving beyond the automation-driven paradigm of Industry 4.0, Industry 5.0 emphasizes human-centric industrial systems where human creativity and instincts complement precise and advanced machines. With this new paradigm, there is a growing need for resource-efficient and user-preferred manufacturing solutions that integrate humans into industrial processes. Unfortunately, methodologies for incorporating human elements into industrial processes remain underdeveloped. In this work, we present the first pipeline for the creation of a human-centered Digital Twin (DT), leveraging Unreal Engine's MetaHuman technology to track worker alertness in real-time. Our findings demonstrate the potential of integrating Artificial Intelligence (AI) and human-centered design within Industry 5.0 to enhance both worker safety and industrial efficiency./proceedings-archive/2024/DATA/692_pdf_upload.pdf |
||
ENERGY-AWARE ERROR CORRECTION METHOD FOR INDOOR POSITIONING AND TRACKING Speaker: Donkyu Baek, Chungbuk National University, KR Authors: Donguk Kim1, Yukai Chen2, Donkyu Baek1, Enrico Macii3 and Massimo Poncino3 1Chungbuk National University, KR; 2IMEC, BE; 3Politecnico di Torino, IT Abstract Indoor positioning is crucial for the effective use of drones in smart environments, enabling precise navigation and control in complex indoor spaces where GPS signals are weak or unavailable and wireless communication-based systems must be used. In order to improve positioning accuracy, various distance measurement techniques and related error correction methods have been proposed in the literature. However, these methods are mostly focused on accuracy and often require a significant amount of computational resources, which is quite inefficient when deployed on battery-operated devices like small robots or drones because of their limited battery capacity. Moreover, conventional error correction methods are little effective for the tracking of moving objects. In this paper, we first analyze the trade-off between energy consumption and accuracy for the error correction and identify the most energy-efficient error correction method. Based on this analysis in the accuracy/energy space, we introduce a new energy-efficient error correction method that is especially targeted for tracking a moving object. We validated our solution by implementing an Ultra-Wideband based indoor positioning system and demonstrated that the proposed method improves positioning accuracy by 15% and reduces energy consumption by 33% compared to the state-of-the-art method./proceedings-archive/2024/DATA/1048_pdf_upload.pdf |
||
DECENTRALIZING IOT DATA PROCESSING: THE RISE OF BLOCKCHAIN-BASED SOLUTIONS Speaker: Daniela De Venuto, Polytechnic University of Bari, IT Authors: Giuseppe Spadavecchia1, Marco Fiore2, Marina Mongiello2 and Daniela De Venuto2 1Private, IT; 2Polytechnic University of Bari, IT Abstract The rise of the Internet of Things has introduced new challenges related to data security and transparency, especially in industries like agri-food where traceability is critical. Traditional cloud-based solutions, while scalable, pose security and privacy risks. This paper proposes a decentralized architecture using Blockchain technology to address these challenges. We deploy IoT sensors connected to a Raspberry Pi for edge processing and utilize Hyperledger Fabric, a private Blockchain, to manage and store data securely. Two approaches were evaluated: computation of a Discomfort Index on the Raspberry Pi (edge processing) versus performing the same computation on-chain using smart contracts. Performance metrics, including latency, throughput, and error rate, were measured using Hyperledger Caliper. The results show that edge processing offers superior performance in terms of latency and throughput, while Blockchain-based computation ensures greater transparency and trust. This study highlights the potential of Blockchain as a viable alternative to centralized cloud systems in IoT environments and suggests future research in scalability, hybrid architectures, and energy efficiency./proceedings-archive/2024/DATA/1453_pdf_upload.pdf |
||
ENABLING A PORTABLE BRAIN COMPUTER INTERFACE FOR REHABILITATION OF SPINAL CORD INJURIES Speaker: Adrian Evans, CEA, FR Authors: Adrian Evans1, Victor Roux-Sibillon2, Joe Saad2, Ivan Miro-Panades2, Tetiana Aksenova3 and Lorena Anghel4 1CEA, FR; 2CEA-List, FR; 3CEA-Leti, FR; 4Grenoble-Alpes University, Grenoble, France, FR Abstract In clinical trials, brain signal decoders combined with spinal stimulation have shown to be a promising means to restore mobility to paraplegic and tetraplegic patients. To make this technology available for home use, the complex brain signal decoding must be performed using a low-power, portable battery operated system. This case study shows how the decoding algorithm for a Brain-Computer Interface (BCI) system was ported to an embedded platform, resulting in an over 25× power reduction, compared to the previous implementation, while respecting real-time and accuracy constraints./proceedings-archive/2024/DATA/510_pdf_upload.pdf |
TS13 Session 24 - E2+E1
Add this session to my calendar
Date: Tuesday, 01 April 2025
Time: 16:30 CET - 18:00 CET
Time | Label | Presentation Title Authors |
---|---|---|
HARDWARE-ASSISTED RANSOMWARE DETECTION USING AUTOMATED MACHINE LEARNING Speaker: Zhixin Pan, Zhixin Pan, US Authors: Zhixin Pan1 and Ziyu Shu2 1Florida State University, US; 2Washington University in St. Louis, US Abstract Ransomware has emerged as a severe privacy threat, leading to significant financial and data losses worldwide. Traditional detection methods, including static signature-based detection and dynamic behavior-based analysis, have shown limitations in effectively identifying and mitigating ever-evolving ransomware attacks. In this paper, we present a machine learning-based framework with hardware-level microprocessor activity monitoring to enhance detection performance. Specifically, the proposed method incorporates adversarial training to address the weaknesses of conventional static analysis against obfuscation, along with a hardware-assisted behavior monitoring to reduce latency, achieving effective and realtime ransomware detection. The proposed method employs a Neural Architecture Search (NAS) algorithm to automate the selection of optimal machine learning models, significantly boosting generalizability. Experimental results demonstrates that our proposed method improves detection accuracy and reduces detection latency compared to existing approaches, while also maintaining a high generalizability across diverse ransomware types./proceedings-archive/2024/DATA/358_pdf_upload.pdf |
||
RICH: HETEROGENEOUS COMPUTING FOR REAL-TIME INTELLIGENT CONTROL SYSTEMS Speaker: Jintao Chen, Shanghai Jiao Tong University, CN Authors: Jintao Chen, Yuankai Xu, Yinchen Ni, An Zou and Yehan Ma, Shanghai Jiao Tong University, CN Abstract Over the past years, intelligent control tasks, such as deep neural networks (DNNs), have demonstrated significant potential in control systems. However, deploying intelligent control policies on heterogeneous computing platforms presents open challenges. These challenges extend beyond the apparent conflict between intensive computation and timing constraints and further encompass the interactions between task executions and complicated control performance. To address these challenges, this paper introduces RICH, a general and end-to-end approach to facilitate intelligent control tasks on heterogeneous computing architectures. RICH incorporates both offline Control-Oriented Computation and Resource Mapping (CCRM) and runtime Most Remaining Accelerator Segment Number First Scheduling (MRAF). Given the control tasks, the CCRM starts with balancing the computation workloads and processor resources with the goal of optimizing overall control performance. Subsequently, the MRAF employs segment-level real-time scheduling to ensure the timely execution of tasks. Extensive experiments on the robotic arms (by hardware-in-the-loop simulator) demonstrate that the RICH can work as a general and end-to-end approach. These experiments reveal significant improvements in control performance, with enhancements of 50.7% observed for intelligent control applications deployed on heterogeneous computing platforms./proceedings-archive/2024/DATA/427_pdf_upload.pdf |
||
RT-VIRTIO: TOWARDS THE REAL-TIME PERFORMANCE OF VIRTIO IN A TWO-TIER COMPUTING ARCHITECTURE Speaker: Siwei Ye, Shanghai Jiao Tong Univerisy, CN Authors: Siwei Ye1, Minqing Sun1, Huifeng Zhu2, Yier Jin3 and An Zou1 1Shanghai Jiao Tong University, CN; 2Washington University in St. Louis, US; 3University of Science and Technology of China, CN Abstract With the popularity of virtualization technology, ensuring reliable I/O operations with timing constraints in virtual environments becomes increasingly critical. Timing-predictable virtual I/O enhances the responsiveness and efficiency of virtualized systems, facilitating their seamless integration into time-critical applications such as industrial automation and robotics. Its significance lies in meeting rigorous performance standards, minimizing latency, and consistently delivering predictable I/O performance. As a result, virtual machines can effectively support mission-critical and time-sensitive workloads. However, due to the complicated system architecture, the IO operations in the virtualization face competition from the IO operations in the same virtual machine and the interaction from different virtual machines of which the I/O goes to the same host machine. This study presents RT-VirtIO, a practical approach to provide predictable real-time IO operations. RT-VirtIO addresses the challenges associated with lengthy data paths and complex resource management. Through early-stage characterization, this study identifies key factors contributing to poor I/O real-time performance and then builds an analytical model and a learning-based data-driven model to predict the tail I/O latency. Leveraging these two models, RT-VirtIO effectively captures these dynamics, enabling the development of a general and applicable optimization framework. Experimental results demonstrate that RT-VirtIO significantly improves real-time performance in virtual environments (by 20.07%~30.90%) without necessitating hardware modifications, which exhibit promising applicability across a broader range of scenarios./proceedings-archive/2024/DATA/680_pdf_upload.pdf |
||
ENABLING SECURITY ON THE EDGE: A CHERI COMPARTMENTALIZED NETWORK STACK Speaker: Donato Ferraro, University of Modena and Reggio Emilia, Minerva Systems, IT Authors: Donato Ferraro1, Andrea Bastoni2, Alexander Zuepke3 and Andrea Marongiu4 1Minerva Systems SRL, University of Modena and Reggio Emilia, IT; 2TUM, Minerva Systems, DE; 3TU Munich, DE; 4Università di Modena e Reggio Emilia, IT Abstract The widespread deployment of embedded systems in critical infrastructures, interconnected edge devices like autonomous drones, and smart industrial systems requires robust security measures. Compromised systems increase the risks of operational failures, data breaches, and---in safety-critical environments---potential physical harm to people. Despite these risks, current security measures are often insufficient to fully address the attack surfaces of embedded devices. CHERI provides strong security from the hardware level by enabling fine-grained compartmentalization and memory protection, which can reduce the attack surface and improve the reliability of such devices. In this work, we explore the potential of CHERI to compartmentalize one of the most critical and targeted components of interconnected systems: their network stack. Our case study examines the trade-offs of isolating applications, TCP/IP libraries, and network drivers on a CheriBSD system deployed on the Arm Morello platform. Our results suggest that CHERI has the potential to enhance security while maintaining performance in embedded-like environments./proceedings-archive/2024/DATA/1190_pdf_upload.pdf |
||
TOWARDS RELIABLE SYSTEMS: A SCALABLE APPROACH TO AXI4 TRANSACTION MONITORING Speaker: Chaoqun Liang, Università di Bologna, IT Authors: Chaoqun Liang1, Thomas Benz2, Alessandro Ottaviano2, Angelo Garofalo1, Luca Benini1 and Davide Rossi1 1Università di Bologna, IT; 2ETH Zurich, CH Abstract In safety-critical SoC applications such as automotive and aerospace, reliable transaction monitoring is crucial for maintaining system integrity. This paper introduces a drop-in Transaction Monitoring Unit (TMU) for AXI4 subordinate endpoints that detects transaction failures including protocol violations or timeouts and triggers recovery by resetting the affected subordinates. Two TMU variants address different constraints: a Tiny-Counter solution for tightly area-constrained systems and a Full-Counter solution for critical subordinates in mixed-criticality SoCs. The Tiny-Counter employs a single counter per outstanding transaction, while the Full-Counter uses multiple counters to track distinct transaction stages, offering finer-grained monitoring and reducing detection latencies by up to hundreds of cycles at roughly 2.5× the area cost. The Full-Counter also provides detailed error logs for performance and bottleneck analysis. Evaluations at both IP and system levels confirm the TMU's effectiveness and low overhead. In GF12 technology, monitoring 16–32 outstanding transactions occupies 1330–2616 µm2 for the tiny-Counter and 3452–6787 µm2 for the Full-Counter; moderate prescaler steps reduce these figures by 18–39% and 19–32%, respectively, with no loss of functionality. Results from a full-system integration demonstrate the TMU's robust and precise monitoring capabilities in safety-critical SoC environments./proceedings-archive/2024/DATA/1253_pdf_upload.pdf |
||
EXACT SCHEDULABILITY ANALYSIS FOR LIMITED-PREEMPTIVE PARALLEL APPLICATIONS USING TIMED AUTOMATA IN UPPAAL Speaker: Jonas Hansen, Aalborg Universitet, DK Authors: Jonas Hansen1, Srinidhi Srinivasan2, Geoffrey Nelissen3 and Kim Larsen1 1Aalborg Universitet, DK; 2Technische Universiteit Eindhoven (TU/e), NL; 3Eindhoven University of Technology, NL Abstract We study the problem of verifying schedulability and ascertaining response time bounds of limited-preemptive parallel applications with uncertainty, scheduled on multi-core platforms. While sufficient techniques exist for analysing schedulability and response time of parallel applications under fixed-priority scheduling, their accuracy remains uncertain due to the lack of a scalable and exact analysis that can serve as a ground-truth to measure the pessimism of existing sufficient analyses. In this paper, we address this gap using formal methods. We use Timed Automata and the powerful UPPAAL verification engine to develop a generic approach to model parallel applications and provide a scalable and exact schedulability and response time analysis. This work establishes a benchmark for evaluating the accuracy of both existing and future sufficient analysis techniques. Furthermore, our solution is easily extendable to more complex task models thanks to its flexible model architecture./proceedings-archive/2024/DATA/1532_pdf_upload.pdf |
||
MONOMORPHISM-BASED CGRA MAPPING VIA SPACE AND TIME DECOUPLING Speaker: Cristian Tirelli, Università della Svizzera italiana, CH Authors: Cristian Tirelli, Rodrigo Otoni and Laura Pozzi, Università della Svizzera italiana, CH Abstract Coarse-Grain Reconfigurable Arrays (CGRAs) provide flexibility and energy efficiency in accelerating compute-intensive loops. Existing compilation techniques often struggle with scalability, unable to map code onto large CGRAs. To address this, we propose a novel approach to the mapping problem where the time and space dimensions are decoupled and explored separately. We leverage an SMT formulation to traverse the time dimension first, and then perform a monomorphism-based search to find a valid spatial solution. Experimental results show that our approach achieves the same mapping quality of state-of-the-art techniques while significantly reducing compilation time, with this reduction being particularly tangible when compiling for large CGRAs. We achieve approximately 10^5x average compilation speedup for the benchmarks evaluated on a 20x20 CGRA./proceedings-archive/2024/DATA/321_pdf_upload.pdf |
||
ATTENTIONLIB: A SCALABLE OPTIMIZATION FRAMEWORK FOR AUTOMATED ATTENTION ACCELERATION ON FPGA Speaker: Zhenyu Liu, Fudan University, CN Authors: Zhenyu Liu, Xilang Zhou, Faxian Sun, Jianli Chen, Jun Yu and Kun Wang, Fudan University, CN Abstract The self-attention mechanism is a fundamental component within transformer-based models. Nowadays, as the length of sequences processed by large language models (LLMs) continues to increase, the attention mechanism has gradually become a bottleneck in model inference. The LLM inference process can be separated into two phases: prefill and decode. The latter contains memory-intensive attention computation, making FPGA-based accelerators an attractive solution for acceleration. However, designing accelerators tailored for the attention module poses a challenge, requiring substantial manual work. To automate this process and achieve superior acceleration performance, we propose AttentionLib, an MLIR-based framework. AttentionLib automatically performs fusion dataflow optimization for attention computations and generates high-level synthesis code in compliance with hardware constraints. Given the large design space, we provide a design space exploration (DSE) engine to automatically identify optimal fusion dataflows within the specified constraints. Experimental results show that AttentionLib is effective in generating well-suited accelerators for diverse attention computations and achieving superior performance under hardware constraints. Notably, the accelerators generated by AttentionLib exhibit at least a 25.1× improvement compared to the baselines solely automatically optimized by Vitis HLS. Furthermore, these designs outperform GPUs in decode workloads, showcasing over a 2× speedup for short sequences./proceedings-archive/2024/DATA/830_pdf_upload.pdf |
||
ENSURING DATA FRESHNESS FOR IN-STORAGE COMPUTING WITH COOPERATIVE BUFFER MANAGER Speaker: Yang Guo, The Chinese University of Hong Kong, CN Authors: Jin Xue, Yuhong Song, Yang Guo and Zili Shao, The Chinese University of Hong Kong, HK Abstract In-storage computing (ISC) aims to mitigate the excessive data movement between the host memory and storage by offloading computation to storage devices for in-situ execution. However, ensuring data freshness remains a key challenge for practical ISC. For performance considerations, many data processing systems implement a buffer manager to cache part of the on-disk data in the host memory. While the host applications commit updates to the in-memory cached copies of the data, ISC operators offloaded to the device only have access to the on-disk persistent data. Thus, ISC may miss the most recent updates from the host and produce incorrect results after reading the stale and inconsistent data from the persistent storage. With this limitation, current ISC can only be used in read-only settings where the on-disk data are not subject to concurrent updates. To tackle this problem, we propose a cooperative buffer manager for ISC to transparently provide data freshness guarantees to host applications. Proposed methods allow the device to synchronize with the host buffer manager and decide whether to read the most recent copy of data from host memory or flash memory. We implement our method based on a real hardware platform and perform evaluation with a B+-tree based key-value store. Experiments show that our method can provide transparent data freshness for host applications with reduced latency./proceedings-archive/2024/DATA/169_pdf_upload.pdf |
||
EVALUATING COMPILER-BASED RELIABILITY WITH RADIATION FAULT INJECTION Speaker: Davide Baroffio, Politecnico di Milano, IT Authors: Davide Baroffio, Tomas López, Federico Reghenzani and William Fornaciari, Politecnico di Milano, IT Abstract Compiler-based fault tolerance is a cost-effective and flexible family of solutions that transparently improves software reliability. This paper evaluates a compiler tool for fault detection via laser injection and $alpha$-particle exposure. A novel memory allocation strategy is proposed to mitigate the effects of multi-bit upsets. We integrated the detection mechanism with a recovery solution based on mixed-criticality scheduling. The results demonstrate the error detection and recovery capabilities in realistic scenarios: reducing undetected errors, enhancing system reliability, and advancing software-implemented fault tolerance./proceedings-archive/2024/DATA/1043_pdf_upload.pdf |
||
UMBRA: AN EFFICIENT FRAMEWORK FOR TRUSTED EXECUTION ON MODERN TRUSTZONE-ENABLED MICROCONTROLLERS Speaker: Stefano Mercogliano, Università di Napoli Federico II, IT Authors: Stefano Mercogliano1 and Alessandro Cilardo2 1Università di Napoli Federico II, IT; 2University of Naples, Federico II, IT Abstract The rise of microcontrollers in critical systems demands robust security measures beyond traditional methods like Memory Protection Units. ARM's TrustZone-M offers enhanced protection for secure applications, yet its potential for deploying Trusted Execution Environments often remains untapped, leaving room for innovation in managing security on resource-constrained devices. This paper presents Umbra, a Rust-based framework that isolates mutually distrustful applications and integrates with untrusted embedded OSes. Leveraging modern security hardware, Umbra features an efficient secure caching mechanism that encrypts all code exposed to attackers, decrypting and validating only necessary blocks during execution, achieving practical Trusted Execution Environments on modern microcontrollers. Index Terms—ARM TrustZone-M, Trusted Execution Environment, Rust for Secure Development, Lightweight Security Mechanisms/proceedings-archive/2024/DATA/802_pdf_upload.pdf |
||
HARDWARE/SOFTWARE CO-ANALYSIS FOR WORST CASE EXECUTION TIME BOUNDS Speaker: Can Joshua Lehmann, Karlsruhe Institute of Technology, DE Authors: Can Lehmann1, Lars Bauer2, Hassan Nassar1, Heba Khdr1 and Joerg Henkel1 1Karlsruhe Institute of Technology, DE; 2Independent Scholar, DE Abstract Ensuring that safety-critical systems meet timing constraints is crucial to avoid disastrous failures. To verify that timing requirements are met, a worst-case execution time (WCET) bound is computed. However, traditional WCET tools require a predefined timing model for each target processor, which is not available when using custom instruction set extensions. We introduce a novel approach based on hardware-software co-analysis that employs an instrumented hardware description of the target processor, removing the requirement for a separate timing model. We demonstrate this approach by extending the FemtoRV32 Individua RISC-V processor with a custom instruction set extension and show that it accurately models the timing behavior of the resulting system./proceedings-archive/2024/DATA/516_pdf_upload.pdf |
TS14 Session 7 - D8
Add this session to my calendar
Date: Tuesday, 01 April 2025
Time: 16:30 CET - 18:00 CET
Time | Label | Presentation Title Authors |
---|---|---|
SPARSYNERGY: UNLOCKING FLEXIBLE AND EFFICIENT DNN ACCELERATION THROUGH MULTI-LEVEL SPARSITY Speaker: Jingkui Yang, National University of Defense Technology, CN Authors: Jingkui Yang1, Mei Wen1, Junzhong Shen2, Jianchao Yang1, Yasong Cao1, Jun He1, Minjin Tang3, Zhaoyun Chen1 and Yang Shi4 1National University of Defense Technology, CN; 2Key Laboratory of Advanced Microprocessor Chips and Systems, National University of Defense Technology, CN; 3National University of Defense Technology, Key Laboratory of Advanced Microprocessor Chips and Systems, CN; 41.National Key Laboratory for Parallel and Distributed Processing, National University of Defense Technology;2.Department of Computer, National University of Defense Technology, CN Abstract To more effectively address the computational and memory requirements of deep neural networks (DNNs), leveraging multi-level sparsity---including value-level and bit-level sparsity--- has emerged as a pivotal strategy. While substantial research has been dedicated to exploring value-level and bit-level sparsity individually, the combination of both has largely been overlooked until now. In this paper, we propose SparSynergy, which---to the best of our knowledge---is the first accelerator that synergistically integrates multi-level sparsity into a unified framework, maximizing computational efficiency and minimizing memory usage. However, jointly considering multi-level sparsity is non-trivial, as it presents several challenges: (1) increased hardware overhead due to the complexity of incorporating multiple sparsity levels, (2) bandwidth-intensive data transmission during multiplexing, and (3) decreased throughput and scalability caused by bottlenecks in bit-serial computation. Our proposed SparSynergy addresses these challenges by introducing a unified sparsity format and a co-optimized hardware design. Experimental results demonstrate that SparSynergy achieves a 5.38x geometric mean improvement in the energy-delay product (EDP) when compared with the tensor core, across workloads with varying degrees of sparsity. Furthermore, SparSynergy significantly improves accuracy retention compared to state-of-the-art accelerators for representative DNNs./proceedings-archive/2024/DATA/553_pdf_upload.pdf |
||
PS-GS: GROUP-WISE PARALLEL RENDERING WITH STAGE-WISE COMPLEXITY REDUCTIONS FOR REAL-TIME 3D GAUSSIAN SPLATTING Speaker: Joongho Jo, Korea University, KR Authors: Joongho Jo and Jongsun Park, Korea University, KR Abstract 3D Gaussian Splatting (3D-GS) is an emerging rendering technique that surpasses the neural radiance field (NeRF) in both rendering speed and image quality. Despite its advantages, running 3D-GS on mobile or edge devices in real-time remains challenging due to large computational complexity. In this paper, we introduce PS-GS, a specialized low-complexity hardware designed to enhance the pipeline parallelism of 3D-GS rendering pipeline process. In this work, we first observe that 3D-GS rendering can be parallelized when the approximate order of Gaussians, from those closest to the camera to those farthest, is known ahead. But, to enhance 3D-GS rendering speed via parallel processing, an efficient viewpoint-adaptive grouping method with low computational costs is essential. Two key computational bottlenecks of viewpoint-adaptive grouping are the grouping of invisible Gaussians and depth-based sorting. For efficient group-wise parallel rendering with low complexity viewpoint-adaptive grouping, we propose three key techniques—cluster-based preprocessing, sorting, and grouping—all seamlessly incorporated into the PS-GS architecture. Our experimental results demonstrate that PS-GS delivers an average speedup of 1.20× with negligible peak signal-to-noise ratio (PSNR) degradation./proceedings-archive/2024/DATA/1426_pdf_upload.pdf |
||
TXISC: TRANSACTIONAL FILE PROCESSING IN COMPUTATIONAL SSDS Speaker: Penghao Sun, Shanghai Jiao Tong University, CN Authors: Penghao Sun1, Shengan Zheng1, Kaijiang Deng1, Guifeng Wang1, Jin Pu1, Jie Yang2, Maojun Yuan2, Feng Zhu2, Shu Li2 and Linpeng Huang1 1Shanghai Jiao Tong University, CN; 2Alibaba Group, CN Abstract Computational SSDs implement the in-storage computing (ISC) paradigm and benefit applications by taking over I/O-intensive tasks from the host. Existing works have proposed various frameworks aiming at easy access to ISC functionalities, and among them generic frameworks with file-based abstractions offer better usability. However, since intermediate output by ISC tasks may leave files in a dirty state, concurrent access to and the integrity of file data should be properly managed, which has not been fully addressed. In this paper, we present TxISC, a generic ISC framework that coordinates the host kernel and device firmware to offer a versatile file-based programming model. Under the hood, TxISC turns each invocation of an ISC task into a transaction with full ACID guarantee, fully covering concurrency control and data protection. TxISC implements transactions at low cost by leveraging the out-of-place write characteristic of NAND flash. Evaluation on full-stack hardware shows that transactions incur almost no runtime performance penalty compared with existing ISC architectures. Application case studies demonstrate that the programming model of TxISC can be used to offload complex logic and deliver significant speedup over host-only solutions./proceedings-archive/2024/DATA/738_pdf_upload.pdf |
||
ARAXL: A PHYSICALLY SCALABLE, ULTRA-WIDE RISC-V VECTOR PROCESSOR DESIGN FOR FAST AND EFFICIENT COMPUTATION ON LONG VECTORS Speaker: Navaneeth Kunhi Purayil, ETH Zurich, CH Authors: Navaneeth Kunhi Purayil1, Matteo Perotti1, Tim Fischer1 and Luca Benini2 1ETH Zurich, CH; 2ETH Zurich, CH | Università di Bologna, IT Abstract The ever-growing scale of data parallelism in today's HPC and ML applications presents a big challenge for computing architectures' energy efficiency and performance. Vector processors address the scale-up challenge by decoupling Vector Register File (VRF) and datapath widths, allowing the VRF to host long vectors and increase register-stored data reuse while reducing the relative cost of instruction fetch and decode. However, even the largest vector processor designs today struggle to scale to more than 8 vector lanes with double-precision Floating Point Units (FPUs) and 256 64-bit elements per vector register. This limitation is induced by difficulties in the physical implementation, which becomes wire-dominated and inefficient. In this work, we present AraXL, a modular and scalable 64-bit RISC-V V vector architecture targeting long-vector applications for HPC and ML. AraXL addresses the physical scalability challenges of state-of-the-art vector processors with a distributed and hierarchical interconnect, supporting up to 64 parallel vector lanes and reaching the maximum Vector Register File size of 64 Kibit/vreg permitted by the RISC-V V 1.0 ISA specification. Implemented in a 22-nm technology node, our 64-lane AraXL achieves a performance peak of 146 GFLOPs on computation-intensive HPC/ML kernels (>99% FPU utilization) and energy efficiency of 40.1 GFLOPs/W (1.15 GHz, TT, 0.8V), with only 3.8x the area of a 16-lane instance./proceedings-archive/2024/DATA/882_pdf_upload.pdf |
||
PERFORMANCE IMPLICATIONS OF MULTI-CHIPLET NEURAL PROCESSING UNITS ON AUTONOMOUS DRIVING PERCEPTION Speaker: Luke Chen, University of California, Irvine, US Authors: Mohanad Odema, Luke Chen, Hyoukjun Kwon and Mohammad Al Faruque, University of California, Irvine, US Abstract We study the application of emerging chiplet-based Neural Processing Units to accelerate vehicular AI perception workloads in constrained automotive settings. The motivation stems from how chiplets technology is becoming integral to emerging vehicular architectures, providing a cost-effective trade-off between performance, modularity, and customization; and from perception models being the most computationally demanding workloads in a autonomous driving system. Using the Tesla Autopilot perception pipeline as a case study, we first breakdown its constituent models and profile their performance on different chiplet accelerators. From the insights, we propose a novel scheduling strategy to efficiently deploy perception workloads on multi-chip AI accelerators. Our experiments using a standard DNN performance simulator, MAESTRO, show our approach realizes 82% and 2.8× increase in throughput and processing engines utilization compared to monolithic accelerator designs./proceedings-archive/2024/DATA/883_pdf_upload.pdf |
||
LT-OAQ: LEARNABLE THRESHOLD BASED OUTLIER-AWARE QUANTIZATION AND ITS ENERGY-EFFICIENT ACCELERATOR FOR LOW-PRECISION ON-CHIP TRAINING Speaker: Qinkai Xu, Nanjing University, CN Authors: Qinkai Xu, Yijin Liu, Yuan Meng, Yang Chen, Yunlong Mao, Li Li and Yuxiang Fu, Nanjing University, CN Abstract Low-precision training has emerged as a powerful technique for reducing computational and storage costs in Deep Neural Network (DNN) model training, enabling on-chip training or fine-tuning on edge devices. However, existing low-precision training methods often require higher bit-widths to maintain accuracy as model sizes increase. In this paper, we introduce an outlier-aware quantization strategy for low-precision training. While traditional value-aware quantization methods require costly online distribution statistics operations on computational data, impeding the efficiency gains of low-precision training, our approach addresses this challenge through a novel Learnable Threshold based Outlier-Aware Quantization (LT-OAQ) training framework. This method concurrently updates outlier thresholds and model weights through gradient descent, eliminating the need for costly data-statistics operations. To efficiently support the LT-OAQ training framework, we designed a hardware accelerator based on the systolic array architecture. This accelerator introduces a processing element (PE) fusion mechanism that dynamically fuses adjacent PEs into clusters to support outlier computations, optimizing the mapping of outlier computation tasks, enabling mixed-precision training, and implementing online quantization. Our approach maintains model accuracy while significantly reducing computational complexity and storage resource requirements. Experimental results demonstrate that our design achieves a 2.9x speedup in performance and a 2.17x reduction in energy consumption compared to state-of-the-art low-precision accelerators./proceedings-archive/2024/DATA/943_pdf_upload.pdf |
||
LIGNN: ACCELERATING GNN TRAINING THROUGH LOCALITY-AWARE DROPOUT Speaker: Gongjian Sun, SKLP, Institute of Computing Technology, CAS; University of Chinese Academy of Sciences, CN Authors: Gongjian Sun1, Mingyu Yan2, Dengke Han3, Runzhen Xue4, Xiaochun Ye1 and Dongrui Fan1 1SKLP, Institute of Computing Technology, CAS; University of Chinese Academy of Sciences, CN; 2Institute of Computing Technology, Chinese Academy of Sciences, CN; 3SKLP, Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences, CN; 4State Key Lab of Processors, Institute of Computing Technology, CAS; School of Computer Science and Technology, University of Chinese Academy of Sciences, CN Abstract Graph Neural Networks (GNNs) have demonstrated significant success in graph learning and are widely adopted across various critical domains. However, the irregular connectivity between vertices leads to inefficient neighbor aggregation, resulting in substantial irregular and coarse-grained DRAM accesses. This lack of data locality presents significant challenges for execution platforms, ultimately degrading performance. While previous accelerator designs have leveraged on-chip memory and data access scheduling strategies to address this issue, they still inevitably access features at irregular addresses from DRAM. In this work, we propose LiGNN, a hardware-based solution that enhances locality and applies dropout to aggregation to accelerate GNN training. Unlike algorithmic dropout approaches that primarily focus on improving accuracy and neglects hardware costs, LiGNN is specifically designed to drop graph features with data locality awareness, directly targeting the reduction of irregular DRAM accesses, meanwhile maintaining accuracy. LiGNN introduces locality-aware ordering and a DRAM row integrity policy, enabling configurable burst and row-granularity dropout at the DRAM level. This approach improves data locality and ensures more efficient DRAM access. Compared to state-of-the-art methods, under classic 0.5 droprate, LiGNN achieves a 1.6~2.2x speedup, reduces DRAM accesses by 44~50% and DRAM row activation by 41~82%, all without losing accuracy./proceedings-archive/2024/DATA/1081_pdf_upload.pdf |
||
COUPLEDCB: ELIMINATING WASTED PAGES IN COPYBACK-BASED GARBAGE COLLECTION FOR SSDS Speaker: Jun Li, Nanjing University of Posts and Telecommunications, CN Authors: Jun Li1, Xiaofei Xu2, zhibing sha3, Xiaobai Chen1, Jieming Yin1 and Jianwei Liao4 1Nanjing University of Posts and Telecommunications, CN; 2RMIT University, AU; 3southwest university, CN; 4Southwest University of China, CN Abstract The management of garbage collection poses significant challenges in high-density NAND flash-based SSDs. The introduction of the copyback command aims to expedite the migration of valid data. However, its odd/even constraint causes wasted pages during migrations, limiting the efficiency of garbage collection. Additionally, while full-sequence programming enhances write performance in high-density SSDs, it increases write granularity and exacerbates the issue of wasted pages. To address the problem of wasted pages, we propose a novel method called CoupledCB, which utilizes coupled blocks to fill up the wasted space in copyback-based garbage collection. By taking into account the access characteristics of the candidate coupled blocks and workloads, we develop a coupled block selection model assisted by logistic regression. Experimental results show that our proposal significantly enhances garbage collection efficiency and I/O performance compared to state-of-the-art schemes./proceedings-archive/2024/DATA/1459_pdf_upload.pdf |
||
LIGHTMAMBA: EFFICIENT MAMBA ACCELERATION ON FPGA WITH QUANTIZATION AND HARDWARE CO-DESIGN Speaker: Renjie Wei, Peking University, CN Authors: Renjie Wei, Songqiang Xu, Linfeng Zhong, Zebin Yang, Qingyu Guo, Yuan Wang, Runsheng Wang and Meng Li, Peking University, CN Abstract State space models (SSMs) like Mamba have recently attracted much attention. Compared to Transformer-based large language models (LLMs), Mamba achieves linear computation complexity with the sequence length and demonstrates superior performance. However, Mamba is hard to accelerate due to the scattered activation outliers and the complex computation dependency, rendering existing LLM accelerators inefficient. In this paper, we propose LightMamba that co-designs the quantization algorithm and FPGA accelerator architecture for efficient Mamba inference. We first propose an FPGA-friendly post-training quantization algorithm that features rotation-assisted quantization and power-of-two SSM quantization to reduce the majority of computation to 4-bit. We further design an FPGA accelerator that partially unrolls the Mamba computation to balance the efficiency and hardware costs. Through computation reordering as well as fine-grained tiling and fusion, the hardware utilization and memory efficiency of the accelerator get drastically improved. We implement LightMamba on Xilinx Versal VCK190 FPGA and achieve 4.65∼6.06× higher energy efficiency over the GPU baseline. When evaluated on Alveo U280 FPGA, LightMamba reaches 93 tokens/s, which is 1.43× that of the GPU baseline./proceedings-archive/2024/DATA/1180_pdf_upload.pdf |
||
EVALUATING IOMMU-BASED SHARED VIRTUAL ADDRESSING FOR RISC-V EMBEDDED HETEROGENEOUS SOCS Speaker: Cyril Koenig, ETH Zurich, CH Authors: Cyril Koenig, Enrico Zelioli and Luca Benini, ETH Zurich, CH Abstract Embedded heterogeneous Systems-on-Chips (SoCs) rely on domain-specific hardware accelerators to improve performance and energy efficiency. In particular, programmable multicore accelerators feature a cluster of processing elements and tightly coupled scratchpad memories to balance performance, energy efficiency, and flexibility. In embedded systems running a general-purpose OS, accelerators access data via dedicated, physically addressed memory regions. This negatively impacts memory utilization and performance by requiring a copy from the virtual host address to the physical accelerator address space. Input-Output Memory Management Units (IOMMUs) overcome this limitation by allowing devices and hosts to use a shared virtual, paged address space. However, resolving IO virtual addresses can be particularly costly on high-latency memory systems as it requires up to three sequential memory accesses on IOTLB miss. In this work, we present a quantitative evaluation of shared virtual addressing in RISC-V heterogeneous embedded systems. We integrate an IOMMU in an open source heterogeneous RISC-V SoC consisting of a 64-bit host with a 32-bit accelerator cluster. We evaluate the system performance by emulating the design on FPGA and implementing compute kernels from the RajaPERF benchmark suite using heterogeneous OpenMP programming. We measure transfers and computation time on the host and accelerators for systems with different DRAM access latencies. We first show that IO virtual address translation can account for 4.2% up to 17.6% of the accelerator's runtime for GEMM (General Matrix Multiplication) at low and high memory bandwidth. Then, we show that in systems containing a last-level cache, this IO address translation cost falls to 0.4% and 0.7% under the same conditions, making shared-virtual addressing and zero-copy offloading suitable for such RISC-V heterogeneous SoCs./proceedings-archive/2024/DATA/1304_pdf_upload.pdf |
TS15 Session 25 - A1+D8+E4
Add this session to my calendar
Date: Tuesday, 01 April 2025
Time: 16:30 CET - 18:00 CET
Time | Label | Presentation Title Authors |
---|---|---|
LESS IS MORE: OPTIMIZING FUNCTION CALLING FOR LLM EXECUTION ON EDGE DEVICES Speaker: Iraklis Anagnostopoulos, Southern Illinois University Carbondale, US Authors: Varatheepan Paramanayakam1, Andreas Karatzas2, Iraklis Anagnostopoulos2 and Dimitrios Stamoulis3 1Southern Illinois University, US; 2Southern Illinois University Carbondale, US; 3The University of Texas at Austin, US Abstract The advanced function-calling capabilities of foundation models open up new possibilities for deploying agents to perform complex API tasks. However, managing large amounts of data and interacting with numerous APIs makes function calling hardware-intensive and costly, especially on edge devices. Current Large Language Models (LLMs) struggle with function calling at the edge because they cannot handle complex inputs or manage multiple tools effectively. This results in low task-completion accuracy, increased delays, and higher power consumption. In this work, we introduce Less-is-More, a novel fine-tuning-free function-calling scheme for dynamic tool selection. Our approach is based on the key insight that selectively reducing the number of tools available to LLMs significantly improves their function-calling performance, execution time, and power efficiency on edge devices. Experimental results with state-of-the-art LLMs on edge hardware show agentic success rate improvements, with execution time reduced by up to 70% and power consumption by up to 40%./proceedings-archive/2024/DATA/611_pdf_upload.pdf |
||
SSMDVFS: MICROSECOND-SCALE DVFS BASED ON SUPERVISED AND SELF-CALIBRATED ML ON GPGPUS Speaker: Minqing Sun, Shanghai Jiao Tong University, CN Authors: Minqing Sun1, Ruiqi Sun1, Yingtao Shen1, Wei Yan2, Qinfen Hao2 and An Zou1 1Shanghai Jiao Tong University, CN; 2The Institute of Computing Technology, Chinese Academy of Sciences, CN Abstract Over the past decade, as GPUs have evolved to achieve higher computational performance, their power density has also accelerated. Consequently, improving energy efficiency and reducing power consumption has become critically important. Dynamic voltage and frequency scaling (DVFS) is an effective technique for enhancing energy efficiency. With the advent of integrated voltage regulators, DVFS can now operate on microsecond (µs) timescales. However, developing a practical and effective strategy to guide rapid DVFS remains a significant challenge. This paper proposes a supervised and self-calibrated machine learning framework (SSMDVFS) to guide microsecond-scale GPU voltage and frequency scaling. This framework features an end-to-end design that encompasses data generation, neural network model design, training, compression, and final runtime calibration. Unlike analytical models, which struggle to accurately represent GPU architectures, and reinforcement learning approaches, which can be challenging to converge during runtime, the SSMDVFS offers a practical solution for guiding microsecond-scale voltage and frequency scaling. Experimental results demonstrate that the proposed framework improves energy-delay product (EDP) by 11.09% and outperforms analytical models and reinforcement learning approaches by 13.17% and 36.80%, respectively./proceedings-archive/2024/DATA/715_pdf_upload.pdf |
||
A 3D DESIGN METHODOLOGY FOR INTEGRATED WEARABLE SOCS: ENABLING ENERGY EFFICIENCY AND ENHANCED PERFORMANCE AT ISO-AREA FOOTPRINT Speaker: Ekin Sumbul, Meta, US Authors: H. Ekin Sumbul1, Arne Symons2, Lita Yang2, Huichu Liu2, Tony Wu2, Matheus Trevisan Moreira2, Debabrata Mohapatra2, Abhinav Agarwal2, Kaushik Ravindran2, Chris Thompson2, Yuecheng Li2 and Edith Beigne2 1Meta, US; 2META, US Abstract Augmented Reality (AR) System-on-Chips (SoCs) have strict power budgets and form-factor limitations for wearable, all-day use AR glasses running high-performance applications. Limited compute and memory resources that can fit within the strict industrial design area footprint of an AR SoC, however, create performance bottlenecks for demanding workloads such as Pixel Codec Avatars (PiCA) group-calling which connects multiple users with their photorealistic representations. To alleviate this unique wearables challenge, 3D integration with hybrid-bonding technology offers energy-efficient 3D stacking of more silicon resources within the same SoC footprint. Implementing such 3D architectures, however, is another challenge as current EDA tools and flows offer limited 3D design control. In this work, we present a 3D design methodology for robust 3D clock network and datapath design using current EDA tools. To validate the proposed methodology, we implemented a 3D integrated prototype AR SoC housing a 3D-stacked Machine Learning (ML) accelerator utilizing TSMC SoIC™bonding technology. Silicon measurements demonstrate that the 3D ML accelerator enables running PiCA AR group call at 30 frames-per-second (fps) by 3D-expanding its memory resources by 4× to achieve 2× better energy-efficiency when compared to a 2D baseline accelerator at iso-footprint./proceedings-archive/2024/DATA/885_pdf_upload.pdf |
||
A LOW-POWER MIXED-PRECISION INTEGRATED MULTIPLY-ACCUMULATE ARCHITECTURE FOR QUANTIZED DEEP NEURAL NETWORKS Speaker: Hu Xiaolu, Department of Micro-Nano Electronics, Shanghai Jiao Tong University, Shanghai, China, CN Authors: Xiaolu Hu1, Xinkuang Geng1, Zhigang Mao2, Jie Han3 and Honglan Jiang1 1Shanghai Jiao Tong University, CN; 2Department of Mico-Nano Electronics, CN; 3University of Alberta, CA Abstract As mixed-precision quantization techniques have been widely considered for balancing computational efficiency and flexibility in quantized deep neural networks (DNNs), mixed-precision multiply-accumulate (MAC) units are increasingly important in DNN accelerators. However, conventional mixed-precision MAC architectures support either signed×signed or unsigned×unsigned multiplications. The signed×unsigned multiplication enhancing the computing efficiency of DNNs with ReLU activations has never been considered in the design of mixed-precision MAC. Thus, this work proposes a mixed-precision MAC architecture supporting six operation modes, int8×int8, int8×uint8, two int4×int4, two int4×uint4, four int2×int2, and four int2×uint2. In this design, to balance the power and delay of different modes, the multiplication is implemented based on four precision-split 4×4 multipliers (PS4Ms). The accumulation is integrated into the partial product accumulation of the multiplication to eliminate redundant switching activities in separate compression. With 10% area reduction, the proposed MAC denoted as PS4MAC, reduces the power by over 35%, 42%, and 56% for 8-bit, 4-bit, and 2-bit operations, respectively, compared with the design based on the Synopsys DesignWare (DW) multipliers. Additionally, it achieves over 23% power savings for 8-bit operations compared to state-of-the-art (SotA) mixed-precision MAC designs. To save more power, an approximate computing mode for 8-bit multiplication is further designed, resulting in a MAC unit enabling eight operation modes, referred to as PS4MAC_AP. Finally, output-stationary systolic arrays (SAs) are explored using the above-mentioned MAC designs to implement DNNs operating under a 1 GHz clock. Our designs show the highest energy efficiency and outstanding area efficiency in all 8-bit, 4-bit, and 2-bit operation modes. Compared with the traditional SA with high-precision-split multipliers, PS4MAC_AP improves the energy efficiency for 8-bit operations by 0.6 TOPS/W, and PS4MAC achieves 0.4 TOPS/W - 0.7 TOPS/W improvement for all operation modes./proceedings-archive/2024/DATA/1246_pdf_upload.pdf |
||
FEDERATED REINFORCEMENT LEARNING FOR OPTIMIZING THE POWER EFFICIENCY OF EDGE DEVICES Speaker: Benedikt Dietrich, Karlsruhe Institute of Technology, DE Authors: Benedikt Dietrich1, Rasmus Müller-Both2, Heba Khdr3 and Joerg Henkel3 1Chair for Embedded Systems, Karlsruhe Institute of Technology, DE; 2-, DE; 3Karlsruhe Institute of Technology, DE Abstract Reinforcement learning (RL) holds great promise for adaptively optimizing microprocessor performance under power constraints. It allows for online learning of application charac- teristics at runtime and enables adjustment to varying system dynamics such as changes in the workload, user preferences or ambient conditions. However, online policy optimization remains resource-intensive, with high computational demand and requiring many samples to converge, making it challenging to deploy to edge devices. In this work, we overcome both of these obstacles and present federated power control using dynamic voltage and frequency scaling (DVFS). Our technique leverages federated RL and enables multiple independent power controllers running on separate devices to collaboratively train a shared DVFS policy, consolidating experience from a multitude of different applica- tions, while ensuring that no privacy-sensitive information leaves the devices. This leads to faster convergence and to increased robustness of the learned policies. We show that our federated power control achieves 57 % average performance improvements over a policy that is only trained on local data. Compared to a state-of-the-art collaborative power control, our technique leads to 22 % better performance on average for the running applications under the same power constraint./proceedings-archive/2024/DATA/1254_pdf_upload.pdf |
||
AXON: A NOVEL SYSTOLIC ARRAY ARCHITECTURE FOR IMPROVED RUN TIME AND ENERGY EFFICIENT GEMM AND CONV OPERATION WITH ON-CHIP IM2COL Speaker: Md Mizanur Rahaman Nayan, Georgia Tech, US Authors: Md Mizanur Rahaman Nayan1, Ritik Raj1, Gouse Shaik Basha1, Tushar Krishna2 and Azad J Naeemi1 1Georgia Tech, US; 2tushar, US Abstract General matrix multiplication (GeMM) is a core operation in virtually all AI applications. Systolic array (SA) based architectures have shown great promise as GeMM hardware accelerators thanks to their speed and energy efficiency. Unfortunately, SAs incur a linear delay in filling the operands, due to unidirectional propogation via pipeline latches. In this work, we propose a novel in-array data orchestration technique in SAs where we enable data feeding on the principal diagonal followed by bi-directional propagation. This improves the runtime by up to 2× at minimal hardware overhead. In addition, the proposed data orchestration enables convolution lowering (known as im2col) using a simple hardware support to fully exploit input feature map reuse opportunity and significantly lower the off-chip memory traffic resulting in 1.2× throughput improvement and 2.17× inference energy reduction during YOLOv3 and RESNET50 workload on average. In contrast, conventional data orchestration would require more elaborate hardware and control signals to implement im2col in hardware because of the data skew. We have synthesized and conducted place and route for 16×16 systolic arrays based on the novel and conventional orchestrations using ASAP 7nm PDK and found that our proposed approach results in 0.211% area and 1.6% power overheads/proceedings-archive/2024/DATA/128_pdf_upload.pdf |
||
TEMPUS CORE: AREA-POWER EFFICIENT TEMPORAL-UNARY CONVOLUTION CORE FOR LOW-PRECISION EDGE DLAS Speaker: Prabhu Vellaisamy, Carnegie Mellon University, US Authors: Prabhu Vellaisamy1, Harideep Nair1, Thomas Kang1, Yichen Ni1, Haoyang Fan1, Bin Qi1, Hsien-Fu Hung1, Jeff Chen1, Shawn Blanton1 and John Shen2 1Carnegie Mellon University, US; 2Carnegie Mellon university, US Abstract The increasing complexity of deep neural networks (DNNs) poses significant challenges for edge inference deployment due to resource and power constraints of edge devices. Recent works on unary-based matrix multiplication hardware aim to leverage data sparsity and low-precision values to enhance hardware efficiency. However, the adoption and integration of such unary hardware into commercial deep learning accelerators (DLA) remain limited due to processing element (PE) array dataflow differences. This work presents Tempus Core, a convolution core with highly scalable unary-based PE array comprising of tub (temporal-unary-binary) multipliers that seamlessly integrates with the NVDLA (NVIDIA's open-source DLA for accelerating CNNs) while maintaining dataflow compliance and boosting hardware efficiency. Analysis across various datapath granularities shows that for INT8 precision in 45nm CMOS, Tempus Core's PE cell unit (PCU) yields 59.3% and 15.3% reductions in area and power consumption, respectively, over NVDLA's CMAC unit. Considering a 16x16 PE array in Tempus Core, area and power improves by 75% and 62%, respectively, while delivering 5x and 4x iso-area throughput improvements for INT8 and INT4 precisions. Post-place and route analysis of Tempus Core's PCU shows that the 16x4 PE array for INT4 precision in 45nm CMOS requires only 0.017mm^2 die area and consumes only 6.2mW of total power. We demonstrate that area-power efficient unary-based hardware can be seamlessly integrated into conventional DLAs, paving the path for efficient unary hardware for edge AI inference./proceedings-archive/2024/DATA/1115_pdf_upload.pdf |
||
ADAPTIVE MULTI-THRESHOLD ENCODING FOR ENERGY-EFFICIENT ECG CLASSIFICATION ARCHITECTURE USING SPIKING NEURAL NETWORK Speaker: Mohammad Amin Yaldagard, TU Delft, NL Authors: Sumit Diware, Yingzhou Dong, Mohammad Amin Yaldagard and Rajendra Bishnoi, TU Delft, NL Abstract Timely identification of cardiac arrhythmia (abnormal heartbeats) is vital for early diagnosis of cardiovascular diseases. Wearable healthcare devices facilitate this process by recording heartbeats through electrocardiogram (ECG) signals and using AI-driven hardware to classify them into arrhythmia classes. Spiking neural networks (SNNs) are well-suited for such hardware as they consume low energy due to event-driven operation. However, their energy-efficiency is constrained by encoding methods that translate real-valued ECG data into spikes. In this paper, we present an SNN-based ECG classification architecture featuring a new adaptive multi-threshold spike encoding scheme. This scheme adjusts encoding window and granularity based on the importance of ECG data samples, to capture essential information with fewer spikes. We develop a high-accuracy SNN model for such spike representation, by proposing a technique specifically tailored to our encoding. We design a hardware architecture for this model, which incorporates optimized layer post-processing for energy-efficient data-flow and employs fixed-point quantization for computational efficiency. Moreover, we integrate this architecture with our encoding scheme into a system-on-chip implementation using TSMC 40nm technology. Results show that our proposed approach achieves better energy-efficiency compared to state-of-the-art, with high ECG classification accuracy./proceedings-archive/2024/DATA/1216_pdf_upload.pdf |
||
LOWGRADQ: ADAPTIVE GRADIENT QUANTIZATION FOR LOW-BIT CNN TRAINING VIA KERNEL DENSITY ESTIMATION-GUIDED THRESHOLDING AND HARDWARE-EFFICIENT STOCHASTIC ROUNDING UNIT Speaker: Sangbeom Jeong, Seoul National University of Science and Technology, KR Authors: Sangbeom Jeong1, Seungil Lee1 and Hyun Kim2 1Seoul National University of Science and Technology, Department of Electrical and Information Engineering, KR; 2Seoul National University of Science and Technology, KR Abstract This paper proposes a hardware-efficient INT8 training framework with dual-scale adaptive gradient quantization (DAGQ) to cope with the growing need for efficient on-device CNN training. DAGQ captures both small- and large-magnitude gradients, ensuring robust low-bit training with minimal quantization error. Additionally, to reduce the computational and memory demands of stochastic rounding in low-bit training, we introduce a reusable LFSR-based stochastic rounding unit (RLSRU), which efficiently generates and reuses random numbers, minimizing hardware complexity. The proposed framework achieves stable INT8 training across various networks with minimal accuracy loss while being implementable on RTL-based hardware accelerators, making it well-suited for resource-constrained environments./proceedings-archive/2024/DATA/199_pdf_upload.pdf |
||
PFASWARE: QUANTIFYING THE ENVIRONMENTAL IMPACT OF PER- AND POLYFLUOROALKYL SUBSTANCES (PFAS) IN COMPUTING SYSTEMS Speaker: Mariam Elgamal, Harvard University, US Authors: Mariam Elgamal1, Abdulrahman Mahmoud2, Gu-Yeon Wei1, David Brooks1 and Gage Hills1 1Harvard University, US; 2Mohamed bin Zayed University of Artificial Intelligence, AE Abstract PFAS (per- and poly-fluoroalkyl substances), also known as forever chemicals, are widely used in electronics and semiconductor manufacturing. PFAS are environmentally persistent and bioaccumulative synthetic chemicals, which have recently received considerable regulatory attention. Manufacturing semiconductors and electronics, including integrated circuits (IC), batteries, displays, etc., currently accounts for a staggering 10% of the total PFAS-containing fluoropolymers used in Europe alone. Now, computer system designers have an opportunity to reduce the use of PFAS in semiconductors and electronics at the design phase. In this work, we quantify the environmental impact of PFAS in computing systems, and outline how designers can optimize their designs to use less PFAS. We show that manufacturing an IC design at a 7 nm technology node using Extreme Ultraviolet (EUV) lithography uses 20% less volume of PFAS-containing chemicals versus manufacturing the same design at a 7 nm node using Deep Ultraviolet (DUV) immersion lithography (instead of EUV). We also show that manufacturing an IC design at a 16 nm technology node results in 15% less volume of PFAS than manufacturing the same design at a 28 nm node due to its smaller area./proceedings-archive/2024/DATA/546_pdf_upload.pdf |
||
FAST MACHINE LEARNING BASED PREDICTION FOR TEMPERATURE SIMULATION USING COMPACT MODELS Speaker: Ayse Coskun, Boston University, US Authors: Mohammadamin Hajikhodaverdian1, Sherief Reda2 and Ayse Coskun1 1Boston University, US; 2Brown University, US Abstract As transistor densities increase, managing thermal challenges in 3D IC designs becomes more complex. Traditional methods like finite element methods and compact thermal models (CTMs) are computationally expensive, while existing machine learning (ML) models require large datasets and a long training time. To address these challenges with the ML models, we introduce a novel ML framework that integrates with CTMs to accelerate steady-state thermal simulations without needing large datasets. Our approach achieves up to 70× speedup over state-of-the-art simulators, enabling real-time, high-resolution thermal simulations for 2D and 3D IC designs./proceedings-archive/2024/DATA/1167_pdf_upload.pdf |
||
CPP-SGS :CYCLE-ACCURATE POWER PREDICTION FRAMEWORK VIA SNN AND GENETIC SIGNAL SELECTION Speaker: Tong LIU, Hong Kong University of Science and Technology, HK Authors: Tong Liu1, Zijun JIANG2 and Yangdi Lyu1 1Hong Kong University of Science and Technology, HK; 2Hong Kong University of Science & Technology (Guangzhou), CN Abstract Effective power management is crucial for optimizing the performance and longevity of integrated circuits. Cycle-accurate power prediction can help power management during runtime. This paper introduces a Cycle-accurate Power Prediction framework via Spiking neural networks (SNNs) and Genetic signal Selection (CPP-SGS), which integrates SNNs and Genetic Algorithms (GAs) to predict real-time power consumption of chips. We apply GAs to select the most relevant signals as the input to SNNs to reduce the model size and inference time, making it well-suited for dynamic power estimation in real-time scenarios. The experimental results show that CCP-SGS outperforms the state-of-the-art approaches, with a normalized root mean squared error (NRMSE) of less than 1.6%./proceedings-archive/2024/DATA/1457_pdf_upload.pdf |
TS16 Session 26 = T1+T2
Add this session to my calendar
Date: Tuesday, 01 April 2025
Time: 16:30 CET - 18:00 CET
Time | Label | Presentation Title Authors |
---|---|---|
FUSIS: FUSING SURROGATE MODELS AND IMPORTANCE SAMPLING FOR EFFICIENT YIELD ESTIMATION Speaker: Wei Xing, University of Sheffield, GB Authors: Yanfang Liu1 and Wei Xing2 1Beihang University, CN; 2University of Sheffield, GB Abstract As process nodes continue to shrink, yield estimation has become increasingly critical in modern circuit design. Traditional approaches face significant challenges: surrogate-based methods often struggle with robustness and accuracy, whereas importance sampling (IS)-based methods suffer from high simulation costs. To address these challenges simultaneously, we propose FUSIS, a unified framework that combines the strengths of surrogate-based and IS-based approaches. Unlike conventional surrogate-based methods that directly replace SPICE simulations for performance predictions, FUSIS employs a Deep Kernel support vector machine (SVM) as an approximation of the indicator function, which is further utilized to construct a quasi-optimal proposal distribution for IS to accelerate convergence. To further mitigate yield estimation bias caused by surrogate inaccuracies, we introduce a novel correction factor to adjust the IS-based yield estimation. Experiments conducted on SRAM and analog circuits demonstrate that FUSIS significantly improves accuracy by up to 24.84% (8.67% on average) while achieving up to 29.54x (10.30x on average) speedup in efficiency compared to seven state-of-the-art methods./proceedings-archive/2024/DATA/83_pdf_upload.pdf |
||
ROTA: ROTATIONAL TORUS ACCELERATOR FOR WEAR LEVELING OF NEURAL PROCESSING ELEMENTS Speaker: Taesoo Lim, Yonsei University, KR Authors: Taesoo Lim, Hyeonjin Kim, JINGU PARK, Bogil Kim and William Song, Yonsei University, KR Abstract This paper introduces a reliability-aware neural accelerator design with a wear-leveling solution that balances the utilization of processing elements (PEs). Neural accelerators deploy many PEs to exploit data-level parallelism, but their designs and operations have focused mostly on performance and energy efficiency metrics. Directional dataflows in PE arrays and dimensional misalignment with variable-sized neural layers cause the underutilization of PEs, which is biased to PE locations and gradually accumulated over time. Consequently, the accelerators experience severe usage imbalance between PEs. To resolve the problem, this paper proposes a rotational torus accelerator (RoTA) with an optimized wear-leveling scheme that shuffles PE utilization spaces to eliminate PE usage imbalance. Evaluation results show that RoTA improves lifetime reliability by 1.69x./proceedings-archive/2024/DATA/303_pdf_upload.pdf |
||
LOCATION IS ALL YOU NEED: EFFICIENT LITHOGRAPHIC HOTSPOT DETECTION USING ONLY POLYGON LOCATIONS Speaker: Kang Liu, Huazhong University of Science & Technology, CN Authors: Yujia Wang1, Jiaxing Wang1, Dan Feng1, Yuzhe Ma2 and Kang Liu1 1Huazhong University of Science & Technology, CN; 2Hong Kong University of Science and Technology, HK Abstract With integrated circuits at advanced technology nodes shrinking in feature size, lithographic hotspot detection has become increasingly important. Deep learning, especially convolutional neural networks (CNNs) and graph neural networks (GNNs) have recently succeeded in lithographic hotspot detection, where layout patterns, represented as images or graph features, are classified into hotspots and non-hotspots. However, with increasingly sophisticated CNN architectural designs, CNN-based hotspot detection requires excessive training and inference costs with expanding model sizes but only marginally improves detection accuracy. Existing GNN-based hotspot detector requires more intuitive and efficient layout graph feature representation. Driven by the understanding that lithographic hotspots result from complex interactions among metal polygons through the light system, we propose the absolute and relative locations of metal polygons are all we need to detect hotspots of a layout clip. We propose a novel layout graph feature representation for hotspot detection where the coordinates of each polygon and the distances between them are taken as node and edge features, respectively. We design an advanced GNN architecture using graph attention and different feature update functions for different edge types of polygons. Our experimental results demonstrate that our architecture achieves the highest hotspot accuracy and the lowest false alarm on different datasets. Notably, we employ one-third of the graph features of the previous GNN hotspot detector and achieve higher accuracy. We outperform all CNN hotspot detectors with higher accuracy, up to 32x speed up in inference time, and 64x reduction in model size./proceedings-archive/2024/DATA/425_pdf_upload.pdf |
||
EFFICIENT MODULATED STATE SPACE MODEL FOR MIXED-TYPE WAFER DEFECT PATTERN RECOGNITION Speaker: Mu Nie, Anhui Polytechnic University, CN Authors: Mu Nie1, ShiDong Zhu1, Aibin Yan2, Zhuo Chen3, Xiaoqing Wen4 and Tianming Ni1 1Anhui Polytechnic University, CN; 2Hefei University of Technology, CN; 3Zhejiang University, CN; 4Kyushu Institute of Technology, JP Abstract Accurate and efficient wafer defect detection is crucial in semiconductor manufacturing to maintain product quality and optimize yield. Traditional methods struggle with the complexity and diversity of modern wafer defect patterns. While deep learning approaches are effective, they are often resource-intensive, posing challenges for real-time deployment in industrial settings. To solve these problems, we propose an Efficient Modulated State Space Model (EM-SSM) for mixed-type wafer defect recognition, optimized with knowledge distillation to balance accuracy and efficiency. Our framework captures size-dependent relationships and improves defect-specific feature representation to recognize complex defects precisely. Specifically, we introduce an efficient directional modulation mechanism to refine spatial recognition of defect patterns. To further improve inference efficiency, we propose a deep-to-shallow distillation method that transfers knowledge from deeper networks to lighter networks, reducing inference time without compromising classification accuracy. Experimental results on the MixedWM38 wafer dataset with 38 defect types show that our model achieves 99.0\% accuracy, outperforming traditional methods in both accuracy and efficiency. Our model offers a scalable solution for modern semiconductor defect detection./proceedings-archive/2024/DATA/818_pdf_upload.pdf |
||
MORE-STRESS: MODEL ORDER REDUCTION BASED EFFICIENT NUMERICAL ALGORITHM FOR THERMAL STRESS SIMULATION OF TSV ARRAYS IN 2.5D/3D IC Presenter: Tianxiang Zhu, Peking University, CN Authors: Tianxiang Zhu, Qipan Wang, Yibo Lin, Runsheng Wang and Ru Huang, Peking University, CN Abstract Thermomechanical stress induced by through-silicon vias (TSVs) plays an important role in the performance and reliability analysis of 2.5D/3D ICs. While the finite element method (FEM) adopted by commercial software can provide accurate simulation results, it is very time- and memory-consuming for large-scale analysis. Over the past decade, the linear superposition method has been utilized to perform fast thermal stress estimations of TSV arrays, but it suffers from a lack of accuracy. In this paper, we propose MORE-Stress, a novel strict numerical algorithm for efficient thermal stress simulation of TSV arrays based on model order reduction. Extensive experimental results demonstrate that our algorithm can realize a 153-504x reduction in simulation time and a 39-115x reduction in memory usage compared with the commercial software ANSYS, with negligible errors less than 1%. Our algorithm is as efficient as the linear superposition method, with an order of magnitude smaller errors and fast convergence. |
||
DYNAMIC IR-DROP PREDICTION THROUGH A MULTI-TASK U-NET WITH PACKAGE EFFECT CONSIDERATION Speaker: Yu-Hsuan Chen, National Tsing Hua University, Taiwan, TW Authors: Yu-Hsuan Chen1, Yu-Chen Cheng1, Yong-Fong Chang2, Yu-Che Lee1, Jia-Wei Lin2, Hsun-Wei Pao2, Peng-Wen Chen2, Po-Yu Chen2, Hao-Yun Chen2, Yung-Chih Chen3, Chun-Yao Wang1 and Shih-Chieh Chang1 1National Tsing Hua University, TW; 2Mediatek Inc, Taiwan, TW; 3National Taiwan University of Science and Technology, TW Abstract Dynamic IR drop analysis is a critical step in the design signoff stage for verifying the power integrity of a chip. Since the analysis is extremely time-consuming, it has led to the emergence of machine learning (ML)-based methods to expedite the procedure. While previous ML approaches have demonstrated the feasibility of IR drop prediction, they often neglect package effects and do not address diverse IR criteria for memory and standard cells. Thus, this paper introduces a novel ML-based approach designed for a fast and accurate prediction of multi-type IR drop, considering package effects. We develop new package- related features to account for the package impact on IR drop. The proposed model is based on a multi-task U-net architecture that not only predicts two types of IR drops simultaneously but also increases prediction accuracy through comprehensive learning. To further enhance the model performance, we introduce the Input Fusion Block (IFB), which unifies units across channels within the input feature maps, leading to improved prediction accuracy. The experimental results show the across-pattern transferability of the proposed IR drop prediction method, demonstrating an RMSE of less than 5mV and an MAE of less than 2mV on the unseen simulation patterns. Additionally, our proposed method achieves a 5X speed-up compared to the commercial tool./proceedings-archive/2024/DATA/1051_pdf_upload.pdf |
||
MINIMUM TIME MAXIMUM FAULT COVERAGE TESTING OF SPIKING NEURAL NETWORKS Speaker: Spyridon Raptis, Sorbonne Université, CNRS, LIP6, FR Authors: Spyridon Raptis1 and Haralampos-G. Stratigopoulos2 1Sorbonne Université, CNRS, LIP6, FR; 2Sorbonne University, CNRS, LIP6, FR Abstract We present a novel test generation algorithm for hardware accelerators of Spiking Neural Networks (SNNs). The algorithm is based on advanced optimization tailored for the spiking domain. It adaptively crafts input samples towards high coverage of hardware-level faults. Time-consuming fault simulation during test generation is circumvented by defining loss functions targeting the maximization of fault sensitisation and fault effect propagation to the output. Comparing the proposed algorithm to the existing ones on three benchmarks, it scales up for large SNN models, and it drastically reduces the test generation runtime from days to hours and the test duration from minutes to seconds. The resultant test input shows near perfect fault coverage and has a duration equivalent to a few dataset samples, thus, besides post-manufacturing testing, it is also suited for in-field testing./proceedings-archive/2024/DATA/508_pdf_upload.pdf |
||
EGIS: ENTROPY GUIDED IMAGE SYNTHESIS FOR DATASET-AGNOSTIC TESTING OF RRAM-BASED DNNS Speaker: Anurup Saha, Georgia Tech, US Authors: Anurup Saha, Chandramouli Amarnath, Kwondo Ma and Abhijit Chatterjee, Georgia Tech, US Abstract While resistive random access memory (RRAM) based deep neural networks (DNN) are important for low-power inference in IoT and edge applications, they are vulnerable to the effects of manufacturing process variations that degrade their performance (classification accuracy). However, to test the same post-manufacture, the (image) dataset used to train the associated machine learning applications may not be available to the RRAM crossbar manufacturer for privacy reasons. As such, the performance of DNNs needs to be assessed with carefully crafted dataset-agnostic synthetic test images that expose anomalies in the crossbar manufacturing process to the maximum extent possible. In this work, we propose a dataset-agnostic post-manufacture testing framework for RRAM-based DNNs using Entropy Guided Image Synthesis (EGIS). We first create a synthetic image dataset such that the DNN outputs corresponding to the synthetic images minimize an entropy-based loss metric. Next, a small subset (consisting of 10-20 images) of the synthetic image dataset, called the compact image dataset, is created to expedite testing. The response of the device under test (DUT) to the compact image dataset is passed to a machine learning based outlier detector for pass/fail labeling of the DUT. It is seen that the test accuracy using such synthetic test images is very close to that of contemporary test methods./proceedings-archive/2024/DATA/1338_pdf_upload.pdf |
||
NVSRLO: A FEFET-BASED NON-VOLATILE AND SEU-RECOVERABLE LATCH DESIGN WITH OPTIMIZED OVERHEAD Speaker: Wangjin Jiang, Hefei University of Technology, CN Authors: Aibin Yan1, Wangjin Jiang1, Han Bao1, Zhengfeng Huang1, Tianming Ni2, Xiaoqing Wen3 and Patrick Girard4 1Hefei University of Technology, CN; 2Anhui Polytechnic University, CN; 3Kyushu Institute of Technology, JP; 4LIRMM, FR Abstract This paper presents a FeFET-based non-volatile and single-event upset (SEU) recoverable latch, namely NVSRLO, which does not require any extra control signals. Simulation results show that the proposed latch provides non-volatility and SEU-recovery with optimized overhead. Compared with existing non-volatile latches, NVSRLO significantly reduces delay, power, and delay-power-area product at the cost of area./proceedings-archive/2024/DATA/969_pdf_upload.pdf |
||
INTERA-ECC: INTERCONNECT-AWARE ERROR CORRECTION IN STT-MRAM Speaker: Surendra Hemaram, Karlsruhe Institute of Technology, DE Authors: Surendra Hemaram1, Mahta Mayahinia1, Mehdi Tahoori1, Francky Catthoor2, Siddharth Rao2, Sebastien Couet2, Tommaso Marinelli3, Anita Farokhnejad2 and Gouri Kar2 1Karlsruhe Institute of Technology, DE; 2IMEC, BE; 3imec, BE Abstract Spin-transfer torque magnetic random access memory (STT-MRAM) is a promising alternative to existing memory technologies. However, STT-MRAM faces reliability challenges, primarily due to stochastic switching, process variation, and manufacturing defects. These reliability challenges become even worse due to interconnect parasitic resistive-capacitive effects, potentially compromising the reliability of memory cells located far from the write driver. This can severely impair the manufacturing yield and large-scale industrial adoption. To address this, we propose an interconnect-aware error correction coding (InterA-ECC), which provides non-uniform error correction to a different zone of the memory subarray. The proposed InterA-ECC strategy selectively applies robust error-correction code (ECC) to specific rows within the subarray rather than uniformly across all rows, reducing ECC parity bits while enhancing bit error rate resiliency in the most vulnerable memory zone./proceedings-archive/2024/DATA/1164_pdf_upload.pdf |
||
ASSESSING SOFT ERROR RELIABILITY IN VECTORIZED KERNELS: VULNERABILITY AND PERFORMANCE TRADE-OFFS ON ARM AND RISC-V ISAS Speaker and Author: Geancarlo Abich, UFRGS, BR Abstract The demand for advanced processing capabilities is paramount in the ever-evolving landscape of radiation-resilient computing exploration. With the standardization of vector extensions on Arm and Risc-V ISAs, leading technology companies are adopting high-performance processors to exploit vector capabilities. In this regard, this work proposes an automatized register's cross-section reliability evaluation while extending the uniform random register file fault injection to assess the increased vulnerability with the vector register length. Such a technique enables soft error reliability assessment of vector extensions from RISC-V and Arm while comparing with scalar counterparts over different integer and FP precisions. The obtained results show that soft error criticality correlates to registers' cross-section, and the vectorized benchmarks presented up to 78% error susceptibility in comparison to 6% in scalar versions while varying according to precision. This emphasizes the necessity of balancing performance and reliability in the emerging onboard platforms with vector capabilities./proceedings-archive/2024/DATA/1320_pdf_upload.pdf |
||
EARLY FUNCTIONAL SAFETY AND PPA EVALUATION OF DIGITAL DESIGNS Speaker: Michelangelo Bartolomucci, Politecnico di Torino, IT Authors: Michelangelo Bartolomucci1, David Kingston2, Teo Cupaiuolo3, Alessandra Nardi4 and Riccardo Cantoro1 1Politecnico di Torino, IT; 2Synopsys, GB; 3Synopsys, IT; 4Synopsys, US Abstract The use of semiconductor devices in safety-critical scenarios is increasing in both quantity and complexity. This paper presents a novel approach to support safety requirements from RTL exploration through to implementation, with the aid of a Safety Specification Format (SSF), thereby minimizing costly development iterations and reducing the Time-To-Market. An assessment of the results is given for the CV32E40P open source RISC-V processor./proceedings-archive/2024/DATA/1203_pdf_upload.pdf |
TS17 Session 5 - D2
Add this session to my calendar
Date: Wednesday, 02 April 2025
Time: 08:30 CET - 10:00 CET
Time | Label | Presentation Title Authors |
---|---|---|
FLOPPYFLOAT: AN OPEN SOURCE FLOATING POINT LIBRARY FOR INSTRUCTION SET SIMULATORS Speaker: Niko Zurstraßen, RWTH Aachen University, DE Authors: Niko Zurstraßen, Nils Bosbach and Rainer Leupers, RWTH Aachen University, DE Abstract Instruction Set Simulators (ISSs) are important software tools that facilitate the simulation of arbitrary compute systems. One of the most challenging aspects of ISS development is the modeling of Floating Point (FP) arithmetic. Despite an industry standard specifically created to avoid fragmentation, every Instruction Set Architecture (ISA) comes with an individual definition of FP arithmetic. Hence, many simulators, such as gem5 or Spike, do not use the Floating Point Unit (FPU) of the host system, but resort to soft float libraries. These libraries offer great flexibility and portability by calculating FP instructions by means of integer arithmetic. However, using tens or hundreds of integer instructions to model a single FP instruction is detrimental to the simulator's performance. Tackling the poor performance of soft float libraries, we present FloppyFloat - an open-source FP library for ISSs. FloppyFloat leverages the host FPU for basic calculations and rectifies corner cases in software. In comparison to the popular Berkeley SoftFloat, FloppyFloat achieves speedups of up to 5.5x for individual instructions. As a replacement for SoftFloat in the RISC-V golden reference simulator Spike, FloppyFloat accelerates common FP benchmarks by up to 1.41x./proceedings-archive/2024/DATA/21_pdf_upload.pdf |
||
HANDLING LATCH LOOPS IN TIMING ANALYSIS WITH IMPROVED COMPLEXITY AND DIVERGENT LOOP DETECTION Speaker: Xizhe Shi, Peking University, CN Authors: Xizhe Shi, Zizheng Guo, Yibo Lin, Runsheng Wang and Ru Huang, Peking University, CN Abstract Latch loops introduce feedback cycles in timing graphs for static timing analysis (STA), disrupting timing propagation in topological order. Existing timers handle latch loops by checking the convergence of global iterations in timing propagation without lookahead detection of divergent loops. Such a strategy ends up with the worst-case runtime complexity O(n²), where n is the number of pins in the timing graph. This can be extremely time-consuming, when n goes to millions and beyond. In this paper, we address this challenge by proposing a new algorithm consisting of two steps. First, we identify the strongly connected components (SCCs) and levelize them into different stages. Second, we implement parallelized arrival time (AT) propagation between SCCs while conducting sequential iterations inside each SCC. This strategy significantly reduces the runtime complexity to O(∑(k_i)²) from the previous global propagation, where k_i is the number of pins in each SCC. Our timer also detects timing information divergent loops in advance, avoiding over-iteration. Experimental results on industrial designs demonstrate 10.31× and 8.77× speed-up over PrimeTime and OpenSTA on average, respectively./proceedings-archive/2024/DATA/76_pdf_upload.pdf |
||
STATIC GLOBAL REGISTER ALLOCATION FOR DYNAMIC BINARY TRANSLATORS Speaker: Niko Zurstraßen, RWTH Aachen University, DE Authors: Niko Zurstraßen, Nils Bosbach, Lennart Reimann and Rainer Leupers, RWTH Aachen University, DE Abstract Dynamic Binary Translators (DBTs) facilitate the execution of binaries across different Instruction Set Architectures (ISAs). Similar to a just-in-time compiler, they recompile machine code from one ISA to another, and subsequently execute the generated code. To achieve near-native execution speed, several challenges must be overcome. This includes the problem of register allocation (RA). In classical compiler engineering, RA is often performed by global methods. However, due to the nature of DBTs, established global methods like graph coloring or linear scan are hardly applicable. This is why state-of-the-art DBTs, like QEMU, use basic-block-local methods, which come with several disadvantages. Addressing these flaws, we propose a novel global method based on static target-to-host mappings. As most applications only work on a small set of registers, mapping them statically from host to target significantly reduces load/store overhead. In a case study using our RISC-V-on-ARM64 user- mode simulator RISE SIM, we demonstrate speedups of up to 1.4× compared to basic-block-local methods./proceedings-archive/2024/DATA/81_pdf_upload.pdf |
||
CORRECTBENCH: AUTOMATIC TESTBENCH GENERATION WITH FUNCTIONAL SELF-CORRECTION USING LLMS FOR HDL DESIGN Speaker: Ruidi Qiu, TU Munich, DE Authors: Ruidi Qiu1, Grace Li Zhang2, Rolf Drechsler3, Ulf Schlichtmann1 and Bing Li4 1TU Munich, DE; 2TU Darmstadt, DE; 3University of Bremen | DFKI, DE; 4University of Siegen, DE Abstract Functional simulation is an essential step in digital hardware design. Recently, there has been a growing interest in leveraging Large Language Models (LLMs) for hardware testbench generation tasks. However, the inherent instability associated with LLMs often leads to functional errors in the generated testbenches. Previous methods do not incorporate automatic functional correction mechanisms without human intervention and still suffer from low success rates, especially for sequential tasks. To address this issue, we propose CorrectBench, an automatic testbench generation framework with functional self-validation and self-correction. Utilizing only the RTL specification in natural language, the proposed approach can validate the correctness of the generated testbenches with a success rate of 88.85%. Furthermore, the proposed LLM-based corrector employs bug information obtained during the self-validation process to perform functional self-correction on the generated testbenches. The comparative analysis demonstrates that our method achieves a pass ratio of 70.13% across all evaluated tasks, compared with the previous LLM-based testbench generation framework's 52.18% and a direct LLM-based generation method's 33.33% Specifically in sequential circuits, our work's performance is 62.18% higher than previous work in sequential tasks and almost 5 times the pass ratio of direct method. The codes and experimental results are open-sourced at the link: https://anonymous.4open.science/r/CorrectBench-8CEA./proceedings-archive/2024/DATA/97_pdf_upload.pdf |
||
CISGRAPH: A CONTRIBUTION-DRIVEN ACCELERATOR FOR PAIRWISE STREAMING GRAPH ANALYTICS Speaker: Songyu Feng, Institute of Computing Technology, Chinese Academy of Sciences, CN Authors: Songyu Feng1, Mo Zou2 and Tian Zhi2 1Institute of Compuiting Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences, CN; 2Institute of Computing Technology, Chinese Academy of Sciences, CN Abstract Abstract—Recent research observed that pairwise query is practical enough in real-world streaming graph analytics. Given a pair of distinct vertices, existing approaches coalesce or prune vertex activations to decrease computations. However, they still suffer from severe invalid computations because they ignore contribution variations in graph updates, hindering performance improvement. In this work, we propose to enhance pairwise analytics by taking updates contributions into account. We first identify that graph updates from one batch have a distinct impact on query results and experience obvious diverse computation overheads. We then introduce CISGraph, a novel Contributiondriven pairwise accelerator with valuable updates Identification and Scheduling. Specifically, inspired by triangle inequality, CISGraph categorizes graph updates into three levels according to contributions, prioritizes valuable updates, delays possible valuable updates, and drops useless updates to eliminate wasteful computations. As far as we know, CISGraph is the first hardware accelerator that supports efficient pairwise queries on streaming graphs. Experimental results show that CISGraph substantially outperforms state-of-the-art streaming graph processing systems by 25× on average in response time./proceedings-archive/2024/DATA/159_pdf_upload.pdf |
||
HIGH-PERFORMANCE ARM-ON-ARM VIRTUALIZATION FOR MULTICORE SYSTEMC-TLM-BASED VIRTUAL PLATFORMS Speaker: Nils Bosbach, RWTH Aachen University, DE Authors: Nils Bosbach1, Rebecca Pelke1, Niko Zurstraßen1, Jan Weinstock2, Lukas Jünger2 and Rainer Leupers1 1RWTH Aachen University, DE; 2MachineWare GmbH, DE Abstract The increasing complexity of hardware and software requires advanced development and test methodologies for modern systems on chips. This paper presents a novel approach to ARM-on-ARM virtualization within SystemC-based simulators using Linux's KVM to achieve high-performance simulation. By running target software natively on ARM-based hosts with hardware-based virtualization extensions, our method eliminates the need for instruction-set simulators, which significantly improves performance. We present a multicore SystemC-TLM-based CPU model that can be used as a drop-in replacement for an instruction-set simulator. It places no special requirements on the host system, making it compatible with various environments. Benchmark results show that our ARM-on-ARM-based virtual platform achieves up to 10 x speedup over traditional instruction-set-simulator-based models on compute-intensive workloads. Depending on the benchmark, speedups increase to more than 100 x./proceedings-archive/2024/DATA/160_pdf_upload.pdf |
||
RTHETER: SIMULATING REAL-TIME SCHEDULING OF MULTIPLE TASKS IN HETEROGENEOUS ARCHITECTURES Speaker: Yinchen Ni, Shanghai Jiao Tong University, CN Authors: Yinchen Ni1, Jiace Zhu1, Yier Jin2 and An Zou1 1Shanghai Jiao Tong University, CN; 2University of Science and Technology of China, CN Abstract The rising popularity of AI applications is driving the adoption of heterogeneous computing architectures to handle complex computations. However, as these heterogeneous architectures grow more complex, optimizing the scheduling of multiple tasks and meeting strict timing constraints becomes significantly challenging. Current studies on real-time scheduling on heterogeneous processors lack agile and flexible simulation tools that can quickly adapt to varying system settings, leading to inefficiencies in system design. Additionally, the high costs associated with evaluating real-time performance in terms of human and facility efforts further complicate the development process. To address these challenges, this paper introduces a comprehensive hierarchical simulating approach and a corresponding simulator designed for flexible heterogeneous computing platforms. The simulator supports ideal or practical, off-the-shelf or customizable heterogeneous architectures, upon which the simulator can execute both parallel and dependent tasks. Utilizing this simulator, we present two case studies that were time-consuming previously but are now easily achieved by the proposed simulator. The first case study reveals the possibility of using policy-based reinforcement learning to explore novel scheduling strategies; the second explores the dominant processors within heterogeneous architectures, providing insights for optimizing the heterogeneous architecture design./proceedings-archive/2024/DATA/323_pdf_upload.pdf |
||
FAST INTERPRETER-BASED INSTRUCTION SET SIMULATION FOR VIRTUAL PROTOTYPES Speaker: Manfred Schlägl, Institute for Complex Systems, Johannes Kepler University Linz, AT Authors: Manfred Schlaegl and Daniel Grosse, Johannes Kepler University Linz, AT Abstract The Instruction Set Simulators (ISSs) used in Virtual Prototypes (VPs) are typically implemented as interpreters with the goal to be easy to understand, and fast to adapt and extend. However, the performance of instruction interpretation is very limited and the ever-increasing complexity of Hardware (HW) poses an increasing challenge to this approach. In this paper, we present optimization techniques for interpreter-based ISSs that significantly boost performance while preserving comprehensibility and adaptability. We consider the RISC-V ISS of an existing, SystemC-based open-source VP with extensive capabilities such as running Linux and interactive graphical applications. The optimization techniques feature a Dynamic Basic Block Cache (DBBCache) to accelerate ISS instruction processing and a Load/Store Cache (LSCache) to speed up ISS load and store operations to and from memory. In our evaluation, we consider 12 Linux-based benchmark workloads and compare our optimizations to the original VP as well as to the very efficient official RISC-V reference simulator Spike maintained by RISC-V International. Overall, we achieve up to 406.97 Million Instructions per Second (MIPS) and a signif- icant average performance increase, by a factor of 8.98 over the original VP and 1.65 over the Spike simulator. To showcase the retention of both comprehensibility and adaptability, we imple- ment support for RISC-V half-precision floating-point extension (Zfh) in both the original and the optimized VP. A comparison of these implementations reveals no significant differences, ensuring that the stated qualities remain unaffected. The optimized VP including Zfh is available as open-source on GitHub./proceedings-archive/2024/DATA/529_pdf_upload.pdf |
||
C2C-GEM5: FULL SYSTEM SIMULATION OF CACHE-COHERENT CHIP-TO-CHIP INTERCONNECTS Speaker: Luis Bertran Alvarez, LIRMM, FR Authors: Luis Bertran Alvarez1, Ghassan Chehaibar2, Stephen Busch2, Pascal Benoit3 and David Novo3 1LIRMM / Eviden, FR; 2Eviden, FR; 3Université de Montpellier, FR Abstract High-Performance Computing (HPC) is shifting toward chiplet-based System-on-Chip (SoC) architectures, necessitating advanced simulation tools for design and optimization. In this work, we extend the gem5 simulator to support cache-coherent multi-chip systems by introducing a new chip-to-chip interconnect model within the Ruby framework. Our implementation is adaptable to various coherence protocols, such as Arm CHI. Calibrated with real hardware, our model is evaluated using PARSEC workloads, demonstrating its accuracy in simulating coherent chip-to-chip interactions and its effectiveness in capturing key performance metrics early in the design flow./proceedings-archive/2024/DATA/717_pdf_upload.pdf |
||
A 101 TOPS/W AND 1.73 TOPS/MM$^2$ 6T SRAM-BASED DIGITAL COMPUTE-IN-MEMORY MACRO FEATURING A NOVEL 2T MULTIPLIER Speaker: Priyanshu Tyagi, IIT Roorkee, IN Authors: Priyanshu Tyagi and Sparsh Mittal, IIT Roorkee, IN Abstract In this paper, we propose a 6T SRAM-based all-digital Compute-in-memory (CIM) macro for multi-bit multiply-and-accumulate (MAC) operations. We propose a novel 2T bitwise multiplier, which is a direct improvement over the previously proposed 4T NOR gate-based multiplier. The 2T multiplier also eliminates the need to invert the input bits, which is required when using NOR gates for multipliers. We propose an efficient digital MAC computation flow based on a barrel shifter, which significantly reduces the latency of shift operation. This brings down the overall latency incurred while performing MAC operations to 13ns/25ns (in 65nm CMOS)for 4b/8b operands (in 65nm CMOS @ 0.6V), compared to 10ns/18ns (in 22nm CMOS @ 0.72V) of the previous work. The proposed CIM macro is fully re-configurable in weight bits (4/8/12/16) and input (4/8) bits. It can perform concurrent MAC and weight update operations. Moreover, its fully complete digital implementation circumvents the challenges associated with analog CIM macros. For MAC operation with 4b weight and input, the macro achieves 24 TOPS/W at 1.2 V and 81 TOPS/W at 0.7 V. When using low-threshold-voltage transistors in the 2T multiplier, the macro works reliably even at 0.6V while achieving 101 TOPS/W./proceedings-archive/2024/DATA/988_pdf_upload.pdf |
TS18 Session 16 - E3
Add this session to my calendar
Date: Wednesday, 02 April 2025
Time: 08:30 CET - 10:00 CET
Time | Label | Presentation Title Authors |
---|---|---|
DE$^2$R: UNIFYING DVFS AND EARLY-EXIT FOR EMBEDDED AI INFERENCE VIA REINFORCEMENT LEARNING Speaker: Yuting He, University of Nottingham Ningbo China, CN Authors: Yuting He1, Jingjin Li1, Chengtai Li1, Qingyu Yang1, Zheng Wang2, Heshan Du1, Jianfeng Ren1 and Heng Yu1 1University of Nottingham Ningbo China, CN; 2Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, CN Abstract Executing neural networks on resource-constrained embedded devices faces challenges. Efforts have been made at the application and system levels to reduce the execution cost. Among them, the early-exit networks reduce computational cost through intermediate exits, while Dynamic Voltage and Frequency Scaling (DVFS) offers system energy reduction. Existing works strive to unify early-exit and DVFS for combined benefits on both timing and energy flexibility, yet limitations exist: 1) varying time constraints that make different exit points become more, or less, important in terms of inference accuracy, are not taken care of, and 2) the optimal decisions of unifying DVFS and early-exit as a multi-objective optimization problem are not achieved due to the large configuration space. To address these challenges, we propose De$^2$r, a reinforcement learning-based framework that jointly optimizes early-exit points and DVFS settings for continuous inference. In particular, De$^2$r includes a cross-training mechanism that fine-tunes the early-exit network to accommodate dynamic time constraints and system conditions. Experimental results demonstrate that De$^2$r achieves up to 22.03% energy reduction and 3.23% accuracy gain compared to contemporary techniques./proceedings-archive/2024/DATA/18_pdf_upload.pdf |
||
CONTINUOUS GNN-BASED ANOMALY DETECTION ON EDGE USING EFFICIENT ADAPTIVE KNOWLEDGE GRAPH LEARNING Speaker: Sanggeon Yun, University of California, Irvine, US Authors: Sanggeon Yun1, Ryozo Masukawa1, William Chung1, Minhyoung Na2, Nathaniel Bastian3 and Mohsen Imani1 1University of California, Irvine, US; 2Kookmin University, KR; 3United States Military Academy at West Point, US Abstract The increasing demand for robust security solutions across various industries has made Video Anomaly Detection (VAD) a critical task in applications such as intelligent surveillance, evidence investigation, and violence detection. Traditional approaches to VAD often rely on finetuning large pre-trained models, which can be computationally expensive and impractical for real-time or resource-constrained environments. To address this, MissionGNN introduced a more efficient method by training a graph neural network (GNN) using a fixed knowledge graph (KG) derived from large language models (LLMs) like GPT-4. While this approach demonstrated significant efficiency in computational power and memory, it faces limitations in dynamic environments where frequent updates to the KG are necessary due to evolving behavior trends and shifting data patterns. These updates typically require cloud-based computation, posing challenges for edge computing applications. In this paper, we propose a novel framework that facilitates continuous KG adaptation directly on edge devices, overcoming the limitations of cloud dependency. Our method dynamically modifies the KG through a three-phase process: pruning, alternating, and creating nodes, enabling real-time adaptation to changing data trends. This continuous learning approach enhances the robustness of anomaly detection models, making them more suitable for deployment in dynamic and resource-constrained environments./proceedings-archive/2024/DATA/280_pdf_upload.pdf |
||
BMP-SD: MARRYING BINARY AND MIXED-PRECISION QUANTIZATION FOR EFFICIENT STABLE DIFFUSION INFERENCE Speaker: Cheng Gu, Shanghai Jiao Tong University, CN Authors: Cheng Gu1, Gang Li2, Xiaolong Lin1, Jiayao Ling1, Jian Cheng3 and Xiaoyao Liang1 1Shanghai Jiao Tong University, CN; 2Institute of Computing Technology, Chinese Academy of Sciences, CN; 3Institute of Automation, CN Abstract Stable Diffusion (SD) is an emerging deep neural network (DNN) model that has demonstrated impressive capabilities in generative tasks such as text-to-image generation. However, the iterative denoising stage with UNet in the SD model is extremely expensive in both computations and memory accesses, making it challenging for fast and energy-efficient edge deployment. To alleviate the overhead of denoising, in this paper we propose BMP-SD, a post-training quantization framework for hardware-efficient SD inference. BMP-SD employs binary weight quantization to significantly reduce the computational complexity and memory footprint of iterative denoising, along with dynamic, step-aware mixed-precision activation quantization, based on the observation that not all denoising steps are equally important. Experiments on the text-to-image generation task show that BMP-SD achieves mixed-precision (W1.73A4.87) with minimal accuracy loss on MS-COCO 2014. We also evaluate the BMP-SD quantized model on multiple bit-flexible DNN accelerators, results reveal that our method can deliver up to 5.14x performance and 3.85x energy efficiency improvements compared to W8A8 quantization./proceedings-archive/2024/DATA/384_pdf_upload.pdf |
||
DISTRIBUTED INFERENCE WITH MINIMAL OFF-CHIP TRAFFIC FOR TRANSFORMERS ON LOW-POWER MCUS Speaker: Victor Jung, ETH Zurich, CH Authors: Severin Bochem1, Victor Jung1, Arpan Suravi Prasad1, Francesco Conti2 and Luca Benini3 1ETH Zurich, CH; 2Università di Bologna, IT; 3ETH Zurich, CH | Università di Bologna, IT Abstract Contextual Artificial Intelligence (AI) based on emerging Transformer models is predicted to drive the next technological revolution in interactive wearable devices such as new-generation smart glasses. By coupling numerous sensors with small, low-power microcontroller units (MCUs), these devices will enable on-device intelligence and sensor control. A major bottleneck in this class of systems is the small amount of on-chip memory available in the MCUs. In this paper, we propose a methodology to deploy real-world Transformers on low-power wearable devices with minimal off-chip traffic exploiting a distributed system of MCUs, partitioning inference across multiple devices and enabling execution with stationary on-chip weights. We validate the scheme by deploying the TinyLlama-42M decoder-only model on a system of 8 parallel ultra-low-power MCUs. The distributed system achieves an energy consumption of 0.64 mJ, a latency of 0.54 ms per inference, an above-linear speedup of 26.07 X, and an energy-delay-product (EDP) improvement of 27.22 X, compared to a single-chip system. On MobileBERT, the distributed system's runtime is 38.84 ms, with an above-linear 4.69 X speedup when using 4 MCUs compared to a single-chip system./proceedings-archive/2024/DATA/518_pdf_upload.pdf |
||
HIFI-SAGE: HIGH FIDELITY GRAPHSAGE-BASED LATENCY ESTIMATORS FOR DNN OPTIMIZATION Speaker: Shambhavi Balamuthu Sampath, BMW Group, DE Authors: Shambhavi Balamuthu Sampath1, Leon Hecht2, Moritz Thoma1, Lukas Frickenstein1, Pierpaolo Mori3, Nael Fasfous1, Manoj Rohit Vemparala1, Alexander Frickenstein1, Claudio Passerone3, Daniel Mueller-Gritschneder4 and Walter Stechele2 1BMW Group, DE; 2TU Munich, DE; 3Politecnico di Torino, IT; 4TU Wien, AT Abstract As deep neural networks (DNNs) are increasingly deployed on resource-constrained edge devices, optimizing and compressing them for real-time performance becomes crucial. Traditional hardware-aware DNN search methods often rely on inaccurate proxy metrics, expensive latency lookup tables, or slow hardware-in-the-loop (HIL) evaluations. To address this, quasi- generalized latency estimators, typically meta-learning-based, were proposed to replace HIL evaluations and accelerate the search. These come with a one-time data collection and training cost and can adapt to new hardware with few measurements. However, they still have some drawbacks: (1) They increase complexity by trying to generalize across a range of diverse hardware types; (2) They depend on handcrafted hardware descriptors, which may fail to capture hardware characteristics; (3) They often perform poorly on new, unseen hardware that significantly differs from their initial training set. To overcome these challenges, this paper turns to the more straightforward platform-specific estimators that do not require hardware descriptors and can be easily trained on any hardware. We introduce HiFi-SAGE, a high fidelity GraphSAGE-based platform-specific latency estimator. When trained from scratch on only 100 latency measurements, our novel dual-head estimator design surpasses the state-of-the-art (SoTA) on the 10% error bound metric by up to 17.4 p.p. while achieving an impressive fidelity score of 99% on the diverse LatBench dataset. We demonstrate that applying HiFi-SAGE to a genetic algorithm-based DNN compression search, achieved a Pareto front comparable to real HIL feedback with a mean absolute percentage error (MAPE) of 2.54%, 2.48%, and 4.16%, for InceptionV3, DenseNet169, and ResNet50 respectively. Compared to existing platform-specific works, the lower number of latency measurements and higher fidelity scores positions HiFi-SAGE as an attractive alternative to replace expensive HIL setups. Code is available at: https://github.com/shamvbs/HiFi-SAGE./proceedings-archive/2024/DATA/861_pdf_upload.pdf |
||
SOLARML: OPTIMIZING SENSING AND INFERENCE FOR SOLAR-POWERED TINYML PLATFORMS Speaker: Hao Liu, TU Delft, NL Authors: Hao Liu, Qing Wang and Marco Zuniga, TU Delft, NL Abstract Machine learning models can now run on microcontrollers. Thanks to the advances in neural architectural search, we can automatically identify tiny machine learning (tinyML) models that satisfy stringent memory and energy requirements. However, existing methods often overlook the energy used during event detection and data gathering. This is critical for devices powered by renewable energy sources like solar power, where energy efficiency is paramount. To address it, we introduce SolarML, a solution designed specifically for solar-powered tinyML platforms, which optimizes the end-to-end system's inference accuracy and energy consumption, from data gathering and processing to model inference. Considering two applications of gesture recognition and keywords spotting, SolarML has the following contributions: 1) a hardware platform with an optimal event detection mechanism that reduces event detection costs by up to 10× compared to state-of-the-art alternatives; 2) a joint optimization framework eNAS that reduces the energy consumption of the sensor and inference model by up to 2×, compared to methods that only optimize the inference model. Jointly, they enable SolarML to run end-to-end gesture and audio inference on a battery-free tinyML platform by only harvesting solar energy for 30 and 57 seconds, respectively, in an office environment (500 lux)./proceedings-archive/2024/DATA/865_pdf_upload.pdf |
||
SAFELOC: OVERCOMING DATA POISONING ATTACKS IN HETEROGENEOUS FEDERATED MACHINE LEARNING FOR INDOOR LOCALIZATION Speaker: Akhil Singampalli, Colorado State University, US Authors: Akhil Singampalli, Danish Gufran and Sudeep Pasricha, Colorado State University, US Abstract Machine learning (ML) based indoor localization solutions are critical for many emerging applications, yet their efficacy is often compromised by hardware/software variations across mobile devices (i.e., device heterogeneity) and the threat of ML data poisoning attacks. Conventional methods aimed at countering these challenges show limited resilience to the uncertainties created by these phenomena. In response, we introduce SAFELOC, a novel framework that not only minimizes localization errors under these challenging conditions but also ensures model compactness for efficient mobile device deployment. SAFELOC introduces a novel fused neural network architecture that performs data poisoning detection and localization, with a low model footprint using federated learning (FL). Additionally, a dynamic saliency map-based aggregation strategy is designed to adapt based on the severity of the detected data poisoning scenario. Experimental evaluations demonstrate that SAFELOC achieves improvements of up to 5.9× in mean localization error, 7.8× in worst-case localization error, and a 2.1× reduction in model inference latency compared to state-of-the-art indoor localization frameworks across diverse indoor environments and data poisoning attack scenarios./proceedings-archive/2024/DATA/891_pdf_upload.pdf |
||
HYBRID TOKEN SELECTOR BASED ACCELERATOR FOR VITS Speaker: Anadi Goyal, Indian Institute of Technology Jodhpur, IN Authors: Akshansh Yadav, Anadi Goyal and Palash Das, Indian Institute of Technology, Jodhpur, IN Abstract Vision Transformers (ViTs) have shown great success in computer vision but suffer from high computational complexity due to the quadratic growth in the number of tokens processed. Token selection/pruning has emerged as a promising solution; however, early methods introduce significant overhead and complexity. Applying a token selector in the early layers of a ViT can yield substantial computational savings (GFLOPs) compared to using it in later layers. However, this approach often leads to significant accuracy loss, particularly with the popular Attention-based Token Selection (ATS) technique. To address these issues, we propose a hybrid token selection (HTS) strategy that integrates our Keypoint-based Token Selection (KTS) with the existing ATS method. KTS dynamically selects important tokens based on image content in the early layers, while ATS refines token pruning in the later layers. This hybrid approach reduces computational costs while maintaining accuracy. Additionally, we design custom hardware modules to accelerate the execution of the proposed methods and the ViT backbone. The proposed HTS delivers a 35.85% reduction in execution time relative to the baseline without any token selection. Furthermore, our results demonstrate that HTS achieves up to a 0.39% increase in accuracy and offers up to 6.05% savings in GFLOPs compared to existing method./proceedings-archive/2024/DATA/1036_pdf_upload.pdf |
||
DAOP: DATA-AWARE OFFLOADING AND PREDICTIVE PRE-CALCULATION FOR EFFICIENT MOE INFERENCE Speaker: Yujie Zhang, National University of Singapore, SG Authors: Yujie Zhang, Shivam Aggarwal and Tulika Mitra, National University of Singapore, SG Abstract Mixture-of-Experts (MoE) models, though highly effective for various machine learning tasks, face significant deployment challenges on memory-constrained devices. While GPUs offer fast inference, their limited memory compared to CPUs means not all experts can be stored on the GPU simultaneously, necessitating frequent, costly data transfers from CPU memory, often negating GPU speed advantages. To address this, we present DAOP, an on-device MoE inference engine to optimize parallel GPU-CPU execution. DAOP dynamically allocates experts between CPU and GPU based on per-sequence activation patterns, and selectively pre-calculates predicted experts on CPUs to minimize transfer latency. This approach enables efficient resource utilization across various expert cache ratios while maintaining model accuracy through a novel graceful degradation mechanism. Comprehensive evaluations across various datasets show that DAOP outperforms traditional expert caching and prefetching methods by up to 8.20x and offloading techniques by 1.35x while maintaining accuracy./proceedings-archive/2024/DATA/1041_pdf_upload.pdf |
||
SPIKESTREAM: ACCELERATING SPIKING NEURAL NETWORK INFERENCE ON RISC-V CLUSTERS WITH SPARSE COMPUTATION EXTENSIONS Speaker: Simone Manoni, Università di Bologna, IT Authors: Simone Manoni1, Paul Scheffler2, Luca Zanatta3, Andrea Acquaviva1, Luca Benini4 and Andrea Bartolini1 1Università di Bologna, IT; 2ETH Zurich, CH; 3NTNU, NO; 4ETH Zurich, CH | Università di Bologna, IT Abstract Spiking Neural Network (SNN) inference has a clear potential for high energy efficiency as computation is triggered by events. However, the inherent sparsity of events poses challenges for conventional computing systems, driving the development of specialized neuromorphic processors, which come with high silicon area costs and lack the flexibility needed for running other computational kernels, limiting widespread adoption. In this paper, we explore the low-level software design, parallelization, and acceleration of SNNs on general-purpose multicore clusters with a low-overhead RISC-V ISA extension for streaming sparse computations. We propose SpikeStream, an optimization technique that maps weights accesses to affine and indirect register-mapped memory streams to enhance performance, utilization, and efficiency. Our results on the end-to-end Spiking-VGG11 model demonstrate a significant 4.39× speedup and an increase in utilization from 9.28% to 52.3% compared to a non-streaming parallel baseline. Additionally, we achieve an energy efficiency gain of 3.46× over LSMCore and a performance gain of 2.38× over Loihi./proceedings-archive/2024/DATA/1227_pdf_upload.pdf |
||
REACT: RANDOMIZED ENCRYPTION WITH AI-CONTROLLED TARGETING FOR NEXT-GEN SECURE COMMUNICATION Speaker: Hossein Sayadi, California State University, Long Beach, US Authors: Zhangying He and Hossein Sayadi, California State University, Long Beach, US Abstract This work introduces REACT (Randomized Encryption with AI-Controlled Targeting), a novel framework leveraging Deep Reinforcement Learning (DRL) and Moving Target Defense (MTD) to secure chaotic communication in resource-constrained environments. REACT employs a random generator to dynamically assign encryption modes, creating unpredictable patterns that thwart interception. At the receiver's end, four DRL agents collaborate to identify encryption modes and apply decryption methods, ensuring secure, synchronized communication. Evaluation results demonstrate up to 100% decryption accuracy and a 51% reduction in attack success probability, establishing REACT as a robust and adaptive defense for secure and reliable communication/proceedings-archive/2024/DATA/1345_pdf_upload.pdf |
||
DUSGAI: A DUAL-SIDE SPARSE GEMM ACCELERATOR WITH FLEXIBLE INTERCONNECTS Speaker: Wujie Zhong, Hong Kong University of Science and Technology, HK Authors: Wujie Zhong and Yangdi Lyu, Hong Kong University of Science and Technology, HK Abstract Sparse general matrix multiplication (SpGEMM) is a crucial operation of deep neural networks (DNNs), leading to the development of numerous specialized SpGEMM accelerators. These accelerators leverage flexible interconnects, thereby outperforming their rigid counterparts. However, the suboptimal utilization of sparsity patterns limits overall performance efficiency. In this work, we propose DuSGAI, a sparse GEMM accelerator that employs a parallel index intersection structure to utilize dual-side sparsity. Our evaluation of DuSGAI with five popular DNN models demonstrates a 3.03× performance improvement compared to the state-of-the-art SpGEMM accelerator./proceedings-archive/2024/DATA/193_pdf_upload.pdf |
TS19 Session 14 - DT6
Add this session to my calendar
Date: Wednesday, 02 April 2025
Time: 08:30 CET - 10:00 CET
Time | Label | Presentation Title Authors |
---|---|---|
DE2: SAT-BASED SEQUENTIAL LOGIC DECRYPTION WITH A FUNCTIONAL DESCRIPTION Speaker: Hai Zhou, Northwestern University, US Authors: You Li, Guannan Zhao, Yunqi He and Hai Zhou, Northwestern University, US Abstract Logic locking is a promising approach to protect the intellectual properties of integrated circuits. Existing logic locking schemes assume that an adversary must possess a cycle-accurate oracle circuit to launch an I/O attack. This paper presents DE2, a novel and rigorous attacking algorithm based on a new adversarial model. DE2 only takes a high-level functional specification of the victim chip. Such specifications are increasingly prevalent in the modern IC design flow. DE2 closes the timing gap between the specification and the circuit with an automatic alignment mechanism, which enables effective logic decryption without cycle-accurate information. An essential enabler of DE2 is a synthesis-based sequential logic decryption algorithm called LIM, which introduces only a minimal overhead in every iteration. Experiments show that DE2 can efficiently attack logic-locked benchmarks without access to a cycle-accurate oracle circuit. Besides, LIM can solve 20% more ISCAS'89 benchmarks than state-of-the-art sequential logic decryption algorithms./proceedings-archive/2024/DATA/100_pdf_upload.pdf |
||
HARDWARE/SOFTWARE RUNTIME FOR GPSA PROTECTION IN RISC-V EMBEDDED CORES Speaker: Louis Savary, INRIA, FR Authors: Louis Savary1, Simon Rokicki2 and Steven Derrien3 1INRIA, FR; 2Irisa, FR; 3Université de Bretagne Occidentale | Lab-STICC, FR Abstract State-of-the-art hardware countermeasures against fault attacks are based, among others, on control flow and code integrity checking. Generalized Path Signature Analysis and Continuous Signature Monitoring can assert these integrity properties. However, supporting such mechanisms requires a dedicated compiler flow and does not support indirect jumps. This work proposes a technique based on a hardware/software runtime to generate those signatures while executing unmodified off-the-shelf RISC-V binaries. The proposed approach has been implemented on a pipelined processor, and experimental results show an average slowdown of x3 compared to unprotected implementations while being completely compiler-independent./proceedings-archive/2024/DATA/264_pdf_upload.pdf |
||
ANALOG CIRCUIT ANTI-PIRACY SECURITY BY EXPLOITING DEVICE RATINGS Speaker: Hazem Hammam, Sorbonne Université, CNRS, LIP6, FR Authors: Hazem Hammam1, Hassan Aboushady1 and Haralampos-G. Stratigopoulos2 1Sorbonne Université, CNRS, LIP6, FR; 2Sorbonne University, CNRS, LIP6, FR Abstract We propose a novel anti-piracy security technique for analog and mixed-signal (AMS) circuits. The circuit is re-designed by obfuscating transistors and capacitors with key-controlled versions. We obfuscate both the device geometries and their ratings, which define the maximum allowable current, voltage, and power dissipation. The circuit is designed to function correctly only with a specific key. Loading any other incorrect key degrades performance and for the vast majority of these keys the chip is damaged because of electrical over-stress. This prevents counter-attacks that employ a chip to search for the correct key. The methodology is demonstrated on a low-dropout regulator (LDO) designed in the 22nm FDSOI technology by GlobalFoundries. By locking the LDO, the entire chip functionality breaks unless the LDO is unlocked first. The secured LDO shows no performance penalty and area overhead is justifiable and less than 25%, while it is protected against all known counter-attacks in the AMS domain./proceedings-archive/2024/DATA/527_pdf_upload.pdf |
||
SIDE-CHANNEL COLLISION ATTACKS AGAINST ASCON Speaker: Hao Zhang, Nanjing University of Science and Technology, CN Authors: Hao Zhang, Yiwen Gao, Yongbin Zhou and Jingdian Ming, Nanjing University of Science and Technology, CN Abstract Side-channel attack poses a significant threat to the security of electronic devices, particularly IoT/AIoT terminals. By leveraging side-channel leakages, collision attacks can efficiently extract the secret keys from cryptographic devices while requiring considerably less computational effort. In this paper, we investigate side-channel collision attacks against ASCON, a lightweight crypto designed for resource-constrained devices, which has been standardized by the NIST. For the first time, we propose a side-channel key recovery attack against ASCON by identifying the collisions in the linear diffusion layer. Using Pearson correlation coefficient and Euclidean distance for internal collision detections, our attack successfully recovers the secret key with approximately 5,000 power traces from an 8-bit software implementation on an AVR device. To further reduce attack complexity, we introduce a novel metric, Locally-Weighted Sum (LWS), which focuses on the most likely points of leakage, thereby decreasing the number of required power traces for successful attack. Our experiment on the same target demonstrates that the LWS-based collision attack can recover the full secret key with approximately 3,000 power traces, a reduction of 40 percent. Our study indicates that ASCON is susceptible to side-channel collision attacks, and bitslice implementations remain vulnerable to such threats./proceedings-archive/2024/DATA/674_pdf_upload.pdf |
||
CUTE-LOCK: BEHAVIORAL AND STRUCTURAL MULTI-KEY LOGIC LOCKING USING TIME BASE KEYS Speaker: Amin Rezaei, California State University, Long Beach, US Authors: Kevin Lopez and Amin Rezaei, California State University, Long Beach, US Abstract The outsourcing of semiconductor manufacturing raises security risks, such as piracy and overproduction of hardware intellectual property. To overcome this challenge, logic locking has emerged to lock a given circuit using additional key bits. While single-key logic locking approaches have demonstrated serious vulnerability to a wide range of attacks, multi-key solutions, if carefully designed, can provide a reliable defense against not only oracle-guided logic attacks, but also removal and dataflow attacks. In this paper, using time base keys, we propose, implement and evaluate a family of secure multi-key logic locking algorithms called Cute-Lock that can be applied both in RTL-level behavioral and netlist-level structural representations of sequential circuits. Our extensive experimental results under a diverse range of attacks confirm that, compared to vulnerable state-of-the-art methods, employing the Cute-Lock family drives attacking attempts to a dead end without additional overhead./proceedings-archive/2024/DATA/827_pdf_upload.pdf |
||
SAFELIGHT: ENHANCING SECURITY IN OPTICAL CONVOLUTIONAL NEURAL NETWORK ACCELERATORS Speaker: Salma Afifi, Colorado State University, US Authors: Salma Afifi1, Ishan Thakkar2 and Sudeep Pasricha1 1Colorado State University, US; 2University of Kentucky, US Abstract The rapid proliferation of deep learning has revolutionized computing hardware, driving innovations to improve computationally expensive multiply-accumulate operations in deep neural networks. Among these innovations are integrated silicon-photonic systems that have emerged as energy-efficient platforms capable of achieving light speed computation and communication, positioning optical neural network (ONN) platforms as a transformative technology for accelerating deep learning models such as convolutional neural networks (CNNs). However, the increasing complexity of optical hardware introduces new vulnerabilities, notably the risk of hardware trojan (HT) attacks. Despite the growing interest in ONN platforms, little attention has been given to how HT-induced threats can compromise performance and security. This paper presents an in-depth analysis of the impact of such attacks on the performance of CNN models accelerated by ONN accelerators. Specifically, we show how HTs can compromise microring resonators (MRs) in a state-of-the-art non-coherent ONN accelerator and reduce classification accuracy across CNN models by up to 7.49% to 80.46% by just targeting 10% of MRs. We then propose techniques to enhance ONN accelerator robustness against these attacks and show how the best techniques can effectively recover the accuracy drops./proceedings-archive/2024/DATA/902_pdf_upload.pdf |
||
ONE MORE MOTIVATION TO USE EVALUATION TOOLS, THIS TIME FOR HARDWARE MULTIPLICATIVE MASKING OF AES Speaker: Hemin Rahimi, TU Darmstadt, DE Authors: Hemin Rahimi and Amir Moradi, TU Darmstadt, DE Abstract Safeguarding cryptographic implementations against the increasing threat of Side-Channel Analysis (SCA) attacks is essential. Masking, a countermeasure that randomizes intermediate values, is a cornerstone of such defenses. In particular, SCA-secure implementation of AES, the most-widely used encryption standard, can employ Boolean masking as well as multiplicative masking due to its underlying Galois field operations. However, multiplicative masking is susceptible to vulnerabilities, including the zero-value problem, which has been identified right after the introduction of multiplicative masking. At CHES 2018, De Meyer et al. proposed a hardware-based approach to manage these challenges and implemented multiplicative masking for AES, incorporating a Kronecker delta function and randomness optimization. In this work, we evaluate their design using the PROLEAD evaluation tool under the glitch- and transition-extended probing model. Our findings reveal a critical vulnerability in their first-order implementation of the Kronecker delta function, stemming from the employed randomness optimization. This leakage compromises the security of their presented masked AES Sbox. After pinpointing the source of such a leakage, we propose an alternative randomness optimization to address this issue, and demonstrate its effectiveness through rigorous evaluations by means of PROLEAD./proceedings-archive/2024/DATA/1026_pdf_upload.pdf |
||
THREE EYED RAVEN: AN ON-CHIP SIDE CHANNEL ANALYSIS FRAMEWORK FOR RUN-TIME EVALUATION Speaker: M Dhilipkumar, IIT Kanpur, IN Authors: M Dhilipkumar, Priyanka Bagade and Debapriya Basu Roy, IIT Kanpur, IN Abstract Side-channel attacks exploit the physical leakages from hardware components, such as power consumption, to break secure cryptographic algorithms and retrieve its secret key. Therefore, evaluating implementations of cryptographic algorithms against such analysis is of paramount importance. A typical side-channel evaluation framework requires external devices like sampling oscilloscope along with a customized analysis board which makes the evaluation both expensive and time-consuming. However recent advancements in developing on-chip sensors on FPGAs for monitoring side channel information pave the path towards a fully on-chip side channel analysis framework without the requirement of any external devices, reducing both the cost and time required to carry out these experiments. In this paper, we propose our on-chip side channel analysis framework Raven that is augmented with hardware implementations of Test Vector Leakage Assessment (TVLA), Correlation Power Analysis (CPA), and Deep Learning based Leakage Assessment (DL-LA). The presence of on-chip hardware implementations of these side-channel evaluation algorithms coupled with on-chip sensors allows RAVEN to assess the side-channel security of the crypto-implementation in a fast and efficient manner. Our proposed implementation for DL-LA can also get trained on-chip and does not require the pre-trained weight values. The resource consumption of RAVEN is not high as the entire design along with the sensors can be fit into PYNQ board of AMD-Xilinx. We have validated the proposed RAVEN framework on AES-128 traces and results of the hardware implementation of TVLA, CPA, DL-LA closely resemble the results of software implementations, requiring significantly less time and storage./proceedings-archive/2024/DATA/1217_pdf_upload.pdf |
||
RTL-BREAKER: ASSESSING THE SECURITY OF LLMS AGAINST BACKDOOR ATTACKS ON HDL CODE GENERATION Speaker: Lakshmi Likhitha Mankali, New York University, US Authors: Lakshmi Likhitha Mankali1, Jitendra Bhandari1, Manaar Alam2, Ramesh Karri1, Michail Maniatakos2, Ozgur Sinanoglu2 and Johann Knechtel2 1New York University, US; 2New York University Abu Dhabi, AE Abstract Large language models (LLMs) have demonstrated remarkable potential with code generation/completion tasks for hardware design. However, the reliance on such automation introduces critical security risks. Notably, given that LLMs have to be trained on vast datasets of codes that are typically sourced from publicly available repositories, often without thorough validation, LLMs are susceptible to so-called data poisoning or backdoor attacks. Here, attackers inject malicious code for the training data, which can be carried over into the hardware description code (HDL) generated by LLMs. This threat vector can compromise the security and integrity of entire hardware systems. In this work, we propose RTL-Breaker, a novel backdoor attack framework on LLM-based HDL code generation. RTL-Breaker provides an in-depth analysis of essential aspects of this novel problem: 1) various trigger mechanisms versus their effectiveness for inserting malicious modifications, and 2) side-effects by backdoor attacks on code generation in general, i.e., impact on code quality. RTL-Breaker emphasizes the urgent need for more robust measures to safeguard against such attacks. Toward that end, we open-source our framework and all data./proceedings-archive/2024/DATA/1438_pdf_upload.pdf |
||
MC3: MEMORY CONTENTION-BASED COVERT CHANNEL COMMUNICATION ON SHARED DRAM SYSTEM-ON-CHIPS Speaker: Ismet Dagli, Colorado School of Mines, US Authors: Ismet Dagli1, James Crea1, Soner Seckiner2, Yuanchao Xu3, Selcuk Kose2 and Mehmet Belviranli1 1Colorado School of Mines, US; 2University of Rochester, US; 3University of California, Santa Cruz, US Abstract Shared memory system-on-chips (SM-SoCs) are ubiquitously employed by a wide range of computing platforms, including edge/IoT devices, autonomous systems, and smart-phones. In SM-SoCs, system-wide shared memory enables a convenient and financially feasible way to make data accessible across dozens of processing units (PUs), such as CPU cores and domain-specific accelerators. Due to the diverse computational characteristics of the PUs they embed, SM-SoCs often do not employ a shared last-level cache (LLC). While the literature studies covert channel attacks for shared memory systems, high-throughput communication is currently possible only through either relying on an LLC or having privileged/physical access to the shared memory subsystem. In this study, we introduce a new memory-contention-based covert communication attack, MC3, which specifically targets shared system memory in mobile SoCs. Unlike existing attacks, our approach achieves high-throughput communication without the need for an LLC or elevated access to the system. We explore the effectiveness of our methodology by demonstrating the trade-off between the channel transmission rate and the robustness of the communication. We evaluate MC3 on NVIDIA Orin AGX, NX, and Nano platforms and achieve transmission rates up to 6.4 Kbps with less than 1% error rate./proceedings-archive/2024/DATA/1454_pdf_upload.pdf |
||
COMB FREQUENCY DIVISION MULTIPLEXING: A NON-BINARY MODULATION FOR AIRGAP COVERT CHANNEL TRANSMISSION Speaker: Mohamed-alla-eddine BAHI, Univ Rennes, INSA Rennes, CNRS, IETR - UMR 6164, F-35000 Rennes, France, FR Authors: Mohamed-alla-eddine BAHI1, Maria MENDEZ REAL2 and Maxime PELCAT2 1Univ Rennes, INSA Rennes, IETR, UMR CNRS 6164, FR; 2IETR - UMR CNRS 6164, FR Abstract Isolated networks ensure the confidentiality of sensitive data on a system by eliminating all physical connections to public networks or external devices, making the system air-gapped. However, previous work has shown that Electromagnetic (EM) emanations when correlated with secret data, can lead to side or covert channels. Specifically, EM emissions caused by clocks can modulate high-frequency signals, enabling unauthorized data transmission to cross the air-gap. This work focuses on covert channels where a software or hardware Trojan inserted in the victim system induces side channel emissions that the attacker can recover through the covert channel, producing an intentional transmission and leakage of sensitive information. This paper introduces a novel encoding method for covert channels called Comb Frequency Division Multiplexing (CFDM). CFDM leverages modulated signals emitted by the victim system, which are evenly spaced across the frequency spectrum, creating a comb-like pattern. Moreover, the uncontrolled nature of the side channel modulation can make each subcarrier carry different information. Unlike traditional methods such as Frequency Shift Keying (FSK) and Amplitude Shift Keying (ASK), CFDM encodes information in both the frequency and amplitude dimensions of the covert channel harmonic sub-carriers./proceedings-archive/2024/DATA/1066_pdf_upload.pdf |
||
MULTI-SENSOR DATA FUSION FOR ENHANCED DETECTION OF LASER FAULT INJECTION ATTACKS IN CRYPTOGRAPHIC HARDWARE: PRACTICAL RESULTS Speaker: Naghmeh Karimi, University of Maryland Baltimore County, US Authors: Mohammad Ebrahimabadi1, Raphael Viera2, Sylvain Guilley3, Jean Luc Danger4, Jean-Max Dutertre5 and Naghmeh Karimi1 1University of Maryland Baltimore County, US; 2Ecole de Mines de Saint-Etienne, FR; 3Secure-IC, FR; 4Télécom ParisTech, FR; 5Mines Saint-Etienne, FR Abstract Though considered secure, cryptographic hardware can be compromised by adversaries injecting faults during runtime to leak secret keys from faulty outputs. Among fault injection methods, laser illumination has gained the most attention due to its precision in targeting specific areas and its fine temporal control. Accordingly, to tackle such attacks, this paper proposes a low-cost detection scheme that leverages Time-To-Digital Converters (TDC) to sense the IR drops caused by laser illumination. To mitigate the false alarm rate while maintaining a high detection rate, our method embeds multiple sensors (as few as two, as discussed in the text). To evaluate the impact of laser illumination and the effectiveness of our proposed scheme, we conducted extensive experiments (≈200k) using a real laser setup to illuminate the targeted AES module implemented on an Artix-7 FPGA. The results confirm the high accuracy of our detection method; achieving 82% fault detection with less than 0.01% false alarms and a detection latency of just 4 clock cycles. Notably, it enabled preventive actions in 70% of cases where illumination occurred but the AES outcome had not changed, greatly enhancing circuit security against key leakage./proceedings-archive/2024/DATA/1353_pdf_upload.pdf |
TS20 Session 9 - D13
Add this session to my calendar
Date: Wednesday, 02 April 2025
Time: 11:00 CET - 12:30 CET
Time | Label | Presentation Title Authors |
---|---|---|
MEGAROUTE: UNIVERSAL AUTOMATED LARGE-SCALE PCB ROUTING METHOD WITH ADAPTIVE STEP-SIZE SEARCH Speaker: Haiyun Li, Tsinghua University, CN Authors: Haiyun Li1 and Jixin Zhang2 1School of Computer Science, Hubei University of Technology, Wuhan, China; Shenzhen International Graduate School, Tsinghua University, Shenzhen, China, CN; 2Hubei University of Technology, CN Abstract The automation of very large-scale PCB routing has long been an unresolved problem within the industry due to the variant electronic components and complex design rules. Existing automated PCB routing methods are primarily designed for single component (e.g., BGA, BTB, etc.) or for simple and small-scale PCBs, and often fail to meet the industry requirements for large-scale PCBs. The biggest challenge is to ensure nearly 100% routability and DRC compliance while achieving high efficiency for large-scale PCBs with various components. To address this challenge, we propose MegaRoute, a precise, efficient, and universal PCB routing method that surpasses the routing routability and DRC compliance of existing methods, including commercial tools, for PCBs with thousands of nets. MegaRoute introduces an adaptive step-size search algorithm that adjusts exploration steps based on design rules and surrounding obstacles, improving both routability and efficiency. We incorporate shape-based obstacle detection for strict DRC compliance and use routing optimization techniques to enhance routability. We conduct extensive experiments on hundreds of real-world PCBs, including mainboard PCBs with thousands of nets. The results show that MegaRoute achieves over 98% routability across all PCBs with DRC-free results, significantly outperforming the state-of-the-art methods and mainstream commercial tools./proceedings-archive/2024/DATA/62_pdf_upload.pdf |
||
TIMING-DRIVEN GLOBAL PLACEMENT WITH HYBRID HEURISTICS AND NADAM-BASED NET WEIGHTING Speaker: Linhao Lu, Southwest University of Science and Technology, CN Authors: Linhao Lu1, Wenxin Yu2, Hongwei Tian3, Chenjin Li4, Xinmiao Li5, Zhaoqi Fu6 and Zhengjie Zhao1 1Southwest University of Science and Technology, CN; 2+86 18081252805, CN; 3+86 13114885828, CN; 4+86 15186738475, CN; 5+86 17745022037, CN; 6+86 18281575968, CN Abstract Timing optimization is critical to the entire design flow of the very-large-scale integrated (VLSI) circuit, and Global Placement is pivotal in achieving timing closure within the design flow of very-large-scale integration circuits. However, most global placement algorithms focus on optimizing wirelength rather than timing. Therefore, we propose a novel timing-driven global placement algorithm to address this gap. This paper proposes a timing-driven global placement algorithm utilizing a Nadam-based net-weighting strategy. Additionally, we employ a hybrid heuristic approach for adaptive dynamic adjustment of net weights. The experimental results on the ICCAD 2015 contest benchmarks show that compared to the RePlAce, our algorithm significantly improves WNS and TNS by 40.7% and 56.5%, respectively./proceedings-archive/2024/DATA/115_pdf_upload.pdf |
||
IR-FUSION: A FUSION FRAMEWORK FOR STATIC IR DROP ANALYSIS COMBINING NUMERICAL SOLUTION AND MACHINE LEARNING Speaker: Feng Guo, Beijing University of Posts and Telecommunications, CN Authors: Feng Guo1, Jianwang Zhai1, Jingyu Jia1, Jiawei Liu1, Kang Zhao1, Bei Yu2 and Chuan Shi1 1Beijing University of Posts and Telecommunications, CN; 2The Chinese University of Hong Kong, HK Abstract IR Drop analysis for on-chip power grids (PGs) is vital but computationally challenging due to the rapid growth in the integrated circuit (IC) scale. Traditional numerical methods employed by current EDA software are accurate but extremely time-consuming. To achieve rapid analysis of IR drop, various machine learning (ML) methods have been introduced to address the inefficiency of numerical methods. However, the issue of interpretability or scalability has been limiting practical applications. In this work, we propose IR-Fusion, which aims to combine numerical methods with ML to achieve the trade-off and complementarity between accuracy and efficiency in static IR drop analysis. Specifically, the numerical method is used to obtain rough solutions and ML models are utilized to improve accuracy further. In our framework, an efficient numerical solver, AMG-PCG, is applied to get rough numerical solutions. Then, based on the numerical solution, the fusion of hierarchical numerical-structural information representing the multilayer structure of the PG is employed, and an Inception Attention U-Net model is designed to capture details and interaction of features at different scales. To cope with the limitations and diversity of PG designs, an augmented curriculum learning strategy is applied to the training phase. Evaluation of IR-Fusion shows that its accuracy is significantly better than previous ML-based methods while requiring considerably less iteration on solver to achieve the same accuracy compared with numerical methods./proceedings-archive/2024/DATA/284_pdf_upload.pdf |
||
TIMING-DRIVEN DETAILED PLACEMENT WITH UNSUPERVISED GRAPH LEARNING Speaker: Dhoui Lim, Ulsan National Institute of Science and Technology, KR Authors: DhouI Lim1 and Heechun Park2 1Kookmin University, School of Electrical Engineering, KR; 2Ulsan National Institute of Science and Technology (UNIST), KR Abstract Detailed placement is a crucial stage in VLSI design that starts from the global placement result to determine the final legal locations of each cell through fine-grained optimization. Traditional detailed placement methods focus on minimizing the half-perimeter wire length (HPWL) as in global placement. However, incorporating timing-driven placement becomes essential with the increasing complexity of VLSI designs and tighter performance constraints. In this paper, we propose a timing-driven detailed placement framework that leverages unsupervised graph learning techniques. Specifically, we integrate timing-related metrics into the objective function for detailed placement and formulate it into the loss function of a graph neural network (GNN) model. The loss function includes overlap, legality, and timing-related arc lengths, with appropriate weights using Bayesian optimization. Experimental results show that our framework achieves comparable or improved HPWL while significantly reducing total negative slack (TNS) by 5.5%, compared to existing methods./proceedings-archive/2024/DATA/493_pdf_upload.pdf |
||
EFFICIENT AND EFFECTIVE MACRO PLACEMENT FOR VERY LARGE SCALE DESIGNS USING RL AND MCTS INTEGRATION Speaker: Zong-Ze Lee, National Cheng Kung University, TW Authors: Jai-Ming Lin1, Zong-Ze Lee1 and Nan-Chu Lin2 1Department of Electrical Engineering, National Cheng Kung University, TW; 2National Cheng Kung University, TW Abstract Macro placement plays a critical role in modern designs. With the rise of artificial intelligence, some researchers have turned to reinforcement learning (RL) techniques to handle this problem. However, these approaches usually require substantial computing resources and runtime for training, making them impractical for very large-scale integration (VLSI) designs. To address these challenges, this paper proposes an effective placer based on the Monte Carlo Tree Search (MCTS) algorithm, guided by a pre-trained RL agent. To reduce the complexities of RL and MCTS, we transform the macro placement problem into a macro group allocation problem. Additionally, we propose a new reward function to facilitate training convergence in RL. Moreover, to reduce runtime without affecting placement quality, we use the pre-training result to directly evaluate the placement quality in MCTS for non-terminal nodes, significantly reducing the number of placement runs required. Experiments show that our MCTS-based placer can achieve high-quality results even in the early stages of RL training. Moreover, our method outperforms state-of-the-art placers./proceedings-archive/2024/DATA/1418_pdf_upload.pdf |
||
DAMIL-DCIM: A DIGITAL CIM LAYOUT SYNTHESIS FRAMEWORK WITH DATAFLOW-AWARE FLOORPLAN AND MILP-BASED DETAILED PLACEMENT Speaker: Chuyu Wang, Fudan University, CN Authors: Chuyu Wang, Ke Hu, Fan Yang, Keren Zhu and Xuan Zeng, Fudan University, CN Abstract Digital computing-in-memory (DCIM) systems integrate complex digital logic with parasitic-sensitive bitcell arrays. Conventional physical design strategies degrade DCIM performance due to a lack of dataflow regularity and excessive wirelength. As a result, current DCIM design often relies on manual layout, which is time-consuming and a bottleneck in the design cycle. Existing layout synthesis frameworks for DCIM often mimic the manual approach and employ a template-based method for DCIM placement. However, overly constrained templates lead to excessive core area, resulting in high costs in practice. In this work, we introduce DAMIL-DCIM, a novel placement framework that bridges template-based techniques with optimization-based placement methods. DAMIL-DCIM utilizes a global dataflow-aware floorplan inspired by template methods and further optimizes the layout using MILP(Mixed Integer Linear Programming)-based detailed placement. The combination of global floorplanning and placement optimization reduces total wire length while maintaining dataflow regularity, resulting in lower parasitic and enhanced performance. Experimental results show, on a practical 28nm DCIM circuit, our approach improves frequency by 25.2% and reduces power consumption by 19.6% compared to Cadence Innovus, while maintaining the same core area./proceedings-archive/2024/DATA/568_pdf_upload.pdf |
||
BI-LEVEL OPTIMIZATION ACCELERATED DRC-AWARE PHYSICAL DESIGN AUTOMATION FOR PHOTONIC DEVICES Speaker: Hao Chen, Hong Kong University of Science and Technology, HK Authors: Hao Chen1, Yuzhe Ma1 and Yeyu Tong2 1Hong Kong University of Science and Technology, HK; 2The Hong Kong University of Science and Technology (Guangzhou)), CN Abstract Photonic integrated circuits (PICs) design has been challenged by the complex physics behind various integrated photonic devices. Inverse design offers an effective design automation solution for obtaining high-performance and compact photonic devices using computational algorithms and electromagnetic (EM) simulations. However, the challenge lies in transforming the fabrication-infeasible device geometries obtained from computational algorithms into reliable while optimal physical design. Incorporating fabrication constraints into the optimization iterations can extend running time and lead to performance compromise. In this work, we proposed a novel DRC-aware photonic inverse design framework, leveraging the bi-level optimization to enable end-to-end gradient-based device optimization. Our method can guarantee all intermediate devices on the optimization trajectory adhere to fabrication requirements and rules. The proposed workflow eliminates the need for a binarization process and fabrication constraint adaption, thus enabling a fast and efficient search for high-performance and reliable integrated photonic devices. Experimental results demonstrate the benefits of our proposed method, including improved device performance and reduced EM simulations and running time./proceedings-archive/2024/DATA/629_pdf_upload.pdf |
||
GTN-CELL: EFFICIENT STANDARD CELL CHARACTERIZATION USING GRAPH TRANSFORMER NETWORK Speaker: LIHAO LIU, State Key Lab of Integrated Chips and Systems, and School of Microelectronics, Fudan University, Shanghai, China., CN Authors: Lihao Liu, Beisi Lu, Yunhui Li, Li Shang and Fan Yang, Fudan University, CN Abstract Lookup table (LUT)-based libraries of standard cell characterization is crucial to accurate static timing analysis (STA). However, with the continuous scaling of technology nodes and the increasing complexity of circuit designs, the traditional non-linear delay model (NLDM) is progressively unable to meet the required accuracy for cell modeling. The current source model (CSM) offers a more precise characterization of cells at advanced nodes and is able to handle arbitrary electrical waveforms. However, the CSM is highly time-consuming because it requires extensive transistor-level simulations, posing severe challenges to efficient standard cell library design. This work presents GTN-Cell, an efficient graph transformer network (GTN)-based method for library-compatible LUT-based CSM waveform prediction of standard cell characterization. GTN-Cell represents the transistor-level structures of standard cells as graphs, learning the local structural information of each cell. By incorporating the transformer encoder into the model and embedding path-related positional encodings, GTN-Cell captures the global relationships between distant nodes within each cell. Compared with HSPICE, the GTN-Cell achieves an average error of 2.27% on predicted voltage waveforms among different standard cells and timing arcs while reducing the number of simulations by 70%./proceedings-archive/2024/DATA/694_pdf_upload.pdf |
||
WIRE-BONDING FINGER PLACEMENT FOR FBGA SUBSTRATE LAYOUT DESIGN WITH FINGER ORIENTATION CONSIDERATION Speaker: Yu-En Lin, National Taiwan University of Science and Technology, Department of Computer Science and Information Engineering, TW Authors: Yu-En Lin and Yi-Yu Liu, National Taiwan University of Science and Technology, TW Abstract Wire bonding is a mature packaging technique that enables chip pins to transmit signals to bonding fingers on the substrate through bonding wires. Such commodity technology is also essential in supporting the rapid development of the system in package and heterogeneous integration technologies. However, the automation tools are relatively deficient compared to other packaging techniques, resulting in tremendous manual design time and engineering effort due to numerous wire-bonding design constraints. This paper addresses the finger placement problem and serves as the first work considering the orientation constraint of fingers. The finger placement flow is divided into three stages. First, an integer linear programming (ILP) formulation is developed to allocate each net finger row. After that, we utilize mixed-integer quadratic programming (MIQP) to place the bonding fingers and consider the wire crossing constraint. Finally, the locations of the bonding finger are refined by considering both the bonding finger orientation angle and the finger spacing constraints. The final layouts generated by our integrated finger placement and substrate routing framework outperform manual designs in terms of the design time, the total wirelength, and the routing completion rate./proceedings-archive/2024/DATA/932_pdf_upload.pdf |
||
A PARALLEL FLOATING RANDOM WALK SOLVER FOR REPRODUCIBLE AND RELIABLE CAPACITANCE EXTRACTION Speaker: Jiechen Huang, Dept. Computer Science & Tech., Tsinghua University, CN Authors: Jiechen Huang1, Shuailong Liu2 and Wenjian Yu1 1Tsinghua University, CN; 2Exceeda Inc., CN Abstract The floating random walk (FRW) method is a popular and promising tool for capacitance extraction, but its stochastic nature leads to critical limitations in reproducibility and physics-related reliability. In this work, we present FRW-RR, a parallel FRW solver with enhancements for Reproducible and Reliable capacitance extraction. First, we propose a novel parallel FRW scheme that ensures reproducible results, regardless of the degree of parallelism (DOP) or machine used. We further optimize its parallel efficiency and enhance the numerical stability. Then, to guarantee the physical properties of capacitances and reliability for downstream tasks, we propose a regularization technique based on constrained multi-parameter estimation to postprocess FRW's results. Experiments on actual IC structures demonstrate that, FRW-RR ensures DOP-independent reproducibility (with at least 12 decimal significant digits) and physics-related reliability with negligible overhead. It has remarkable advantages over existing FRW solvers, including the one in [1]./proceedings-archive/2024/DATA/1091_pdf_upload.pdf |
||
A COMPREHENSIVE INDUCTANCE-AWARE MODELING APPROACH TO POWER DISTRIBUTION NETWORK IN HETEROGENEOUS 3D INTEGRATED CIRCUITS Speaker: Yuanqing Cheng, Beihang University, CN Authors: Quansen Wang1, Vasilis Pavlidis2 and Yuanqing Cheng1 1Beihang University, CN; 2Aristotle University of Thessaloniki, GR Abstract Heterogeneous 3D integration technology is a costeffective and high-performance alternative to planar integrated circuits (ICs). In this paper, we propose an on-chip power distribution network (PDN) modeling technique for heterogeneous 3D-ICs (H3D-ICs), which explicitly takes the effects of on-chip inductance into account. The proposed model facilitates efficient transient and AC simulations with integrated inductive effects, enabling accurate noise characterization at high frequencies and facilitating the exploration of early-stage PDN design. The model is validated via HSPICE simulations, demonstrating a maximum error below 1% and achieving average speedups of 1.5x in transient and 8.5x in AC simulations./proceedings-archive/2024/DATA/791_pdf_upload.pdf |
TS21 Session 17 - E4
Add this session to my calendar
Date: Wednesday, 02 April 2025
Time: 11:00 CET - 12:30 CET
Time | Label | Presentation Title Authors |
---|---|---|
SPNERF: MEMORY EFFICIENT SPARSE VOLUMETRIC NEURAL RENDERING ACCELERATOR FOR EDGE DEVICES Speaker: Yipu Zhang, Hong Kong University of Science and Technology, HK Authors: Yipu Zhang, Jiawei Liang, Jian Peng, Jiang Xu and Wei Zhang, Hong Kong University of Science and Technology, HK Abstract Neural rendering has gained prominence for its high-quality output, which is crucial for AR/VR applications. However, its large voxel grid data size and irregular access patterns challenge real-time processing on edge devices. While previous works have focused on improving data locality, they have not adequately addressed the issue of large voxel grid sizes, which necessitate frequent off-chip memory access and substantial on-chip memory. This paper introduces SpNeRF, a software-hardware co-design solution tailored for sparse volumetric neural rendering. We first identify memory-bound rendering inefficiencies and analyze the inherent sparsity in the voxel grid data of neural rendering. To enhance efficiency, we propose novel preprocessing and online decoding steps, reducing memory size for the voxel grid. The preprocessing step employs hash mapping to support irregular data access while maintaining a minimal memory size. The online decoding step enables efficient on-chip sparse voxel grid processing, incorporating bitmap masking to mitigate PSNR loss caused by hash collisions. To further optimize performance, we design a dedicated hardware architecture supporting our sparse voxel grid processing technique. Experimental results demonstrate that SpNeRF achieves an average 21.07× reduction in memory size while maintaining comparable PSNR levels. When benchmarked against Jetson XNX, Jetson ONX, RT-NeRF.Edge and NeuRex.Edge, our design achieves speedups of 95.1×, 63.5×, 1.5× and 10.3×, and improves energy efficiency by 625.6×, 529.1×, 4×, and 4.4×, respectively./proceedings-archive/2024/DATA/155_pdf_upload.pdf |
||
SBQ: EXPLOITING SIGNIFICANT BITS FOR EFFICIENT AND ACCURATE POST-TRAINING DNN QUANTIZATION Speaker: Jiayao Ling, Shanghai Jiao Tong University, CN Authors: Jiayao Ling1, Gang Li2, Qinghao Hu2, Xiaolong Lin1, Cheng Gu1, Jian Cheng3 and Xiaoyao Liang1 1Shanghai Jiao Tong University, CN; 2Institute of Computing Technology, Chinese Academy of Sciences, CN; 3Institute of Automation, CN Abstract Post-Training Quantization is an effective technique for deep neural network acceleration. However, as the bit-width decreases to 4 bits and below, PTQ faces significant challenges in preserving accuracy, especially for attention-based models like LLMs. The main issue lies in considerable clipping and rounding errors induced by the limited number of quantization levels and narrow data range in conventional low-precision quantization. In this paper, we present an efficient and accurate PTQ method that targets 4 bits and below through algorithm and architecture co-design. Our key idea is to dynamically extract a small portion of significant bit terms from high-precision operands to perform low-precision multiplications under the given computational budget. Specifically, we propose Significant-Bit Quantization (SBQ). It exploits a product-aware method to dynamically identify significant terms and an error-compensated computation scheme to minimize compute errors. We present a dedicated inference engine to unleash the power of SBQ. Experiments on CNNs, ViTs, and LLMs reveal that SBQ consistently outperforms prior PTQ methods under 2~4-bit quantization. We also compare the proposed inference engine with state-of-the-art bit-operation-based quantization architectures TQ and Sibia. Results show that SBQ can achieve the highest area and energy efficiency./proceedings-archive/2024/DATA/388_pdf_upload.pdf |
||
AIRCHITECT V2: LEARNING THE HARDWARE ACCELERATOR DESIGN SPACE THROUGH UNIFIED REPRESENTATIONS Speaker: Akshat Ramachandran, Georgia Tech, US Authors: Akshat Ramachandran1, Jamin Seo1, Yu-Chuan Chuang2, Anirudh Itagi1 and Tushar Krishna1 1Georgia Tech, US; 2National Taiwan University, TW Abstract Design space exploration (DSE) plays a crucial role in enabling custom hardware architectures, particularly for emerging applications like AI, where optimized and specialized designs are essential. With the growing complexity of deep neural networks (DNNs) and the introduction of advanced foundational models (FMs), the design space for DNN accelerators is expanding at an exponential rate. Additionally, this space is highly non-uniform and non-convex, making it increasingly difficult to navigate and optimize. Traditional DSE techniques rely on search-based methods, which involve iterative sampling of the design space to find the optimal solution. However, this process is both time-consuming and often fails to converge to the global optima for such design spaces. Recently, AIrchitect v1, the first attempt to address the limitations of search-based techniques, transformed DSE into a constant-time classification problem using recommendation networks. In this work, we propose AIrchitect v2, a more accurate and generalizable learning-based DSE technique applicable to large-scale design spaces that overcomes the shortcomings of earlier approaches. Specifically, we devise an encoder-decoder transformer model that (a) encodes the complex design space into a uniform intermediate representation using contrastive learning and (b) leverages a novel unified representation blending the advantages of classification and regression to effectively explore the large DSE space without sacrificing accuracy. Experimental results evaluated on 10^5 real DNN workloads demonstrate that, on average, AIrchitect v2 outperforms existing techniques by 15% in identifying optimal design points. Furthermore, to demonstrate the generalizability of our method, we evaluate performance on unseen model workloads (LLMs) and attain a 1.7x improvement in inference latency on the identified hardware architecture. Code and dataset are available at: https://github.com/maestro-project/AIrchitect-v2./proceedings-archive/2024/DATA/389_pdf_upload.pdf |
||
ZEBRA: LEVERAGING DIAGONAL ATTENTION PATTERN FOR VISION TRANSFORMER ACCELERATOR Speaker: Sukhyun Han, Sungkyunkwan University, KR Authors: Sukhyun Han, Seongwook Kim, Gwangeun Byeon, Jihun Yoon and Seokin Hong, Sungkyunkwan University, KR Abstract Vision Transformers (ViTs) have achieved remarkable performance in computer vision, but their computational complexity and challenges in optimizing memory bandwidth limit hardware acceleration. A major bottleneck lies in the self-attention mechanism, which leads to excessive data movement and unnecessary computations despite high input sparsity and low computational demands. To address this challenge, existing transformer accelerators have leveraged sparsity in attention maps. However, their performance gains are limited due to low hardware utilization caused by the irregular distribution of non-zero values in the sparse attention maps. Self-attention often exhibits strong diagonal patterns in the attention map, as the diagonal elements tend to have higher values than others. To exploit this, we introduce Zebra, a hardware accelerator framework optimized for diagonal attention patterns. A core component of Zebra is the Striped Diagonal (SD) pruning technique, which prunes the attention map by preserving only the diagonal elements at runtime. This reduces computational load without requiring offline pre-computation or causing significant accuracy loss. Zebra features a reconfigurable accelerator architecture that supports optimized matrix multiplication method, called Striped Diagonal Matrix Multiplication (SDMM), which computes only the diagonal elements of matrices. With this novel method, Zebra addresses low hardware utilization, a key barrier to leveraging the diagonal patterns. Experimental results demonstrate that Zebra achieves a 57x speedup over a CPU and 1.7x over the state-of-the-art ViT accelerator with similar inference accuracy./proceedings-archive/2024/DATA/480_pdf_upload.pdf |
||
PUSHING UP TO THE LIMIT OF MEMORY BANDWIDTH AND CAPACITY UTILIZATION FOR EFFICIENT LLM DECODING ON EMBEDDED FPGA Speaker: Jindong Li, Institute of Computing Technology, Chinese Academy of Sciences, CN Authors: Jindong Li, Tenglong Li, Guobin Shen, Dongcheng Zhao, Qian Zhang and Yi Zeng, Institute of Computing Technology, Chinese Academy of Sciences, CN Abstract The extremely high computational and storage demands of large language models have excluded most edge devices, which were widely used for efficient machine learning, from being viable options. A typical edge device usually only has 4GB of memory capacity and a bandwidth of less than 20GB/s, while a large language model quantized to 4-bit precision with 7B parameters already requires 3.5GB of capacity, and its decoding process is purely bandwidth-bound. In this paper, we aim to explore these limits by proposing a hardware accelerator for large language model (LLM) inference on the Zynq-based KV260 platform, equipped with 4GB of 64-bit 2400Mbps DDR4 memory. We successfully deploy a LLaMA2-7B model, achieving a decoding speed of around 5 token/s, utilizing 93.3% of the memory capacity and reaching 85% decoding speed of the theoretical memory bandwidth limit. To fully reserve the memory capacity for model weights and key-value cache, we develop the system in a bare-metal environment without an operating system. To fully reserve the bandwidth for model weight transfers, we implement a customized dataflow with an operator fusion pipeline and propose a data arrangement format that can maximize the data transaction efficiency. This research marks the first attempt to deploy a 7B level LLM on a standalone embedded field programmable gate array (FPGA) device. It provides key insights into efficient LLM inference on embedded FPGA devices and provides guidelines for future architecture design./proceedings-archive/2024/DATA/498_pdf_upload.pdf |
||
LEVERAGING COMPUTE-IN-MEMORY FOR EFFICIENT GENERATIVE MODEL INFERENCE IN TPUS Speaker: Zhantong Zhu, Peking University, CN Authors: Zhantong Zhu, Hongou Li, Wenjie Ren, Meng Wu, Le Ye, Ru Huang and Tianyu Jia, Peking University, CN Abstract With the rapid advent of generative models, efficiently deploying these models on specialized hardware has become critical. Tensor Processing Units (TPUs) are designed to accelerate AI workloads, but their high power consumption necessitates innovations for improving efficiency. Compute-in-memory (CIM) has emerged as a promising paradigm with superior area and energy efficiency. In this work, we present a TPU architecture that integrates digital CIM to replace conventional digital systolic arrays in matrix multiply units (MXUs). We first establish a CIM-based TPU architecture model and simulator to evaluate the benefits of CIM for diverse generative model inference. Building upon the observed design insights, we further explore various CIM-based TPU architectural design choices. Up to 44.2% and 33.8% performance improvement for large language model and diffusion transformer inference, and 27.3x reduction in MXU energy consumption can be achieved with different design choices, compared to the baseline TPUv4i architecture./proceedings-archive/2024/DATA/627_pdf_upload.pdf |
||
SPARSEINFER: TRAINING-FREE PREDICTION OF ACTIVATION SPARSITY FOR FAST LLM INFERENCE Speaker: Jiho Shin, University of Seoul, KR Authors: Jiho Shin1, Hoeseok Yang2 and Youngmin Yi3 1University of Seoul, KR; 2Santa Clara University, US; 3Sogang University, KR Abstract Leveraging sparsity is crucial for optimizing large language model (LLM) inference; however, modern LLMs employing SiLU as their activation function exhibit minimal activation sparsity. Recent research has proposed replacing SiLU with ReLU to induce significant activation sparsity and showed no downstream task accuracy degradation through fine-tuning. However, taking full advantage of it required training a predictor to estimate this sparsity. In this paper, we introduce SparseInfer, a simple, light-weight, and training-free predictor for activation sparsity of ReLU-fied LLMs, in which activation sparsity is predicted by comparing only the sign bits of inputs and weights. To compensate for possible prediction inaccuracy, an adaptive tuning of the predictor's conservativeness is enabled, which can also serve as a control knob for optimizing LLM inference. The proposed method achieves approximately 21% faster inference speed over the state-of-the-art, with negligible accuracy loss of within 1%p./proceedings-archive/2024/DATA/638_pdf_upload.pdf |
||
LOW-RANK COMPRESSION FOR IMC ARRAYS Speaker: Kang Eun Jeon, Sungkyunkwan University, KR Authors: Kang Eun Jeon, Johnny Rhe and Jong Hwan Ko, Sungkyunkwan University, KR Abstract In this study, we address the challenge of low-rank model compression in the context of in-memory computing (IMC) architectures.Traditional pruning approaches, while effective in model size reduction, necessitate additional peripheral circuitry to manage complex dataflows and mitigate dislocation issues, leading to increased area and energy overheads, especially when model sparsity does not meet a specific threshold. To circumvent these drawbacks, we propose leveraging low-rank compression techniques, which, unlike pruning, streamline the dataflow and seamlessly integrate with IMC architectures. However, low-rank compression presents its own set of challenges, notably suboptimal IMC array utilization and compromised accuracy compared to traditional pruning methods. To address these issues, we introduce a novel approach employing shift and duplicate kernel (SDK) mapping technique, which exploits idle IMC columns for parallel processing, and group low-rank convolution, which mitigates the information imbalance in the decomposed matrices. Our experimental results, using ResNet-20 and Wide ResNet16-4 networks on CIFAR-10 and CIFAR-100 datasets, demonstrate that our proposed method not only matches the performance of existing pruning techniques on ResNet-20 but also achieves up to 2.5x speedup and +20.9% accuracy boost on Wide ResNet16-4./proceedings-archive/2024/DATA/918_pdf_upload.pdf |
||
INTEGER UNIT-BASED OUTLIER-AWARE LLM ACCELERATOR PRESERVING NUMERICAL ACCURACY OF FP-FP GEMM Speaker: Jehun Lee, Seoul National University, KR Authors: Jehun Lee and Jae-Joon Kim, Seoul National University, KR Abstract The proliferation of large language models (LLMs) has significantly heightened the importance of quantization to alleviate the computational burden given the surge in the number of parameters. However, quantization often targets a subset of a LLM and relies on the floating-point (FP) arithmetic for matrix multiplication of specific subsets, leading to performance and energy overhead. Additionally, to compensate for the quality degradation incurred by quantization, retraining methods are frequently employed, demanding significant efforts and resources. This paper proposes OwL-P, an outlier-aware LLM inference accelerator which preserves the numerical accuracy of FP arithmetic while enhancing hardware efficiency with an integer (INT)-based arithmetic unit for general matrix multiplication (GEMM), through the use of a shared exponent and efficient management of outlier data. It also mitigates off-chip bandwidth requirements by employing a compressed number format. The proposed number format leverages outliers and shared exponents to facilitate the compression of both model weights and activations. We evaluate this work across 10 different transformer-based benchmarks, and the results demonstrate that the proposed integer-based LLM accelerator achieves an average 2.70× performance gain and 3.57× energy savings while maintaining the numerical accuracy of the FP arithmetic./proceedings-archive/2024/DATA/1191_pdf_upload.pdf |
||
LEVERAGING HOT DATA IN A MULTI-TENANT ACCELERATOR FOR EFFECTIVE SHARED MEMORY MANAGEMENT Speaker: Chunmyung Park, Seoul National University, KR Authors: Chunmyung Park, Jicheon Kim, Eunjae Hyun, Xuan Truong Nguyen and Hyuk-Jae Lee, Seoul National University, KR Abstract Multi-tenant neural networks (MTNN) have been emerging in various domains. To effectively handle multi-tenant workloads, modern hardware systems typically incorporate multiple compute cores with shared memory systems. While prior works have intensively studied compute- and bandwidth-aware allocation, on-chip memory allocation for MTNN accelerators has not been well studied. This work identifies two key challenges of on-chip memory allocation in MTNN accelerators: on-chip memory shortages, which force data eviction to off-chip memory, and on-chip memory underutilization, where memory remains idle due to coarse-grained allocation. Both issues lead to increased external memory accesses (EMAs), significantly degrading system performance. To address these challenges, we propose HotPot, a novel multi-tenant accelerator with a runtime temperature-aware memory allocator. HotPot prioritizes hot data for global on-chip memory allocation, reducing unnecessary EMAs and optimizing memory utilization. Specifically, HotPot introduces a temperature score that quantifies reuse potential and guides runtime memory allocation decisions. Experimental results demonstrate that HotPot improves system throughput (STP) by up to 1.88× and average normalized turnaround time (ANTT) by 1.52× compared to baseline methods./proceedings-archive/2024/DATA/1543_pdf_upload.pdf |
||
DOTS: DRAM-PIM OPTIMIZATION FOR TALL AND SKINNY GEMM OPERATIONS IN LLM INFERENCE Speaker: Gyeonghwan Park, Seoul National University, KR Authors: Gyeonghwan Park, Sanghyeok Han, Yoon Byungkuk and Jae-Joon Kim, Seoul National University, KR Abstract For large language models (LLMs), increasing token lengths require smaller batch sizes due to increase in memory requirement for KV caching, leading to under-utilization of processing units and memory bandwidth bottleneck in NPUs. To address the challenge, we propose DOTS, a new DRAM-PIM architecture that can handle both GEMV and GEMM efficiently, even outperforming NPUs in GEMM operations when batch sizes are small. The proposed DRAM-PIM reduces power consumption and latency caused by frequent DRAM row activation switching in conventional DRAM PIMs with negligible hardware overhead. Simulation results show that our proposed design achieves throughput improvements of 1.83x, 1.92x, and 1.7x over GPU, NPU, and heterogeneous NPU/PIM systems, respectively, for models as large as or larger than OPT-175B./proceedings-archive/2024/DATA/368_pdf_upload.pdf |
||
LLM4GV: AN LLM-BASED FLEXIBLE PERFORMANCE-AWARE FRAMEWORK FOR GEMM VERILOG GENERATION Speaker: Meiqi Wang, National Sun Yat-Sen University, TW Authors: Dingyang Zou1, Gaoche Zhang1, kairui sun2, wen zhe3, Meiqi Wang2 and Zhongfeng Wang1 1Nanjing University, CN; 2National Sun Yat-Sen University, TW; 3sysu, CN Abstract Advancements in AI have increased the demand for specialized AI accelerators, with design for general matrix multiplication (GEMM) module being crucial but time-consuming. While large language models (LLMs) show promise for automating GEMM design, challenges arise from GEMM's vast design space and performance requirements. Existing LLM-based frameworks for RTL code generation often lack flexibility and performance awareness. To overcome the challenges, we propose LLM4GV, a multi-agent LLM-based framework that integrates hardware optimization techniques (HOTs) and performance modeling, improving correctness and performance of the generated code over prior works./proceedings-archive/2024/DATA/349_pdf_upload.pdf |
TS22 Session 13 - DT5
Add this session to my calendar
Date: Wednesday, 02 April 2025
Time: 11:00 CET - 12:30 CET
Time | Label | Presentation Title Authors |
---|---|---|
USING OFF-SET ONLY FOR CORRUPTING CIRCUIT TO RESIST STRUCTURAL ATTACK IN CAC LOCKING Speaker: Hsaing-Chun Cheng, National Tsing Hua University, TW Authors: Hsaing-Chun Cheng, RuiJie Wang and TingTing Hwang, National Tsing Hua University, TW Abstract Corrupt-and-Correct (CAC) Logic Lockings [1]–[4] are state-of-the-art hardware security techniques designed to protect IC/IP designs from IP piracy, reverse engineering, overproduction, and unauthorized use. Although these techniques are resilient to SAT-based attacks, they remain vulnerable to structural attacks, which exploit structural traces left by the synthesis tool to recover the original form. In this paper, we will propose a novel method that uses only the OFF-set to corrupt the circuit. This approach helps the added circuitry better merge with the original circuit, thereby thwarting structural attacks while maintaining resilience to SAT-based attacks. Additionally, we demonstrate that our proposed method can incur less area overhead compared to previous locking methods in HIID [5]. Compared to SFLL-rem [4], our method can achieve comparable area overhead while effectively resisting structural attacks, including Valkyrie [6] and SPI attacks [7]./proceedings-archive/2024/DATA/106_pdf_upload.pdf |
||
RUNTIME SECURITY ANALYSIS OF MONOLITHIC 3D EMBEDDED DRAM WITH OXIDE-CHANNEL TRANSISTOR Speaker: Eduardo Ortega, Arizona State University, US Authors: Eduardo Ortega1, Jungyoun Kwak2, Shimeng Yu2 and Krishnendu Chakrabarty1 1Arizona State University, US; 2Georgia Tech, US Abstract We present the first security and disturbance study of monolithic 3D (M3D) embedded DRAM (eDRAM) with 2T gain cell using oxide-channel transistors. We explore the Rowhammer/Rowpress vulnerabilities on amorphous indium tungsten oxide (IWO) transistors for eDRAM with standalone 2D integration and memory-on-memory M3D integration. In addition, We examine M3D-specific electrical disturbances from memory-on-logic M3D integration. We evaluate IWO eDRAM's susceptibility to these vulnerabilities/disturbances and discuss the potential impact on M3D integration. We examine physical design and architecture strategies for M3D integration of IWO eDRAM. We provide systematic recommendations to inform security strategies for M3D integration and security of IWO eDRAM. Our results show that limiting the minimum vertical interlayer distance to 300 nm reduces vertical disturbances in memory-on-memory M3D integration. In addition, for memory-on-logic M3D integration, we observed that IWO eDRAM's read bitline is sensitive to crosstalk from high-speed switching logic circuits. In conjunction, we show that IWO eDRAM standalone 2D integration is 30X more resilient to Rowhammer than current state-of-the-art memory because the IWO transistor's ON/OFF current ratio is roughly three orders of magnitude greater than standard memory access transistors./proceedings-archive/2024/DATA/121_pdf_upload.pdf |
||
EXPLORING LARGE INTEGER MULTIPLICATION FOR CRYPTOGRAPHY TARGETING IN-MEMORY COMPUTING Speaker: Florian Krieger, TU Graz, AT Authors: Florian Krieger, Florian Hirner and Sujoy Sinha Roy, TU Graz, AT Abstract Emerging cryptographic systems such as Fully Homomorphic Encryption (FHE) and Zero-Knowledge Proofs (ZKP) are computation- and data-intensive. FHE and ZKP implementations in software and hardware largely rely on the von Neumann architecture, where a significant amount of energy is lost on data movements. A promising computing paradigm is computing in memory (CIM) which enables computations to occur directly within memory thereby reducing data movements and energy consumption. However, efficiently performing large integer multiplications – critical in FHE and ZKP – is an open question, as existing CIM methods are limited to small operand sizes. In this work, we address this question by exploring advanced algorithmic approaches for large integer multiplication, identifying the Karatsuba algorithm as the most effective for CIM applications. Thereafter, we design the first Karatsuba multiplier for resistive CIM crossbars. Our multiplier uses a three-stage pipeline to enhance throughput and, additionally, balances memory endurance with efficient array sizes. Compared to existing CIM multiplication methods, when scaled up to the bit widths required in ZKP and FHE, our design achieves up to 916x in throughput and 281x in area-time product improvements./proceedings-archive/2024/DATA/301_pdf_upload.pdf |
||
A LOW-COMPLEXITY TRUE RANDOM NUMBER GENERATION SCHEME USING 3D-NAND FLASH MEMORY Speaker: RUIBIN ZHOU, National Sun Yat-Sen University, TW Authors: Ruibin Zhou1, Jian Huang1, Xianping Liu2, Yuhan Wang1, Xinrui Zhang1, Yungen Peng1 and Zhiyi Yu1 1National Sun Yat-Sen University, TW; 21.Sun Yat-Sen University 2.Peng Cheng Laboratory, CN Abstract Unpredictable true random numbers are essential in cryptographic applications and secure communications. However, implementing True Random Number Generators (TRNGs) typically requires specialized hardware devices. In this paper, we propose a low-complexity true random number extraction scheme that can be implemented in endpoint systems containing 3D-NAND flash memory chips, addressing the need for random numbers without requiring additional complex hardware. We successfully utilized the randomness of the rapid charging and discharging of shallow charge traps in 3D-NAND memory as an entropy source. The proposed approach only requires conventional user-mode erase, program, and read operations, without any special timing control. We successfully extracted random bitstream using this scheme without a post-debiasing process. We evaluated the randomness of the generated bitstream using the NIST SP 800-22 statistical test suite, and it passed all 15 tests./proceedings-archive/2024/DATA/737_pdf_upload.pdf |
||
A SYNTHESIZABLE THYRISTOR-LIKE LEAKAGE-BASED TRUE RANDOM NUMBER GENERATOR Speaker: Seohyun Kim, Ajou University, KR Authors: Seo Hyun Kim, Jang Hyun Kim and Jongmin Lee, Ajou University, KR Abstract As the demand for random data in cryptographic systems continues to rise, the importance of True Random Number Generators (TRNGs) becomes increasingly crucial for securing cryptographic applications. However, designing a TRNG that is reliable, secure, and cost-effective presents a significant challenge in hardware security. In this paper, we propose a synthesizable TRNG design based on a thyristor-like leakage-based (TL) structure, optimized for secure applications with small area and cost-efficiency. Our design has been validated using a 65-nm CMOS process, achieving a throughput of 0.397-Mbps within a compact area of 14.4-μm2, offering considerable cost savings while maintaining high randomness and area-throughput trade-off of 27.57 Gbps/mm2. Moreover, this TRNG can be synthesized as a standard cell through a semi-custom design flow, significantly reducing design costs and enabling design automation, which streamlines the process and reduces the time and effort required compared to traditional full-custom TRNGs. Additionally, as it is library characterized, the number of TL TRNG cells can be freely adjusted to meet specific application requirements, offering flexibility in both performance and scalability. To assess its randomness, the NIST statistical test suite was applied, and the proposed TL TRNG successfully passed all applicable tests, demonstrating its randomness./proceedings-archive/2024/DATA/752_pdf_upload.pdf |
||
GRAFTED TREES BEAR BETTER FRUIT: AN IMPROVED MULTIPLE-VALUED PLAINTEXT-CHECKING SIDE-CHANNEL ATTACK AGAINST KYBER Speaker: Jinnuo Li, School of Computer Science, China University of Geosciences, Wuhan, China, CN Authors: Jinnuo Li1, Chi Cheng1, Muyan Shen2, Peng Chen1, Qian Guo3, Dongsheng Liu4, Liji Wu5 and Jian Weng6 1China University of Geosciences, Wuhan, CN; 2School of Cryptology, University of Chinese Academy of Sciences, Beijing, China, CN; 3Lund University, Lund, Sweden, SE; 4School of Integrated Circuits, Huazhong University of Science and Technology, CN; 5(School of Integrated Circuits, Tsinghua University, Beijing, China, CN; 6College of Cyber Security, Jinan University, Guangzhou, China, CN Abstract As a prominent category of side-channel attacks (SCAs), plaintext-checking (PC) oracle-based SCAs offer the advantages of generality and operational simplicity on a targeted device. At TCHES 2023, Rajendran et al. and Tanaka et al. independently proposed the multiple-valued (MV) PC oracle, significantly reducing the required number of queries (a.k.a., traces) in the PC oracle. However, in practice, when dealing with environmental noise or inaccuracies in the waveform classifier, they still rely on majority voting or the other technique that usually results in three times the number of queries compared to the ideal case. In this paper, we propose an improved method to further reduce the number of queries of the MV-PC oracle, particularly in scenarios where the oracle is imperfect. Compared to the state-of-the-art at TCHES 2023, our proposed method reduces the number of queries for a full key recovery by more than 42.5%. The method involves three rounds. Our key observation is that coefficients recovered in the first round can be regarded as prior information to significantly aid in retrieving coefficients in the second round. This improvement is achieved through a newly designed grafted tree. Notably, the proposed method is generic and can be applied to both the NIST key encapsulation mechanism (KEM) standard Kyber and other significant candidates, such as Saber and Frodo. We have conducted extensive software simulations against Kyber-512, Kyber-768, Kyber-1024, FireSaber, and Frodo-1344 to validate the efficiency of the proposed method. An electromagnetic attack conducted on real-world implementations, using an STM32F407G board equipped with an ARM Cortex-M4 microcontroller and Kyber implementation from the public library pqm4, aligns well with our simulations./proceedings-archive/2024/DATA/921_pdf_upload.pdf |
||
CAS-PUF: CURRENT-MODE ARRAY-TYPE STRONG PUF FOR SECURE COMPUTING IN AREA CONSTRAINED SOCS Speaker: Dimosthenis Georgoulas, University of Ioannina, GR Authors: Dimosthenis Georgoulas, Yiorgos Tsiatouhas and Vasileios Tenentes, University of Ioannina, GR Abstract Secure computing necessitates the integration in Systems-on-Chips (SoCs) of strong Physical Unclonable Functions (PUFs) that can generate a vast amount of Challenge Response Pairs (CRPs) for cryptographic keys generation, identification and authentication. However, the excessive area cost of strong PUF designs imposes integration difficulties to SoCs of area constrained applications, such as the IoT and mobile computing. In this paper, we present a novel strong PUF design, with silicon area requirements significantly lower than those of previous strong PUFs. The proposed Current-mode Array-type Strong PUF (CAS-PUF) is based on a current source topology of only six minimum size transistors, which is tolerant to power supply variation for enhanced reliability. Compared to previous strong PUFs, the CAS-PUF achieves the same number of CRPs with 20% to 72% less area size; while for the same area size, it provides 19 to 53 orders of magnitude higher number of CRPs. Furthermore, extensive Monte Carlo simulations on CAS-PUF show a reliability of 96.45% under ±10% power supply fluctuation; and 97.69% under temperature variation (0°C to 80°C), with an average uniqueness and uniformity of 50.01% and 49.54%, respectively. Therefore, the CAS-PUF can be used as a hardware root of trust mechanism to secure computing in area constrained SoCs./proceedings-archive/2024/DATA/1235_pdf_upload.pdf |
||
FLASH: AN EFFICIENT HARDWARE ACCELERATOR LEVERAGING APPROXIMATE AND SPARSE FFT FOR HOMOMORPHIC ENCRYPTION Speaker: Tengyu Zhang, Peking University, CN Authors: Tengyu Zhang1, Yufei Xue2, LING LIANG1, Zhen Gu3, Yuan Wang1, Runsheng Wang1, Ru Huang1 and Meng Li1 1Peking University, CN; 2Department of Electronic and Computer Engineering, Hong Kong University of Science and Technology, HK; 3Alibaba Group, CN Abstract Private convolutional neural network (CNN) inference based on hybrid homomorphic encryption (HE) and two-party computation (2PC) emerges as a promising technique for sensitive user data protection. However, homomorphic convolutions (HConvs) suffer from high computation costs due to the extensive number theoretic transforms (NTTs). While customized accelerators have been proposed, they usually overlook the intrinsic error resilience and native sparsity of DNNs and hybrid HE/2PC protocols. In this paper, we propose FLASH, leveraging these key characteristics for highly efficient HConv. Specifically, we observe the private DNN inference is robust to computation errors and propose approximate fast Fourier transforms (FFTs) to replace NTTs and avoid the expensive modular reduction operations. We also design a flexible sparse FFT dataflow leveraging the high sparsity of weight plaintexts. With extensive experiments, we demonstrate FLASH improves the power efficiency by 90.7x for weight transforms and by 9.7x for all transforms in HConvs compared to existing works. As for the HConvs in ResNet-18 and ResNet-50, FLASH achieves about 87.3% energy consumption reduction./proceedings-archive/2024/DATA/1451_pdf_upload.pdf |
||
HFL: HARDWARE FUZZING LOOP WITH REINFORCEMENT LEARNING Speaker: Lichao Wu, TU Darmstadt, DE Authors: Lichao Wu, Mohamadreza Rostami, Huimin Li and Ahmad-Reza Sadeghi, TU Darmstadt, DE Abstract As hardware systems grow increasingly complex, ensuring their security becomes more critical. This complexity often introduces difficult and costly vulnerabilities to address after fabrication. Traditional verification methods, such as formal and dynamic approaches, encounter limitations in scalability and efficiency when applied to complex hardware designs. While hardware fuzzing presents a promising solution for efficient and effective vulnerability detection, current methods face several challenges, including coverage saturation, long simulation times, and limited vulnerability detection capabilities. This paper introduces Hardware Fuzzing Loop (HFL), a novel fuzzing framework designed to address these limitations. We demonstrate that Long Short-Term Memory (LSTM), a machine learning model commonly used in natural language processing, can effectively capture the semantics of test cases and accurately predict hardware coverage. Building on this insight, we leverage reinforcement learning to optimize the test generation strategy dynamically within a hardware fuzzing loop. Our approach utilizes a multi-head LSTM to generate sophisticated RISC-V assembly instruction sequences, along with an LSTM-based predictor that evaluates the quality of these instructions. By dynamically interacting with the hardware, HFL efficiently explores complex instruction sequences with minimal fuzzing iterations, allowing it to uncover hard-to-detect vulnerabilities. We evaluated HFL on three RISC-V cores, and the results show that it achieves higher coverage using fewer than 1\% of the test cases required by leading hardware fuzzers, effectively mitigating the issue of coverage saturation. Furthermore, HFL identified all known vulnerabilities in the tested systems and discovered four previously unknown high-severity issues, demonstrating its significant potential in improving hardware security assessments./proceedings-archive/2024/DATA/1551_pdf_upload.pdf |
||
REAP-NVM: RESILIENT ENDURANCE-AWARE NVM-BASED PUF AGAINST LEARNING-BASED ATTACKS Speaker: Hassan Nassar, Karlsruhe Institute of Technology, DE Authors: Hassan Nassar1, Ming-Liang Wei2, Chia-Lin Yang2, Joerg Henkel1 and Kuan-Hsun Chen3 1Karlsruhe Institute of Technology, DE; 2National Taiwan University, TW; 3University of Twente, NL Abstract NVM-based PUFs offer secure authentication and cryptographic applications by exploiting NVMs' MLC to generate diverse, ML-attack-resistant responses.nYet, frequent writes degrade these PUFs, lowering reliability and lifespan. This paper presents a model to assess endurance effects on NVM PUFs, guiding the creation of more robust PUFs. Our novel NVM PUF design enhances endurance by evenly distributing writes, thus mitigating cell stress, achieving a 62x improvement over current solutions while preserving security against learning-based attacks./proceedings-archive/2024/DATA/485_pdf_upload.pdf |
||
ACCELERATING OBLIVIOUS TRANSFER WITH A PIPELINED ARCHITECTURE Speaker: Xiaolin Li, Institute of Computing Technology, Chinese Academy of Sciences, CN Authors: Xiaolin Li1, wei yan1, yong zhang2, hongwei liu1, qinfen hao1, yong liu2 and ninghui sun1 1Institute of Computing Technology, Chinese Academy of Sciences, CN; 2Zhongguancun Laboratory, CN Abstract With the rapid development of machine learning and big data technologies, ensuring user privacy has become a pressing challenge. Secure multi-party computation offers a solution to this challenge by enabling privacy-preserving computations, but it also incurs significant performance overhead, thus limiting its further application. Our analysis reveals that the oblivious transfer protocol accounts for up to 96.64\% of execution time. To address these challenges, we propose POTA, a high-performance pipelined OT hardware acceleration architecture supporting the silent OT protocol. Finally, we implement a POTA prototype on Xilinx VCU129 FPGAs. Experimental results demonstrate that under various network settings, POTA achieves significant speedups, with maximum improvements of $22.67 imes$ for OT efficiency and $192.57 imes$ for basic operations in MPC applications./proceedings-archive/2024/DATA/698_pdf_upload.pdf |
TS23 Session 18 - D11+A3
Add this session to my calendar
Date: Wednesday, 02 April 2025
Time: 11:00 CET - 12:30 CET
Time | Label | Presentation Title Authors |
---|---|---|
FPGA-BASED ACCELERATION OF MCMC ALGORITHM THROUGH SELF-SHRINKING FOR BIG DATA Speaker: Shuanglong Liu, Hunan Normal University, CN Authors: Shuanglong Liu, Shiyu Peng and Wan Shen, Hunan Normal University, CN Abstract Markov chain Monte Carlo (MCMC) algorithms are widely used in Bayesian inference to compute the posterior distribution of complex models, facilitating sampling from probability distributions. However, the computational burden of evaluating the likelihood function in MCMC poses significant challenges in big data applications. To address this, sub-sampling methods have been introduced to approximate the target distribution by using subsets of the data rather than the entire dataset. Unfortunately, these methods often lead to biased samples, making them impractical for real-world applications. This paper proposes a novel scaling MCMC method that achieves exact sampling by utilizing a subset (mini-batch) of the data with locally bounded approximations of the target distribution. Our method adaptively adjusts the mini-batch size by automatically tuning a hyperparameter based on the sample acceptance ratio, ensuring optimal balance between sample efficiency and computational cost. Moreover, we introduce a highly optimized hardware architecture to efficiently implement the proposed MCMC method onto FPGA. Our accelerator is evaluated on an AMD Zynq UltraScale+ FPGA device using a Bayesian logistic regression model on the MNIST dataset. The results demonstrate that our design achieves unbiased sampling with a 47.6 times speedup over the standard MCMC design, while also significantly reducing estimation errors compared to state-of-the-art MCMC methods./proceedings-archive/2024/DATA/12_pdf_upload.pdf |
||
ATE-GCN: AN FPGA-BASED GRAPH CONVOLUTIONAL NETWORK ACCELERATOR WITH ASYMMETRICAL TERNARY QUANTIZATION Speaker: Ruiqi Chen, Vrije Universiteit Brussel, BE Authors: Ruiqi Chen1, Jiayu Liu2, Shidi Tang3, Yang Liu4, Yanxiang Zhu5, Ming Ling3 and Bruno da Silva1 1Vrije Universiteit Brussel, BE; 2University College London, GB; 3Southeast University, CN; 4Fudan University, CN; 5VeriMake Innovation Laboratory, CN Abstract Ternary quantization can effectively simplify matrix multiplication, which is the primary computational operation in neural network models. It has shown success in FPGA-based accelerator designs for emerging models such as GAT and Transformer. However, existing ternary quantization methods can lead to substantial accuracy loss under certain weight distribution patterns, such as GCN. Furthermore, current FPGA-based ternary weight designs often focus on reducing resource consumption while neglecting full utilization of FPGA DSP blocks, limiting maximum performance. To address these challenges, we propose ATE-GCN, an FPGA-based asymmetrical ternary quantization GCN accelerator using a software-hardware co-optimization approach. First, we adopt an asymmetrical quantization strategy with specific interval divisions tailored to the bimodal distribution of GCN weights, reducing accuracy loss. Second, we design a unified processing element (PE) array on FPGA to support various matrix computation forms, optimizing FPGA resource usage while leveraging the benefits of cascade design and ternary quantization, significantly boosting performance. Finally, we implement the ATE-GCN prototype on the VCU118 FPGA board. The results show that ATE-GCN maintains an accuracy loss below 2%. Additionally, ATE-GCN achieves average performance improvements of 224.13ˆ and 11.1ˆ, with up to 898.82ˆ and 69.9ˆ energy consumption saving compared to CPU and GPU, respectively. Moreover, compared to state-of-the-art FPGA-based GCN accelerators, ATE-GCN improves DSP efficiency by 63% with an average latency reduction of 11%./proceedings-archive/2024/DATA/54_pdf_upload.pdf |
||
PREVV: ELIMINATING STORE QUEUE VIA PREMATURE VALUE VALIDATION FOR DATAFLOW CIRCUIT ON FPGA Speaker: Kuangjie Zou, Fudan University, CN Authors: Kuangjie Zou, Yifan Zhang, Zicheng Zhang, Guoyu Li, Jianli Chen, Kun Wang and Jun Yu, Fudan University, CN Abstract Dynamic scheduling in high-level synthesis (HLS) maximizes pipeline performance by enabling out-of-order scheduling of load and store requests at runtime. However, this method introduces unpredictable memory dependencies, leading to data disambiguation challenges. Load-store queues (LSQs), commonly used in superscalar CPUs, offer a potential solution for HLS. However, LSQs in dynamically scheduled HLS implementations often suffer from high resource overhead and scalability limitations. In this paper, we introduce PreVV, an architecture based on premature value validation designed to address memory disambiguation with minimal resource overhead. Our approach substitutes LSQ with several PreVV components and a straightforward premature queue. We prevent potential deadlocks by incorporating a specific tag that can send 'fake' tokens to prevent the accumulation of outdated data. Furthermore, we demonstrate that our design has scalability potential. We implement our design using several hardware templates and an LLVM pass to generate targeted dataflow circuits with PreVV. Experimental results on various benchmarks with data hazards show that, compared to state-of-the-art dynamic HLS, PreVV16 (a version with a premature queue depth of 16) reduces LUT usage by 43.91% and FF usage by 33.09%, with minimal impact on timing performance. Meanwhile, PreVV64 (a version with a premature queue depth of 64) reduces LUT usage by 27.21% and FF usage by 33.10%, without affecting timing performance./proceedings-archive/2024/DATA/601_pdf_upload.pdf |
||
PEARL: FPGA-BASED REINFORCEMENT LEARNING ACCELERATION WITH PIPELINED PARALLEL ENVIRONMENTS Speaker: Jiayi Li, Peking University, CN Authors: Jiayi Li, Hongxiao Zhao, Wenshuo Yue, Yihan Fu, Daijing Shi, Anjunyi Fan, Yuchao Yang and Bonan Yan, Peking University, CN Abstract Reinforcement learning (RL) is an effective machine learning approach that enables artificial intelligence agents to perform complex tasks and make decisions in dynamic situations. Training an RL agent demands its repetitive interaction with the environment to learn optimal policies. To efficiently collect training data, parallelizing environments is a widely used technique by enabling simultaneous interactions between multiple agents and environments. However, existing CPU-based RL software frameworks face a key challenge of slow multi-environmental update computation. To solve this problem, we present a novel FPGA-based RL accelerating framework--PEARL. PEARL instantiates multiple parallel environments and accelerates them with a carefully designed pipeline scheme to hide data transfer latency within the computation time. We evaluate PEARL on respective RL environments and achieve 4.36× to 972.6× speedup over the existing fastest software-based framework for parallel environment execution. When scaling the number of environments from 1024 to 43008 (42×) in CliffWalking benchmark, the power consumption increases marginally by 3%, while LUT and flip-flops utilization rise by 2.24× and 3.08×, respectively. This demonstrates efficient resource usage and power management in PEARL. Further, PEARL allows users to define and add their environments within the framework flexibly. We have established an open-source repository for users to utilize and expand. We also implement PEARL with the existing RL algorithm and achieve acceleration. It is available online at https://github.com/Selinaee/FPGA_Gym./proceedings-archive/2024/DATA/682_pdf_upload.pdf |
||
AISPGEMM: ACCELERATING IMBALANCED SPGEMM ON FPGAS WITH FLEXIBLE INTERCONNECT AND INTRA-ROW PARALLEL MERGING Speaker: Yuanfang Wang, Fudan University, CN Authors: Enhao Tang1, Shun Li2, Hao Zhou3, Guohao Dai3, Jun Lin4 and Kun Wang1 1Fudan University, CN; 2Southeast University, CN; 3Shanghai Jiao Tong University, CN; 4Nanjing University, CN Abstract The row-wise product algorithm shows significant potential for sparse matrix-matrix multiplication (SpGEMM) on hardware accelerators. Recent studies have made notable progress in accelerating SpGEMM using this algorithm. However, several challenges remain in accelerating imbalanced SpGEMM, where the distribution of non-zero elements across different rows is imbalanced. These challenges include: (1) the fixed dataflow of the merger tree, which leads to lower PE utilization, and (2) highly imbalanced data distributions, such as single rows with numerous non-zero elements, which result in intensive computations. This imbalance significantly challenges SpGEMM acceleration, leading to time-consuming processes that dominate overall computation time. In this paper, we propose AiSpGEMM to accelerate imbalanced SpGEMM on FPGAs. First, we improved the C2SR format to adapt it for imbalanced SpGEMM acceleration based on the row-wise product algorithm. This reduces off-chip memory bank conflicts and increases data reuse of matrix B. Secondly, we design a reconfigurable merger (R-merger) with flexible interconnects to improve PE utilization. Additionally, we propose an intra-row parallel merging algorithm and its corresponding hardware architecture, the parallel merger (P-merger), to accelerate intensive operations. Experimental results demonstrate that AiSpGEMM achieves a geometric mean (geomean) speedup of 5.8× compared to the state-of-the-art FPGA-based SpGEMM accelerator. In Geomean, AiSpGEMM achieves a 3.0× speedup and a 9.8× improvement in energy efficiency compared to the NVIDIA cuSPARSE library running on an NVIDIA A6000 GPU. Moreover, AiSpGEMM-21 demonstrated a 4× increase in average throughput compared to the same GPU./proceedings-archive/2024/DATA/773_pdf_upload.pdf |
||
FAMERS: AN FPGA ACCELERATOR FOR MEMORY-EFFICIENT EDGE-RENDERED 3D GAUSSIAN SPLATTING Speaker: Yuanfang Wang, Fudan University, CN Authors: Yuanfang Wang, Yu Li, Jianli Chen, Jun Yu and Kun Wang, Fudan University, CN Abstract This paper introduces FAMERS, a tile-based hardware accelerator designed for efficient 3D Gaussian Splatting (3DGS) inference on edge-deployed Field Programmable Gate Arrays (FPGAs). 3DGS has emerged as a powerful technique for photorealistic image rendering, leveraging anisotropic Gaussians to balance computational efficiency and visual fidelity. However, the high memory and processing demands of 3DGS pose significant challenges for real-time applications on resource-constrained edge devices. To address these limitations, we present a novel architecture that optimizes both computational and memory overheads through model pruning and compression techniques, enabling high-quality rendering within the constrained memory and processing capabilities of edge platforms. Experimental results demonstrate that our implementation on the Xilinx XC7K325T FPGA achieves a 1.99× speedup and 13.46× energy efficiency compared to NVIDIA RTX 3060M Laptop GPU, underscoring the viability of our approach for real-time applications in virtual and augmented reality./proceedings-archive/2024/DATA/957_pdf_upload.pdf |
||
SMARTMAP: ARCHITECTURE-AGNOSTIC CGRA MAPPING USING GRAPH TRAVERSAL AND REINFORCEMENT LEARNING Speaker: Ricardo Ferreira, Federal University of Viçosa, BR Authors: Fábio Ramos1, Pedro Realino1, Wagner Junior1, Alex Vieira2, Ricardo Ferreira1 and José Nacif1 1Federal University of Viçosa, BR; 2Federal University of Juiz de Fora, BR Abstract Coarse-Grained Reconfigurable Architectures (CGRAs) have been the subject of extensive research due to their balance between performance, energy efficiency, and flexibility. CGRAs must be capable of executing a dataflow graph (DFG), which depends on a compiler producing quality valid mappings with feasible running time performance and portable mapping DFGs on different CGRA architectures. Machine learning-based compilers have shown promising results by presenting high quality and performance but offer limited portability. Moreover, some approaches do not explore efficient placement methods or do not demonstrate whether scaling to more challenging, less connected architectures. This paper presents SmartMap, an architecture-agnostic framework that uses an actor-critic reinforcement learning method applied to a Monte-Carlo Tree Search (MCTS) to learn how to map a DFG onto a CGRA. This framework offers full portability using a state-action representation layer in the policy network instead of a probability distribution over actions. SmartMap uses a graph traversal placement method to provide scalability and improve the efficiency of MCTS by enabling more efficient exploration during the search. Our results show that SmartMap has 2.81x more mapping capacity, a 16.82x speed-up in compilation time, and consumes fewer resources compared to the state-of-the-art./proceedings-archive/2024/DATA/1152_pdf_upload.pdf |
||
DATAFLOW OPTIMIZED RECONFIGURABLE ACCELERATION FOR FEM-BASED CFD SIMULATIONS Speaker: Aggelos Ferikoglou, National TU Athens, GR Authors: Anastassis Kapetanakis, Aggelos Ferikoglou, Georgios Anagnostopoulos and Sotirios Xydis, National TU Athens, GR Abstract Computational Fluid Dynamics (CFD) simulations are essential for analyzing and optimizing fluid flows in a wide range of real-world applications. These simulations involve approximating the solutions of the Navier-Stokes differential equations using numerical methods, which are highly compute- and memory-intensive due to their need for high-precision iterations. In this work, we introduce a high-performance FPGA accelerator specifically designed for numerically solving the Navier-Stokes equations. We focus on the Finite Element Method (FEM) due to its ability to accurately model complex geometries and intricate setups typical of real-world applications. Our accelerator is implemented using High-Level Synthesis (HLS) on an AMD Alveo U200 FPGA, leveraging the reconfigurability of FPGAs to offer a flexible and adaptable solution. The proposed solution achieves 7.9x higher performance than optimized Vitis-HLS implementations and 45% lower latency with 3.64x less power compared to a software implementation on a high-end server CPU. This highlights the potential of our approach to solve Navier-Stokes equations more effectively, paving the way for tackling even more challenging CFD simulations in the future./proceedings-archive/2024/DATA/1158_pdf_upload.pdf |
||
A RESOURCE-AWARE RESIDUAL-BASED GAUSSIAN BELIEF PROPAGATION ACCELERATOR TOOLFLOW Speaker: Omar Sharif, Imperial College London, GB Authors: Omar Sharif and Christos Bouganis, Imperial College London, GB Abstract Gaussian Belief Propagation (GBP) is a graphical method of statistical inference that provides an approximate solution to the probability distribution of a system. In recent years, GBP has emerged as a powerful computational framework with numerous applications in domains such as SLAM and image processing. In pursuit of high performance efficiency (i.e., inference per watt), streaming-based reconfigurable hardware solutions have demonstrated significant performance gains compared to leading-edge processors and high-power, server-grade CPUs. However, this class of architectures suffers from performance degradation at scale when on-chip memory is limited. This paper addresses this challenge by building on previous GBP architectural and algorithmic developments, introducing a novel hardware method that dynamically prioritizes node computations by monitoring information gain. By leveraging the inherent properties of the GBP algorithm, we demonstrate how convergence-driven optimizations can push the performance envelope of state-of-the-art reconfigurable accelerators despite on-chip memory constraints. The performance of our architecture is rigorously evaluated against this across both real-world and synthetic SLAM and image-denoising benchmarks. For equal resources, our work achieves a convergence rate improvement of up to 3.5x for large graphs, demonstrating its effectiveness for real-time inference tasks./proceedings-archive/2024/DATA/1285_pdf_upload.pdf |
||
UNIT: A HIGHLY UNIFIED AND MEMORY-EFFICIENT FPGA-BASED ACCELERATOR FOR TORUS FHE Speaker: Yuying ZHANG, Hong Kong University of Science and Technology, HK Authors: Yuying ZHANG1, Sharad Sinha2, Jiang Xu1 and Wei Zhang1 1Hong Kong University of Science and Technology, HK; 2Indian Institute of Technology (IIT) Goa, IN Abstract Fully Homomorphic Encryption (FHE) has emerged as a promising solution for the secure computation on encrypted data without leaking user privacy. Among various FHE schemes, Torus FHE (TFHE) distinguishes itself by its ability to perform exact computations on non-linear functions within the encrypted domain, satisfying the crucial requirement for privacy-preserving AI applications. However, the high computational overhead and strong data dependency in TFHE's bootstrapping process present significant challenges to its practical adoption and efficient hardware implementation. Existing TFHE accelerators on various hardware platforms still face limitations in terms of performance, flexibility, and area efficiency. In this work, we propose UNIT, a novel and highly unified accelerator for Programmable Bootstrapping (PBS) in TFHE, featuring carefully designed computation units. We introduce a unified architecture for negacyclic (inverse) number theoretic transform (I)NTT with fused twisting steps, which reduces computing resources by 33% and the memory utilization of pre-stored factors by nearly 66%. Another key feature of UNIT is the innovative design of the monomial number theoretic transform unit, called OF-MNTT, which leverages on-the-fly twiddle factor generation to eliminate memory traffic and overhead. This memory-efficient and highly parallelizable approach for MNTT is proposed for the first time in TFHE acceleration. Furthermore, UNIT is highly reconfigurable and scalable, supporting various parameter sets and performance-resource requirements. Our proposed accelerator is evaluated on the Xilinx Alveo U250 FPGA platform. Experimental results demonstrate its superior performance compared to the state-of-the-art GPU and FPGA-based implementations with the improvement of 8.3x and 3.63x, respectively. In comparison with the most advanced FPGA implementation, UNIT achieves 30% enhanced area efficiency and 3.2x reduced power with much better flexibility./proceedings-archive/2024/DATA/139_pdf_upload.pdf |
||
RGHT-Q: RECONFIGURABLE GEMM UNIT FOR HETEROGENEOUS-HOMOGENEOUS TENSOR QUANTIZATION Speaker: Seungho Lee, Sungkyunkwan University, KR Authors: Seungho Lee, Donghyun Nam and Jeongwoo Park, Sungkyunkwan University, KR Abstract The high computational demands of large language models (LLMs) are limited by the lack of GPU hardware support for heterogeneous quantization, which mixes integers and floating points. To address this limitation, we propose an LLM processing element (PE), RGHT-Q, which features reconfigurable general-matrix multiplication (GEMM) for both heterogeneous and homogeneous tensor quantization. The RGHT-Q introduces a novel design that leverages butterfly routing and multi-precision multipliers. As a result, we achieve significant performance improvements, offering 3.14× higher energy efficiency, and 1.56× better area efficiency compared to prior designs./proceedings-archive/2024/DATA/1116_pdf_upload.pdf |
TS24 Session 20 - D12
Add this session to my calendar
Date: Wednesday, 02 April 2025
Time: 14:00 CET - 15:30 CET
Time | Label | Presentation Title Authors |
---|---|---|
HYBRID EXACT AND HEURISTIC EFFICIENT TRANSISTOR NETWORK OPTIMIZATION FOR MULTI-OUTPUT LOGIC Speaker: Lang Feng, National Sun Yat-Sen University, TW Authors: Lang Feng1, Rongjian Liang2 and Hongxin Kong3 1National Sun Yat-Sen University, TW; 2NVIDIA Corp., US; 3Texas A&M University, US Abstract With the approaching post-Moore era, it is becoming increasingly impractical to decrease the transistor size in digital VLSI for better performance. To address this issue, one approach is to optimize the digital circuit at the transistor level to reduce the transistor count. Although previous works have explored ways to conduct transistor network optimization, most of these efforts have focused on single-output networks or applied heuristics only, limiting their scope or optimization quality. In this paper, we propose an exact transistor network optimization algorithm that supports multi-output logic and is formulated as a SAT problem. Our approach maintains a high optimization level by employing the exact algorithm, while also incorporating a hybrid process that uses a heuristic algorithm to predict the solution range as a guidance for better efficiency. Experimental results show that the proposed algorithm has a 5.32% better optimization level given 54% less runtime compared with the state-of-the-art work./proceedings-archive/2024/DATA/152_pdf_upload.pdf |
||
MAXIMUM FANOUT-FREE WINDOW ENUMERATION: TOWARDS MULTI-OUTPUT SUB-STRUCTURE SYNTHESIS Speaker: Ruofei TANG, Hong Kong Baptist University, CN Authors: Ruofei Tang1, Xuliang Zhu2, Xing Li3, Lei Chen4, Xin Huang1, Mingxuan Yuan4 and Jianliang Xu5 1Hong Kong Baptist University, HK; 2Antai College of Economics and Management, Shanghai Jiaotong University, CN; 3Huawei Noah's Ark Lab, CN; 4Huawei Noah's Ark Lab, HK; 5HKBU, CN Abstract Peephole optimization is commonly used in And-Inverter Graphs (AIGs) optimization algorithms. The efficiency of these algorithms heavily relies on the enumeration process of sub-structures. One common sub-structure is the cut, known for its efficient enumeration method and single-output characteristic. However, an increasing number of optimization algorithms now target sub-structures that incorporate multiple outputs. In this paper, we explore Maximum Fanout-Free Windows (MFFWs), a novel sub-structure with a multi-output nature, as well as its practical applications and enumeration algorithms. To accommodate various algorithm execution processes, we propose two different enumeration styles: Dynamic and Static. The Dynamic approach provides flexibility in adapting to changes in the AIG structure, whereas the Static method ensures efficiency as long as the AIG structure remains unchanged during execution. We apply these methods to rewriting and technology mapping to improve their runtime performance. Experimental results on pure enumeration and practical scenarios show the scalability and efficiency of the proposed MFFW enumeration methods./proceedings-archive/2024/DATA/259_pdf_upload.pdf |
||
SIMGEN: SIMULATION PATTERN GENERATION FOR EFFICIENT EQUIVALENCE CHECKING Speaker: Carmine Rizzi, ETH Zurich, CH Authors: Carmine Rizzi1, Sarah Brunner1, Alan Mishchenko2 and Lana Josipovic1 1ETH Zurich, CH; 2University of California, Berkeley, US Abstract Combinational equivalence checking for hardware design tends to be slow due to the number and complexity of intermediate node equivalences considered by the SAT solver. This is because the solver often spends extensive time disproving nodes that appear equivalent under random simulation. We propose SimGen, an open-source and expressive simulation pattern generator inspired by Automatic Test Pattern Generation (ATPG); it exploits the circuit's structure and logic information to disprove the equivalence of circuit nodes and avoid excessive SAT calls. We demonstrate the effectiveness of SimGen's simulation patterns over those generated by state-of-the-art random and guided simulation./proceedings-archive/2024/DATA/335_pdf_upload.pdf |
||
ELMAP: AREA-DRIVEN LUT MAPPING WITH K-LUT NETWORK EXACT SYNTHESIS Speaker: Hongyang Pan, Fudan University, CN Authors: Hongyang Pan1, Keren Zhu1, Fan Yang1, Zhufei Chu2 and Xuan Zeng1 1Fudan University, CN; 2Ningbo University, CN Abstract Mapping to k-input lookup tables (k-LUTs) is a critical process in field-programmable gate array (FPGA) synthesis. However, the structure of the subject graph can introduce structural bias, which refers to the dependency of mapping results on the inherent graph structure, often leading to suboptimal results. To address this, we present ELMap, an area-driven LUT mapping framework. It incorporates structural choice during the collapsing phase. This enables dynamic decomposition, maximizing local-to-global optimization transfer. To ensure seamless integration between the optimization and mapping processes, ELMap leverages exact k-LUT synthesis to generate area-optimal sub-LUT networks. Experiments on the EPFL benchmark suite demonstrate that ELMap significantly outperforms state-of-theart methods. Specifically, in 6-LUT mapping, ELMap reduces the average LUT area by 8.5% and improves the area-depthproduct (ADP) by 5.8%. In 4-LUT remapping, it reduces the average LUT area by 17.6% and improves the ADP by 2.4%./proceedings-archive/2024/DATA/377_pdf_upload.pdf |
||
APPLICATION OF FORMAL METHODS (SAT/SMT) TO THE DESIGN OF CONSTRAINED CODES Speaker: Sunil Sudhakaran, Student, US Authors: Sunil Sudhakaran1, Clark Barrett2 and Mark Horowitz2 1Student, US; 2Stanford University, US Abstract Constrained coding plays a crucial role in high-speed communication links by restricting bit sequences to reduce the adverse effects imposed by the characteristics of the channel. This technique trades off some bit efficiency for higher transmission rates, thereby boosting overall data throughput. We show how the design of hardware-efficient translation logic to and from the restricted code space can be formulated as a Satisfiability Modulo Theories (SMT) problem. Using SMT, we can not only try to minimize the complexity of this logic and limit the effect of transmission errors on the final decoded output, but also significantly reduce development time—from weeks to just hours. Our initial results demonstrate the efficiency and effectiveness of this approach./proceedings-archive/2024/DATA/592_pdf_upload.pdf |
||
WIDEGATE: BEYOND DIRECTED ACYCLIC GRAPH LEARNING IN SUBCIRCUIT BOUNDARY PREDICTION Speaker: Jiawei Liu, Beijing University of Posts and Telecommunications, CN Authors: Jiawei Liu1, Zhiyan Liu1, Xun He1, Jianwang Zhai1, Zhengyuan Shi2, Qiang Xu2, Bei Yu2 and Chuan Shi1 1Beijing University of Posts and Telecommunications, CN; 2The Chinese University of Hong Kong, HK Abstract Subcircuit boundary prediction is an important application of machine learning in logical analysis, effectively supporting tasks such as functional verification and logic optimization. Existing methods often convert circuits into and-inverter graphs and then use directed acyclic graph neural networks to perform this task. However, two key characteristics of subcircuit boundary prediction do not align with the fundamental assumptions of DAG learning, which limits the model's expressiveness and generalization capabilities. To break these assumptions, we propose WideGate, which includes a receptive field generation module that extends beyond the fanin cone and fanout cone, as well as an adaptive aggregation module that focuses on boundaries. Extensive experiments show that WideGate significantly outperforms existing methods in terms of prediction accuracy and training efficiency for subcircuit boundary prediction. The code is available at https://github.com/BUPT-GAMMA/WideGate./proceedings-archive/2024/DATA/673_pdf_upload.pdf |
||
BIAS BY DESIGN: DIVERSITY QUANTIFICATION TO MITIGATE STRUCTURAL BIAS EFFECTS IN AIG LOGIC OPTIMIZATION Speaker: Isabella Venancia Gardner, Universiteit van Amsterdam, NL Authors: Isabella Venancia Gardner1, Marcel Walter2, Yukio Miyasaka3, Robert Wille2 and Michael Cochez4 1Universiteit van Amsterdam, NL; 2TU Munich, DE; 3University of California, Berkeley, US; 4Vrije Universiteit Amsterdam, NL Abstract And-Inverter Graphs (AIGs) are a fundamental data structure in logic optimization and are widely used in modern electronic design automation. A persistent challenge in AIG optimization is structural bias, where the initial graph structure significantly influences optimization quality by restricting the search space, often resulting in suboptimal outcomes. Existing methods address this issue by running multiple optimization workflows in parallel, relying on a trial-and-error approach that lacks a systematic way to measure structural diversity or assess effectiveness, making them computationally expensive and inefficient. This paper introduces a novel framework for systematically evaluating and reducing structural bias by measuring structural diversity, defined as the degree of dissimilarity between AIG graphs. Several traditional graph similarity measures and newly proposed AIG-specific metrics, including the Rewrite, Refactor, and Resub Scores, are explored. Results reveal limitations in traditional graph similarity metrics and highlight the effectiveness of the proposed AIG-specific measures in quantifying structural dissimilarity. Notably, the RRR Score shows a strong correlation (Pearson correlation coefficient, r = 0.79) with post-optimization structural differences, demonstrating the reliability of the metric in capturing meaningful variations between AIG structures. This work addresses the challenge of quantifying structural bias and offers a methodology that could potentially improve optimization outcomes, with future extensions applicable to other types of logic graphs./proceedings-archive/2024/DATA/851_pdf_upload.pdf |
||
TIMING-DRIVEN APPROXIMATE LOGIC SYNTHESIS BASED ON DOUBLE-CHASE GREY WOLF OPTIMIZER Speaker: Xiangfei Hu, Southeast University, CN Authors: Xiangfei Hu1, Yuyang Ye2, Tinghuan Chen2, Hao Yan1 and Bei Yu3 1Southeast University, CN; 2The Chinese University of Hong Kong, CN; 3The Chinese University of Hong Kong, HK Abstract With the shrinking technology nodes, timing optimization becomes increasingly challenging. Approximate logic synthesis (ALS) can perform local approximate changes (LACs) on circuits to optimize timing with the cost of slight inaccuracy. However, existing ALS methods that focus solely on critical path depth reduction or area minimization are not optimal in timing optimization. This paper proposes an effective timing-driven ALS framework, where we employ a double-chase grey wolf optimizer to explore and apply LACs, simultaneously bringing excellent critical path shortening and area reduction under error constraints. Subsequently, it utilizes post-optimization under area constraints to convert area reduction into further timing improvement, thus achieving maximum critical path delay reduction. According to experiments on open-source circuits with 28nm technology, compared to the SOTA method, our framework can generate approximate circuits with greater critical path delay reduction under different error and area constraints./proceedings-archive/2024/DATA/996_pdf_upload.pdf |
||
IRW: AN INTELLIGENT REWRITING Speaker: Haisheng Zheng, Shanghai Artificial Intelligence Laboratory, CN Authors: Haisheng Zheng1, Haoyuan WU2, Zhuolun He2, Yuzhe Ma3 and Bei Yu2 1Shanghai AI Laboratory, CN; 2The Chinese University of Hong Kong, HK; 3Hong Kong University of Science and Technology, HK Abstract This paper proposes a novel machine learning-driven rewriting algorithm to optimize And-Inverter Graphs (AIGs) for refining combinational logic prior to technology mapping. The algorithm, called iRw, iteratively extracts subcircuits in AIGs and replaces them with more streamlined implementations. These subcircuits are identified using an original extraction algorithm, while the compact implementations are produced through rewriting techniques guided by a machine learning model. This approach efficiently enables the generation of logically equivalent subcircuits with minimal overhead. Experiments on benchmark circuits indicate that the proposed methodology outperforms state-of-the-art AIG rewriting techniques in both quality and runtime./proceedings-archive/2024/DATA/24_pdf_upload.pdf |
||
AUTOMATIC ROUTING FOR PHOTONIC INTEGRATED CIRCUITS UNDER DELAY MATCHING CONSTRAINTS Speaker: Yuchao Wu, Hong Kong University of Science and Technology, HK Authors: Yuchao Wu1, Weilong Guan1, Yeyu Tong2 and Yuzhe Ma1 1Hong Kong University of Science and Technology, HK; 2The Hong Kong University of Science and Technology (Guangzhou)), CN Abstract Optical interconnects have emerged as a promising solution for rack-, board-scale, and even in-package communications, thanks to their high available optical bandwidth and minimal latency. However, the optical waveguides are intrinsically different from traditional metal wires, especially the phase matching constraints, which impose new challenges for routing in the photonic integrated circuits design. In this paper, we propose a comprehensive and efficient optical routing framework that introduces a diffuse-based length-matching method and bend modification methods to ensure phase-matching constraints. Furthermore, we present a congestion-based A* formulation with a negotiated congestion-based rip-up and reroute strategy on new rectangular grids with an aspect ratio of 1:√3 to reduce insertion loss. Experimental results based on real photonic integrated designs show that our optical routing flow can reduce total insertion loss by 11% and maximum insertion loss by 108%, while effectively satisfying matching constraints, compared to manual results./proceedings-archive/2024/DATA/743_pdf_upload.pdf |
||
ML-BASED AIG TIMING PREDICTION TO ENHANCE LOGIC OPTIMIZATION Speaker: Sachin Sapatnekar, University of Minnesota, US Authors: Wenjing Jiang1, Jin Yan2 and Sachin S. Sapatnekar1 1University of Minnesota, US; 2Google, US Abstract Traditional logic optimization relies on proxy metrics to approximate post-mapping performance and area, which may not correlate well with post-mapping delay and area. This paper explore a ground-truth-based optimization flow that directly incorporates the post-mapping delay and area during optimization using decision tree-based machine learning models. Results show high prediction accuracy and generalization to unseen designs./proceedings-archive/2024/DATA/1303_pdf_upload.pdf |
TS25 Session 27 = A5+D1+D2+D8+DT4+DT6
Add this session to my calendar
Date: Wednesday, 02 April 2025
Time: 14:00 CET - 15:30 CET
Time | Label | Presentation Title Authors |
---|---|---|
HYATTEN: HYBRID PHOTONIC-DIGITAL ARCHITECTURE FOR ACCELERATING ATTENTION MECHANISM Speaker: Huize Li, National University of Singapore, SG Authors: Huize Li, Dan Chen and Tulika Mitra, National University of Singapore, SG Abstract The wide adoption and substantial computational resource requirements of attention-based Transformers have spurred the demand for efficient hardware accelerators. Unlike digital-based accelerators, there is growing interest in exploring photonics due to its high energy efficiency and ultra-fast processing speeds. However, the significant signal conversion overhead limits the performance of photonic-based accelerators. In this work, we propose HyAtten, a photonic-based attention accelerator with minimize signal conversion overhead. HyAtten incorporates a signal comparator to classify signals into two categories based on whether they can be processed by low-resolution converters. HyAtten integrates low-resolution converters to process all low resolution signals, thereby boosting the parallelism of photonic computing. For signals requiring high-resolution conversion, HyAtten uses digital circuits instead of signal converters to reduce area and latency overhead. Compared to state-of-the-art photonicbased Transformer accelerator, HyAtten achieves 9.8× performance/area and 2.2× energy-efficiency/area improvement./proceedings-archive/2024/DATA/87_pdf_upload.pdf |
||
SEGA-DCIM: DESIGN SPACE EXPLORATION-GUIDED AUTOMATIC DIGITAL CIM COMPILER WITH MULTIPLE PRECISION SUPPORT Speaker: Haikang Diao, Peking University, CN Authors: Haikang Diao, Haoyi Zhang, Jiahao Song, Haoyang Luo, Yibo Lin, Runsheng Wang, Yuan Wang and Xiyuan Tang, Peking University, CN Abstract Digital computing-in-memory (DCIM) has been a popular solution for addressing the memory wall problem in recent years. However, the DCIM design still heavily relies on manual efforts, and the optimization of DCIM is often based on human experience. These disadvantages limit the time to market while increasing the design difficulty of DCIMs. This work proposes a design space exploration-guided automatic DCIM compiler (SEGA-DCIM) with multiple precision support, including integer and floating-point data precision operations. SEGA-DCIM can automatically generate netlists and layouts of DCIM designs by leveraging a template-based method. With a multi-objective genetic algorithm (MOGA)-based design space explorer, SEGA-DCIM can easily select appropriate DCIM designs for a specific application considering the trade-offs among area, power, and delay. As demonstrated by the experimental results, SEGA-DCIM offers solutions with wide design space, including integer and floating-point precision designs, while maintaining competitive performance compared to state-of-the-art (SOTA) DCIMs./proceedings-archive/2024/DATA/254_pdf_upload.pdf |
||
SOFTMAP: SOFTWARE-HARDWARE CO-DESIGN FOR INTEGER-ONLY SOFTMAX ON ASSOCIATIVE PROCESSORS Speaker: Mariam Rakka, University of California, Irvine, US Authors: Mariam Rakka1, Jinhao Li2, Guohao Dai3, Ahmed Eltawil4, Mohammed Fouda5 and Fadi Kurdahi1 1University of California, Irvine, US; 2Shanghai Jiao Tong University, CN; 3Qingyuan Research Institute, Shanghai Jiao Tong University, CN; 4King Abdullah University of Science and Technology, SA; 5Rain AI, US Abstract Recent research efforts focus on reducing the computational and memory overheads of Large Language Models (LLMs) to make them feasible on resource-constrained devices. Despite advancements in compression techniques, non-linear operators like Softmax and Layernorm remain bottlenecks due to their sensitivity to quantization. We propose SoftmAP, a software-hardware co-design methodology that implements an integer-only low-precision Softmax using In-Memory Compute (IMC) hardware. Our method achieves up to three orders of magnitude improvement in the energy-delay product compared to A100 and RTX3090 GPUs, making LLMs more deployable without compromising performance./proceedings-archive/2024/DATA/589_pdf_upload.pdf |
||
COMPREHENSIVE RISC-V FLOATING-POINT VERIFICATION: EFFICIENT COVERAGE MODELS AND CONSTRAINT-BASED TEST GENERATION Speaker: Tianyao Lu, College of Information Science and Electronic Engineering, Zhejiang University, CN Authors: Tianyao Lu, Anlin Liu, Bingjie Xia and Peng Liu, Zhejiang University, CN Abstract The increasing complexity of processor architectures necessitates more rigorous functional verification. Floating-point operations, in particular, present significant challenges due to their extensive range of computational cases that require verification. This paper proposes a comprehensive approach for generating floating-point instruction sequences to enhance the verification of RISC-V. We introduce a constraint-based method for floating-point test generation and design efficient coverage models as input constraints for this process. The resulting representative floating-point tests are integrated with RISC-V instruction sequence generation through a memory-bound register update method. Experimental results demonstrate that our approach improves the functional coverage of RISC-V floating-point instruction sequences from 93.32% to 98.34%, while simultaneously reducing the number of required instructions by 66.67% compared to the Google RISCV-DV generator. Additionally, our method achieves more comprehensive coverage of floating-point types in instruction write-back data compared to RISCV-DV. Using the proposed approach, we successfully detect representative floating-point-related faults injected into the RISC-V processor CV32E40P, thereby demonstrating its effectiveness./proceedings-archive/2024/DATA/1141_pdf_upload.pdf |
||
WINACC: WINDOW-BASED ACCELERATION OF NEURAL NETWORKS USING BLOCK FLOATING POINT Speaker: Xin Ju, National University of Defense Technology, CN Authors: Xin Ju, Jun He, Mei Wen, Jing Feng, Yasong Cao, Junzhong Shen, Zhaoyun Chen and Yang Shi, National University of Defense Technology, CN Abstract Deep Neural Networks (DNNs) impose significant computational demands, necessitating optimizations for computational and energy efficiencies. Per-vector scaling, which applies a scaling factor to blocks of elements using narrow integer types, effectively reduces storage and computational overhead. However, the frequent occurrence of floating-point accumulations between vectors limits further improvements in energy efficiency. State-of-the-art accelerators address this challenge by grouping and summing vector products based on their exponent differences, thereby reducing the overhead associated with intra-group shifting and accumulation. Nevertheless, this approach increases the complexity of register usage and grouping logic, leading to limited energy benefits and hardware efficiency. In this context, we introduce WinAcc, a novel algorithm and architecture co-designed solution that utilizes a low-cost accumulator to handle the majority of data in DNNs, offering low area overhead and high energy efficiency gains. Our key insight is that the data of DNNs follows a Laplace-like distribution, which enables the use of a customized data format with a narrow dynamic range to encode most of the data. This allows for the design of a low-cost accumulator with narrow shifters and adders, significantly reducing reliance on floating-point accumulator and consequently improving energy efficiency. Compared with state-of-the-art architecture Bucket, WinAcc achieves 33.95% energy reduction across seven representative DNNs and reduces area by 9.5% while maintaining superior model performance./proceedings-archive/2024/DATA/675_pdf_upload.pdf |
||
SACPLACE: MULTI-AGENT DEEP REINFORCEMENT LEARNING FOR SYMMETRY-AWARE ANALOG CIRCUIT PLACEMENT Speaker: Lei Cai, Wuhan Unversity of Technology, CN Authors: Lei Cai1, Guojing Ge2, Guibo Zhu2, Jixin Zhang3, Jinqiao Wang2, Bowen Jia1 and Ning Xu1 1Wuhan University of Technology, CN; 2Institute of Automation, Chinese Academic of Science, CN; 3Hubei University of Technology, CN Abstract The placement of analog Integrated Circuits (ICs) plays a critical role in their physical design. The objective is to minimize the Half-Perimeter Wire Length (HPWL) while satisfying complex analog IC constraints, such as symmetry. Unlike digital ICs, analog ICs are highly sensitive to parasitic effects, making device symmetry crucial for optimal circuit performance. However, existing methods, including both machine learning-based and analytical approaches, struggle to meet strict symmetry constraints. In machine learning-based methods, training a general model is challenging due to the limited diversity of the training data. In analytical methods, the difficulty lies in formulating symmetry constraints as a convex function, which is necessary for gradient-based optimization of the placement. To address the issue, we formulate the placement process as a Markov decision process and propose SACPlace, a multi-agent deep reinforcement learning method for Symmetry-Aware analog Circuit Placement. SACPlace initially extracts layout information and various constraints as the input information for placement refinement and evaluation. Subsequently, SACPlace constructs multi-agent policy networks for symmetry-aware placement by refining placement guided by the evaluation of optimal symmetry quality. Following this, SACPlace constructs multi-layer perceptron-based critic networks to embed placement information for evaluating symmetry quality. This evaluation reward will be used for guiding placement refinement. Experimental results from four public analog IC datasets demonstrate that our method achieves the lowest HPWL while fully satisfying symmetry and common constraints, outperforming state-of-the-art methods. Additionally, simulation results on real-world analog ICs show better performance than these methods and even manual designs./proceedings-archive/2024/DATA/267_pdf_upload.pdf |
||
LINEARIZATION OF QUADRATURE DIGITAL POWER AMPLIFIERS BY NEURAL NETWORK OF ULR_LSTM: UNSUPERVISED LEARNING RESIDUAL LSTM Speaker: Jiayu Yang, State Key Laboratory of Integrated Chips and Systems, School of Microelectronics, Fudan University, Shanghai, China, CN Authors: Jiayu Yang, Luyi Guo, Yicheng Li, Wang Wang, Zixu Li, Manni Li, Zijian Huang, Yinyin Lin, Yun Yin and Hongtao Xu, Fudan University, CN Abstract For the first time, this paper presents an unsupervised learning residual long short-term memory (ULR_LSTM) neural network to develop a digital predistortion (DPD) method for the linearization of digital power amplifiers (DPAs). Our method eliminates the need for iterative learning control (ILC) to obtain the ideal input of the DPA required by state-of-the-arts (SOTAs), which leads to high computational complexity and extensive training time. We perform behavioral modeling of the DPA using the R_LSTM network. After determining the optimal behavioral model architecture, the corresponding DPD model is obtained through an inverse training process. A 15-bit transformer-based quadrature DPA chip incorporating Class-G and IQ-cell-sharing techniques was implemented in a 28nm CMOS process to validate our proposed method. Experimental results demonstrate outstanding linearization performance comparing to prior arts, achieving an error vector magnitude (EVM) of -40.4dB for the 802.11ax 40MHz 64QAM signal./proceedings-archive/2024/DATA/689_pdf_upload.pdf |
||
COMPATIBILITY GRAPH ASSISTED AUTOMATIC HARDWARE TROJAN INSERTION FRAMEWORK Speaker: Anjum Riaz, IIT Jammu, IN Authors: Gaurav Kumar, Ashfaq Shaik, Anjum Riaz, Yamuna Prasad and Satyadev Ahlawat, IIT Jammu, IN Abstract Hardware Trojans (HTs) pose substantial security threats to Integrated Circuits (ICs), compromising their integrity, confidentiality, and functionality. Various HT detection methods have been developed to mitigate these risks. However, the limited availability of comprehensive HT benchmarks necessitates designers to create their own for evaluation purposes. Moreover, the existing benchmarks exhibit several deficiencies, including a restricted range of trigger nodes, susceptibility to detection through random patterns, lengthy HT instance creation and validation process, and a limited number of HT instances per circuit. To address these limitations, we propose a Compatibility Graph assisted automatic Hardware Trojan insertion framework for HT benchmark generation. Given a netlist, this framework generates a design incorporating single or multiple HT instances according to user-defined properties. It allows various configurations of HTs, such as a large number of trigger nodes, low activation probability and large number of unique HT instances. The experimental results demonstrate that the generated HT benchmarks exhibit exceptional resistance to state-of-the-art HT detection schemes. Additionally, the proposed framework achieves an average improvement of 37815.7x and 989.4x over the insertion times of the Random and Reinforcement Learning based HT insertion frameworks, respectively./proceedings-archive/2024/DATA/1525_pdf_upload.pdf |
||
TOWARDS ROBUST RRAM-BASED VISION TRANSFORMER MODELS WITH NOISE-AWARE KNOWLEDGE DISTILLATION Speaker: Wenyong Zhou, University of Hong Kong, HK Authors: Wenyong Zhou, Taiqiang Wu, Chenchen Ding, Yuan Ren, Zhengwu Liu and Ngai Wong, University of Hong Kong, HK Abstract Resistive random-access memory (RRAM)-based compute-in-memory (CIM) systems show promise in accelerating Transformer-based vision models but face challenges from inherent device non-idealities. In this work, we systematically investigate the vulnerability of Transformer-based vision models to RRAM-induced perturbations. Our analysis reveals that earlier Transformer layers are more vulnerable than later ones, and feed-forward networks (FFNs) are more susceptible to noise than multi-head self-attention (MHSA). Based on these observations, we propose a noise-aware knowledge distillation framework that enhances model robustness by aligning both intermediate features and final outputs between weight-perturbed and noise-free models. Experimental results demonstrate that our method improves accuracy by up to 1.54% and 1.49% on ViT and DeiT models under various noise conditions compared to their vanilla counterparts./proceedings-archive/2024/DATA/1103_pdf_upload.pdf |
||
HYIMC: ANALOG-DIGITAL HYBRID IN-MEMORY COMPUTING SOC FOR HIGH-QUALITY LOW-LATENCY SPEECH ENHANCEMENT Speaker: Wanru Mao, Beihang University, CN Authors: Wanru Mao1, Hanjie Liu2, Guangyao Wang1, Tianshuo Bai1, Jingcheng Gu1, Han Zhang1, Xitong Yang3, Aifei Zhang3, Xiaohang Wei3, Meng Wang3 and Wang Kang1 1Beihang University, CN; 2Beihang university, CN; 3Zhicun Research Lab, CN Abstract In-memory computing (IMC) holds significant promise for accelerating deep learning-based speech enhancement (DL-SE). However, existing IMC architectures face challenges in simultaneously achieving high precision, energy efficiency, and the necessary parallelism for DL-SE's inherent temporal dependencies. This paper introduces HyIMC, a novel hybrid analog-digital IMC architecture designed to address these limitations. HyIMC features: 1) a hybrid analog-digital design optimized for DL-SE algorithms; 2) a schedule controller that efficiently manages recurrent dataflow within skip connections; and 3) non-key dimension shrinkage, a model compression technique that preserves accuracy. Implemented on a 40nm eFlash-based IMC SoC prototype, HyIMC achieves 160 TOPS/W energy efficiency, compresses the DL-SE model size by ∼600%, improves the feature of merit by ∼1200%, and enhances perceptual evaluation of speech quality by ∼120%./proceedings-archive/2024/DATA/252_pdf_upload.pdf |
TS26 Session 12 - DT4
Add this session to my calendar
Date: Wednesday, 02 April 2025
Time: 14:00 CET - 15:30 CET
Time | Label | Presentation Title Authors |
---|---|---|
INTO-OA: INTERPRETABLE TOPOLOGY OPTIMIZATION FOR OPERATIONAL AMPLIFIERS Speaker: Jinyi Shen, Fudan University, CN Authors: Jinyi Shen, Fan Yang, Li Shang, Zhaori Bi, Changhao Yan, Dian Zhou and Xuan Zeng, Fudan University, CN Abstract This paper presents INTO-OA, an interpretable topology optimization method for operational amplifiers (op-amps). We propose a Bayesian optimization-based approach to effectively explore the high-dimensional, discrete topology design space of op-amps. Our method integrates a Gaussian process surrogate model with the Weisfeiler-Lehman graph kernel to extract structural features from a dedicated circuit graph representation. It also employs a candidate generation strategy that combines random sampling with mutation to balance global exploration and local exploitation. Additionally, INTO-OA enhances interpretability by assessing the impact of circuit structures on performance, providing designers with valuable insights into generated topologies and enabling the interpretable refinement of existing designs. Experimental results demonstrate that INTO-OA achieves higher success rates, a 1.84× to 19.10× improvement in op-amp performance, and a 3.20× to 14.33× increase in topology optimization efficiency compared to state-of-the-art methods./proceedings-archive/2024/DATA/235_pdf_upload.pdf |
||
EFFECTIVE ANALOG ICS FLOORPLANNING WITH RELATIONAL GRAPH NEURAL NETWORKS AND REINFORCEMENT LEARNING Speaker: Davide Basso, University of Trieste, IT Authors: Davide Basso1, Luca Bortolussi1, Mirjana Videnovic-Misic2 and Husni Habal3 1University of Trieste, IT; 2Infineon Technologies AT, AT; 3Infineon Technologies, DE Abstract Analog integrated circuit (IC) floorplanning is typically a manual process with the placement of components (devices and modules) planned by a layout engineer. This process is further complicated by the interdependence of floorplanning and routing steps, numerous electric and layout-dependent constraints, as well as the high level of customization expected in analog design. This paper presents a novel automatic floorplanning algorithm based on reinforcement learning. It is augmented by a relational graph convolutional neural network model for encoding circuit features and positional constraints. The combination of these two machine learning methods enables knowledge transfer across different circuit designs with distinct topologies and constraints, increasing the generalization ability of the solution. Applied to 6 industrial circuits, our approach surpassed established floorplanning techniques in terms of speed, area and half-perimeter wire length. When integrated into a procedural generator for layout completion, overall layout time was reduced by 67.3% with a 8.3% mean area reduction compared to manual layout./proceedings-archive/2024/DATA/337_pdf_upload.pdf |
||
FORMALLY VERIFYING ANALOG NEURAL NETWORKS WITH DEVICE MISMATCH VARIATIONS Speaker: Tobias Ladner, TU Munich, Germany, DE Authors: Yasmine Abu-Haeyeh1, Thomas Bartelsmeier2, Tobias Ladner3, Matthias Althoff3, Lars Hedrich4 and Markus Olbrich2 1University of Frankfurt, DE; 2Leibniz University Hannover, DE; 3TU Munich, DE; 4Goethe University Frankfurt, DE Abstract Training and running inference of large neural networks comes with excessive cost and power consumption. Thus, realizing these networks as analog circuits is an energy- and area-efficient alternative. However, analog neural networks suffer from inherent deviations within their circuits, requiring extensive testing for their correct behavior under these deviations. Unfortunately, tests based on Monte Carlo simulations are extremely time- and resource-intensive. We present an alternative approach to proving the correctness of the neural network using formal neural network verification techniques and developing a modeling methodology for these analog neural circuits. Our experimental results compare two methods based on reachability analysis showing their effectiveness by reducing the test time from days to milliseconds. Thus, they offer a faster, more scalable solution for verifying the correctness of analog neural circuits./proceedings-archive/2024/DATA/474_pdf_upload.pdf |
||
POST-LAYOUT AUTOMATED OPTIMIZATION FOR CAPACITOR ARRAY IN DIGITAL-TO-TIME CONVERTER Speaker: Hefei Wang, Southern University of Science and Technology, CN Authors: Hefei Wang1, Jianghao Su1, Junhe Xue1, Haoran Lv1, Junhua Zhang2, Longyang Lin1, Kai Chen1, Lijuan Yang2 and Shenghua Zhou3 1Southern University of Science and Technology, CN; 2International Quantum Academy, CN; 3Southern University of Science and Technology; International Quantum Academy, CN Abstract The integral non-linearity (INL) of Digital-to-Time Converter (DTC) in fractional-N phase-locked loops introduces fractional spurs, especially at near-integer channels, resulting in increased jitter. To meet the strict jitter and spur performance requirements of high-performance wireless transceivers, minimizing the INL in DTC designs is crucial. This work presents a computer-aided, automated optimization methodology that focuses on addressing issues stemming from the uniform capacitor unit structure within the capacitor array in Variable-Slope DTC. These issues include parasitic resistance and capacitance, which distort the charging and discharging behavior of the capacitors, contributing to INL. By systematically optimizing the capacitor layout and mitigating parasitic effects, the methodology allows precise tuning of each capacitor unit in capacitor array to reduce INL, enhancing the overall performance of the DTC./proceedings-archive/2024/DATA/636_pdf_upload.pdf |
||
TIME-DOMAIN 3D ELECTROMAGNETIC FIELDS ESTIMATION BASED ON PHYSICS-INFORMED DEEP LEARNING FRAMEWORK Speaker: Huifan Zhang, ShanghaiTech University, CN Authors: Huifan Zhang, Yun Hu and Pingqiang Zhou, ShanghaiTech University, CN Abstract Electromagnetic simulation is important and time-consuming in RF/microwave circuit design. Physics-informed deep learning is a promising method to learn a family of parametric partial differential equations. In this work, we propose a physics-informed deep learning framework to estimate time-domain 3D electromagnetic fields. Our method leverages physics-informed loss functions to model Maxwell's equations which govern electromagnetic fields. Our post-trained model produces accurate results with over 200x speedup over the FDTD simulation. We reduce the mean square error by at least 14% and 15%, with respect to purely data-driven learning and the Fourier operator learning method FNO. In order to optimize data and physical loss simultaneously, we introduce a self-adaptive scaling factors updating algorithm, which has 8.4% less error than the loss balancing method ReLoBRaLo./proceedings-archive/2024/DATA/663_pdf_upload.pdf |
||
TPC-GAN: BATCH TOPOLOGY SYNTHESIS FOR PERFORMANCE-COMPLIANT OPERATIONAL AMPLIFIERS USING GENERATIVE ADVERSARIAL NETWORKS Speaker: Jinglin Han, Beihang University, CN Authors: Yuhao Leng1, Jinglin Han1, Yining Wang2 and Peng Wang1 1Beihang University, CN; 2Corelink Technology(Qingdao)Co.,Ltd., CN Abstract Operational amplifier is one of the most important analog basic blocks. Existing automated synthesis strategies for operational amplifiers solely focus on the optimization of single topology, making them unsuitable for scenarios requiring batch synthesis, such as dataset augmentation. In this paper, we introduce TPC-GAN, a generative model for batch topology synthesis of operational amplifiers in accordance with performance specifications. To be specific, it incorporates a reward network of circuit performance into the adversarial generative networks (GANs). This enables direct synthesis of novel and feasible circuit topology meeting performance specifications. Experimental results demonstrate that our proposed method can achieve a validity rate of 98% in circuit generation, among which 99.7% are novel relative to the training dataset. With the introduction of a reward network, a significant portion (82.8%) of the generated circuits satisfy performance specifications, which is a substantial improvement than those without. Transistor-level experimental results further demonstrate the practicality and competitiveness of our generated circuits with nearly 3x improvement over manual designs./proceedings-archive/2024/DATA/696_pdf_upload.pdf |
||
NANOELECTROMECHANICAL BINARY COMPARATOR FOR EDGE-COMPUTING APPLICATIONS Speaker: Victor Marot, University of Bristol, GB Authors: Victor Marot, Manu Krishnan, Mukesh Kulsreshath, Elliott Worsey, Roshan Weerasekera and Dinesh Pamunuwa, University of Bristol, GB Abstract Bitwise comparison is a fundamental operation in many digital arithmetic functions and is ubiquitous in both datapath and control elements; for example, many machine learning algorithms depend on binary comparison. This work proposes a new class of binary comparator circuit using 4-terminal nanoelectromechanical (NEM) relays that use just 6 devices compared to 9 transistors in CMOS implementations. Moreover, NEM implementations are capable of withstanding much higher temperatures, up to 300°C, and radiation levels, well over 1 Mrad absorbed dose, conditions which are common across many industrial edge applications, with near zero standby power. A 1-bit magnitude and equality comparators comprising two in-plane silicon 4-terminal relays each were fabricated on a silicon-on-insulator substrate and electrically characterized for proof of concept, the first such demonstration. Using the 1-bit comparators as building blocks, a scalable tree-based topology is proposed to implement higher-order comparators, resulting in ≈47% reduction in device count over a CMOS implementation for a 64-bit comparator. Circuit level simulations of the comparator using accurate device models show that a single operation consumes at most 21 fJ a 9-fold reduction over the best CMOS offering in an equivalent process node./proceedings-archive/2024/DATA/809_pdf_upload.pdf |
||
CLOCK AND POWER SUPPLY-AWARE HIGH ACCURACY PHASE INTERPOLATOR LAYOUT SYNTHESIS Speaker: Hung-Ming Chen, University of Tehran, IR Authors: Siou-Sian Lin1, Shih-Yu Chen1, Yu-Ping Huang1, Tzu-Chuan Lin1, Hung-Ming Chen2 and Wei-Zen Chen1 1NYCU, TW; 2National Yang Ming Chiao Tung University, TW Abstract Due to a popular request from the designers of clock and data recovery (CDR) in the inefficiency of generating high accuracy phase interpolator (PI), in this work, we have developed a layout generator for such circuit, different from conventional constraint-driven works. In the first stage, we propose a customized template floorplanning plus pin generation demanded by the users. In the second stage, in order to generate high accuracy layout, we implement a gridless router for signal, power supply and clock. Experiments with several configurations indicate that our approach can generate high-quality corresponding layouts that align with user expectations, and even surpass the quality of manual designs on structurally regular high-performance PIs, which are not easy and efficient to be generated by prior primitive/grid-based methods./proceedings-archive/2024/DATA/922_pdf_upload.pdf |
||
ML-BASED FAST AND ACCURATE PERFORMANCE MODELING AND PREDICTION FOR HIGH-SPEED MEMORY INTERFACES ACROSS DIFFERENT TECHNOLOGIES Speaker: Taehoon Kim, Seoul National University, KR Authors: Taehoon Kim1, Minjeong Kim1, Hankyu Chi2, Byungjun Kang2, Eunji Song2 and Woo-Seok Choi1 1Seoul National University, KR; 2SK hynix, KR Abstract The chip industry is undergoing a market transition from mass production to mass customization. Rapid market changes require agile responses and diversified product designs, particularly in interface circuits managing chip-to-chip communication. To facilitate these shifts, this paper proposes a machine learning-based method for rapidly and accurately predicting and analyzing the performance of high-speed transceivers, along with an evaluation methodology utilizing the proposed approach. Especially, using the process technology information as input in the dataset, this is the first work to predict the performance of a design across different technologies, which will be invaluable in architecting and optimizing designs during the early stages of development. By simulating each functional block, we gather a dataset for parameterized design and performance and incorporate device characteristics from lookup tables. The transmitter, which operates like digital circuits, is trained using parameterized signals with a DNN, while the receiver, containing analog blocks and feedback structures, employs hybrid LSTM-DNN learning with time-series input and output. Our model, trained with a 40nm design, demonstrates high accuracy in predicting performance even with different foundries and technologies. The majority of performance parameters show an R^2 value exceeding 0.9, indicating strong predictive accuracy under varying conditions. This method provides valuable insights for early-stage design optimization and process technology scaling, offering potential for broader applications in other circuit design areas./proceedings-archive/2024/DATA/1014_pdf_upload.pdf |
||
ACCELERATING OTA CIRCUIT DESIGN: TRANSISTOR SIZING BASED ON A TRANSFORMER MODEL AND PRECOMPUTED LOOKUP TABLES Speaker: Subhadip Ghosh, University of Minnesota, US Authors: Subhadip Ghosh1, Endalk Gebru1, chandramouli kashyap2, Ramesh Harjani1 and Sachin S. Sapatnekar1 1University of Minnesota, US; 2Cadence Design Systems, US Abstract Device sizing is crucial for meeting performance specifications in operational transconductance amplifiers (OTAs), and this work proposes an automated sizing framework based on a transformer model. The approach first leverages the driving-point signal flow graph (DP-SFG) to map an OTA circuit and its specifications into transformer-friendly sequential data. A specialized tokenization approach is applied to the sequential data to expedite the training of the transformer on a diverse range of OTA topologies, under multiple specifications. Under specific performance constraints, the trained transformer model is used to accurately predict DP-SFG parameters in the inference phase. The predicted DP-SFG parameters are then translated to transistor sizes using a precomputed look-up table-based approach inspired by the gm/Id methodology. In contrast to previous conventional or machine-learning-based methods, the proposed framework achieves significant improvements in both speed and computational efficiency by reducing the need for expensive SPICE simulations within the optimization loop; instead, almost all SPICE simulations are confined to the one-time training phase. The method is validated on a variety of unseen specifications, and the sizing solution demonstrates over 90% success in meeting specifications with just one SPICE simulation for validation, and 100% success with 3-5 additional SPICE simulations./proceedings-archive/2024/DATA/1355_pdf_upload.pdf |
||
A 10PS-ORDER FLEXIBLE RESOLUTION TIME-TO-DIGITAL CONVERTER WITH LINEARITY CALIBRATION AND LEGACY FPGA Speaker: Kentaroh Katoh, Fukuoka University, JP Authors: Kentaroh Katoh1, Toru Nakura2 and Haruo Kobayashi3 1FUKUOKA UNIVERSITY, JP; 2Fukuoka University, JP; 3Gunma University, JP Abstract This paper presents a 10ps-order flexible resolution time-to-digital converter (TDC) consisting of only Lookup Tables and Flip-Flops that can be applied to legacy FPGAs, which is industry friendly. The proposed TDC is a Vernier delay-line based TDC. By using MUX chains as the delay adjustable buffers, it realizes flexible and high resolution 10ps-order TDC. By controlling the control values of each MUX chain independently, the nonlinearity of TDC is compensated. In the evaluation using the AMD Artix-7 FPGA, the DNL and INL were [-0.26 LSB, 0.91 LSB] and [-0.84 LSB, 2.27 LSB], respectively, at a resolution of 8.92 ps./proceedings-archive/2024/DATA/1220_pdf_upload.pdf |
TS27 Session 30 - D7+D13
Add this session to my calendar
Date: Wednesday, 02 April 2025
Time: 14:00 CET - 15:30 CET
Time | Label | Presentation Title Authors |
---|---|---|
HIPERNOC: A HIGH-PERFORMANCE NETWORK-ON-CHIP FOR FLEXIBLE AND SCALABLE FPGA-BASED SMARTNICS Speaker: Klajd Zyla, TU Munich, DE Authors: Klajd Zyla, Marco Liess, Thomas Wild and Andreas Herkersdorf, TU Munich, DE Abstract A recent approach that the research community has proposed to address the steep growth of network traffic and the attendant rise in computing demands is in-network computing. This paradigm shift is bringing about an increase in the types of computations performed by network devices. Consequently, processing demands are becoming more varied, requiring flexible packet-processing architectures. State-of-the-art switch-based smart network interface cards (SmartNICs) provide high versatility without sacrificing performance but do not scale well concerning resource usage. In this paper, we introduce HiPerNoC—a flexible and scalable field-programmable gate array (FPGA)-based SmartNIC architecture deploying a 2D-mesh network-on-chip (NoC) with a novel router design to manage network traffic with diverse processing demands. The NoC can forward incoming network packets to the available processing engines in the required sequence at a traffic load of up to 91.1 Gbit/s (0.89 flit/node/cycle). Each router applies distributed switch allocation and avoids head-of-line blocking by deploying queues at the switch crosspoints of input-output connections used by the routing algorithm. It also prevents deadlocks by employing non-blocking virtual cut-through switching. We implemented a prototype of HiPerNoC as a 4x4 2D-mesh NoC in SystemVerilog and evaluated it with synthetic network traffic via cycle-accurate register-transfer level simulations in Vivado. The evaluation results show that HiPerNoC achieves up to 53% higher saturation throughput, occupies 53 % fewer lookup tables and block RAMs, and consumes 16 % less power on an Alveo U55C than ProNoC—a state-of-the-art FPGA-based NoC./proceedings-archive/2024/DATA/348_pdf_upload.pdf |
||
NEUROHEXA: A 2D/3D-SCALABLE MODEL-ADAPTIVE NOC ARCHITECTURE FOR NEUROMORPHIC COMPUTING Speaker: Yi Zhong, Peking University, CN Authors: Yi Zhong, Zilin Wang, Yipeng Gao, Xiaoxin Cui, Xing Zhang and Yuan Wang, Peking University, CN Abstract Neuromorphic computing has endeavored a novel computing paradigm that entails a bio-inspired architecture to reproduce the remarkable functionalities of the human brain, such as massively parallel processing and extremely low-power consumption. However, those promising merits can be greatly canceled by the mismatched communication infrastructure in large-scale hardware implementation, in view of the vast degree of neural connectivity, the unstructured spike dataflow, and the unbalanced model workload assignment. In an effort to tackle those challenges, this work presents NeuroHexa, a network-on-chip (NoC) architecture intended for multi-core neuromorphic design. NeuroHexa adopts a customized intra-chip hexagonal topology, which can be further cascaded in 6 directions by either 2D or 3D chiplet integration. Designed in globally asynchronous, locally synchronous (GALS) methodology, a group of processing nodes can operate in independent work pace to further improve resource utilization. To satisfy the varied requirement of data reuse across the chip, NeuroHexa proposes a flexible multicast routing mechanism to best adapt to the model-defined dataflow. And under a specific congestion scenario, NeuroHexa can switch its routing algorithm between deterministic routing and fully adaptive routing modes. The presented NoC router is evaluated in 28nm CMOS, where we achieve the maximal throughput as 179.2Gbps, and the best energy efficiency as 4.872pJ/packet at the area overhead of 0.0226mm2./proceedings-archive/2024/DATA/404_pdf_upload.pdf |
||
SPB: TOWARDS LOW-LATENCY CXL MEMORY VIA SPECULATIVE PROTOCOL BYPASSING Speaker: Junbum Park, Sungkyunkwan University, KR Authors: Junbum Park, Yongho Lee, Sungbin Jang, Wonyoung Lee and Seokin Hong, Sungkyunkwan University, KR Abstract Compute Express Link (CXL) is an advanced interconnect standard designed to facilitate high-speed communication between CPUs, accelerators, and memory devices, making it well-suited for data-intensive applications such as machine learning and real-time analytics. Despite its advantages, CXL memory encounter significant latency challenges due to the complex hierarchy of protocol layers, which can adversely impact performance in latency-sensitive scenarios. To address this issue, we introduce the Speculative Protocol Bypassing (SPB) architecture, which aims to minimize latency during read operations by speculatively bypassing several protocol layers of CXL. To achieve this, SPB employs the Snooper mechanism, which extracts essential read commands from the Flit data at an early stage, allowing it to bypass multiple protocol layers and reduce memory access time. Additionally, the Hazard Filter (HF) prevents Read-After-Write (RAW) hazards between read and write operations, thereby maintaining data integrity and ensuring system reliability. The SPB architecture effectively optimizes CXL memory access latency, providing a robust solution for high-performance computing environments that require both low latency and high efficiency. Its minimal hardware overhead makes it a practical and scalable enhancement for future CXL-based memory./proceedings-archive/2024/DATA/431_pdf_upload.pdf |
||
SRING: A SUB-RING CONSTRUCTION METHOD FOR APPLICATION-SPECIFIC WAVELENGTH-ROUTED OPTICAL NOCS Speaker: Zhidan Zheng, TU Munich, DE Authors: Zhidan Zheng, Meng Lian, Mengchu Li, Tsun-Ming Tseng and Ulf Schlichtmann, TU Munich, DE Abstract Wavelength-routed optical networks-on-chip (WRONoCs) attract ever-increasing attention for supporting high-speed communications with low power and latency. Among all WRONoC routers, optical ring routers attract much interest for their simple structures. However, current designs of ring routers have overlooked the customization problem: when adapting to applications that have specific communication requirements, current designs suffer high propagation loss caused by long worst-case signal paths and high splitter usage in power distribution networks (PDN). To address those problems, we propose a novel customization method to generate application-specific ring routers with multiple sub-rings, SRing. Instead of sequentially connecting all nodes in a large ring, we cluster the nodes and connect them with sub-ring waveguides to reduce the path length. Besides, we propose a mixed integer linear programming model for wavelength assignment to reduce the number of PDN splitters. We compare SRing to three state-of-the-art ring router design methods for six applications. Experimental results show that SRing can greatly reduce the length of the longest signal path, the worst-case insertion loss, and the number of splitters in the PDN, significantly improving the power efficiency./proceedings-archive/2024/DATA/582_pdf_upload.pdf |
||
BEAM: A MULTI-CHANNEL OPTICAL INTERCONNECT FOR MULTI-GPU SYSTEMS Speaker: Chongyi Yang, Microelectronics Thrust, The Hong Kong University of Science and Technology (Guangzhou), CN Authors: Chongyi Yang, Bohan Hu, Peiyu Chen, Yinyi Liu, Wei Zhang and Jiang Xu, Hong Kong University of Science and Technology, HK Abstract High-performance computing and AI applications necessitate high-bandwidth communication between GPUs. Traditional electrical interconnects for GPU-to-GPU communication face challenges over longer distances, including high power consumption, crosstalk noise, and signal loss. In contrast, optical interconnects excel in this domain, offering high bandwidth and consistent power dissipation over long distance. This paper proposes BEAM, a Bandwidth-Enhanced optical interconnect Architecture for Multi-GPU systems. BEAM extends electrical-optical interfaces into the GPU package, positioning them close to GPU compute logic and memory. Unlike existing single-channel approaches, each BEAM optical interface incorporates multiple parallel optical channels, further enhancing bandwidth. An arbitration scheme manages channel usage among data transfers. Evaluation on Rodinia benchmarks and LLM training kernels demonstrates that BEAM achieves a speedup of 1.14 - 1.9× and reduces energy consumption by 29 - 44% compared to the electrical-interconnect system and state-of-the-art schemes, while maintaining comparable chip area consumption./proceedings-archive/2024/DATA/759_pdf_upload.pdf |
||
TCDM BURST ACCESS: BREAKING THE BANDWIDTH BARRIER IN SHARED-L1 RVV CLUSTERS BEYOND 1000 FPUS Speaker: Diyou Shen, ETH Zurich, CH Authors: Diyou Shen1, Yichao Zhang1, Marco Bertuletti1 and Luca Benini2 1ETH Zurich, CH; 2ETH Zurich, CH | Università di Bologna, IT Abstract As computing demand and memory footprint of deep learning applications accelerate, clusters of cores sharing local (L1) multi-banked memory are widely used as key building blocks in large-scale architectures. When the cluster's core count increases, a flat all-to-all interconnect between cores and L1 memory banks becomes a physical implementation bottleneck, and hierarchical network topologies are required. However, hierarchical, multi-level intra-cluster networks are subject to internal contention which may lead to significant performance degradation, especially for SIMD or vector cores, as their memory access is bursty. We present the TCDM Burst Access architecture, a software-transparent burst transaction support to improve bandwidth utilization in clusters with many vector cores tightly coupled to a multi-banked L1 data memory. In our solution, a Burst Manager dispatches burst requests to L1 memory banks, multiple 32b words from burst responses are retired in parallel on channels with parametric data-width. We validate our design on a RISC-V Vector (RVV) many-core cluster, evaluating the benefits on different core counts. With minimal logic area overhead (less than 8%), we improve the bandwidth of a 16-, a 256-, and a 1024-Floating Point Unit (FPU) baseline clusters, without Tightly Coupled Data Memory (TCDM) Burst Access, by 118%, 226%, and 77% respectively. Reaching up to 80% of the cores-memory peak bandwidth, our design demonstrates ultra-high bandwidth utilization and enables efficient performance scaling. Implemented in 12-nm FinFET technology node, compared to the serialized access baseline, our solution achieves up to 1.9x energy efficiency and 2.76x performance in real-world kernel benchmarkings./proceedings-archive/2024/DATA/1013_pdf_upload.pdf |
||
SEDG: STITCH-COMPATIBLE END-TO-END LAYOUT DECOMPOSITION BASED ON GRAPH NEURAL NETWORK Speaker: Yifan Guo, Shanghai Jiao Tong university, CN Authors: Yifan Guo1, Jiawei Chen1, Yexin Li1, Yunxiang Zhang1, Qing Zhang1, Yuhang Zhang2 and Yongfu Li1 1Shanghai Jiao Tong University, CN; 2East China Normal University, CN Abstract Advanced semiconductor lithography faces significant challenges as feature sizes continue to shrink, necessitating effective Multiple Patterning Layout Decomposition (MPLD) algorithms. Existing MPLD algorithms have limited efficiency or cannot support stitch insertion to achieve finer-grained optimal decomposition. This paper introduces an end-to-end GNN-based framework that not only achieves high-quality solutions quickly but also applies to layouts with stitches. Our framework treats layouts as heterogeneous graphs and performs inference through a message-passing mechanism. We deliver ultra-competitive, near-optimal solutions that are 10× faster than the exact algorithm (e.g., integer linear programming) and 3× faster than approximate algorithms (e.g., exact-cover, semi-definite programming)./proceedings-archive/2024/DATA/52_pdf_upload.pdf |
||
MULTISCALE FEATURE ATTENTION AND TRANSFORMER BASED CONGESTION PREDICTION FOR ROUTABILITY-DRIVEN FPGA MACRO PLACEMENT Speaker: Hao Gu, Southeast University, CN Authors: Hao Gu1, Xinglin Zheng1, Youwen Wang1, Keyu Peng1, Ziran Zhu2 and Yang Jun3 1Southeast University, CN; 2School of Integrated Circuits, Southeast University, CN; 3, Abstract As routability has emerged as a critical task in modern field-programmable gate array (FPGA) physical design, it is desirable to develop an effective congestion prediction model during the placement stage. Given that the interconnection congestion level is a critical metric for measuring the routability of FPGA placement, we utilize that level as the model training label. In this paper, we propose a multiscale feature attention (MFA) and transformer based congestion prediction model to extract placement features and strengthen their association with congested areas for effective FPGA macro placement. A convolutional neural network (CNN) component is first designed to extract multiscale features from grid-based placement. Then, a well-designed MFA block is proposed that utilizes the dual attention mechanism on both spatial and channel dimensions to enhance the representation of each multiscale feature. By incorporating MFA blocks and CNN's output at each skip connection layer, our model substantially enhances its capability to learn features and recover more precise congestion level maps. Furthermore, multiple transformer layers that employ dynamic attention mechanisms are utilized to extract global information, which can significantly improve the difference between various congestion levels and enhance the ability to identify these levels. Based on the ten most congested and challenging benchmarks from the MLCAD 2023 FPGA macro placement contest, experimental results show that our model outperforms existing congestion prediction models. Furthermore, our model can achieve the best routability and score among the contest winners when integrated into the macro placer based on DREAMPlaceFPGA./proceedings-archive/2024/DATA/405_pdf_upload.pdf |
||
AN EFFECTIVE AND EFFICIENT CROSS-LINK INSERTION FOR NON-TREE CLOCK NETWORK SYNTHESIS Speaker: Mengshi Gong, Southwest University of Science and Technology, CN Authors: Jinghao Ding1, Jiazhi Wen1, Hao Tang1, Zhaoqi Fu1, Mengshi Gong1, Yuanrui Qi1, Wenxin Yu1 and Jinjia Zhou2 1Southwest University of Science and Technology, CN; 2Hosei University, JP Abstract Clock skew introduces significant challenge to the overall system performance. Existing non-tree solutions like cross-link insertion often come with limitations, such as the over-consumption of resource and power. In this work, we propose a cross-link insertion algorithm that effectively reduces the clock skew with minimal power overhead, and prioritize delay optimization on the paths with high sensitivity to the skew. The experimental results from the ISPD 2010 benchmarks show a 17% reduction in the mean of clock skew, a 45% decrease in the standard deviation of clock skew, and a 13% lower power consumption versus the advanced non-tree solutions in literature./proceedings-archive/2024/DATA/707_pdf_upload.pdf |
TS28 Session 28 -D3+T2
Add this session to my calendar
Date: Wednesday, 02 April 2025
Time: 16:30 CET - 18:00 CET
Time | Label | Presentation Title Authors |
---|---|---|
POLYNOMIAL FORMAL VERIFICATION OF SEQUENTIAL CIRCUITS USING WEIGHTED-AIGS Speaker: Mohamed Nadeem, University of Bremen, DE Authors: Mohamed Nadeem1, Chandan Jha1 and Rolf Drechsler2 1University of Bremen, DE; 2University of Bremen | DFKI, DE Abstract Ensuring the functional correctness of a digital system is achievable through formal verification. Despite the increased complexity of modern systems, formal verification still needs to be done in a reasonable time. Hence, Polynomial Formal Verification (PFV) techniques are being explored as they provide a guaranteed upper bound on the run time for verification. Recently, it was shown that combinational circuits characterized by a constant cutwidth can be verified in linear time using Answer Set Programming (ASP). However, most of the designs used in digital systems are sequential. Hence, in this paper, we propose a linear time formal verification approach using ASP for sequential circuits with constant cutwidth. We achieve this by proposing a new data structure called Weighted-And Inverter Graph (W-AIG). Unlike existing formal verification methods, we prove that our approach can verify any sequential circuit with a constant cutwidth in a linear time. Finally, we also implement our approach and experimentally show the results on a variety of sequential circuits like pipelined adders, serial adders, and shift registers to confirm our theoretical findings./proceedings-archive/2024/DATA/257_pdf_upload.pdf |
||
WORD-LEVEL COUNTEREXAMPLE REDUCTION METHODS FOR HARDWARE VERIFICATION Speaker: Zhiyuan Yan, Microelectronics Thrust, The Hong Kong University of Science and Technology(Guangzhou), CN Authors: Zhiyuan Yan1 and Hongce Zhang2 1The Hong Kong University of Science and Technology(Guangzhou), CN; 2Hong Kong University of Science and Technology, HK Abstract Hardware verification is crucial to ensure the correctness in the logic design of digital circuits. The purpose of verification is to either find bugs or show their absence. Prior works mostly focus on the bug-finding process and have proposed a range of verification algorithms and techniques to be faster to reach a bug or conclude with a proof of correctness. However, for a human verification engineer, it also matters how to better analyze the counterexamples trace to understand the root cause of bugs. This kind of technique remains absent in word-level circuit analysis. In this paper, we investigate the counterexample reduction method. Given the existing techniques for the bit-level circuit model, we first extend current semantic analysis methods to the word-level counterexample reduction and then develop a more efficient word-level structural analysis approach. We compare the effectiveness and overhead of these methods on the hardware model-checking problems and show the usefulness of such analysis in applications including pivot input analysis, word-level model-checking and counterexample-guided abstraction refinement./proceedings-archive/2024/DATA/488_pdf_upload.pdf |
||
ACCURATE AND EXTENSIBLE SYMBOLIC EXECUTION OF BINARY CODE BASED ON FORMAL ISA SEMANTICS Speaker: Sören Tempel, TU Braunschweig, DE Authors: Sören Tempel1, Tobias Brandt2, Christoph Lüth3, Christian Dietrich4 and Rolf Drechsler3 1TU Braunschweig, DE; 2Independent, DE; 3University of Bremen | DFKI, DE; 4TU Hamburg, DE Abstract Symbolic execution is an SMT-based software verification and testing technique. Symbolic execution requires tracking performed computations during software simulation to reason about branches in the software under test. The prevailing approach on symbolic execution of binary code tracks computations by transforming the code to be tested to an architecture-independent IR and then symbolically executes this IR. However, the resulting IR must be semantically equivalent to the binary code, making this process complex and error-prone. The semantics of the binary code are specified by the targeted ISA, commonly given in natural language and requiring a manual implementation of the transformation to an IR. In recent years, the use of formal languages to describe ISA semantics in a machine-readable way has gained increased popularity. We investigate the utilization of such formal semantics for symbolic execution of binary code, achieving an accurate representation of instruction semantics. We present a prototype for the RISC-V ISA and conduct a case study to demonstrate that it can be easily extended to additional instructions. Furthermore, we perform an experimental comparison with prior work which resulted in the discovery of five previously unknown bugs in the ISA implementation of the popular IR-based symbolic executor angr./proceedings-archive/2024/DATA/584_pdf_upload.pdf |
||
EFFICIENT SAT-BASED BOUNDED MODEL CHECKING OF EVOLVING SYSTEMS Speaker: Sophie Andrews, Stanford University, US Authors: Sophie Andrews, Matthew Sotoudeh and Clark Barrett, Stanford University, US Abstract SAT-based verification is a common technique used by industry practitioners to find bugs in computer systems. However, these systems are rarely designed in a single step: instead, designers repeatedly make small modifications, reverifying after each change. With current tools, this reverification step takes as long as a full, from-scratch verification, even if the design has only been modified slightly. We propose a novel SAT-based verification technique that performs significantly better than the naive approach in the setting of evolving systems. The key idea is to reuse information learned during the verification of earlier versions of the system to speed up the verification of later versions. We instantiate our technique in a bounded model checking tool for SystemVerilog code and apply it to a new benchmark set based on real edit history for a set of open source RISC-V cores. This new benchmark set is now publicly available for further research on verification of evolving systems. Our tool, PrediCore, significantly improves the time required to verify properties on later versions of the cores compared to the current state-of-the-art, verify-from-scratch approach./proceedings-archive/2024/DATA/899_pdf_upload.pdf |
||
HIGH-THROUGHPUT SAT SAMPLING Speaker: Arash Ardakani, University of California, Berkeley, US Authors: Arash Ardakani1, Minwoo Kang1, Kevin He1, Qijing Huang2 and John Wawrzynek1 1University of California, Berkeley, US; 2NVIDIA Corp., US Abstract In this work, we present a novel technique for GPU-accelerated Boolean satisfiability (SAT) sampling. Unlike conventional sampling algorithms that directly operate on conjunctive normal form (CNF), our method transforms the logical constraints of SAT problems by factoring their CNF representations into simplified multi-level, multi-output Boolean functions. It then leverages gradient-based optimization to guide the search for a diverse set of valid solutions. Our method operates directly on the circuit structure of refactored SAT instances, reinterpreting the SAT problem as a supervised multi-output regression task. This differentiable technique enables independent bit-wise operations on each tensor element, allowing parallel execution of learning processes. As a result, we achieve GPU-accelerated sampling with significant runtime improvements ranging from $33.6 imes$ to $523.6 imes$ over state-of-the-art heuristic samplers. We demonstrate the superior performance of our sampling method through an extensive evaluation on $60$ instances from a public domain benchmark suite utilized in previous studies./proceedings-archive/2024/DATA/1208_pdf_upload.pdf |
||
SMT-BASED REPAIRING REAL-TIME TASK SPECIFICATIONS Speaker: Anand Yeolekar, TCS Research, IN Authors: Anand Yeolekar1, RAVINDRA METTA1 and Samarjit Chakraborty2 1TCS, IN; 2UNC Chapel Hill, US Abstract When addressing timing issues in real-time systems, approaches for systematic timing debugging and repair have been missing due to (i) Lack of available feedback: most timing analysis techniques, being closed-form analytical techniques, are unable to provide root cause information when a timing property is violated, which is critical for identifying an appropriate repair, and (ii) Pessimism in the analysis: existing schedulability analysis techniques tend to make worst case assumptions in the presence of non-determinism introduced by real-world factors such as release jitter, or sporadic tasks. To address this gap, we propose an SMT encoding of task runs for exact debugging of timing violations, and a procedure to iteratively repair a given task specification. We demonstrate the utility of this procedure by repairing example task sets scheduled under global non-preemptive earliest-deadline- first scheduling, a common choice for many safety-critical systems./proceedings-archive/2024/DATA/1397_pdf_upload.pdf |
||
HACHIFI: A LIGHTWEIGHT SOC ARCHITECTURE-INDEPENDENT FAULT-INJECTION FRAMEWORK FOR SEU IMPACT EVALUATION Speaker: Masanori Hashimoto, Kyoto University, JP Authors: Quan Cheng1, Wang Liao2, Ruilin Zhang1, Hao Yu3, Longyang Lin3 and Masanori Hashimoto1 1Kyoto University, JP; 2Kochi University of Technology, JP; 3Southern University of Science and Technology, CN Abstract Single-Event Upsets (SEUs), triggered by energetic particles, manifest as unexpected bit-flips in memory cells or registers, potentially causing significant anomalies in electronic devices. Driven by the needs of safety-critical applications, it is crucial to evaluate the reliability of these electronic devices before they are deployed. However, traditional reliability analysis techniques, such as irradiation experiments, are costly, while fault injection (FI) simulations often fail to provide full coverage and have limited effectiveness and accuracy. To address these issues, we introduce HachiFI, a lightweight, architecture-independent framework that automates fault injection with 100\% coverage via memory and scan-chain accesses and simulates the behavior of SEUs based on specific cross-sections. HachiFI supports configurable fault injection patterns for both system-level and module-level reliability analysis. Using HachiFI, we demonstrate a low hardware overhead (<2%) and a high match (R^2=0.984) between FI and irradiation experiments, verified on a 22nm edge-AI chip./proceedings-archive/2024/DATA/451_pdf_upload.pdf |
||
ACCELERATING CELL-AWARE MODEL GENERATION FOR SEQUENTIAL CELLS USING GRAPH THEORY Speaker: Gianmarco Mongelli, LIRMM, FR Authors: Gianmarco Mongelli1, Eric Faehn2, Dylan Robins2, Patrick Girard3 and Arnaud Virazel3 1LIRMM and STMicroelectronics Crolles, FR; 2STMicroelectronics, FR; 3LIRMM, FR Abstract The Cell-Aware (CA) methodology has become essential to detect and diagnose manufacturing intra-cell defects in modern semiconductor technologies. It characterizes standard cells by creating a defect-detection matrix, which serves as a reference that maps stimuli to the specific defects they can detect. Its limitation is that CA approach needs a number of time-consuming analog simulations to create the matrix. In [1] a graph-based methodology able to reduce the number of simulations to perform, called Transistor Undetectable Defect eLiminator (TrUnDeL), was presented. TrUnDeL can identify undetectable stimulus/defect pairs that are then excluded from the analog simulations. However, its use is limited to combinational cells and does not offer any guidance on handling sequential cells, which are usually the most complex cells. In this paper we present a new version of TrUnDeL that supports sequential cells analysis. Experiments conducted on sequential cells from two standard cell industrial libraries demonstrate that the CA generation time is reduced by 30% without compromising accuracy./proceedings-archive/2024/DATA/465_pdf_upload.pdf |
||
AN EFFICIENT PARALLEL FAULT SIMULATOR FOR FUNCTIONAL PATTERNS ON MULTI-CORE SYSTEMS Speaker: Xiaoze Lin, Institute of Computing Technology, Chinese Academy of Sciences, CN Authors: Xiaoze Lin1, Liyang Lai2, Huawei Li3, Biwei Xie3 and Xingquan Li4 1State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences, CN; 2Shantou University, CN; 3Institute of Computing Technology, Chinese Academy of Sciences, CN; 4Peng Cheng Laboratory, CN Abstract Fault simulation targeting functional patterns emerges as an essential mechanism within functional safety, crucial for validating the effectiveness of safety mechanisms. The acceleration of fault simulation for functional patterns is imperative for boosting the efficiency and adaptability of functional safety verification, presenting a significant yet unresolved challenge. In the paper, we propose an efficient fault simulator for functional patterns, utilizing three techniques including fault filtering, fault grouping, and CPU-based parallelism. The integration of these three techniques, tailored to the characteristics of functional patterns, reduces the runtime of fault simulation from different perspectives. The experimental results show that on a 48-core system, an average 79x speedup can be achieved by our parallel fault simulator against a commercial tool./proceedings-archive/2024/DATA/666_pdf_upload.pdf |
||
SPATIAL MODELING WITH AUTOMATED MACHINE LEARNING AND GAUSSIAN PROCESS REGRESSION TECHNIQUES FOR IMPUTING WAFER ACCEPTANCE TEST DATA Speaker: Ming-Chun Wei, National Cheng Kung University, TW Authors: Ming-Chun Wei, Hsun-Ping Hsieh and Chun-Wei Shen, National Cheng Kung University, TW Abstract The Wafer Acceptance Test (WAT) is a significant quality control measurement in the semiconductor industry. However, because the WAT process can be time-consuming and expensive, sampling test is commonly employed during production. This makes root cause tracing impossible when abnormal products have not been tested. Therefore, in our study, we focus on establishing a reliable method to estimate WAT results for non-tested shots, including both intra and inter-wafer prediction. Notably, we are the first to combine the use of Chip Probing data with WAT to improve the predictions. Our proposed method first extracts valuable features from Chip Probing test results by using the Automated Machine Learning technique. We then employ Gaussian Process Regression to capture the spatio-temporal correlation. Finally, we adopted the linear regression model to ensemble two components and proposed a SMART-WAT model to effectively estimate the wafer acceptance test data. Our method has been tested on a real-world dataset from the semiconductor manufacturing industry. The prediction results of four key WAT parameters indicate that our proposed model outperforms the state-of-the-art methods in both intra and inter-wafer prediction./proceedings-archive/2024/DATA/926_pdf_upload.pdf |
||
ON THE IMPACT OF WARPAGE ON BEOL GEOMETRY AND PATH DELAYS IN FAN-OUT WAFER-LEVEL PACKAGING Speaker: Dhruv Thapar, Arizona State University, US Authors: Dhruv Thapar1, Arjun Chaudhuri1, Christopher Bailey1, Ravi Mahajan2 and Krishnendu Chakrabarty1 1Arizona State University, US; 2Intel Corporation, US Abstract Warpage is a major concern in fan-out wafer-level packaging (FOWLP) due to the complex thermal processing steps involved in manufacturing. These steps include curing, electroplating, and deposition, which induce residual stresses through differential thermal expansion and contraction of materials. This effect is further amplified by mismatches in the coefficients of thermal expansion (CTE) between different materials. In particular, high-density interconnects in the back-end of line (BEOL), redistribution layers (RDLs), and through-mold vias (TMVs) are susceptible to warpage-induced stress, strain, and deformation. This work conducts structural simulations to analyze warpage in the BEOL stack induced by FOWLP. Our results indicate that the impact of warpage is non-uniform across the entire BEOL geometry of a die, hence it impacts different metal layers differently, and different coordinates within one metal layer differently. We leverage this warpage analysis to calculate parasitics and evaluate the resulting changes in path delays/proceedings-archive/2024/DATA/1368_pdf_upload.pdf |
||
MODELING AND ANALYSIS TECHNIQUE FOR THE FORMAL VERIFICATION OF SYSTEM-ON-CHIP ADDRESS MAPS Speaker: Niels Mook, NXP, NL Authors: Niels Mook1, Erwin de Kock1, Bas Arts1, Soham Chakraborty2 and Arie van Deursen2 1NXP Semiconductors, NL; 2TU Delft, NL Abstract This paper proposes a modeling and analysis technique to verify SoC address maps. The approach involves (i) modeling the specification and implementation address map using a unified graph model, and (ii) analysis of equivalence in terms of address maps between two such models. Using a state-of-the-art mid-size SoC design, we demonstrate the proposed solution is able to analyze and verify address maps of complex SoC designs and to identify the causes of discrepancies./proceedings-archive/2024/DATA/724_pdf_upload.pdf |
||
FREDDY: MODULAR AND EFFICIENT FRAMEWORK TO ENGINEER DECISION DIAGRAMS YOURSELF Speaker: Rune Krauss, DFKI, DE Authors: Rune Krauss1, Jan Zielasko1 and Rolf Drechsler2 1DFKI, DE; 2University of Bremen | DFKI, DE Abstract The hardware complexity in electronic devices used by today's society has increased significantly in recent decades due to technological progress. In order to cope with this complexity, data structures and algorithms in electronic design automation must be continuously improved. Decision Diagrams (DDs) are an important data structure in the design and analysis of circuits because they allow efficient algorithms for their manipulation. The practical relevance of DDs leads to an ongoing quest for appropriate software solutions that enable working with different DD types. Unfortunately, existing DD software libraries focus either on efficiency or usability. Consequences are a disproportionately high effort for extensions or considerable loss of performance. To tackle these issues, a modular and efficient Framework to Engineer Decision Diagrams Yourself (FrEDDY) is proposed in this paper. Various experiments demonstrate that no compromise with regard to performance has to be made when using FrEDDY. It is on par with or clearly more efficient than established DD libraries./proceedings-archive/2024/DATA/1377_pdf_upload.pdf |
TS29 Session 31 - D10+D9
Add this session to my calendar
Date: Wednesday, 02 April 2025
Time: 16:30 CET - 18:00 CET
Time | Label | Presentation Title Authors |
---|---|---|
EFFICIENT APPROXIMATE LOGIC SYNTHESIS WITH DUAL-PHASE ITERATIVE FRAMEWORK Speaker: Ruicheng Dai, Shanghai Jiao Tong University, CN Authors: Ruicheng Dai1, Xuan Wang1, Wenhui Liang1, Xiaolong Shen2, Menghui Xu2, Leibin Ni2, Gezi Li2 and Weikang Qian1 1Shanghai Jiao Tong University, CN; 2Huawei Technologies Co., Ltd., China, CN Abstract Approximate computing is an emerging paradigm to improve the energy efficiency for error-tolerant applications. Many iterative approximate logic synthesis (ALS) methods were proposed to automatically design approximate circuits. However, as the sizes of circuits grow, the runtime of ALS grows rapidly. Thus, a crucial challenge is to ensure circuit quality while improving the efficiency of ALS. This work proposes a dual-phase iterative framework to accelerate the iterative ALS flows. In the first phase, a comprehensive circuit analysis is performed to gather the necessary information, including the error information. In the second phase, minimal incremental computation is employed based on the information from the first phase. The experimental results show that the proposed method achieves an acceleration by up to 21.8× without loss of circuit quality compared to the state-of-the-art methods./proceedings-archive/2024/DATA/306_pdf_upload.pdf |
||
EFFICIENT APPROXIMATE NEAREST NEIGHBOR SEARCH VIA DATA-ADAPTIVEPARAMETER ADJUSTMENT IN HIERARCHICAL NAVIGABLE SMALL GRAPHS Speaker: Huijun Jin, Yonsei University, KR Authors: Huijun Jin, Jieun Lee, Shengmin Piao, Sangmin Seo, Sein Kwon and Sanghyun Park, Yonsei University, KR Abstract Abstract—Hierarchical Navigable Small World (HNSW) graphs are a state-of-the-art solution for approximate nearest neighbor search, widely applied in areas like recommendation systems, computer vision, and natural language processing. However, the effectiveness of the HNSW algorithm is constrained by its reliance on static parameter settings, which do not account for variations in data density and dimensionality across different datasets. This paper introduces Dynamic HNSW, an adaptive method that dynamically adjusts key parameters — such as the M (number of connections per node) and ef (search depth) — based on both local data density and dimensionality of the dataset. The proposed approach improves flexibility and efficiency, allowing the graph to adapt to diverse data characteristics. Experimental results across multiple datasets demonstrate that Dynamic HNSW significantly reduces graph build time by up to 33.11% and memory usage by up to 32.44%, while maintaining comparable recall, thereby outperforming the conventional HNSW in both scalability and efficiency. Keywords—Approximate Nearest Neighbor Search, Hierarchical Navigable Small World, Dynamic Parameter Tuning, Data-adaptive/proceedings-archive/2024/DATA/648_pdf_upload.pdf |
||
HAAN: A HOLISTIC APPROACH FOR ACCELERATING LAYER NORMALIZATION IN LARGE LANGUAGE MODELS Speaker: Sai Qian Zhang, New York University, US Authors: Tianfan Peng1, Tianhua Xia2, Jiajun Qin3 and Sai Qian Zhang4 1Tongji University, CN; 2Independent Researcher, US; 3Zhejiang University, CN; 4New York University, US Abstract Large language models (LLMs) have revolutionized natural language processing (NLP) tasks by achieving state-of-the-art performance across a range of benchmarks. Central to the success of these models is the integration of sophisticated architectural components aimed at improving training stability, convergence speed, and generalization capabilities. Among these components, normalization operation, such as layer normalization (LayerNorm), emerges as a pivotal technique, offering substantial benefits to the overall model performance. However, previous studies have indicated that normalization operations can substantially elevate processing latency and energy usage.In this work, we adopt the principles of algorithm and hardware co-design, introducing a holistic normalization accelerating method named HAAN. The evaluation results demonstrate that HAAN can achieve significantly better hardware performance compared to state-of-the-art solutions./proceedings-archive/2024/DATA/667_pdf_upload.pdf |
||
MCTA: A MULTI-STAGE CO-OPTIMIZED TRANSFORMER ACCELERATOR WITH ENERGY-EFFICIENT DYNAMIC SPARSE OPTIMIZATION Speaker: Heng Liu, Harbin Institute of Technology, CN Authors: Heng Liu, Ming Han, Jin Wu, Ye Wang and Jian Dong, Harbin Institute of Technology, CN Abstract As Transformer-based models continue to enhance service quality across various domains, their intensive computational requirements are exacerbating the AI energy crisis. Traditional energy-efficient Transformer architectures primarily focus on optimizing the Attention stage due to its high algorithmic complexity (O(n^2)). However, linear layers can also be significant energy consumers, sometimes accounting for over 70% of total energy usage. Although existing approaches such as sparsity have improved the Attention stage, the optimization space within such linear layers is not fully exploited. In this paper, we introduce the multi-stage co-optimized Transformer accelerator (MCTA) for optimizing energy efficiency. Our approach independently enhances the Query-Key-Value generation, Attention, and Feed-forward Neural Network stages. It employs two novel techniques: Low-overhead Mask Generation (LMG) for dynamically identifying unimportant calculations with minimal energy costs, and Cascaded Mask Derivation (CMD) for streamlining the mask generation process through parallel processing. Experimental results show that MCTA achieves an average energy reduction of 1.48× with only a 1% accuracy loss compared to state-of-the-art accelerators. This work demonstrates the potential for significant energy savings in Transformer models without the need for retraining, paving the way for more sustainable AI applications./proceedings-archive/2024/DATA/755_pdf_upload.pdf |
||
CIRCUITS IN A BOX: COMPUTING HIGH-DIMENSIONAL PERFORMANCE SPACES FOR ANALOG INTEGRATED CIRCUITS Speaker: Juergen Kampe, Ernst-Abbe-Hochschule Jena, DE Authors: Benedikt Ohse, Jürgen Kampe and Christopher Schneider, Ernst-Abbe-Hochschule Jena, DE Abstract Performance spaces contain information about all combinations of attainable performance parameters of analog integrated circuits. Their exploration allows designers to evaluate given circuits without considering implementation details, making them a valuable tool to support the design process. The computation of performance spaces---even for a small number of considered parameters---is time-consuming because it requires solving multi-objective, non-convex optimization problems that involve costly circuit simulations. We present a numerical method for efficiently approximating high-dimensional performance spaces, which is based on the box-coverage method known from Pareto optimization. The resulting implementation not only outperforms state-of-the-art solvers based on the well-known Normal-Boundary Intersection method in terms of computational complexity, but also offers several advantages, such as a practical stopping criterion and the possibility of warm starting. Furthermore, we present an interactive visualization technique to explore performance spaces of any dimension, which can help system designers to make reliable topology decisions even without detailed technical knowledge of the underlying circuits. Numerical experiments that confirm the efficiency of our approach are performed by computing seven-dimensional performance spaces for an analog low-dropout regulator as used in the radio-frequency identification domain./proceedings-archive/2024/DATA/842_pdf_upload.pdf |
||
GRADIENT APPROXIMATION OF APPROXIMATE MULTIPLIERS FOR HIGH-ACCURACY DEEP NEURAL NETWORK RETRAINING Speaker: Chang Meng, EPFL, Switzerland, CN Authors: Chang Meng1, Wayne Burleson2, Weikang Qian3 and Giovanni De Micheli1 1EPFL, CH; 2U Massachusetts Amherst, US; 3Shanghai Jiao Tong University, CN Abstract Approximate multipliers (AppMults) are widely employed in deep neural network (DNN) accelerators to reduce the area, delay, and power consumption. However, the inaccuracies of AppMults degrade DNN accuracy, necessitating a retraining process to recover accuracy. A critical step in retraining is computing the gradient of the AppMult, i.e., the partial derivative of the approximate product with respect to each input operand. Conventional methods approximate this gradient using that of the accurate multiplier (AccMult), often leading to suboptimal retraining results, especially for AppMults with relatively large errors. To address this issue, we propose a difference-based gradient approximation of AppMults to improve retraining accuracy. Experimental results show that compared to the state-of-the-art methods, our method improves the DNN accuracy after retraining by 4.10% and 2.93% on average for the VGG and ResNet models, respectively. Moreover, after retraining a ResNet18 model using the 7-bit AppMult, the final DNN accuracy does not degrade compared to the quantized model using the 7-bit AccMult, while the power consumption is reduced by 51%./proceedings-archive/2024/DATA/1032_pdf_upload.pdf |
||
SEGMENT-WISE ACCUMULATION: LOW-ERROR LOGARITHMIC DOMAIN COMPUTING FOR EFFICIENT LARGE LANGUAGE MODEL INFERENCE Speaker: Xinkuang Geng, Shanghai Jiao Tong University, CN Authors: Xinkuang Geng, Yunjie Lu, Hui Wang and Honglan Jiang, Shanghai Jiao Tong University, CN Abstract Logarithmic domain computing (LDC) has great potential for reducing quantization errors and computational complexity in Large Language Models (LLMs). While logarithmic multiplication can be efficiently implemented using fixed-point addition, the primary challenge in multiply-accumulate (MAC) operations is balancing the precision of logarithmic adders with their hardware overhead. Through a detailed analysis of the errors inherent in LDC-based LLMs, we propose segment-wise accumulation (SWA) to mitigate these errors. In addition, a processing element (PE) is introduced to enable SWA in the systolic array architecture. Compared with the accumulation scheme devised for enhancing floating-point computing, the proposed SWA facilitates the integration into existing accelerator architectures, resulting in lower hardware overhead. The experimental results show that SWA allows LDC under low-precision configurations to achieve remarkable accuracy in LLMs, demonstrating higher hardware efficiency than merely increasing the precision of individual computations. Our method, while maintaining a lower hardware overhead than traditional LDC, achieves more than 13.9% improvement in average accuracy across multiple zero-shot benchmarks in extsc{Llama-2-7B}. Furthermore, compared to integer domain computing, a logarithmic processing element array based on the proposed SWA yields reductions of 24.6% in area and 42.3% in power, while achieving higher accuracy./proceedings-archive/2024/DATA/1252_pdf_upload.pdf |
||
LOOKUP TABLE REFACTORING: TOWARDS EFFICIENT LOGARITHMIC NUMBER SYSTEM ADDITION FOR LARGE LANGUAGE MODELS Speaker: Xinkuang Geng, Shanghai Jiao Tong University, CN Authors: Xinkuang Geng1, Siting Liu2, Hui Wang1, Jie Han3 and Honglan Jiang1 1Shanghai Jiao Tong University, CN; 2ShanghaiTech University, CN; 3University of Alberta, CA Abstract Compared to integer quantization, logarithmic quantization aligns more effectively with the long-tailed distribution of data in large language models (LLMs), resulting in lower quantization errors. Moreover, the logarithmic number system (LNS) employs a fixed-point adder to perform multiplication, indicating a potential reduction in computational complexity for LLM accelerators that require extensive multiply-accumulate (MAC) operations. However, a key bottleneck is that LNS addition requires complex nonlinear functions, which are typically approximated using lookup tables (LUTs). This study aims to reduce the hardware resources needed for LUTs in LNS addition while maintaining high precision. Specifically, we investigate the specific nature of addition operations within LLMs; the relationship between the hardware parameters of the LUT and the computing errors is then mathematically derived. Based on these insights, we propose LUT refactoring to optimize the LUT for enhanced efficiency in LNS addition. With 10.93% and 19.78% reductions in area-delay product (ADP) and power-delay product (PDP), respectively, LUT refactoring results in an accuracy improvement of up to 33.5% in LLM benchmarks compared to the naive design. When compared to integer quantization, our method achieves higher accuracy while reducing area by 18.27% and power by 42.61%./proceedings-archive/2024/DATA/1250_pdf_upload.pdf |
||
EVASION: EFFICIENT KV CACHE COMPRESSION VIA PRODUCT QUANTIZATION Speaker: Zongwu Wang, Shanghai Jiao Tong University, CN Authors: Zongwu Wang1, Fangxin Liu1, Peng Xu1, Qingxiao Sun2, Junping Zhao3 and Li Jiang1 1Shanghai Jiao Tong University, CN; 2China University of Petroleum, Beijing, CN; 3Ant Group, CN Abstract Large language models (LLMs) benefit from longer context lengths, but suffer from quadratic complexity in terms of attention mechanisms. KV caching alleviates this issue by storing pre-computed data, but its memory requirements increase linearly with context length, thereby hindering the intelligent development of LLMs. The traditional weight quantization scheme performs poorly in KV quantization due to two reasons: (1) KV requires dynamic quantization and de-quantization, which can lead to significant performance degradation; (2) Outliers are widely present in KV, which poses a challenge to low-bitwidth uniform quantization. This work proposes a novel approach called archname to achieve low-bitwidth quantization by product quantization. We thorough analyze the distribution of KV cache and demonstrate the limitations of existing quantization schemes. Then a non-uniform quantization algorithm based on product quantization is introduced, which offers efficient compression while maintaining accuracy. Finally, we design a high-performance GPU inference framework for archname, utilizing sparse computation and asynchronous quantization for further acceleration. Comprehensive evaluation results demonstrate that archname can achieve 4 bits quantization trivial perplexity and accuracy loss, it also achieves 1.8x end-to-end inference speedup./proceedings-archive/2024/DATA/136_pdf_upload.pdf |
||
SOFTEX: A LOW POWER AND FLEXIBLE SOFTMAX ACCELERATOR WITH FAST APPROXIMATE EXPONENTIATION Speaker: Andrea Belano, University of Bologna, Bologna, Italy, IT Authors: Andrea Belano1, Yvan Tortorella1, Angelo Garofalo2, Davide Rossi1, Luca Benini3 and Francesco Conti1 1Università di Bologna, IT; 2University of Bologna, ETH Zurich, IT; 3ETH Zurich, CH | Università di Bologna, IT Abstract Transformer-based models achieve remarkable results in a wide range of fields, including natural language processing, computer vision, and audio processing. Despite Transformers being computationally dominated by matrix multiplications (MatMul), a non-negligible portion of their runtime is also spent on executing the softmax operator. The softmax is a non-linear and non-pointwise operator that can become a performance bottleneck especially if dedicated hardware is used to decrease the runtime of MatMul operators. We introduce SoftEx, a parametric accelerator for the softmax function of BF16 vectors. SoftEx introduces an approximate exponentiation algorithm balancing efficiency (121× speedup over glibc's implementation) with accuracy (mean relative error of 0.14%). We integrate our design in a heterogeneous tightly-coupled cluster containing 256KiB of shared SRAM and 8 general-purpose RISC-V cores as well as a 24×8 systolic array MatMul accelerator. In 12nm technology, SoftEx occupies 0.033 mm², only 2.75% of the cluster, and achieves an operating frequency of 1.12 GHz. Computing the attention probabilities with SoftEx requires up to 10.8× less time and 26.8× less energy compared to a highly optimized software implementation running on the 8 cores, boosting the overall throughput on MobileBERT's attention layer by up to 2.17×, achieving a performance of 324 GOPS at 0.80V or 1.30 TOPS/W at 0.55V at full BF16 accuracy./proceedings-archive/2024/DATA/413_pdf_upload.pdf |
TS30 Session 29 - T3+E5
Add this session to my calendar
Date: Wednesday, 02 April 2025
Time: 16:30 CET - 18:00 CET
Time | Label | Presentation Title Authors |
---|---|---|
ERASER: EFFICIENT RTL FAULT SIMULATION FRAMEWORK WITH TRIMMED EXECUTION REDUNDANCY Speaker: Jiaping Tang, State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences, CN Authors: Jiaping Tang1, Jianan Mu1, Silin Liu1, Zizhen Liu1, Feng Gu2, Xinyu Zhang1, Leyan Wang1, Shengwen Liang2, Jing Ye1, Huawei Li1 and Xiaowei Li3 1State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences/ University of Chinese Academy of Sciences/ CASTEST, China, CN; 2State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences/ University of Chinese Academy of Sciences, CN; 3State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences/ University of Chinese Academy of Sciences, China, CN Abstract As intelligent computing devices increasingly integrate into human life, ensuring the functional safety of the corresponding electronic chips becomes more critical. A key metric for functional safety is achieving a sufficient fault coverage. To meet this requirement, extensive time-consuming fault simulation of the RTL code is necessary during the chip design phase. The main overhead in RTL fault simulation comes from simulating behavioral nodes (always blocks). Due to the limited fault propagation capacity, fault simulation results often match the good simulation results for many behavioral nodes. A key strategy for accelerating RTL fault simulation is the identification and elimination of redundant simulations. Existing methods detect redundant executions by examining whether the fault inputs to each RTL node are consistent with the good inputs. However, we observe that this input comparison mechanism overlooks a significant amount of implicit redundant execution: although the fault inputs differ from the good inputs, the node's execution results remain unchanged. Our experiments reveal that this overlooked redundant execution constitutes nearly half of the total execution overhead of behavioral nodes, becoming a significant bottleneck in current RTL fault simulation. The underlying reason for this overlooked redundancy is that, in these cases, the true execution paths within the behavioral nodes are not affected by the changes in input values. In this work, we propose a behavior-level redundancy detection algorithm that focuses on the true execution paths. Building on the elimination of redundant executions, we further developed an efficient RTL fault simulation framework, Eraser. Experimental results show that compared to commercial tools, under the same fault coverage, our framework achieves a 3.9 × improvement in simulation performance on average./proceedings-archive/2024/DATA/70_pdf_upload.pdf |
||
PESEC -- A SIMPLE POWER-EFFICIENT SINGLE ERROR CORRECTING CODING SCHEME FOR RRAM Speaker: Shlomo Engelberg, Jerusalem College of Technology, IL Authors: Shlomo Engelberg1 and Osnat Keren2 1Jerusalem College of Technology, IL; 2Bar-Ilan University, IL Abstract The power consumed when writing to Resistive Random Access Memory (RRAM) is significantly greater than that consumed by many charge-based memories such as SRAM, DRAM and NAND-Flash memories. As a result, when used in applications where instantaneous power consumption is constrained, the number of bits that can be set or reset must not exceed a certain threshold. In this paper, we present a power-efficient, single error correcting (PESEC) code for memory macros, which, when combined with bus encoding, ensures low-power operation and reliable data storage. This systematic, multiple-representation based single-error correcting code provides a relatively high rate, with a marginal increase in implementation cost relative to that of a standard Hamming code, and it can be used with any bus encoder./proceedings-archive/2024/DATA/978_pdf_upload.pdf |
||
FROM GATES TO SDCS: UNDERSTANDING FAULT PROPAGATION THROUGH THE COMPUTE STACK Speaker: Odysseas Chatzopoulos, University of Athens, GR Authors: Odysseas Chatzopoulos1, George Papadimitriou1, Dimitris Gizopoulos1, Harish Dixit2 and Sriram Sankar2 1University of Athens, GR; 2Meta Platforms Inc., US Abstract Silent Data Corruption (SDC) is the most severe effect of a silicon defect in a CPU or other computing chip. The arithmetic units of a CPU are, usually, unprotected and are, thus, the ones that most likely produce SDCs (as well as visible malfunctions of programs such as crashes). In this work, we shed light on the traversal of silicon defects from their point of origin deep inside arithmetic units of complex CPUs towards the program result. We employ microarchitecture-level fault injection enhanced with gate-level designs of the arithmetic units of interest. The hybrid setup combines (i) the accuracy of the hardware and fault modeling and (ii) the speed of program simulation to run long programs to end (thus observing SDC incidents); the analysis that this combination delivers is impossible at other abstraction layers which are either hardware-agnostic (software level) or extremely slow (gate-level). We quantify the effects of faults in two stages and with multiple metrics: (a) how faults propagate to the outputs of the arithmetic units when individual instructions are executed, and (b) how faults eventually affect the outcome of the program generating SDCs, crashes, or being masked. Our fine-grain findings can be utilized for informed fault detection and tolerance strategies at the hardware or the software levels./proceedings-archive/2024/DATA/1153_pdf_upload.pdf |
||
RAPID FAULT INJECTION SIMULATION BY HASH-BASED DIFFERENTIAL FAULT EFFECT EQUIVALENCE CHECKS Speaker: Johannes Geier, TU Munich, DE Authors: Johannes Geier1, Leonidas Kontopoulos1, Daniel Mueller-Gritschneder2 and Ulf Schlichtmann1 1TU Munich, DE; 2TU Wien, AT Abstract Assessing a computational system's resilience to hardware faults is essential for safety and security-related systems. Fault Injection (FI) simulation is a valuable tool that can increase confidence in computational systems and guide hardware and software design decisions in the early stages of development. However, simulating hardware at low levels of abstraction, such as Register Transfer Level (RTL), is costly, and minimizing the effort required for large-scale FI campaigns is a significant objective. This work introduces Hash-based Differential Fault Effect Equivalence Checks to automatically terminate experiments early based on predicting their outcome. We achieve this by matching observed fault effects to ones already encountered in previous experiments. We generate these hashes from differentials computed by repurposing existing fast boot checkpoints from a state-of-the-art acceleration method. By integrating these approaches in an automated manner, we can accelerate a large-scale FI simulation of a CPU at RTL. We reduce the average simulation time by a factor of up to 25 compared to a factor of around 2 to 5 for state-of-the-art techniques. While maintaining 100 % accuracy, we can recover the faulty state through the stored differentials./proceedings-archive/2024/DATA/1172_pdf_upload.pdf |
||
DEAR: DEPENDABLE 3D ARCHITECTURE FOR ROBUST DNN TRAINING Speaker: Ashish Reddy Bommana, Arizona State University, US Authors: Ashish Reddy Bommana1, Farshad Firouzi2, Chukwufumnanya Ogbogu3, Biresh Kumar Joardar4, Janardhan Rao Doppa3, Partha Pratim Pande3 and Krishnendu Chakrabarty1 1Arizona State University, US; 2ASU, US; 3Washington State University, US; 4University of Houston, US Abstract ReRAM-based compute-in-memory (CiM) architectures present an attractive design choice for accelerating deep neural network (DNN) training. However, these architectures are susceptible to stuck-at faults (SAFs) in ReRAM cells, which arise from manufacturing defects and cell wearout over time, particularly due to the continuous weight updates during DNN training. These faults significantly degrade accuracy and compromise dependability. To address this issue, we propose DEAR: dependable 3D architecture for robust DNN training. DEAR introduces a novel online compensation method that employs a digital compensation unit to correct SAF-induced errors dynamically during both forward and backward propagation. This approach mitigates errors induced by SAFs during both the forward and backward phases of DNN training. Additionally, DEAR leverages an HBM-based 3D memory structure to store fault-related error information efficiently. Experimental results show that DEAR limits inferencing accuracy loss to under 2% even when up to 10% of cells are faulty with uniformly distributed faults, and under 2% for up to 5% faulty cells in clustered distributions. This high fault tolerance is achieved with an area overhead of 11.5% and energy overhead of less than 6% for VGG networks and less than 12% for ResNet networks./proceedings-archive/2024/DATA/1232_pdf_upload.pdf |
||
IMPROVING SOFTWARE RELIABILITY WITH RUST: IMPLEMENTATION FOR ENHANCED CONTROL FLOW CHECKING METHODS Speaker: Jacopo Sini, Politecnico di Torino, IT Authors: Jacopo Sini1, Mohammadreza Amel Solouki1, Massimo Violante1 and Giorgio Di Natale2 1Politecnico di Torino, IT; 2TIMA - CNRS, FR Abstract The C language, traditionally used in developing safety-critical systems, often faces memory management issues, leading to potential vulnerabilities. Rust emerges as a safer and secure alternative, aiming to mitigate these risks with its robust memory protection features, making it suitable for producing reliable code in critical environments, such as the automotive industry. This study proposes employing Rust code hardened by Control Flow Checking (CFC) in real-time embedded systems, which software is traditionally developed by Assembly and C languages. The methods have been implemented at the application level, i.e., in the Rust source code, to make them platform-agnostic. A methodology for leveraging the Rust advantages is presented, such as stronger security guarantees and modern features, to implement these methods more effectively. Highlighting a use case in the automotive sector, our research demonstrates the Rust capacity to enhance system reliability through CFC, especially against Random Hardware Faults. Two CFC algorithms from the literature, YACCA, and RACFED, have been implemented in the Rust language to assess their effectiveness, obtaining 46.5\% Diagnostic Coverage for the YACCA method and 50.1\% for RACFED. The proposed approach is aligned with functional safety standards, showcasing how Rust can balance safety requirements and cost considerations in industries reliant on software solutions for critical functionalities./proceedings-archive/2024/DATA/447_pdf_upload.pdf |
||
BRIDGING THE GAP BETWEEN ANOMALY DETECTION AND RUNTIME VERIFICATION: H-CLASSIFIERS Speaker: Hagen Heermann, RPTU Kaiserlautern, DE Authors: Hagen Heermann and Christoph Grimm, University of Kaiserslautern-Landau, DE Abstract Runtime Verification (RV) and Anomaly Detection (AD) are crucial for ensuring the reliability of cyber-physical systems, but existing methods often suffer from high computational costs and lack of explainability. This paper presents a novel approach that integrates formal methods into anomaly detection, transforming complex system models into efficient classification tasks. By combining the strengths of RV and AD, our method significantly improves detection efficiency while providing explainability for failure causes. Our approach offers a promising solution for enhancing the safety and reliability of critical systems./proceedings-archive/2024/DATA/450_pdf_upload.pdf |
||
CRITICALITY AND REQUIREMENT AWARE HETEROGENEOUS COHERENCE FOR MIXED CRITICALITY SYSTEMS Speaker: Mohamed Hassan, McMaster University, CA Authors: Safin Bayes and Mohamed Hassan, McMaster University, CA Abstract We propose CoHoRT, as the first heterogeneous cache coherent solution for mixed criticality systems (MCS) equipped with several features that targets the characteristics and requirements of such systems. CoHoRT is requirementaware. It provides an optimization engine to optimally configure the architecture based on system requirements. CoHoRT is also criticality-aware. It introduces a low-cost novel architecture to enable cores to heterogeneously run different coherence protocols (time-based and MSI-based protocols). Moreover, it enables a run-time switch between these protocols to provide hardware support for mode operation switch, which is a common challenge in MCS. Our evaluation shows that CoHoRT outperforms existing solutions both from worst-case memory latency as well as overall average performance. It also illustrates that CoHoRT is able to meet timing requirements in various MCS setups and showcases CoHoRT's ability to adapt to mode switches./proceedings-archive/2024/DATA/896_pdf_upload.pdf |
||
PROTECTING CYBER-PHYSICAL SYSTEMS VIA VENDOR-CONSTRAINED SECURITY AUDITING WITH REINFORCEMENT LEARNING Speaker: Nan Wang, East China University of Science and Technology, CN Authors: Nan Wang1, Kai Li1, Lijun Lu1, Zhiwei Zhao1 and Zhiyuan Ma2 1School of Information Science and Engineering, East China University of Science and Technology, CN; 2Institute of Machine Intelligence, University of Shanghai for Science and Technology, CN Abstract Hardware Trojans may cause security issues in cyber-physical systems (CPSs), and recently proposed mutual auditing frameworks have helped build trustworthy CPSs with untrustworthy devices by requiring neighboring devices from different vendors. However, this may cause severe multi-vendor integration challenges, such as expensive, hard-to-maintain, and insufficient vendors to purchase devices. In this work, we improve the mutual auditing framework by maintaining the security of the CPSs with fewer vendors. First, the vendor-constrained security auditing framework is introduced to enhance the security of the CPS network with limited vendors, where side-auditing detects the hardware Trojan collusion between neighboring nodes and infected node isolation stops the spread of active HTs. Second, a multi-agent cooperative reinforcement learning-based method is proposed to assign devices with proper vendors in the context of security auditing, and it provides solutions with a minimized number of offline nodes due to the HT infection. The experimental results show that our proposed method reduces the number of vendors needed by 40.95%, and only causes an increment of 0.39% infected nodes./proceedings-archive/2024/DATA/914_pdf_upload.pdf |
||
ADAPTIVE BRANCH-AND-BOUND TREE EXPLORATION FOR NEURAL NETWORK VERIFICATION Speaker: Kota Fukuda, Kyushu University, JP Authors: Kota Fukuda1, Guanqin Zhang2, Zhenya Zhang1, Yulei Sui2 and Jianjun Zhao1 1Kyushu University, JP; 2University of New South Wales, AU Abstract Formal verification is a rigorous approach that can provably ensure the quality of neural networks, and to date, Branch and Bound (BaB) is the state-of-the-art that performs verification by splitting the problem as needed and applying off-the-shelf verifiers to sub-problems for improved performance. However, existing BaB may not be efficient, due to its naive way of exploring the space of sub-problems that ignores the importance of different sub-problems. To bridge this gap, we first introduce a notion of importance that reflects how likely a counterexample can be found with a sub-problem, and then we devise a novel verification approach, called ABONN, that explores the sub-problem space of BaB adaptively, in a Monte-Carlo tree search (MCTS) style. The exploration is guided by the importance of different sub-problems, so it favors the sub-problems that are more likely to find counterexamples. As soon as it finds a counterexample, it can immediately terminate; even though it cannot find, after visiting all the sub-problems, it can still manage to verify the problem. We evaluate ABONN with 552 verification problems from commonly-used datasets and neural network models, and compare it with the state-of-the-art verifiers as baseline approaches. Experimental evaluation shows that ABONN demonstrates speedups of up to 15.2x on MNIST and 24.7x on CIFAR-10. We further study the influences of hyperparameters to the performance of ABONN, and the effectiveness of our adaptive tree exploration./proceedings-archive/2024/DATA/1524_pdf_upload.pdf |
||
TOWARDS COHERENT SEMANTICS: A QUANTITATIVELY TYPED EDSL FOR SYNCHRONOUS SYSTEM DESIGN Speaker: Rui Chen, KTH Royal Institute of Technology, SE Authors: Rui Chen and Ingo Sander, KTH Royal Institute of Technology, SE Abstract We present SynQ, an embedded DSL (EDSL) targeting synchronous system design with quantitative types. SynQ is designed to facilitate semantically coherent system design processes by language embedding and advanced type systems. The current case study indicates the potential for a seamless system design process./proceedings-archive/2024/DATA/522_pdf_upload.pdf |
||
CO-DESIGN OF SUSTAINABLE EMBEDDED SYSTEMS-ON-CHIP Speaker: Dominik Walter, FAU, DE Authors: Jan Spieck, Dominik Walter, Jan Waschkeit and Jürgen Teich, Friedrich-Alexander-Universität Erlangen-Nürnberg, DE Abstract This paper introduces a novel approach to the co-design of sustainable embedded systems through multi-objective design space exploration (DSE). We propose a two-phase methodology that optimizes both the multiprocessor system-on-chip (MPSoC) architecture and application mappings, considering sustainability, reliability, performance, and cost as optimization objectives. Our method thereby accounts for both operational and embodied emissions, providing a more comprehensive assessment of sustainability. First, an individual intra-application DSE is performed to explore Pareto-optimal constraint graphs for each application. The second phase, an inter-application DSE, combines these results to explore sustainable target architectures and corresponding application mappings. Our approach incorporates detailed models for embodied emissions (scope 1 and scope 2), operational emissions, reliability, performance, and cost. The evaluation demonstrates that our sustainability-aware DSE is able to explore design spaces, supported by superior results in four key objectives. This enables the development of sustainable embedded systems whilst achieving high performance and reliability./proceedings-archive/2024/DATA/539_pdf_upload.pdf |
TS31 Session 22 - D14 +D15
Add this session to my calendar
Date: Wednesday, 02 April 2025
Time: 16:30 CET - 18:00 CET
Time | Label | Presentation Title Authors |
---|---|---|
GENETIC ALGORITHM-DRIVEN IMC MAPPING FOR CNNS USING MIXED QUANTIZATION AND MLC FEFETS Speaker: Alptekin Vardar, Fraunhofer IPMS, DE Authors: Alptekin Vardar, Franz Müller, Gonzalo Cuñarro Podestá, Nellie Laleni, Nandakishor Yadav and Thomas Kämpfe, Fraunhofer IPMS, DE Abstract Ferroelectric Field-Effect Transistors (FeFETs) are emerging as a highly promising non-volatile memory (NVM) technology for in-memory computing architectures, thanks to their low power consumption and non-volatility. These characteristics make FeFETs particularly well-suited for convolutional neural networks (CNNs), especially in power-constrained environments where minimizing the memory footprint is critical for improving both area efficiency and energy consumption. Two effective strategies for reducing memory requirements are quantization and the use of multi-level cell (MLC) configurations in NVMs. This work proposes a solution that combines mixed quantization schemes with FeFET-based MLC and single-level cell (SLC) configurations to balance memory usage and accuracy. Given the large hyperparameter space introduced by these combinations, we employ a genetic algorithm to efficiently explore and identify Pareto-optimal solutions, allowing flexible adaptation to various application-specific requirements. Our approach achieves significant improvements in both memory efficiency and performance, reducing memory usage by 50% while sacrificing only 3% accuracy compared to the 8-bit ResNet baseline. After a single epoch of retraining, the accuracy matches the baseline while fully retaining the memory savings. Additionally, when compared to the 4-bit baseline, a 46% memory reduction is achieved with virtually no loss in accuracy./proceedings-archive/2024/DATA/1305_pdf_upload.pdf |
||
OPENMFDA: MICROFLUIDIC DESIGN AUTOMATION IN THREE DIMENSIONS Speaker: Ashton Snelgrove, University of Utah, US Authors: Ashton Snelgrove1, Daniel Wakeham1, Skylar Stockham1, Scott Temple2 and Pierre-Emmanuel Gaillardon1 1University of Utah, US; 2Primis AI, US Abstract Current microfluidic design automation (MFDA) solutions are limited by the planarity requirements of current manufacturing techniques. Recent advances in stereolithography 3D printing create an opportunity for new MFDA design methodologies.We propose a methodology for the placement of microfluidic components and the routing of flow and control channels in three dimensions. Additionally, we propose a methodology for generating a printable 3D structure from the layout. We then present OpenMFDA, an open-source MFDA design flow implementing the proposed methodologies. This design flow takes a structural netlist and produces a sliced design for manufacturing using an SLA 3D printer. Our methodology demonstrates short run times and generates devices with 2-20$ imes$ smaller area compared to state-of-the-art MFDA tools./proceedings-archive/2024/DATA/1390_pdf_upload.pdf |
||
CLAIRE: COMPOSABLE CHIPLET LIBRARIES FOR AI INFERENCE Speaker: Pragnya Nalla, University of Minnesota Twin Cities, US Authors: Pragnya Nalla1, Emad Haque2, Yaotian Liu2, Sachin S. Sapatnekar1, Jeff Zhang2, Chaitali Chakrabarti2 and Yu Cao1 1University of Minnesota, US; 2Arizona State University, US Abstract Artificial intelligence has made a significant impact on fields like computer vision, Natural Language Processing (NLP), healthcare, and robotics. However, recent AI models, such as GPT-4 and LLaMAv3, demand significant number of computational resources, pushing monolithic chips to their technological and practical limits. 2.5D chiplet-based heterogeneous architectures have been proposed to address these technological and practical limits. While chiplet optimization for models like Convolutional Neural Networks (CNNs) is well-established, scaling this approach to accommodate diverse AI inference models with different computing primitives, data volumes, and different chiplet sizes is very challenging. A set of hardened IPs and chiplet libraries optimized for a broad range of AI applications is proposed in this work. We derive the set of chiplet configurations that are composable, scalable and reusable by employing an analytical framework trained on a diverse set of AI algorithms. Testing these set of library synthesized configurations on a different set of algorithms, we achieve a 1.99× − 3.99× improvement in non-recurring engineering (NRE) chiplet design costs, with minimal performance overhead compared to custom chiplet-based ASIC designs. Similar to soft IPs for SoC development, the library of chiplets improves flexibility, reusability, and efficiency for AI hardware designs./proceedings-archive/2024/DATA/1405_pdf_upload.pdf |
||
A TALE OF TWO SIDES OF WAFER: PHYSICAL IMPLEMENTATION AND BLOCK-LEVEL PPA ON FLIP FET WITH DUAL-SIDED SIGNALS Speaker: Haoran Lu, Peking University, CN Authors: Haoran Lu, Xun Jiang, Yanbang Chu, Ziqiao Xu, Rui Guo, Wanyue Peng, Yibo Lin, Runsheng Wang, Heng Wu and Ru Huang, Peking University, CN Abstract As the conventional scaling of logic devices comes to an end, functional wafer backside and 3D transistor stacking are consensus for next-generation logic technology, offering considerable design space extension for powers, signals or even devices on the wafer backside. The Flip FET (FFET), a novel transistor architecture combining 3D transistor stacking and fully functional wafer backside, was recently proposed. With symmetric dual-sided standard cell design, the FFET can deliver around 12.5% cell area scaling and faster but more energy-efficient libraries beyond other stacked transistor technologies such as Complementary FET (CFET). Besides, thanks to the novel cell design with dual-sided pins, the FFET supports dual-sided signal routing, delivering better routability and larger backside design space. In this work, we demonstrated a comprehensive FFET evaluation framework considering physical implementation and block-level power-performance-area (PPA) assessment for the first time, in which key functions are dual-sided routing and dual-sided RC extraction. A 32-bit RISC-V core was used for the evaluation here. Compared to the CFET with single-sided signals, the FFET with single-sided signals (for fair comparison) achieved 23.3% post-P&R core area reduction, 25.0% higher frequency and 11.9% lower power at the same utilization, and 16.0 % higher frequency at the same core area. Meanwhile, the FFET supports dual-sided signals, which can further benefit more from flexible allocation of cell input pins on both sides. By optimizing the input pin density and BEOL routing layer number on each side, 10.6% frequency gain was realized without power degradation compared to the one with single-sided signal routing. Moreover, the routability and power efficiency of FFET barely degrades even with the routing layer number reduced from 12 to 5 on each side, validating the great space for cost-friendly design enabled by FFET./proceedings-archive/2024/DATA/1464_pdf_upload.pdf |
||
COLUMN-WISE QUANTIZATION OF WEIGHTS AND PARTIAL SUMS FOR ACCURATE AND EFFICIENT COMPUTE-IN-MEMORY ACCELERATORS Speaker: Kang Eun Jeon, Sungkyunkwan University, KR Authors: Jiyoon Kim, Kang Eun Jeon, Yulhwa Kim and Jong Hwan Ko, Sungkyunkwan University, KR Abstract Compute-in-memory (CIM) is an efficient method for implementing deep neural networks (DNNs) but suffers from substantial overhead from analog-to-digital converters (ADCs), especially as ADC precision increases. Low-precision ADCs can reduce this overhead but introduce partial-sum quantization errors degrading accuracy. Additionally, low-bit weight constraints, imposed by cell limitations and the need for multiple cells for higher-bit weights, present further challenges. While fine-grained partial-sum quantization has been studied to lower ADC resolution effectively, weight granularity, which limits overall partial-sum quantized accuracy, remains underexplored. This work addresses these challenges by aligning weight and partial-sum quantization granularities at the column-wise level. Our method improves accuracy while maintaining dequantization overhead, simplifies training by removing two-stage processes, and ensures robustness to memory cell variations via independent column-wise scale factors. We also propose an open-source CIM-oriented convolution framework to handle fine-grained weights and partial-sums efficiently, incorporating a novel tiling method and group convolution. Experimental results on ResNet-20 (CIFAR-10, CIFAR-100) and ResNet-18 (ImageNet) show accuracy improvements of 0.99%, 2.69%, and 1.01%, respectively, compared to the best-performing related works. Additionally, variation analysis reveals the robustness of our method against memory cell variations. These findings highlight the effectiveness of our quantization scheme in enhancing accuracy and robustness while maintaining hardware efficiency in CIM-based DNN implementations. Our code is available at https://github.com/jiyoonkm/ColumnQuant./proceedings-archive/2024/DATA/1098_pdf_upload.pdf |
||
DHD: DOUBLE HARD DECISION DECODING SCHEME FOR NAND FLASH MEMORY Speaker: Lanlan Cui, Xian University of Technology, CN Authors: Lanlan Cui1, Yichuan Wang1, Renzhi Xiao2, Miao Li3, Xiaoxue Liu1 and Xinhong Hei1 1Xi'an University of Technology, CN; 2Jiangxi University of Science and Technology, CN; 3National University of Defense Technology, CN Abstract With the advancement of NAND flash technology, the increased storage density leads to intensified interference, which in turn raises the error rate during data retrieval. To ensure data reliability, low-density parity-check (LDPC) codes are extensively employed for error correction in NAND flash memory. Although LDPC soft decision decoding offers high error correction capability, it comes with a significant latency. Conversely, hard-decision decoding, although faster, lacks sufficient error correction strength. Consequently, flash memory typically initiates with hard-decision decoding and resorts to multiple soft decision decoding upon failure. To minimize decoding latency, this paper proposes a decoding mechanism based on the double hard decision, called DHD. This DHD scheme improves the Log-Likelihood Ratio (LLR) in the hard decision process. After the first hard decision fails, the read reference voltage (RRV) is adjusted to perform the second hard decision decoding. If the second hard decision also fails, soft decision decoding is then employed. Experimental results demonstrate that when the Raw Bit Error Rate (RBER) is 8.5E-3, DHD reduces the Frame Error Rate (FER) by 86.4% compared to the traditional method./proceedings-archive/2024/DATA/1145_pdf_upload.pdf |
||
WRITE-OPTIMIZED PERSISTENT HASH INDEX FOR NON-VOLATILE MEMORY Speaker: Renzhi Xiao, Jiangxi University of Science and Technology, CN Authors: Renzhi Xiao1, Dan Feng2, Yuchong Hu2, Yucheng Zhang2, Lanlan Cui3 and Lin Wang2 1Jiangxi University of Science and Technology, CN; 2Huazhong University of Science & Technology, CN; 3Xi'an University of Technology, CN Abstract A hashing index provides rapid search performance by swiftly locating key-value items. Non-volatile memory (NVM) technologies have driven research into hashing indexes for NVM, combining hard disk persistence with DRAM-level performance. Nevertheless, current NVM-based hashing indexes must tackle data inconsistency challenges caused by NVM write reordering or partial writes, and mitigate rapid local wear due to frequent updates, considering NVM's limited endurance. The temporary allocation of buckets in NVM-based chained hashing to resolve hash collisions prolongs the critical path for writing, thus hampering write performance. This paper presents WOPHI, a write-optimized persistent hash index scheme for NVM. By utilizing log-free failure-atomic writes, WOPHI minimizes data consistency overhead and addresses hash conflicts with bucket pre-allocation. Experimental results underscore WOPHI's significant performance enhancements, with insertion latency slashed by up to 88.2\% and deletion latency boosted by up to 82.6\% compared to existing state-of-the-art schemes. Moreover, WOPHI substantially mitigates data consistency overhead, reducing cache line flushes by 59.3\%, while maintaining robust write throughput for insert and delete operations./proceedings-archive/2024/DATA/1157_pdf_upload.pdf |
||
DEAR-PIM: PROCESSING-IN-MEMORY ARCHITECTURE WITH DISAGGREGATED EXECUTION OF ALL-BANK REQUESTS Speaker: Jungi Hyun, Seoul National University, KR Authors: Jungi Hyun, Minseok Seo, Seongho Jeong, Hyuk-Jae Lee and Xuan Truong Nguyen, Seoul National University, KR Abstract Emerging transformer-based large language models (LLMs) involve many low-arithmetic intensity operations, which result in sub-optimal performance on general-purpose CPUs and GPUs. Processing-in-Memory (PIM) has shown promise in enhancing performance by reducing data movement bottlenecks. Commodity near-bank PIMs enable in-memory computation through bank-level compute units and typically rely on all-bank commands, which simultaneously operate the compute units of all banks to maximize internal bandwidth and parallelism. However, activating all banks simultaneously before issuing all-bank commands generally requires high peak power, which may exceed system power limit, when stacking multiple PIM devices for LLM inference. Additionally, under a DRAM power constraint, all-bank commands are only issued after all banks are fully activated through a sequence of single-bank activations, incurring bubble cycles and degrading overall performance. To address these shortcomings, this study proposes DEAR-PIM, a novel PIM architecture with Disaggregated Execution of All-bank Requests. DEAR-PIM incorporates disaggregated command queue, allowing it to buffer all-bank commands and provide them to each bank sequentially without waiting to complete all-bank activations. However, since all banks must finish their disaggregated execution before simultaneous post-processing, synchronization between early-activated and last-activated banks is necessary. To tackle the issue, DEAR-PIM introduces a column-aware synchronization command scheme that inserts no-op-like commands into unused columns without modifying the memory controller. Experiments demonstrate that DEAR-PIM achieves a speedup of 2.03-3.33× over an A100 GPU and improves performance by 1.11-1.52× compared to the sequential activation scheme. DEAR-PIM also reduces the peak power consumption by 21.3-41.7% compared to the simultaneous activation scheme./proceedings-archive/2024/DATA/1209_pdf_upload.pdf |
||
SYNDCIM: A PERFORMANCE-AWARE DIGITAL COMPUTING-IN-MEMORY COMPILER WITH MULTI-SPEC-ORIENTED SUBCIRCUIT SYNTHESIS Speaker: Kunming Shao, Hong Kong University of Science and Technology, HK Authors: Kunming Shao1, Fengshi Tian1, Xiaomeng WANG1, Jiakun Zheng1, Jia Chen2, Jingyu He1, Hui Wu3, Jinbo Chen3, Xihao Guan1, Yi Deng2, Fengbin Tu1, Jie Yang3, Mohamad Sawan3, Tim Cheng1 and Chi Ying Tsui1 1Hong Kong University of Science and Technology, HK; 2AI Chip Center for Emerging Smart Systems (ACCESS),Hong Kong University of Science and Technology, HK; 3Westlake University, CN Abstract Digital Computing-in-Memory (DCIM) is an innovative technology that integrates multiply-accumulation (MAC) logic directly into memory arrays to enhance the performance of modern AI computing. However, the need for customized memory cells and logic components currently necessitates significant manual effort in DCIM design. Existing tools for facilitating DCIM macro designs struggle to optimize subcircuit synthesis to meet user-defined performance criteria, thereby limiting the potential system-level acceleration that DCIM can offer. To address these challenges and enable the agile design of DCIM macros with optimal architectures, we present SynDCIM — a performance-aware DCIM compiler that employs multi-spec-oriented subcircuit synthesis. SynDCIM features an automated performance-to-layout generation process that aligns with user-defined performance expectations. This is supported by a scalable subcircuit library and a multi-spec-oriented searching algorithm for effective subcircuit synthesis. The effectiveness of SynDCIM is demonstrated through extensive experiments and validated with a test chip fabricated in a 40nm CMOS process. Testing results reveal that designs generated by SynDCIM exhibit competitive performance when compared to state-of-the-art manually designed DCIM macros./proceedings-archive/2024/DATA/1218_pdf_upload.pdf |