Switch on Event Multithreading (SoE MT, also known as coarse-grained MT and block MT) processors run multiple threads on a pipeline machine, while the pipeline switches threads on stall events (e.g., cache miss). The thread switch penalty is determined by the number of stages in the pipeline that are flushed of in-flight instructions. In this paper, Continuous Flow Multithreading (CFMT), a new …
To address the Dark Silicon problem, architects have increasingly turned to special-purpose hardware accelerators to improve the performance and energy efficiency of common computational kernels, such as encryption and compression. Unfortunately, the latency and overhead required to off-load a computation to an accelerator sometimes outweighs the potential benefits, resulting in a net decrease …
A novel method to protect a system against errors resulting from soft errors occurring in the virtual address (VA) storing structures such as translation lookaside buffers (TLB), physical register file (PRF) and the program counter (PC) is proposed in this paper. The work is otivated by showing how soft errors impact the structures that store virtual page numbers (VPN). A solution is proposed b…
Power mismatching between supply and demand has emerged as a top issue in modern datacenters that are under-provisioned or powered by intermittent power supplies. Recent proposals are primarily limited to leveraging uninterruptible power supplies (UPS) to handle power mismatching, and there fore lack the capability of efficiently handling the irregular peak power mismatches. In this paper we p…
Web browsing on mobile devices is undoubtedly the future. However, with the increasing complexity of webpages, the mobile device’s computation capability and energy consumption become major pitfalls for a satisfactory user experience. In this paper, we propose a mechanism to effectively leverage processor frequency scaling in order to balance the performance and energy consumption of mobile w…
JavaScript is a sequential programming language, and Thread-Level Speculation has been proposed to dynamically extract parallelism in order to take advantage of parallel hardware. In previous work, we have showed significant speed-ups with a simple on/off speculation heuristic. In this paper, we propose and evaluate three heuristics for dynamically adapt the speculation: a 2-bit heuristic, an e…
Bitwise operations are an important component of modern day programming, and are used in a variety of applications such as databases. In this work, we propose a new and simple mechanism to implement bulk bitwise AND and OR operations in DRAM, which is faster and more efficient than existing mechanisms. Our mechanism exploits existing DRAM operation to perform a bitwise AND/OR of two DRAM rows c…
We study the tradeoffs between Many-Core machines like Intel’s Larrabee and Many-Thread machines like Nvidia and AMD GPGPUs. We define a unified model describing a superposition of the two architectures, and use it to identify operation zones for which each machine is more suitable. Moreover, we identify an intermediate zone in which both machines deliver inferior performance. We study the sh…
gem5-gpu is a new simulator that models tightly integrated CPU-GPU systems. It builds on gem5, a modular full-system CPU simulator, and GPGPUSim, a detailed GPGPU simulator. gem5-gpu routes most memory accesses through Ruby, which is a highly configurable memory system in gem5. By doing this, it is able to simulate many system configurations, ranging from a system with coherent caches and a si…
Over the past few years, there has been vast growth in the area of the web browser as an applications platform. One example of this trend is Google’s Native Client (NaCl) platform, which is a software-fault isolation mechanism that allows the running of native x86 or ARM code on the browser. One of the security mechanisms employed by NaCl is that all branches must jump to the start of a valid…
Consider a workload comprising a consecutive sequence of program execution segments, where each segment can either be executed on general purpose processor or offloaded to a hardware accelerator. An analytical optimization framework based on MultiAmdhal framework and Lagrange multipliers, for selecting the optimal set of accelerators and for allocating resources among them under constrained are…
Memory access times are the primary bottleneck for many applications today. This “memory wall” is due to the performance disparity between processor cores and main memory. To address the performance gap, we propose the use of custom memory subsystems tailored to the application rather than attempting to optimize the application for a fixed memory subsystem. Custom subsystems can take advant…
With the trend towards increasing number of cores in a multicore processors, the on-chip network that connects the cores needs to scale efciently. In this work, we propose the use of high-radix networks in on-chip networks and describe how the attened buttery topology can be mapped to on-chip networks. By using high-radix routers to reduce the diameter of the network, the attened buttery o…
The Roofline model graphically represents the attainable upper bound performance of a computer architecture. This paper analyzes the original Roofline model and proposes a novel approach to provide a more insightful performance modeling of modern architectures by introducing cache awareness, thus significantly improving the guidelines for application optimization. The proposed model was experim…
DRAM scaling has been the prime driver of increasing capacity of main memory systems. Unfortunately, lower technology nodes worsen the cell reliability as it increases the coupling between adjacent DRAM cells, thereby exacerbating different failure modes. This paper investigates the reliability problem due to Row Hammering, whereby frequent activations of a given row can cause data loss for its…
We present a method for accelerating server applications using a hybrid CPU+FPGA architecture and demonstrate its advantages by accelerating Memcached, a distributed key-value system. The accelerator, implemented on the FPGA fabric, processes request packets directly from the network, avoiding the CPU in most cases. The accelerator is created by profiling the application to determine the most c…
Accelerators integrated on-die with General-Purpose CPUs (GP-CPUs) can yield significant performance and power improvements. Their extensive use, however, is ultimately limited by their area overhead; due to their high degree of specialization, the opportunity cost of investing die real estate on accelerators can become prohibitive, especially for general-purpose architectures. In this paper we…
Network-on-Chip (NoC) paradigm is rapidly evolving into an efficient interconnection network to handle the strict communication requirements between the increasing number of cores on a single chip. Diminishing transistor size is making the NoC increasingly vulnerable to both hard faults and soft errors. This paper concentrates on soft errors in NoCs. A soft error in an NoC router results in sig…
Replication of values causes poor utilization of on-chip cache memory resources. This paper addresses the question: How much cache resources can be theoretically and practically saved if value replication is eliminated? We introduce the concept of valueaware caches and show that a sixteen times smaller value-aware cache can yield the same miss rate as a conventional cache. We then make a case f…
To protect multicores from soft-error perturbations, resiliency schemes have been developed with high coverage but high power/performance overheads (2). We observe that not all soft-errors affect program correctness, some soft-errors only affect program accuracy, i.e., the program completes with certain acceptable deviations from soft-error free outcome. Thus, it is practical to improve proce…
Abstract—Hardware specialization has emerged as a promising paradigm for future microprocessors. Unfortunately, it is natural to develop and evaluate such architectures within end-to-end vertical silos spanning application, language/ compiler, hardware design and evaluation tools, leaving little opportunity for cross-architecture analysis and innovation. This paper develops a novel program re…
Abstract—Energy consumption by software applications is a critical issue that determines the future of multicore software development. In this article, we propose a hardware-software Cooperative approach that uses hardware support to efficiently gather the energy-related hardware counters during program execution, and utilizes parameter estimation models in software to compute the energy cons…
The number of cores in a multicore chip design has been increasing in the past two decades. The rate of increase will continue for the foreseeable future. With a large number of cores, the on-chip communication has become a very important design consideration. The increasing number of cores will push the communication complexity level to a point where managing such highly complex systems requir…
Optical interconnect is a promising alternative to substitute the electrical interconnect for intra-chip communications. The topology of optical Network-on-Chip (ONoC) has a great impact on the network performance. However, the size of ONoC is limited by the power consumption and crosstalk noise, which are mainly resulted from the waveguide crossings in the topology. In this paper, a diagonal M…
Understanding data reuse patterns of a computing system is crucial to effective design optimization. The emerging SIMT (Single Instruction Multiple Threads) processor adopts a programming model that is fundamentally disparate from conventional scalar processors. There is a lack of analytical approaches to quantify the data reuse of SIMT applications. This paper presents a quantitative method t…
Deadlock remains a central problem in interconnection network. In this paper, we establish a new theory of deadlock-free flow control for k-ary, n-cube mesh network, which enables the use of any minimal-path adaptive routing algorithms while avoiding deadlock. We prove that the proposed flow control algorithm is a sufficient condition for deadlock freedom in any minimal path, adaptive routing a…
Abstract—Memory bandwidth is critical to GPGPU performance. Exploiting locality in caches can better utilize memory bandwidth. However, memory requests issued by excessive threads cause cache thrashing and saturate memory bandwidth, degrading performance. In this paper, we propose adaptive cache and concurrency allocation (CCA) to prevent cache thrashing and improve the utilization of bandwid…
The paper presents an adaptive wear-leveling scheme based on several wear-thresholds in different periods. The basic idea behind this scheme is that blocks can have different wear-out speeds and the wear-leveling mechanism does not conduct data migration until the erasure counts of some hot blocks hit a threshold. Through a series of emulation experiments based on several realistic disk traces,…
The divide-and-conquer paradigm can be used to express many computationally significant problems, but an important subset of these applications is inherently load-imbalanced. Load balancing is a challenge for irregular parallel divideand- conquer algorithms and efficiently solving these applications will be a key requirement for future many-core systems. To address the load imbalance issue, ins…
Flash-based solid-state drive (SSD) is now being widely deployed in cloud computing platforms due to the potential advantages of better performance and less energy consumption. However, current virtualization architecture lacks support for highperformance I/O virtualization over persistent storage, which results in sub-optimal I/O performance for guest virtual machines (VMs) on SSD. Further, cu…
This paper presents a lifetime reliability characterization of many-core processors based on a full-system simulation of integrated microarchitecture, power, thermal, and reliability models. Under normal operating conditions, our model and analysis reveal that the mean-time-to-failure of cores on the die show normal distribution. From the processor-level perspective, the key insight is that red…
Increased power dissipation in computing devices has led to a sharp rise in thermal hotspots, creating thermal runaway. To reduce the additional power requirement caused by increased temperature, current approaches apply cooling mechanisms to remove heat or apply management techniques to avoid thermal emergencies by slowing down heat generation. This paper proposes to tackle the heat management…
We have developed and evaluated Argus-G, an error detection scheme for general purpose GPU (GPGPU) cores. Argus-G is a natural extension of the Argus error detection scheme for CPU cores, and we demonstrate how to modify Argus such that it is compatible with GPGPU cores. Using an RTL prototype, we experimentally show that Argus-G can detect the vast majority of injected errors at relatively lo…
Recently, researchers have explored way-based hybrid SRAM-NVM (non-volatile memory) last level caches (LLCs) to bring the best of SRAM and NVM together. However, the limited write endurance of NVMs restricts the lifetime of these hybrid caches. We present AYUSH, a technique to enhance the lifetime of hybrid caches, which works by using data-migration to preferentially use SRAM for storing freq…
Graphics Processing Units accelerate data-parallel graphic calculations using wide SIMD vector units. Compiling programs to use the GPU’s SIMD architectures require converting multiple control flow paths into a single stream of instructions. IF-conversion is a compiler transformation, which converts control dependencies into data dependencies, and it is used by vectorizing compilers to elimi…
Caches are critical to performance, yet their behavior is hard to understand and model. In particular, prior work does not provide closed-form solutions of cache performance, i.e. simple expressions for the miss rate of a specific access pattern. Existing cache models instead use numerical methods that, unlike closed-form solutions, are computationally expensive and yield limited insight. We pr…
Faulty cells have become major problems in cost-sensitive main-memory DRAM devices. Conventional solutions to reduce device failure rates due to cells with permanent faults, such as populating spare rows and relying on errorcorrecting codes, have had limited success due to high area overheads. In this paper, we propose CIDR, a novel cache-inspired DRAM resilience architecture, which substantia…
STT-RAM technology has recently emerged as one of the most promising memory technologies. However, its major problems, limited write endurance and high write energy, are still preventing it from being used as a drop-in replacement of SRAM cache. In this paper, we propose a novel coding scheme for STT-RAM last level cache based on the concept of value locality. We reduce switching probability in…
A growing number of processors have CPU cores that implement the same instruction set architecture (ISA) using different microarchitectures. The underlying motivation for single-ISA heterogeneity is that a diverse set of cores can enable runtime flexibility. Modern processors are subject to strict power budgets, and heterogeneity provides the runtime scheduler with more latitude to decide the l…
The rapid random access of SSDs enables efficient searching of redundant data and their deduplication. However, the space earned from deduplication cannot be used as permanent storage because it must be reclaimed when deduplication is cancelled as a result of an update to the deduplicated data. To overcome this limitation, we propose a novel FTL scheme that enables the gained capacity to be us…
Improving energy efficiency is crucial for both mobile and high-performance computing systems while a large fraction of total energy is consumed to transfer data between storage and processing units. Thus, reducing data transfers across the memory hierarchy of a processor (i.e., off-chip memory, on-chip caches, and register file) can greatly improve the energy efficiency. To this end, we propos…
In this paper we present DRAMSim2, a cycle accurate memory system simulator. The goal of DRAMSim2 is to be an accurate and publicly available DDR2/3 memory system model which can be used in both full system and trace-based simulations. We describe the process of validating DRAMSim2 timing against manufacturer Verilog models in an effort to prove the accuracy of simulation results. We outline t…
3D logic-on-logic technology is a promising approach for extending the validity of Moore’s law when technology scaling stops. 3D technology can also lead to a paradigm shift in on-chip communication design by providing orders of magnitude higher bandwidth and lower latency for inter-layer communication. To turn the 3D technology bandwidth and latency benefits into network latency reductions …
The performance of user-facing applications is critical to client platforms. Many of these applications are event-driven and exhibit ”bursty” behavior: the application is generally idle but generates bursts of activity in response to human interaction. We study one example of a bursty application, web-browsers, and produce two important insights: (1) Activity bursts contain false parallelis…
Interconnection networks are a critical component in most modern systems nowadays. Both off-chip networks, in HPC systems, data centers, and cloud servers, and on-chip networks, in chip multiprocessors (CMPs) and multiprocessors system-on-chip (MPSoCs), play an increasing role as their performance is vital for the performance of the whole system. One of the key components of any interconnect i…
Next generation byte addressable nonvolatile memory (NVM) technologies like PCM are attractive for end-user devices as they offer memory scalability as well as fast persistent storage. In such environments, NVM’s limitations of slow writes and high write energy are magnified for applications that need atomic, consistent, isolated and durable (ACID) updates. This is because, for satisfying cor…
Over the lifetime of a microprocessor, the Hot Carrier Injection (HCI) phenomenon degrades the threshold voltage, which causes slower transistor switching and eventually results in timing violations and faulty operation. This effect appears when the memory cell contents flip from logic ‘0’ to ‘1’ and vice versa. In caches, the majority of cell flips are concentrated into only a few of t…
The performance of data-intensive applications, when running on modern multi- and many-core processors, is largely determined by their memory access behavior. Its most important contributors are the frequency and latency of off-chip accesses and the extent to which long-latency memory accesses can be overlapped with useful computation or with each other. In this paper we present two methods to…
Hardware prefetching improves system performance by hiding and tolerating the latencies of lower levels of cache and off-chip DRAM. An accurate prefetcher improves system performance whereas an inaccurate prefetcher can cause cache pollution and consume additional bandwidth. Prefetch address filtering techniques improve prefetch accuracy by predicting the usefulness of a prefetch address and ba…
For the last few years, the major driving force behind the rapid performance improvement of SSDs has been the increment of parallel bus channels between a flash controller and flash memory packages inside the solid-state drives (SSDs). However, are other internal parallelisms inside SSDs yet to be explored. In order to improve performance further by utilizing the parallelism, this paper sugges…