Since most of the current multi-core processors use a large last-level cache (LLC), efficient use of an LLC is critical for the overall performance of multi-cores. To improve the caching efficiency, page coloring is a representative software-based approach to allow the OS to control placement of pages on an LLC to improve their cache utility and to avoid conflicts among cores. However, system v…
Abstract—The challenge for fast and low-cost deployment of ubiquitous personalized e-Health services has prompted us to propose a new framework architecture for such services. We have studied the operational features and the environment of e-Health services and we led to a framework structure that extends the European Telecommunications Standards Institute (ETSI)/Parlay architecture, which is…
Faxian traveled regions that we now lump together as Asia. He wrote about what he discovered, translated the texts he brought back, suffered aches and pains, felt homesick, returned transformed, and, in turn, transformed the world’s understanding of religion, Buddhism, travel, geography, ritual, art, and many other things. His traveling companions died, left him, joined him, pursued other pat…
Technology scaling has raised the specter of myriads of cheap, but unreliable and/or stochastic devices that must be creatively combined to create a reliable computing system. This has renewed the interest in computing that exploits stochasticity—embracing, not combating the device physics. If a stochastic representation is used to implement a programmable general-purpose architecture akin to…
Long-latency cache accesses cause significant performance-impacting delays for both in-order and out-of-order processor systems. To address these delays, runahead pre-execution has been shown to produce speedups by warming-up cache structures during stalls caused by long-latency memory accesses. While improving cache related performance, basic runahead approaches do not otherwise utilize resul…
Weighted speedup is nowadays the most commonly used multiprogram workload performance metric. Weighted speedup is a weighted-IPC metric, i.e., the multiprogram IPC of each program is first weighted with its isolated IPC. Recently, Michaud questions the validity of weighted-IPC metrics by arguing that they are inconsistent and that weighted speedup favors unfairness [4]. Instead, he advocates us…
Associative Processor (AP) combines data storage and data processing, and functions simultaneously as a massively parallel array SIMD processor and memory. Traditionally, AP is based on CMOS technology, similar to other classes of massively parallel SIMD processors. The main component of AP is a Content Addressable Memory (CAM) array. As CMOS feature scaling slows down, CAM experiences scalabil…
To make applications with dynamic data sharing among threads benefit from GPU acceleration, we propose a novel software transactional memory system for GPU architectures (GPU-STM). The major challenges include ensuring good scalability with respect to the massively multithreading of GPUs, and preventing livelocks caused by the SIMT execution paradigm of GPUs. To this end, we propose (1) a hiera…
Flash storage devices behave quite differently from hard disk drives (HDDs); a page on flash has to be erased before it can be rewritten, and the erasure has to be performed on a block which consists of a large number of contiguous pages. It is also important to distribute writes evenly among flash blocks to avoid premature wearing. To achieve interoperability with existing block I/O subsystem…
Dynamic voltage and frequency scaling (DVFS) is a key technique for reducing processor power consumption in mobile devices. In recent years, mobile system-on-chips (SoCs) has supported DVFS for embedded graphics processing units (GPUs) as the processing power of embedded GPUs has been increasing steadlily. The major challenge of applying DVFS to a processing unit is to meet the quality of serv…
Given the increase of runtime managed code environments in desktop, server, and mobile segments, agile, flexible, and accurate performance monitoring capabilities are required in order to perform wise code transformations and optimizations. Common profiling strategies, mainly based on instrumentation and current performance monitoring units (PMUs), are not adequate and new innovative designs ar…
We present for the first time the concept of per-task energy accounting (PTEA) and relate it to per-task energy metering (PTEM). We show the benefits of supporting both in future computing systems. Using the shared last-level cache (LLC) as an example:(1) We illustrate the complexities in providing PTEM and PTEA; (2) we present an idealized PTEM model and an accurate and low-cost implementation…
Non-volatile memory (NVM) technology holds promise to replace SRAM and DRAM at various levels of the memory hierarchy. The interest in NVM is motivated by the difficulty faced in scaling DRAM beyond 22 nm and, long-term, lower cost per bit. While offering higher density and negligible static power (leakage and refresh), NVM suffers increased latency and energy per memory access. This paper dev…
In out-of-order (OoO) processors, speculative execution with high branch prediction accuracy is employed to achieve good single thread performance. In these processors the branch prediction unit tables (BPU) are accessed in parallel with the instruction cache before it is known whether a fetch group contains branch instructions. For integer applications, we find 85 percent of BPU lookups are d…
This paper proposes persistent transactional memory (PTM), a new design that adds durability to transactional memory (TM) by incorporating with the emerging non-volatile memory (NVM). PTM dynamically tracks transactional updates to cache lines to ensure the ACI (atomicity, consistency and isolation) properties during cache flushes and leverages an undo log in NVM to ensure PTM can always consis…
Memory bottleneck has always been a major cause for limiting the performance of computer systems. While in the past latency was the major concern, today, lack of bandwidth becomes a limiting factor as well, as a result of exploiting more parallelism with the growing number of cores per die, which intensifies the pressure on the memory bus. In such an environment, any additional traffic to memor…
Integrated CPU-GPU architectures with a fully addressable shared memory completely eliminate any CPU-GPU data transfer overhead. Since such architectures are relatively new, it is unclear what level of interaction between the CPU and GPU attains the best energy efficiency. Too coarse grained or larger kernels with fairly low CPU - GPU interaction could cause poor utilization of the shared resou…
Abstract—In this letter, a flexible memory simulator - NVMain 2.0, is introduced to help the community for modeling not only commodity DRAMs but also emerging memory technologies, such as die-stacked DRAM caches, non-volatile memories (e.g., STT-RAM, PCRAM, and ReRAM) including multi-level cells (MLC), and hybrid non-volatile plus DRAM memory systems. Compared to existing memory simulators, N…
Many-Accelerator (MA) systems have been introduced as a promising architectural paradigm that can boost performance and improve power of general purpose computing platforms. In this paper, we focus on the problem of resource under-utilization, i.e. Dark Silicon, in FPGA-based MA platforms. We show that except the typically expected peak power budget, on-chip memory resources form a severe under…
Switch on Event Multithreading (SoE MT, also known as coarse-grained MT and block MT) processors run multiple threads on a pipeline machine, while the pipeline switches threads on stall events (e.g., cache miss). The thread switch penalty is determined by the number of stages in the pipeline that are flushed of in-flight instructions. In this paper, Continuous Flow Multithreading (CFMT), a new …
To address the Dark Silicon problem, architects have increasingly turned to special-purpose hardware accelerators to improve the performance and energy efficiency of common computational kernels, such as encryption and compression. Unfortunately, the latency and overhead required to off-load a computation to an accelerator sometimes outweighs the potential benefits, resulting in a net decrease …
A novel method to protect a system against errors resulting from soft errors occurring in the virtual address (VA) storing structures such as translation lookaside buffers (TLB), physical register file (PRF) and the program counter (PC) is proposed in this paper. The work is otivated by showing how soft errors impact the structures that store virtual page numbers (VPN). A solution is proposed b…
Power mismatching between supply and demand has emerged as a top issue in modern datacenters that are under-provisioned or powered by intermittent power supplies. Recent proposals are primarily limited to leveraging uninterruptible power supplies (UPS) to handle power mismatching, and there fore lack the capability of efficiently handling the irregular peak power mismatches. In this paper we p…
Web browsing on mobile devices is undoubtedly the future. However, with the increasing complexity of webpages, the mobile device’s computation capability and energy consumption become major pitfalls for a satisfactory user experience. In this paper, we propose a mechanism to effectively leverage processor frequency scaling in order to balance the performance and energy consumption of mobile w…
JavaScript is a sequential programming language, and Thread-Level Speculation has been proposed to dynamically extract parallelism in order to take advantage of parallel hardware. In previous work, we have showed significant speed-ups with a simple on/off speculation heuristic. In this paper, we propose and evaluate three heuristics for dynamically adapt the speculation: a 2-bit heuristic, an e…
Bitwise operations are an important component of modern day programming, and are used in a variety of applications such as databases. In this work, we propose a new and simple mechanism to implement bulk bitwise AND and OR operations in DRAM, which is faster and more efficient than existing mechanisms. Our mechanism exploits existing DRAM operation to perform a bitwise AND/OR of two DRAM rows c…
We study the tradeoffs between Many-Core machines like Intel’s Larrabee and Many-Thread machines like Nvidia and AMD GPGPUs. We define a unified model describing a superposition of the two architectures, and use it to identify operation zones for which each machine is more suitable. Moreover, we identify an intermediate zone in which both machines deliver inferior performance. We study the sh…
gem5-gpu is a new simulator that models tightly integrated CPU-GPU systems. It builds on gem5, a modular full-system CPU simulator, and GPGPUSim, a detailed GPGPU simulator. gem5-gpu routes most memory accesses through Ruby, which is a highly configurable memory system in gem5. By doing this, it is able to simulate many system configurations, ranging from a system with coherent caches and a si…
Over the past few years, there has been vast growth in the area of the web browser as an applications platform. One example of this trend is Google’s Native Client (NaCl) platform, which is a software-fault isolation mechanism that allows the running of native x86 or ARM code on the browser. One of the security mechanisms employed by NaCl is that all branches must jump to the start of a valid…
Consider a workload comprising a consecutive sequence of program execution segments, where each segment can either be executed on general purpose processor or offloaded to a hardware accelerator. An analytical optimization framework based on MultiAmdhal framework and Lagrange multipliers, for selecting the optimal set of accelerators and for allocating resources among them under constrained are…
Memory access times are the primary bottleneck for many applications today. This “memory wall” is due to the performance disparity between processor cores and main memory. To address the performance gap, we propose the use of custom memory subsystems tailored to the application rather than attempting to optimize the application for a fixed memory subsystem. Custom subsystems can take advant…
With the trend towards increasing number of cores in a multicore processors, the on-chip network that connects the cores needs to scale efciently. In this work, we propose the use of high-radix networks in on-chip networks and describe how the attened buttery topology can be mapped to on-chip networks. By using high-radix routers to reduce the diameter of the network, the attened buttery o…
The Roofline model graphically represents the attainable upper bound performance of a computer architecture. This paper analyzes the original Roofline model and proposes a novel approach to provide a more insightful performance modeling of modern architectures by introducing cache awareness, thus significantly improving the guidelines for application optimization. The proposed model was experim…
DRAM scaling has been the prime driver of increasing capacity of main memory systems. Unfortunately, lower technology nodes worsen the cell reliability as it increases the coupling between adjacent DRAM cells, thereby exacerbating different failure modes. This paper investigates the reliability problem due to Row Hammering, whereby frequent activations of a given row can cause data loss for its…
We present a method for accelerating server applications using a hybrid CPU+FPGA architecture and demonstrate its advantages by accelerating Memcached, a distributed key-value system. The accelerator, implemented on the FPGA fabric, processes request packets directly from the network, avoiding the CPU in most cases. The accelerator is created by profiling the application to determine the most c…
Accelerators integrated on-die with General-Purpose CPUs (GP-CPUs) can yield significant performance and power improvements. Their extensive use, however, is ultimately limited by their area overhead; due to their high degree of specialization, the opportunity cost of investing die real estate on accelerators can become prohibitive, especially for general-purpose architectures. In this paper we…
Network-on-Chip (NoC) paradigm is rapidly evolving into an efficient interconnection network to handle the strict communication requirements between the increasing number of cores on a single chip. Diminishing transistor size is making the NoC increasingly vulnerable to both hard faults and soft errors. This paper concentrates on soft errors in NoCs. A soft error in an NoC router results in sig…
Replication of values causes poor utilization of on-chip cache memory resources. This paper addresses the question: How much cache resources can be theoretically and practically saved if value replication is eliminated? We introduce the concept of valueaware caches and show that a sixteen times smaller value-aware cache can yield the same miss rate as a conventional cache. We then make a case f…
To protect multicores from soft-error perturbations, resiliency schemes have been developed with high coverage but high power/performance overheads (2). We observe that not all soft-errors affect program correctness, some soft-errors only affect program accuracy, i.e., the program completes with certain acceptable deviations from soft-error free outcome. Thus, it is practical to improve proce…
Abstract—Hardware specialization has emerged as a promising paradigm for future microprocessors. Unfortunately, it is natural to develop and evaluate such architectures within end-to-end vertical silos spanning application, language/ compiler, hardware design and evaluation tools, leaving little opportunity for cross-architecture analysis and innovation. This paper develops a novel program re…
Abstract—Energy consumption by software applications is a critical issue that determines the future of multicore software development. In this article, we propose a hardware-software Cooperative approach that uses hardware support to efficiently gather the energy-related hardware counters during program execution, and utilizes parameter estimation models in software to compute the energy cons…
The number of cores in a multicore chip design has been increasing in the past two decades. The rate of increase will continue for the foreseeable future. With a large number of cores, the on-chip communication has become a very important design consideration. The increasing number of cores will push the communication complexity level to a point where managing such highly complex systems requir…
Optical interconnect is a promising alternative to substitute the electrical interconnect for intra-chip communications. The topology of optical Network-on-Chip (ONoC) has a great impact on the network performance. However, the size of ONoC is limited by the power consumption and crosstalk noise, which are mainly resulted from the waveguide crossings in the topology. In this paper, a diagonal M…
Understanding data reuse patterns of a computing system is crucial to effective design optimization. The emerging SIMT (Single Instruction Multiple Threads) processor adopts a programming model that is fundamentally disparate from conventional scalar processors. There is a lack of analytical approaches to quantify the data reuse of SIMT applications. This paper presents a quantitative method t…
Deadlock remains a central problem in interconnection network. In this paper, we establish a new theory of deadlock-free flow control for k-ary, n-cube mesh network, which enables the use of any minimal-path adaptive routing algorithms while avoiding deadlock. We prove that the proposed flow control algorithm is a sufficient condition for deadlock freedom in any minimal path, adaptive routing a…
Abstract—Memory bandwidth is critical to GPGPU performance. Exploiting locality in caches can better utilize memory bandwidth. However, memory requests issued by excessive threads cause cache thrashing and saturate memory bandwidth, degrading performance. In this paper, we propose adaptive cache and concurrency allocation (CCA) to prevent cache thrashing and improve the utilization of bandwid…
The paper presents an adaptive wear-leveling scheme based on several wear-thresholds in different periods. The basic idea behind this scheme is that blocks can have different wear-out speeds and the wear-leveling mechanism does not conduct data migration until the erasure counts of some hot blocks hit a threshold. Through a series of emulation experiments based on several realistic disk traces,…
The divide-and-conquer paradigm can be used to express many computationally significant problems, but an important subset of these applications is inherently load-imbalanced. Load balancing is a challenge for irregular parallel divideand- conquer algorithms and efficiently solving these applications will be a key requirement for future many-core systems. To address the load imbalance issue, ins…
Flash-based solid-state drive (SSD) is now being widely deployed in cloud computing platforms due to the potential advantages of better performance and less energy consumption. However, current virtualization architecture lacks support for highperformance I/O virtualization over persistent storage, which results in sub-optimal I/O performance for guest virtual machines (VMs) on SSD. Further, cu…
This paper presents a lifetime reliability characterization of many-core processors based on a full-system simulation of integrated microarchitecture, power, thermal, and reliability models. Under normal operating conditions, our model and analysis reveal that the mean-time-to-failure of cores on the die show normal distribution. From the processor-level perspective, the key insight is that red…