Blackburn, Cai, Chen, Yang, Zhang, Zigman,
Rethinking Java Performance Analysis,
in ASPLOS ’25, March 30-April 3, 2025, Rotterdam, Netherlands., 2025.
Representative workloads and principled methodologies are the foundation of performance analysis, which in turn provides the empirical grounding for much of the innovation in systems research. However, benchmarks are hard to maintain, methodologies are hard to develop, and our field moves fast. The tension between our fast-moving fields and their need to maintain their methodological foundations is a serious challenge. This paper explores that challenge through the lens of Java performance analysis. Lessons we draw extend to other languages and other fields of computer science. In this paper we: i) introduce a complete overhaul of the DaCapo benchmark suite, characterizing 22 new and/or refreshed workloads across 47 dimensions, using principal components analysis to demonstrate their diversity, ii) demonstrate new methodologies and how they are integrated into an easy to use framework, iii) use this framework to conduct an analysis of the state of the art in production Java performance, and iv) motivate the need to invest in renewed methodologies and workloads, using as an example a review of contemporary production garbage collector performance. We highlight the danger of allowing methodologies to lag innovation and respond with a suite and new methodologies that nudge forward some of our field’s methodological foundations. We offer guidance on maintaining the empirical rigor we need to encourage profitable research directions and quickly identify unprofitable ones.
|
Sareen, Blackburn, Hamouda, Gidra,
Memory Management on Mobile Devices,
in ISMM ’24, June 25, 2024, Copenhagen, Denmark., 2024.
The performance of mobile devices directly affects billions of people worldwide. Yet, despite memory management being key to their responsiveness, energy efficiency, and cost, mobile devices are understudied in the literature. A paucity of suitable methodologies and benchmarks is likely both a cause and a consequence of this gap. It also reflects the challenges of evaluating mobile devices due to: i) their inherently multi-tenanted nature, ii) the scarcity of widely-used open source workloads suitable as benchmarks, iii) the challenge of determinism and reproducibility given mobile devices’ extensive use of GPS and network services, iv) the complexity of mobile performance criteria. We study this problem using the Android Runtime (ART), which is particularly interesting because it is open sourced, garbage collected, and its market extends from the most advanced to the most basic mobile devices available, with a commensurate diversity of performance expectations. Our study makes the following contributions: i) we identify pitfalls and challenges to the sound evaluation of garbage collection in ART, ii) we describe a framework for the principled performance evaluation of overheads in ART, iii) we curate a small benchmark suite comprised of widely-used real-world applications, and iv) we conduct an evaluation of these real-world workloads as well as some DaCapo benchmarks and a micro-benchmark. For a modestly sized heap, we find that the lower bound on garbage collection overheads vary considerably among the benchmarks we evaluate, from 2% to 51%, and that overall, overheads are similar to those identified in recent studies of Java workloads running on OpenJDK. We hope that this work will demystify the challenges of studying memory management in the Android Runtime. By doing so,we hope to open up research and lead to more innovation in this highly impactful and memory-sensitive domain.
|
Huang, Blackburn, Cai,
Improving Garbage Collection Observability with Performance Tracing,
in MPLR ’23, September 22, 2023, Cascais, Portugal, 2023.
Debugging garbage collectors for performance and correctness is notoriously difficult. Among the arsenal of tools available to systems engineers, support for one of the most powerful, _tracing_, is lacking in most garbage collectors. Instead, engineers must rely on counting, sampling, and logging. Counting and sampling are limited to statistical analyses while logging is limited to hard-wired metrics. This results in cognitive friction, curtailing innovation and optimization. We find that tracing allows: (i) cheap insertion of tracepoints---just 14 lines of code and no measurable run-time overhead, (ii) decoupling of the declaration of tracepoints from tracing logic, (iii) high fidelity measurement able to detect subtle performance regressions, while also allowing (iv) interrogation of a running binary. Our tools crisply highlight several classes of performance bug, such as poor scalability in multi-threaded GCs, and lock contention in the allocation sequence. These observations uncover optimization opportunities in collectors, and even reveal bugs in application programs. We showcase tracing as a powerful tool for GC designers and practitioners. Tracing can uncover missed opportunities and lead to novel algorithms and new engineering practices.
|
Sareen, Blackburn,
Better Understanding the Costs and Benefits of Automatic Memory Management,
in MPLR ’22, September 14–15, 2022, Brussels, Belgium, 2022.
Automatic memory management relieves programmers of the burden of having to reason about object lifetimes in order to soundly reclaim allocated memory.
|
Xu, Moss, Blackburn,
Towards a Model Checking Framework for a New Collector Framework,
in MPLR ’22, September 14–15, 2022, Brussels, Belgium, 2022.
Garbage collectors provide memory safety, an important step toward program correctness.
|
Blackburn, ,
ISMM Keynote: We live in interesting times,
in ISMM '22, June 12, 2022, San Diego, CA, USA., 2022.
Memory management is rich with deep problems and challenging engineering, and fundamental to much of computer science which is probably why the likes of McCarthy, Dijkstra, Knuth, Steele, and Liskov have all made important contributions to our field. With so much water under the bridge, and such a rich history, perhaps we’re about done? Perhaps memory management is a solved problem? Far from it. Seismic technology shifts, unprecedented scale, concerns for energy efficiency and security, a diversity of programming languages and above all, ubiquity, have conspired to make memory management more challenging, interesting, and important than ever. In this talk, I’ll reflect on some of my experience in the field and reflect on why now is a more interesting (exciting!) time to be working on memory management than any in its six-decade history.
|
Zhao, Blackburn, McKinley,
Low-Latency, High-Throughput Garbage Collection,
in PLDI ’22, June 13-17, 2022, San Diego, CA, USA., 2022.
To achieve short pauses, state-of-the-art concurrent copying collectors such as C4, Shenandoah, and ZGC use substantially more CPU cycles and memory than simpler collectors. They suffer from design limitations: i) _concurrent copying_ with inherently expensive read and write barriers, ii) _scalability_ limitations due to tracing, and iii) _immediacy_ limitations for mature objects that impose memory overheads. This paper takes a different approach to optimizing responsiveness and throughput. It uses the insight that regular, brief stop-the-world collections deliver sufficient responsiveness at greater efficiency than concurrent evacuation. It introduces LXR, where stop-the-world collections use reference counting (RC) and judicious copying. RC delivers scalability and immediacy, promptly reclaiming young and mature objects. RC, in a hierarchical Immix heap structure, reclaims most memory without any copying. Occasional concurrent tracing identifies cyclic garbage. LXR introduces (i) RC remembered sets for judicious copying of mature objects; (ii) a novel low-overhead write barrier that combines coalescing reference counting, concurrent tracing, and remembered set maintenance; (iii) object reclamation while performing a concurrent trace; (iv) lazy processing of decrements; and (v) novel survival rate triggers that modulate pause duration. LXR combines excellent responsiveness and throughput, improving over production collectors. On the widely-used Lucene search engine in a tight heap, LXR delivers 7.8X better throughput and 10X better 99.99% tail latency than Shenandoah. On 17 diverse modern workloads in a moderate heap, LXR outperforms OpenJDK\u0027s default G1 on throughput by 4% and Shenandoah by 43%.
|
Ma, Liu, Wang, Qiao, Bond, Blackburn, Kim, Xu,
Mako: A Low-Pause, High-Throughput Evacuating Collector for Memory-Disaggregated Datacenters,
in PLDI ’22, June 13-17, 2022, San Diego, CA, USA., 2022.
Resource disaggregation has gained much traction as an emerging datacenter architecture, as it improves resource utilization and simplifies hardware adoption. Under resource disaggregation, different types of resources (memory, CPUs, etc) are disaggregated into dedicated servers connected by high-speed network fabrics. Memory disaggregation brings efficiency challenges to concurrent garbage collection (GC), which is widely used for latency-sensitive cloud applications, because GC and mutator threads simultaneously run and constantly compete for memory and swap resources. Mako is a new concurrent and distributed GC designed for memory-disaggregated environments. Key to Mako’s success is its ability to offload both tracing and evacuation onto memory servers and run these tasks concurrently when the CPU server executes mutator threads. A major challenge is how to let servers efficiently synchronize as they do not share memory. We tackle this challenge with a set of novel techniques centered around the heap indirection table (HIT), where entries provide one-hop indirection for heap pointers. Our evaluation shows that Mako achieves about 12 ms at the 90th-percentile pause time and outperforms Shenandoah by an average of 3X in throughput.
|
Cai, Blackburn, Bond, Maas,
Distilling the Real Cost of Production Garbage Collectors,
in 2022 IEEE International Symposium on Performance Analysis of Systems and Software, May 22-24, 2022, Singapore., 2022.
Despite the long history of garbage collection (GC) and its prevalence in modern programming languages, there is surprisingly little clarity about its true cost. Without understanding their true cost, crucial tradeoffs made by garbage collectors (GCs) often go unnoticed. This can lead to misguided design constraints and evaluation criteria used by GC researchers and users, hindering the development of high-performance, low-cost GCs. In this paper, we develop a methodology that allows us to empirically estimate the absolute cost of GC for any given set of metrics. This fundamental quantification has eluded the research community, even when using modern, well-established methodologies. By distilling out the explicitly identifiable GC cost, we estimate the intrinsic application execution cost using different GCs. The minimum distilled cost forms a baseline. Subtracting this baseline from the total execution costs, we can then place an empirical lower bound on the absolute costs of different GCs. Using this methodology, we study five production GCs in OpenJDK 17, a high-performance Java runtime. We measure the cost of these collectors, and expose their respective key performance tradeoffs. We find that with a modestly sized heap, production GCs incur substantial overheads across a diverse suite of modern benchmarks, spending at least 7-82% more wall-clock time and 6-92% more CPU cycles relative to the baseline cost. We show that these costs can be masked by concurrency and generous provisioning of memory/compute. In addition, we find that newer low-pause GCs are significantly more expensive than older GCs, and, surprisingly, sometimes deliver worse application latency than stop-the-world GCs. Our findings reaffirm that GC is by no means a solved problem and that a low-cost, low-latency GC remains elusive. We recommend adopting the distillation methodology together with a wider range of cost metrics for future GC evaluations. This will not only help the community more comprehensively understand the performance characteristics of different GCs, but also reveal opportunities for future GC optimizations.
|
Cai, Blackburn, Bond,
Understanding and Utilizing Hardware Transactional Memory Capacity,
in ISMM ’21, June 22, 2021, Virtual, Canada., 2021.
Hardware transactional memory (HTM) provides a simpler programming model than lock-based synchronization. However, HTM has limits that mean that transactions may suffer costly capacity aborts. Understanding HTM capacity is therefore critical. Unfortunately, crucial implementation details are undisclosed. In practice HTM capacity can manifest in puzzling ways. It is therefore unsurprising that the literature reports results that appear to be highly contradictory, reporting capacities that vary by nearly three orders of magnitude. We conduct an in-depth study into the causes of HTM capacity aborts using four generations of Intel’s Transactional Synchronization Extensions (TSX). We identify the apparent contradictions among prior work, and shed new light on the causes of HTM capacity aborts. In doing so, we reconcile the apparent contradictions. We focus on how replacement policies and the status of the cache can affect HTM capacity. One source of surprising behavior appears to be the cache replacement policies used by the processors we evaluated. Both invalidating the cache and warming it up with the transactional working set can significantly improve the read capacity of transactions across the microarchitectures we tested. A further complication is that a physically indexed LLC will typically yield only half the total LLC capacity. We found that methodological differences in the prior work led to different warmup states and thus to their apparently contradictory findings. This paper deepens our understanding of how the underlying implementation and cache behavior affect the apparent capacity of HTM. Our insights on how to increase the read capacity of transactions can be used to optimize HTM applications, particularly those with large read-mostly transactions, which are common in the context of optimistic parallelization.
|
Zhao, Blackburn,
Deconstructing the Garbage-First Collector,
in VEE ’20, March 17, 2020, Lausanne, Switzerland., 2020.
Garbage-First is among today’s most widely used garbage collectors. It is used in the HotSpot and OpenJDK virtual machines, and shares algorithmic foundations with three other important contemporary collectors: Shenandoah, C4, and ZGC. However, the design of the core algorithms and the performance tradeoffs they manifest have not been carefully analyzed in the literature. In this work, we deconstruct the G1 algorithm and re-implement it from first principles. We retrospectively develop a concurrent, region-based evacuating collector, CRE, which captures the principal design elements shared by G1, Shenandoah, C4, and ZGC. We then evaluate the impact of each of the major elements of G1 on performance, including pause time, remembered set footprint and barrier overheads. We find that G1’s concurrent marking and generational collection reduces the 95-percentile GC pauses by 64% and 93% respectively. We find that the space overhead of G1’s remembered sets is very low, typically under 1%. We also independently measure the barriers used by G1 and find that they have an overhead of around 12% with respect to total performance. This analysis gives users and collector designers insights into the garbage-first collector and the other fixed-size region-based concurrent evacuating collectors, which we hope will lead to better use of the collectors and provoke future improvements.
|
Blackburn, Xie, McKinley,
Author Growth Outstrips Publication Growth in Computer Science and Publication Quality Correlates with Collaboration,
in arXiv, 2019.
Although the computer science community successfully harnessed exponential increases in computer performance to drive societal and economic change, the exponential growth in publications is proving harder to accommodate. To gain a deeper understanding of publication growth and inform how the computer science community should handle this growth, we analyzed publication practices from several perspectives: ACM sponsored publications in the ACM Digital Library as a whole: subdisciplines captured by ACM's Special Interest Groups (SIGs); ten top conferences; institutions; four top U.S. departments; authors; faculty; and PhDs between 1990 and 2012. ACM publishes a large fraction of all computer science research. We first summarize how we believe our main findings inform (1) expectations on publication growth, (2) how to distinguish research quality from output quantity; and (3) the evaluation of individual researchers. We then further motivate the study of computer science publication practices and describe our methodology and results in detail.
|
Amiri, Blackburn, Hosking, Norrish,
Designing a Low-Level Virtual Machine for Implementing Real-Time Managed Languages,
in VMIL ’19, October 22, 2019, Athens, Greece., 2019.
Applications of real-time systems have grown significantly in both diversity and popularity, and the appetite for real-time software has never been higher. In contrast, the choice of programming languages used to develop such systems has stagnated, mostly limited to decades-old languages, specifically Ada and C/C++, and more recently real-time Java. We posit that the high cost and difficulty of developing new programming languages for real-time systems is the main reason for this mono-culture. To tackle the lack of diversity, we propose the design of a micro virtual machine on which managed programming languages for real-time systems can be developed. Our design facilitates bringing the advantages of correct managed languages to the real-time domain. We build on a previously published micro virtual machine specification, named Mu, and propose a set of modifications to its abstractions over concurrency and memory management to make it suitable for real-time systems. The resulting design is a basis for a new micro virtual machine specification we call RTMu, designed as a reliable and efficient foundation for the development of managed languages for real-time systems.
|
Shin, Soen, Readshaw, Blackburn, Whitelaw, Xie,
Influence Flowers of Academic Entities,
in VAST '19, IEEE Conference on Visual Analytics Science and Technology, Vancouver, Oct 2019., 2019.
We present the Influence Flower, a new visual metaphor for the influence profile of academic entities, including people, projects, institutions, conferences, and journals. While many tools quantify influence, we aim to expose the flow of influence between entities. The Influence Flower is an ego-centric graph, with a query entity placed in the centre. The petals are styled to reflect the strength of influence to and from other entities of the same or different type. For example, one can break down the incoming and outgoing influences of a research lab by research topics. The Influence Flower uses a recent snapshot of Microsoft Academic Graph, consisting of 212million authors, their 176 million publications, and 1.2 billion citations. An interactive web app, Influence Map, is constructed around this central metaphor for searching and curating visualisations. We also propose a visual comparison method that highlights change in influence patterns over time. We demonstrate through several case studies that the Influence Flower supports data-driven inquiries about the following: researchers careers over time; paper(s) and projects, including those with delayed recognition; the interdisciplinary profile of a research institution; and the shifting topical trends in conferences. We also use this tool on influence data beyond academic citations, by contrasting the academic and Twitter activities of a researcher.
|
Blackburn,
Design and Analysis of Field-Logging Write Barriers,
in Proceedings of the 2019 ACM SIGPLAN International Symposium on Memory Management (ISMM '19), June 23, 2019, Phoenix, AZ, USA, 2019.
Write barriers are a fundamental mechanism that most production garbage collection algorithms depend on. They inform the collector of mutations to the object graph, enabling partial heap collections, concurrent collection, and reference counting. While in principle, write barriers remember only the pointers within the object graph that were changed and do so just once, widely-used write barriers are less precise, sacrificing temporal and/or spatial precision to performance. The idea of precisely remembering just the pointers that were changed is not new, however implementing performant field-precise barriers has proved elusive. We describe a technique for efficiently implementing field-logging barriers. We implement a number of barriers and evaluate them on a range of x86 hardware. A generational field-logging barrier performs with 0.1% to 1% mutator overhead compared to a highly tuned object-logging barrier, while a preliminary implementation of a reference counting field-logging barrier performs with around 1% to 2% overhead compared to a highly tuned object-logging reference counting barrier. These results suggest that garbage collection algorithms that require the precision of exactly remembering field mutations without sacrificing performance may now be possible, adding a new mechanism to the design toolkit available to garbage collection researchers.
|
Berger, Blackburn, Brodley, Jagadish, McKinley, Nascimento, Shin, Wang, Xie,
GOTO Rankings Considered Helpful,
in Communications of the ACM, 2019.
Rankings are a fact of life. Whether or not one likes them (a previous Communications editorial argued we should eschew rankings altogether4), they exist and are influential. Within academia, and in computer science in particular, rankings not only capture our attention but also widely influence people who have a limited understanding of computing science research, including prospective students, university administrators, and policymakers. In short, rankings matter.
|
Wang, Blackburn, Hosking, Norrish,
Hop, Skip, & Jump: Practical On-Stack Replacement for a Cross-Platform Language-Neutral VM,
in VEE’18: Proceedings of the 2017 ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, March 24–25, 2018, Williamsburg, VA, USA, 2018.
On-stack replacement (OSR) is a performance-critical technology for many languages, especially dynamic languages. Conventional wisdom, apparent in JavaScript engines such as V8 and SpiderMonkey, is that OSR must be implemented in a low-level (ie, in assembly) and language-specific way. This paper presents an OSR abstraction based on Swapstack, materialized as the API for a low-level virtual machine, and shows how the abstraction of resumption protocols facilitates an elegant implementation of this API on real hardware. Using an experimental JavaScript implementation, we demonstrate that this API enables the language implementation to perform OSR without the need to deal with machine-level details. We also show that the API itself is implementable on concrete hardware. This work helps crystallize OSR abstractions and, by providing a reusable implementation, brings OSR within reach for more language implementers.
|
Yang, Blackburn, McKinley,
Elfen scheduling: Fine-grain principled borrowing from latency-critical workloads using simultaneous multithreading,
in Proceedings of the 2016 USENIX Annual Technical Conference (USENIX ATC'16), Denver, CO, June 22-24, 2016.
Web services from search to games to stock trading impose strict Service Level Objectives (SLOs) on tail latency. Meeting these objectives is challenging because the computational demand of each request is highly variable and load is bursty. Consequently, many servers run at low utilization (10 to 45%); tur off simultaneous multithreading (SMT); and execute only a single service — wasting hardware, energy, and money. Although co-running batch jobs with latency critical requests to utilize multiple SMT hardware contexts (lanes) is appealing, unmitigated sharing of core resources induces non-linear effects on tail latency and SLO violations. We introduce principled borrowing to control SMT hardware execution in which batch threads borrow core resources. A batch thread executes in a reserved batch SMT lane when no latency-critical thread is executing in the partner request lane. We instrument batch threads to quickly detect execution in the request lane, step out of the way, and promptly return the borrowed resources. We introduce the nanonap system call to stop the batch thread’s execution without yielding its lane to the OS scheduler, ensuring that requests have exclusive use of the core’s resources. We evaluate our approach for co-locating batch workloads with latency-critical requests using the Apache Lucene search engine. A conservative policy that executes batch threads only when request lane is idle improves utilization between 90% and 25% on one core depending on load, without compromising request SLOs. Our approach is straightforward, robust, and unobtrusive, opening the way to substantially improved resource utilization in datacenters running latency-critical workloads.
|
Jibaja, Cao, Blackburn, McKinley,
Portable performance on asymmetric multicore processors,
in Proceedings of the 14th Annual IEEE/ACM International Symposium on Code Generation and Optimization, 2016.
|
Lin, Blackburn, Hosking, Norrish,
Rust as a language for high performance GC implementation,
in Proceedings of the Sixteenth ACM SIGPLAN International Symposium on Memory Management, ISMM '16, Santa Barbara, CA, June 13, 2016.
High performance garbage collectors build upon performance-critical low-level code, typically exhibit multiple levels of concurrency, and are prone to subtle bugs. Implementing, debugging and maintaining such collectors can therefore be extremely challenging. The choice of implementation language is a crucial consideration when building a collector. Typically, the drive for performance and the need for efficient support of low-level memory operations leads to the use of low-level languages like C or C++, which offer little by way of safety and software engineering benefits. This risks undermining the robustness and flexibility of the collector design. Rust’s ownership model, lifetime specification, and reference borrowing deliver safety guarantees through a powerful static checker with little runtime overhead. These features make Rust a compelling candidate for a collector implementation language, but they come with restrictions that threaten expressiveness and efficiency. We describe our experience implementing an Immix garbage collector in Rust and C. We discuss the benefits of Rust, the obstacles encountered, and how we overcame them. We show that our Immix implementation has almost identical performance on micro benchmarks, compared to its implementation in C, and outperforms the popular BDW collector on the gcbench micro benchmark. We find that Rust’s safety features do not create significant barriers to implementing a high performance collector. Though memory managers are usually considered low-level, our high performance implementation relies on very little unsafe code, with the vast majority of the implementation benefiting from Rust’s safety. We see our experience as a compelling proof-of-concept of Rust as an implementation language for high performance garbage collection.
|
Kumar, Dolby, Blackburn,
Integrating asynchronous task parallelism and data-centric atomicity,
in Proceedings of PPPJ’16, the 13th International Conference on Principles and Practices of Programming on the Java platform: Virtual machines, languages, and tools, Lugano, Switzerland, 2016.
Processor design has turned toward parallelism and heterogeneous cores to achieve performance and energy efficiency. Developers find high-level languages attractive as they use abstraction to offer productivity and portability over these hardware complexities. Over the past few decades, researchers have developed increasingly advanced mechanisms to deliver performance despite the overheads naturally imposed by this abstraction. Recent work has demonstrated that such mechanisms can be exploited to attack overheads that arise in emerging high-level languages, which provide strong abstractions over parallelism. However, current implementation of existing popular high-level languages, such as Java, offer little by way of abstractions that allow the developer to achieve performance in the face of extensive hardware parallelism. In this paper, we present a small set of extensions to the Java programming language that aims to achieve both high performance and high productivity with minimal programmer effort. We incorporate ideas from languages like X10 and AJ to develop five annotations in Java for achieving asynchronous task parallelism and data-centric concurrency control. These annotations allow the use of a highly efficient implementation of a work-stealing scheduler for task parallelism. We evaluate our proposal by refactoring classes from a number of existing multithreaded open source projects to use our new annotations. Our results suggest that these annotations significantly reduce the programming effort while achieving performance improvements up to 30% compared to conventional approaches.
|
Blackburn, Diwan, Hauswirth, Sweeney, Amaral, Brecht, Bulej, Click, Eeckhout, Fischmeister, Frampton, Hendren, Hind, Hosking, Jones, Kalibera, Keynes, Nystrom, Zeller,
The truth, the whole truth, and nothing but the truth: A pragmatic guide to assessing empirical evaluations,
in ACM Trans. Program. Lang. Syst., 2016.
An unsound claim can misdirect a field, encouraging the pursuit of unworthy ideas and the abandonment of promising ideas. An inadequate description of a claim can make it difficult to reason about the claim, for example, to determine whether the claim is sound. Many practitioners will acknowledge the threat of unsound claims or inadequate descriptions of claims to their field. We believe that this situation is exacerbated, and even encouraged, by the lack of a systematic approach to exploring, exposing, and addressing the source of unsound claims and poor exposition. This article proposes a framework that identifies three sins of reasoning that lead to unsound claims and two sins of exposition that lead to poorly described claims and evaluations. Sins of exposition obfuscate the objective of determining whether or not a claim is sound, while sins of reasoning lead directly to unsound claims. Our framework provides practitioners with a principled way of critiquing the integrity of their own work and the work of others. We hope that this will help individuals conduct better science and encourage a cultural shift in our research community to identify and promulgate sound claims.
|
Lin, Wang, Blackburn, Norrish, Hosking,
Stop and go: Understanding yieldpoint behavior,
in Proceedings of the Fourteenth ACM SIGPLAN International Symposium on Memory Management, ISMM '15, Portland, OR, June 14, 2015.
Yieldpoints are critical to the implementation of high performance garbage collected languages, yet the design space is not well understood. Yieldpoints allow a running program to be interrupted at well-defined points in its execution, facilitating exact garbage collection, biased locking, on-stack replacement, profiling, and other important virtual machine behaviors. In this paper we identify and evaluate yieldpoint design choices, including previously undocumented designs and optimizations. One of the designs we identify opens new opportunities for very low overhead profiling. We measure the frequency with which yieldpoints are executed and establish a methodology for evaluating the common case execution time overhead. We also measure the median and worst case time-to-yield. We find that Java benchmarks execute about 100 M yieldpoints per second, of which about 1/20000 are taken. The average execution time overhead for untaken yieldpoints on the VM we use ranges from 2.5% to close to zero on modern hardware, depending on the design, and we find that the designs trade off total overhead with worst case time-to-yield. This analysis gives new insight into a critical but overlooked aspect of garbage collector implementation, and identifies a new optimization and new opportunities for very low overhead profiling.
|
Yang, Blackburn, McKinley,
Computer performance microscopy with Shim,
in ISCA’15: The 42nd International Symposium on Computer Architecture, 2015.
Developers and architects spend a lot of time trying to understand and eliminate performance problems. Unfortunately, the root causes of many problems occur at a fine granularity that existing continuous profiling and direct measurement approaches cannot observe. This paper presents the design and implementation of Shim, a continuous profiler that samples at resolutions as fine as 15 cycles; three to five orders of magnitude finer than current continuous profilers. Shim’s fine-grain measurements reveal new behaviors, such as variations in instructions per cycle (IPC) within the execution of a single function. A Shim observer thread executes and samples autonomously on unutilized hardware. To sample, it reads hardware performance counters and memory locations that store software state. Shim improves its accuracy by automatically detecting and discarding samples affected by measurement skew. We measure Shim’s observer effects and show how to analyze them. When on a separate core, Shim can continuously observe one software signal with a 2% overhead at a 1200 cycle resolution. At an overhead of 61%, Shim samples on software signal on the same core with SMT at a 15 cycle resolution. Modest hardware changes could significantly reduce overheads and add greater analytical capability to Shim. We vary prefetching and DVFS policies in case studies that show the diagnostic power of fine-grain IPC and memory bandwidth results. By repurposing existing hardware, we deliver a practical tool for fine-grain performance microscopy for developers and architects.
|
Jibaja, Jensen, Hu, Haghighat, McCutchan, Gohman, Blackburn, McKinley,
Vector parallelism in Javascript: Language and compiler support for SIMD,
in Proceedings of the 2015 International Conference on Parallel Architecture and Compilation (PACT), 2015.
|
Wang, Lin, Blackburn, Norrish, Hosking,
Draining the swamp: Micro virtual machines as solid foundation for language development,
in 1st Summit on Advances in Programming Languages (SNAPL 2015), 2015.
Many of today’s programming languages are broken. Poor performance, lack of features and hard-to-reason-about semantics can cost dearly in software maintenance and inefficient execution. The problem is only getting worse with programming languages proliferating and hardware becoming more complicated. An important reason for this brokenness is that much of language design is implementation-driven. The difficulties in implementation and insufficient understanding of concepts bake bad designs into the language itself. Concurrency, architectural details and garbage collection are three fundamental concerns that contribute much to the complexities of implementing managed languages. We propose the micro virtual machine, a thin abstraction designed specifically to relieve implementers of managed languages of the most fundamental implementation challenges that currently impede good design. The micro virtual machine targets abstractions over memory (garbage collection), architecture (compiler backend), and concurrency. We motivate the micro virtual machine and give an account of the design and initial experience of a concrete instance, which we call Mu, built over a two year period. Our goal is to remove an important barrier to performant and semantically sound managed language design and implementation.
|
Shahriyar, Blackburn, McKinley,
Fast conservative garbage collection,
in OOPSLA ’14: Proceeding of the 25th ACM SIGPLAN conference on Object Oriented Programming Systems Languages and Applications, 2014.
Garbage collectors are exact or conservative. An exact collector identifies all references precisely and may move referents and update references, whereas a conservative collector treats one or more of stack, register, and heap references as ambiguous. Ambiguous references constrain collectors in two ways. (1) Since they may be pointers, the collectors must retain referents. (2) Since they may be values, the collectors cannot modify them, pinning their referents. We explore conservative collectors for managed languages, with ambiguous stacks and registers. We show that for Java benchmarks they retain and pin remarkably few heap objects: <0.01% are falsely retained and 0.03% are pinned. The larger effect is collector design. Prior conservative collectors (1) use mark-sweep and unnecessarily forgo moving all objects, or (2) use mostly copying and pin entire pages. Compared to generational collection, overheads are substantial: 12% and 45% respectively. We introduce high performance conservative Immix and reference counting (RC). Immix is a mark-region collector with fine line-grain pinning and opportunistic copying of unambiguous referents. Deferred RC simply needs an object map to deliver the first conservative RC. We implement six exact collectors and their conservative counterparts. Conservative Immix and RC come within 2 to 3% of their exact counterparts. In particular, conservative RC Immix is slightly faster than a well-tuned exact generational collector. These findings show that for managed languages, conservative collection is compatible with high performance.
|
Kumar, Blackburn, Grove,
Friendly barriers: Efficient work-stealing with return barriers,
in VEE’14: Proceedings of the 2014 ACM SIGPLAN/SIGOPS international conference on Virtual Execution Environments, 2014.
This paper addresses the problem of efficiently supporting parallelism within a managed runtime. A popular approach for exploiting software parallelism on parallel hardware is task parallelism, where the programmer explicitly identifies potential parallelism and the runtime then schedules the work. Work-stealing is a promising scheduling strategy that a runtime may use to keep otherwise idle hardware busy while relieving overloaded hardware of its burden. However, work-stealing comes with substantial overheads. Recent work identified sequential overheads of work-stealing, those that occur even when no stealing takes place, as a significant source of overhead. That work was able to reduce sequential overheads to just 15%. In this work, we turn to dynamic overheads, those that occur each time a steal takes place. We show that the dynamic overhead is dominated by introspection of the victim’s stack when a steal takes place. We exploit the idea of a low overhead return barrier to reduce the dynamic overhead by approximately half, resulting in total performance improvements of as much as 20%. Because, unlike prior work, we attack the overheads directly due to stealing and therefore attack the overheads that grow as parallelism grows, we improve the scalability of work-stealing applications. This result is complementary to recent work addressing the sequential overheads of work-stealing. This work therefore substantially relieves work-stealing of the increasing pressure due to increasing intra-node hardware parallelism.
|
Sartor, Heirman, Blackburn, Eeckout, McKinley,
Cooperative cache scrubbing,
in PACT ’14 Proceedings of the 23rd International Conference on Parallel Architectures and Compilation Techniques, 2014.
Managing the limited resources of power and memory bandwidth while improving performance on multicore hardware is challenging. In particular, more cores demand more memory bandwidth, and multi-threaded applications increasingly stress memory systems, leading to more energy consumption. However, we demonstrate that not all memory traffic is necessary. For modern Java programs, 10 to 60% of DRAM writes are useless, because the data on these lines are dead - the program is guaranteed to never read them again. Furthermore, reading memory only to immediately zero initialize it wastes bandwidth. We propose a software/hardware cooperative solution: the memory manager communicates dead and zero lines with cache scrubbing instructions. We show how scrubbing instructions satisfy MESI cache coherence protocol invariants and demonstrate them in a Java Virtual Machine and multicore simulator. Scrubbing reduces average DRAM traffic by 59%, total DRAM energy by 14%, and dynamic DRAM energy by 57% on a range of configurations. Cooperative software/hardware cache scrubbing reduces memory bandwidth and improves energy efficiency, two critical problems in modern systems.
|
Shahriyar, Blackburn, Yang, McKinley,
Taking off the gloves with reference counting immix,
in OOPSLA ’13: Proceeding of the 24th ACM SIGPLAN Conference on Object Oriented Programming Systems Languages and Applications, 2013.
Despite some clear advantages and recent advances, reference counting remains a poor cousin to high-performance tracing garbage collectors. The advantages of reference counting include a) immediacy of reclamation, b) incrementality, and c) local scope of its operations. After decades of languishing with hopelessly bad performance, recent work narrowed the gap between reference counting and the fastest tracing collectors to within 10%. Though a major advance, this gap remains a substantial barrier to adoption in performance-concious application domains. Our work identifies heap organization as the principal source of the remaining performance gap. We present the design, implementation, and analysis of a new collector, RC Immix, that replaces reference counting’s traditional free-list heap organization with the line and block heap structure introduced by the Immix collector. The key innovations of RC Immix are 1) to combine traditional reference counts with per-line live object counts to identify reusable memory and 2) to eliminate fragmentation by integrating copying with reference counting of new objects and with backup tracing cycle collection. In RC Immix, reference counting offers efficient collection and the line and block heap organization delivers excellent mutator locality and efficient allocation. With these advances, RC Immix closes the 10% performance gap, outperforming a highly tuned production generational collector. By removing the performance barrier, this work transforms reference counting into a serious alternative for meeting high performance objectives for garbage collected languages.
|
Gao, Strauss, Blackburn, McKinley, Burger, Larus,
Using managed runtime systems to tolerate holes in wearable memories,
in PLDI ’13: Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, Seattle, WA, USA, June, 2013.
New memory technologies, such as phase-change memory (PCM), promise denser and cheaper main memory, and are expected to displace DRAM. However, many of them experience permanent failures far more quickly than DRAM. DRAM mechanisms that handle permanent failures rely on very low failure rates and, if directly applied to PCM, are extremely inefficient: Discarding a page when the first line fails wastes 98% of the memory. This paper proposes low complexity cooperative software and hardware that handle failure rates as high as 50%. Our approach makes error handling transparent to the application by using the memory abstraction offered by managed languages. Once hardware error correction for a memory line is exhausted, rather than discarding the entire page, the hardware communicates the failed line to a failure-aware OS and runtime. The runtime ensures memory allocations never use failed lines and moves data when lines fail during program execution. This paper describes minimal extensions to an Immix mark-region garbage collector, which correctly utilizes pages with failed physical lines by skipping over failures. This paper also proposes hardware support that clusters failed lines at one end of a memory region to reduce fragmentation and improve performance under failures. Contrary to accepted hardware wisdom that advocates for wear-leveling, we show that with software support non-uniform failures delay the impact of memory failure. Together, these mechanisms incur no performance overhead when there are no failures and at failure levels of 10% to 50% sufffer an oaverage overhead of 4% and 12%, respectively. These results indicate that hardware and software cooperation can greatly extend the life of wearable memories.
|
Cao, Blackburn, Gao, McKinley,
The Yin and Yang of Power and Performance for Asymmetric Hardware and Managed Software,
in ISCA ’12: The 39th International Symposium on Computer Architecture, 2012.
On the hardware side, asymmetric multicore processors present software with the challenge and opportunity of optimizing in two dimensions: performance and power. Asymmetric multicore processors (AMP) combine general-purpose big (fast, high power) cores and small (slow, low power) cores to meet power constraints. Realizing their energy efficiency opportunity requires workloads with differentiated performance and power characteristics. On the software side, managed workloads written in languages such as C#, Java, JavaScript, and PHP are ubiquitous. Managed languages abstract over hardware using Virtual Machine (VM) services (garbage collection, interpretation, and/or just-in-time compilation) that together impose substantial energy and performance costs, ranging from 10% to over 80% We show that these services differing degrees, they are parallel, asynchronous, communicate infrequently, and are not on the application's critical path. We identify a synergy between AMP and VM services that we exploit to attack the 40% average energy overhead due to VM services. Using measurements and very conservative models, we show that adding small cores tailored for VM services should deliver, at least, improvements in performance of 13%, energy of 7%, and performance per energy of 22%. The yin of VM services is overhead, but it meets the yang of small cores on an AMP. The yin of AMP is exposed hardware complexity, but it meets the yang of abstraction in managed languages. VM services fulfill the AMP requirement for an asynchronous, non-critical, differentiated, parallel, and ubiquitous workload to deliver energy efficiency. Generalizing this approach beyond system software to applications will require substantially more software and hardware investment, but these results show the potential energy efficiency gains are significant.
|
Esmaeilzadeh, Cao, Yang, Blackburn, McKinley,
Looking back and looking forward: Power, performance, and upheaval,
in Communications of the ACM, 2012.
The past 10 years have delivered two significant revolutions. (1) Microprocessor design has been transformed by the limits of chip power, wire latency, and Dennard scaling—leading to multicore processors and heterogeneity. (2) Managed languages and an entirely new software landscape emerged—revolutionizing how software is deployed, is sold, and interacts with hardware. Researchers most often examine these changes in isolation. Architects mostly grapple with microarchitecture design through the narrow software context of native sequential SPEC CPU benchmarks, while language researchers mostly consider microarchitecture in terms of performance alone. This work explores the clash of these two revolutions over the past decade by measuring power, performance, energy, and scaling, and considers what the results may mean for the future. Our diverse findings include the following: (a) native sequential workloads do not approximate managed workloads or even native parallel workloads; (b) diverse application power profiles suggest that future applications and system software will need to participate in power optimization and management; and (c) software and hardware researchers need access to real measurements to optimize for power and energy.
|
Esmaeilzadeh, Cao, Yang, Blackburn, McKinley,
What is happening to power, performance, and software?,
in IEEE Micro, 2012.
Systematically exploring power, performance and energy sheds new light on the clash of two trends that unfolded over the past decade: the rise of parallel processors in response to technology constraints on power, clock speed, and wire delay; and the rise of managed high-level, portable programming languages.
|
Kumar, Frampton, Blackburn, Grove, Tardieu,
Work-stealing without the baggage,
in Proceedings of the 2012 ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages and Applications (OOPSLA 2012), Tucson, AZ, October 19-26, 2012.
Work-stealing is a promising approach for effectively exploiting software parallelism on parallel hardware. A programmer who uses work-stealing explicitly identifies potential parallelism and the runtime then schedules work, keeping otherwise idle hardware busy while relieving overloaded hardware of its burden. Prior work has demonstrated that work-stealing is very effective in practice. However, work-stealing comes with a substantial overhead: as much as 2X to 12X slowdown over orthodox sequential code. In this paper we identify the key sources of overhead in work-stealing schedulers and present two significant refinements to their implementation. We evaluate our work-stealing designs using a range of benchmarks, four different work-stealing implementations, including the popular fork-join framework, and a range of architectures. On these benchmarks, compared to orthodox sequential Java, our fastest design has an overhead of just 15 fork-join has a 2.3X overhead and the previous implementation of the system we use has an overhead of 4.1X. These results and our insight into the sources of overhead for work-stealing implementations give further hope to an already promising technique for exploiting increasingly available hardware parallelism.
|
Lin, Blackburn, Frampton,
Unpicking the knot: Teasing apart vm/application interdependencies,
in VEE ’12: Proceedings of the 2012 ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, 2012.
Flexible and efficient runtime design requires an understanding of the dependencies among the components internal to the runtime and those between the application and the runtime. These dependencies are frequently unclear. This problem exists in all runtime design, and is most vivid in a metacircular runtime — one that is implemented in terms of itself. Metacircularity blurs boundaries between application and runtime implementation, making it harder to understand and make guarantees about overall system behavior, affecting isolation, security, and resource management, as well as reducing opportunities for optimization. Our goal is to shed new light on VM interdependencies, helping all VM designers understand these dependencies and thereby engineer better runtimes. We explore these issues in the context of a high-performance Java-in-Java virtual machine. Our approach is to identify and instrument transition points into and within the runtime, which allows us to establish a dynamic execution context. Our contributions are: 1) implementing and measuring a system that dynamically maintains execution context with very low overhead, 2) demonstrating that such a framework can be used to improve the software engineering of an existing runtime, and 3) analyzing the behavior and runtime characteristics of our runtime across a wide range of benchmarks. Our solution provides clarity about execution state and allowable transitions, making it easier to develop, debug, and understand managed runtimes.
|
Shahriyar, Blackburn, Frampton,
Down for the count? Getting reference counting back in the ring,
in Proceedings of the Eleventh ACM SIGPLAN International Symposium on Memory Management, ISMM ’12, Beijing, China, June 15-16, 2012.
Reference counting and tracing are the two fundamental approaches that have underpinned garbage collection since 1960. However, despite some compelling advantages, reference counting is almost completely ignored in implementations of high performance systems today. In this paper we take a detailed look at reference counting to understand its behavior and to improve its performance. We identify key design choices for reference counting and analyze how the behavior of a wide range of benchmarks might affect design decisions. As far as we are aware, this is the first such quantitative study of reference counting. We use insights gleaned from this analysis to introduce a number of optimizations that significantly improve the performance of reference counting. We find that an existing modern implementation of reference counting has an average 30% overhead compared to tracing, and that in combination, our optimizations are able to completely eliminate that overhead. This brings the performance of reference counting on par with that of a well tuned mark-sweep collector. We keep our in-depth analysis of reference counting as general as possible so that it may be useful to other garbage collector implementers. Our finding that reference counting can be made directly competitive with well tuned mark-sweep should shake the community’s prejudices about reference counting and perhaps open new opportunities for exploiting reference counting’s strengths, such as localization and immediacy of reclamation.
|
Yang, Blackburn, Frampton, Hosking,
Barriers reconsidered, friendlier still!,
in Proceedings of the Eleventh ACM SIGPLAN International Symposium on Memory Management, ISMM ’12, Beijing, China, June 15-16, 2012.
Read and write barriers mediate access to the heap allowing the collector to control and monitor mutator actions. For this reason, barriers are a powerful tool in the design of any heap management algorithm, but the prevailing wisdom is that they impose significant costs. However, changes in hardware and workloads make these costs a moving target. Here, we measure the cost of a range of useful barriers on a range of modern hardware and workloads. We confirm some old results and overturn others. We evaluate the microarchitectural sensitivity of barrier performance and the differences among benchmark suites. We also consider barriers in context, focusing on their behavior when used in combination, and investigate a known pathology and evaluate solutions. Our results show that read and write barriers have average overheads as low as 5.4% and 0.9% respectively. We find that barrier overheads are more exposed on the workload provided by the modern DaCapo benchmarks than on old SPECjvm98 benchmarks. Moreover, there are differences in barrier behavior between in-order and out-of-order machines, and their respective memory subsystems, which indicate different barrier choices for different platforms. These changing costs mean that algorithm designers need to reconsider their design choices and the nature of their resulting algorithms in order to exploit the opportunities presented by modern hardware.
|
Lin, Blackburn,
Bypassing portability pitfalls of high-level low-level programming,
in The 6th Workshop on Virtual Machines and Intermediate Languages (VMIL 2012), Tucson, AZ, October 21, 2012.
Program portability is an important software engineering consideration. However, when high-level languages are extended to effectively implement system projects for software engineering gain and safety, portability is compromised—high-level code for low-level programming cannot execute on a stock runtime, and, conversely, a runtime with special support implemented will not be portable across different platforms. We explore the portability pitfall of high-level low-level programming in the context of virtual machine implementation tasks. Our approach is designing a restricted high-level language called RJava, with a flexible restriction model and effective low-level extensions, which is suitable for different scopes of virtual machine implementation, and also suitable for a low-level language bypass for improved portability. Apart from designing such a language, another major outcome from this work is clearing up and sharpening the philosophy around language restriction in virtual machine design. In combination, our approach to solving portability pitfalls with RJava favors virtual machine design and implementation in terms of portability and robustness.
|
Kumar, Blackburn,
Faster work stealing with return barriers,
in The 6th Workshop on Virtual Machines and Intermediate Languages (VMIL 2012), Tucson, AZ, October 21, 2012, 2012.
Work-stealing is a promising approach for effectively exploiting software parallelism on parallel hardware. A programmer who uses work-stealing explicitly identifies potential parallelism and the runtime then schedules work, keeping otherwise idle hardware busy while relieving overloaded hardware of its burden. However, work-stealing comes with substantial overheads. Our prior work demonstrates that using the exception handling mechanism of modern VMs and gathering the runtime information directly from the victim’s execution stack can significantly reduce these overheads. In this paper we identify the overhead associated with managing the work-stealing related information on a victim’s execution stack. A return barrier is a mechanism for intercepting the popping of a stack frame, and thus is a powerful tool for optimizing mechanisms that involve scanning of stack state. We present the design and preliminary findings of using return barriers on a victim’s execution stack to reduce these overheads. We evaluate our design using classical work-stealing benchmarks. On these benchmarks, compared to our prior design, we are able to reduce the overheads by as much as 58%. These preliminary findings give further hope to an already promising technique of harnessing rich features of a modern VM inside a work-stealing scheduler.
|
Esmaeilzadeh, Cao, Yang, Blackburn, McKinley,
Looking back on the language and hardware revolutions: Measured power, performance, and scaling,
in Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems, Newport Beach, CA, USA, March 5 - 11, 2011.
This paper reports and analyzes measured chip power and performance on five process technology generations executing 61 diverse benchmarks with a rigorous methodology. We measure representative Intel IA32 processors with technologies ranging from 130nm to 32nm while they execute sequential and parallel benchmarks written in native and managed languages. During this period, hardware and software changed substantially: (1) hardware vendors delivered chip multiprocessors instead of uniprocessors, and independently (2) software developers increasingly chose managed languages instead of native languages. This quantitative data reveals the extent of some known and previously unobserved hardware and software trends. Two themes emerge. (I) Workload The power, performance, and energy trends of native workloads do not approximate managed workloads. For example, (a) the SPEC CPU2006 native benchmarks on the i7-920 and i5-670 draw significantly less power than managed or scalable native benchmarks; and (b) managed runtimes exploit parallelism even when running single-threaded applications. The results recommend architects always include native and managed workloads to design and evaluate energy efficient designs. (II) Architecture Clock scaling, microarchitecture, simultaneous multithreading, and chip multiprocessors each elicit a huge variety of processor power, performance, and energy responses. This variety and the difficulty of obtaining power measurements recommends exposing on-chip power meters and when possible structure specific power meters for cores, caches, and other structures. Just as hardware event counters provide a quantitative grounding for performance innovations, power meters are necessary for optimizing energy.
|
Yang, Blackburn, Frampton, Sartor, McKinley,
Why nothing matters: The impact of zeroing,
in Proceedings of the 2011 ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages & Applications (OOPSLA 2011), Portland, OR, October 22-27, 2011.
Memory safety defends against inadvertent and malicious misuse of memory that may compromise program correctness and security. A critical element of memory safety is zero initialization. The direct cost of zero initialization is surprisingly high: up to 12.7%, with average costs ranging from 2.7 to 4.5% on a high performance virtual machine on IA32 architectures. Zero initialization also incurs indirect costs due to its memory bandwidth demands and cache displacement effects. Existing virtual machines either: a) minimize direct costs by zeroing in large blocks, or b) minimize indirect costs by zeroing in the allocation sequence, which reduces cache displacement and bandwidth. This paper evaluates the two widely used zero initialization designs, showing that they make different tradeoffs to achieve very similar performance. Our analysis inspires three better designs: (1) bulk zeroing with cache-bypassing (non-temporal) instructions to reduce the direct and indirect zeroing costs simultaneously, (2) concurrent non-temporal bulk zeroing that exploits parallel hardware to move work off the application’s critical path, and (3) adaptive zeroing, which dynamically chooses between (1) and (2) based on available hardware parallelism. The new software strategies offer speedups sometimes greater than the direct overhead, improving total performance by 3% on average. Our findings invite additional optimizations and microarchitectural support.
|
Garner, Blackburn, Frampton,
A comprehensive evaluation of object scanning techniques,
in Proceedings of the Tenth ACM SIGPLAN International Symposium on Memory Management, ISMM ’11, San Jose, CA, USA, June 4 - 5, 2011.
At the heart of all garbage collectors lies the process of identifying and processing reference fields within an object. Despite its key role, and evidence of many different implementation approaches, to our knowledge no comprehensive quantitative study of this design space exists. The lack of such a study means that implementers must rely on ‘conventional wisdom’, hearsay, and their own costly analysis. Starting with mechanisms described in the literature and a variety of permutations of these, we explore the impact of a number of dimensions including: a) the choice of data structure, b) levels of indirection from object to metadata, and c) specialization of scanning code. We perform a comprehensive examination of these tradeoffs on four different architectures using eighteen benchmarks and hardware performance counters. We inform the choice of mechanism with a detailed study of heap composition and object structure as seen by the garbage collector on these benchmarks. Our results show that choice of scanning mechanism is important. We find that a careful choice of scanning mechanism alone can improve garbage collection performance by 16% and total time by 2.5%, on average, over a well tuned baseline. We observe substantial variation in performance among architectures, and find that some mechanisms–particularly specialization, layout of reference fields in objects, and encoding metadata in object headers–yield consistent, significant advantages.
|
Jibaja, Blackburn, Haghighat, McKinley,
Deferred gratification: Engineering for high performance garbage collection from the get go,
in Proceedings of the 2011 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness (MSPC 2011), San Jose, CA, June 5, 2011, 2011.
Implementing a new programming language system is a daunting task. A common trap is to punt on the design and engineering of exact garbage collection and instead opt for reference counting or conservative garbage collection (GC). For example, AppleScript, Perl, Python, and PHP implementers chose reference counting (RC) and Ruby chose conservative GC. Although easier to get working, reference counting has terrible performance and conservative GC is inflexible and performs poorly when allocation rates are high. However, high performance GC is central to performance for managed languages and only becoming more critical due to relatively lower memory bandwidth and higher memory latency of modern architectures. Unfortunately, retrofitting support for high performance collectors later is a formidable software engineering task due to their exact nature. Whether they realize it or not, implementers have three routes: (1) forge ahead with reference counting or conservative GC, and worry about the consequences later; (2) build the language on top of an existing managed runtime with exact GC, and tune the GC to scripting language workloads; or (3) engineer exact GC from the ground up and enjoy the correctness and performance benefits sooner rather than later. We explore this conundrum using PHP, the most popular server side scripting language. PHP implements reference counting, mirroring scripting languages before it. Because reference counting is incomplete, the implementors must (a) also implement tracing to detect cyclic garbage, or (b) prohibit cyclic data structures, or (c) never reclaim cyclic garbage. PHP chose (a), AppleScript chose (b), and Perl chose (c). We characterize the memory behavior of five typical PHP programs to determine whether their implementation choice was a good one in light of the growing demand for high performance PHP. The memory behavior of these PHP programs is similar to other managed languages, such as Java — they allocate many short lived objects, a large variety of object sizes, and the average allocated object size is small. These characteristics suggest copying generational GC will attain high performance.Language implementers who are serious about correctness and performance need to understand deferred gratification: paying the software engineering cost of exact GC up front will deliver correctness and memory system performance later.
|
Cao, Blackburn, McKinley,
Virtual machine services: An opportunity for hardware customization,
in Workshop on energy-efficient computing for a sustainable world, Porto Alegre, Brazil, Dec 4, 2011.
|
Sartor, Blackburn, Frampton, Hirzel, McKinley,
Z-rays: Divide arrays and conquer speed and flexibility,
in ACM sigplan conference on programming language design and implementation, 2010.
Arrays are the ubiquitous organization for indexed data. Throughout programming language evolution, implementations have laid out arrays contiguously in memory. This layout is problematic in space and time. It causes heap fragmentation, garbage collection pauses in proportion to array size, and wasted memory for sparse and over-provisioned arrays. Because of array virtualization in managed languages, an array layout that consists of indirection pointers to fixed-size discontiguous memory blocks can mitigate these problems transparently. This design however incurs significant overhead, but is justified when real-time deadlines and space constraints trump performance. This paper proposes z-rays, a discontiguous array design with flexibility and efficiency. A z-ray has a spine with indirection pointers to fixed-size memory blocks called arraylets, and uses five optimizations: (1) inlining the first N array bytes into the spine, (2) lazy allocation, (3) zero compression, (4) fast array copy, and (5) arraylet copy-on-write. Whereas discontiguous arrays in prior work improve responsiveness and space efficiency, z-rays combine time efficiency and flexibility. On average, the best z-ray configuration performs within 12.7% of an unmodified Java Virtual Machine on 19 benchmarks, whereas previous designs have two to three times higher overheads. Furthermore, language implementers can configure z-ray optimizations for various design goals. This combination of performance and flexibility creates a better building block for past and future array optimization.
|
Esmaeilzadeh, Blackburn, Yang, McKinley,
Power and performance of native and Java benchmarks on 130nm to 32nm process technologies,
in Sixth Annual Workshop on Modeling, Benchmarking and Simulation, MOBS 2010, Saint-Malo, France, 2010.
Over the past decade, chip fabrication technology shrank from 130nm to 32nm. This reduction was generally considered to provide performance improvements together with chip power reduc tions. This paper examines how well process technology and microarchitecture delivered on this assumption. This paper evaluates power and performance of native and Java workloads across a selection of IA32 processors from five technology generations (130nm, 90nm, 65nm, 45nm, and 32nm). We use a Hall effect sensor to accurately measure chip power. This paper reports a range findings in three areas. 1) Methodology: TDP is unsurprisingly a poor predictor of application power consumption for a particular processor, but worse, TDP is a poor predictor of relative power consumption between processors. 2) Power-performance trends: Processors appear to have already hit the power wall at 45nm. 3) Native versus Java workloads and their relationship to processor technology: Single threaded Java workloads exploit multiple cores. These results indicate that Java workloads offer different opportunities and challenges compared to native workloads. Our findings challenge prevalent methodologies and offer new insight into how microarchitectures have traded power and performance as process technology shrank.
|
Ha, Arnold, Blackburn, McKinley,
A concurrent dynamic analysis framework for multicore hardware,
in OOPSLA ’09: Proceeding of the 24th ACM SIGPLAN Conference on Object-Oriented Programming Systems Languages and Applications, 2009.
Software has spent the bounty of Moore’s law by solving harder problems and exploiting abstractions, such as high-level languages, virtual machine technology, binary rewriting, and dynamic analysis. Abstractions make programmers more productive and programs more portable, but usually slow them down. Since Moore’s law is now delivering multiple cores instead of faster processors, future systems must either bear a relatively higher cost for abstractions or use some cores to help tolerate abstraction costs. This paper presents the design, implementation, and evaluation of a novel concurrent, configurable dynamic analysis framework that efficiently utilizes multicore cache architectures. It introduces Cache-friendly Asymmetric Buffering (CAB), a lock-free ring-buffer that implements efficient communication between application and analysis threads. We guide the design and implementation of our framework with a model of dynamic analysis overheads. The framework implements exhaustive and sampling event processing and is analysis-neutral. We evaluate the framework with five popular and diverse analyses, and show performance improvements even for lightweight, low-overhead analyses. Efficient inter-core communication is central to high performance parallel systems and we believe the CAB design gives insight into the subtleties and difficulties of attaining it for dynamic analysis and other parallel software.
|
Frampton, Blackburn, Cheng, Garner, Grove, Moss, Salishev,
Demystifying magic: High-level low-level programming,
in VEE ’09: Proceedings of the 2009 ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, 2009.
The power of high-level languages lies in their abstraction over hardware and software complexity, leading to greater security, better reliability, and lower development costs. However, opaque abstractions are often show-stoppers for systems programmers, forcing them to either break the abstraction, or more often, simply give up and use a different language. This paper addresses the challenge of opening up a high-level language to allow practical low-level programming without forsaking integrity or performance. The contribution of this paper is three-fold: 1) we draw together common threads in a diverse literature, 2) we identify a framework for extending high-level languages for low-level programming, and 3) we show the power of this approach through concrete case studies. Our framework leverages just three core ideas: extending semantics via intrinsic methods, extending types via unboxing and architectural-width primitives, and controlling semantics via scoped semantic regimes. We develop these ideas through the context of a rich literature and substantial practical experience. We show that they provide the power necessary to implement substantial artifacts such as a high-performance virtual machine, while preserving the software engineering benefits of the host language. The time has come for high-level low-level programming to be taken more seriously: 1) more projects now use high-level languages for systems programming, 2) increasing architectural heterogeneity and parallelism heighten the need for abstraction, and 3) a new generation of high-level languages are under development and ripe to be influenced.
|
Blackburn, McKinley,
Immix: A mark-region garbage collector with space efficiency, fast collection, and mutator performance,
in ACM SIGPLAN Conference on Programming Language Design and Implementation, 2008.
Programmers are increasingly choosing managed languages for modern applications, which tend to allocate many short-to-medium lived small objects. The garbage collector therefore directly determines program performance by making a classic space-time tradeoff that seeks to provide space efficiency, fast reclamation, and mutator performance. The three canonical tracing garbage collectors: semi-space, mark-sweep, and mark-compact each sacrifice one objective. This paper describes a collector family, called mark-region, and introduces opportunistic defragmentation, which mixes copying and marking in a single pass. Combining both, we implement immix, a novel high performance garbage collector that achieves all three performance objectives. The key insight is to allocate and reclaim memory in contiguous regions, at a coarse block grain when possible and otherwise in groups of finer grain lines. We show that immix outperforms existing canonical algorithms, improving total application performance by 7 to 25% on average across 20 benchmarks. As the mature space in a generational collector, immix matches or beats a highly tuned generational collector, e.g. it improves SPEC jbb by 5%. These innovations and the identification of a new family of collectors open new opportunities for garbage collector design.
|
Blackburn, McKinley, Garner, Hoffmann, Khan, Bentzur, Diwan, Feinberg, Frampton, Guyer, Hirzel, Hosking, Jump, Lee, Eliot, Moss, Phansalkar, Stefanovic, VanDrunen, Dincklage, Wiedermann,
Wake Up and Smell the Coffee: Evaluation Methodology for the 21st Century,
in Communications of the ACM, 2008.
Evaluation methodology underpins all innovation in experi- mental computer science. It requires relevant workloads, appropriate experimental design, and rigorous analysis. Unfortunately, methodology is not keeping pace with the changes in our field. The rise of managed languages such as Java, C#, and Ruby in the past decade and the imminent rise of commodity multicore architectures for the next de- cade pose new methodological challenges that are not yet widely understood. This paper explores the consequences of our collective inattention to methodology on innovation, makes recommendations for addressing this problem in one domain, and provides guidelines for other domains. We describe benchmark suite design, experimental design, and analysis for evaluating Java applications. For example, we introduce new criteria for measuring and selecting diverse applications for a benchmark suite. We show that the complexity and nondeterminism of the Java runtime system make experimental design a first-order consideration, and we recommend mechanisms for addressing complexity and nondeterminism. Drawing on these results, we suggest how to adapt methodology more broadly. To continue to deliver innovations, our field needs to significantly increase participation in and funding for developing sound methodological foundations.
|
Blackburn, Salishev, Danilov, Mokhovikov, Nashatyrev, Novodvorsky, Bogdanov, Li, Ushakov,
The Moxie JVM experience,
in ANU Technical Report TR-CS-08-01, 2008.
By January 1998, only two years after the launch of the first Java virtual machine, almost all JVMs in use today had been architected. In the nine years since, technology has advanced enormously, with respect to the underlying hardware, language implementation, and in the application domain. Although JVM technology has moved forward in leaps and bounds, basic design decisions made in the 90’s has anchored JVM implementation. The Moxie project set out to explore the question: "How would we design a JVM from scratch knowing what we know today?’ Amid the mass of design questions we faced, the tension between performance and flexibility was pervasive, persistent and problematic. In this experience paper we describe the Moxie project and its lessons, a process which began with consulting experts from industry and academia, and ended with a fully working prototype.
|
Ha, Gustafsson, Blackburn, McKinley,
Microarchitectural characterization of production JVMs and Java workloads,
in IBM CAS Workshop, Austin, TX, 2008.
Understanding and comparing Java Virtual Machine (JVM) performance at a microarchitectural level can identify JVM performance anomalies and potential opportunities for optimization. The two primary tools for microarchitectural performance analysis are hardware performance counters and cycle accurate simulators. Unfortunately, the nondeterminism, complexity, and size of modern JVMs make these tools difficult to apply and therefore the microarchitectural performance of JVMs remains under-studied. We propose and use new methodologies for measuring unmodified production JVMs using both performance counters and a cycle accurate simulator. Our experimental design controls nondeterminism within a single measurement by using multiple iterations after a steady state is reached. We also use call-backs provided by the production JVMs to isolate application performance from the garbage collector, and where supported, the JIT. Finally, we use conventional statistical approaches to understand the effect of the remaining sources of measurement noise such as nondeterministic JIT optimization plans. This paper describes these methodologies and then reports on work in progress using these methodologies to compare IBM J9, BEA JRockit, and Sun HotSpot JVM performance with hardware performance counters and simulation. We examine one benchmark in detail to give a flavor of the depth and type of analyses possible with this methodology.
|
Blackburn, Hertz, McKinley, Moss, Yang,
Profile-based pretenuring,
in ACM Transactions on Programming Languages and Systems, 2007.
Pretenuring can reduce copying costs in garbage collectors by allocating long-lived objects into regions that the garbage collector will rarely, if ever, collect. We extend previous work on pretenuring as follows: (1) We produce pretenuring advice that is neutral with respect to the garbage collector algorithm and configuration. We thus can and do combine advice from different applications. We find for our benchmarks that predictions using object lifetimes at each allocation site in Java programs are accurate, which simplifies the pretenuring implementation. (2) We gather and apply advice to both applications and Jikes RVM, a compiler and runtime system for Java written in Java. Our results demonstrate that building combined advice into Jikes RVM from different application executions improves performance, regardless of the application Jikes RVM is compiling and executing. This build-time advice thus gives user applications some benefits of pretenuring, without any application profiling. No previous work uses profile feedback to pretenure in the runtime system. (3) We find that application-only advice also consistently improves performance, but that the combination of build-time and application-specific advice is almost always noticeably better. (4) Our same advice improves the performance of generational, Older First, and Beltway collectors, illustrating that it is collector neutral. (5) We include an immortal allocation space in addition to a nursery and older generation, and show that pretenuring to immortal space has substantial benefit.
|
Garner, Blackburn, Frampton,
Effective prefetch for mark-sweep garbage collection,
in The 2007 International Symposium on Memory Management, ISMM'07, 2007.
Garbage collection is a performance-critical feature of most modern object oriented languages, and is characterized by poor locality since it must traverse the heap. In this paper we show that by combining two very simple ideas we can significantly improve the performance of the canonical mark-sweep collector, resulting in improvements in application performance. We make three main contributions: 1) we develop a methodology and framework for accurately and deterministically analyzing the tracing loop at the heart of the collector, 2) we offer a number of insights and improvements over conventional design choices for mark-sweep collectors, and 3) we find that two simple ideas — edge order traversal and software prefetch — combine to greatly improve garbage collection performance although each is unproductive in isolation. We perform a thorough analysis in the context of MMTk and Jikes RVM on a wide range of benchmarks and four different architectures. Our baseline system (which includes a number of our improvements) is very competitive with highly tuned alternatives. We show a simple marking mechanism which offers modest but consistent improvements over conventional choices. Finally, we show that enqueuing the edges (pointers) of the object graph rather than the nodes (objects) significantly increases opportunities for software prefetch, despite increasing the total number of queue operations. Combining edge ordered enqueuing with software prefetching yields average performance improvements over a large suite of benchmarks of 20-30% in garbage collection time and 4-6% of total application performance in moderate heaps, across four architectures.
|
Blackburn, Garner, Hoffmann, Khang, McKinley, Bentzur, Diwan, Feinberg, Frampton, Guyer, Hirzel, Hosking, Jump, Lee, Eliot, Moss, Phansalkar, Stefanovic, VanDrunen, Dincklage, Wiedermann,
The DaCapo benchmarks: Java benchmarking development and analysis,
in OOPSLA ’06: Proceedings of the 21st Annual ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages, and Applications, 2006.
Since benchmarks drive computer science research and industry product development, which ones we use and how we evaluate them are key questions for the community. Despite complex runtime tradeoffs due to dynamic compilation and garbage collection required for Java programs, many evaluations still use methodologies developed for C, C++, and Fortran. SPEC, the dominant purveyor of benchmarks, compounded this problem by institutionalizing these methodologies for their Java benchmark suite. This paper recommends benchmarking selection and evaluation methodologies, and introduces the DaCapo benchmarks, a set of open source, client-side Java benchmarks. We demonstrate that the complex interactions of (1) architecture, (2) compiler, (3) virtual machine, (4) memory management, and (5) application require more extensive evaluation than C, C++, and Fortran which stress (4) much less, and do not require (3). We use and introduce new value, time-series, and statistical metrics for static and dynamic properties such as code complexity, code size, heap composition, and pointer mutations. No benchmark suite is definitive, but these metrics show that DaCapo improves over SPEC Java in a variety of ways, including more complex code, richer object behaviors, and more demanding memory system requirements. This paper takes a step towards improving methodologies for choosing and evaluating benchmarks to foster innovation in system design and implementation for Java and other managed languages.
|
Huang, Blackburn, Grove, McKinley,
Fast and efficient partial code reordering: Taking advantage of dynamic recompilation,
in The 2006 Iinternational Symposium on Memory Management (ISMM 2006), Ottawa, Ontario, Canada, 2006.
Poor instruction cache locality can degrade performance on modern architectures. For example, our simulation results show that eliminating all instruction cache misses improves performance by as much as 16% for a modestly sized instruction cache. In this paper, we show how to take advantage of dynamic code generation in a Java Virtual Machine (VM) to improve instruction locality at run-time. We develop a dynamic code reordering (DCR) system; a low overhead, online approach for improving instruction locality. DCR has three optimizations: (1) Interprocedural method separation; (2) Intraprocedural code splitting; and (3) Code padding. DCR uses the dynamic call graph and an edge profile that most VMs already collect to separate hot/cold methods and hot/cold code within a method. It also puts padding between methods to minimize conflict misses between frequent caller/callee pairs. It incrementally performs these optimizations only when the VM is optimizing a method at a higher level. We implement DCR in Jikes RVM and show its overhead is negligible. Extensive simulation and run-time experiments show that a simple code space improves average performance on a Pentium 4 by around 6% on SPEC and DaCapo Java benchmarks. These programs however have very small instruction cache footprints that limit opportunities for DCR to improve performance. Consequently, DCR optimizations on average show little effect, sometimes degrading performance and occasionally improving performance by up to 5%. Our work shows that the VM has the potential to dynamically improve instruction locality incrementally by simply piggybacking on hotspot recompilation.
|
Hertz, Blackburn, Moss, McKinley, Stefanovic,
Generating object lifetime traces with Merlin,
in ACM Transactions on Programming Languages and Systems, 2006.
Programmers are writing a rapidly growing number of programs in object-oriented languages, such as Java and C\#, that require garbage collection. Garbage collection traces and simulation speed up research by enabling deeper understandings of object lifetime behavior and quick exploration and design of new garbage collection algorithms. When generating perfect traces, the brute-force method of computing object lifetimes requires a whole-heap garbage collection at every potential collection point in the program. Because this process is prohibitively expensive, researchers often use granulated traces by collecting only periodically, for example, every 32 KB of allocation.We extend the state of the art for simulating garbage collection algorithms in two ways. First, we develop a systematic methodology for simulation studies of copying garbage collection and present results showing the effects of trace granularity on these simulations. We show that trace granularity often distorts simulated garbage collection results compared with perfect traces. Second, we present and measure the performance of a new algorithm called Merlin for computing object lifetimes. Merlin timestamps objects and later uses the timestamps of dead objects to reconstruct when they died. The Merlin algorithm piggybacks on garbage collections performed by the base system. Experimental results show that Merlin can generate traces over two orders of magnitude faster than the brute-force method which collects after every object allocation. We also use Merlin to produce visualizations of heap behavior that expose new object lifetime behaviors.
|
Blackburn, McKinley,
Transient caches and object streams,
in ANU Technical Report TR-CS-06-03, 2006.
Memory latency limits program performance. Object-oriented languages such as C\# and Java exacerbate this problem, but their software engineering benefits make them increasingly popular. We show that current memory hierarchies are not particularly well suited to Java in which object streams write and read a window of short-lived objects that pollute the cache. These observations motivate the exploration of transient caches which assist a parent cache. For an L1 parent cache, transient caches are positioned similarly to a classic L0, providing one cycle access time. Their distinguishing features are (1) they are tiny (4 to 8 lines), (2) they are highly associative, and (3) the processor may seek them in parallel with their parent. They can assist any cache level. To address object stream behavior, we explore policies for read and write instantiation, promotion, filtering, and valid bits to implement no-fetch on write. Good design points include a parallel L0 (PL0) which improves Java programs by 3 simulation over a two-cycle 32KB, 128B line, 2-way L1. A transient qualifying cache (TQ) improves further by a) minimizing pollution in the parent by filtering short-lived lines without temporal reuse, and b) using a write no-fetch policy with per-byte valid bits to eliminate wasted fetch bandwidth. TQs at L1 and L2 improve Java programs by 5 even achieves improvements when the parent has half the capacity or associativity compared to the original larger L1. The one-cycle access time, a write no-fetch policy, and filtering bestow these benefits. Java motivates this approach, but it also improves for C programs.
|
Alpern, Augart, Blackburn, Butrico, Cocchi, Cheng, Dolby, Fink, Grove, Hind, McKinley, Mergen, Moss, Ngo, Sarkar, Trapp,
The Jikes Research Virtual Machine project: Building an open source research community,
in IBM Systems Journal, 2005.
This paper describes the evolution of the Jikes Research Virtual Machine project from an IBM internal research project, called Jalapeno, into an open-source project. After summarizing the original goals of the project, we discuss the motivation for releasing it as an open-source project and the activities performed to ensure the success of the project. Throughout, we highlight the unique challenges of developing and maintaining an open-source project designed specifically to support a research community.
|
Paz, Petrank, Blackburn,
Age-oriented concurrent garbage collection,
in 14th International Conference on Compiler Construction (cc), Edinburgh, Scotland, April 4-8, 2005.
Generational collectors are well known as a tool for shortening pause times incurred by garbage collection and for improving garbage collection efficiency. In this paper, we investigate how to best use generations with on-the-fly collectors. On-the-fly collectors run concurrently with the program threads and induce very short program pauses. Thus, the motivation for incorporating generations is focused at improving the throughput; pauses do not matter, since they are already very short. We propose a new collection approach, denoted age-oriented collection, for exploiting the generational hypothesis to obtain better efficiency. This approach is particularly useful when reference counting is used to collect the old generation, yielding a highly efficient and non-obtrusive on-the-fly collector. Finally, an implementation is provided demonstrating how the age-oriented collector outperforms both the non-generational and the generational collectors’ efficiency.
|
Huang, Blackburn, McKinley, Moss, Wang, Cheng,
The garbage collection advantage: Improving program locality,
in Proceedings of the 2004 ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages and Applications (OOPSLA 2004), Vancouver, Canada, October 24-28, 2004.
As improvements in processor speed continue to outpace improvements in cache and memory speed, poor locality increasingly degrades performance. Because copying garbage collectors move objects, they have an opportunity to improve locality. However, no static copying order is guaranteed to match program traversal orders. This pa- per introduces online object reordering (OOR) which includes a new dynamic, online class analysis for Java that detects program traversal patterns and exploits them in a copying collector. OOR uses run-time method sampling that drives just-in-time (JIT) compilation. For each hot (frequently executed) method, OOR analysis identifies the hot field accesses. At garbage collection time, the OOR collector then copies referents of hot fields together with their parent. Enhancements include static analysis to exclude accesses in cold basic blocks, heuristics that decay heat to respond to phase changes, and a separate space for hot objects. The overhead of OOR is on average negligible and always less than 2% on Java benchmarks in Jikes RVM with MMTk. We compare program performance of OOR to static class-oblivious copying orders (e.g., breadth and depth first). Performance variation due to static orders is often low, but can be up to 25%. More troubling is the absence of a consistent winner. In contrast, OOR matches or improves upon the best static order since its history-based copying tunes memory layout to program traversal.
|
Blackburn, Cheng, McKinley,
Oil and water? High performance garbage collection in Java with MMTk,
in ICSE 2004, 26th International Conference on Software Engineering, Edinburgh, Scotland, May 23-28, 2004.
Increasingly popular languages such as Java and C# require efficient garbage collection. This paper presents the design, implementation, and evaluation of MMTk, a Memory Management Toolkit for and in Java. MMTk is an efficient, composable, extensible, and portable framework for building garbage collectors. MMTk uses design patterns and compiler cooperation to combine modularity and efficiency. The resulting system is more robust, easier to maintain, and has fewer defects than monolithic collectors. Experimental comparisons with monolithic Java and C implementations reveal MMTk has significant performance advantages as well. Performance critical system software typically uses monolithic C at the expense of flexibility. Our results refute common wisdom that only this approach attains efficiency, and suggest that performance critical software can embrace modular design and high-level languages.
|
Blackburn, Cheng, McKinley,
Myths and realities: The performance impact of garbage collection,
in SIGMETRICS – Performance 2004, Joint International Conference on Measurement and Modeling of Computer Systems, New York, NY, USA, June 12–16, 2004.
This paper explores and quantifies garbage collection (GC) behavior for three whole heap collectors and generational counterparts: copying semi-space, mark-sweep, and reference counting, the canonical algorithms from which essentially all other GC algorithms are derived. Efficient implementations in the memory management toolkit (MMTk) in Jikes RVM share all common mechanisms to provide a clean experimental platform. Performance counters and instrumentation measure timing and memory performance, separating GC and program behavior. Experiments on SPEC JVM Benchmarks reveal key algorithmic features and how they match program characteristics to explain the direct cost of GC as a function of heap size, and the indirect impact of GC on application performance. Results include the finding that the choice of GC algorithms can improve mutator locality, disputing the myth that "no GC is good GC." We show how the trade-offs in space utilization versus allocation and tracing costs in copying and mark-sweep collectors motivates a copying nursery for newly allocated objects, even without high nursery mortality. We find that object locality and pointer mutations demographics in the mature space are much less sensitive to GC algorithm, but that copying and mark-sweep for the mature space occasionally excel in different programs. This study is unique in its breadth of GC algorithms and its depth of analysis.
|
Blackburn, Hosking,
Barriers: Friend or foe?,
in The 2004 International Symposium on Memory Management (ISMM 2004), Vancouver, Canada, October 24-25, 2004.
Modern garbage collectors rely on read and write barriers imposed on heap accesses by the mutator, to keep track of references between different regions of the garbage collected heap, and to synchronize actions of the mutator with those of the collector. It has been a long-standing untested assumption that barriers impose significant overhead to garbage-collected applications. As a result, researchers have devoted effort to development of optimization approaches for elimination of unnecessary barriers, or proposed new algorithms for garbage collection that avoid the need for barriers while retaining the capability for independent collection of heap partitions. On the basis of the results presented here, we dispel the assumption that barrier overhead should be a primary motivator for such efforts. We present a methodology for precise measurement of mutator overheads for barriers associated with mutator heap accesses. We provide a taxonomy of different styles of barrier and measure the cost of a range of popular barriers used for different garbage collectors within Jikes RVM. Our results demonstrate that barriers impose surprisingly low cost to the mutator, though results vary by architecture. We found that the average overhead for a reasonable generational write barrier was less than 2% on average, and less than 6% overhead of an unconditional read barrier on the PowerPC was only 0.85%, while on the AMD it was 8.05%. With both read and write barriers, we found that second order locality effects were sometimes more important that the overhead of the barriers themselves, leading to counter-intuitive speedups in a number of situations.
|
Jump, Blackburn, McKinley,
Dynamic object sampling for pretenuring,
in The 2004 International Symposium on Memory Management (ISMM 2004), Vancouver, Canada, October 24-25, 2004.
Many state-of-the-art garbage collectors are generational, collecting the young nursery objects more frequently than old objects. These collectors perform well because young objects tend to die at a higher rate than old ones. However, these collectors do not examine object lifetimes with respect to any particular program or allocation site. This paper introduces low-cost object sampling to dynamically determine lifetimes. The sampler marks an object and records its allocation site every n bytes of allocation. The collector then computes per-site nursery survival rates. Sampling degrades total performance by only 3% on average for sample rates of 256 bytes in Jikes RVM, a rate at which overall lifetime accuracy compares well with sampling every object. An adaptive collector can use this information to tune itself. For example, pretenuring decreases nursery collection work by allocating new, but long-lived, objects directly into the mature space. We introduce a dynamic pretenuring mechanism that detects long-lived allocation sites and pretenures them, given sufficient samples. To react to phase changes, it occasionally backsamples. As with previous online pretenuring, consistent performance improvements on SPECjvm98 benchmarks are difficult to attain since only two combine sufficient allocation load with high nursery survival. Our pre-tenuring system consistently improves one of these, 213 javac, by 2% to 9% of total time by decreasing collection time by over a factor of two. Sampling and pretenuring overheads slow down all the others. This paper thus provides an efficient sampling mechanism that accurately predicts lifetimes, but leaves open optimization policies that can exploit this information.
|
Blackburn, McKinley,
Ulterior reference counting: Fast garbage collection without a long wait,
in Proceedings of the 2003 ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages and Applications (OOPSLA 2003), Anaheim, CA, USA, October 26-30, 2003.
General purpose garbage collectors have yet to combine short pause times with high throughput. For example, generational collectors can achieve high throughput. They have modest average pause times, but occasionally collect the whole heap and consequently incur long pauses. At the other extreme, concurrent collectors, including reference counting, attain short pause times but with significant performance penalties. This paper introduces a new hybrid collector that combines copying generational collection for the young objects and reference counting the old objects to achieve both goals. It restricts copying and reference counting to the object demographics for which they perform well. Key to our algorithm is a generalization of deferred reference counting we call Ulterior Reference Counting. Ulterior reference counting safely ignores mutations to select heap objects. We compare a generational reference counting hybrid with pure reference counting, pure mark-sweep, and hybrid generational mark-sweep collectors. This new collector combines excellent throughput, matching a high performance generational mark-sweep hybrid, with low maximum pause
|
Blackburn, Jones, McKinley, Moss,
Beltway: Getting around garbage collection gridlock,
in Proceedings of SIGPLAN 2002 Conference on Programming Languages Design and Implementation, PLDI’02, Berlin, June, 2002.
We present the design and implementation of a new garbage collection framework that significantly generalizes existing copying collectors. The Beltway framework exploits and separates object age and incrementality. It groups objects in one or more increments on queues called belts, collects belts independently, and collects increments on a belt in first-in-first-out order. We show that Beltway configurations, selected by command line options, act and perform the same as semi-space, generational, and older-first collectors, and encompass all previous copying collectors of which we are aware. The increasing reliance on garbage collected languages such as Java requires that the collector perform well. We show that the generality of Beltway enables us to design and implement new collectors that are robust to variations in heap size and improve total execution time over the best generational copying collectors of which we are aware by up to 40 small to moderate heap sizes. New garbage collection algorithms are rare, and yet we define not just one, but a new family of collectors that subsumes previous work. This generality enables us to explore a larger design space and build better collectors.
|
Hertz, Blackburn, Moss, McKinley, Stefanovic,
Error-free garbage collection traces: How to cheat and not get caught,
in SIGMETRICS 2002, International Conference on Measurements and Modeling of Computer Systems, June 15-19, 2002, Marina del Rey, California, USA, 2002.
Programmers are writing a large and rapidly growing number of programs in object-oriented languages such as Java that require garbage collection (GC). To explore the design and evaluation of GC algorithms quickly, researchers are using simulation based on traces of object allocation and lifetime behavior. The brute force method generates perfect traces using a whole-heap GC at every potential GC point in the program. Because this process is prohibitively expensive, researchers often use granulated traces by collecting only periodically, e.g., every 32K bytes of allocation. We extend the state of the art for simulating GC algorithms in two ways. First, we present a systematic methodology and results on the effects of trace granularity for a variety of copying GC algorithms. We show that trace granularity often distorts GC performance results compared with perfect traces, and that some GC algorithms are more sensitive to this effect than others. Second, we introduce and measure the performance of a new precise algorithm for generating GC traces which is over 800 times faster than the brute force method. Our algorithm, called Merlin, frequently timestamps objects and later uses the timestamps of dead objects to reconstruct precisely when they died. It performs only periodic garbage collections and achieves high accuracy at low cost, eliminating any reason to use granulated traces.
|
Blackburn, McKinley,
In or out? Putting write barriers in their place,
in Proceedings of the third International Symposium on Memory Management, ISMM ’11, Berlin, Germany, 2002.
In many garbage collected systems, the mutator performs a write barrier for every pointer update. Using generational garbage collectors, we study in depth three code placement options for remembered-set write barriers: inlined, out-of-line, and partially inlined (fast path inlined, slow path out-of-line). The fast path determines if the collector needs to remember the pointer update. The slow path records the pointer in a list when necessary. Efficient implementations minimize the instructions on the fast path, and record few pointers (from 0.16 to 3 stores in our benchmarks). We find the mutator performs best with a partially inlined barrier, by a modest 1.5 inlining. We also study the compilation cost of write-barrier code placement. We find that partial inlining reduces the compilation cost by 20 to 25 just-in-time compilation, the application is exposed to compiler activity. Regardless of the level of compiler activity, partial inlining consistently gives a total running time performance advantage over full inlining on the SPEC JVM98 benchmarks. When the compiler optimizes all application methods on demand and compiler load is highest, partial inlining improves total performance on average by 10.2
|
Stefanovic, Hertz, Blackburn, McKinley, Moss,
Older-first garbage collection in practice: Evaluation in a java virtual machine,
in Proceedings of the first ACM SIGPLAN Workshop on Memory System Performance (MSP), Berlin, Germany, June 16, 2002.
Until recently, the best performing copying garbage collectors used a generational policy which repeatedly collects the very youngest objects, copies any survivors to an older space, and then infrequently collects the older space. A previous study that used garbage-collection simulation pointed to potential improvements by using an Older-First copying garbage collection algorithm. The Older-First algorithm sweeps a fixed-sized window through the heap from older to younger objects, and avoids copying the very youngest objects which have not yet had sufficient time to die. We describe and examine here an implementation of the Older-First algorithm in the Jikes RVM for Java. This investigation shows that Older-First can perform as well as the simulation results suggested, and greatly improves total program performance when compared to using a fixed-size nursery generational collector. We further compare Older-First to a flexible-size nursery generational collector in which the nursery occupies all of the heap that does not contain older objects. In these comparisons, the flexible-nursery collector is occasionally the better of the two, but on average the Older-First collector performs the best.
|
Blackburn, Singhai, Hertz, McKinley, Moss,
Pretenuring for Java,
in Proceedings of the 2001 ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages and Applications (OOPSLA 2001), Tampa, FL, October 14-18, 2001.
Pretenuring can reduce copying costs in garbage collectors by allocating long-lived objects into regions that the garbage collector will rarely, if ever, collect. We extend previous work on pretenuring as follows. (1) We produce pretenuring advice that is neutral with respect to the garbage collector algorithm and configuration. We thus can and do combine advice from different applications. We find that predictions using object lifetimes at each allocation site in Java programs are accurate, which simplifies the pretenuring implementation. (2) We gather and apply advice to applications and the Jalapeno JVM, a compiler and run-time system for Java written in Java. Our results demonstrate that building combined advice into Jalapeno from different application executions improves performance regardless of the application Jalapeno is compiling and executing. This build-time advice thus gives user applications some benefits of pretenuring without any application profiling. No previous work pretenures in the run-time system. (3) We find that application-only advice also improves performance, but that the combination of build-time and application-specific advice is almost always noticeably better. (4) Our same advice improves the performance of generational and Older First collection, illustrating that it is collector neutral.
|
Marquez, Blackburn, Mercer, Zigman,
Implementing orthogonally persistent Java,
in Advances in persistent object systems: Proceedings of the 9th International Workshop on Persistent Object Systems (POS 9), Sep 6–8, 2000, Lillehammer, Norway, 2001.
Orthogonally persistent Java combines the power of abstraction over persistence with Java’s rich programming environment. In this paper we report our experience in designing and implementing orthogonally persistent Java. Our design approach is anchored by the view that any system that brings together Java and orthogonal persistence should as far as possible avoid diluting the strengths of Java or the principles of orthogonal persistence. Our approach is thus distinguished by three features: complete transparency of persistence, support for both intra and inter application concurrency through ACID transactions, and the preservation of Java’s property of portability. In addition to discussing design and implementation, we present results that show that our approach performs credibly.
|
He, Blackburn, Kirby, Zigman,
Platypus: The design and implementation of a flexible high performance object store.,
in Advances in persistent object systems: Proceedings of the 9th International Workshop on Persistent Object Systems (POS 9), Sep 6–8, 2000, Lillehammer, Norway, 2001.
This paper reports the design and implementation of Platypus, a transactional object store. The twin goals of flexibility and performance dominate the design of Platypus. The design includes: strong support for SMP concurrency; stand-alone, client-server and client-peer distribution configurations; configureable logging and recovery; and object management which can accommodate garbage collection and clustering mechanisms. The first implementation of Platypus incorporates a number of innovations. (1) A new recovery algorithm derived from ARIES that removes the need for log seqeuence numbers to be present on store pages. (2) A zero-copy memory-mapped buffer manager with controlled write-back behavior. (3) New data structures for highly concurrent locking and map querying. We present performance results comparing Platypus with a stream-lined implemention of the SHORE object store. For both medium and small OO7 workloads Platypus outperforms SHORE across a wide range of benchmark operations in both ‘hot’ and ‘cold’ settings.
|
Zigman, Blackburn, Moss,
TMOS: A transactional garbage collector,
in Advances in persistent object systems: Proceedings of the 9th International Workshop on Persistent Object Systems (POS 9), Sep 6–8, 2000, Lillehammer, Norway, 2001.
Defining persistence in terms of reachability is fundamental to achieving orthogonality of persistence. It is implicit to the principles of orthogonal persistence and is a part of the ODMG 3.0 data objects standard. Although space reclamation in the context of persistence by reachability can be achieved automatically using garbage collection, relatively few papers address the problem of implementing garbage collection in a transactional storage system. A transactional GC algorithm must operate correctly in the face of failure, and in particular must deal with the problem of transaction abort, which by undoing changes such as the deletion of references, subverts the GC reachability axiom of ‘once garbage always garbage’once garbage always garbage.In this paper we make two key contributions. First, we present a generic approach to the design of transactional collectors that promotes clarity, generality, and understandability, and then using this approach, we present a new transactional garbage collection algorithm, TMOS. Our design approach brings together three independent components—a mutator, a transactional store, and a GC algorithm. TMOS represents the application of the Mature Object Space family of GC algorithms to the transactional context through our approach to transactional GC design.
|
Blackburn, Hudson, Morrison, Moss, Munro, Zigman,
Starting with termination: A methodology for building distributed garbage collection algorithms,
in Australian Computer Science Conference, Proceedings, Jan 29–31, 2001, Gold Coast, Australia, 2001.
We propose an effective methodology in which a distributed garbage collector may be derived from a distributed termination algorithm and a centralized garbage collector in a manner that preserves interesting properties of the original collector, such as completeness. To illustrate our technique we show how two distributed termination algorithms, credit recovery and task balancing, may be suitably described; and then map four centralized garbage collectors reference counting, mark/scan, a generational scheme, and the Mature Object Space collector (MOS) onto this description. The advantage of our approach is that, by separating the issues of distribution and collection, we alleviate the difficulty of inventing, understanding, and comparing distributed garbage collection techniques.
|
Marquez, Blackburn,
Addressing complexity and scale in a high performance object server,
in Succeeding with object databases, 2001.
The so called ‘information explosion’ has lead to an enormous demand for networked information, which has in turn placed enormous pressure on information servers. The design of high performance server technology has thus become a hot topic in computer science. In this paper we describe how our exposure to a large real-world application has shaped our response to two key issues for high performance object server technologies—the management of complexity and the management of scale. The Australian Bureau of Statistics’ Business Register (BR) is an OODB application that forms a register of all businesses in Australia. The register is fundamental to much of the ABS’ economic survey activity and is used by many branches of the large organization. The basis of the register is an object model that reflects the business structure of all Australian businesses, from large multinationals through to corner stores. There are over 100,000,000 objects in the register and it is constantly updated, both through operator-driven data entry and the batch processing of data from other government agencies. Specific requirements are acceptable response time for interactive access for more that 100 users, long and short transaction support, efficient storage and online access to historic information, flexible object version views and schema evolution support. The challenges raised by the BR are dominated by the problems of complexity and scale. Our response to these challenges has been guided by our view that abstraction has a basic role in addressing complexity and scale, and by our use of Java as an implementation context. In this paper we report not only our approach, but also novel technological outcomes which stem from our exposure to the challenges highlighted by the ABS-BR. These outcomes include: portable orthogonally persistent Java, a system for object versioning, a proposal for schema versioning, and a system architecture for scalable object servers based on a separation of programming language and storage system concerns. These outcomes are examined in the context of an ABS-BR technology demonstrator—a small-scale version of the ABS-BR built to test and showcase new technologies for high performance object servers.
|
He, Marquez, Blackburn,
Opportunistic prioritised clustering framework (OPCF),
in Objects and databases, international symposium, sophia antipolis, france, june 13, 2000, proceedings, 2001.
Ever since the ‘early days’ of database management systems, clustering has proven to be one of the most effective performance enhancement techniques for object oriented database management systems. The bulk of the work in the area has been on static clustering algorithms which re-cluster the object base when the database is static. However, this type of re-clustering cannot be used when 24-hour database access is required. In such situations dynamic clustering is required, which allows the object base to be reclustered while the database is in operation. We believe that most existing dynamic clustering algorithms lack three important properties. These include: the use of opportunism to imposes the smallest I/O footprint for re-organisation; the re-use of prior research on static clustering algorithms; and the prioritisation of re-clustering so that the worst clustered pages are re-clustered first. In this paper, we present OPCF, a framework in which any existing static clustering algorithm can be made dynamic and given the desired properties of I/O opportunism and clustering prioritisation. In addition, this paper presents a performance evaluation of the ideas suggested above and in particular shows the importance of I/O opportunism in improving the performance of dynamic clustering algorithms in a variety of situations. The main contribution of this paper is the observation that existing static clustering algorithms, when transformed via a simple transformation framework such as OPCF, can produce dynamic clustering algorithms that out-perform complex existing dynamic algorithms, in a variety of situations. This makes the solution presented in this paper particularly attractive to real OODBMS system implementers who often prefer to opt for simpler solutions.
|
Marquez, Zigman, Blackburn,
Fast portable orthogonally persistent Java,
in Software: Practice and Experience, 2000.
A powerful feature of the Java programming language is its user-definable class loading policy, which when combined with the namespace independence between class loaders, allows portable implementation of semi-dynamic program transformations. Such transformations can be used for a range of purposes, including optimization and semantic extension. In this paper we present a framework for semantic extensions in Java. This framework consists of a number of simple but powerful transformations that, among other things, allow us to semantically extend Java to provide orthogonal persistence. The use of semi-dynamic program transformations lends our orthogonally persistent Java a number of important qualities, including simplicity, portability and a clean model of persistence. Significantly, our implementations are efficient and can outperform in some cases PJama, a well-known orthogonally persistent Java, which is based on a modified virtual machine. In addition to describing the application of these transformations to orthogonally persistent Java, we foreshadow their use in a number of other contexts, including dynamic instance versioning and instrumentation.
|
Zigman, Blackburn,
Java finalize method, orthogonal persistence and transactions,
in Advances in Persistent Object Systems: Proceedings of the 8th International Workshop on Persistent Object Systems (POS8) and proceedings of the 3rd International Workshop on Persistence and Java (PJW3), Aug 30–Sep 1, 1998, Tiburon, CA, USA, 1999.
Java is a popular, object oriented language that is runtime type safe. As such, it has been seen as an attractive basis for the implementation of orthogonally persistent systems by several research groups. Transactions are widely used as a means of enforcing consistency of the stable image in the face of concurrency, and have been adopted by most groups developing persistent Java systems. However, Java has a user definable *finalize* method which provides an asynchronous cleanup mechanism. The strict temporal semantics of transactions and the asynchrony of the finalize method seem at odds. This paper describes this conflict and provides a strategy for resolving the problem.
|
Blackburn, Zigman,
Concurrency—The fly in the ointment?,
in Advances in Persistent Object Systems: Proceedings of the 8th International Workshop on Persistent Object Systems (POS8) and proceedings of the 3rd International Workshop on Persistence and Java (PJW3), Aug 30–Sep 1, 1998, Tiburon, CA, USA, 1999.
Concurrency is a central pillar of the Java programming language, is implicit in the transactional model of computation adopted by most persistent systems, and has been widely studied in the context of orthogonal persistence. We argue that despite the substantial literature on concurrency control and transaction models for orthogonal persistence, a basic question as to the interaction between concurrency and orthogonal persistence has yet to be adequately addressed. The demands of orthogonality appear to place orthogonal persistence at odds with more general approaches to concurrency control. Given its stated objective of providing a ‘complete’ computational environment, the future of the orthogonal persistence vision in general, and persistent Java in particular, will depend, to some extent, on the release of this tension through the realization of new approaches to the integration of concurrency and persistence. This paper addresses both strict transactional and more ‘open’ approaches to managing concurrency in face of persistence, and examines difficulties with each approach. A simple transaction model that relieves some of these difficulties is presented and its limitations briefly examined.
|
Blackburn, Stanton,
The transactional object cache: A foundation for high performance persistent system construction.,
in Advances in Persistent Object Systems: Proceedings of the 8th International Workshop on Persistent Object Systems (POS8) and proceedings of the 3rd International Workshop on Persistence and Java (PJW3), Aug 30–Sep 1, 1998, Tiburon, CA, USA, 1999.
This paper argues that caching, atomicity and layering are fundamental to persistent systems, and that the transactional object cache architecture, as an embodiment of these concerns, provides a foundation for high performance persistent system construction. Central to the paper is a description of the semantics of an abstract transactional object cache architecture that supports a wide range of transactional models and is open to a broad spectrum of transactional cache coherency algorithms. The centrality of the abstraction is a consequence of its role in facilitating the definition of a transactional interface, which is the key to the practical application of the transactional object cache architecture. The utility of the architectural framework in general, and the interface in particular, is argued for in the context of existing systems and systems currently under construction within that framework.
|
Blackburn, Stanton,
Scalable multicomputer object spaces: A foundation for high performance systems,
in Third International Working Conference on Massively Parallel Programming Models, Nov 12–14, 1997, London, UK, 1998.
The development of scalable architectures at store levels of a layered model has concentrated on processor parallelism balanced against scalable memory bandwidth, primarily through distributed memory structures of one kind or another. A great deal of attention has been paid to hiding the distribution of memory to produce a single store image across the memory structure. It is unlikely that the distribution and concurrency aspects of scalable computing can be completely hidden at that level. The paper argues for a store layer which respects the need for caching and replication, and to do so at an "object" level granularity of memory use. These facits are interrelated through atomic processes, leading to an interface for the store which is strongly transactional in character. The paper describes the experimental performance of such a layer on a scalable multi-computer architecture. The behaviour of the store supports the view that a scalable cached "transactional" store architecture is a practical objective for high performance based on parallel computation across distributed memories.
|