I. Executive Summary

The Green Tea garbage collector (GC) is an experimental, memory-aware parallel marking algorithm for the Go runtime. It represents a fundamental redesign of the garbage collection process, shifting from an object-centric to a memory-block-centric approach to combat performance bottlenecks caused by poor memory locality in modern many-core systems. This report analyzes its design, performance characteristics, and strategic importance.

The core finding is that Green Tea successfully addresses its primary goal: reducing GC CPU costs and improving scalability by enhancing memory access patterns. It achieves this by scanning memory in large, contiguous blocks called “spans” instead of chasing pointers to individual objects across the heap. This strategy improves both spatial and temporal locality, leading to significant performance gains in GC-intensive workloads. In select microbenchmarks, Green Tea demonstrates a 10–50% reduction in GC CPU costs, with improvements scaling directly with the number of CPU cores. It also halves the number of L1 and L2 cache misses in these scenarios.

However, the performance on real-world applications is more nuanced and highly dependent on the application’s heap topology. Workloads with high-fanout data structures that naturally create locality, such as the tile38 benchmark, see substantial improvements (a 35% reduction in GC overhead). Conversely, workloads with low-fanout, frequently mutated structures that shuffle memory, like the bleve-index benchmark, struggle to benefit and may even see minor regressions on lower-core machines, though they still gain from improved scalability on high-core-count systems.

The Green Tea algorithm is not a replacement for the entire garbage collector but an enhancement to its most expensive phase: concurrent marking. It coexists with the traditional marking algorithm, applying its span-based strategy to small objects where the overheads are highest, while larger objects are handled by the existing mechanism. This new algorithm is a direct response to the growing “memory wall” problem and represents a critical step toward ensuring Go’s performance on future hardware architectures.

II. Introduction: The Memory Wall and Go’s GC

2.1 The Performance Challenge: Poor Memory Locality

Modern high-performance computing is increasingly constrained by memory latency and bandwidth. While CPU core counts and clock speeds have grown, the speed of DRAM access has not kept pace. This divergence, often called the “memory wall,” means that the efficiency of memory access patterns—spatial and temporal locality—is now a critical factor in application performance.

Go’s standard garbage collector, a highly optimized tri-color concurrent mark-sweep algorithm, functions by traversing the graph of live objects. This process, at its core, is a “graph flood,” where the collector follows pointers from one object to the next. This approach has no inherent consideration for the physical memory location of these objects. As a result, the GC can jump between disparate memory regions, leading to extremely poor spatial locality. This inefficiency is quantifiable: over 35% of CPU cycles during the GC’s core “scan loop” are spent stalled, waiting for data from memory. As systems move toward more cores and non-uniform memory access (NUMA) architectures, this problem is expected to worsen significantly.

2.2 The Green Tea Solution: A Memory-Aware Approach

The Green Tea algorithm is a direct answer to this challenge. It re-architects the parallel marking phase to be “memory-aware.” Instead of treating the heap as an abstract graph of objects, it operates on coarse, contiguous blocks of physical memory. By processing objects that are physically close to one another together, it aims to dramatically improve cache utilization and reduce memory stalls. This new algorithm is currently available as an experiment in Go 1.25 and has shown a significant reduction in GC CPU costs on suitable workloads, unlocking new potential for future optimizations like SIMD acceleration.

III. Algorithmic Design and Mechanics

3.1 Core Concept: From Object-Centric to Span-Centric Marking

The central idea behind Green Tea is simple yet profound: instead of queuing individual objects to be scanned, the GC queues large, contiguous blocks of memory called spans. A span is a run of memory, typically 8 KiB, that contains objects of a single size class.

The core hypothesis is that by delaying the scan of a span, it will accumulate multiple marked objects within its bounds. When a GC worker eventually dequeues and processes the span, it can scan several objects that are physically co-located, leading to better cache performance and amortizing the overhead of processing the span itself. This contrasts sharply with the old algorithm, where processing a single object might prefetch data that is never used again as the GC immediately jumps to a new object in a completely different memory region.

3.2 Prototype Implementation Details

The experimental implementation of Green Tea focuses its efforts on small objects (up to 512 bytes), as the per-object overhead of the old GC is most pronounced in this category. Larger objects continue to be handled by the traditional object-centric algorithm.

Tri-Color Marking on Spans: The classic tri-color abstraction (white, gray, black) is maintained, but its metadata is stored within the span itself. Each span contains two bits per object: a gray bit and a black bit. An object is white (undiscovered) if neither bit is set, gray (discovered, needs scanning) if the gray bit is set, and black (scanned) if both bits are set.
The Marking Process: When the GC finds a pointer to a small object, it sets that object’s gray bit within its corresponding span. If the gray bit was not already set and the span is not currently in the work queue, the entire span is enqueued. When a worker dequeues the span, it identifies all gray-but-not-black objects, scans them for more pointers, and then sets their black bits.
Work Distribution and Scalability: To minimize contention, which is a significant problem in the standard GC’s global worklists, Green Tea uses a separate work queue for spans. This queue is based on the distributed, work-stealing runqueues used by the Go scheduler. This design allows worker goroutines to steal work directly from each other, reducing reliance on a central, contended queue. Furthermore, queuing a few large spans instead of millions of tiny objects inherently lowers the pressure on the queuing system.
Single-Object Scan Optimization: A potential inefficiency arises if a span is dequeued with only a single gray object. The overhead of the span-based approach would be higher than the old object-centric method. To mitigate this, Green Tea employs an optimization. When a span is first enqueued, the object that triggered it becomes the span’s “representative.” A “hit” flag is also added to the span. If another object in the span is marked gray while the span is queued, the hit flag is set. When a worker dequeues the span, if the hit flag is not set, it knows there is only one object to process and can scan the representative directly, bypassing the more expensive logic of iterating over the span’s metadata.

IV. Architectural Integration

Green Tea is not a standalone collector but an alternative marking strategy that integrates into the existing GC framework. The decision of which algorithm to use is made dynamically when a pointer is found.

The runtime maintains a bitmap that indicates whether any given 8 KiB page in the heap belongs to a small object span. When scanning an object, the GC checks this bitmap for any outgoing pointers. If a pointer targets a small object span, the Green Tea marking logic is invoked. Otherwise, the object is processed using the traditional algorithm. This allows the system to gain the locality benefits for the vast majority of objects (which are small) without adding complexity to the handling of large objects.

4.1 Architectural Diagram

The following diagram illustrates the decision-making and workflow of the Green Tea marking algorithm.

graph TD subgraph Application or GC Worker A[Object Reference] -->|Finds pointer to B| P{Is target B in a small object span?} style A fill:#FFE4B5,stroke:#FFA500 end subgraph GC Runtime P -- No --> OldGC[Old Generation GC] P -- Yes --> SetGray[Mark as Gray] style OldGC fill:#FFB6C1,stroke:#FF69B4 style SetGray fill:#98FB98,stroke:#2E8B57 SetGray --> Q{Is Span S already queued?} Q -- Yes --> End1(( )) Q -- No --> Enqueue[Add to Work Queue] style Q fill:#E6E6FA,stroke:#9370DB style Enqueue fill:#ADD8E6,stroke:#4682B4 Enqueue --> WQ[Work Queue] style WQ fill:#F0E68C,stroke:#DAA520 subgraph GC Worker Pool W[GC Worker] -->|Steals/Dequeues| WQ W -->|Checks 'hit' flag| Opt{Single object?} Opt -- Yes --> ScanRep[Scan Single Object] Opt -- No --> ScanAll[Iterate span metadata and scan all gray objects] style W fill:#FFA07A,stroke:#FF6347 style Opt fill:#E6E6FA,stroke:#9370DB style ScanRep fill:#90EE90,stroke:#3CB371 style ScanAll fill:#87CEFA,stroke:#1E90FF end end style Application_or_GC_Worker fill:#FFFACD,stroke:#EEE8AA style GC_Runtime fill:#F5F5F5,stroke:#D3D3D3

V. Empirical Performance Analysis

The prototype was evaluated across a variety of benchmarks on multi-core Linux VMs. The results confirm the algorithm’s strengths and weaknesses.

5.1 Benchmark Performance Summary

The performance of Green Tea is highly dependent on the workload’s memory access patterns and data structures.

Benchmark Category	Workload Description	Green Tea GC Performance	Key Insight
GC-Heavy Microbenchmarks	`garbage`, `binary-trees`	10-50% reduction in GC CPU cost	Scales better with more cores, halves L1/L2 cache misses.
`tile38`	High-fanout tree	Substantial improvement (35% GC overhead reduction)	High fan-out generates high scan density, which is the ideal case for Green Tea.
`bleve-index`	Low-fanout, mutated binary tree	Performance is a wash (slight regression on 16-core, improvement on 72/88-core)	Struggles to generate locality from a shuffled heap. Scalability wins on many-core systems.
Go Compiler	Compiling Go code	Slight, inconsistent regression (~0.5%)	Largely insensitive to this GC change.

Table 1: Summary of Green Tea GC Benchmark Results

5.2 Analysis of Results

The data reveals a clear pattern: Green Tea excels when an application’s data structures have good inherent locality. In the tile38 benchmark, the high-fanout tree means that objects pointing to each other are often located within the same memory regions, allowing Green Tea to efficiently find and scan dense spans.

In contrast, the bleve-index benchmark, which involves frequent tree rotations, actively shuffles the heap’s memory layout. This creates a worst-case scenario where pointers cross large regions of the heap, and Green Tea struggles to find spans with more than one object to scan. In this case, its performance relies heavily on the single-object scan optimization to remain competitive, and its main advantage comes from better scalability on high-core-count machines. This highlights a key characteristic: Green Tea capitalizes on existing locality but cannot create it out of a randomly organized heap.

VI. Future Work and Strategic Direction

The span-based marking approach opens up new avenues for optimization that were not feasible with the object-centric model.

SIMD-Accelerated Scanning: Because objects within a span have a regular, predictable memory layout, it becomes possible to use Single Instruction, Multiple Data (SIMD) CPU instructions (like AVX512) to process them. A prototype SIMD-based scanning kernel has already demonstrated the potential to reduce GC overheads by an additional 15-20%. This work is ongoing, as it requires a high density of pointers to be effective.
Concentrator Network: An earlier, more complex design for Green Tea involved a “concentrator network” to sort and group pointers, artificially creating the high density required for efficient SIMD processing. While this approach was deferred due to its complexity, it remains a viable path for future exploration to generate locality even in challenging workloads.

VII. Conclusion

The Green Tea garbage collector is a significant and promising evolution of Go’s runtime. It is a direct and well-reasoned attack on the “memory wall,” a fundamental performance barrier for modern software. By shifting the marking phase from a locality-agnostic “graph flood” to a memory-aware, block-based scanning algorithm, it substantially reduces GC CPU overhead and improves scalability on many-core systems for a wide range of workloads.

While not a universal performance win—its effectiveness is tied to the inherent locality of an application’s heap—it delivers major improvements where it matters most: in GC-heavy applications with favorable data structures. The development of Green Tea is a strategic investment in Go’s future performance, ensuring the language remains efficient and competitive as hardware continues to evolve toward massively parallel, memory-constrained architectures. Its experimental release provides a valuable opportunity for the community to test it on real-world workloads and help shape its final design.

A Research Report on the Golang Green Tea Garbage Collector