CSE-211_Part_7

Topics Covered: • Multithreading Motivation • Course Grain Multithreading • Simultaneous Multithreading

Multithreading Overview

Instruction-Level Parallelism (ILP) and Data-Level Parallelism (DLP):
- Extracting ILP or DLP from a single sequential thread becomes increasingly difficult due to dependencies and pipeline hazards.
Thread-Level Parallelism (TLP):
- Multiprogramming: Running multiple independent jobs.
- Multithreaded Applications: Accelerating a single job by splitting it into parallel threads.
- Multithreading utilizes TLP to keep the processor busy by interleaving multiple threads.
Utilization Improvement:
- TLP improves the utilization of a single processor pipeline by filling idle cycles caused by hazards or dependencies in a single thread.

Pipeline Hazards

Problem: Instruction dependencies lead to delays in pipeline execution:
- Example:
```
assembly

LW r1, 0(r2)
LW r5, 12(r1)
ADDI r5, r5, #12
SW 12(r1), r5
```
- Each stage (Fetch, Decode, Execute, Memory, Writeback) can stall if the next instruction depends on data not yet produced.
Solutions to Cope with Hazards:
- Interlocks: Inserts pipeline stalls; simple but slow.
- Bypassing/Forwarding: Uses hardware to reduce stalls by providing data directly between pipeline stages.

Multithreading to Address Hazards

Goal: Eliminate dependency issues by interleaving instructions from different threads.
Approach: Execute instructions from multiple threads in a round-robin or demand-based fashion on the same pipeline.

Example: Thread Interleaving

Interleaving instructions from 4 threads (T1-T4) in a 5-stage pipeline:
- Ensures writeback for one thread's instruction completes before the next instruction from the same thread begins.
- Dependencies are inherently avoided because interleaved instructions belong to different threads.

Multithreaded Pipeline Design

Thread Select Mechanism:
- Keeps track of which thread is being executed at each pipeline stage to ensure state correctness.
- Appears as multiple slower CPUs to software.

Example:

A 2-thread pipeline:
- Each thread has its own Program Counter (PC) and General-Purpose Registers (GPRs).
- Shared resources like caches and memory are accessed based on the active thread.

Costs of Multithreading

State Overheads:
- Each thread requires its own:
  - User state (PC, GPRs).
  - System state (page table, exception registers).
Resource Conflicts:
- Increased cache and TLB contention among threads.
- May require larger caches or additional OS scheduling overhead.
Complex Scheduling:
- Hardware or software must efficiently manage which threads execute and when.

Thread Scheduling Policies

Fixed Interleave:
- Cycles are equally divided among threads.
- A thread not ready in its slot introduces pipeline bubbles.
Software-Controlled Interleave:
- OS allocates slots, and hardware interleaves instructions based on availability.
Hardware-Controlled Scheduling:
- Hardware monitors thread readiness and dynamically prioritizes execution.

Key Takeaways

Multithreading enhances CPU performance by addressing pipeline hazards and maximizing utilization through TLP.
Design complexity and overheads require careful balance between thread scheduling, resource allocation, and performance gains.

Coarse-Grain Hardware Multithreading

Definition:Coarse-Grain Hardware Multithreading is an architecture designed for SIMD (Single Instruction, Multiple Data) operations, where tasks are divided into smaller units that can be executed in parallel. Each unit functions as a separate thread, operating simultaneously, which allows for better control over task execution by segmenting work into smaller, manageable pieces. This method is typically used to handle high-latency events, such as cache misses, by switching threads to keep the pipeline active.

Key Characteristics:

Minimal Low-Latency Pipeline Stalls:
- Coarse-grain multithreading is designed for systems with relatively few pipeline stalls. It focuses on maintaining a balance between the thread-switching overhead and the time spent on task execution, especially for workloads where stalls are infrequent.
Latency Hiding:
- The primary aim of CGMT is to hide latency introduced by high-latency events (e.g., cache misses). When one thread encounters such a latency, the processor can switch to another thread to continue processing, thus reducing idle time and improving overall throughput.
Thread Support and Context Switching:
- CGMT typically supports a smaller number of threads. The processor switches between these threads based on specific events, such as cache misses, which may cause the currently running thread to stall. This helps ensure that the execution pipeline remains occupied, minimizing idle time.
Minimal Context Switching Overhead:
- Thread switching in CGMT occurs less frequently than in fine-grained multithreading. The switching mechanism is designed to minimize overhead and allow for efficient handling of latency events.

Advantages:

Simpler Implementation Compared to Fine-Grained Threading
Efficient in Low-Latency Stall Scenarios
Better Control of Execution

Limitations:

Underutilization of Pipeline During Frequent Stalls
Thread Switching Latency

Fine-Grained:

Fine Grained architecture for SIMD hence major on the ability to perform various operations on data within small units which and functions as separate units but simultaneously. This translates into better control of the execution of operations because tasks are divided into small segments.

Advantages of Fine-Grained SIMD(Single Instruction Multiple Data)

High flexibility: They imply that in Fine-Grained SIMD architectures each operation is individually manageable with a higher level of distinctness.

Improved data handling: Confinement which is better suited for computerized processes where the concern data dependencies may be present or where the operations of a certain stringency have to be exercised.

Enhanced parallelism: As opposed to current SIMD implementation where tasks are split into chunks, the Fine-Grained SIMD can fully exploit multi-core architectures and thus achieve greater amount of parallelism.

Disadvantages of Fine-Grained SIMD

Higher complexity: Semantic level control enhances the level of difficulty while programming as well as managing the architecture.

Lower throughput: More often than not, processing the work with the smaller units can be detrimental to the through puts across the whole system than the use of Coarse-Grained SIMD.

Historical Examples of Hardware Multithreading

1. Denelcor HEP (1982):

First commercial machine to implement hardware threading in the CPU.
Specifications:
- 120 threads per processor.
- 10 MHz clock rate.
- Configurable up to 8 processors.
Significance:
- A precursor to more advanced multithreaded architectures like the Tera MTA and Cray XMT.

2. Tera (Cray) MTA (1990):

Advanced architecture focusing on multithreaded performance.
Specifications:
- Up to 256 processors, each supporting 128 active threads.
- Sparse 3D torus interconnection for processors and memory.
- No data cache but a flat, shared main memory.
- Sustained one memory access per cycle per processor.
Technology:
- Prototype used GaAs (Gallium Arsenide) logic consuming 1 KW/processor at 260 MHz.
- Later versions (e.g., MTA-2) transitioned to CMOS for reduced power (50 W/processor).

Pipeline Architecture of MTA:

Pipeline Stages:
- 21-cycle instruction pipeline.
- Memory operations have ~150 cycles of latency.
Execution:
- Each cycle, one VLIW (Very Long Instruction Word) instruction from an active thread is launched.
Memory, retry, and write pools are used to manage inter-thread resource contention.

MIT Alewife (1990):

Focused on shared memory systems with hardware threading.
Specifications:
- Modified SPARC chips.
- Register windows store multiple thread contexts.
- Up to 4 threads per node.
Feature:
- Thread switching on local cache misses, ensuring efficient use of resources.

Oracle/Sun Niagara Processors

Designed for datacenter workloads like web servers and databases with high concurrency.
Key Features:
- Multiple simple cores with hardware threading.
- Reduced energy per operation but lower single-thread performance.
Generations:
- Niagara-1 (2004): 8 cores, 4 threads per core.
- Niagara-2 (2007): 8 cores, 8 threads per core.
- Niagara-3 (2009): 16 cores, 8 threads per core.
Advantage in Datacenters:
- Ideal for workloads with many concurrent requests, trading single-thread speed for high throughput and energy efficiency.

Comparison and Takeaways

Coarse-Grain Multithreading:
- Efficient for scenarios with infrequent pipeline stalls.
- Thread switching occurs on major events (e.g., cache misses).
Fine-Grain Multithreading (e.g., Niagara):
- Designed for workloads requiring high throughput.
- Multiple threads execute in a tightly interleaved manner, minimizing stalls.
Historical Impact:

From Denelcor HEP to modern Niagara processors, multithreading has evolved to address diverse performance and energy efficiency needs in computing.

Simultaneous Multithreading (SMT)

SMT enables instructions from multiple threads to be executed simultaneously within a single processor core, utilizing resources more effectively. Unlike traditional multithreading (vertical or coarse-grained), SMT leverages the fine-grain control mechanisms already present in OOO superscalar processors.

Out-of-order (OOO) superscalar processors are a type of processor that can execute independent instructions out of order, as long as the correct program output is maintained.

Key Features of SMT:

Resource Sharing:
- Multiple threads share functional units (e.g., ALUs, FPUs) and execution pipelines.
- This maximizes the utilization of idle execution units, which are often underused in single-threaded or OOO designs.
Thread Interleaving:
- Instructions from different threads can be issued in the same cycle, leading to better overall throughput.
Efficient Resource Utilization:
- SMT addresses both horizontal waste (underutilization of issue slots) and vertical waste (idle cycles caused by stalls).

Key Innovations in SMT:

Tullsen, Eggers, Levy's 1995 Work:
- Proposed interleaving instructions from multiple threads across multiple issue slots with no restrictions.
- Highlighted the inefficiency of idle execution units in OOO superscalars and demonstrated SMT's potential to fill these gaps.
OOO Simultaneous Multithreading (1996):
- Added multiple contexts (e.g., Program Counters, register sets) and fetch engines to fetch and issue instructions from different threads.
- Adapted the existing OOO instruction window to schedule instructions from multiple threads.
- Allowed single-thread execution to fully utilize machine resources if necessary.

Multithreading Efficiency:

Types of Waste:

Vertical Waste:
- Entire cycles are idle when no instructions are ready for issue.
- Addressed partially by coarse-grained and vertical multithreading.
Horizontal Waste:
- Issue slots are underutilized in cycles where fewer instructions are available than the issue width.
- SMT significantly reduces this by interleaving multiple threads.

Adaptation to Parallelism:

High Thread-Level Parallelism (TLP):
- SMT allows multiple threads to share the machine width, increasing overall throughput.
Low Thread-Level Parallelism:
- SMT provides the entire machine width to a single thread, leveraging instruction-level parallelism (ILP).

Architectural Examples:

Power 4 (IBM):
- Pioneered many techniques, but lacked SMT support.
Power 5 (IBM):
- Added SMT support to Power 4:
  - Increased cache sizes and associativity (L1 to L3).
  - Added per-thread load/store queues and virtual registers.
  - Increased instruction issue queues.
- Resulted in a 24% larger core area compared to Power 4.
Pentium 4 Hyperthreading (2002):
- First commercial SMT implementation (2-way SMT).
- Logical processors shared nearly all resources, with minimal die area overhead (~5%).
- Revived in Nehalem (2008) after being dropped in Pentium-M and Core Duo architectures.

Initial SMT Performance:

Pentium 4 Extreme SMT:
- Speedup: 1.01x (SPECint_rate), 1.07x (SPECfp_rate).
- Speedups for paired SPEC benchmarks varied from 0.90 to 1.58, averaging 1.20.
Power 5:
- 1.23x faster for SPECint_rate, 1.16x for SPECfp_rate with SMT.
- Performance gains varied, with floating-point apps showing minimal improvements due to cache conflicts.

ICount Choosing Policy:

SMT often uses policies like ICount, which fetches instructions from the thread with the fewest instructions in flight.
This reduces contention and enhances throughput.

Comparison of Multithreading Techniques:

Multithreading Technique	Description	Waste Type Addressed	Key Characteristics

Superscalar

Single-thread execution with multiple execution units.

Horizontal waste (inefficient use of execution units).

Executes multiple instructions from a single thread per cycle, but may waste execution resources if not all units are used.

Fine-Grained Multithreading

Cycles through threads at each pipeline stage to reduce vertical waste.

Vertical waste (unused cycles in a stalled thread).

Switches between threads at each stage, keeping the pipeline active even if one thread is stalled.

Coarse-Grained Multithreading

Switches threads at larger intervals, typically on events like cache misses.

Horizontal waste (inefficient use of execution units during high-latency events).

Thread switching occurs on major events (e.g., cache misses), hiding latency by using other threads when needed.

Simultaneous Multithreading (SMT)

Interleaves instructions from multiple threads within a single cycle.

Reduces both horizontal and vertical waste (minimizes both idle execution units and idle cycles).

Executes multiple threads in parallel within the same cycle, using multiple execution units per cycle.

Chip Multiprocessing (CMP)

Splits workloads across multiple cores, each capable of running multiple threads.

Vertical waste (inefficient use of resources within individual cores).

Multiple independent cores, each with its own execution pipeline, allowing parallel thread execution across cores.