Table of Contents
Understanding Task Parallelism
Task parallelism is about splitting a program into distinct units of work (tasks) that can be executed concurrently, rather than splitting up just the data. Each task may do something different, work on different data, or follow a different control flow, but they can run at the same time when resources (cores, nodes, GPUs) are available.
Where data parallelism typically means “do the same operation on many data items,” task parallelism means “do different operations that can overlap in time.”
Characteristics of Task Parallelism
Independent vs. Dependent Tasks
Tasks may have different relationships:
- Independent tasks
Tasks that can execute in any order, or simultaneously, without affecting correctness.
Example: Running three unrelated simulations with different parameters. - Dependent tasks (with precedence constraints)
Task B can only start after Task A finishes because it needs A’s result.
This can be represented as a directed acyclic graph (DAG), where: - Nodes = tasks
- Edges = “must happen before” dependencies
Schedulers (in runtimes or job systems) try to run as many tasks as possible at once while respecting these dependencies.
Coarse-Grained vs. Fine-Grained Tasks
- Coarse-grained tasks:
Each task does a relatively large amount of work.
Pros: - Lower overhead for creating and scheduling tasks
- Easier to manage and debug
Cons: - Fewer tasks may limit available parallelism
- Fine-grained tasks:
Each task does a small amount of work.
Pros: - Lots of potential parallelism
Cons: - Overhead can dominate; scheduling and context switching can be more expensive than the work itself
In practice, effective task parallelism usually requires grouping work into tasks that are “big enough” to amortize overhead while still exposing enough concurrency.
Heterogeneous Work
Task parallelism naturally models programs where different parts do different kinds of work:
- Different algorithms or code paths for different tasks
- CPU-only tasks and GPU-accelerated tasks
- I/O-heavy tasks and compute-heavy tasks
This is useful when workloads are not uniform and do not fit neatly into a single data-parallel loop.
Examples of Task Parallelism
Parameter Sweeps and Ensembles
You may need to run the same application many times with different inputs or parameters (e.g., scanning over temperatures, energies, or model hyperparameters). Each run is a separate task:
- Task 1: run with parameter set A
- Task 2: run with parameter set B
- Task 3: run with parameter set C
- …
If the runs are independent (no communication between them), they can be executed in parallel on different cores or nodes. This is sometimes called embarrassingly parallel, but from the scheduler’s point of view it is a form of task parallelism: a pool of independent tasks.
Producer–Consumer Pipelines
A common pattern is to structure a workflow as a pipeline of tasks:
- Task A (producer): read or generate data
- Task B (worker): process or transform data
- Task C (consumer): analyze or write results
If implemented as separate tasks, they can overlap:
- While Task B processes batch 1, Task A can start reading batch 2, and Task C might write out results from batch 0.
This is task parallelism because different tasks perform different roles, and the system coordinates data and control between them.
Multi-Stage Scientific Workflows
Complex HPC workflows often consist of several distinct stages, for example:
- Preprocessing: convert raw experimental data to simulation-ready input.
- Simulation: run a large-scale numerical model.
- Postprocessing: reduce, filter, or compress simulation output.
- Analysis/Visualization: compute statistics, create plots, or generate images.
Each stage can be implemented as a task or a group of tasks. Multiple datasets or parameter sets might flow through this pipeline, providing additional opportunities for task-level parallelism (e.g., one dataset is being simulated while another is being preprocessed).
Task Parallelism Inside a Program
Some applications have natural decomposition into distinct units:
- Different time windows, spatial regions, or components of a model are handled by separate tasks that exchange results periodically.
- Different numerical kernels (e.g., assembling a matrix, applying boundary conditions, computing diagnostics) are scheduled as tasks over parts of the domain.
Modern programming models can represent such internal tasks, and a runtime system decides where and when to run them.
Task Parallelism vs. Data Parallelism
Task parallelism and data parallelism are complementary:
- Data parallelism:
- Same operation on many data elements
- Ideal for vectorization and regular loops over arrays
- Example: apply a filter to every pixel in an image
- Task parallelism:
- Different operations or workflows on the same or different data
- Ideal for irregular control flow, pipelines, or many independent jobs
- Example: run three different analysis codes on simulation output
Key contrasts:
- Control flow:
- Data parallelism: mostly identical for all elements
- Task parallelism: tasks may follow different branches and do different types of work
- Synchronization:
- Data parallel: often synchronized at loop boundaries or collective operations
- Task parallel: synchronization happens via task dependencies, barriers, or futures/promises
- Load balancing:
- Data parallel: try to split data evenly
- Task parallel: try to distribute tasks so that no worker is idle for long, which can be more challenging if tasks have varying cost
Task Graphs and Dependencies
Task parallel programs are frequently described using task graphs (often DAGs):
- A task graph is a set of tasks with directed edges representing “must happen before” relationships.
- If task $T_2$ depends on $T_1$, then:
$$ T_1 \rightarrow T_2 $$
means $T_2$ can only start after $T_1$ completes.
This representation:
- Makes concurrency explicit: any tasks without incoming unsatisfied dependencies can run now.
- Helps runtime systems and job schedulers decide which tasks to execute in parallel.
- Is well-suited to representing workflows, pipelines, and multi-stage simulations.
In practice, you might see:
- Workflow systems that let you declare tasks and their input/output file dependencies.
- Programming models that let you annotate functions as tasks and specify which data they read or write.
Load Balancing in Task Parallelism
In task parallelism, load balancing means distributing tasks so that all resources are kept as busy as possible:
- If tasks have similar and predictable costs, you can use a simple static assignment (e.g., divide tasks equally among workers).
- If tasks are irregular or have unknown runtime, more sophisticated strategies are needed, such as dynamic task queues.
Common approaches:
- Global queue: all workers pull tasks from a shared queue.
- Work stealing: each worker has its own task queue; when idle, it steals tasks from others.
- Priority-based scheduling: some tasks (e.g., those on the critical path of the task graph) are given higher priority to reduce total runtime.
Good task-level load balancing is crucial for parallel efficiency, especially when tasks vary widely in runtime.
Task Parallelism at Different Scales
Within a Node
Inside a single compute node, task parallelism may involve:
- Threads executing independent or lightly interacting tasks.
- Asynchronous operations, such as overlapping computation with I/O or communication via separate tasks.
The runtime (e.g., a threading or task library) schedules tasks onto CPU cores.
Across Nodes
At the cluster level, task parallelism appears as:
- Many independent jobs submitted to a scheduler (parameter sweeps, ensembles).
- Complex workflows broken into separate jobs connected by dependencies.
- Hybrid setups where each task corresponds to a multi-process or multi-threaded computation.
The batch system’s scheduler decides on which nodes each task will run, based on queue policies and resource availability.
When Task Parallelism Is a Good Fit
Task parallelism is especially useful when:
- Your work can be broken down into many independent or loosely coupled subproblems.
- Different parts of your computation perform different operations or have complex control flow.
- You need to manage pipelines or multi-stage workflows.
- Work units have unpredictable or heterogeneous runtimes, and you want the runtime system to dynamically schedule them.
In many realistic HPC applications, a combination of task and data parallelism is used: data-parallel kernels inside each task, and a task-parallel structure to organize the overall workflow.