Question uploaded previously done abstract and the ieee paper
that should be implemented
Submit code with comments + output screenshot +
2-page report on how code is implemented/nProject Implementation
1. For the project which involves implementations, either demo or presentation (not
both) can be performed to discuss the details. The option is left to the individual
students.
A. For demo a maximum 8 minutes time is allocated for a specific student. The
student needs to bring laptop, portable platform, boards, etc. to the class.
B. For presentation, please prepare approximately 10 slides to cover a 8
minutes presentation. You should upload the presentations using this link.
This will make it easier for you to download them in the class PC and use.
2. Project final report submission. The report can be a 10 to 20 pages writeup.
Please provide information of the project as follows:
(1) It should have a title.
(2) What is the project all about?
(3) How much you have implemented already in this project?
(4) What is left to be implemented that you could not do?
(5) What are the difficulties and challenges you faced for this project?
(6) Design and simulation figures and tables.
(7) Formal references.
Note:
(1) The text and figures to be used should be your own. Draw the figures that you
want to use. Write in your own words.
(2) Use latex for your writing as it is considered better for technical writing./n IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 23, NO. 8, AUGUST 2012
A Survey of Parallel Programming Models
and Tools in the Multi and Many-Core Era
Javier Diaz, Camelia Muñoz-Caro, and Alfonso Niño
Abstract—In this work, we present a survey of the different parallel programming models and tools available today with special
consideration to their suitability for high-performance computing. Thus, we review the shared and distributed memory approaches, as
well as the current heterogeneous parallel programming model. In addition, we analyze how the partitioned global address space
(PGAS) and hybrid parallel programming models are used to combine the advantages of shared and distributed memory systems. The
work is completed by considering languages with specific parallel support and the distributed programming paradigm. In all cases, we
present characteristics, strengths, and weaknesses. The study shows that the availability of multi-core CPUs has given new impulse to
the shared memory parallel programming approach. In addition, we find that hybrid parallel programming is the current way of
harnessing the capabilities of computer clusters with multi-core nodes. On the other hand, heterogeneous programming is found to be
an increasingly popular paradigm, as a consequence of the availability of multi-core CPUs+GPUs systems. The use of open industry
standards like OpenMP, MPI, or OpenCL, as opposed to proprietary solutions, seems to be the way to uniformize and extend the use
of parallel programming models.
Index Terms-Parallelism and concurrency, distributed programming, heterogeneous (hybrid) systems.
1369
1
INTRODUCTION
MICROPROCESSORS bormance increases and cost reducing the of the software applications are developed followy
tions in computer applications for more than two decades.
However, this process reached a limit around 2003 due to
heat dissipation and energy consumption issues [1]. These
problems have limited the increase of CPU clock frequen-
cies and the number of tasks that can be performed within
each clock period. The solution adopted by processor
developers was to switch to a model where the micro-
processor has multiple processing units known as cores [2].
Nowadays, we can speak of two approaches [2]. The first,
multi-core approach, integrates a few cores (currently
between two and ten) into a single microprocessor, seeking
to keep the execution speed of sequential programs. Actual
laptops and desktops incorporate this kind of processor.
The second, many-core approach uses a large number of
cores (currently as many as several hundred) and is
specially oriented to the execution throughput of parallel
programs. This approach is exemplified by the Graphical
Processing Units (GPUs) available today. Thus, parallel
computers are not longer expensive and elitist devices, but
commodity machines we find everywhere. Clearly, this
change of paradigm has had (and will have) a huge impact
on the software developing community [3].
• J. Diaz is with the Pervasive Technology Institute, Indiana University,
2719 East Tenth Street, Bloomington, IN 47408.
E-mail: javidiaz@indiana.edu.
implemented on traditional single-core microprocessors.
Therefore, each new, more efficient, generation of single-
core processors translates into a performance increase of the
available sequential applications. However, the current
stalling of clock frequencies prevents further performance
improvements. In this sense, it has been said that
"sequential programming is dead" [4], [5]. Thus, in the
present scenario we cannot rely on more efficient cores to
improve performance but in the appropriate coordinate use
of several cores, i.e., in concurrency. So, the applications
that can benefit from performance increases with each
generation of new multi-core and many-core processors are
the parallel ones. This new interest in parallel program
development has been called the “concurrency revolution”
[3]. Therefore, parallel programming, once almost relegated
to the High Performance Computing community (HPC), is
taken a new star role on the stage.
Parallel computing can increase the applications perfor-
mance by executing them on multiple processors. Unfortu-
nately, the scaling of application performance has not
matched the scaling of peak speed, and the programming
burden continues to be important. This is particularly
problematic because the vision of seamless scalability needs
the applications to scale automatically with the number of
processors. However, for this to happen, the applications
have to be programmed to exploit parallelism in the most
efficient way. Thus, the responsibility for achieving the vision
of scalable parallelism falls on the applications developer [6].
In this sense, there are two main approaches to
parallelize applications: autoparallelization and parallel
programming [7]. They differ in the achievable application
performance and ease of parallelization. In the first case, the
sequential programs are automatically parallelized using
Authorized licensed use limited to: University of North Texas. Downloaded on February 19,2024 at 01:11:54 UTC from IEEE Xplore. Restrictions apply.
• C. Muñoz-Caro and A. Niño are with the Grupo SciCom, Departamento de
Tecnologias y Sistemas de Informacion, Escuela Superior de Informática,
Universidad de Castilla-La Mancha, Paseo de la Universidad 4, 13004
Ciudad Real, Spain. E-mail: {camelia.munoz, alfonso.nino}@uclm.es.
Manuscript received 22 Apr. 2011; revised 7 Dec. 2011; accepted 8 Dec. 2011;
published online 28 Dec. 2011.
Recommended for acceptance by H. Jiang.
For information on obtaining reprints of this article, please send e-mail to:
tpds@computer.org, and reference IEEECS Log Number TPDS-2011-04-0242.
Digital Object Identifier no. 10.1109/TPDS.2011.308.
1045-9219/12/$31.00 © 2012 IEEE
Published by the IEEE Computer Society 1370
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 23, NO. 8, AUGUST 2012
ILP (instruction level parallelism) or parallel compilers.
Thus, the main advantage is that existing applications just
need to be recompiled with a parallel compiler, without
modifications. However, due to the complexity of auto-
matically transforming sequential algorithms into parallel
ones, the amount of parallelism reached using this approach
is low. On the other hand, in the parallel programming
approach, the applications are specifically developed to
exploit parallelism. Therefore, developing a parallel appli-
cation involves the partitioning of the workload into tasks,
and the mapping of the tasks into workers (i.e., the
computers where the tasks will be processed). In general,
parallel programming obtains a higher performance than
autoparallelization but at the expense of more paralleliza-
tion efforts. Fortunately, there are some typical kinds of
parallelism in computer programs such as task, data,
recursive, and pipelined parallelism [8], [9], [10]. In
addition, much literature is available about the suitability
of algorithms for parallel execution [11], [12] and about the
design of parallel programs [10], [13], [14], [15]. From the
design point of view, different patterns for exploiting
parallelism have been proposed [8], [10]. A pattern is a
strategy for solving recurring problems in a given field. In
addition, the patterns can be organized as part of a pattern
language, allowing the user to use the patterns to build
complex systems. This approach applied to parallel pro-
gramming is presented in [10]. Here, the pattern language is
organized in four design spaces or phases: finding con-
currency, algorithm structure, supporting structures, and
implementation mechanisms. A total of 19 design patterns
are recognized and organized around the first three phases.
In particular, four patterns corresponding to the supporting
structures phase can be related to the different parallel
programming models [10]. These are: Single Program
Multiple data (SPMD, where the same program is executed
several times with different data), Master/Worker (where a
master process sets up a pool of worker processes and a bag
of tasks), loop parallelism (where different iterations of one
or more loops are executed concurrently), and fork/join
(where a main process forks off several other processes that
execute concurrently until they finally join in a single
process again).
Parallel systems, or architectures, fall into two broad
categories: shared memory and distributed memory [8]. In
shared memory architectures we have a single memory
address space accessible to all the processors. Shared
memory machines have existed for a long time in the servers
and high-end workstations segment. However, at present,
common desktop machines fall into this category since in
multi-core processors all the cores share the main memory.
On the other hand, in distributed memory architectures
there is not global address space. Each processor owns its
own memory. This is a popular architectural model
encountered in networked or distributed environments such
as clusters or Grids of computers. Of course, hybrid shared-
distributed memory systems can be built.
distributed memory systems. The largest and fastest
computers today employ both shared and distributed
memory architectures. This provides flexibility when
tuning the parallelism in the programs to generate max-
imum efficiency and an appropriate balance of the
computational and communication loads. In addition, the
availability of General Purpose computation on GPUs
(GPGPUs) in actual multi-core systems has lead to the
Heterogeneous Parallel Programming (HPP) model. HPP
seeks to harness the capabilities of multi-core CPUs and
many-core GPUs. Accordingly to all theses hybrid archi-
tectures, different parallel programming models can be
mixed in what is called hybrid parallel programming. A
wise implementation of hybrid parallel programs can
generate massive speedups in the otherwise pure MPI or
pure OpenMP implementations [18]. The same can be
applied to hybrid programming involving GPUs and
distributed architectures [19], [20].
In this paper, we review the parallel programming models
with especial consideration of their suitability for High
Performance Computing applications. In addition, we con-
sider the associated programming tools. Thus, in Section 2 we
present a classification of parallel programming models in
use today. Sections 3 to 8 review the different models
presented in Section 2. Finally, in Section 9 we collect the
conclusions of the work.
2 CLASSIFICATION OF PARALLEL PROGRAMMING
MODELS
Strictly speaking, a parallel programming model is an
Therefore, it is not tied to any specific machine type.
abstraction of the computer system architecture [10].
However, there are many possible models for parallel
computing because of the different ways several processors
can be put together to build a parallel system. In addition,
separating the model from its actual implementation is
often difficult. Parallel programming models and its
associated implementations, i.e., the parallel programming
environments defined by Mattson et al. [10], are over-
whelming. However, in the late 1990s two approaches
become predominant in the HPC parallel programming
landscape: OpenMP for shared memory and MPI for
distributed memory [10]. This allows us to define the
classical or pure parallel models. In addition, the new
processor architectures, multi-core CPUs and many-core
GPUs, have produced heterogeneous parallel programming
models. Also, the simulation of a global memory space in a
distributed environment leads to the Partitioned Global
Address Space (PGAS) model. Finally, the architectures
available today allow definition of hybrid, shared-distributed
memory + GPU, models. The parallel computing landscape
would not be complete without considering the languages
with parallel support and the distributed programming model.
All these topics are presented in the next sections.
3 PURE PARALLEL PROGRAMMING MODELS
The conventional parallel programming practice in-
volves a pure shared memory model [8], usually using the Here, we consider parallel programming models using a
OpenMP API [16], in shared memory architectures, or a pure shared or distributed memory approach. As such, we
pure message passing model [8], using the MPI API [17], on consider the threads, shared memory OpenMP, and
Authorized licensed use limited to: University of North Texas. Downloaded on February 19,2024 at 01:11:54 UTC from IEEE Xplore. Restrictions apply. DIAZ ET AL.: A SURVEY OF PARALLEL PROGRAMMING MODELS AND TOOLS IN THE MULTI AND MANY-CORE ERA
Programming
Implementation:
TABLE 1
Pure Parallel Programming Models Implementations
Pthreads
Threads
OpenMP
Shared Memory
MPI
Message Passing
Model
System
Architecture
Communication
Model
Granularity
Synchronization
Shared memory
Shared memory
Shared Address
Shared Address
Course or Fine
Explicit
Fine
Implicit
Implementation
WebPage
Library
a)
Compiler
b)
Distributed and
Shared memory
Message Passing or
Shared Address
Course or Fine
Implicit or Explicit
Library
1371
distributed memory message passing models. Table 1
collects the characteristics of the usual implementations of
these models.
3.1
POSIX Threads
In this model, we have several concurrent execution paths
(the threads) that can be controlled independently. A thread
is a lightweight process having its own program counter
and execution stack [9]. The model is very flexible, but low
level, and is usually associated with shared memory and
operating systems. In 1995 a standard was released [21]: the
POSIX.1c, Threads extensions (IEEE Std 1003.1c-1995), or as
it is usually called Pthreads.
The Pthreads, or Portable Operating System Interface
(POSIX) Threads, is a set of C programming language types
and procedure calls [7], [22], [23]. Pthreads is implemented
as a header (pthread.h) and a library for creating and
manipulating each thread. The Pthreads library provides
functions for creating and destroying threads and for
coordinating thread activities via constructs designed to
ensure exclusive access to selected memory locations (locks
and condition variables). This model is especially appro-
priate for the fork/join parallel programming pattern [10].
In the POSIX model, the dynamically allocated heap
memory, and obviously the global variables, is shared by
the threads. This can cause programming difficulties. Often,
one needs a variable that is global to the routines called
within a thread but that is not shared between threads. A
set of Pthreads functions is used to manipulate thread local
storage to address these requirements. Moreover, when
multiple threads access the shared data, programmers have
to be aware of race conditions and deadlocks. To protect
critical section, i.e., the portion of code where only one
thread must reach shared data, Pthreads provides mutex
(mutual exclusion) and semaphores [24]. Mutex permits
only one thread to enter a critical section at a time, whereas
semaphores allow several threads to enter a critical section.
of threads is not related to the number of processors
available. These characteristics make Pthreads programs
not easily scalable to a large number of processors [6]. For
all these reasons, the explicitly-managed threads model is
not well suited for the development of HPC applications.
3.2 Shared Memory OpenMP
Strictly speaking, this is also a multithreaded model, as the
previous one. However, here we refer to a shared memory
parallel programming model that is task oriented and
works at a higher abstraction level than threads. This model
is in practice inseparable from its practical implementation:
OpenMP.
OpenMP [25] is a shared memory application program-
ming interface (API) whose aim is to ease shared memory
parallel programming. The OpenMP multithreading inter-
face [16] is specifically designed to support HPC programs. It
is also portable across shared memory architectures.
OpenMP differs from Pthreads in several significant ways.
While Pthreads is purely implemented as a library, OpenMP
is implemented as a combination of a set of compiler
directives, pragmas, and a runtime providing both manage-
ment of the thread pool and a set of library routines. These
directives instruct the compiler to create threads, perform
synchronization operations, and manage shared memory.
Therefore, OpenMP does require specialized compiler sup-
port to understand and process these directives. At present,
an increasing number of OpenMP versions for Fortran, C,
and C++ are available in free and proprietary compilers, see
Appendix 1, which can be found on the Computer Society
Digital Library at http://doi.ieeecomputersociety.org/
10.1109/TPDS.2011.308.
In OpenMP the use of threads is highly structured
because it was designed specifically for parallel applica-
tions. In particular, the switch between sequential and
parallel sections of code follows the fork/join model [9].
This is a block-structured approach for introducing con-
currency. A single thread of control splits into some number
of independent threads (the fork). When all the threads
have completed the execution of their specified tasks, they
resume the sequential execution (the join). A fork/join block
corresponds to a parallel region, which is defined using the
PARALLEL and END PARALLEL directives.
In general, Pthreads is not recommended as a general-
purpose parallel program development technology. While it
has its place in specialized situations, and in the hands of
expert programmers, the unstructured nature of Pthreads
constructs makes difficult the development of correct and
maintainable programs. In addition, recall that the number
Authorized licensed use limited to: University of North Texas. Downloaded on February 19,2024 at 01:11:54 UTC from IEEE Xplore. Restrictions apply. 1372
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 23, NO. 8, AUGUST 2012
The parallel region enables a single task to be replicated
across a set of threads. However, in parallel programs is
very common the distribution of different tasks across a set
of threads, such as parallel iterations over the index set of a
loop. Thus, there is a set of directives enabling each thread
to execute a different task. This procedure is called
worksharing. Therefore, OpenMP is specially suited for
the loop parallel program structure pattern, although the
SPMD and fork/join patterns also benefit from this
programming environment [10].
OpenMP provides application-oriented synchronization
primitives, which make easier to write parallel programs.
By including these primitives as basic OpenMP operations,
it is possible to generate efficient code more easily than, for
instance, using Pthreads and working in terms of mutex
and condition variables.
In May 2008 the OpenMP 3.0 version was released [26].
The major change in this version was the support for
explicit tasks. Explicit tasks ease the parallelization of
applications where units of work are generated dynami-
cally, as in recursive structures or in while loops. This new
characteristic is very powerful. By supporting while loops
and other iterative control structures, it is possible to handle
graph algorithms and dynamic data structures, for instance.
The characteristics of OpenMP allow for a high abstrac-
tion level, making it well suited for developing HPC
applications in shared memory systems. The pragma
directives make easy to obtain concurrent code from serial
code. In addition, the existence of specific directives eases to
parallelize loop-based code. However, the high cost of
traditional multiprocessor machines prevented the wide-
spread use of OpenMP. Nevertheless, the ubiquitous
availability of multi-core processors has renewed the
interest for this parallel programming model.
3.3 Message Passing
Message Passing is a parallel programming model where
communication between processes is done by interchanging
messages. This is a natural model for a distributed memory
system, where communication cannot be achieved by
sharing variables. There are more or less pure realizations
of this model such as Aggregate Remote Memory Copy
Interface (ARMCI), which allows a programming approach
between message passing and shared memory. ARMCI is
detailed later in Section 5.2.1. However, over time, a
standard has evolved and dominated for this model: the
Message Passing Interface (MPI).
different nodes of computational Grids implementing well-
established middlewares such as Globus (the de facto
standard, see Section 8.1 later) [30].
MPI addresses the message-passing model [6], [27], [28].
In this model, the processes executed in parallel have
separate memory address spaces. Communication occurs
when part of the address space of one process is copied into
the address space of another process. This operation is
cooperative and occurs only when the first process executes
a send operation and the second process executes a receive
operation. In MPI, the workload partitioning and task
mapping have to be done by the programmer, similarly to
Pthreads. Programmers have to manage what tasks are to be
computed by each process. Communication models in MPI
comprise point-to-point, collective, one-sided, and parallel
I/O operations. Point-to-point operations such as the
"MPI_Send"/"MPI_Recv" pair facilitate communications
between processes. Collective operations such as “MPI_
Bcast" ease communications involving more than two
processes. Regular MPI send/receive communication uses
a two-sided model. This means that matching operations by
sender and receiver are required. Therefore, some amount of
synchronization is needed to manage the matching of sends
and receives, and the associated buffer space, of messages.
However, starting from MPI-2 [31], one-sided communica-
tions are possible. Here, no sender-receiver matching is
needed. Thus, one-sided communication decouples data
transfer from synchronization. One-sided communication
allows remote memory access. Three communication calls
are provided: "MPI_Put” (remote write), “MPI_Get” (re-
mote read), and “MPI_Accumulate" (remote update).
Finally, parallel I/O is a major component of MPI-2,
providing access to external devices exploiting data types
and communicators [28].
On the other hand, with Symmetric Multiprocessing
(SMP) machines being commonly available, and multi-core
processors becoming the norm, a programming model to be
considered is a mixture of message passing and multi-
threading. In this model, user programs consist of one or
more MPI processes on each SMP node or multi-core
processor, with each MPI process itself comprising multiple
threads. The MPI-2 Standard [31] has clearly defined the
interaction between MPI and user created threads in an MPI
program. This specification was written with the goal of
allowing users to write multithreaded MPI programs easily.
Thus, MPI supports four "levels" of thread safety that a
user must explicitly select:
• MPI THREAD SINGLE. A process has only one
thread of execution.
MPI THREAD FUNNELED. A process may be
multithreaded, but only the thread that initialized
MPI can make MPI calls.
MPI THREAD SERIALIZED. A process may be
multithreaded, but only one thread at a time can
make MPI calls.
MPI is a specification for message passing operations [6],
[27], [28], [29]. MPI is a library, not a language. It specifies
the names, calling sequences, and results of the subroutines
or functions to be called from Fortran, C or C++ programs.
Thus, the programs can be compiled with ordinary
compilers but must be linked with the MPI library. MPI is
currently the de facto standard for HPC applications on
distributed architectures. By its nature it favors the SPMD
and, to a lesser extent, the Master/Worker program
structure patterns [10]. Appendix 2, available in the online
supplemental material, collects some well-known MPI
implementations. It is interesting to note that MPICH-G2
and GridMPI are MPI implementations for computational
Grid environments. Thus, MPI applications can be run on
Authorized licensed use limited to: University of North Texas. Downloaded on February 19,2024 at 01:11:54 UTC from IEEE Xplore. Restrictions apply.
MPI THREAD MULTIPLE. A process may be
multithreaded and multiple threads can call MPI
functions simultaneously.
Further details about thread safety is provided in [31]. In
addition, in [32] the authors analyze and discuss critical
issues of thread-safe MPI implementations. DIAZ ET AL.: A SURVEY OF PARALLEL PROGRAMMING MODELS AND TOOLS IN THE MULTI AND MANY-CORE ERA
In summary, MPI is well suited for applications where
portability, both in space (across different systems existing
now) and in time (across generations of computers), is
important. MPI is also an excellent choice for task-parallel
computations and for applications where the data structures
are dynamic, such as unstructured mesh computations.
Over the last two decades (the computer cluster era),
message passing, and specifically MPI, has become the HPC
standard approach. Thus, most of the current scientific code
allows for parallel execution under the message passing
model. Examples are: the molecular electronic structure
codes NWChem [33] and Gamess [34], or mpiBLAST [35]
the parallel version of the Basic Local Alignment Search
Tool (BLAST) used to find regions of local similarity
between nucleotide or protein sequences. Message passing
(colloquially understood as MPI) is so tied to HPC and
scientific computing that, at present, in many scientific
fields HPC is synonymous of MPI programming.
4 HETEROGENEOUS PARALLEL PROGRAMMING
MODELS
In the beginning of 2001 NVIDIA introduced the first
programmable GPU: GeForce3. Later, in 2003 the Siggraph/
Eurographics Graphics Hardware workshop, held in San
Diego, showed a shift from graphics to nongraphics
applications of the GPUs [36]. Thus, the GPGPU concept
was born. Today, it is possible to have, in a single system,
one or more host CPUs and one or more GPUs. In this
sense, we can speak of heterogeneous systems. Therefore, a
programming model oriented toward these systems has
appeared. The heterogeneous model is foreseeable to
become a mainstream approach due to the microprocessors
industry interest in the development of Accelerated
Processing Units (APUs). An APU integrates the CPU
(multi-core) and a GPU on the same die. This design allows
for a better data transfer rate and lower power consump-
tion. AMD Fusion [37] and Intel Sandy Bridge [38] APUs
are examples of this tendency.
In the first CPU+GPU systems, languages as Brook [39] or
Cg [40] were used. However, NVIDIA has popularized
(Compute Unified Device Architecture) CUDA [41] as the
primary model and language to program their GPUs. More
recently, the industry has worked together on the Open
Computing Language (OpenCL) standard [42] as a common
model for heterogeneous programming. In addition, differ-
ent proprietary solutions, such as Microsoft's DirectCompute
[43] or Intel's Array Building Blocks (ArBB) [44], are
available. This section reviews these approaches.
4.1 CUDA
CUDA is a parallel programming model developed by
NVIDIA [41]. The CUDA project started at 2006 with the
first CUDA SDK released in early 2007. The CUDA model
is designed to develop applications scaling transparently
with the increasing number of processor cores provided
by the GPUs [1], [45]. CUDA provides a software
environment that allows developers to use C as high-level
programming language. In addition, other languages
bindings or application programming interfaces are
Grid (Device)
Block (Work-group)
Registers
(Private
Registers
Registers
Registers
(Private
(Private
(Private
Memory)
Memory)
Memory)
Memory)
Thread
(Work-item)
Thread
(Work-item)
Thread
(Work-item)
Shared Memory
(Local Memory)
Thread
(Work-item)
Shared Memory
(Local Memory)
Global/Constant Memory
Host
Fig. 1. CUDA (OpenCL) architecture and memory model.
1373
supported; see Appendix 3, available in the online
supplemental material.
For CUDA, a parallel system consists of a host (i.e.,
CPU) and a computation resource or device (i.e., GPU).
The computation of tasks is done in the GPU by a set of
threads running in parallel. The GPU threads architecture
consists in a two-level hierarchy, namely the block and the
grid, see Fig. 1.
The block is a set of tightly coupled threads, each
identified by a thread ID. On the other hand, the grid is a set
of loosely coupled blocks with similar size and dimension.
There is no synchronization at all between the blocks, and
an entire grid is handled by a single GPU. The GPU is
organized as a collection of multiprocessors, with each
multiprocessor responsible for handling one or more blocks
in a grid. A block is never divided across multiple
multiprocessors. Threads within a block can cooperate by
sharing data through some shared memory, and by
synchronizing their execution to coordinate memory ac-
cesses. More detailed information can be found in [41], [46].
Moreover, there is a best practices guide that can be useful
to programmers [47]. CUDA is well suited for implement-
ing the SPMD parallel design pattern [10].
Worker management in CUDA is done implicitly. That
is, programmers do not manage thread creations and
destructions. They just need to specify the dimension of
the grid and block required to process a certain task.
Workload partitioning and worker mapping in CUDA is
done explicitly. Programmers have to define the workload
to be run in parallel by using the function "Global
Function" and specifying the dimension and size of the
grid and of each block.
The CUDA memory model is shown in Fig. 1. At the
bottom of the figure, we see the global and constant
memories. These are the memories that the host code can
write to and read from. Constant memory allows read-only
access by the device. Inside a block, we have the shared
memory and the registers or local memory. The shared
memory can be accessed by all threads in a block. The
registers are independent for each thread.
Finally, we would like to mention a recent initiative by
Intel. This initiative is called Knights Ferry [48], [49], and is
Authorized licensed use limited to: University of North Texas. Downloaded on February 19,2024 at 01:11:54 UTC from IEEE Xplore. Restrictions apply./n Bridging the Gap: Examining Parallel Programming Models for
High-Performance Computing and Parallelizing Uniprocessor
Simulators
1 Project Description
The objective of this research is to examine the current evolution of computer architecture, specifically in
relation to the increasing prevalence of processors with multiple core architectures. The work is divided into
two components. Firstly, we provide a programming approach to parallelize single-processor simulators in
order to efficiently mimic multi-core systems. Furthermore, we provide a comprehensive analysis of parallel
programming paradigms and tools, emphasising their relevance and appropriateness for tasks associated
with high-performance computing (HPC). Our study aims to enhance simulation technology and parallel
computing approaches by establishing a connection between parallel programming paradigms and simula-
tion environments. We aim to foster innovation in computational research and development by making a
significant contribution to the advancement of parallel computing methodologies and simulation approaches.
2
Rationale
In order to keep up with the increasing prevalence of multiple-core processor designs, it is crucial to de-
velop simulation environments that accurately depict these architectures. Furthermore, the increasing need
for high-performance computing solutions underscores the importance of understanding and using parallel
programming methodologies to maximise benefits. The objective of our work is to address these significant
concerns and improve parallel computing methods and simulation technologies via the investigation of par-
allel programming paradigms and the parallelization of uniprocessor simulators. Our objective is to explore
the complexities of parallel programming and simulation in order to meet the changing requirements of com-
putational research and facilitate the development of more effective and scalable computing solutions across
many fields. Our study aims to connect theoretical principles with practical applications in computational
science and engineering, promoting innovation and progress.
3 What Will by implemented?
The project's execution heavily relies on the development and enhancement of a programming approach
to parallelize existing uniprocessor simulators. This technology enables the simulation of multiple-core
architectures, providing developers and researchers with valuable tools to explore and validate innovative
notions. We will conduct a comprehensive review and analysis of parallel programming paradigms and
tools, evaluating their appropriateness and effectiveness for various high-performance computing applications.
Through practical inquiry and assessment, our objective is to identify the primary benefits and drawbacks
of different parallel programming approaches. Our goal is to contribute to the development of parallel
computing frameworks that are both resilient and scalable, capable of properly utilising the computational
capacity of contemporary hardware architectures. We do this via systematic experimentation and iterative
improvement. The aim of our study is to provide valuable insights into the practical use and improvement
of parallel algorithms. This will help in creating and using effective parallel computing solutions in many
scientific and industrial fields.
1 4
Tools Needed to Implement It
5
● Programming languages, such as C and C++, are used to implement the parallelization approach.
• Employ simulation environments and frameworks such as Gem5 or Simics to examine and authenticate
parallelized simulators.
• Perform comprehensive research on parallel programming models and techniques by exploring databases,
academic articles, and conference proceedings.
• Benchmarking suites and performance analysis tools, such as Intel VTune Profiler or SPEC CPU
benchmarks, are used to evaluate the effectiveness and efficiency of parallel programming models in
high-performance computing (HPC) environments.
• Incorporating machine learning approaches to optimise parallel algorithms and enhance performance
in dynamic computing settings.
• Engaging in partnerships with industrial partners and academic institutions to use state-of-the-art
technology and promote multidisciplinary research in parallel computing.
• Investigation of nascent hardware designs, such as GPUs and FPGAs, to use their concurrent processing
capabilities and augment computational efficacy.
• Creation of extensive documentation and tutorials to enable the adoption of parallel programming
approaches and tools by the wider scientific community.
References
1 J. Diaz, C. Muñoz-Caro and A. Niño, "A Survey of Parallel Programming Models and Tools in the Multi
and Many-Core Era,” in IEEE Transactions on Parallel and Distributed Systems, vol. 23, no. 8, pp.
1369-1386, Aug. 2012, doi: 10.1109/TPDS.2011.308. keywords: Computational modeling; Parallel pro-
gramming; Graphics processing unit; Message systems; Instruction sets; Multicore processing; Parallelism
and concurrency;
distributed programming; heterogeneous (hybrid) systems.,
2 R. Kumar, V. Zyuban and D. M. Tullsen, "Interconnections in multi-core architectures: under-
standing mechanisms, overheads and scaling," 32nd International Symposium on Computer Archi-
tecture (ISCA'05), Madison, WI, USA, 2005, pp. 408-419, doi: 10.1109/ISCA.2005.34. keywords:
Bandwidth; Computer architecture; Space technology; Delay; Power system interconnection; Joining pro-
cesses;Space exploration; Computer science; Design engineering; Power engineering and energy,
3 AJ. Donald and M. Martonosi, "An Efficient, Practical Parallelization Methodology for Multicore
Architecture Simulation,” in IEEE Computer Architecture Letters, vol. 5, no. 2, pp. 14-14, Feb. 2006,
doi: 10.1109/L-CA.2006.14. keywords: Multicore processing; Object oriented modeling;
Computational modeling; Computer architecture; Parallel programming;
Computer simulation;Feedback; Parallel processing; Product development;
Process planning; simulation; multicore; parallelism,
2