Search for question
Question

uploaded previously done abstract and the ieee paper that should be implemented Submit code with comments + output screenshot + 2-page report on how code is implemented/nProject Implementation 1. For the project which involves implementations, either demo or presentation (not both) can be performed to discuss the details. The option is left to the individual students. A. For demo a maximum 8 minutes time is allocated for a specific student. The student needs to bring laptop, portable platform, boards, etc. to the class. B. For presentation, please prepare approximately 10 slides to cover a 8 minutes presentation. You should upload the presentations using this link. This will make it easier for you to download them in the class PC and use. 2. Project final report submission. The report can be a 10 to 20 pages writeup. Please provide information of the project as follows: (1) It should have a title. (2) What is the project all about? (3) How much you have implemented already in this project? (4) What is left to be implemented that you could not do? (5) What are the difficulties and challenges you faced for this project? (6) Design and simulation figures and tables. (7) Formal references. Note: (1) The text and figures to be used should be your own. Draw the figures that you want to use. Write in your own words. (2) Use latex for your writing as it is considered better for technical writing./n IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 23, NO. 8, AUGUST 2012 A Survey of Parallel Programming Models and Tools in the Multi and Many-Core Era Javier Diaz, Camelia Muñoz-Caro, and Alfonso Niño Abstract—In this work, we present a survey of the different parallel programming models and tools available today with special consideration to their suitability for high-performance computing. Thus, we review the shared and distributed memory approaches, as well as the current heterogeneous parallel programming model. In addition, we analyze how the partitioned global address space (PGAS) and hybrid parallel programming models are used to combine the advantages of shared and distributed memory systems. The work is completed by considering languages with specific parallel support and the distributed programming paradigm. In all cases, we present characteristics, strengths, and weaknesses. The study shows that the availability of multi-core CPUs has given new impulse to the shared memory parallel programming approach. In addition, we find that hybrid parallel programming is the current way of harnessing the capabilities of computer clusters with multi-core nodes. On the other hand, heterogeneous programming is found to be an increasingly popular paradigm, as a consequence of the availability of multi-core CPUs+GPUs systems. The use of open industry standards like OpenMP, MPI, or OpenCL, as opposed to proprietary solutions, seems to be the way to uniformize and extend the use of parallel programming models. Index Terms-Parallelism and concurrency, distributed programming, heterogeneous (hybrid) systems. 1369 1 INTRODUCTION MICROPROCESSORS bormance increases and cost reducing the of the software applications are developed followy tions in computer applications for more than two decades. However, this process reached a limit around 2003 due to heat dissipation and energy consumption issues [1]. These problems have limited the increase of CPU clock frequen- cies and the number of tasks that can be performed within each clock period. The solution adopted by processor developers was to switch to a model where the micro- processor has multiple processing units known as cores [2]. Nowadays, we can speak of two approaches [2]. The first, multi-core approach, integrates a few cores (currently between two and ten) into a single microprocessor, seeking to keep the execution speed of sequential programs. Actual laptops and desktops incorporate this kind of processor. The second, many-core approach uses a large number of cores (currently as many as several hundred) and is specially oriented to the execution throughput of parallel programs. This approach is exemplified by the Graphical Processing Units (GPUs) available today. Thus, parallel computers are not longer expensive and elitist devices, but commodity machines we find everywhere. Clearly, this change of paradigm has had (and will have) a huge impact on the software developing community [3]. • J. Diaz is with the Pervasive Technology Institute, Indiana University, 2719 East Tenth Street, Bloomington, IN 47408. E-mail: javidiaz@indiana.edu. implemented on traditional single-core microprocessors. Therefore, each new, more efficient, generation of single- core processors translates into a performance increase of the available sequential applications. However, the current stalling of clock frequencies prevents further performance improvements. In this sense, it has been said that "sequential programming is dead" [4], [5]. Thus, in the present scenario we cannot rely on more efficient cores to improve performance but in the appropriate coordinate use of several cores, i.e., in concurrency. So, the applications that can benefit from performance increases with each generation of new multi-core and many-core processors are the parallel ones. This new interest in parallel program development has been called the “concurrency revolution” [3]. Therefore, parallel programming, once almost relegated to the High Performance Computing community (HPC), is taken a new star role on the stage. Parallel computing can increase the applications perfor- mance by executing them on multiple processors. Unfortu- nately, the scaling of application performance has not matched the scaling of peak speed, and the programming burden continues to be important. This is particularly problematic because the vision of seamless scalability needs the applications to scale automatically with the number of processors. However, for this to happen, the applications have to be programmed to exploit parallelism in the most efficient way. Thus, the responsibility for achieving the vision of scalable parallelism falls on the applications developer [6]. In this sense, there are two main approaches to parallelize applications: autoparallelization and parallel programming [7]. They differ in the achievable application performance and ease of parallelization. In the first case, the sequential programs are automatically parallelized using Authorized licensed use limited to: University of North Texas. Downloaded on February 19,2024 at 01:11:54 UTC from IEEE Xplore. Restrictions apply. • C. Muñoz-Caro and A. Niño are with the Grupo SciCom, Departamento de Tecnologias y Sistemas de Informacion, Escuela Superior de Informática, Universidad de Castilla-La Mancha, Paseo de la Universidad 4, 13004 Ciudad Real, Spain. E-mail: {camelia.munoz, alfonso.nino}@uclm.es. Manuscript received 22 Apr. 2011; revised 7 Dec. 2011; accepted 8 Dec. 2011; published online 28 Dec. 2011. Recommended for acceptance by H. Jiang. For information on obtaining reprints of this article, please send e-mail to: tpds@computer.org, and reference IEEECS Log Number TPDS-2011-04-0242. Digital Object Identifier no. 10.1109/TPDS.2011.308. 1045-9219/12/$31.00 © 2012 IEEE Published by the IEEE Computer Society 1370 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 23, NO. 8, AUGUST 2012 ILP (instruction level parallelism) or parallel compilers. Thus, the main advantage is that existing applications just need to be recompiled with a parallel compiler, without modifications. However, due to the complexity of auto- matically transforming sequential algorithms into parallel ones, the amount of parallelism reached using this approach is low. On the other hand, in the parallel programming approach, the applications are specifically developed to exploit parallelism. Therefore, developing a parallel appli- cation involves the partitioning of the workload into tasks, and the mapping of the tasks into workers (i.e., the computers where the tasks will be processed). In general, parallel programming obtains a higher performance than autoparallelization but at the expense of more paralleliza- tion efforts. Fortunately, there are some typical kinds of parallelism in computer programs such as task, data, recursive, and pipelined parallelism [8], [9], [10]. In addition, much literature is available about the suitability of algorithms for parallel execution [11], [12] and about the design of parallel programs [10], [13], [14], [15]. From the design point of view, different patterns for exploiting parallelism have been proposed [8], [10]. A pattern is a strategy for solving recurring problems in a given field. In addition, the patterns can be organized as part of a pattern language, allowing the user to use the patterns to build complex systems. This approach applied to parallel pro- gramming is presented in [10]. Here, the pattern language is organized in four design spaces or phases: finding con- currency, algorithm structure, supporting structures, and implementation mechanisms. A total of 19 design patterns are recognized and organized around the first three phases. In particular, four patterns corresponding to the supporting structures phase can be related to the different parallel programming models [10]. These are: Single Program Multiple data (SPMD, where the same program is executed several times with different data), Master/Worker (where a master process sets up a pool of worker processes and a bag of tasks), loop parallelism (where different iterations of one or more loops are executed concurrently), and fork/join (where a main process forks off several other processes that execute concurrently until they finally join in a single process again). Parallel systems, or architectures, fall into two broad categories: shared memory and distributed memory [8]. In shared memory architectures we have a single memory address space accessible to all the processors. Shared memory machines have existed for a long time in the servers and high-end workstations segment. However, at present, common desktop machines fall into this category since in multi-core processors all the cores share the main memory. On the other hand, in distributed memory architectures there is not global address space. Each processor owns its own memory. This is a popular architectural model encountered in networked or distributed environments such as clusters or Grids of computers. Of course, hybrid shared- distributed memory systems can be built. distributed memory systems. The largest and fastest computers today employ both shared and distributed memory architectures. This provides flexibility when tuning the parallelism in the programs to generate max- imum efficiency and an appropriate balance of the computational and communication loads. In addition, the availability of General Purpose computation on GPUs (GPGPUs) in actual multi-core systems has lead to the Heterogeneous Parallel Programming (HPP) model. HPP seeks to harness the capabilities of multi-core CPUs and many-core GPUs. Accordingly to all theses hybrid archi- tectures, different parallel programming models can be mixed in what is called hybrid parallel programming. A wise implementation of hybrid parallel programs can generate massive speedups in the otherwise pure MPI or pure OpenMP implementations [18]. The same can be applied to hybrid programming involving GPUs and distributed architectures [19], [20]. In this paper, we review the parallel programming models with especial consideration of their suitability for High Performance Computing applications. In addition, we con- sider the associated programming tools. Thus, in Section 2 we present a classification of parallel programming models in use today. Sections 3 to 8 review the different models presented in Section 2. Finally, in Section 9 we collect the conclusions of the work. 2 CLASSIFICATION OF PARALLEL PROGRAMMING MODELS Strictly speaking, a parallel programming model is an Therefore, it is not tied to any specific machine type. abstraction of the computer system architecture [10]. However, there are many possible models for parallel computing because of the different ways several processors can be put together to build a parallel system. In addition, separating the model from its actual implementation is often difficult. Parallel programming models and its associated implementations, i.e., the parallel programming environments defined by Mattson et al. [10], are over- whelming. However, in the late 1990s two approaches become predominant in the HPC parallel programming landscape: OpenMP for shared memory and MPI for distributed memory [10]. This allows us to define the classical or pure parallel models. In addition, the new processor architectures, multi-core CPUs and many-core GPUs, have produced heterogeneous parallel programming models. Also, the simulation of a global memory space in a distributed environment leads to the Partitioned Global Address Space (PGAS) model. Finally, the architectures available today allow definition of hybrid, shared-distributed memory + GPU, models. The parallel computing landscape would not be complete without considering the languages with parallel support and the distributed programming model. All these topics are presented in the next sections. 3 PURE PARALLEL PROGRAMMING MODELS The conventional parallel programming practice in- volves a pure shared memory model [8], usually using the Here, we consider parallel programming models using a OpenMP API [16], in shared memory architectures, or a pure shared or distributed memory approach. As such, we pure message passing model [8], using the MPI API [17], on consider the threads, shared memory OpenMP, and Authorized licensed use limited to: University of North Texas. Downloaded on February 19,2024 at 01:11:54 UTC from IEEE Xplore. Restrictions apply. DIAZ ET AL.: A SURVEY OF PARALLEL PROGRAMMING MODELS AND TOOLS IN THE MULTI AND MANY-CORE ERA Programming Implementation: TABLE 1 Pure Parallel Programming Models Implementations Pthreads Threads OpenMP Shared Memory MPI Message Passing Model System Architecture Communication Model Granularity Synchronization Shared memory Shared memory Shared Address Shared Address Course or Fine Explicit Fine Implicit Implementation WebPage Library a) Compiler b) Distributed and Shared memory Message Passing or Shared Address Course or Fine Implicit or Explicit Library 1371 distributed memory message passing models. Table 1 collects the characteristics of the usual implementations of these models. 3.1 POSIX Threads In this model, we have several concurrent execution paths (the threads) that can be controlled independently. A thread is a lightweight process having its own program counter and execution stack [9]. The model is very flexible, but low level, and is usually associated with shared memory and operating systems. In 1995 a standard was released [21]: the POSIX.1c, Threads extensions (IEEE Std 1003.1c-1995), or as it is usually called Pthreads. The Pthreads, or Portable Operating System Interface (POSIX) Threads, is a set of C programming language types and procedure calls [7], [22], [23]. Pthreads is implemented as a header (pthread.h) and a library for creating and manipulating each thread. The Pthreads library provides functions for creating and destroying threads and for coordinating thread activities via constructs designed to ensure exclusive access to selected memory locations (locks and condition variables). This model is especially appro- priate for the fork/join parallel programming pattern [10]. In the POSIX model, the dynamically allocated heap memory, and obviously the global variables, is shared by the threads. This can cause programming difficulties. Often, one needs a variable that is global to the routines called within a thread but that is not shared between threads. A set of Pthreads functions is used to manipulate thread local storage to address these requirements. Moreover, when multiple threads access the shared data, programmers have to be aware of race conditions and deadlocks. To protect critical section, i.e., the portion of code where only one thread must reach shared data, Pthreads provides mutex (mutual exclusion) and semaphores [24]. Mutex permits only one thread to enter a critical section at a time, whereas semaphores allow several threads to enter a critical section. of threads is not related to the number of processors available. These characteristics make Pthreads programs not easily scalable to a large number of processors [6]. For all these reasons, the explicitly-managed threads model is not well suited for the development of HPC applications. 3.2 Shared Memory OpenMP Strictly speaking, this is also a multithreaded model, as the previous one. However, here we refer to a shared memory parallel programming model that is task oriented and works at a higher abstraction level than threads. This model is in practice inseparable from its practical implementation: OpenMP. OpenMP [25] is a shared memory application program- ming interface (API) whose aim is to ease shared memory parallel programming. The OpenMP multithreading inter- face [16] is specifically designed to support HPC programs. It is also portable across shared memory architectures. OpenMP differs from Pthreads in several significant ways. While Pthreads is purely implemented as a library, OpenMP is implemented as a combination of a set of compiler directives, pragmas, and a runtime providing both manage- ment of the thread pool and a set of library routines. These directives instruct the compiler to create threads, perform synchronization operations, and manage shared memory. Therefore, OpenMP does require specialized compiler sup- port to understand and process these directives. At present, an increasing number of OpenMP versions for Fortran, C, and C++ are available in free and proprietary compilers, see Appendix 1, which can be found on the Computer Society Digital Library at http://doi.ieeecomputersociety.org/ 10.1109/TPDS.2011.308. In OpenMP the use of threads is highly structured because it was designed specifically for parallel applica- tions. In particular, the switch between sequential and parallel sections of code follows the fork/join model [9]. This is a block-structured approach for introducing con- currency. A single thread of control splits into some number of independent threads (the fork). When all the threads have completed the execution of their specified tasks, they resume the sequential execution (the join). A fork/join block corresponds to a parallel region, which is defined using the PARALLEL and END PARALLEL directives. In general, Pthreads is not recommended as a general- purpose parallel program development technology. While it has its place in specialized situations, and in the hands of expert programmers, the unstructured nature of Pthreads constructs makes difficult the development of correct and maintainable programs. In addition, recall that the number Authorized licensed use limited to: University of North Texas. Downloaded on February 19,2024 at 01:11:54 UTC from IEEE Xplore. Restrictions apply. 1372 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 23, NO. 8, AUGUST 2012 The parallel region enables a single task to be replicated across a set of threads. However, in parallel programs is very common the distribution of different tasks across a set of threads, such as parallel iterations over the index set of a loop. Thus, there is a set of directives enabling each thread to execute a different task. This procedure is called worksharing. Therefore, OpenMP is specially suited for the loop parallel program structure pattern, although the SPMD and fork/join patterns also benefit from this programming environment [10]. OpenMP provides application-oriented synchronization primitives, which make easier to write parallel programs. By including these primitives as basic OpenMP operations, it is possible to generate efficient code more easily than, for instance, using Pthreads and working in terms of mutex and condition variables. In May 2008 the OpenMP 3.0 version was released [26]. The major change in this version was the support for explicit tasks. Explicit tasks ease the parallelization of applications where units of work are generated dynami- cally, as in recursive structures or in while loops. This new characteristic is very powerful. By supporting while loops and other iterative control structures, it is possible to handle graph algorithms and dynamic data structures, for instance. The characteristics of OpenMP allow for a high abstrac- tion level, making it well suited for developing HPC applications in shared memory systems. The pragma directives make easy to obtain concurrent code from serial code. In addition, the existence of specific directives eases to parallelize loop-based code. However, the high cost of traditional multiprocessor machines prevented the wide- spread use of OpenMP. Nevertheless, the ubiquitous availability of multi-core processors has renewed the interest for this parallel programming model. 3.3 Message Passing Message Passing is a parallel programming model where communication between processes is done by interchanging messages. This is a natural model for a distributed memory system, where communication cannot be achieved by sharing variables. There are more or less pure realizations of this model such as Aggregate Remote Memory Copy Interface (ARMCI), which allows a programming approach between message passing and shared memory. ARMCI is detailed later in Section 5.2.1. However, over time, a standard has evolved and dominated for this model: the Message Passing Interface (MPI). different nodes of computational Grids implementing well- established middlewares such as Globus (the de facto standard, see Section 8.1 later) [30]. MPI addresses the message-passing model [6], [27], [28]. In this model, the processes executed in parallel have separate memory address spaces. Communication occurs when part of the address space of one process is copied into the address space of another process. This operation is cooperative and occurs only when the first process executes a send operation and the second process executes a receive operation. In MPI, the workload partitioning and task mapping have to be done by the programmer, similarly to Pthreads. Programmers have to manage what tasks are to be computed by each process. Communication models in MPI comprise point-to-point, collective, one-sided, and parallel I/O operations. Point-to-point operations such as the "MPI_Send"/"MPI_Recv" pair facilitate communications between processes. Collective operations such as “MPI_ Bcast" ease communications involving more than two processes. Regular MPI send/receive communication uses a two-sided model. This means that matching operations by sender and receiver are required. Therefore, some amount of synchronization is needed to manage the matching of sends and receives, and the associated buffer space, of messages. However, starting from MPI-2 [31], one-sided communica- tions are possible. Here, no sender-receiver matching is needed. Thus, one-sided communication decouples data transfer from synchronization. One-sided communication allows remote memory access. Three communication calls are provided: "MPI_Put” (remote write), “MPI_Get” (re- mote read), and “MPI_Accumulate" (remote update). Finally, parallel I/O is a major component of MPI-2, providing access to external devices exploiting data types and communicators [28]. On the other hand, with Symmetric Multiprocessing (SMP) machines being commonly available, and multi-core processors becoming the norm, a programming model to be considered is a mixture of message passing and multi- threading. In this model, user programs consist of one or more MPI processes on each SMP node or multi-core processor, with each MPI process itself comprising multiple threads. The MPI-2 Standard [31] has clearly defined the interaction between MPI and user created threads in an MPI program. This specification was written with the goal of allowing users to write multithreaded MPI programs easily. Thus, MPI supports four "levels" of thread safety that a user must explicitly select: • MPI THREAD SINGLE. A process has only one thread of execution. MPI THREAD FUNNELED. A process may be multithreaded, but only the thread that initialized MPI can make MPI calls. MPI THREAD SERIALIZED. A process may be multithreaded, but only one thread at a time can make MPI calls. MPI is a specification for message passing operations [6], [27], [28], [29]. MPI is a library, not a language. It specifies the names, calling sequences, and results of the subroutines or functions to be called from Fortran, C or C++ programs. Thus, the programs can be compiled with ordinary compilers but must be linked with the MPI library. MPI is currently the de facto standard for HPC applications on distributed architectures. By its nature it favors the SPMD and, to a lesser extent, the Master/Worker program structure patterns [10]. Appendix 2, available in the online supplemental material, collects some well-known MPI implementations. It is interesting to note that MPICH-G2 and GridMPI are MPI implementations for computational Grid environments. Thus, MPI applications can be run on Authorized licensed use limited to: University of North Texas. Downloaded on February 19,2024 at 01:11:54 UTC from IEEE Xplore. Restrictions apply. MPI THREAD MULTIPLE. A process may be multithreaded and multiple threads can call MPI functions simultaneously. Further details about thread safety is provided in [31]. In addition, in [32] the authors analyze and discuss critical issues of thread-safe MPI implementations. DIAZ ET AL.: A SURVEY OF PARALLEL PROGRAMMING MODELS AND TOOLS IN THE MULTI AND MANY-CORE ERA In summary, MPI is well suited for applications where portability, both in space (across different systems existing now) and in time (across generations of computers), is important. MPI is also an excellent choice for task-parallel computations and for applications where the data structures are dynamic, such as unstructured mesh computations. Over the last two decades (the computer cluster era), message passing, and specifically MPI, has become the HPC standard approach. Thus, most of the current scientific code allows for parallel execution under the message passing model. Examples are: the molecular electronic structure codes NWChem [33] and Gamess [34], or mpiBLAST [35] the parallel version of the Basic Local Alignment Search Tool (BLAST) used to find regions of local similarity between nucleotide or protein sequences. Message passing (colloquially understood as MPI) is so tied to HPC and scientific computing that, at present, in many scientific fields HPC is synonymous of MPI programming. 4 HETEROGENEOUS PARALLEL PROGRAMMING MODELS In the beginning of 2001 NVIDIA introduced the first programmable GPU: GeForce3. Later, in 2003 the Siggraph/ Eurographics Graphics Hardware workshop, held in San Diego, showed a shift from graphics to nongraphics applications of the GPUs [36]. Thus, the GPGPU concept was born. Today, it is possible to have, in a single system, one or more host CPUs and one or more GPUs. In this sense, we can speak of heterogeneous systems. Therefore, a programming model oriented toward these systems has appeared. The heterogeneous model is foreseeable to become a mainstream approach due to the microprocessors industry interest in the development of Accelerated Processing Units (APUs). An APU integrates the CPU (multi-core) and a GPU on the same die. This design allows for a better data transfer rate and lower power consump- tion. AMD Fusion [37] and Intel Sandy Bridge [38] APUs are examples of this tendency. In the first CPU+GPU systems, languages as Brook [39] or Cg [40] were used. However, NVIDIA has popularized (Compute Unified Device Architecture) CUDA [41] as the primary model and language to program their GPUs. More recently, the industry has worked together on the Open Computing Language (OpenCL) standard [42] as a common model for heterogeneous programming. In addition, differ- ent proprietary solutions, such as Microsoft's DirectCompute [43] or Intel's Array Building Blocks (ArBB) [44], are available. This section reviews these approaches. 4.1 CUDA CUDA is a parallel programming model developed by NVIDIA [41]. The CUDA project started at 2006 with the first CUDA SDK released in early 2007. The CUDA model is designed to develop applications scaling transparently with the increasing number of processor cores provided by the GPUs [1], [45]. CUDA provides a software environment that allows developers to use C as high-level programming language. In addition, other languages bindings or application programming interfaces are Grid (Device) Block (Work-group) Registers (Private Registers Registers Registers (Private (Private (Private Memory) Memory) Memory) Memory) Thread (Work-item) Thread (Work-item) Thread (Work-item) Shared Memory (Local Memory) Thread (Work-item) Shared Memory (Local Memory) Global/Constant Memory Host Fig. 1. CUDA (OpenCL) architecture and memory model. 1373 supported; see Appendix 3, available in the online supplemental material. For CUDA, a parallel system consists of a host (i.e., CPU) and a computation resource or device (i.e., GPU). The computation of tasks is done in the GPU by a set of threads running in parallel. The GPU threads architecture consists in a two-level hierarchy, namely the block and the grid, see Fig. 1. The block is a set of tightly coupled threads, each identified by a thread ID. On the other hand, the grid is a set of loosely coupled blocks with similar size and dimension. There is no synchronization at all between the blocks, and an entire grid is handled by a single GPU. The GPU is organized as a collection of multiprocessors, with each multiprocessor responsible for handling one or more blocks in a grid. A block is never divided across multiple multiprocessors. Threads within a block can cooperate by sharing data through some shared memory, and by synchronizing their execution to coordinate memory ac- cesses. More detailed information can be found in [41], [46]. Moreover, there is a best practices guide that can be useful to programmers [47]. CUDA is well suited for implement- ing the SPMD parallel design pattern [10]. Worker management in CUDA is done implicitly. That is, programmers do not manage thread creations and destructions. They just need to specify the dimension of the grid and block required to process a certain task. Workload partitioning and worker mapping in CUDA is done explicitly. Programmers have to define the workload to be run in parallel by using the function "Global Function" and specifying the dimension and size of the grid and of each block. The CUDA memory model is shown in Fig. 1. At the bottom of the figure, we see the global and constant memories. These are the memories that the host code can write to and read from. Constant memory allows read-only access by the device. Inside a block, we have the shared memory and the registers or local memory. The shared memory can be accessed by all threads in a block. The registers are independent for each thread. Finally, we would like to mention a recent initiative by Intel. This initiative is called Knights Ferry [48], [49], and is Authorized licensed use limited to: University of North Texas. Downloaded on February 19,2024 at 01:11:54 UTC from IEEE Xplore. Restrictions apply./n Bridging the Gap: Examining Parallel Programming Models for High-Performance Computing and Parallelizing Uniprocessor Simulators 1 Project Description The objective of this research is to examine the current evolution of computer architecture, specifically in relation to the increasing prevalence of processors with multiple core architectures. The work is divided into two components. Firstly, we provide a programming approach to parallelize single-processor simulators in order to efficiently mimic multi-core systems. Furthermore, we provide a comprehensive analysis of parallel programming paradigms and tools, emphasising their relevance and appropriateness for tasks associated with high-performance computing (HPC). Our study aims to enhance simulation technology and parallel computing approaches by establishing a connection between parallel programming paradigms and simula- tion environments. We aim to foster innovation in computational research and development by making a significant contribution to the advancement of parallel computing methodologies and simulation approaches. 2 Rationale In order to keep up with the increasing prevalence of multiple-core processor designs, it is crucial to de- velop simulation environments that accurately depict these architectures. Furthermore, the increasing need for high-performance computing solutions underscores the importance of understanding and using parallel programming methodologies to maximise benefits. The objective of our work is to address these significant concerns and improve parallel computing methods and simulation technologies via the investigation of par- allel programming paradigms and the parallelization of uniprocessor simulators. Our objective is to explore the complexities of parallel programming and simulation in order to meet the changing requirements of com- putational research and facilitate the development of more effective and scalable computing solutions across many fields. Our study aims to connect theoretical principles with practical applications in computational science and engineering, promoting innovation and progress. 3 What Will by implemented? The project's execution heavily relies on the development and enhancement of a programming approach to parallelize existing uniprocessor simulators. This technology enables the simulation of multiple-core architectures, providing developers and researchers with valuable tools to explore and validate innovative notions. We will conduct a comprehensive review and analysis of parallel programming paradigms and tools, evaluating their appropriateness and effectiveness for various high-performance computing applications. Through practical inquiry and assessment, our objective is to identify the primary benefits and drawbacks of different parallel programming approaches. Our goal is to contribute to the development of parallel computing frameworks that are both resilient and scalable, capable of properly utilising the computational capacity of contemporary hardware architectures. We do this via systematic experimentation and iterative improvement. The aim of our study is to provide valuable insights into the practical use and improvement of parallel algorithms. This will help in creating and using effective parallel computing solutions in many scientific and industrial fields. 1 4 Tools Needed to Implement It 5 ● Programming languages, such as C and C++, are used to implement the parallelization approach. • Employ simulation environments and frameworks such as Gem5 or Simics to examine and authenticate parallelized simulators. • Perform comprehensive research on parallel programming models and techniques by exploring databases, academic articles, and conference proceedings. • Benchmarking suites and performance analysis tools, such as Intel VTune Profiler or SPEC CPU benchmarks, are used to evaluate the effectiveness and efficiency of parallel programming models in high-performance computing (HPC) environments. • Incorporating machine learning approaches to optimise parallel algorithms and enhance performance in dynamic computing settings. • Engaging in partnerships with industrial partners and academic institutions to use state-of-the-art technology and promote multidisciplinary research in parallel computing. • Investigation of nascent hardware designs, such as GPUs and FPGAs, to use their concurrent processing capabilities and augment computational efficacy. • Creation of extensive documentation and tutorials to enable the adoption of parallel programming approaches and tools by the wider scientific community. References 1 J. Diaz, C. Muñoz-Caro and A. Niño, "A Survey of Parallel Programming Models and Tools in the Multi and Many-Core Era,” in IEEE Transactions on Parallel and Distributed Systems, vol. 23, no. 8, pp. 1369-1386, Aug. 2012, doi: 10.1109/TPDS.2011.308. keywords: Computational modeling; Parallel pro- gramming; Graphics processing unit; Message systems; Instruction sets; Multicore processing; Parallelism and concurrency; distributed programming; heterogeneous (hybrid) systems., 2 R. Kumar, V. Zyuban and D. M. Tullsen, "Interconnections in multi-core architectures: under- standing mechanisms, overheads and scaling," 32nd International Symposium on Computer Archi- tecture (ISCA'05), Madison, WI, USA, 2005, pp. 408-419, doi: 10.1109/ISCA.2005.34. keywords: Bandwidth; Computer architecture; Space technology; Delay; Power system interconnection; Joining pro- cesses;Space exploration; Computer science; Design engineering; Power engineering and energy, 3 AJ. Donald and M. Martonosi, "An Efficient, Practical Parallelization Methodology for Multicore Architecture Simulation,” in IEEE Computer Architecture Letters, vol. 5, no. 2, pp. 14-14, Feb. 2006, doi: 10.1109/L-CA.2006.14. keywords: Multicore processing; Object oriented modeling; Computational modeling; Computer architecture; Parallel programming; Computer simulation;Feedback; Parallel processing; Product development; Process planning; simulation; multicore; parallelism, 2

Fig: 1