Boost your journey with 24/7 access to skilled experts, offering unmatched computer organisation and architecture homework help

**Q1:**The topic is Metaverse (Virtual Reality). How VR helps the educational system in design schools?See Answer**Q2:**Operations on vectors with length < 64 elements require: a. The Exception Program Counter. WRONG!! b.None of the choices. c. Multi-lane convergence d. Branch instructions and some arithmetic instructions. e. Loop statements. f. The Vector Length Register.See Answer**Q3:**Operations on vectors with length < 64 elements require: a. The Exception Program Counter. WRONG!! b.None of the choices. c. Multi-lane convergence d. Branch instructions and some arithmetic instructions. e. Loop statements. f. The Vector Length Register.See Answer**Q4:**Consider the following sequence of instructions to compute x² + 4x + 1 for each element x in a vector stored in ve. multvv.d V1, ve, ve # V1 = x^2 multvs.d V2, V8, 4 # V2 = 4 x addvv.d V3, V1, V2 addvs.d V4, V3, 1 # V3 = x^2 + 4x #V4 = x^2 + 4x + 1 Assume that we have a vector processor with two vector multiplication unit whose latency is 7 and two vector addition unit whose latency is 6. Let n=32 represent the length of the vector supported on our processor. The processor has fully- pipelined vector execution units. The vector processor supports chaining. How many clock cycles will this instruction sequence take? a. 48 cycles b. 32 cycles 0.c. 52 cycles d. 120 cycles e. 51 cycles f. 100 cycles g. 200 cycles h. 150 cycles 1. 74 cyclesSee Answer**Q5:**Consider the following loop: for (i=0; i<64; i++) X[i] = a* x[i] + b; Here is the assembly code for the loop. Assume that prior to the loop, i is in R1, 64*4 is in R2, a is in FO, and b is in F2. 10: Id F4, X(R1) 11: mul F4, F4, FO 12: add F4, F4, F2 13: st F4, X(R1) 14: addi R1, R1, #4 15: bne R1, R2, 10 Consider a processor with 32-element vector processor. The processor has four fully-pipelined vector execution units: a 2-cycle load unit, a 2-cycle store unit, a 2- cycle FP adder, and a 4-cycle FP multiplier. The vector processor does not supports chaining. How long would it take the original loop to execute on this processor? You ignore overlaps between multiple vector chains and the scalar code for setting up each group of vector operations. a.200 cycles b.48 cycles c.148 cycles d.32 cycles e.74 cycles f.120 cycles g.100 cycles h.52 cycles i.150 cyclesSee Answer**Q6:**Let assume that the vector processor supports chaining. How long would it take the original loop to execute on this processor? a.52 cycles b.148 cycles c.74 cycles wrong!! d.200 cycles wrong!! e.32 cycles f.48 cycles g.120 cycles h.100 cycles i.150 cyclesSee Answer**Q7:**Consider the following loop: for (i=0; i<64; i++) X[i] = a* x[i] + b; Here is the assembly code for the loop. Assume that prior to the loop, i is in R1, 64*4 is in R2, a is in FO, and b is in F2. 10: Id F4, X(R1) 11: mul F4, F4, FO 12: add F4, F4, F2 13: st F4, X(R1) 14: addi R1, R1, #4 15: bne R1, R2, 10 Consider a processor with 32-element vector processor. The processor has four fully-pipelined vector execution units: a 2-cycle load unit, a 2-cycle store unit, a 2- cycle FP adder, and a 4-cycle FP multiplier. The vector processor does not supports chaining. How long would it take the original loop to execute on this processor? You ignore overlaps between multiple vector chains and the scalar code for setting up each group of vector operations. a.200 cycles b.48 cycles c.148 cycles d.32 cycles e.74 cycles f.120 cycles g.100 cycles h.52 cycles i.150 cyclesSee Answer**Q8:**Let assume that the vector processor supports chaining. How long would it take the original loop to execute on this processor? a.52 cycles b.148 cycles c.74 cycles wrong!! d.200 cycles wrong!! e.32 cycles f.48 cycles g.120 cycles h.100 cycles i.150 cyclesSee Answer**Q9:**Design and develop a piece of software which interacts directly with computer hardware, including parallel architectures. You are required to deliver a software solution with a report (1500 words). You should ensure the following are included in your development (this list is not exhaustive): Part A (Design, implement and evaluate programs): • You can select an application of your choice and parallelize it. (Ex. image filtering, discrete wavelet transform, matrix multiplication, discrete cosine transforms, etc.) • You can use any programming language (Python, Java, C/C++, etc.) with which you are conversant and submit your source code. • You are free to use any hardware (CPU, GPU, or APU) • You are free to use any Operating system (Linux, Windows, etc.). • You can use external libraries such as OpenMP, OpenCL, CU DA, etc. Part B (Report-1500 words): You are required to submit a report of about 1500 words along with the code (both sequential version and parallel version). Also, you need to provide a demo video/presentation of the working of your code. Your report should contain at least the following information: • Summary or Introduction • Programming language and hardware details: In this section, you should include details about programming language and hardware. Also, this is a section to mention external libraries.See Answer**Q10:**8) For a vector addition, assume that the vector length is 8000, each thread calculates one output element, and the thread block size is 1024 threads. The programmer configures the kernel launch to have a minimal number of thread blocks to cover all output elements. How many threads will be in the grid? (a) 8000 (b) 8196 (c) 8192 (d) 8200See Answer**Q11:**9) Assume that a kernel is launched with 1000 thread blocks each of which has 512 threads. if a variable is declared as a shared memory variable, how many versions of the variable will be created through the lifetime of the execution of the kernel? (al (b) 1000 (c) 512 (d) 512000See Answer**Q12:**} 2- 3. Consider the three kernels below. Assume that you have 4 blocks each has 4 threads In each case, write the value of the array a global void kernel(int 'a) int idx-blockidx.x"blockDim.x+ threadidx.x; a[idx]-7; global_ void kernel (int *a) int idx=blockldx.x"blockDim.x + threadldx.x; a[idx]-blockldx.x; global_ void kernel(int *a) { int idx=blockldx.x"blockDim.x + threadldx.x; a[idx] = threadldx.x; } plution: 1-See Answer**Q13:**Q5. [5 Pts: 2, 3 pts, 10 minutes] 1. Write a CUDA kernel incrementing a float array A of size N for a 1D grid, using ID thread blocks, and assuming that each thread increments one element. } Solution: _global____ void increment_On_Device(float *A_d, int N) {See Answer**Q14:**2. Assuming that each thread block counts 64 threads, write the host code launching the kernel (including memory allocation on the device and host-device data transfers) Solution: #include <cuda.h> Void main(float *A, int N) { float *A_h; float *A_d;See Answer**Q15:**06. (6 P: 3 each, 20 minutes) Consider the following array: (4 6 7 1 28 5 21. a. Perform a parallel inclusive prefix scan on the array using the Kogge-Stone algorithm. Report the intermediate states of the array after each step. How many add operations? Solution: b. Repeat using the work-efficient algorithm (Brent-Kung). Solution:See Answer**Q16:**07.13 Pts, 10 minutes] Assume we have matrix of 80 by 100 and we want to cover it to index it with a thread blocks of size 32 by 32. Analyze this case and what is the performance impact of divergent warps? Solution:See Answer**Q17:**Q2. (8 Pts: 1 pt. each, 15 minutes] Indicate whether the following statements are true or false, device qualifier may be called on the host or the 1. [] Functions annotated with the device 2. [ ] Page faults cannot be handled by software because the overhead is too large. 3. [ ] Virtual memory space has to be bigger than the physical memory space. 4. [] You can have a miss in the TLB, a hit in the page table, and a miss in the cache for a single memory access. 5. [ ] Shared memory in CUDA is accessible to both the host and GPU 6. [] In the case of warp divergence; all possible execution paths are run by all threads in a warp serially so that thread instructions do not diverge. 7. ] All thread blocks involved in the same computation use the same kernel 8. [] Is it possible to multiply two 1024X1024 matrices using a tiled matrix multiplication code with 1,024 thread blocks on a device of block size of 512 threads. Note that each thread in a thread block calculates one element of the result matrix.See Answer**Q18:**Q3. [6 Pts: 1.5, 1.5, 3. pts, 15 minutes] Conceptual questions: 1. Suppose we are building a computer system around a processor\core that is capable of running at 500 MHz. The processor cycle time is 2ns (nanoseconds). Suppose we can build a data cache that takes less than 2ns to respond to a read request, so a cache hit would not slow down the processor at all, and a main memory that takes 100 ns or 50 cycles to similarly respond, forcing the processor to stall for those 50 cycles. We are considering a range of cache sizes in our design. Suppose we chose a cache size that would give a cache hit rate of 99%; what is the average memory access time? Solution: 2. We want to use each thread to calculate two elements of a vector addition. Each thread block process 2*blockDim.x consecutive elements that form two sections. All threads in each block will first process a section first, each processing one element. They will then all move to the next section, each processing one element. Assume that variable i should be the index for the first element to be processed by a thread. What would be the expression for mapping the thread/block indices to data index of the first element? Folution:See Answer**Q19:**ts 40 Here is a block diagram for the register file: 32X64 WP Register File Ra Rb Rw Bus W Cik Bus A Upload Choose a File Bus B RegWr Save your new register file as RegisterFile.v. Test your code against the provided testbench to make sure it is working (If you are not working on Vivado, you may need include "RegisterFile.v" command). Demo for your TA. Attach a zip/tar file containing your completed module along with a screenshot of the waveform. registerfile tb.v↓/n2023 cements 5 ents ons ro 1.3 40 Question 1 Implement a 32x64 register file (32 registers; each register is 64-bit wide). Below is a specification of the register file: • Inputs Ra and Rb are read register indices. Input Ra indexes the register whose value is on BusA, and input Rb indexes the register whose value is on BusB. • Input Rw is the write register index. • When RegWr is high, the data on Bus W is stored in the register specified by Rw, on the negative (falling) clock edge. Register reading should occur after the register write (on the negative clock edge), but before the positive clock edge. • Register 31 must always read zero, even if it has not been written to. • The Register File module should have the following interface: ●. module RegisterFile (BusA, BusB, BusW, RA, RB, RW, RegWr, Clk); output [63:0] BUSA; output [63:0] BusB; input [63:0] BusW; input [4:0] RA; input [4:0] RB; input [4:0] RW; input RegWr; input clk; reg [63:0] registers [31:0]; Here is a block diagram for the register file: Ra Bb 20 pts ----- Bus ASee Answer**Q20:**Summer.2023 me labus nouncements dules signments ades scussions om Pro 1.3 40 Use the following module interface (which you can find in the provided file): module NextPCLogic(NextPC, CurrentPC, SignExtImm64, Branch, ALUZero, Uncondbranch); input [63:0] CurrentPC, SignExtImm64; input Branch, ALUZero, Uncondbranch; output [63:0] NextPC; reg [63:0] tmp; /* write your code here */ endmodule Branch (CBZ) is true if the current instruction is a conditional branch instruction, Uncondbranch is true if the current instruction is an Unconditional Branch (B), and ALUZero is the Zero output of the ALU. A template and testbench are provided (If you work on Vivado, comment out the include command at the very top). Complete the next pc logic; add a few test cases to the testbench to improve it. Demonstrate your program to the TA. Attach a zip/tar file containing your completed module, testbench and a screenshot of the waveform. NextPClogic.v↓↓ NextPClogicTest.v Upload Choose a File/ny mer.2023 IS ncements es ments S Electrical and Co... sions Pro 1.3 40 Home | Howdy Question 2 Outlook PC Add WebAssign-LOG... C Get Homework He... Write a behavior model to calculate the next PC for an instruction. It will use information from the processor control module and the ALU to determine the destination for the next PC. It will contain two adders for calculating the two possible NextPC and choose between the two possibilities using the logic depicted below. memory Regi:00 studion Dif [Parucion H - Uncondbranch Branch ""!! data Reg Netflix ▸YouTube M Gmail. Shift left 2 Add ALU result 20 pts Zero 2 MapsSee Answer

- C/C++
- Java
- Python
- Agile Software Development
- Android App Development
- Artificial Intelligence
- Assembly Programming
- Big Data
- C#
- Cloud Computing
- Compiler Design
- Computer Graphics
- Computer Networks
- Computer Organisation And Architecture
- Cryptography
- Cyber Security
- Data Mining
- Data Science
- Data Structures And Algo
- Data Warehousing
- DBMS
- Deep Learning
- Distributed Computing
- Formal Language Automata
- Haskell Programming
- Internet Of Things
- Machine Learning
- Mobile Computing
- Multimedia Technology
- Natural Language Processing
- Object Oriented Analysis And Design
- Operating System
- Programming Language Principle And Paradigm
- Prolog Programming
- Real Time System
- Software Engineering
- Web Designing And Development
- Design And Analysis Of Algorithms
- React
- Coding

TutorBin believes that distance should never be a barrier to learning. Over 500000+ orders and 100000+ happy customers explain TutorBin has become the name that keeps learning fun in the UK, USA, Canada, Australia, Singapore, and UAE.