ICOM 5995

ICOM 5995
Performance Instrumentation and Visualization for High Performance Computer Systems
Department of Electrical & Computer Engineering
University of Puerto Rico – Mayaguez
Fall 2002

Lecture 3: September 18, 2002

Announcements:

· Please check that your name appears in the official list of students of ICOM 5995

· Attendance is a must. The attendance report to the university was already generated.

Topics:

Overview of parallel and distributed systems

System architectures

Parallel and Distributed Systems: Architectures

The computing power requirements for scientific applications have led to the development of different approaches for meeting the demands of processing speed, memory size and speed, and data input/output rates. Increases in performance have come from several advances. Some of these advances include:

Microprocessors have become faster through the use of instruction-level parallelism, multilevel caches, and faster clock speeds.
Different schemes have been developed for effective interconnection between processors and memory.
Users have learned how to use multiple processors and deep memory hierarchies.
Software tools have been improved.

Goal: To get as many operations in the processor per clock cycle.

How: several operations at the same time in the processor.

This leads to different approaches in architecture

Classes of processors:

· Superscalar

· Superpipeline

· Long Instruction Word

Superscalar Processor

Many units in the processor working at the same time. The compiler supports an instruction mix to generate independent instructions to keep the units busy.

Superpipelined Processors

Divide stages in the design of the units in the processor into less complicated stages working together. This increases the overall speed of the pipeline and therefore we can have higher clock speeds. Higher throughput.

Long Instruction Word

Each instruction is two or more regular instructions joined together into a single long instruction word. The instruction word is composed of simpler instructions which are executed simultaneously. Processor relies on the compiler to generate these instructions. The term VLIW stands for Very Long Instruction Word machines. There might be several floating point, integer, branch, and memory operations initiated each clock cycle. In the VLIW architecture, multiple functional units are exercised by the instructions on each cycle.

Other Advanced Features:

Branch Prediction: Avoid the penalty of filling the pipeline and flushing the pipeline when a conditional branch is encountered. Several schemes to predict which one of the alternatives in a branch decision is made. One approach is to see where has the processor gone in the past and predict future based on this.

Parallel Architectures

Scalable parallel processors (SPP) are also used to meet performance demands. SPPs are hundreds or thousands of state-of-the-art interconnected processors. The goal of SPPs is to obtain fast computers through highly parallel designs and substantial parallelism within the processor. Issues to consider when designing SPPs are whether to use Single-Instruction Multiple-Data (SIMD), Multiple-Instruction Multiple-Data (MIMD), and Very Long Instruction Word (VLIW). In the SIMD architecture, the same instruction is executed on multiple processors at the same time. In MIMD, different instructions and data are operated on by each processor. The real challenge in SPPs is how to perform at the potential the system has. Some of the challenges include decisions on the interconnection hardware, memory system design, compilers, and algorithm design.

In high-performance systems, efficient memory access schemes are needed. Multiple levels caches are used to speed up data and instruction access. Instruction reordering and data prefetching are used to avoid latency caused by slow memories. This causes compilers to be left with the task of generating efficient codes to take advantage of the hardware. Memory access in shared memory systems has to incorporate cache coherence mechanisms to avoid the access of non-valid data from cache.

Metacomputing is another approach to obtain high-performance systems, also called GRID COMPUTING. The idea behind metacomputing is to obtain a highly powerful distributed system composed of physically distributed computers to obtain the best resources available to jointly solve a problem. The system should be transparent to the users who will concentrate on the solution of their respective problems and not on the computational requirements of the problem. There are several issues in the development of such a global computing platform, such as software compatibility, high-performance networks, security, and user-friendly interfaces. Metacomputing involves the interconnection of high-performance networks, implementing a distributed file system, coordinating user access to different computational structures, and making the environment easy-to-use and transparent to the user.

Ethernet is used as local-area network connection for networks of workstation (NOW) architectures. In a network of workstations, each node of the system is a workstation that collaborates with others via message-passing.

Plan

Read Cluster Computing: The Wave of the Future? By Al Geist. Read information about cluster computing mainly. Do not emphasize PVM since we are not using it.
Download the NAS Parallel Benchmarks from http://www.nas.nasa.gov/Software/NPB/ . The NAS Parallel Benchmarks (NPB) are a set of 8 programs designed to help evaluate the performance of parallel supercomputers. The benchmarks, which are derived from computational fluid dynamics (CFD) applications, consist of five kernels and three pseudo-applications.
Activate your account in Amadeus. Use Kennie’s instructions.
Test Kernel FT: A 3-D fast-Fourier transform partial differential equation benchmark
Get results.

Kennie’s manual

General Instructions for ICOM 5995 Course

-----------------------------------------

* LAM Hosts

There are a total of 8 lam hosts (clients). This are: aramana, netlab02, albizu, netlab04, yuisa, netlab03, bayrex, betances. There is no particular order.

* Generate SSH keys

To be able to work with the system you will need to generate SSH authentication keys, so the remote node will let you execute something without an interactive login.

1. ssh-keygen -t rsa

> acepte los defaults

> no coloque un passphrase

2. cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

3. log to every node of the lam topology

After this you can know play with lam.