Vector Processing

In many science and engineering applications, the problems can be formulated in terms of vectors and matrices that lend themselves to vector processing.
Computers with vector processing capabilities are in demand in specialized e.g.
- Long-range weather forecasting
- Petroleum explorations
- Seismic data analysis
- Medical diagnosis
- Artificial intelligence and expert systems
- Image processing
- Mapping the human genome
To achieve the required level of high performance it is necessary to utilize the fastest and most reliable hardware and apply innovative procedures from vector and parallel processing techniques.

Vector Operations

Many scientific problems require arithmetic operations on large arrays of numbers.
A vector is an ordered set of a one-dimensional array of data items.
A vector V of length n is represented as a row vector by V=[v1,v2,…,Vn].
To examine the difference between a conventional scalar processor and a vector processor, consider the following Fortran DO loop:

DO 20 I = 1, 100

20 C(I) = B(I) + A(I)

This is implemented in machine language by the following sequence of operations.

Initialize I=0

20 Read A(I) Read B(I)

Store C(I) = A(I)+B(I)

Increment I = I + 1 If I <= 100 go to 20

Continue

A computer capable of vector processing eliminates the overhead associated with the time it takes to fetch and execute the instructions in the program loop.

C(1:100) = A(1:100) + B(1:100)

A possible instruction format for a vector instruction is shown in 4-11.
- This assumes that the vector operands reside in memory.
It is also possible to design the processor with a large number of registers and store all operands in registers prior to the addition operation.
- The base address and length in the vector instruction specify a group of CPU registers.

Fig 4-11: Instruction format for vector processor

The multiplication of two n x n matrices consists of n² inner products or n³ multiply-add operations.
- Consider, for example, the multiplication of two 3 x 3 matrices A and B.
- c11= a11b11+ a12b21+ a13b31
- This requires three multiplication and (after initializing c₁₁ to 0) three additions.

In general, the inner product consists of the sum of k product terms of the form C= A₁B₁+A₂B₂+A₃B₃+…+A_kB_k.
- In a typical application k may be equal to 100 or even 1000.

C = A1 B1 + A5 B5 + A9 B9 + A13B13 + ......

+ A2 B2 + A6 B6 + A10 B10 + A14 B14 + .......

+ A3 B3 + A7 B7 + A11B11 + A15 B15 + ........

+ A4 B4 + A8 B8 + A12 B12 + A16 B16 + .......

Fig 4-12: Pipeline for calculating an inner product

Pipeline and vector processors often require simultaneous access to memory from two or more sources.
- An instruction pipeline may require the fetching of an instruction and an operand at the same time from two different segments.
- An arithmetic pipeline usually requires two or more operands to enter the pipeline at the same time.
Instead of using two memory buses for simultaneous access, the memory can be partitioned into a number of modules connected to a common memory address and data buses.
- A memory module is a memory array together with its own address and data registers.

Fig 4-13: Multiple module memory organization

The advantage of a modular memory is that it allows the use of a technique called interleaving.
In an interleaved memory, different sets of addresses are assigned to different memory modules.
By staggering the memory access, the effective memory cycle time can be reduced by a factor close to the number of modules.

A commercial computer with vector instructions and pipelined floating-point arithmetic operations is referred to as a supercomputer.
- To speed up the operation, the components are packed tightly together to minimize the distance that the electronic signals have to travel.
This is augmented by instructions that process vectors and combinations of scalars and vectors.
A supercomputer is a computer system best known for its high computational speed, fast and large memory systems, and the extensive use of parallel processing.
- It is equipped with multiple functional units and each unit has its own pipeline configuration.

It is specifically optimized for the type of numerical calculations involving vectors and matrices of floating-point numbers.
They are limited in their use to a number of scientific applications, such as numerical weather forecasting, seismic wave analysis, and space research.
A measure used to evaluate computers in their ability to perform a given number of floating-point operations per second is referred to as flops.
A typical supercomputer has a basic cycle time of 4 to 20 ns.
The examples of supercomputer:
Cray-1: it uses vector processing with 12 distinct functional units in parallel; a large number of registers (over 150); multiprocessor configuration (Cray X-MP and Cray Y-MP)
- Fujitsu VP-200: 83 vector instructions and 195 scalar instructions; 300 megaflops