Keynote
Optimizing
the software in the new generation parallel computing environments is the key
to open a new market for your engineering solution.
Doing job
in parallel way is not a new idea. It is a natural thinking. When you
do something at home or work place, for faster completion, we always avail
work-force where things are done parallel. At computing world too, it was there
for many decades. But due to many reasons, I should say unfortunately, the
processor industry gave over importance to speeding up the CPU for these past
decades. So other than in a few super computers, programmers have never got an opportunity
to use their ‘parallel philosophy’ while solving the computing problems. In
recent past, there is a silent but steady revolution happened in the processor
industry. Now parallel processing at our finger tip. Almost all the desktops,
laptops, tablets and phones have got this processing capability. The
introduction of multicore CPUs and GPU (Graphic Processing) are behind this
move. Interestingly this hardware production has been done because of a
compelling need from the software community from certain computing domains. But
majority of the software are yet to utilize this advantage. Surely typical
legacy software, which might be too good in the era of serial processing, is
wasting the capacity of this wonderful platform which the end-user has paid
for. So the programmers have to attempt for a change. if you are managing a
system with software, it is for securing the interest of your fellow users/customers.
Also as we are optimizing the power of hardware per watt, it is eco friendly
thinking too.
More about processors
Computing
means arithmetic in a simple sense. So the ALU, arithmetic and logic unit, is
perceived main part in a typical processor. Interestingly it was changed during
the course of time. In other words, the evolution of the microprocessor forced
the designers to do so. Moore’s law stands good. It said the number of
transistors in each square mm will double in every 18 months. No of transistors
are the defacto standard to show complexity/vastness of the electronics in the
integration. It gave more opportunity to the chip designer pack more digital-logic. The CPUs and other type of chips have to be
discussed briefly.
The CPU
story: As the influence of CISC (a kind of microprocessor designs who accepts
specific instructions for almost specific purposes) especially in personal
computing field, made the real estate on the piece of silica crowded with non
arithmetic logics. So it gave more specific functions hardwired and will
complete it in one instruction. Then how an old IBM PC-DOS program runs faster
in the new computer? It is related to the speed-enhancement of clock (the pulse
to the chip to execute an instruction in a program). Surely majority of the
processor not been utilized at one given time. Basically the typical
microprocessor’s various functionalities can work independently. But
unfortunately the programming interface was designed for serial programming. So
it is like a set of people in a work-ground who is ready to do different jobs,
but only one supervisor who accepts the work, schedule it, and wait for the
completion for accepting next work. There are certain improvements happened
overtime. The important one is pipelining. Here is like the supervisor
wisely decides to take next couple of instructions while a few workers do some
job. A level of parallelism came to the system, even though it is unknown to
the programmer. In recent past, x86 processor community, who is prominent in
the pc market, also moved a concept of accommodating more ‘processors’ into one
dye. That kind of processors is branded as ‘multicore’. The reason for the
shift can be a lot. The instruction set is almost addressing everything even a
CISC world needed. Also increasing the speed of the clock is almost stopped.
The system software, Operating systems and middleware, gave almost no-change
kind of feel to the programmers. Because the threading system at OS can utilize
the additional cores without any change in the application software. For
addressing concurrent systems, the software community was moved to a
multithreaded system at the age of single core systems. Thus task based
parallelism and instruction based parallelism has been accommodated.
There is an
instruction set philosophy called RISC employed in microprocessors from its
inception. In a simple way, it only
accommodated the basic instruction set. So compared to a CISC program, a RISC
program will have more instructions to do same task.
At
factories, to increase production, it is a typical method to increase the
parallelization. Mostly same kinds of jobs are being executed through different
workers at same time. Similar pattern adopted to computing philosophy in the
name SIMD(Single Instruction Multiple Data). Good example is DSP which is
developed for image processing. When Moore’s law holds well, the processors
have got a choice to go to parallel design early. In another way of
explanation, it is more natural. Since the complexity of instructions is less,
the arithmetic/computing logic took a good area of silica’s real-estate. Naturally
those parts could accommodate more execution paths. Image processing kind of
software problems are known as ‘embarrassingly parallel’. Another example which utilized this is
GPU(Graphics Processing Unit). Other than DSP, CPU cores also offer kind SIMD
extentions with special instructions (Ex: SSE of intel).
For
displaying pictures, a display controller (CRTC) was present from the early
days of PC. But when the increasing need of display processing came, slowly a
new kind of processors evolved: GPU. They have a special hardwired logic called
graphics pipeline for handling 3D geometry, lighting and shading. Since the
entertainment and game industry funded in this processor market, slowly it has
got a good design on silica which is compute oriented. But the main-stream programmers have not
given attention to this area as the early days of GPU is mainly meant for
display. The primary intention was fast geometric computation and pixel
processing. In the past 8 years, these processors have been adopted by a set of
programmers for general purpose programming. Initially when our team started
parallelization of software on GPU, the movement was named as GPGPU. It
was back in 2005. Then NVIDIA, one of the prominent companies, gave a general
purpose programming language on their platform: CUDA. Where CPU is giving
importance to the task based parallelism, the systems like GPU are giving
importance to a method called data-parallelism. Since it operates on same kind
of way on a huge sized data, its massively parallel ALU is the best in class.
Thus it more like super computer at your desktop. Interestingly current the
super computers are also using this inexpensive processors (based on processing
capacity) at large scale.
As a
summary the processors offer different kind of optimized processing for the
logic that we have. Without changing the software, it is not utilizing the full
power of the platform. It is true that huge investment in the existing software
has to be carefully analyzed for a better optimization.
Engineering systems and
its future.
As
described before, entertainment and game industries are the first users of
these massively parallel systems. In fact the programmers only considered
embarrassingly parallel problems in this first 5 years.
In recent
past, another revolution was faced in the production and deployment of sensors.
The concepts like internet of things are there. At engineering systems, the
data produced by these sensors are logged for predominantly for human
validation. Since the data is very huge
for humans to understand, the area like data analytics got importance.
Irrespective
of the industry verticals, massively parallel solutions are demanded for
enhancing existing engineering solutions for taking ‘informed’ decisions. For
taking ‘informed’ decisions, the machine has to process the vast amount of data
produced by these sensors. Users are ready to welcome suggestions from the
computers from mere statistical analysis to machine-learning. The terminologies
like big-data, have been coined by computer enthusiasts around this.
Here machine learning took place to arrive to relations of unrelated data to
humans. For saturated markets too, this 6th sense will get
importance at user level as it will only help the user for taking informed
decisions.
Steps of adaption
Even though
parallel systems are natural at life-situations, at software development, it is
usually difficult to craft a parallel algorithm with certain memory and
platform considerations. Also the expertise in this field is very less. Building
expertise through training is also hard. The theoretical knowledge are only at
abstract level. And the case studies available in supercomputing field
only. So time-to-market will be major
point while attempting a totally new software. By considering this business
aspect, time to market, there are multiple steps for reaching the excellence in
performance.
As a first
step, it will be normally good to invest in optimizing existing software. I
suggest here to a boost existing software with help of tools(like parallel
studio). Instruction pipelining and task-paralellism can be attempted. Surely
the system is deployed in a distributed model in a network; an extended
approach can be done. The software should be done with automatic hardware adaptation
feature. This can be attempted for first
parallel optimized release. The SIMD style can be tried at certain bottle neck
areas. It will give extra mileage from
the CPU. Since the cores of the CPU are limited, this method will give maximum
value for money at this stage.
After consideration
of the power of CPU, next stage of making the best software is adaption of software
to massively parallel hardware. like GPU. The massively parallel approach lies
in analysis of possibilities of data parallel algorithms. The major step to
arrive to an algorithm is data-flow analysis, which can be done on the existing
software using custom tools. Depending on the data flow density, amount of
intermediate calculations, the problem has to be solved in data parallel
methods like stream programming. Usually the implementation demands a
heterogeneous environment to execute. Configurations like CPU + GPU
configuration. CPU is supervising host and GPU as massively parallel processor.
Depending on the need, more than one GPU can be integrated to platform. These
multi-gpu solutions demand solution to another set of issues related to memory
sharing between these GPUs. Finally for maximum performance, there are
different hardware level heuristics are applied as the GPU industry is not that
stabilized. Even though it is not that common, some instructions perform better
in certain combination in certain processor architecture (like Fermi or Kepler
by nvidia). Another aspect which effect overall design is the size of data at
each stage of computation. So data structure design (for better organization of
data for easy access and compression) and the compilation of information in
pre-computed tables (like Logarithmic table for humans) are important. Always
cache architecture of the target processor should be mind even if decide to
keep the tables intact (example: read-only modifier is for the cache manager).
Normally the core architecture of HMI (human machine interface) are not
involved in this modification. So an automated testing can be employed for
final tune-up and repeated releases.
Conclusion
The new
generation users are ready. There are good changes in hardware side to
accommodate the parallel thinking now. Not just business analytics or
entertainment industries are benefiting from this, but also the core
engineering solutions also can incorporate these changes. After getting first
level enhancement, try for a time-consuming but perfect solution to get the
appreciation from the user of future.
courtesy: Network systems and Technologies (P) Ltd, Trivandrum, India. www.nestsoftware.com
Articles from NVIDIA .com and intel.com
Author’s Paper
ID:V130829