Thursday, August 29, 2013



Time for being parallel….


Keynote 

Optimizing the software in the new generation parallel computing environments is the key to open a new market for your engineering solution.
Doing job in parallel way is not a new idea. It is a natural thinking. When you do something at home or work place, for faster completion, we always avail work-force where things are done parallel. At computing world too, it was there for many decades. But due to many reasons, I should say unfortunately, the processor industry gave over importance to speeding up the CPU for these past decades. So other than in a few super computers, programmers have never got an opportunity to use their ‘parallel philosophy’ while solving the computing problems. In recent past, there is a silent but steady revolution happened in the processor industry. Now parallel processing at our finger tip. Almost all the desktops, laptops, tablets and phones have got this processing capability. The introduction of multicore CPUs and GPU (Graphic Processing) are behind this move. Interestingly this hardware production has been done because of a compelling need from the software community from certain computing domains. But majority of the software are yet to utilize this advantage. Surely typical legacy software, which might be too good in the era of serial processing, is wasting the capacity of this wonderful platform which the end-user has paid for. So the programmers have to attempt for a change. if you are managing a system with software, it is for securing the interest of your fellow users/customers. Also as we are optimizing the power of hardware per watt, it is eco friendly thinking too.

More about processors

Computing means arithmetic in a simple sense. So the ALU, arithmetic and logic unit, is perceived main part in a typical processor. Interestingly it was changed during the course of time. In other words, the evolution of the microprocessor forced the designers to do so. Moore’s law stands good. It said the number of transistors in each square mm will double in every 18 months. No of transistors are the defacto standard to show complexity/vastness of the electronics in the integration. It gave more opportunity to the chip designer pack more digital-logic.  The CPUs and other type of chips have to be discussed briefly.
The CPU story: As the influence of CISC (a kind of microprocessor designs who accepts specific instructions for almost specific purposes) especially in personal computing field, made the real estate on the piece of silica crowded with non arithmetic logics. So it gave more specific functions hardwired and will complete it in one instruction. Then how an old IBM PC-DOS program runs faster in the new computer? It is related to the speed-enhancement of clock (the pulse to the chip to execute an instruction in a program). Surely majority of the processor not been utilized at one given time. Basically the typical microprocessor’s various functionalities can work independently. But unfortunately the programming interface was designed for serial programming. So it is like a set of people in a work-ground who is ready to do different jobs, but only one supervisor who accepts the work, schedule it, and wait for the completion for accepting next work. There are certain improvements happened overtime. The important one is pipelining. Here is like the supervisor wisely decides to take next couple of instructions while a few workers do some job. A level of parallelism came to the system, even though it is unknown to the programmer. In recent past, x86 processor community, who is prominent in the pc market, also moved a concept of accommodating more ‘processors’ into one dye. That kind of processors is branded as ‘multicore’. The reason for the shift can be a lot. The instruction set is almost addressing everything even a CISC world needed. Also increasing the speed of the clock is almost stopped. The system software, Operating systems and middleware, gave almost no-change kind of feel to the programmers. Because the threading system at OS can utilize the additional cores without any change in the application software. For addressing concurrent systems, the software community was moved to a multithreaded system at the age of single core systems. Thus task based parallelism and instruction based parallelism has been accommodated.
There is an instruction set philosophy called RISC employed in microprocessors from its inception.  In a simple way, it only accommodated the basic instruction set. So compared to a CISC program, a RISC program will have more instructions to do same task.
At factories, to increase production, it is a typical method to increase the parallelization. Mostly same kinds of jobs are being executed through different workers at same time. Similar pattern adopted to computing philosophy in the name SIMD(Single Instruction Multiple Data). Good example is DSP which is developed for image processing. When Moore’s law holds well, the processors have got a choice to go to parallel design early. In another way of explanation, it is more natural. Since the complexity of instructions is less, the arithmetic/computing logic took a good area of silica’s real-estate. Naturally those parts could accommodate more execution paths. Image processing kind of software problems are known as ‘embarrassingly parallel’.  Another example which utilized this is GPU(Graphics Processing Unit). Other than DSP, CPU cores also offer kind SIMD extentions with special instructions (Ex: SSE of intel).
For displaying pictures, a display controller (CRTC) was present from the early days of PC. But when the increasing need of display processing came, slowly a new kind of processors evolved: GPU. They have a special hardwired logic called graphics pipeline for handling 3D geometry, lighting and shading. Since the entertainment and game industry funded in this processor market, slowly it has got a good design on silica which is compute oriented.  But the main-stream programmers have not given attention to this area as the early days of GPU is mainly meant for display. The primary intention was fast geometric computation and pixel processing. In the past 8 years, these processors have been adopted by a set of programmers for general purpose programming. Initially when our team started parallelization of software on GPU, the movement was named as GPGPU. It was back in 2005. Then NVIDIA, one of the prominent companies, gave a general purpose programming language on their platform: CUDA. Where CPU is giving importance to the task based parallelism, the systems like GPU are giving importance to a method called data-parallelism. Since it operates on same kind of way on a huge sized data, its massively parallel ALU is the best in class. Thus it more like super computer at your desktop. Interestingly current the super computers are also using this inexpensive processors (based on processing capacity) at large scale.
As a summary the processors offer different kind of optimized processing for the logic that we have. Without changing the software, it is not utilizing the full power of the platform. It is true that huge investment in the existing software has to be carefully analyzed for a better optimization.

Engineering systems and its future.

As described before, entertainment and game industries are the first users of these massively parallel systems. In fact the programmers only considered embarrassingly parallel problems in this first 5 years. 
In recent past, another revolution was faced in the production and deployment of sensors. The concepts like internet of things are there. At engineering systems, the data produced by these sensors are logged for predominantly for human validation.  Since the data is very huge for humans to understand, the area like data analytics got importance.
Irrespective of the industry verticals, massively parallel solutions are demanded for enhancing existing engineering solutions for taking ‘informed’ decisions. For taking ‘informed’ decisions, the machine has to process the vast amount of data produced by these sensors. Users are ready to welcome suggestions from the computers from mere statistical analysis to machine-learning. The terminologies like big-data, have been coined by computer enthusiasts around this. Here machine learning took place to arrive to relations of unrelated data to humans. For saturated markets too, this 6th sense will get importance at user level as it will only help the user for taking informed decisions.

Steps of adaption

Even though parallel systems are natural at life-situations, at software development, it is usually difficult to craft a parallel algorithm with certain memory and platform considerations. Also the expertise in this field is very less. Building expertise through training is also hard. The theoretical knowledge are only at abstract level. And the case studies available in supercomputing field only.  So time-to-market will be major point while attempting a totally new software. By considering this business aspect, time to market, there are multiple steps for reaching the excellence in performance.  
As a first step, it will be normally good to invest in optimizing existing software. I suggest here to a boost existing software with help of tools(like parallel studio). Instruction pipelining and task-paralellism can be attempted. Surely the system is deployed in a distributed model in a network; an extended approach can be done. The software should be done with automatic hardware adaptation feature.  This can be attempted for first parallel optimized release. The SIMD style can be tried at certain bottle neck areas.  It will give extra mileage from the CPU. Since the cores of the CPU are limited, this method will give maximum value for money at this stage.
After consideration of the power of CPU, next stage of making the best software is adaption of software to massively parallel hardware. like GPU. The massively parallel approach lies in analysis of possibilities of data parallel algorithms. The major step to arrive to an algorithm is data-flow analysis, which can be done on the existing software using custom tools. Depending on the data flow density, amount of intermediate calculations, the problem has to be solved in data parallel methods like stream programming. Usually the implementation demands a heterogeneous environment to execute. Configurations like CPU + GPU configuration. CPU is supervising host and GPU as massively parallel processor. Depending on the need, more than one GPU can be integrated to platform. These multi-gpu solutions demand solution to another set of issues related to memory sharing between these GPUs. Finally for maximum performance, there are different hardware level heuristics are applied as the GPU industry is not that stabilized. Even though it is not that common, some instructions perform better in certain combination in certain processor architecture (like Fermi or Kepler by nvidia). Another aspect which effect overall design is the size of data at each stage of computation. So data structure design (for better organization of data for easy access and compression) and the compilation of information in pre-computed tables (like Logarithmic table for humans) are important. Always cache architecture of the target processor should be mind even if decide to keep the tables intact (example: read-only modifier is for the cache manager). Normally the core architecture of HMI (human machine interface) are not involved in this modification. So an automated testing can be employed for final tune-up and repeated releases.

Conclusion

The new generation users are ready. There are good changes in hardware side to accommodate the parallel thinking now. Not just business analytics or entertainment industries are benefiting from this, but also the core engineering solutions also can incorporate these changes. After getting first level enhancement, try for a time-consuming but perfect solution to get the appreciation from the user of future.

courtesy: Network systems and Technologies  (P) Ltd, Trivandrum, India.  www.nestsoftware.com 
Articles from NVIDIA .com and intel.com
Author’s Paper ID:V130829