Multithreading & Gpgpu, A quick tour

I just read a great post by Mark Alexander on the current state of Multithreading and GpGPU work. Mark gave me permission to re-posted it here. Opinions stated within are Mark’s own, not his employer.

Threading

There has been a lot more emphasis on multithreading now that multiprocessor systems have become more accessible over the past few years. Faced with a “GHz wall” where chips simply can’t run faster due to heat buildup and pesky electrical things like current leakage and capacitance, chip manufacturers have made a 90′ turn to improve parallelism instead. Basically, it’s trying to do more things at once, rather than trying to do those things faster, and overall things get done quicker.

Early parallelism attempts can be seen in things like MMX, SSE and Hyperthreading, which gave very specific applications a bit of a boost (from about 10% to 800%). MMX, SSE and Altivec are vector-processing instruction sets which operate on several variables at once (usually 4 or 8). This is very useful in 3D processing and general array processing, but the vast majority of a program (even Houdini) doesn’t fit into this category. For one, the instruction sets are fairly limited in what they can do, and two, even 3D processing has a lot of non-3D overhead (loops, conditional branches, setup, etc). So even though they can accelerate things by 4-8x, they tend to only be active for a small portion of the time (say 5-10%), which only accelerates the program on the whole by about 1.2-1.8x. Useful, just not groundbreaking.

“Hyperthreading” is an Intel marketing term for something that’s been around in PowerPC and MIPS processors for some time. It allows two programs to be running on a single CPU at the same time, and they share the limited functional units. So, if a CPU has 3 floating point units, 2 integer units and 1 memory unit, hopefully between the two programs they can be kept busy most of the time (usually a single program can’t accomplish this). If all the units are working 100% of the time, their work gets finished a little bit faster than if the programs were executed one at a time. Intel’s hyperthreading on the P4 usually saw anywhere from a -10% to 40% improvement, averaging around 10% (so your hour-long render might instead take around 55 mins — woooo hoo).

Now, with multiple CPUs on a die, there is a lot more opportunity for parallelism. There are two ways to go about using more than one CPU – multi-processing and multi-threading. Multi-processing allows you to run more than one program at the same time – such as running Houdini and mantra concurrently – which generally increases productivity. If you have the memory, it can improve performance as well, like running 2 renders on a frame range – one on odd frames and the other on evens. Multi-threading is the term given to a program that runs several threads at a time (a “thread” being a separately running task inside a program). Multi-threading doesn’t re-allocate all the resources of the program when it splits, so it’s a lot leaner from a memory point-of-view than running the same process twice.

Single CPU systems appear to run many applications at a time, but in fact very rapidly switch between them (time-slicing), so they aren’t actually running at the same time. Multi-CPU systems still use time-slicing, since there are always more programs than CPUs, but it is now possible to run 2 or more programs at exactly the same time. So, a multi-CPU system can provide more performance equal to the number of processors available (2x,4x,8x currently). The trick is keeping the processors busy since idle processors, even for mere fractions of a second, eat into your nice 8x speedup that you paid big bucks for.

So, when it comes down to actually completing a task, it’s tempting to think that with 4 processors it can be completed 4x faster. This is where most programmers wish the CPU manufacturers had been able to just keep running up the clock speeds… because this just generally isn’t the case. There are a number of factors standing in the way.

First, you need to figure out how to split the work up. As any TD can attest, this isn’t trivial. Ideally, you give the first half to one, and the second to another, and some time that actually works (which is great; you can go home by 5). Applying gamma to an image can be done this way pretty easily by farming out scanlines, and normalizing normals on geometry can also be split up nicely. Problem is that the simple stuff generally doesn’t really take that long anyways (you always want to optimize the code that’s taking the longest to run).

Second, you need to make sure that the work you’re doing actually takes longer than the time to start up and stop a thread (or threads). Unfortunately, they aren’t free and take a pretty significant chunk of time to startup (since the OS often has to switch from user to system mode to create and destroy a thread – if that made no sense to you, don’t worry – just think ‘slow’). An image has to be pretty large before it’ll benefit from threading when just applying gamma, and in fact, most small images will see a performance slowdown instead.

Third, data dependencies can really be a killer. For example, a simulation can’t be threaded by computing frames 1 and 2 simultaneously; frame 2 depends on the results from frame 1. Similarly, a subdivide, fluid or cloth algorithm depends on a lot of neighbouring points and this can make divying up the work difficult. Algorithms that work in multiple passes also have similar problems, as do ones that modify global data as they work.

Fourth, even though you have multiple CPUs, you still only have one memory controller and one I/O system (or disk target). So tasks doing heavy memory accesses or disk I/O in parallel will generally slow things down as they’re bottlenecked by the single resource. Multiple graphics cards are also bottlenecked by the single driver that runs them (some drivers are threaded, but have exactly the same issues listed here).

Fifth, threads can’t just do whatever they want. Imagine two workers at a desk, grabbing papers and writing on them whenever they need to. No useful work is going to get done if Bob keeps grabbing a paper he needs while Joe’s writing or reading it. If someone kept grabbing my stuff while I was working on it, I’d likely end up hitting them, and that’s probably the real-world equivalent of a multithreading crash. So if Joe’s working with a certain paper, Bob needs to wait for him to finish. This does two things – One, it eats into that nice 2x-8x speedup because Bob’s sitting there doing nothing (Bob isn’t smart enough to do something else – no offense, Bob). Two, the manager (programmer) has to first determine all the places where the two workers can’t work at the same time, which gets more difficult as the threads get longer. As more areas of the program are restricted to one worker, this situation occurs more frequently. This gets worse when Sally and Jorge show up to help. In reality, most tasks that are highly dual-threaded see about a 1.6-1.8x speedup, and quad-threaded tasks see a 3.2-3.6x speedup. Sound at all familiar?

Finally, threading increases the complexity of the underlying code a lot (like “maybe I can help figure out a way to solve the GHz wall problem”-lot). Researchers have suggested that it’s about 10x harder to write multithreaded code than singlethreaded. This translates into longer development time, more bugs and more maintenance in the long term. Adding additional features to the task can be difficult, as can passing the code onto another developer. Put another way, this means that fewer features are added each cycle with the same development crew. So you want to make sure that the performance increase is worth the trouble. With dual-CPU systems this was a bit difficult to justify some times. At least now with quad core and 8-core systems available, that part’s become easier smile.gif

After running through this list of threading issues with a bunch of potential candidate tasks, you’ll find some of them can’t be done (1x speedup) and others can’t reach their full potential (which is generally 80-90% of expected speedup). Sometimes it’s better to spend the time figuring out a better way to implement the algorithm – SIGgraph papers can show approximations and structures that accelerate things by orders of magnitude, not just by a paltry 2-4x. This is not to say that threading isn’t useful – it is tremendously – just that there are a lot of cases for which it isn’t well suited.

There’s other ways to use threading as well, besides for doubling up performance. One is to hand off slow tasks from quick ones, like spawning a thread to write out a file, or do network I/O (which works nicely even on single-processor systems). Another is to separate the user interface from the underlying processing, giving the program a feeling of responsiveness that the user perceives as “fast” which never hurts.

As multi-CPU systems continue to climb rapidly into 8-core systems and above, hopefully more software development tools will be made available so that developers, and by extension users, can take advantage of them. For now, it feels like software developers are playing catch-up. After years of software developers banging on their doors for more processing power, the hardware guys must be grinning.

GPGPU

GPGPU, or “General Purpose computing on a Graphic Processing Unit” is the art of doing non-graphics related tasks on a graphics device. Graphics Processors (or GPUs) have changed over the years from rigid graphics pipelines to programmable graphics shading. Along the way, the precision of these devices has increased from 256 colors, to 24bit color (16million colors, or 1 fixed byte per RGB channel), to full 32bit floating point per channel color (don’t know the exactly #colors, but way more than your eye could hope to see, even with pharmaceutical help). Together, this has made them viable platforms for doing more than just computing pixels. Because of the nature of shading pixels is highly repetitive, it can be done in parallel. Hardware developers exploited this by adding multiple shading units to their GPUs. Slowly starting around 4-8 units, this has exploded into hundreds, making GPUs very capable number crunchers. For processing huge amounts of data, this would be the perfect device. Well, in theory…

Several years ago, GPU manufacturers realized they had a good product for doing more than just graphics. The marketing teams did what they do best and GPGPU was born; the only problem was that it was a tad premature. There were a lot of limitations placed on the shaders by the hardware, which in turn limited what could be done outside of the graphics realm (but Doom3 sure looked cool – when you could see it). Some of these early issues were:

  • Very small shaders, at 256 and then 4096 instructions.
  • Reduced precision at 16bit or 24bit floating point, which isn’t enough precision to do most general purpose math.
  • No integer support above 8 or 16bit, which is generally required for task control.
  • Extremely limited branching and looping. If you could actually do it, it was very slow (like “don’t even bother” slow).
  • Limited number of inputs. Only being able to access 2 or 4 arrays (textures) and a small set of constants at a time from a shader required multiple passes with different shaders, increasing the complexity and slowing the process down.
  • Texture sizes were limited to powers of 2, and limited to a maximum size of 2048×2048 or 4096×4096.
  • Limited onboard GPU memory. 64 or 128MB equipped boards were already running your OS’s framebuffer and all your graphics.
  • Bus bandwidth. The AGP bus took some time to get the data to the GPU and back, which added significant overhead.
  • Proprietary extensions. There were no real standards for certain workarounds to the above issues, so it was almost like multi-platform development.
  • Crappy ass driver support. Some of the features of OpenGL or DirectX just plain didn’t work (Oops… personal bias creeping in)

As graphics hardware progressed from the nv6800/ati9800 to the nv8800/ati1900, the situation became much more favourable for GPGPU. PCI-Express improved the bandwidth, DX9 enforced 32bit processing, GPU memory sizes increased dramatically, shader programs’ sizes could be much larger and the power-of-2 limitation was relaxed. Additionally, the vertex shader unit and the fragment (pixel) shader unit merged into a single processing unit, inheriting the strengths of both. Some of the restrictions still exist, but aren’t quite as severe, making programming on the GPU a more reasonable proposition.

Current hardware (NV 8800, ATI X3800 at the time of writing) provides a massively parallel environment to play in. While CPUs are at sitting around 4 functional cores, GPUs have hundreds. They can run the same program on hundreds of pieces of data simultaneously, in contrast to a CPU which with SSE support may run on dozens. The clock speed of the GPU is a lot slower – 600-900MHz – but the added parallelism is great for large datasets.

The biggest problem is the “distance” of the hardware from the CPU. Besides having to travel along a bus to the GPU and back, all processing must go through a software driver. It’s almost like outsourcing – it gets the job done, but you have to package everything up and do a lot of communication in addition to the work that gets done. So, while the process might run 500x faster on the GPU, the transmission back & forth can take a huge amount of time (graphics processing generally only goes one way; GPGPU by necessity has to be two way). You don’t want to send away the work unless it’s a lot of work (it has to be a lot of data, and it has to do a fair amount per data element). Because of this, most tasks are immediately disqualified.

So the trick is finding a task that runs frequently enough to make a difference when optimized, and runs well on the GPU vs. the CPU. In this case, speedups of 10-20x aren’t uncommon. In addition, you can even cleverly have the CPU do something else while waiting for the result, turning the GPU into an additional parallel processor (also known as Asymmetrical Multi-processing).

Being in the 3D business, we’re graced with a lot of algorithms that will benefit from the GPU’s parallel processing and its vector nature. Still, the biggest weakness of the GPU is that it’s still not general enough for general purpose programming. Most algorithms require more advanced data structures than 1D or 2D arrays, which can be difficult to represent and modify within the GPU. Shaders generally operate on a 1-to-1 basis, making algorithms that create or destroy data inefficient. Looping and branching is still far behind the CPU in terms of performance, and is very common in advanced algorithms.

GPGPU projects like nVidia’s CUDA have begun to appear, which should definitely help with many of these issues. Designed as a general purpose dataset computation toolset, it removes all the graphics API overhead and allows programming in more generic fashion, with more of the conveniences developers are used to. Also on the horizon are GPUs with 64bit floating point support, needed for serious mathematical computation. AMD/ATI’s Fusion project is planning on marrying a GPU with a CPU on a single chip, hopefully eliminating a lot of the data transmission issues. And GPUs will continue to fight over the highest FPSs in whatever games happen to be hot, pushing the performance even higher.

The next few years should be very interesting for GPGPU, after a bit of a slow start. I, for one, just hope the drivers work.

Parallel Hardware

Try as I might I haven’t been able to see into the future, but there are several trends in hardware that can be extrapolated a bit for some fancy guesswork.

  • First off, with AMD aquiring ATI, and Intel working hard on their GMA (Graphics Media Accelerators), I think we’ll definitely begin to see a blurring of the lines between what constitutes a CPU and a GPU.
  • AMD has announced their Fusion project, and while the first step in that is to put a CPU & GPU on the same die for power reduction in the mobile space, I don’t see a streaming FP unit too far behind. Hopefully this would be a very low latency (compared to a PCI-E graphics card), high bandwidth processor with direct memory access.
  • AMD has also announced their Torrenza project, which amounts to providing a co-processor socket connected to the CPU via a hypertransport. Initial guesses point to a vector floating point stream processor as a good contender (given ATI’s closeness to that area), but they could also be dedicated things like cryptography processors, integer accelerators and other things. It will be interesting to see how that fits with Fusion.
  • Intel has tested an 80-processor chip, mostly to come up with a good interconnect system between the cores. The cores themselves weren’t really the focus of the project, which seems to indicate they have processor clusters in mind.
  • IBM, Sony & Toshiba developed the Cell processor, which is a stripped down PowerPC with 8 streaming vector processors (known as SPEs) connected via a high speed bus. This gives a bit of an early preview to what assymetrical CPU designs might operate like. The PowerPC handles control processes, and the SPEs do the hard work.
  • Microchip process manufacturing is coming close to hitting a wall when it comes to shrinking transistors, which has so far managed to allow several processing units on a single chip. Currently at 45nm and planned to go on to 32nm, there is a bit of an issue coming up because a silicon atom is about 1/4nm in size. So, we won’t be forever making transistors smaller and cramming more of them on to a single chip. You can make the chip bigger, but as that happens there’s much more of a chance of a defect, which increases costs. This will either spur new developments in processor design (much like the GHz wall did) or new manufacturing processes (or both).

So, with all the above coming into play, I wouldn’t expect to see a 16 core processor as we know multi-core processors today. It seems rather wasteful to have 16 processors each with their own SSE unit, for example – it would be difficult to keep all of them busy in the general usage case. Unused units still use power and die space and generate heat, which isn’t desirable. I could see several processors sharing the use of specialized units like SSE and cache memory, organized into larger blocks (like 4 blocks of communal quad-processors making up a 16 processor chip).

I do wonder if the SSE vector engine now present in CPUs will take more of a backburner role to future floating point accelerators or stream processors, much like the original FP unit did when SSE came onto the scene. It seems like a stream processor would pretty much supersede the instruction set, though I suspect SSE would still be supported.

Also, with all the talk of accelerators (FP stream processor or otherwise), I could see a traditional dual/tri/quad-core processor with an accelerator or graphics processor on die emerging. Finally, the Cell and Intel’s 80-core CPU test might indicate a dual or quad core CPU with a dozen smaller, specialized processing units on board. And I wouldn’t be too shocked to see processors fabbed with different configurations – an 8 core SMP processor; a 4 core processor + GPU; dual core + several FP stream processors – targeted to to the industries that could benefit from the specialized configuration.

The most interesting aspect of processor design to me is the fact that there are currently a lot of potential solutions to processing, and it’s really unknown which ones will be successful. However, I think it’s a safe bet to say that whatever comes, multithreaded programming will be very important, as will stream processing techniques. So you can’t go too wrong investing in either. I think that their success will be measured by how easily software developers can take advantage of their capabilities rather than their performance statistics, because at the end of the day it’s not about the hardware, it’s what you can do with it.

Software Development Tools for Threading

This part is a little more involved, and I’m afraid that’s just the way it is. I won’t hold it against anyone if they don’t make it through this section smile.gif

Before getting into the tools available (and things I wish were available), it’s probably a good idea to look at the different ways of threading software. There are generally three types: coarse grained threading, explicit fine grained threading, and implicit fine grained threading.

  • Coarse grained threading separates relatively independent tasks into different threads. For example, MPlay’s user interface might run in a thread, the audio handled in another thread, interactive renders received in another thread, and playback processing done in a final thread (some of that is actually the case now). All the threads have a long duration (generally as long as the program’s running) and their primary goal isn’t so much to improve performance by 2-4x, but to allow simultaneous and prompt processing of various events (mouse click response, render bucket received, audio packet sent to sound card, image updated). Most of the time, the threads sit idle waiting for their target event to occur. However, if there were a sudden influx in rendering packets, there’s no other thread to help the render processing thread out with the load. This is where fine-grained multithreading comes in.
  • Fine grained threading involves splitting up a single task over multiple processors, and it’s geared towards improving performance. There are two types – implicit and explicit. Explicit fine grain threading is when the software developer hand-codes the task for threading. This involves making sure shared resources are properly shared, dividing the work between the threads and keeping the threads busy. Because all the threads are working on the same task, unlike coarse grained threading, there can be a lot of contention to resolve. This makes generally makes it harder to develop than the other two types (longer, more bugs, etc).
  • Implicit fine grained multithreading uses higher level tools to do much of the grunt work involved in fine grained threading. Tools such as OpenMP and Rapidmind ( http://www.openmp.org http://www.rapidmind.com ) attempt to break up the workload automatically and transparently, so that much of the thread-specific code is simply not present in the source code. This makes it easier to develop, but it’s also a bit more limited in what can be accomplished.

For example, looking at a scene in Houdini, one way to divide the work is to determine all the objects which require cooking, spawn 4 threads, and start handing an object to each thread to cook. Another way is to thread the cook of each individual SOP. While implicit fine-grained threading can handle the latter case, the scope of a SOP network cooking in a thread alongside other threads is simply beyond what the current tools can provide. Cooking a full SOP network is not a nice, discrete task that can be automatically parallelized by software — not yet, anyways. (There’s always the possibility of using both techniques as well; it’s not really either/or).

So threading a SOP node seems like a decent proposition, a good balance between development effort and performance. The main issue is that it’s hard to keep all the processors busy most of the time. There’s some initial thread startup overhead, probably some contention for shared resources, and the SOP must wait for all threads to finish (they never finish all at the same time, even given equal work, due to the fact that other things are going on in the OS, so there’s some idle time at the end). Then we move on to the next SOP and repeat the process. In a large SOP network, those waiting periods add up to soften the potential speedup of all those processors.

On the other hand, cooking separate nodes in parallel doesn’t have as much waiting involved (until data dependencies between nodes become an issue). The threads can individually grab the next node in line to be processed, regardless of what the other threads are doing. This is similar to how the compositor and mantra currently run. The major downside is that it requires a lot of code to be 100% threadsafe, identification of all shared resources, and strict policies on how and when data is written. It’s for this sort of threading that I’d like more tools developed.

Currently, the main industry emphasis is on implicit fine grained threading, because the types of tasks it performs can be automatically parallelized. As mentioned, RapidMind and OpenMP are two of the many players concentrating on making this type of threading more accessible. Intel is also looking into automatic threading support in their compilers.

There are some existing tools that exist for helping developers with issues in explicit fine grained and coarse grained multithreading. Valgrind has a helgrind tool ( http://valgrind.org/info/tools.html#helgrind ) which attempts to find race conditions between threads (more on that in a minute). Intel also has a tool which tries to find code that should be made exclusive to one thread to avoid data stomping and crashes. (There are others as well, but I’m not writing a book here. Well, at least not intentionally)

I’d really like to see threading support become much more native to programming languages. Things like native vector support, thread support, exclusive critical section support, atomically-updated variables, etc. Languages like C++ seem a little behind the times when in comes to parallel processing. Using vector intrinsics and add-on libraries does help, it just seems a little “tacked on” at times, I suppose, and it isn’t always the same across platforms. If it were natively part of C++, then hopefully those cross platform/hardware issues go away.

There’s always the argument that “you can build it in C++”, and yep, I have, thanks. The reason I like native support is because it ain’t my problem anymore. The less code that I have to maintain, the better I sleep at night. And after working in GLSL and Cg with excellent vector processing, it’s kind of disappointing to go back to C++, even with vector classes.

Debugging threaded code is another issue which could use improvements to existing software development tools. Take the following code, which shares “block” between several threads: if( block == NULL)<br/> block = new Block();<br/> block->doSomething();<br/>

This breaks down into several discrete assembly instructions (pseudo code here): 1. load block from mem<br/> 2. compare block, NULL<br/> 3. if not equal, jump to End <br/> 4. create new Block<br/> 5. store block back to mem<br/> 6. End<br/> 7. doSomething()<br/>

Now, if two threads happen to come across this code at about the same time, they both load ‘block’ from memory into their processors (a copy of block, that is). For both of them, it’s NULL, so they both create a new block and write it back to the same memory location, and then proceed to do something useful with it using their own local copies. Whichever store happens last “wins”, and this situation is (rather ironically) named a “race”. You’re left with a memory leak, and very likely the wrong result because one of the thread’s results was discarded. As long as thread B executes instruction #1 before thread A gets to instruction #5, there will be a bug – a very small window of opportunity, but it exists nonetheless. There’s also a possibility that the OS will decide to switch to another thread while A is between 1-5, suspending its operation, which opens the window even wider.

“Well, it seems like the likelihood of that actually happening is so remote in a large program, so it’s probably not a serious issue,” you might say. Unfortunately, it is because of that exact fact that it is a serious problem. The bug is very difficult to trigger, and thus it stands an excellent chance of getting by QA. One day you’re happily working along and boom you either get a crash, a hang or some weird result. One that you can’t reproduce again to save your life, and so it persists in the software for a very long time because no one can track down the exact cause. The software developer has to get very lucky to find it (or sit there for days pouring through code, which happens as well).

This is where software tools to determine these hotspots come in very handy, and could still use some development. I’m not looking for miracles to fix all my problems for me, just some good solid help. And it’s coming, thankfully.

So, I guess if I were to sum it up, I’d want good, solid thread debugging tools and good threading & vector support in my programming language of choice. Coming in second would be good implicit fine-grained threaded tools, because they’re still pretty useful when you find a task they can do well – they save you a lot of work.