Threading On Multi-Core CPUs

A multi-core processor combines two or more independent cores on a single chip. This multi-core architecture is adopted by most modern processors enabling execution of multiple threads simultaneously.

This article concentrates on how to optimize multithreading on multi-core CPUs.

Understanding Thread Behavior

Multithreading does not actually mean that a CPU core processes multiple threads at a time. No matter how many cores the CPU has, each core can process only one thread at a time (Note that in Hyper threading enabled CPUs the operating system treats the single CPU as 2 cores).

Consider the example of a shopping mall billing counter. Let us assume that there are 2 billing counters which represent CPU cores and each billing counter can bill for only 10 items at a time. If you wanted to buy 100 items you would have to go in and out of the billing queue 10 times.

To get the billing done faster you a take a friend along, who stands in the other queue. This time each one of you will have to stand in the queue 5 times each, and your work will be done in half the time. But however this does not mean that if there were 10 of you your work will be done 10 times faster. Since there are only 2 billing counters only 20 items will be processed at a time. If there were 10 billing counters you will definitely get your work done 10 times faster.

In the above example you and your friends represented one thread each, which will get work done only if it gets CPU time. So it is quite clear that the number of threads should ideally be the same as the number of CPUs or CPU cores right?

Well not just yet. We forgot one minor detail. You are not the only customer in the mall there are other customers too. Similarly we do have other processes and their threads running on the computer that require processing time. So if there were 10 of you the probability of you or one of your friends getting in the line is higher.

Thread execution time is managed by the operating system and is a highly unpredictable activity. CPU execution time is known as a time slice or a quantum. If we create more and more threads the operating system spends a fraction of this time slice in deciding which thread goes first, thus reducing the actual execution time each thread gets. In other words each thread will do lesser work if there were a large number of threads queued up.

Calculating the Optimal Thread Concurrency

To confirm the theory above and to arrive at a conclusion as to how many threads should be declared for a doing the same task, a small console application which records the CPU time for each thread is shown below. We will try to determine if we get higher performance when we declare threads equal to the processor count or greater than the processor count.

We record the time for multiple threading scenarios. Processor count is used to determine how many concurrent threads should be tested. If processor count is ‘N’ we determine result for threads in range from N-1 to N+2 concurrent threads.

The processing done by each thread is quite simple. Each thread has to run for 1 second. This means that each thread should get a total execution time of one second. The thread might not get continuous execution cycles causing the thread to wait for CPU time before it can start executing again. So one second of execution time might actually take more than one second if the thread does not get dedicated CPU time.

Once the thread has finished execution it signals the WaitHandle. After all the threads have finished execution the total time per thread is calculated. Time per thread is the total time required from start of first thread to end of last thread divided by number of threads. This gives the average time required for each thread to process.

Show Code

Show Code

The Result

Data was recorded for the following CPUs

  1. AMD X2 64 3800+ (Dual Core)
  2. Intel Dual Core 3.0Ghz (Dual Core)
  3. Intel Xeon E5450 3.0 GHz (Dual Core) *
  4. Intel Xeon E5310 1.6 GHz (Quad Core)

* I could not find a dual core version of Xeon E5450 on Intel’s web site however the server I tested it on was actually running a dual core version of this processor and not a quad core one.

The testing results are shown in the graphs below. Figures to the right of the bars indicate the time/thread in milliseconds. Lesser time/thread indicates faster execution.

Dualcore Multi-threading Report

Dualcore Multi-threading Report

Quadcore Multi-threading Report

Quadcore Multi-threading Report

Conclusions

From the result it is quite clear that dual core CPUs give the best performance for 2 concurrent threads and the single quad core that was tested gives best performance at 4 concurrent threads. More detailed analysis can be done by recording data for longer time periods to give a more accurate reading.

When a dual core CPU runs just one thread it takes almost twice as much time as it would if 2 concurrent threads were running, which is quite expected. As the number of concurrent threads increases to 3 there is a slight increase in time required. This is probably due to the operating system switching between the threads.

If this data was to be recorded at high CPU load then the results might have been different.

Also setting the thread priority as high or highest the thread can be given more dedicated CPU time for execution.

Download Sample
MultithreadingSample.zip

    • Gastón C. Hillar
    • March 2nd, 2009 8:26pm

    Hi Rohan,

    It’s a very nice benchmark! Very interesting.
    However, I do believe it will be more useful with some real-life processing added to each thread.
    Besides, you can benchmark native threads vs. Parallel Extensions.
    You can check the source code available from Packt’s website for my book “C# 2008 and 2005 threaded programming”: http://www.packtpub.com/beginners-guide-for-C-sharp-2008-and-2005-threaded-programming/book
    You seem to have access to many different hardware. That’s a great experience to let your blog followers understand the differences in the awesome multicore island.
    You can also check the ConcurrentQueue (new in ParallelExtensions), it is very interesting to work with it in order to solve the producer-consumer problem.

    Cheers,

    Gastón Hillar

  1. Thank you for your suggestion Gaston, Parallel Extensions is definitely something I would like to look forward to.

  2. Great site this csharp-codesamples.com and I am really pleased to see you have what I am actually looking for here and this this post is exactly what I am interested in. I shall be pleased to become a regular visitor :)

    • atin
    • April 9th, 2009 10:40am

    Rohan to think about it..

    Multi core means capability to handle multiple processes at a given point of time.. rather than multiple threads.. and as a process can have several threads .. multi processes automatically becomes multi threaded..

    So even a single core can be multi threaded, but more cores means more process and hence much more no of threads running at the same time..

    I am not 100% sure but still pretty confident that in a single core processor also, u can have multiple threads of 1 process, processing at the same time.

    Hmm.. m talkin more from a java standpoint.. and m still learning so whatever i am writing might be totally wrong.. please correct me if thats the case

  3. Thank you for your reply @atin

    A process in simple terms is an application. Each application is itself runs threads to execute tasks. If an application is written in such way that it can perform two or more tasks simultaneously it can be termed as multithreaded.

    Yes most processors singlecore or multicore run applications simultaneously as we see in our day to day life. But process is just a container. A thread is the most basic unit a processor executes. And each processor no matter how advances will only execute one thread in a timeslice (most processors have a timeslice of 1.56 usecs)

    So when you say that you can have multiple threads of one process, processing at the same time. It is merely an illusion most programmers have. In reality the operating system jugles these threads and assignes random CPU timeslices to each of these processes, which makes them appear as though they are running concurrently.

    This is independent of the programming language you use. It is just the way windows works with native threads. Im sure linux and other OS too work the same way.

  4. Great post. It shows that we should be coscious about thread count depending on machine to machine. But hopefully I’m waiting for .Net 4.0 and Task Class which seemingly manage all the thing we should consider in Threads.

  5. This is a nice proof that multicore processors can execute code at the same time in each core (I’m not sure exactly how that handle slicing external bus activity as there is still only one memory for all cores, I’m assuming in this example the code fits into cache so this may not matter.) However as pointed out the example shows execution of totally independent threads, and in real life this is generally not the case, all threads tend to act in unison for a single cause. That said it does not mean there are not instances where this independence does not exists and for those applications this is a useful study. But I would also suggest that if you want to know how your application performs you should be able to easily vary your “worker” thread count to find the best number for your application. For example if your applications fetches web pages for data you may want a how bunch of threads as most of the time each thread will be waiting for the web to respond, but if you’re crunching numbers you may want to split that into a thread count equal to the number of cores.

  1. April 5th, 2009

Spam protection by WP Captcha-Free