Monday, March 10, 2008

ICC vs GCC-4.3

Since GCC-4.3.0 is about to be released I decided to take a look at its new intel Core 2 tuning and SSSE3 code generation by emerging the package found on Dirtyepic's overlay. I compared the time it would take to re-encode a video with ffmpeg and a WAV sample with oggenc. The video clip I used can be found here (1920x816 MOV, 1:46, 128.3MB) while the WAV file is just the extracted audio track thereof.

I used these four compiler collections and their CFLAGS:
  1. GCC 4.1.2 (-march=nocona -O3 -pipe -msse3)
  2. GCC 4.2.3 (-march=nocona -O3 -pipe -msse3)
  3. GCC 4.3.0-pre20080302 (-march=core2 -O3 -pipe -mssse3)
  4. ICC 10.1 20080112 (-O3 -xT -ipo -gcc)
My system's specs:
  • Q6600(B3) @ 3.21GHz
  • 400Mhz FSB (266Mhz northbridge strap)
  • 2GB PC3-15000 1603Mhz (8-8-8-24)
  • kernel 2.6.24-gentoo-r3 (kernel lock preemption and preemptible kernel model, 1000Hz timer freq, see config)
I recompiled the following packages with emerge after changing my environment to the appropriate compiler using gcc-config:
  • x11-libs/libXau-1.0.3 USE="-debug"
  • x11-libs/libXdmcp-1.0.2 USE="-debug"
  • x11-libs/libXext-1.0.4 USE="-debug"
  • x11-libs/libX11-1.1.3-r1 USE="ipv6 -debug -xcb"
  • media-libs/libogg-1.1.3
  • media-libs/faac-1.26-r1
  • media-sound/lame-3.97-r1 USE="-debug -mp3rtp"
  • media-libs/xvid-1.1.3-r3 USE="(-altivec) -examples"
  • media-libs/x264-svn-20080301 USE="threads -debug"
  • media-libs/a52dec-0.7.4-r5 USE="-djbfft -oss"
  • media-libs/amrnb-
  • media-libs/faad2-2.6.1 USE="-drm"
  • media-libs/libpng-1.2.25
  • dev-libs/libxml2-2.6.31 USE="ipv6 python readline -bootstrap -build -debug -doc -examples -test"
  • media-libs/libvorbis-1.2.0 USE="-doc"
  • media-libs/speex-1.2_beta3 USE="ogg sse"
  • media-libs/flac-1.2.1-r2 USE="cxx ogg sse -3dnow (-altivec) -debug -doc"
  • media-libs/libtheora-1.0_beta2-r1 USE="encode -doc -examples"
  • media-libs/freetype-2.3.5-r2 USE="X -bindist -debug -doc -utils"
  • media-libs/giflib-4.1.6 USE="X -rle"
  • media-sound/vorbis-tools-1.2.0 USE="flac nls ogg123 speex"
  • media-video/ffmpeg-0.4.9_p20070616-r2 USE="X a52 aac amr doc encode ieee1394 imlib ipv6 mmx ogg sdl theora threads truetype v4l vorbis x264 xvid zlib (-altivec) -debug -network -oss -test"
All remaining system libraries which ffmpeg and oggenc might link to were compiled with gcc 4.3.0 (e.g. glibc).

Note on ICC: Multifile interprocedural optimizations didn't work for lame, flac, a52dec and faad2, where I needed to resort to single file interprocedural optimizations and thus used '-O3 -xT -ip -gcc'. Also, ICC didn't seem to compile ffmpeg. For that reason I needed to recompile libX11, libXau, libXdmcp and libXext with gcc-4.3.0 or else ffmpeg would complain about symbol lookup errors.

I used the following command for re-encoding the video clip:

ffmpeg -y -i \
-f avi -vcodec mpeg4 -b 800k -g 300 \
-bf 2 -acodec libfaac output.avi

I repeated it 5 times and got these results:

GCC-4.1.2: 437.24 sec
GCC-4.2.3: 436.98 sec
GCC-4.3.0: 436.17 sec
ICC 10.1: 429.72 sec

For ogg encoding I first extracted the audio track of the clip.

ffmpeg -y -i \

and then encoded it with oggenc:

rm -f output.wav; oggenc output.wav

This command was repeated 30 times and resulted in the following times:

GCC-4.1.2: 217.00 sec
GCC-4.2.3: 216.97 sec
GCC-4.3.0: 206.90 sec
ICC 10.1: 191.91 sec

Doing the graphs I decided to truncate the bars and only show the relevant upper parts. Thus these graphs don't represent absolute values but demonstrate the differences in execution time between the code produced by each compiler collection:

ffmpeg chart
oggenc chart
It turns out that the GCC 4.3 branch yields quite a noticeable performance boost, probably thanks to its new Core 2 tuning option. ICC's optimizations are still unmatched and show that GCC could still need some improvement. After all ICCs lead in video encoding was most probably just caused by its shared libraries (e.g. flac) because ffmpeg itself was compiled with GCC (see above).

As a conclusion, GCC and especially the upcoming release produces code which is more than fast enough for a normal desktop system. Even with libraries that benefit greatly from ICC's vectorization techniques the advantage of ICC over GCC is negligible and wouldn't justify the time spent in recompilation and porting.


Anonymous said...

Your graphs aren't optimal, because if you look only at the graphs, you think icc is twice as fast as gcc.

Falko said...

That's right. I chose to truncate the graphs to show the relative difference, for example that GCC 4.3 is roughly between ICC and GCC 4.2 which would be impossible to see if I had chosen absolute bar lengths.
Anyway, the y axis is properly labled, so at least at the second glance it should become clear.

Anonymous said...

Great post. Was about to do something similiar. Now i don't have to ;-) Thanks!

Anonymous said...

It would be very interesting to know how things behave with profiling (ie -fprofile-generate and -fprofile-use). ICC is enabling some of more aggressive loop optimization (such as loop unrolling you need to ask GCC specially for with -frunoll-loops). Also enabling vectorizer and using -ffast-math might help and get GCC closer to what ICC really does.


Falko said...

Very insightful, thanks a lot. I'll look into it for sure and might use some of these flags on a per-package basis or even as global flags (such as -ffast-math).

eile said...

I've observed similar speed differences, though I haven't tried gcc 4.3 yet. This is on my own code, more details can be found here.

Anonymous said...

I've tried to compare icc 10.1 against gcc 4.2.3 on a pentium 4 "C" microprocessor whith my own code using the following options :
-O2 -march=pentium4 on GCC
-fast -march=pentium4
The results scared me because of the binary generated using icc was twice as fast as gcc one.
After that i tried to see whats going on, so i compiled with the -S option and i figured out that gcc does'nt generate see instructions. Finally i've got similar speed differences using
-O2 -march=pentium4 -mfpmath=sse
forcing gcc to use see instruction set

Anonymous said...

how about icc's option "-parallel". it would be interesting to see if autoparallelization improves anything.

Pengvado said...

march=nocona is no good for anything other than Netburst.
To optimize for core2 before march=core2 was introduced, you should have used march=k8, and then you would see much less difference between gcc 4.2 and 4.3

Will said...

I've built ffmpeg before and it's not exactly clean code. From what I remember, you could only build it with older versions of binutils.

As for the 'missing symbols', link it... thats right add it to your CFLAGS or LDFLAGS(-Wl,-?) and that's it. I used to do it all the time with Compaq C (-Wl,-lots).

I betcha, if you turn up the CFLAGS, you would see more of a difference! Something like -O3 -fstrict-aliasing -funroll-loops -mieee -mtune=cpu etc... I can tell ya, the code is damn near the same size and runs about 30-50% faster for GCC and ICC!

Mothersh1p said...

Perhaps you want some more math related killbits -mno-ieee-fp -IPF_fp_relaxed -rcd -ftz -fp-model fast=1 -fp-port -auto_ilp32 -prec-sqrt -pc64 -no-prec-div

MCKAY Brothers said...

thats the facts, the curent gcc cuarse, and all recent proyects curse its that developers dont take care of performance on code, all recent apps/proyects are very resource consumers, cos as u see, everybody have a powered resource full machine like u, dual core's etc, gigas of RAM etc etc, in this machine performance tests are very less perceptible!

a great example its the VisualBoyAdvance emulator, that not compile ith optimize falgs, hands and spend much memory

in my live cd i must compile mayority of apps with gcc-3.4 cvs, the 3.4 brand are a paralel gcc best solution to gcc-4.X on some builds! i recommend!

very distributions still include both versions, gcc-4.X for recend and also the 3.4

Anonymous said...

Generally I do not post on blogs, but I would like to say that this post really forced me to do so! really nice post.


MCKAY Brothers said...

just as said here, there the facts, currentgcc its poor,developers today not work well due too many resources propised a lazy programing curse, a more decend and minimal resource hardware its better for devel..

i update the massenkoh devel solution here

Pedro Larroy said...

Shitty and misleading barcharts.