Not very familiar Visual Studio C++

Chris_F · Jun 4, 2009

So I've been doing a bit of programming lately, with GCC in linux.

Having made a few number crunching programs I've noticed that if you compile without any optimization flags, GCC will give you nothing but x87 FPU code. The execution time of my routine was ~8 seconds.

Using the -O2 or -O3 flag for optimization produced SSE code that did the same thing in about ~0.7 seconds. Not bad at all.

Anyway, I've been doing a bit of codding in Visual Studio Express now, and I don't see any options for optimizations. I don't even know if VS9.0 Express (or otherwise) supports SSE (or greater) or automatic vectorization.

I am able to write inline ASM code for manual SSE vectorization, and the speed boost is great, but it doesn't seem to offer the automatic optimizations that GCC offers.

Maybe Intel's compiler is really the best compiler for Windows? I've never used it. I thought Microsoft's would have been just as good.

Chris_F · Jun 4, 2009

I confirmed with Ollydbg that VS9.0 is creating x87 code, and not even all that well optimized at that.

Sorin · Jun 5, 2009

Can't offer anything helpful or useful, but I did want to comment...

Chris_F said:
Having made a few number crunching programs I've noticed that if you compile without any optimization flags, GCC will give you nothing but x87 FPU code. The execution time of my routine was ~8 seconds.

Using the -O2 or -O3 flag for optimization produced SSE code that did the same thing in about ~0.7 seconds. Not bad at all.

...on this. I noticed the same thing while I was doing some simple decimal to binary conversion programs using C. I came up with my version, which used division and modulus, then I found a version online that used pointers (to traverse the array), bit shifting, and bit-wise AND (well I also did a third version that was a sort of hybrid between the two, but setting that aside for now...). With no optimizations, the pointers-n-bitwise version was faster and this speed difference became more and more magnified the more numbers that were converted at once. With optimizations however, my division-n-modulus version was faster and the difference was even bigger.

Additionally, my version using optimizations only took 19% of the time to do the same thing as the unoptimized version, while the online version took 40% of the time, optimized vs. unoptimized versions. I drew up some Excel spreadsheets and everything for the various tests, but that's outside the scope of the post.

That was longer than I wanted it to be, but basically, I was using Code::Blocks to do my coding and it is using "GNU GCC Complier" under the selected compilers section. I haven't used VS much because it way over-complicates things. I had access to it at school, but I haven't touched it since then because it was waaaaaaay too complicated and was far too headache inducing. It felt like taking a chrome and titanium RoboHammer 9000 v9.3 with "are you sure?" prompts and safety switches just to drive a simple nail into a wall to hang a simple picture.

pharoer · Jun 5, 2009

http://msdn.microsoft.com/en-us/library/fwkeyyhe(vs.71).aspx

Look at /arch for example

Chris_F · Jun 5, 2009

Ok I see.

Still though, I see that it is now using the Ot, O2 and Arch parameters, and yet there has been no speed increase, and it is still only x87 code.

Chris_F · Jun 5, 2009

ok, another odd problem I'm having is I cant declare a variable inside of a for loop, which I should be able to do in C++. What the heck.

Chris_F · Jun 5, 2009

Even stranger yet (keep in mind I'm still not very good with C/C++)

Anyway, here is my code:

The program creates 2 arrays of 8192 random floating point numbers between 0 and 1. It then multiplies the first arrray with the second, 250,000 times to make it take longer. It does it once using a multiplication routine written in C++, then using an inline assembly one using SSE instructions. It calculates the time and performance.

Code:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define d_size 8192
#define runcount 250000

float *init_dataset(int size);
void mul_datasets(float *p1, float *p2, float *p3, int size);
void sse_mul_datasets(float *p1, float *p2, float *p3, int size);

int main()
{
	int i;
	char buff[256];
	clock_t cStart, cFinish;
	float runtime, flops;

	float *dataset_a;
	float *dataset_b;
	float *result_1;
	float *result_2;

	dataset_a = init_dataset(d_size);
	dataset_b = init_dataset(d_size);
	result_1 = init_dataset(d_size);
	result_2 = init_dataset(d_size);


	printf("Strating C++ code benchmark...\n\n");

	cStart = clock();

	for(i=0; i<runcount; i++)
	{
		mul_datasets(dataset_a, dataset_b, result_1, d_size);
	}

	cFinish = clock();

	runtime = (cFinish - cStart) / 1000.0f;

	flops = ((d_size * runcount) / runtime) / 1000000000;

	printf("Finished in %.*f seconds.\n%.*f Gflops\n\n", 3, runtime, 3, flops);

	printf("Strating ASM code benchmark...\n\n");

	cStart = clock();

	for(i=0; i<runcount; i++)
	{
		sse_mul_datasets(dataset_a, dataset_b, result_2, d_size);
	}

	cFinish = clock();

	runtime = (cFinish - cStart) / 1000.0f;

	flops = ((d_size * runcount) / runtime) / 1000000000;

	printf("Finished in %.*f seconds.\n%.*f Gflops\n\n", 3, runtime, 3, flops);

	free(dataset_a);
	free(dataset_b);

	printf("Comparing results...\n\n");

	for(i=0; i<d_size / 4; i++)
	{
		if(result_1[i] != result_2[i])
		{
			printf("Results %f did not match %f at offset %d.\nAn error has occured!\n", result_1[i], result_2[i], i);
			free(result_1);
			free(result_2);
			gets(buff);
			exit(1);
		}
	}

	printf("Both results match perfectly!\n");
	free(result_1);
	free(result_2);
	gets(buff);
	exit(0);
}

float *init_dataset(int size)
{
	int i;
	float *dataset = (float*)malloc(size*4);
	
	for (i = 0; i < size; i++)
	{
		dataset[i] = (float)rand()/(float)RAND_MAX;
	}

	return dataset;
}

void mul_datasets(float *p1, float *p2, float *p3, int size)
{
	int i;
	for(i=0; i<size; i++)
	{
		p3[i] = p1[i] * p2[i];
	}
}

void sse_mul_datasets(float *p1, float *p2, float *p3, int size)
{
	_asm{
		mov ecx, 0
		mov ebx, p1
		mov edx, p2
		mov esi, p3
asmlp:
		movups xmm0, [ebx+ecx]
		movups xmm1, [edx+ecx]
		movups xmm2, [ebx+ecx+16]
		movups xmm3, [edx+ecx+16]
		mulps xmm0, xmm1
		mulps xmm2, xmm3
		movups [esi+ecx], xmm0
		movups [esi+ecx+16], xmm2
		add ecx, 32
		cmp ecx, 8192
		jl asmlp
	}
}

When I compile with it set to debug, it generates a 29KB executable. The output from running it is:

Code:

Starting C++ code benchmark...

Finished in 8.578 seconds.
0.239 Gflops

Starting ASM code benchmark...

Finished in 0.288 seconds.
7.111 Gflops

Comparing results...

Both results match perfectly!

When I change to to compile as a release version, it creates a smaller 8KB executable, which is logical. However I would expect that the C++ routine runs a bit faster than in the debug version, while the ASM version runs about the same. Right?

Code:

Starting C++ code benchmark...

Finished in 6.192 seconds.
0.331 Gflops

Starting ASM code benchmark...

Finished in 0.474 seconds.
4.321 Gflops

Comparing results...

Both results match perfectly!

Ah, the C++ is running about 30% faster, as expected. But what happened to the inline ASM, which is now running an unenthusiastic 40% slower now.

ShadowPho · Jun 6, 2009

Check in the project settings there should be options for optimizations.
(I'll take a look on my dev computer a bit later).

Now, AFAIK intel's compiler beats gcc in speed and the compiler warnings are much better, with the MSVC fitting in between the two.

As for SSE code I am unsure as to if MSVC has it. Here's a quick google page result though:
http://msdn.microsoft.com/en-us/library/t467de55(VS.80).aspx

Chris_F · Jun 6, 2009

For whomever it may concern, I've figured out the performance anomaly.

SSE requires that the memory locations being accessed be aligned to 16 bytes, or else you will suffer a significant performance penalty (in my case 40%.)

Using float *dataset = (float*)_aligned_malloc(size*4, 16); was the answer.

As for MSVC's vectorization abilities, I guess I'll read up on intrinsic and more inline, as it seems 2008 leaves much to be desired (I guess future releases are supposed to address this issue.) I can always look into Intel's compiler if I'm really serious anyway. Their's probably would be the best, after all it is Intel, they created x86.

Chris_F · Jun 7, 2009

Also, for anyone interested to know, it turns out that Visual Studio Express (the free version) is crippled. It only has 32-bit support (can't do x64) and it has no support for automatic SIMD code generation (SSE, SSE2, SSE3, ect), and no optimizations.

I guess that officially makes MSVC the worst free compiler available (not a shock.)

Not very familiar Visual Studio C++

Chris_F

Member

Chris_F

Member

Sorin

Member

pharoer

Member

Chris_F

Member

Chris_F

Member

Chris_F

Member

ShadowPho

Member

Chris_F

Member

Chris_F

Member

Similar threads