HD7900 has enough 3GB memory , enough 32CUs, I wonder why.
I guess some matters.
- nvidia GPU's branch granularity is half size of amd GPU? (32 threads warp vs 64 threads wavefront)
many branche penalties?
- driver's bug ?
- this program is optimized hard for nvidia gpu? working buffer size is fit to nvidia gpu's L2 texture cache size?
Somebody please tell me your opinion.
Regards.