Hi,
Also you can learn a lot from: http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf
Some easy hints:
- Don't use more than 128 vector registers.
- Schedule at least 4x more workitems than the number of streams in the device.
- use 24bit integer math if it is possible
- ensure that the kernel code is not larger than 32KB (for the main loop.)