I discussed some OTB-CUDA experiments that had been done with Emmanuel. The experiments tell the age old story of bandwidth pay-off versus speed. If your problem is too simple the performance gain from CUDA is not going to be much, they did one of the first tests with specifically Remote Sensing type computations of various complexity. Amdahl's law is fairly intuitive and even embarassingly parallel problems that we encounter in raster processing can not break the curse of sequential segments.
In GPGPU computing one of the core tweaks is in the kernel size and computational complexity. You don't want to be performing CPU to GPU transfers all the time for the GPU to perform a very simple operation no matter however fast, on the other hand GPU's are rather constrained in the complexity of operations they can carry out and will always need assitance from the CPU in loading data from the disk etc.
MSVC Express 2010. As penalty for being on the bleeding edge Nvidia CUDA and OpenCL don't work on 2010, so I had to time warp back to 2008. Even then the OTB-CUDA code needs some porting due to missing timing code, may be the sample here will provide some inspiration.