Friday, February 26, 2010

Reading Fermi CUDA stuff!

Fermi comp guide:
CUBINs are only compatible forward up to major revision so 1.x cubins only work on Tesla arch not Fermi..
Nvidia 195 drivers and up support forcing JIT compilation of kernels in PTX for execution setting CUDA_FORCE_PTX_JIT, which  is a way of checking for Fermi support of CUDA programs. I.e. if executable doesn't contain PTX codes will fail..
For optimizing for Fermi arch (since CUDA 3.0) is better to add explicitly code=compute_20 to code=compute_10 so it generates better code (?) also add sm_10 sm_20 for cubins.
Seems that cubin are cached by the driver so generated once and survive reboots, crashes,etc.. (where are they stored?)..
For CUDA driver API use nvcc -ptx and load using cumoduleloaddataex..
Since CUDA 2.1 compiling with arch=sm_xx (default standard (?))  evaluates to code=sm_xx (cubin) and code=compute_xx (ptx) so PTX code is inserted..

Fermi tuning guide& Programming guide 3.0:
*New graphics interop API tex interop and DX11 supports: (pags 37 cudart 63 driver api)
*interop cudart driver api(pag 72):
->allocate mem with whatever API
->if initiated context with driverapi first CUDA runtime call doesn't create context->cublas cufft work from CUDA driver api
-don't work with emulation and cuCtx{push,pop} functions..
*Use concurrent kernels:
check cudaGetDeviceProperties() concurrentKernels
and use multiple streams..

(up to 4 ONLY and from only one context? I supposed up to 16 as is the number of SMs so seems one kernel per Graphics processing cluter and not per SM) also from the same context invalidates running multiple parallel CUDA executables for extracting more perf (so no similar to use CPU cores running multiple single threaded apps. This is a shame as hardware has fast context switching but with bad coded CUDA program in parallel only aleviates overhead in switching it but no in parallel..
I suppose it's software implementation issues and fixed in software in CUDA 3.x or if not would be fixed for Fermi 2 so we can run as kernels as SMs and any number on contexts in parallel altough possibly every SM can run only one context..
*Arithmetic Instruction perf table (pag 90):
remember tesla 8 cores per sm's and good ops execute one warp in 4 clocks so 8 inst/s per SM.
Fermi is 32 cores per SM. So 16 Sms.
Note 32 bit integer is on Fermi as good as floating point so imad=mad in perf.. must see.
 *All global mem and shared mem access is done per warp not half-warp as before so check all goes well.
Shared mem is expanded to 32 banks.
Now with cache global mem coalescing seems less a requirement and also shared mem is much better as only bank conflict are "when 2 o more threads request data in different words to same bank" i.e it has multiple words broadcast, etc.. So 8-bit access,16 bit , 32(always fast) 64bit (doubles) and even 96bit (was always)  sequential access is good now as 32bit and also 32bits with 8bit offset for example.. well I don't know if a 8 bit offset for 32 bit words is bad or not as that would require breaking every word in two banks and don't know if that is served jointly or not but I pressume runs without bank conflicts fast!
*Similar to DC which seems to require knowing at compile time(DispatchIndirect is only for grid size?):
and OpenCL having __attribute__((reqd_work_group_size(X, Y, Z)))
The workgroup size that must be used as the local_work_size argument to clEnqueueNDRangeKernel. This allows the compiler to optimize the generated code appropriately for this kernel.
CUDA introduces __launchbounds__  to be appended to kernels for specifying min blocks per SM desired ocuppancy max workgroup size so it can optimize register (spilling) usage..

*By default (i.e. compiling source programs without change) L1 cache size will be 16kb so shared mem would be increased 3x per SM.. The function for setting is
*We know global mem was cached by hardware cache (L2) and know that there is a L1 cache at least 16kb in size.. I presumed this was used for caching global mem but turns out that L1 caching of global mem can be disabled (compile time would be better at runtime).. So what is used L1 for? local mem for register spilling for example that can't be disabled..
*a read only place from global mem (like const variables in C++) used along all threads in kernel is cached using constant cache (doesn't require __constant address space)..
*Don't use 24 int integers are slow on CUDA check at compile time with CUDA_ARCH only device code
but guide says..
*FP ops are higher precision so results can differ from Tesla
*As Fermi supports 64bit address space if passing -m64 to nvcc compiles to 64 host code and device code which is slower than 32 bits.. So if you don't need 64 address space but compile to 64 bits host (i.e. the GPUs your program would run are less than 4gb or program needs already less than so compile) compile separately kernel code from host code..
*CUDA C++:
function overloading: f(int a) f(double a)
default parameters: f(a,b=0);
namespaces: namespace nv{ int a;} namespace ati{ int a;} nv::a=2; ati::a; using namespace nv; a=3;(nv )
operator overloading uchar4 operator+()  uchar4 a,b,c; c=a+b;
implicit, explicit and specialized templates: f(x) or int x; f(x) and f(x){return(2);} f ret(3);
Fermi stuff:
classes and functors.

Seems support for virtual functions is missing yet and function pointers.. but coming..
Recursion and mem allocation inside kernel still lacking and coming much later (?)..
Remember all that supported in hardware..

Search fermi new insts in b.5, b.6, b.11 (103,104)
I don't know if b.12 is new __prof_trigger which exposes 8 counters which are incremented per warp each time and can be queried by profiler.. would be good if you can read with another inst in kernel? must think..
b.14 has launchbound(pag 112) doc.
Appendix G has the architecture feature chart (G1)
LACKING documentation from the guide:
*Launching of 3D grids! (well in 102 b.4 you find griddim is dim3 type but in pag 8 2.2 you see blocks are 1d or 2d thing and well in b.13 in 111 is said grid is dim3 but .z=1) (DC 5.0 has it OpenCL model (the API) supports that also)
*Surface functions (I hope are no left for CUDA 3.1 or later as Fermi supports it and even Tesla as is used for RWTexture in DC and image writes in OpenCL driver)
*Info that Fermi allows D2H H2D simultaneous transfers via async functions (check concurrent bancwith 1.1)

Also somethings I was unaware of:
use of __restrict__ in cuda pointers and some SLI info about cuda, SLI and D3D graphics interop..

2 comentarios:

  1. Did you collect some news regarding write access to 3D textures?

    ReplyDelete
  2. For what I know, Fermi surfaces are only 2D...
    But why do you want to write into a 3D texture ? ;-)

    ReplyDelete