Sunday, March 21, 2010

What's for CUDA 3.1 and OpenGL 3.3/4.1!

Let's see CUDA 3.0 vs beta:

*adds full blas support
*opencl local atomics
*ocl i cuda d3d9-11 interop..
*updated guides since beta..
still no ptx 1.5,2.0 specs..
also nv-cl extensions published now: http://developer.download.nvidia.com/compute/cuda/3_0/toolkit/docs/opencl_extensions/cl_nv_compiler_options.txt

Interesting notes.
*Float16 (half) textures are supported in the runtime
*cublas complete i ieee754 complaint fermi
 *SGEMM performance on Fermi-based GPU is 30% lower than expected. 
    It will be fixed in 3.1.
*The stability of the large-prime FFT transform (signals with a length
    that is prime and >64k samples) is extremely variable, giving single-
    precision accuracy in the range 0.005->0.025. In general, smaller signals
    experience greater accuracy.
*This package will work MAC OSX running 32/64-bit.  
      *     CUDA applications built in 32/64-bit (CUDA Driver API) is supported.
       *    CUDA applications built as 32-bit (CUDA Runtime API) is supported.
           (10.5.x Leopard and 10.6 SnowLeopard)
Note: x86_64 is not currently working for Leopoard or SnowLeopard
*CUDA applications built with the CUDA driver API can run as either 32/64-bit applications.  
 *  CUDA applications using CUDA Runtime APIs can only be built on 32-bit applications.
SDK Release 3.0 Final:
* Replaced 3dfd sample with FDTD3d (Finite Difference sample has been updated)
* Added support for Fermi Architecture (Compute 2.0 profile) to the SDK samples
* Updated Graphics/CUDA samples to use the new unified graphics interop
* Several samples with Device Emulation have been removed.  Device Emulation is 
  deprecated for CUDA 3.0, and will be removed with CUDA 3.1.
* Added new samples:
   concurrentKernels (Fermi Capability)
* Bug Fixes
have added simplempi also..
have to test with intel mpi 4.0

MAC notes:
cuda.dylib is 64bit and has 195API and 195 185 dylibs versioned as 195_96 or 185_55..
*has cuda-memcheck but no cuda-gdb
*cuda kext is fatbin with 64 bits and also cuda.dylib so cuda driver applications are compatible with 64 bits
and compilable..
note also can boot in 64 bit kernel due to kext..
cudart 32 bit
then we can in theory program a cudart wrapper over cuda driver and compile in 64 bits more
now than cudart is stateless and has interop with cuda driver mem alloc..

all needed is cublas and cufft to be 64 bits compile in that..
we have code for cudpp,thrust and cusp and in the meanwhile volkov matmul,fft and lapack codes

so all these can be compiled with 64 bits if we had a cudart 64 bit and see what's up..
well I have compiled cudadevicedrv and matmuldrv
(i'm the first in the world to have 64 bit cuda apple binaries? excepting at nvidia..?)
I have get rid of cutil though compiling to 64 bits would be no problem some notes:
nvcc on mac defaults to 32 bits vs gcc defaults on 64 bits on Snow leopard..
so for using 64bits you must use -m64 in nvcc..
but for cuda driver projects nvcc is of no use since you can use g++ for cuda driver api and compile cuda
files to ptx with nvcc -ptx
if you use nvcc with -m64 you get both cpu 64 bit code but also using -ptx you get ptx code
using 64 bit pointers for Fermi?
so you can use 32 bit pointers in Fermi is better use 32 bit pointers..
so matrixmuldrv use nvcc -ptx for 32bit pointers and use g++ (-m64) and you get
but cudamoduleloaddataex i get error
CUDA_ERROR_POINTER_IS_64BIT     = 800,      ///< Attempted to retrieve 64-bit pointer via 32-bit API function
loading ptx either if I use a nvcc -m64 or nvcc (all with -ptx) get this error..
so ptx with 32 or 64 bit pointers doesn't change that..
I have to compare files with 32 and 64 bit pointers to see differences also with sm_20..
also note for nvcc -m64 to work either if it not needed needs /usr/local/cuda/lib64 to exist..
so I have copied lib->lib64 or do a symlink..
so you can now run it..
I have to write tutorial of using cuda and nvcc and achieving macos fat binaries(i386 ad 64)
*I see nvcuvid library for mac in gpu computing sdk.. only 32 bits..
/C/common/lib
and /C/common/inc/cuvid

Anyway I have a libcuvid (vs libnvcuvid) for 64 bits /usr/local/cuda (where i have get from?)
*also a pref pane control panel with autoupdate and shows gpu driver version and cuda driver version..


note opencl samples on mac no work until 10.6.3..


good is opencl not definided behavior (implementation specific) for nvidia:
http://developer.download.nvidia.com/compute/cuda/3_0/toolkit/docs/NVIDIA_OpenCL_ImplementationNotes_3.0.txt

issues with mac..

opengl 4.1/3.3 perfect release:
*ext_direct_state_access
*ext_separate_shader_objects
*RW textures (3d also) ext_image_load_store
*binary shaders (gl es 2.0 api)
in theory you can use some ir from 3dlabs frontend compiler source..
or also translate to hlsl via som translator (amd hlsl2glsl?) and then use binary hlsl shader..
also a good translator..
http://code.google.com/p/angleproject/
has flex/bison glsl parser and also a glsl2hlsl translator (es 2.0)..
going from binary to dx il via:
fxc /dumpbin
but dx il to binary? also how from dx il->hlsl or glsl directly..
I also have found wine handles/parses more or less dxbc files..
/dlls/d3d10/effect.c
static HRESULT parse_shade

NV OGL extensions:
*fermi fuction pointers and recursion for glsl?
would be good addition to bindless  extensions and shader buffer load..

CUDA 3.1:
*cuda-gdb OpenCL HW debugging support..
*pinned GPU mem interop with MPI Infiniband.. (spring10 in sc09)

*template for a DirectCompute project
Currently there is no template for a DirectCompute project, but NVIDIA will be
    providing one soon.
*Fix perf of CUBLAS SGEMM by 30% faster on Fermi
*Fix CUFFT perf vs 3.0beta goes 180-190gflops to 150gflops
*provide official cudaasm/decuda or documentation about cubin/ELF format for SM_20 devices? also for sm_10?
*PTX 1.5, 2.0 docs?
*Updated opencl best practices for Fermi? cuda best.. guide is updated but for Fermi?

*Surface functions: RW textures with x,w addressing etc.. also 3d image writes.. headers and exported functions in beta but removed in final..

Also CUDA to CPU compiler or is gpuocelot mature enough and also mac and windows ports avaiable..
would be good a direct PTX2CPU code conversor and using gpuocelto lib as cudart and cuda api..

Mac

*add cuda-gdb (with ocl also) and OpenCL visual profiler

opencl mac no xutan 2 ejemplos

cuda opengl slow mac
ship
is going to work with fermi cuda.kext


*Related is first 195 series 197 whql driver for Quadros enabling OpenCL on these devices..
Adds support for CUDA 3.0 for improved performance in GPU Computing applications. See CUDA for more details. 
This driver resolves fan speed issues reported with version 196.75 drivers.
Adds support for the Open Computing Language (OpenCL) 1.0 in Quadro FX Series x700 and newer as well as the FX4600 and FX5600.
*Nvidia mentions compute cluster driver but is 196.28 not updated since early feb.. anyway d3d interop
added finally is not nedeed here..
*
to pierre boudier you cansee ogl 4.0 drivers soon and also a image write and random access extension soon ala d3d11 rwtexture..
ubuntu 10.4 fglrx 8.72
fglrx-installer (2:8.721-0ubuntu1) lucid; urgency=low 

* New upstream release: 
- Restore compatibility with kernel 2.6.32 and xserver 1.7 (LP: #494699). 
- Add Passive Stereo support on workstation (FireGL/FirePro) hardware. 
- Add Eyefinity support (more than 2 monitors on Radeon HD 5xxx hardware). 
Officially WS-only but should work on consumer boards as well. 
GL_EXT_shader_subroutine GL_EXT_timer_query

Also what about 3d stereo on linux:
*3d vision for opengl qb on quadro with stereo connector is here..
*a 3dtv for linux so opengl qb can be output to hdmi 1.4 on linux? this can add working on low profile quadros as stereo connector is not needed (is not needed in 3d vision is Nvidia way of artificially limiting to super high end quadros well expect perhaps better synch..)
also if they add VDPAU h.264 MVC and you decrypt bluray3d with anydvd hd you will be able in theory to see it in linux gpu accelerated decoding and sending to tv's via hdmi 1.4..
let's see also how windows is handled as not dxva 2.0 support it mvc? also not cuvid so leet's see if they add it to cuvid also..
so seems all cyberlink will get some library by nvidia or what?
*ATI has hooks for d3d9,10? d3d11? in 10.3, also fglrx 8.72 add passive stereo for ogl qb (active stereo is here right?.. but for 120hz lcds also?)
let's see also how ati manages output to HDMI 1.4 tv's via either IZ3D partnership or what? in fact I expect iz3d only hooks d3d stereo and the amd will add some HDMI 1.4 stereo from this hooks so will be good a sdk or documentation of this hooks..
Also Nvidia will be good publishing stereo sdk (promised in gdc2010) and hope also this hooks (d3d9-11) will work with 3dtv and output to hdmi 1.4 tvs.. In fact yes as Avatar and 3d stereo vision use this hooks presumably..
mac is out in this scope..

also nvidia can be late with fermi but not with software supporting it..
now d3d11 is with cs5.0 here and also we have now d3d11 interop for cuda in 3.0 and d3d11 interop with opencl extension and also optix d3d11 interop..
We have d3d11 interop with:
*CUDA 3.0
*OpenCL
*Optix
HW debugging:
Nsight.
All need to be released is nsight which will also bring d3d11 support (hw debug and profile) wii be good to hw debug cuda, d3d11 cs, cuda with d3d11 interop, and trace opencl and opengl (4.0? will be traced?)..

also cg 3.0 will have support for d3d11? and also sm5.0 opengl 4.0 support? i.e. tesselation shaders with glsl output?
note cgc 3.0 is shipping on tegra sdk and also as part of nvidia drivers 195 opengl compiler..
I have seen cgfx working with optix and cuda in a blog so hope they ship example soon..
http://lorachnroll.blogspot.com/2010/03/mixing-nvidia-technologies-thanks-to.html


GPU: GF100 @ 700MHz
- CUDA cores: 480 @ 1401MHz
- Memory: 1536MB GDDR5 @ 1848MHz 384-bit
- TDP: 250W
GeForce GTX 470:
- GPU: GF100 @ 607MHz
- CUDA cores: 448 @ 1215MHz
- Memory: 1280MB GDDR5 @ 1674MHz 320-bit
- TDP: 225W
- Price: $349US

- 3D APIs: OpenGL 4.0 and Direct3D 11
- GPU Computing: OpenCL, CUDA and DirectCompute
- 3-way SLI support



GeForce GTX 480 : 480 SP, 700/1401/1848MHz core/shader/mem, 384-bit, 1536MB, 250W TDP, US$499

GeForce GTX 470 : 448 SP, 607/1215/1674MHz core/shader/mem, 320-bit, 1280MB, 225W TDP, US$349

Note also we have like GLSL and OCL vec4 and other C++ libraries:
*GLM has GLSL strict compliance..
even with GMX experimental extensions we have SIMD implementations..
*DX SDK feb 2010 has XNAMATH 2.02 SIMD math library
also read:
http://www.gamasutra.com/view/feature/4248/designing_fast_crossplatform_simd_.php

HDR good maps:

http://www.hdrlabs.com/sibl/archive.html


Nvidia employess blogs:


http://timothylottes.blogspot.com/
http://jamesdolan.blogspot.com/
http://industrialarithmetic.blogspot.com/
http://castano.ludicon.com/blog/

http://twitter.com/castano

http://twitter.com/tmurray_cmpxchg

showing max cuda mem:
http://forums.nvidia.com/index.php?showtopic=102682 cuda maxmem

caustics patents:
US patent applications: 20090096788, 20090096789, and especially 20090128562,


The LLVM 2.7 binaries are available for testing:
http://llvm.org/pre-releases/2.7/pre-release1/
http://amnoid.de/tmp/clangtut/tut.html
http://lists.cs.uiuc.edu/pipermail/cfe-dev/2009-May/005167.html
http://synopsis.fresco.org/
Performance inconsistencies when testing various bit-counting methods 
ubuntu cheat cube:119834-cheat-cube-ub
ie9 VML to SVG Migration Guide
windows phone 7:
*xna ctp 4.0 avaiable works with pc but only reach profile not hidef..
*unlocked image with all apps instructions on a blog..
*petzold samples and book excerpt avaiable..
*also sqlite port ->csharp-sqlite.wp

Windows 7  XP Mode now has support for CPUs without virtualization VT-D support..
Windows 7 SP1 virtualization news:
With Microsoft RemoteFX, users will be able to work remotely in a Windows Aero desktop environment, watch full-motion video, enjoy Silverlight animations, and run 3D applications," Microsoft's Max Herrmann writes, "All with the fidelity of a local-like performance when connecting over the LAN."
cuda will work with it? i.e. no need for compute cluster driver and also ogl,dx and interop support..
Q: Will RemoteFx support also OpenGL hardware acceleration which is the 3D high level API used by professional applications like CAD systems or medical applications ?

A: RemoteFX will support certain OpenGL applications. However, as the development of RemoteFX is still ongoing, it is too early to provide any specifics at this point.
Q: Are you plan to introduce RemoteFX also for Windows 7 because their are many scenarios where the remote system is not a server but a high end workstation ?
A: RemoteFX has been designed as a Windows Server capability to support the growing demand for multi-user, media-rich centralized desktop environments. Windows 7 will be supported as a virtual guest OS under Hyper-V.

Dynamic Memory is an improvement to Hyper-V which allows users to pool all available physical host memory together, and dynamically allocate it to virtual machines. In other words, if the workload changes, VMs can get access to extra memory without having to shut them down.

XNA forums:
Updated list of D3D12 suggestions
Unable to perform a recursive call with DirectCompute? 
How to AttachBuffersAndPrecompute to ID3DX11FFT
RWStructuredBuffer counter
The IncrementCounter is faster than IterlockedAdd(Buffer[0], 1) in 4 times.
Gamefest 2010 presentations?
D3D11 / D2D Interoperativity
329M pairs/sec radix sort performance, 408M keys/sec - crushes CUDPP numbers
AppendStructuredBuffer driver bug?
How to debug DirectX 11 Compute Shaders?
Creating a Shared Surface with DXGI


atomic
I have some questions about RWStructuredBuffer:
1. How to copy hidden counter to system memory? CopyStructureCount
2. How to reset the counter to zero? last argument of OMSetRenderTargetsAndUnorderedAccessViews
3. Why the performance of this counter is much more than the performance of InterlockedAdd at the element buffer? (HD 5670)
The IncrementCounter is faster than IterlockedAdd(Buffer[0], 1) in 4 times.
How to AttachBuffersAndPrecompute to ID3DX11FFT?

http://gephi.org/

http://forums.xna.com/forums/t/49607.aspx
Thank you. I forgot about debug version of the D3DCSX. Debug message proved to be helpful. For the record: 1. The number of buffers attached must be exactly the same as in D3DX11_FFT_BUFFER_INFO. 2. The views MUST be created with the D3D11_BUFFER_UAV_FLAG_RAW flag (although it wasn't mentioned in documentation).



The Chrome dev channel release has support for an Open GL ES 2.0 interface 
for Native Client. This is something we said we would do sometime last year. 
When we consider it stable, documented etc. we will do more of an 
announcement.

Google are announcing that NaCl now also supports x86-64 and ARM.
http://www.osnews.com/story/23021/Native_Client_Portability_Almost_Native_Graphics_Layer_Engine
NaCl_SFI:Adapting Software Fault Isolation to Contemporary CPU
Architectures
pnacl: Portable Native Client Executables

from GDC:

this are also graphics API translations:
Cider & Cedega: Direct3D on OpenGL
GameTree.tv: Direct3D on OpenGL ES
SwiftShader: DX Software Rendering (also WARP)
ANGLE Project: WebGL (OGL ES 2.0) on Direct3D

now we need GPGPU apis so:
cuda on opencl?
cuda on cal?
directcompute on opencl?
opencl on directcompute?

posted on opengl and cuda forums:
Questions to nvidia:
*Is Nvidia going to expose ext_gpu_shader_fp64 on GT2xx hardware with double precision or is for d3d11 hardware?
For example gtx275
AMD seems to support double precision on GLSL via doublepAMD even on 4850 cards..
Also is Nvidia with initial GL 4.0 drivers going to finally expose documentation for wgl_nv_dx_interop and have the shown at gtc texture writting and random access support?
via ext_image_load_store?
Please post PTX 1.5 and  2.0 documents..
Also I'm summing here things promised soon by Nvidia so let's see how much it takes before we get:
*cuda-gdb support for hardware debugging of OpenCL kernels
*cuda-gdb GPU debugger for Mac (with OpenCL support also)

Mac related:
Is mac 64 supported?
This package will work MAC OSX running 32/64-bit.  
           CUDA applications built in 32/64-bit (CUDA Driver API) is supported.
           CUDA applications built as 32-bit (CUDA Runtime API) is supported.
           (10.5.x Leopard and 10.6 SnowLeopard)
Note: x86_64 is not currently working for Leopoard or SnowLeopard
UDA applications built with the CUDA driver API can run as either 32/64-bit applications.  
    CUDA applications using CUDA Runtime APIs can only be built on 32-bit applications.




My mac notes:

nvcc matrixMul_kernel.cu matrixMulDrv.cpp  -I../../common/inc/  ../../lib/libcutil_i386.a matrixMul_gold.cpp -Xlinker /usr/local/cuda/lib/libcuda.dylib
nvcc matrixMul_kernel.cu -c -m64
g++ matrixMul_gold.cpp matrixMulDrv.cpp  -I../../common/inc/ -I$CUDA_INC_PATH -L$CUDA_LIB_PATH /usr/local/cuda/lib/libcuda.dylib ../../lib/libcutil_i386.a

para nvcc -m64 crea lib64 con copia de lib
nvcc -m64 deviceQueryDrv.cpp  -I../../common/inc/ -I../../../shared/inc -Xlinker /usr/local/cuda/lib/libcuda.dylib
quita cut
nvcc defaults 32 bits
gcc defaults 64
g++
g++  deviceQueryDrv.cpp  -I../../common/inc/ -I../../../shared/inc  /usr/local/cuda/lib/libcuda.dylib -I$CUDA_INC_PATH

//#include
#define CU_SAFE_CALL_NO_SYNC(a) a
//CUT_EXIT(argc, argv);

export CUDA_BIN_PATH=/usr/local/cuda/bin
export CUDA_BIN_PATH=/usr/local/cuda/bin
export CUDA_LIB_PATH=/usr/local/cuda/lib
export CUDA_INC_PATH=/usr/local/cuda/include
export PATH=$PATH:/usr/local/cuda/bin



1 comentarios: