Sunday, March 7, 2010

GPU computing toys!

Hi I would like to release some lame but hopefully useful tools:
https://dl.dropbox.com/u/1416327/cld3d.rar

First OCL D3D interop headers and spec for Nvidia and AMD and a tool for checking current status:
the headers are in h
and are for d3d9,10,11 for NV and d3d9,10 for AMD..
#include for every d3d version and call initcld3d() in your code and voila you have the
d3d stuff..
if you #define INCAMD you have even amd functions included and can avoid amd headers..

with these I have complied four exes named cl_xx_interop which check d3d 9,9Ex,10 and 11..
they check extension reporting, try to create a shared context in some ways and then associate a d3d object and textures to ocl and aquire and release it prior to use..

Also cl_d3d10_interop build shows image formats avaiable to OpenCL images see next post..

Testing OCL-D3D11 interop
Checking D3D interop extensions support for platform: NVIDIA Corporation
 nv D3D  9 interop extension:  Found.
 nv D3D 10 interop extension:  Found.
 nv D3D 11 interop extension:  Found.

Using device: GeForce GTX 275
Enabling texture interop checks: image support is supported.
clGetDeviceIDsFromD3D11NV pointer: Found
 and it works! (returns d3d associated ocl device)
clCreateFromD3D11BufferNV pointer: Found
clCreateFromD3D11Texture2DNV pointer: Found
clCreateFromD3D11Texture3DNV pointer: Found
clEnqueueAcquireD3D11ObjectsNV pointer: Found
clEnqueueReleaseD3D11ObjectsNV pointer: Found
Testing context creation with
 no dev (clCreateContextFromType): OK.
dev info (getdeviceids): OK.
dev info (clGetDeviceIDsFromD3DNV CL_PREFERRED_DEVICES_FOR_D3D9_NV): OK.
Testing clCreateFromD3D11BufferNV: OK.
Testing aquire release stuff: Ok.. releasing it: Ok.
Testing clCreateFromD3D11Texture2DNV: OK.
Testing aquire release stuff: Ok.. releasing it: Ok.
Testing clCreateFromD3D11Texture3DNV: OK.
Testing aquire release stuff: Ok.. releasing it: Ok.



Also I contains a optd3d which displays the four optional d3d11 features (cap bits):

In my gtx 200 displays:


multithreaded comand lists: 0
multithreaded Concurrent Creates: 1
Double precision: 0
Compute Shader: 1

in ATI 5850 displays:


multithreaded comand lists: 0
multithreaded Concurrent Creates: 1
Double precision: 1
Compute Shader: 1

Anyway double prec is not working with loops..
This shows multithreaded command lists are still not supported by ATI (are this supposed to be a implementation issue or a hardware limitation..)
Equal to Nvidia and upcoming Fermi..

I include a CLinfo not mine but for checking CL info..

report.bat create a report.txt with the info of all this executables..
I also include 2dbench for cheking GDI in Windows 7 perf issues.. AMD will fix in Catalyst 10.4..

There is a high efficient matmul for CUDA and AMD cards and peakflops for AMD cards..

%
%  compute C = A*B, A:mxk, B:kxn, C:mxn
%
%  cubin file = ../method1/decuda_ldsb32_cudasm.cubin
%  kernel function = method1_variant_sgemmNN
%  use device: GeForce GTX 275
%  m=n=k    gpu_time (ms)   flops (Gflops/s)
     32         0.044         1.391
    128         0.120        32.451
    224         0.194       107.870
    320         0.302       201.802
    416         0.445       301.033
    512         0.619       403.979
    608         1.277       327.914
    704         1.582       410.719
    800         2.618       364.210
    896         3.135       427.439
    992         4.401       413.123
   1088         6.014       398.868
   1184         6.981       442.860
   1280         8.751       446.365
   1376        10.911       444.746
   1472        13.403       443.262
   1568        16.377       438.470
   1664        18.901       454.051
   1760        22.437       452.594
   1856        25.820       461.218
   1952        31.233       443.566
   2048        33.317       480.229
   2144        39.834       460.841
   2240        44.989       465.337
   2336        51.643       459.765
   2432        56.514       474.095
   2528        64.183       468.859
   2624        72.540       463.923
   2720        79.686       470.387
   2816        85.826       484.626
   2912        96.003       479.094
   3008       108.801       465.942
   3104       121.579       458.181
   3200       126.446       482.699
   3296       138.522       481.473
   3392       153.544       473.440
   3488       168.797       468.268
   3584       177.873       482.085
   3680       193.298       480.227
   3776       212.160       472.675
   3872       229.596       470.947
   3968       246.403       472.280
   4064       260.086       480.699
clock 1620
%  m=n=k    gpu_time (ms)   flops (Gflops/s)
     32         0.040         1.516
    128         0.108        36.044
    224         0.173       120.900
    320         0.265       229.925
    416         0.393       341.338
    512         0.535       467.090
    608         1.107       378.021
    704         1.371       474.163
    800         2.270       420.030
    896         2.751       486.983
    992         3.804       477.992
   1088         5.205       460.925
   1184         6.003       514.983
   1280         7.609       513.393
   1376         9.396       516.463
   1472        11.555       514.134
   1568        14.145       507.666
   1664        16.427       522.442
   1760        19.387       523.784
   1856        22.182       536.854
   1952        26.860       515.777
   2048        28.642       558.623
   2144        34.530       531.627
   2240        39.585       528.868
   2336        44.440       534.292
   2432        49.141       545.226
   2528        55.274       544.429
   2624        63.241       532.134
   2720        68.451       547.592
   2816        74.160       560.865
   2912        82.945       554.516
   3008        94.150       538.449
   3104       104.581       532.653
   3200       108.907       560.436
   3296       119.277       559.158
   3392       131.982       550.785
   3488       146.003       541.376
   3584       154.088       556.502
   3680       166.307       558.166
   3776       184.523       543.469
   3872       198.692       544.196
   3968       214.158       543.390
   4064       223.720       558.838

it's a cubin so will not work in fermi
5850 stock

flopspeak.exe
Device            0
target            8
localRAM          1024 MB
uncachedRemoteRAM 2047 MB
cachedRemoteRAM   2047 MB
engineClock       725 MHz
memoryClock       1000 MHz
wavefrontSize     64
numberOfSIMD      18
doublePrecision   1
localDataShare    1
globalDataShare   1
globalGPR         1
computeShader     1
memExport         1
pitch_alignment   256
surface_alignment 4096
Device 0: execution time 7913.45 ms, achieved 2041.80 gflops
oc 950mhz

flopspeak.exe

engineClock       950 MHz
memoryClock       1000 MHz

Device 0: execution time 6039.35 ms, achieved 2675.40 gflops



matmul.exe 2048 2048 100

Device 0: execution time 1415.08 ms, achieved 1214.06 gflops
oc 950mhz
Device 0: execution time 1114.06 ms, achieved 1542.09 gflops

UPDATE 1:
Nvidia and ATI working together!
opencl.dll from ati sdk 2.01

Found 2 platform(s).
platform[01104BA0]: profile: FULL_PROFILE
platform[01104BA0]: version: OpenCL 1.0 CUDA 3.0.1
platform[01104BA0]: name: NVIDIA CUDA
platform[01104BA0]: vendor: NVIDIA Corporation
platform[01104BA0]: extensions: cl_khr_byte_addressable_store cl_khr_icd cl_khr_
gl_sharing cl_nv_d3d9_sharing cl_nv_d3d10_sharing cl_nv_d3d11_sharing cl_nv_comp
iler_options cl_nv_device_attribute_query cl_nv_pragma_unroll
platform[01104BA0]: Found 1 device(s).
        device[01104C08]: NAME: GeForce GTX 275
        device[01104C08]: VENDOR: NVIDIA Corporation
        device[01104C08]: PROFILE: FULL_PROFILE
        device[01104C08]: VERSION: OpenCL 1.0 CUDA
        device[01104C08]: EXTENSIONS: cl_khr_byte_addressable_store cl_khr_icd c
l_khr_gl_sharing cl_nv_d3d9_sharing cl_nv_d3d10_sharing cl_nv_d3d11_sharing cl_n
v_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll  cl_khr_glob
al_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_ba
se_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64
        device[01104C08]: DRIVER_VERSION: 196.75

        device[01104C08]: Type: GPU
        device[01104C08]: EXECUTION_CAPABILITIES: Kernel
        device[01104C08]: GLOBAL_MEM_CACHE_TYPE: None (0)
        device[01104C08]: CL_DEVICE_LOCAL_MEM_TYPE: Local (1)
        device[01104C08]: SINGLE_FP_CONFIG: 0x3e
        device[01104C08]: QUEUE_PROPERTIES: 0x3

        device[01104C08]: VENDOR_ID: 4318
        device[01104C08]: MAX_COMPUTE_UNITS: 30
        device[01104C08]: MAX_WORK_ITEM_DIMENSIONS: 3
        device[01104C08]: MAX_WORK_GROUP_SIZE: 512
        device[01104C08]: PREFERRED_VECTOR_WIDTH_CHAR: 1
        device[01104C08]: PREFERRED_VECTOR_WIDTH_SHORT: 1
        device[01104C08]: PREFERRED_VECTOR_WIDTH_INT: 1
        device[01104C08]: PREFERRED_VECTOR_WIDTH_LONG: 1
        device[01104C08]: PREFERRED_VECTOR_WIDTH_FLOAT: 1
        device[01104C08]: PREFERRED_VECTOR_WIDTH_DOUBLE: 1
        device[01104C08]: MAX_CLOCK_FREQUENCY: 1404
        device[01104C08]: ADDRESS_BITS: 32
        device[01104C08]: MAX_MEM_ALLOC_SIZE: 229998592
        device[01104C08]: IMAGE_SUPPORT: 1
        device[01104C08]: MAX_READ_IMAGE_ARGS: 128
        device[01104C08]: MAX_WRITE_IMAGE_ARGS: 8
        device[01104C08]: IMAGE2D_MAX_WIDTH: 8192
        device[01104C08]: IMAGE2D_MAX_HEIGHT: 8192
        device[01104C08]: IMAGE3D_MAX_WIDTH: 2048
        device[01104C08]: IMAGE3D_MAX_HEIGHT: 2048
        device[01104C08]: IMAGE3D_MAX_DEPTH: 2048
        device[01104C08]: MAX_SAMPLERS: 16
        device[01104C08]: MAX_PARAMETER_SIZE: 4352
        device[01104C08]: MEM_BASE_ADDR_ALIGN: 256
        device[01104C08]: MIN_DATA_TYPE_ALIGN_SIZE: 16
        device[01104C08]: GLOBAL_MEM_CACHELINE_SIZE: 0
        device[01104C08]: GLOBAL_MEM_CACHE_SIZE: 0
        device[01104C08]: GLOBAL_MEM_SIZE: 919994368
        device[01104C08]: MAX_CONSTANT_BUFFER_SIZE: 65536
        device[01104C08]: MAX_CONSTANT_ARGS: 9
        device[01104C08]: LOCAL_MEM_SIZE: 16384
        device[01104C08]: ERROR_CORRECTION_SUPPORT: 0
        device[01104C08]: PROFILING_TIMER_RESOLUTION: 1000
        device[01104C08]: ENDIAN_LITTLE: 1
        device[01104C08]: AVAILABLE: 1
        device[01104C08]: COMPILER_AVAILABLE: 1
platform[0313A434]: profile: FULL_PROFILE
platform[0313A434]: version: OpenCL 1.0 ATI-Stream-v2.0.1
platform[0313A434]: name: ATI Stream
platform[0313A434]: vendor: Advanced Micro Devices, Inc.
platform[0313A434]: extensions: cl_khr_icd
platform[0313A434]: Found 2 device(s).
        device[0338CA70]: NAME: Intel(R) Core(TM) i7 CPU         920  @ 2.67GHz
        device[0338CA70]: VENDOR: GenuineIntel
        device[0338CA70]: PROFILE: FULL_PROFILE
        device[0338CA70]: VERSION: OpenCL 1.0 ATI-Stream-v2.0.1
        device[0338CA70]: EXTENSIONS: cl_khr_icd cl_khr_global_int32_base_atomic
s cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_lo
cal_int32_extended_atomics cl_khr_byte_addressable_store
        device[0338CA70]: DRIVER_VERSION: 1.0

        device[0338CA70]: Type: CPU
        device[0338CA70]: EXECUTION_CAPABILITIES: Kernel
        device[0338CA70]: GLOBAL_MEM_CACHE_TYPE: Read-Write (2)
        device[0338CA70]: CL_DEVICE_LOCAL_MEM_TYPE: Global (2)
        device[0338CA70]: SINGLE_FP_CONFIG: 0x7
        device[0338CA70]: QUEUE_PROPERTIES: 0x2

        device[0338CA70]: VENDOR_ID: 4098
        device[0338CA70]: MAX_COMPUTE_UNITS: 8
        device[0338CA70]: MAX_WORK_ITEM_DIMENSIONS: 3
        device[0338CA70]: MAX_WORK_GROUP_SIZE: 1024
        device[0338CA70]: PREFERRED_VECTOR_WIDTH_CHAR: 16
        device[0338CA70]: PREFERRED_VECTOR_WIDTH_SHORT: 8
        device[0338CA70]: PREFERRED_VECTOR_WIDTH_INT: 4
        device[0338CA70]: PREFERRED_VECTOR_WIDTH_LONG: 2
        device[0338CA70]: PREFERRED_VECTOR_WIDTH_FLOAT: 4
        device[0338CA70]: PREFERRED_VECTOR_WIDTH_DOUBLE: 0
        device[0338CA70]: MAX_CLOCK_FREQUENCY: 2698
        device[0338CA70]: ADDRESS_BITS: 32
        device[0338CA70]: MAX_MEM_ALLOC_SIZE: 536870912
        device[0338CA70]: IMAGE_SUPPORT: 0
        device[0338CA70]: MAX_READ_IMAGE_ARGS: 0
        device[0338CA70]: MAX_WRITE_IMAGE_ARGS: 0
        device[0338CA70]: IMAGE2D_MAX_WIDTH: 0
        device[0338CA70]: IMAGE2D_MAX_HEIGHT: 0
        device[0338CA70]: IMAGE3D_MAX_WIDTH: 0
        device[0338CA70]: IMAGE3D_MAX_HEIGHT: 0
        device[0338CA70]: IMAGE3D_MAX_DEPTH: 0
        device[0338CA70]: MAX_SAMPLERS: 0
        device[0338CA70]: MAX_PARAMETER_SIZE: 4096
        device[0338CA70]: MEM_BASE_ADDR_ALIGN: 32768
        device[0338CA70]: MIN_DATA_TYPE_ALIGN_SIZE: 128
        device[0338CA70]: GLOBAL_MEM_CACHELINE_SIZE: 64
        device[0338CA70]: GLOBAL_MEM_CACHE_SIZE: 65536
        device[0338CA70]: GLOBAL_MEM_SIZE: 1073741824
        device[0338CA70]: MAX_CONSTANT_BUFFER_SIZE: 65536
        device[0338CA70]: MAX_CONSTANT_ARGS: 8
        device[0338CA70]: LOCAL_MEM_SIZE: 32768
        device[0338CA70]: ERROR_CORRECTION_SUPPORT: 0
        device[0338CA70]: PROFILING_TIMER_RESOLUTION: 1
        device[0338CA70]: ENDIAN_LITTLE: 1
        device[0338CA70]: AVAILABLE: 1
        device[0338CA70]: COMPILER_AVAILABLE: 1
        device[04A30050]: NAME: Cypress
        device[04A30050]: VENDOR: Advanced Micro Devices, Inc.
        device[04A30050]: PROFILE: FULL_PROFILE
        device[04A30050]: VERSION: OpenCL 1.0 ATI-Stream-v2.0.1
        device[04A30050]: EXTENSIONS: cl_khr_global_int32_base_atomics cl_khr_gl
obal_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_e
xtended_atomics
        device[04A30050]: DRIVER_VERSION: CAL 1.4.556

        device[04A30050]: Type: GPU
        device[04A30050]: EXECUTION_CAPABILITIES: Kernel
        device[04A30050]: GLOBAL_MEM_CACHE_TYPE: None (0)
        device[04A30050]: CL_DEVICE_LOCAL_MEM_TYPE: Local (1)
        device[04A30050]: SINGLE_FP_CONFIG: 0x6
        device[04A30050]: QUEUE_PROPERTIES: 0x2

        device[04A30050]: VENDOR_ID: 4098
        device[04A30050]: MAX_COMPUTE_UNITS: 18
        device[04A30050]: MAX_WORK_ITEM_DIMENSIONS: 3
        device[04A30050]: MAX_WORK_GROUP_SIZE: 256
        device[04A30050]: PREFERRED_VECTOR_WIDTH_CHAR: 16
        device[04A30050]: PREFERRED_VECTOR_WIDTH_SHORT: 8
        device[04A30050]: PREFERRED_VECTOR_WIDTH_INT: 4
        device[04A30050]: PREFERRED_VECTOR_WIDTH_LONG: 2
        device[04A30050]: PREFERRED_VECTOR_WIDTH_FLOAT: 4
        device[04A30050]: PREFERRED_VECTOR_WIDTH_DOUBLE: 0
        device[04A30050]: MAX_CLOCK_FREQUENCY: 725
        device[04A30050]: ADDRESS_BITS: 32
        device[04A30050]: MAX_MEM_ALLOC_SIZE: 268435456
        device[04A30050]: IMAGE_SUPPORT: 0
        device[04A30050]: MAX_READ_IMAGE_ARGS: 0
        device[04A30050]: MAX_WRITE_IMAGE_ARGS: 0
        device[04A30050]: IMAGE2D_MAX_WIDTH: 0
        device[04A30050]: IMAGE2D_MAX_HEIGHT: 0
        device[04A30050]: IMAGE3D_MAX_WIDTH: 0
        device[04A30050]: IMAGE3D_MAX_HEIGHT: 0
        device[04A30050]: IMAGE3D_MAX_DEPTH: 0
        device[04A30050]: MAX_SAMPLERS: 0
        device[04A30050]: MAX_PARAMETER_SIZE: 1024
        device[04A30050]: MEM_BASE_ADDR_ALIGN: 4096
        device[04A30050]: MIN_DATA_TYPE_ALIGN_SIZE: 128
        device[04A30050]: GLOBAL_MEM_CACHELINE_SIZE: 0
        device[04A30050]: GLOBAL_MEM_CACHE_SIZE: 0
        device[04A30050]: GLOBAL_MEM_SIZE: 268435456
        device[04A30050]: MAX_CONSTANT_BUFFER_SIZE: 65536
        device[04A30050]: MAX_CONSTANT_ARGS: 8
        device[04A30050]: LOCAL_MEM_SIZE: 32768
        device[04A30050]: ERROR_CORRECTION_SUPPORT: 0
        device[04A30050]: PROFILING_TIMER_RESOLUTION: 1
        device[04A30050]: ENDIAN_LITTLE: 1
        device[04A30050]: AVAILABLE: 1
        device[04A30050]: COMPILER_AVAILABLE: 1
UPDATE 2:
DX formats included in optd3d

1 comentarios: