Главная страница описания
Главная страница ЛТФ
CUDA
Технология вычислений на графических процессорах Nvidia
Введение
Вверх
Главная страница описания
Главная страница ЛТФ
Аппаратное обеспечение
См.
статья о Nvidia Tesla в Wikipedia
| TECHNICAL SPECIFICATIONS |
|
Tesla A30 на THEOR4 ( Micro-architecture Ampere GA100 )
- Peak FP64: 5.2 TF
- Peak FP64: Tensor Core 10.3 TF
- Peak FP32: 10.3 TF
- TF32 Tensor Core: 82 TF | 165 TF*
- BFLOAT16 Tensor Core: 165 TF | 330 TF*
- Peak FP16 Tensor Core: 165 TF | 330 TF*
- Peak INT8 Tensor Core: 330 TOPS | 661 TOPS*
- Peak INT4 Tensor Core: 661 TOPS | 1321 TOPS*
- Media engines:
1 optical flow accelerator (OFA)
1 JPEG decoder (NVJPEG)
4 Video decoders (NVDEC)
- GPU Memory 24GB HBM2
- GPU Memory Bandwidth 933GB/s
- Interconnect PCIe Gen4: 64GB/s
Third-gen NVIDIA® NVLINK® 200GB/s**
- Form Factor 2-slot, full height, full length (FHFL)
- Max thermal design power (TDP) 165W
Multi-Instance GPU (MIG):
4 MIGs @ 6GB each
2 MIGs @ 12GB each
1 MIGs @ 24GB
- Virtual GPU (vGPU) software support: NVIDIA AI Enterprise NVIDIA Virtual Compute Server
|
|
Tesla P100 на THEOR3
(
Micro-architecture Pascal GP100 )
|
|
RTX A2000 на i9a
(
Micro-architecture Ampere GA106 )
- Form factor → PCIe x16 form factor
- # OF CUDA CORES → 3328
- # OF Tensor cores → 104
- CUDA compute capability → 7.5
- Frequency of cuda cores → up to 1.2 GHz
- Double precision floating point performance (peak)
→ 249 Gflops
- Single precision floating point performance (peak)
→ 7.987 Tflops
- Total dedicated memory
→ 12GB GDDR6*
- Memory speed
→ 1.5 GHz
- Memory interface
→ 192-bit
- Memory bandwidth
→ 288 GB/sec
- Power consumption
→ 70W TDP
- System interface
→ PCIe x16
|
|
Tesla C2075 на Theor2
(
Micro-architecture Fermi GF100 )
- Form factor → 9.75. PCIe x16 form factor
- Number of CUDA cores → 448
- Frequency of CUDA cores → 1.15 GHz
- Double precision floating point performance (peak) → 515 Gflops
- Single precision floating point performance (peak) → 1.03 Tflops
- Total dedicated memory → 6GB GDDR5*
- Memory speed → 1.5 GHz
- Memory interface → 384-bit
- Memory bandwidth → 144 GB/sec
- Power consumption → 225W TDP
- System interface → PCIe x16 Gen2
- Thermal solution → Active Fansink
- Display support → Dual-Link DVI-I: 1
→ Maximum Display Resolution 1600x1200
|
Вверх
Главная страница описания
Главная страница ЛТФ
Программное обеспечение
NVIDIA CUDA Toolkit
- CUDA-12 на THEOR3 (см. в директории /usr/local/cuda-12/)
- CUDA-13 на THEOR4 (см. в директории /usr/local/cuda-13/)
- CUDA-13 и -12 на i9A (см. в директории /usr/local/cuda-13/ или -12/)
Примечание: GPU Tesla C2075 не поддерживается современным ПО.
Вверх
Главная страница описания
Главная страница ЛТФ
Производительность
for (i = 0; i < MatrixSize; i++)
for (j = 0; j < MatrixSize; j++)
for (k = 0; k < MatrixSize; k++)
C[j][i] += A[j][k] * B[k][i];
GFlops = 2 * MatrixSize3 /109/ExecutionTime
Вверх
Главная страница описания
Главная страница ЛТФ
Пример для Maple
theor2:> maple test_cuda.mpl
|\^/| Maple 16 (X86 64 LINUX)
._|\| |/|_. Copyright (c) Maplesoft, a division of Waterloo Maple Inc. 2012
\ MAPLE / All rights reserved. Maple is a trademark of
<____ ____> Waterloo Maple Inc.
| Type ? for help.
> CUDA:-IsEnabled();
false
> CUDA:-Enable(true);
false
> CUDA:-IsEnabled();
true
>
> CUDA:-HasDoubleSupport();
table([0 = true])
>
> with(LinearAlgebra):
> M:=RandomMatrix(4000,outputoptions=[datatype=float[4]]);
[ 4000 x 4000 Matrix ]
M := [ Data Type: float[4] ]
[ Storage: rectangular ]
[ Order: Fortran_order ]
> N:=RandomMatrix(4000,outputoptions=[datatype=float[4]]);
memory used=124.1MB, alloc=126.0MB, time=0.88
[ 4000 x 4000 Matrix ]
N := [ Data Type: float[4] ]
[ Storage: rectangular ]
[ Order: Fortran_order ]
>
> time[real](MatrixMatrixMultiply(M,N));
memory used=185.2MB, alloc=187.1MB, time=0.92
0.617
> CUDA:-Enable(false);
true
> time[real](MatrixMatrixMultiply(M,N));
5.623
>
>
> CUDA:-Enable(true);
false
> M:=RandomMatrix(4000,outputoptions=[datatype=float[8]]);
memory used=368.4MB, alloc=248.1MB, time=7.48
[ 4000 x 4000 Matrix ]
M := [ Data Type: float[8] ]
[ Storage: rectangular ]
[ Order: Fortran_order ]
> N:=RandomMatrix(4000,outputoptions=[datatype=float[8]]);
memory used=490.6MB, alloc=370.2MB, time=7.88
[ 4000 x 4000 Matrix ]
N := [ Data Type: float[8] ]
[ Storage: rectangular ]
[ Order: Fortran_order ]
>
> time[real](MatrixMatrixMultiply(M,N));
1.640
>
> CUDA:-Enable(false);
true
>
> time[real](MatrixMatrixMultiply(M,N));
10.614
>
> CUDA:-Properties();
[table(["Max Threads Dimensions" = [1024, 1024, 64], "Clock Rate" = 1147000,
"Max Grid Size" = [65535, 65535, 65535], "Memory Pitch" = 2147483647,
"Max Threads Per Block" = 1024, "Warp Size" = 32,
"Kernel Exec Timeout Enabled" = false, "Resisters Per Block" = 32768,
"ID" = 0, "Texture Alignment" = 512, "Minor" = 0,
"MultiProcessor Count" = 14, "Shared Memory Per Block" = 49152,
"Total Global Memory" = 4294967295, "Major" = 2, "Name" = "Tesla C2075",
"Total Constant Memory" = 65536,
"Device Overlap" = 1
])]
> quit
memory used=734.8MB, alloc=614.3MB, time=20.10
Вверх
Главная страница описания
Главная страница ЛТФ
Источники информации
Компьютерная группа ЛТФ
20 февраля 2013 г.
e-mail: super@theor.jinr.ru, telepuzik@theor.jinr.ru
e-mail yoda@theor.jinr.ru, godzilla@theor.jinr.ru
Дата обновления: 2025-11-27 21:25:34
Вверх
Главная страница описания
Главная страница ЛТФ