600字范文 > 深度学习环境配置 (Ubuntu18.04 + CUDA10.0 + cuDNN7.6.5 + TensorFlow2.0)

深度学习环境配置 (Ubuntu18.04 + CUDA10.0 + cuDNN7.6.5 + TensorFlow2.0)

时间：2021-08-13 03:19:04

@ Bergen,Norway

第一次安装 CUDA 的过程简直抓狂，中间出现了很多次莫名其妙的 bug，踩了很多坑。比如装好了 CUDA 重启后进不去桌面系统了，直接黑屏、比如鼠标键盘都不 work 了、再比如装好了却安装不了TensorFlow-GPU......看了一圈网上的安装教程，发现还是官方指南真香了~

新年第一篇，分享一下我的Ubuntu 18.04 + CUDA 10.0 + cuDNN 7.6.5 + TensorFlow 2.0安装笔记，希望可以帮助大家少踩坑。

整个安装流程大致是：安装显卡驱动 -> 安装 CUDA[1] -> 安装 cuDNN[2] -> 安装 tensorflow-gpu 并测试。

全文目录：

Ubuntu安装与更新

安装显卡驱动

安装CUDA

安装cuDNN

安装TensorFlow2.0 GPU及测试

1. Ubuntu安装和更新

先进行Ubuntu18.04系统一些基本的安装和更新，具体的操作系统安装过程省略，比较容易，大家可自行百度，有很多教程。

sudo apt-get update # 更新源sudo apt-get upgrade # 更新已安装的包sudo apt-get install vim

2. 安装显卡驱动

2.1 禁用 Nouveau 驱动

注意：Linux 系统下有两种方案安装 CUDA：一种是 Package Manager Installation (.deb)，另一种是 Runfile Installation (.run)。本文采取的是第一种（也是官方推荐的方式）。如果使用deb方式安装CUDA可以忽略此步，本人测试OK。如果使用 runfile 安装CUDA需要手动禁用系统自带的 Nouveau 驱动：

lsmod | grep nouveau # 要确保这条命令无输出

vim /etc/modprobe.d/blacklist-nouveau.conf# 添加下面两行：#######################################################blacklist nouveauoptions nouveau modeset=0######################################################## 保存后重启：sudo update-initramfs -usudo reboot# 再次输入以下命令，无输出就表示设置成功了lsmod | grep nouveau

2.2 安装合适的显卡驱动[3]

# 先清空现有的显卡驱动及依赖并重启sudo apt-get remove --purge nvidia*sudo apt autoremovesudo reboot

# 添加ppa源并安装最新的驱动sudo add-apt-repository ppa:graphics-drivers/ppasudo apt updateubuntu-drivers devicessudo apt install nvidia-driver-440# 为了防止自动更新驱动导致的兼容性问题，我们还可以锁定驱动版本:sudo apt-mark hold nvidia-driver-440# nvidia-driver-440 set on hold.

并在【软件和更新】菜单中的附加驱动列表中，可以找到刚刚安装的nvidia-driver-440，选定即可。输入sudo reboot重启后，输入nvidia-smi，显示下图信息，这样表示显卡驱动已经 ready：

lsmod | grep nvidia # 看到下面的输出则为安装成功，如果无输出，表示有问题

也可以手动去官网下载对应的安装程序安装显卡[4]

# 动态监测显卡使用的方式：watch -n 1 nvidia-smi # 1表示每1秒刷新一次watch -n 0.01 nvidia-smi # 也可改成0.01s刷新一次# 也可以用gpustatpip install gpustatgpustat -i 1 -P

3. 安装 CUDA

百度百科：CUDA（Compute Unified Device Architecture），是显卡厂商NVIDIA[5]推出的运算平台。CUDA 是一种由 NVIDIA 推出的通用并行计算[6]架构，该架构使GPU[7]能够解决复杂的计算问题。

Linux 系统下有两种方案安装 CUDA：一种是 Package Manager Installation (.deb)，另一种是 Runfile Installation (.run)。本文采取的是第一种（也是官方推荐的方式）。

另外，CUDA 对于系统环境有严格的依赖，比如对于 CUDA10.0 有如下的要求。其他的版本可查看对应的Online Documentation[8]。

3.1 安装前的准备

在安装 CUDA 之前需要先确定环境是 ready 的，以免出现乱七八糟的 bug 无从下手。直接引用官网的说明：

Some actions must be taken before the CUDA Toolkit and Driver can be installed on Linux:
Verify the system has a CUDA-capable GPU.
Verify the system is running a supported version of Linux.
Verify the system has gcc installed.
Verify the system has the correct kernel headers and development packages installed.
Download the NVIDIA CUDA Toolkit.
Handle conflicting installation methods.

3.1.1 确认你有支持 CUDA 的 GPU

lspci | grep -i nvidia | grep VGA

3.1.2 确认你的 linux 版本

uname -m && cat /etc/*releaseuname -a# The x86_64 line indicates you are running on a 64-bit system.

3.1.3 确认 gcc 版本

gcc --version# gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0

3.1.4 安装对应内核版本的头文件

查看 kernel 的版本：

uname -r# 5.0.0-37-generic

This is the version of the kernel headers and development packages that must be installed prior to installing the CUDA Drivers.

安装对应内核版本的头文件：

sudo apt-get install linux-headers-$(uname -r)

3.1.5 选择安装方式

下载对应的安装包（以官方推荐的 Deb packages 安装方式为例）[9]

The CUDA Toolkit can be installed using either of two different installation mechanisms:distribution-specific packages (RPM and Deb packages), or adistribution-independent package (runfile packages).
(1) The distribution-independent packagehas the advantage of working across a wider set of Linux distributions, but does not update the distribution's native package management system.
(2) The distribution-specific packagesinterface with the distribution's native package management system. It is recommended to use the distribution-specific packages, where possible.

3.1.6 彻底卸载之前安装过的相关应用，避免冲突

如果是全新的 ubuntu，可忽略此部分，执行 3.2 部分即可。

如果 ubuntu 下用 RPM/Deb 安装的：

sudo apt-get --purge remove <package_name>sudo apt autoremove

如果是 runfile 安装的：

sudo /usr/bin/nvidia-uninstallsudo /usr/local/cuda-X.Y/bin/uninstall_cuda_X.Y.pl

3.2 安装

首先确保已经下载好对应的.deb 文件，然后执行：

sudo dpkg -i cuda-repo-ubuntu1804-10-0-local-10.0.130-410.48_1.0-1_amd64.debsudo apt-key add /var/cuda-repo-<version>/7fa2af80.pub # 根据执行完第一步的提示输入，比如我是：# sudo apt-key add /var/cuda-repo-10-0-local-10.0.130-410.48/7fa2af80.pubsudo apt-get updatesudo apt-get install cuda-toolkit-10-0 # 注意不是cuda，因为在第二步中装过驱动了，此过程安装cuda-toolkit-10-0即可

3.3 安装后

安装之后需要手动进行一些设置才能使 CUDA 正常的工作。

export PATH=/usr/local/cuda-10.0/bin${PATH:+:${PATH}}

nvcc -V # 检查CUDA是否安装成功# OUTPUT:nvcc: NVIDIA (R) Cuda compiler driverCopyright (c) - NVIDIA CorporationBuilt on Sat_Aug_25_21:08:01_CDT_Cuda compilation tools, release 10.0, V10.0.130

最好关闭系统的自动更新，防止安装好的环境突然 bug：

sudo vi /etc/apt/apt.conf.d/10periodic# 修改为：APT::Periodic::Update-Package-Lists "0";APT::Periodic::Download-Upgradeable-Packages "0";APT::Periodic::AutocleanInterval "0";

也可以通过桌面设置：System Settings => Software&Updates => updates

4. 安装 cuDNN[10]

NVIDIA cuDNN 是用于深度神经网络的 GPU 加速库。首先需要注册下载对应 CUDA 版本号的 cuDNN 安装包:链接[11]。

比如对应 CUDA10.0，我下载的是：tar -zxvf cudnn-10.0-linux-x64-v7.6.5.32.tgz

tar -zxvf cudnn-10.0-linux-x64-v7.6.5.32.tgzsudo cp cuda/include/cudnn.h /usr/local/cuda/includesudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*

验证是否安装成功：

cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2# 输出"""#define CUDNN_MAJOR 7#define CUDNN_MINOR 6#define CUDNN_PATCHLEVEL 5--#define CUDNN_VERSION (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)#include "driver_types.h""""

更推荐使用 Debian File 去安装，因为可以通过里面的样例去验证 cuDNN 是否成功安装。首先下载下面三个文件：

# 分别下载sudo dpkg -i libcudnn7_7.6.5.32-1+cuda10.0_amd64.debsudo dpkg -i libcudnn7-dev_7.6.5.32-1+cuda10.0_amd64.debsudo dpkg -i libcudnn7-doc_7.6.5.32-1+cuda10.0_amd64.deb# 安装完验证：cp -r /usr/src/cudnn_samples_v7/ $HOMEcd $HOME/cudnn_samples_v7/mnistCUDNNmake clean && make./mnistCUDNN# Test passed!

另外也可以用 conda 来安装 cudatoolkit 和 cuDNN，但要保证驱动是 ready 的。

conda install cudatoolkit=10.0conda install -c anaconda cudnn

5. 安装 TensorFlow2.0 GPU及测试

# 安装condawget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh && bash Miniconda3-latest-Linux-x86_64.shsource ~/.bashrcconda create -y -n tf2 python=3.7conda activate tf2pip install --upgrade pippip config set global.index-url https://pypi.tuna./simplepip install tensorflow-gpupip install catboost

测试:

import tensorflow as tfprint(tf.__version__)print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))"""2.0.0Num GPUs Available: 2"""

"""测试程序：源链接：/dragen1860/TensorFlow-2.x-Tutorials/blob/master/08-ResNet/main.py"""import osos.environ["CUDA_VISIBLE_DEVICES"] = "1" # os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"import tensorflow as tfimport numpy as npfrom tensorflow import kerastf.random.set_seed(22)np.random.seed(22)os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'assert tf.__version__.startswith('2.')(x_train, y_train), (x_test, y_test) = keras.datasets.fashion_mnist.load_data()x_train, x_test = x_train.astype(np.float32) / 255., x_test.astype(np.float32) / 255.# [b, 28, 28] => [b, 28, 28, 1]x_train, x_test = np.expand_dims(x_train, axis=3), np.expand_dims(x_test,axis=3)# one hot encode the labels. convert back to numpy as we cannot use a combination of numpy# and tensors as input to kerasy_train_ohe = tf.one_hot(y_train, depth=10).numpy()y_test_ohe = tf.one_hot(y_test, depth=10).numpy()print(x_train.shape, y_train.shape)print(x_test.shape, y_test.shape)# 3x3 convolutiondef conv3x3(channels, stride=1, kernel=(3, 3)):return keras.layers.Conv2D(channels,kernel,strides=stride,padding='same',use_bias=False,kernel_initializer=tf.random_normal_initializer())class ResnetBlock(keras.Model):def __init__(self, channels, strides=1, residual_path=False):super(ResnetBlock, self).__init__()self.channels = channelsself.strides = stridesself.residual_path = residual_pathself.conv1 = conv3x3(channels, strides)self.bn1 = keras.layers.BatchNormalization()self.conv2 = conv3x3(channels)self.bn2 = keras.layers.BatchNormalization()if residual_path:self.down_conv = conv3x3(channels, strides, kernel=(1, 1))self.down_bn = tf.keras.layers.BatchNormalization()def call(self, inputs, training=None):residual = inputsx = self.bn1(inputs, training=training)x = tf.nn.relu(x)x = self.conv1(x)x = self.bn2(x, training=training)x = tf.nn.relu(x)x = self.conv2(x)# this module can be added into self.# however, module in for can not be added.if self.residual_path:residual = self.down_bn(inputs, training=training)residual = tf.nn.relu(residual)residual = self.down_conv(residual)x = x + residualreturn xclass ResNet(keras.Model):def __init__(self, block_list, num_classes, initial_filters=16, **kwargs):super(ResNet, self).__init__(**kwargs)self.num_blocks = len(block_list)self.block_list = block_listself.in_channels = initial_filtersself.out_channels = initial_filtersself.conv_initial = conv3x3(self.out_channels)self.blocks = keras.models.Sequential(name='dynamic-blocks')# build all the blocksfor block_id in range(len(block_list)):for layer_id in range(block_list[block_id]):if block_id != 0 and layer_id == 0:block = ResnetBlock(self.out_channels,strides=2,residual_path=True)else:if self.in_channels != self.out_channels:residual_path = Trueelse:residual_path = Falseblock = ResnetBlock(self.out_channels,residual_path=residual_path)self.in_channels = self.out_channelsself.blocks.add(block)self.out_channels *= 2self.final_bn = keras.layers.BatchNormalization()self.avg_pool = keras.layers.GlobalAveragePooling2D()self.fc = keras.layers.Dense(num_classes)def call(self, inputs, training=None):out = self.conv_initial(inputs)out = self.blocks(out, training=training)out = self.final_bn(out, training=training)out = tf.nn.relu(out)out = self.avg_pool(out)out = self.fc(out)return outdef main():num_classes = 10batch_size = 128epochs = 2# build model and optimizermodel = ResNet([2, 2, 2], num_classes)pile(optimizer=keras.optimizers.Adam(0.001),loss=keras.losses.CategoricalCrossentropy(from_logits=True),metrics=['accuracy'])model.build(input_shape=(None, 28, 28, 1))print("Number of variables in the model :", len(model.variables))model.summary()# trainmodel.fit(x_train,y_train_ohe,batch_size=batch_size,epochs=epochs,validation_data=(x_test, y_test_ohe),verbose=1)# evaluate on test setscores = model.evaluate(x_test, y_test_ohe, batch_size, verbose=1)print("Final test loss and accuracy :", scores)if __name__ == '__main__':main()

监测 GPU 使用：

watch -n 0.01 nvidia-smi

测试 catboost 使用 CPU：

from catboost.datasets import titanicimport numpy as npfrom sklearn.model_selection import train_test_splitfrom catboost import CatBoostClassifier, Pool, cvfrom sklearn.metrics import accuracy_scoretrain_df, test_df = titanic()null_value_stats = train_df.isnull().sum(axis=0)null_value_stats[null_value_stats != 0]train_df.fillna(-999, inplace=True)test_df.fillna(-999, inplace=True)X = train_df.drop('Survived', axis=1)y = train_df.SurvivedX_train, X_validation, y_train, y_validation = train_test_split(X, y, train_size=0.75, random_state=42)X_test = test_dfcategorical_features_indices = np.where(X.dtypes != np.float)[0]model = CatBoostClassifier(task_type="GPU",custom_metric=['Accuracy'],random_seed=666,logging_level='Silent')model.fit(X_train, y_train,cat_features=categorical_features_indices,eval_set=(X_validation, y_validation),logging_level='Verbose', # you can comment this for no text outputplot=True);

监测 GPU 使用：

watch -n 0.01 nvidia-smi

REFERENCE

[1]

安装CUDA:/cuda-toolkit-archive

[2]

安装cuDNN:/rdp/cudnn-download

[3]

安装合适的显卡驱动:/home/how-to-install-latest-nvidia-drivers-in-linux

[4]

也可以手动去官网下载对应的安装程序安装显卡:/drivers

[5]

NVIDIA:/item/NVIDIA

[6]

并行计算:/item/并行计算/113443

[7]

GPU:/item/GPU

[8]

Online Documentation:/cuda-toolkit-archive

[9]

下载对应的安装包（以官方推荐的Deb packages安装方式为例）:/cuda-10.0-download-archive?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1804&target_type=deblocal

[10]

安装cuDNN:/rdp/cudnn-download

[11]

链接:/rdp/cudnn-download

[12]

官方-NVIDIA CUDA Installation Guide for Linux:/cuda/archive/10.0/cuda-installation-guide-linux/index.html

[13]

CUDA_Quick_Start_Guide-pdf:https://developer./compute/cuda/10.0/Prod/docs/sidebar/CUDA_Quick_Start_Guide.pdf

[14]

CUDA_Installation_Guide_Linux-pdf:https://developer./compute/cuda/10.0/Prod/docs/sidebar/CUDA_Installation_Guide_Linux.pdf

[15]

官方-cuDNN安装:/deeplearning/sdk/cudnn-install/index.html#install-linux