• TensorFlow入门——bazel编译(带GPU)


    这一系列基本上是属于我自己进行到了那个步骤就做到那个步骤的

    由于新装了GPU (GTX750ti)和CUDA9.0、CUDNN7.1版本的软件,所以希望TensorFlow能在GPU上运行,也算上补上之前的承诺

    说了下初衷,由于现在新的CUDA版本对TensorFlow的支持不好,只能采取编译源码的方式进行

    所以大概分为以下几个步骤

    1.安装依赖库(这部分我已经做过了,不进行介绍,可以看前边的依赖库,基本一致)

    sudo apt-get install openjdk-8-jdk

    jdk是bazel必须的

    2.安装Git(有的就跳过这一步)

    3.安装TensorFlow的build工具bazel

    4.配置并编译TensorFlow源码

    5.安装并配置环境变量

    1.安装依赖库

    2.安装Git

    使用

    sudo apt-get install git
    git clone --recursive https://github.com/tensorflow/tensorflow

    3. 安装TensorFlow的build工具bazel

    这一步比较麻烦,是因为apt-get中没有bazel这个工具

    因此需要到GitHub上先下载,再进行安装 下载地址是https://github.com/bazelbuild/bazel/releases

    选择正确版本下载,这里序号看下TensorFlow的版本需求,具体对BAZEL的需求可以查看configure.py文件,比如我这个版本中就有这样的一段

    _TF_BAZELRC_FILENAME = '.tf_configure.bazelrc'
    _TF_WORKSPACE_ROOT = ''
    _TF_BAZELRC = ''
    _TF_CURRENT_BAZEL_VERSION = None
    _TF_MIN_BAZEL_VERSION = '0.27.1'
    _TF_MAX_BAZEL_VERSION = '1.1.0'

    每个字段的意思从字面上就可以得知,_TF_BAZELRC_FILENAME是使用bazel编译时使用的配置文件(没有特别细致的研究,https://www.cnblogs.com/shouhuxianjian/p/9416934.html里边有解释),_TF_MIN_BAZEL_VERSION = '0.27.1'是最低的bazel版本需求

    使用sudo命令安装.sh文件即可

    sudo chmod +x ./bazel*.sh
    sudo ./bazel-0.*.sh
    

    4.配置并编译TensorFlow源码

    首先是配置,可以针对自己的需求进行选择和裁剪。这一步特别麻烦,有很多选项需要选择,我的选择如下:

     1 jourluohua@jour:~/tools/tensorflow$ ./configure 
     2 WARNING: Running Bazel server needs to be killed, because the startup options are different.
     3 You have bazel 0.14.1 installed.
     4 Please specify the location of python. [Default is /usr/bin/python]: 
     5 
     6 
     7 Found possible Python library paths:
     8   /usr/local/lib/python2.7/dist-packages
     9   /usr/lib/python2.7/dist-packages
    10 Please input the desired Python library path to use.  Default is [/usr/local/lib/python2.7/dist-packages]
    11 
    12 Do you wish to build TensorFlow with jemalloc as malloc support? [Y/n]: Y
    13 jemalloc as malloc support will be enabled for TensorFlow.
    14 
    15 Do you wish to build TensorFlow with Google Cloud Platform support? [Y/n]: n
    16 No Google Cloud Platform support will be enabled for TensorFlow.
    17 
    18 Do you wish to build TensorFlow with Hadoop File System support? [Y/n]: n
    19 No Hadoop File System support will be enabled for TensorFlow.
    20 
    21 Do you wish to build TensorFlow with Amazon S3 File System support? [Y/n]: n
    22 No Amazon S3 File System support will be enabled for TensorFlow.
    23 
    24 Do you wish to build TensorFlow with Apache Kafka Platform support? [Y/n]: n
    25 No Apache Kafka Platform support will be enabled for TensorFlow.
    26 
    27 Do you wish to build TensorFlow with XLA JIT support? [y/N]: y
    28 XLA JIT support will be enabled for TensorFlow.
    29 
    30 Do you wish to build TensorFlow with GDR support? [y/N]: y
    31 GDR support will be enabled for TensorFlow.
    32 
    33 Do you wish to build TensorFlow with VERBS support? [y/N]: y
    34 VERBS support will be enabled for TensorFlow.
    35 
    36 Do you wish to build TensorFlow with OpenCL SYCL support? [y/N]: N
    37 No OpenCL SYCL support will be enabled for TensorFlow.
    38 
    39 Do you wish to build TensorFlow with CUDA support? [y/N]: y
    40 CUDA support will be enabled for TensorFlow.
    41 
    42 Please specify the CUDA SDK version you want to use. [Leave empty to default to CUDA 9.0]: 8
    43 
    44 
    45 Please specify the location where CUDA 8.0 toolkit is installed. Refer to README.md for more details. [Default is /usr/local/cuda]: 
    46 
    47 
    48 Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 7.0]: 
    49 
    50 
    51 Please specify the location where cuDNN 7 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda]:
    52 
    53 
    54 Do you wish to build TensorFlow with TensorRT support? [y/N]: N
    55 No TensorRT support will be enabled for TensorFlow.
    56 
    57 Please specify the NCCL version you want to use. [Leave empty to default to NCCL 1.3]: 
    58 
    59 
    60 Please specify a list of comma-separated Cuda compute capabilities you want to build with.
    61 You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
    62 Please note that each additional compute capability significantly increases your build time and binary size. [Default is: 5.0]
    63 
    64 
    65 Do you want to use clang as CUDA compiler? [y/N]: N
    66 nvcc will be used as CUDA compiler.
    67 
    68 Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/bin/gcc]: 
    69 
    70 
    71 Do you wish to build TensorFlow with MPI support? [y/N]: N
    72 No MPI support will be enabled for TensorFlow.
    73 
    74 Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native]: 
    75 
    76 
    77 Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]: N
    78 Not configuring the WORKSPACE for Android builds.
    79 
    80 Preconfigured Bazel build configs. You can use any of the below by adding "--config=<>" to your build command. See tools/bazel.rc for more details.
    81     --config=mkl             # Build with MKL support.
    82     --config=monolithic      # Config for mostly static monolithic build.
    83 Configuration finished
    View Code

     然后使用bazel进行编译(本步骤非常容易出问题,而且特别耗时),这里使用 -c opt是编译release版本的,使用-c dbg是编译debug版本的

    bazel build --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package
    bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg

    中间会遇到很多问题,这里列举一些不方便查的错误。

    1)比如会遇到CXX的错误,然后具体的错误还很难排查(只显示哪个配置文件的哪一行出错,并不显示具体错误)。需要查看具体错误信息的时候,建议添加--verbose_failures选项。

    2)遇到CXX的错误,(做编译的都知道,比较成熟C++的代码稳定性比较好,兼容性也比较好,移植起来也比较方便,一般不会遇到编译器和环境问题)可能是编译器gcc的版本问题,可以添加--cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0"

    3)遇到virtual memory exhausted: Cannot allocate memory 错误。这是因为swap分区没有设置或者swap分区容量设置太小的问题,使用free -m命令可以得知这个错误,可以使用扩展swap分区容量的方法。大概的命令如下

    mkdir /home/jourluohua/swap
    rm -rf /home/jourluohua/swap
    dd if=/dev/zero of=/home/jourluohua/swap bs=1024 count=4096000
    mkswap /home/jourluohua/swap
    sudo swapon /home/jourluohua/swap

    意思是设置4096000个1024byte大小的块,一共是4G。如果问题还是没有解决,以为bazel默认是使用多线程编译模式,可以手动添加 -j 2选项,将使用的线程固定在2

    4)遇到AttributeError: 'module' object has no attribute 'IntEnum' 这个问题比较模糊,使用python -c "import enum"的时候没有错误,但是里边确实没有IntEnum的属性,查找后发现是需要安装enum34包来解决,Python不太好的一点就是各种包非常混乱,

    pip install enum34 --user

    5)遇到AttributeError: attribute '__doc__' of 'type' objects is not writable错误。这个问题其实挺棘手的,自身是体系结构方向,一般使用的语言也是C++,对Python不是很熟悉,也许是我的编译环境出了问题?检查查了下__doc__是Python里边注释。

    先写了个小程序复现了这个问题:

    #!/usr/bin/python
    from functools import wraps
    
    #from https://stackoverflow.com/questions/39010366/functools-wrapper-attributeerror-attribute-doc-of-type-objects-is-not
    def memoize(f):
        """ Memoization decorator for functions taking one or more arguments.
            Saves repeated api calls for a given value, by caching it.
        """
        @wraps(f)
        class memodict(dict):
           """memodict"""
           def __init__(self, f):
               self.f = f
           def __call__(self, *args):
               return self[args]
           def __missing__(self, key):
               ret = self[key] = self.f(*key)
               return ret
        return memodict(f)
    
    @memoize
    def a():
        """blah"""
        pass

    出现了同样的错误:

    Traceback (most recent call last):
      File "ipy.py", line 20, in <module>
        @memoize
      File "ipy.py", line 9, in memoize
        class memodict(dict):
      File "/usr/lib/python2.7/functools.py", line 33, in update_wrapper
        setattr(wrapper, attr, getattr(wrapped, attr))
    AttributeError: attribute '__doc__' of 'type' objects is not writable

    打开出问题的Python代码,原来的代码是这样

    @tf_export(v1=["VariableAggregation"])
    class VariableAggregation(enum.Enum):
      NONE = 0
      SUM = 1
      MEAN = 2
      ONLY_FIRST_REPLICA = 3
      ONLY_FIRST_TOWER = 3  # DEPRECATED
      
      def __hash__(self):
        return hash(self.value)
    
    
    # LINT.ThenChange(//tensorflow/core/framework/variable.proto)
    #
    # Note that we are currently relying on the integer values of the Python enums
    # matching the integer values of the proto enums.
    
    VariableAggregation.__doc__ = (
        VariableAggregationV2.__doc__ +
        "* `ONLY_FIRST_TOWER`: Deprecated alias for `ONLY_FIRST_REPLICA`.
      ")

    大概就是要将VariableAggregation的注释设置成VariableAggregationV2加上额外的一段"* `ONLY_FIRST_TOWER`: Deprecated alias for `ONLY_FIRST_REPLICA`. ",猜想既然不允许在class声明外做这个事情,那么直接在class中设置是否可行?

    修改后的代码如下:

    @tf_export(v1=["VariableAggregation"])
    class VariableAggregation(enum.Enum):
      NONE = 0
      SUM = 1
      MEAN = 2
      ONLY_FIRST_REPLICA = 3
      ONLY_FIRST_TOWER = 3  # DEPRECATED
      __doc__ = (VariableAggregationV2.__doc__ + "* `ONLY_FIRST_TOWER`: Deprecated alias for `ONLY_FIRST_REPLICA`.
      ")
      def __hash__(self):
        return hash(self.value)
    
    
    # LINT.ThenChange(//tensorflow/core/framework/variable.proto)
    #
    # Note that we are currently relying on the integer values of the Python enums
    # matching the integer values of the proto enums.
    
    #VariableAggregation.__doc__ = (
     #   VariableAggregationV2.__doc__ +
      #  "* `ONLY_FIRST_TOWER`: Deprecated alias for `ONLY_FIRST_REPLICA`.
      ")

    6)遇到LargeZipFile: Zipfile size would require ZIP64 extensions 问题,这个问题其实很明显,就是文件太大了,在需要压缩的时候,需要配置一下ZIP64选项,而默认应该是不支持的,修改/usr/lib/python2.7/dist-packages/wheel/archive.py文件

    将    zip = zipfile.ZipFile(open(zip_filename, "wb+"), "w",compression=zipfile.ZIP_DEFLATED)改成zip = zipfile.ZipFile(open(zip_filename, "wb+"), "w",compression=zipfile.ZIP_DEFLATED, allowZip64=True)就可以。

    但是说实话,debug版本还是太大了,超过了zip可以压缩的大小,主要是CRC32校验那里过不去,对于我不是急需,就没有修改这里,毕竟Python2.7已经不再更新,没有努力的必要,Python3.5以上的版本这里都没有问题。

    还有一些其他缺库的问题,一般都比较好搜索,就不一一列举在这里。

    5.安装并配置环境变量

    使用pip进行安装

    $ pip install /tmp/tensorflow_pkg/tensorflow --user
    
    # with no spaces after tensorflow hit tab before hitting enter to fill in blanks

    最后就是测试

    import tensorflow as tf
    sess = tf.InteractiveSession()
    sess.close()

    如果每一步都不报错的,TensorFlow就编译并安装成功了

  • 相关阅读:
    关于隐藏元素高度的问题 css visibility:hidden 与 display:none的区别
    三星R428 内存不兼容金士顿2G DDR3
    IE (6-11)版本,在使用iframe的框架时,通过a标签javascript:; 和js跳转parent.location的时候 出现在新页面打开的情况
    按键精灵 vbs 获取网页源码 xp系统被拒绝
    threejs 组成的3d管道,寻最短路径问题
    javaweb部署多个项目(复制的项目)
    添加无登录权限的SSH用户命令
    Using Blocks in iOS 4: Designing with Blocks
    Using Blocks in iOS 4: The Basics
    Understanding Objective-C Blocks
  • 原文地址:https://www.cnblogs.com/jourluohua/p/9180709.html
Copyright © 2020-2023  润新知