Better performance with tf.function（TensorFlow/学习/TensorFlow Core/指南）

本文原始页面链接：https://tensorflow.google.cn/guide/function

在TensorFlow 2.x系列的版本，eager execution功能是默认打开的，用户接口更直观和灵活，但是代价体现在性能和部署的方面。
想要得到性能和便携性都很好的模型，可以使用tf.function从你的程序中生成图（graph）。但是关于tf.function也有一些问题/陷阱需要注意，本文帮助你理解tf.function到底在做些什么，以方便你掌握它。


** 追踪（trace/tracing）：tensorflow依据输入构建图(Graph)的过程称为追踪（tracing）。 **
** 回溯（retracing）：当参数发生变化，重新构建图(Graph)的过程。 **

主要结论和建议包括：

使用Eager execution模式debug，使用tf.function装饰函数 / Debug in Eager mode, then decorate with @tf.function.
不要依赖python副作用，如对象转变或者是列表append / Don't rely on Python side effects like object mutation or list appends.
tf.function与TensorFlow op搭配工作的最好，numpy和python调用被转换成常量 / tf.function works best with TensorFlow ops; NumPy and Python calls are converted to constants.

开始（Setup）

import tensorflow as tf

定义一个帮助函数，提示可能遇到的错误类型：

import traceback
import contextlib

# Some helper code to demonstrate the kinds of errors you might encounter.
@contextlib.contextmanager
def assert_raises(error_class):
  try:
    yield
  except error_class as e:
    print('Caught expected exception 
  {}:'.format(error_class))
    traceback.print_exc(limit=2)
  except Exception as e:
    raise e
  else:
    raise Exception('Expected {} to be raised but no error was raised!'.format(
        error_class))

基本(Basics)

你自己定义的函数，经过tf.function装饰后，就成为像tensorflow的核心操作一样，可以eager execution，求梯度等等。。

比如定义add函数，执行eager execution，

@tf.function
def add(a, b):
  return a + b

add(tf.ones([2, 2]), tf.ones([2, 2]))  #  [[2., 2.], [2., 2.]]

输出是：

<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[2., 2.],
       [2., 2.]], dtype=float32)>

对add求梯度,

v = tf.Variable(1.0)
with tf.GradientTape() as tape:
  result = add(v, 1.0)
tape.gradient(result, v)

其输出是，

<tf.Tensor: shape=(), dtype=float32, numpy=1.0>

也可以在tf.function函数中使用定义过的tf.function函数，

@tf.function
def dense_layer(x, w, b):
  return add(tf.matmul(x, w), b)

dense_layer(tf.ones([3, 2]), tf.ones([2, 2]), tf.ones([2]))

<tf.Tensor: shape=(3, 2), dtype=float32, numpy=
array([[3., 3.],
       [3., 3.],
       [3., 3.]], dtype=float32)>

如此定义的tf.function函数比立即执行的代码要快，特别是对于有许多小op操作的图(Graph)，但是对于有一些计算量大的操作（比如卷积conv），也许速度就不会有明显提升，

import timeit
conv_layer = tf.keras.layers.Conv2D(100, 3)

@tf.function
def conv_fn(image):
  return conv_layer(image)

image = tf.zeros([1, 200, 200, 100])
# warm up
conv_layer(image); conv_fn(image)
print("Eager conv:", timeit.timeit(lambda: conv_layer(image), number=10))
print("Function conv:", timeit.timeit(lambda: conv_fn(image), number=10))
print("Note how there's not much difference in performance for convolutions")

Eager conv: 0.004070537999723456
Function conv: 0.0023154040000008536
Note how there's not much difference in performance for convolutions

调试(Debugging)

一般情况下，在eager模式下调试代码比在tf.function函数中调试更简单。使用tf.function装饰你的函数之前，应该确保在eager模式下没有运行错误。为了帮助调试，可以使用tf.config.run_functions_eagerly(True)全局地禁用/允许tf.function功能。

在调试一些只在tf.function定义的函数内部出现的问题时，有以下几点提示：

只在tracing的时候调用python的print，帮助跟踪函数. / Plain old Python print calls only execute during tracing, helping you track down when your functions get (re)traced.
tf.print调用每次都会执行，在执行时追踪中间值。/ tf.print calls will execute every time, and can help you track down intermediate values during execution.
当出现NaN与Inf时，使用tf.debugging.enable_check_numerics 帮助追踪./ tf.debugging.enable_check_numerics is an easy way to track down where NaNs and Inf are created.
pdb 能帮助理解tracing时程序的详细信息。/ pdbcan help you understand what's going on during tracing. (Caveat: PDB will drop you into AutoGraph-transformed source code.)

追踪和多态性（Tracing and polymorphism）

python的动态类型意味着你可以使用各种不同的参数类型调用函数，而python会对应不同的参数执行不同的行为。

相反的是，TensorFlow的图（Graph）要求静态的数据类型和数据维度（形状）；tf.function弥补了这一差距，它在需要时会回溯函数，从而产生正确的图。ft.function使用时的微妙之处大多数都源自于这种回溯行为。

使用不同的参数调用函数，观察如下，

# Functions are polymorphic

@tf.function
def double(a):
  print("Tracing with", a)
  return a + a

print(double(tf.constant(1)))
print()
print(double(tf.constant(1.1)))
print()
print(double(tf.constant("a")))
print()

Tracing with Tensor("a:0", shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)

Tracing with Tensor("a:0", shape=(), dtype=float32)
tf.Tensor(2.2, shape=(), dtype=float32)

Tracing with Tensor("a:0", shape=(), dtype=string)
tf.Tensor(b'aa', shape=(), dtype=string)

想要控制这种追溯回溯行为，可以使用如下的技术:

创建一个新的tf.function，独立的tf.function函数对象已经确保不会共享追踪。

def f():
  print('Tracing!')
  tf.print('Executing')

tf.function(f)()
tf.function(f)()

Tracing!
Executing
Tracing!
Executing

使用get_concrete_function来获取一个特定的追踪：

print("Obtaining concrete trace")
double_strings = double.get_concrete_function(tf.TensorSpec(shape=None, dtype=tf.string))
print("Executing traced function")
print(double_strings(tf.constant("a")))
print(double_strings(a=tf.constant("b")))
print("Using a concrete trace with incompatible types will throw an error")
with assert_raises(tf.errors.InvalidArgumentError):
  double_strings(tf.constant(1))

Obtaining concrete trace
Tracing with Tensor("a:0", dtype=string)
Executing traced function
tf.Tensor(b'aa', shape=(), dtype=string)
tf.Tensor(b'bb', shape=(), dtype=string)
Using a concrete trace with incompatible types will throw an error
Caught expected exception 
  <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>:

Traceback (most recent call last):
  File "<ipython-input-3-73d0ca52e838>", line 8, in assert_raises
    yield
  File "<ipython-input-10-5351d0a2eda2>", line 8, in <module>
    double_strings(tf.constant(1))
tensorflow.python.framework.errors_impl.InvalidArgumentError: cannot compute __inference_double_183 as 
      input #0(zero-based) was expected to be a string tensor but is a int32 tensor [Op:__inference_double_183]

在tf.function中指定input_signature来限制追踪：

@tf.function(input_signature=(tf.TensorSpec(shape=[None], dtype=tf.int32),))
def next_collatz(x):
  print("Tracing with", x)
  return tf.where(x % 2 == 0, x // 2, 3 * x + 1)

print(next_collatz(tf.constant([1, 2])))
# We specified a 1-D tensor in the input signature, so this should fail.
with assert_raises(ValueError):
  next_collatz(tf.constant([[1, 2], [3, 4]]))

Tracing with Tensor("x:0", shape=(None,), dtype=int32)
tf.Tensor([4 1], shape=(2,), dtype=int32)
Caught expected exception 
  <class 'ValueError'>:

Traceback (most recent call last):
  File "<ipython-input-3-73d0ca52e838>", line 8, in assert_raises
    yield
  File "<ipython-input-11-9939c82c1507>", line 9, in <module>
    next_collatz(tf.constant([[1, 2], [3, 4]]))
ValueError: Python inputs incompatible with input_signature:
  inputs: (
    tf.Tensor(
[[1 2]
 [3 4]], shape=(2, 2), dtype=int32))
  input_signature: (
    TensorSpec(shape=(None,), dtype=tf.int32, name=None))

何时回溯（When to retrace）

一个多态的tf.function会保持追踪时产生的函数实体的缓存。缓存的keys是从function的args和kwargs参数产生的有效tuple。对于一个tf.Tensor参数产生的key是它的维度数目和类型，对python参数产生的key是它的值。对于其他的python类型，keys基于对象的id()产生，因此对类的每一个实例的方法都能够单独追踪。未来，TensorFlow有可能会给python对象增加更复杂的缓存，使对象能够安全地转换为Tensor。

可参考Concrete functions。

Python还是Tensor参数？(Python or Tensor args？)

通常情况下，Python参数用于控制超参数与图的构建，如num_layers=10,training=True,nonlinearity=relu。如果python的参数改变了，就必须要回溯图。

然而，不用python参数去控制图的构建也是可能的。python的值改变会引发不必要的回溯，例如在这个训练循环中，AutoGraph会动态展开，尽管有多个追踪，产生的图实际上是相同的，这样效率有些低。

def train_one_step():
  pass

@tf.function
def train(num_steps):
  print("Tracing with num_steps = {}".format(num_steps))
  for _ in tf.range(num_steps):
    train_one_step()

train(num_steps=10)
train(num_steps=20)

Tracing with num_steps = 10
Tracing with num_steps = 20

如果你的参数不影响产生的图的形状，可以将你的参数转换成Tensor，

train(num_steps=tf.constant(10))
train(num_steps=tf.constant(20))

Tracing with num_steps = Tensor("num_steps:0", shape=(), dtype=int32)

tf.function的副作用

一般情况下，python的副作用（print，转换对象等）只在追踪(tracing)时发生，那么如何可靠的从tf.function中触发python的副作用呢？

经验就是只在调试时使用python副作用。否则，TensorFlow的操作（ops）像 tf.Variable.assign, tf.print, tf.summary等，是保证你的代码被TensorFlow Runtime追踪以及执行的最好选择。一般使用函数风格会产生最好的结果，

@tf.function
def f(x):
  print("Traced with", x)
  tf.print("Executed with", x)

f(1)
f(1)
f(2)

Traced with 1 //只在追踪时执行python print
Executed with 1
Executed with 1
Traced with 2 //只在追踪时执行python print
Executed with 2

如果你想在每次调用tf.function时执行python代码，tf.py_function是一个出口。tf.py_function的缺陷是它既不便携，也不特别高效，而且在分布式设备（multi-GPU, TPU）也不能很好地工作。而且，为了可微分性，tf.py_function必须接入图，它将（与图之间的）所有的输入/输出强制转换为Tensor。

external_list = []

def side_effect(x):
  print('Python side effect')
  external_list.append(x)

@tf.function
def f(x):
  tf.py_function(side_effect, inp=[x], Tout=[])

f(1)
f(1)
f(1)
assert len(external_list) == 3
# .numpy() call required because py_function casts 1 to tf.constant(1)
assert external_list[0].numpy() == 1

Python side effect
Python side effect
Python side effect

注意python的状态（Beware of Python state）

python的许多特征，如迭代器或者生成器，依靠python runtime去追踪状态。一般来说，这些构造器在Eager模式下会按照预期的那样工作，由于追踪行为（tracing behavior），它们在tf.function内部会引发许多意外的结果。

比如，向前推进迭代器的状态是Python的一个副作用，因此只在跟踪期间发生，

external_var = tf.Variable(0)
@tf.function
def buggy_consume_next(iterator):
  external_var.assign_add(next(iterator))
  tf.print("Value of external_var:", external_var)

iterator = iter([0, 1, 2, 3])
buggy_consume_next(iterator)
# This reuses the first value from the iterator, rather than consuming the next value.
buggy_consume_next(iterator)
buggy_consume_next(iterator)

Value of external_var: 0
Value of external_var: 0
Value of external_var: 0

变量(Variables)

我们可以使用与提升代码的预期执行顺序所相同的思想，使得在tf.function中创建和使用变量非常容易。

一个很重要的警告，使用变量时，可以写出在eager模式和图模式下表现不同的代码。
具体地说，当你每次调用的时候都创建一个新的变量，这种情况就有可能发生。由于追踪语义，tf.function会在每次调用时复用同一个变量，但是eager模式会在每次调用时创建一个新的变量。为了解决这个错误，当检测到危险的变量创建行为时，tf.function会引发错误警告。

@tf.function
def f(x):
  v = tf.Variable(1.0)
  v.assign_add(x)
  return v

with assert_raises(ValueError):
  f(1.0)

WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow/python/ops/resource_variable_ops.py:1817: 
                              calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint
                              is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
Caught expected exception 
  <class 'ValueError'>:

Traceback (most recent call last):
  File "<ipython-input-3-73d0ca52e838>", line 8, in assert_raises
    yield
  File "<ipython-input-17-73e410646579>", line 8, in <module>
    f(1.0)
ValueError: in user code:

    <ipython-input-17-73e410646579>:3 f  *
        v = tf.Variable(1.0)
    /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow/python/ops/variables.py:261 __call__  **
        return cls._variable_v2_call(*args, **kwargs)
    /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow/python/ops/variables.py:255 _variable_v2_call
        shape=shape)
    /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow/python/ops/variables.py:66 getter
        return captured_getter(captured_previous, **kwargs)
    /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py:511 invalid_creator_scope
        "tf.function-decorated function tried to create "

    ValueError: tf.function-decorated function tried to create variables on non-first call.

无歧义的代码是没有问题的，

v = tf.Variable(1.0)

@tf.function
def f(x):
  return v.assign_add(x)

print(f(1.0))  # 2.0
print(f(2.0))  # 4.0

tf.Tensor(2.0, shape=(), dtype=float32)
tf.Tensor(4.0, shape=(), dtype=float32)

你只要能够保证这些变量只在tf.function函数第一次执行时被创建，也可以在tf.function中创建变量，

class C:
  pass

obj = C()
obj.v = None

@tf.function
def g(x):
  if obj.v is None:
    obj.v = tf.Variable(1.0)
  return obj.v.assign_add(x)

print(g(1.0))  # 2.0
print(g(2.0))  # 4.0

tf.Tensor(2.0, shape=(), dtype=float32)
tf.Tensor(4.0, shape=(), dtype=float32)

变量的初始化可以通过函数的参数以及其他变量的值，我们可以使用与产生控制依赖项相同的方法来确定的初始化顺序，

state = []
@tf.function
def fn(x):
  if not state:
    state.append(tf.Variable(2.0 * x))
    state.append(tf.Variable(state[0] * 3.0))
  return state[0] * x * state[1]

print(fn(tf.constant(1.0)))
print(fn(tf.constant(3.0)))

tf.Tensor(12.0, shape=(), dtype=float32)
tf.Tensor(36.0, shape=(), dtype=float32)

AutoGraph 转换（AutoGraph Transformations）

Autograph是在tf.function中默认启用的库，它将python中一部分子集代码转换为图兼容的TensorFlow操作（ops），包含控制流如if,for,while等。

TensorFlow ops如tf.cond和tf.while_loop可以工作，但是用python代码写的控制流更容易编写和理解。

# Simple loop

@tf.function
def f(x):
  while tf.reduce_sum(x) > 1:
    tf.print(x)
    x = tf.tanh(x)
  return x

f(tf.random.uniform([5]))

[0.42992723 0.425026417 0.735794306 0.224515557 0.623353]
[0.405260503 0.401156455 0.626597464 0.2208177 0.553458273]
[0.384441048 0.380938053 0.555704892 0.217297286 0.503107667]
...
...
[0.203162178 0.202645645 0.219479173 0.161014691 0.215892553]

<tf.Tensor: shape=(5,), dtype=float32, numpy=
array([0.20041241, 0.19991657, 0.2160216 , 0.1596375 , 0.21259971],
      dtype=float32)>

通过下面的语句，可以检查自动图(AutoGraph)转换所生成的代码，

print(tf.autograph.to_code(f.python_function))

def tf__f(x):
    do_return = False
    retval_ = ag__.UndefinedReturnValue()
    with ag__.FunctionScope('f', 'fscope', ag__.ConversionOptions(
      recursive=True, user_requested=True,optional_features=(), internal_convert_user_code=True)) as fscope:

        def get_state():
            return (x,)

        def set_state(loop_vars):
            nonlocal x
            (x,) = loop_vars

        def loop_body():
            nonlocal x
            ag__.converted_call(tf.print, (x,), None, fscope)
            x = ag__.converted_call(tf.tanh, (x,), None, fscope)

        def loop_test():
            return (ag__.converted_call(tf.reduce_sum, (x,), None, fscope) > 1)
        ag__.while_stmt(loop_test, loop_body, get_state, set_state, ('x',), {})
        try:
            do_return = True
            retval_ = fscope.mark_return_value(x)
        except:
            do_return = False
            raise
    (do_return,)
    return ag__.retval(retval_)

· 条件

AutoGraph会转换一些条件语句(if<condition>)变成相同的tf.cond调用。如果<condition>是Tensor的话，AutoGraph就会执行转换，否则，if就按照python语句执行。

Python条件只在追踪时（tracing）执行，因此条件的一个分支会被加入图中。没有AutoGraph，如果存在依赖于数据的控制流，则此跟踪图（Tracing Graph）将无法执行选择分支。

tf.cond追踪和增加所有条件分支到图中，在执行时动态选择一个分支。追踪也会也发意外的副作用，更多信息参考AutoGraph tracing effects。

@tf.function
def fizzbuzz(n):
  for i in tf.range(1, n + 1):
    print('Tracing for loop')
    if i % 15 == 0:
      print('Tracing fizzbuzz branch')
      tf.print('fizzbuzz')
    elif i % 3 == 0:
      print('Tracing fizz branch')
      tf.print('fizz')
    elif i % 5 == 0:
      print('Tracing buzz branch')
      tf.print('buzz')
    else:
      print('Tracing default branch')
      tf.print(i)

fizzbuzz(tf.constant(5))
fizzbuzz(tf.constant(20))

Tracing for loop
Tracing fizzbuzz branch
Tracing fizz branch
Tracing buzz branch
Tracing default branch
1
2
fizz
4
buzz
1
2
fizz
4
buzz
fizz
7
8
fizz
buzz
11
fizz
13
14
fizzbuzz
16
17
fizz
19
buzz

参考reference documentation有关AutoGraph转换的if语句的额外限制。

· 循环

AutoGraph转换一些for和while语句成为等同的TensorFlow ops，比如tf.while_loop。如果没有转换，for和while循环当作python语句执行。

这个转换在以下情况下进行：

for x in y: 如果y是一个Tensor，转换为tf.while_loop。在y是一个tf.data.Dataset的特殊情况下，会生成一个tf.data.Datasetops的连接。
while<condition>: 如果<condition>是Tensor，转换为tf.while_loop。

Python循环在追踪期间(tracing)执行，每次循环迭代都对图(Graph)增加额外的ops。

TensorFlow的循环对循环体进行追踪，在Runtime动态地选择迭代次数。循环体在生成的tf.Graph中只出现一次。

参考reference documentation有关AutoGraph转换的for和while语句的额外限制。

· 对python数据的循环（Looping over Python data）

一个常见的问题是在tf.function内循环Numpy/python的数据，这个循环在追踪过程中执行，每次迭代都复制你的模型到tf.Graph中。

如果想要把整个训练循环都包括在tf.function中，最安全的方式就是将你的数据包裹为tf.data.Dataset，这样AutoGraph会动态地展开训练数据。

def measure_graph_size(f, *args):
  g = f.get_concrete_function(*args).graph
  print("{}({}) contains {} nodes in its graph".format(
      f.__name__, ', '.join(map(str, args)), len(g.as_graph_def().node)))

@tf.function
def train(dataset):
  loss = tf.constant(0)
  for x, y in dataset:
    loss += tf.abs(y - x) # Some dummy computation.
  return loss

small_data = [(1, 1)] * 3
big_data = [(1, 1)] * 10
measure_graph_size(train, small_data)
measure_graph_size(train, big_data)

measure_graph_size(train, tf.data.Dataset.from_generator(
    lambda: small_data, (tf.int32, tf.int32)))
measure_graph_size(train, tf.data.Dataset.from_generator(
    lambda: big_data, (tf.int32, tf.int32)))

train([(1, 1), (1, 1), (1, 1)]) contains 11 nodes in its graph
train([(1, 1), (1, 1), (1, 1), (1, 1), (1, 1), (1, 1), (1, 1), (1, 1), (1, 1), (1, 1)]) contains 32 nodes in its graph
train(<FlatMapDataset shapes: (<unknown>, <unknown>), types: (tf.int32, tf.int32)>) contains 8 nodes in its graph
train(<FlatMapDataset shapes: (<unknown>, <unknown>), types: (tf.int32, tf.int32)>) contains 8 nodes in its graph

当把Numpy/Python数据包裹进Dataset中，注意区别使用tf.data.Dataset.from_generator 和 tf.data.Dataset.from_tensors，前者会将数据保持python的形式，通过tf.py_function取数据，性能方面会有影响；后者将数据复制后作为图上的一个tf.constant节点，内存方面会有影响。

从TFRecordDataset/CsvDataset/etc读取数据是最高效的形式，无需python的参与，TensorFlow会自动管理数据的异步载入和预存取。更多信息参考 tf.data guide。

· 在循环中累积值（Accumulating values in a loop）

通常操作是在循环中累积中间值，正常情况下，这是通过python的扩展列表（list）或者增加字典实体来实现的。由于python的副作用，在动态展开的循环中，这些操作不会像预期的那样工作。使用tf.TensorArray从动态展开循环中累积结果/中间值。

batch_size = 2
seq_len = 3
feature_size = 4

def rnn_step(inp, state):
  return inp + state

@tf.function
def dynamic_rnn(rnn_step, input_data, initial_state):
  # [batch, time, features] -> [time, batch, features]
  input_data = tf.transpose(input_data, [1, 0, 2])
  max_seq_len = input_data.shape[0]

  states = tf.TensorArray(tf.float32, size=max_seq_len)
  state = initial_state
  for i in tf.range(max_seq_len):
    state = rnn_step(input_data[i], state)
    states = states.write(i, state)
  return tf.transpose(states.stack(), [1, 0, 2])
  
dynamic_rnn(rnn_step,
            tf.random.uniform([batch_size, seq_len, feature_size]),
            tf.zeros([batch_size, feature_size]))

<tf.Tensor: shape=(2, 3, 4), dtype=float32, numpy=
array([[[0.96471524, 0.233114  , 0.1417228 , 0.14083493],
        [1.6257136 , 0.9389272 , 0.73989546, 0.8011714 ],
        [2.233508  , 1.827873  , 1.1567426 , 1.5585394 ]],

       [[0.67377114, 0.42712367, 0.5697857 , 0.71173656],
        [1.5520021 , 0.806401  , 0.9260858 , 1.5265073 ],
        [1.8115815 , 1.6316041 , 1.2245122 , 1.9724467 ]]], dtype=float32)>