Programmer's Guide ================== Introduction ------------ The mission of OVL is to provide a framework for easily defining high-performance, python-native operators that can work transparently with TensorFlow on a variety of different processor architectures. OVL operators are written by users in OVL's python-embedded domain specific language (DSL). The programming model of operators is similar to that of CUDA or OpenCL - conceptually an operator is a stateless function that is mapped over a set of user-defined thread or worker indices. Operators are defined as the python interpreter encounters them but are evaluated lazily, allowing for numerical and performance optimizations to be applied to the entire graph of defined operators before evaluation time. The OVL DSL is designed to strike a balance between productivity, performance, and portability and as such there are some important restrictions that differentiate it from similar approaches: * The DSL is strongly typed and tensor shapes are part of the type system. This means that inputs and outputs must have a concrete shape - there is no support for dynamic tensor sizing. * Individual worker elements cannot communicate and, as such, there are no synchronization barriers. * Operators have no persistent state between calls. Troubleshooting ~~~~~~~~~~~~~~~ General usage and setup questions should be posted to `StackOverflow `__. Please post bugs and feature requests to `GitHub `__ with as much information as possible including: * System environment information * Library versions ``pip show tensorflow opveclib`` * Logs, e.g. by using ``logging.basicConfig(filename='output.log',level=logging.DEBUG)`` in your code * Instructions or code snippets for reproducing the issue Common Pitfalls ~~~~~~~~~~~~~~~ OVL currently uses the python interpreter to parse operators into its own intermediate representation, an approach which comes with some limitations. These limitations result in a programming API that has two common pitfalls that raise a syntax error when encountered. These pitfalls are highlighted here to prevent confusion, but examples of proper use cases are detailed in the examples section below. These limitations may eventually disappear (i.e. not throw a syntax error and work as expected) as improvements are made to library internals. Assignment __________ Once variables are defined within the operator, the python-native assignment operator `` = `` cannot be used on them. Use the `` <<= `` operator instead. For example, the following should raise an error: .. code-block:: python import opveclib as ovl x = ovl.variable(ovl.float32) x = 2.0 The correct way to assign a value to a previously-defined variable is as follows: .. code-block:: python import opveclib as ovl x = ovl.variable(ovl.float32) x <<= 2.0 Conditionals ____________ The python-native ``if``, ``elif``, and ``else`` statements cannot be used with OVL expressions. OVL conditionals must be invoked from within a ``with`` statement. More specifically, the following should raise an error: .. code-block:: python import opveclib as ovl x = ovl.variable(ovl.float32) if x < -1: x <<= -1 elif x > 1: x <<= 1 The correct way to use conditionals is as follows: .. code-block:: python import opveclib as ovl x = ovl.variable(ovl.float32) with ovl.if_(x < -1): x <<= -1 with ovl.elif_(x > 1): x <<= 1 Guide by examples ----------------- Hello World ~~~~~~~~~~~ Designing the simplest OVL operator requires understanding a few key concepts and their corresponding implementations in the OVL API. This example takes the absolute value of an input tensor and shows all of the basic functionality necessary to create and use an operator. New concepts introduced here include: * An OVL operator is defined by creating a python function and decorating it with the ``operator()`` decorator. * Arguments to the operator are the tensors that it will operate on at evaluation time. * Output tensors are the only thing that can be returned from operators. They are defined with the ``output`` and ``output_like`` functions. * Operators are implicitly mapped over a set of workgroup positions. The workgroup shape must be statically defined based on the arguments to the operator and must be either a single int or a list of ints. The ``position_in`` function is used to define the workgroup shape and returns a ``PositionTensor`` object which is used to identify the position of the current worker element. A workgroup shape can be defined to be any number of dimensions that makes sense for the problem at hand. A PositionTensor can be indexed to obtain the current worker's position along a specific axis. * Data elements in input tensors are accessed by indexing into them. Indexing into an input tensor yields a ``Scalar``. A ``Scalar`` can be transformed into a new ``Scalar`` by applying scalar operators to them, in this case the ``absolute`` function. * Output tensors are written to by setting ``Scalar`` values at a specific position, in this case the position is just the ``PositionTensor``, but more complicated write patterns follow in other examples. * Operators can be tested independently of the TensorFlow runtime using the OVL test infrastructure via the ``evaluate`` function. Testing operators using this infrastructure is recommended since it isolates the operator from the TF runtime. An explicit ``evaluate`` function is used so that operators can be lazily evaluated, increasing the opportunity for optimization. * Operators are linked into the TensorFlow runtime by explicitly converting operator outputs to TensorFlow tensors with the ``as_tensorflow`` function. .. testcode:: import numpy as np import tensorflow as tf import opveclib as ovl @ovl.operator() def absolute(input_tensor): # define the output tensor output_tensor = ovl.output_like(input_tensor) # define the workgroup shape and get workgroup position reference wg_position = ovl.position_in(input_tensor.shape) # read input element at current workgroup position input_element = input_tensor[wg_position] # apply absolute function abs = ovl.absolute(input_element) # set the output element output_tensor[wg_position] = abs return output_tensor # define a numpy input in_np = np.arange(-3, 4, dtype=np.float32) # apply the operator out = absolute(in_np) # lazily evaluate with the OVL test infrastructure print(ovl.evaluate([out])) # explicitly convert to tensorflow tensor out_tf = ovl.as_tensorflow(out) # lazily evaluate using tensorflow sess = tf.Session() print(sess.run([out_tf])[0]) Outputs: .. testoutput:: [ 3. 2. 1. 0. 1. 2. 3.] [ 3. 2. 1. 0. 1. 2. 3.] Constants ~~~~~~~~~ Many operators depend on an input value that does not change throughout the lifetime of the operator, and is not dependent on the data values of the input tensors. Examples include summing along a constant axis, applying a constant threshold, and applying a constant power to an input tensor. This example implements the ``power`` function which raises an input tensor to a specified power and introduces the new concept of a constant: * Constant arguments are differentiated from input tensors by explicitly giving the operator function argument a default value. If the default value is ``None`` an error will be raised if the argument is not set when the operator is applied. When calling the operator, constants are specified by passing them to the operator as keyword arguments. .. testcode:: import numpy as np import tensorflow as tf import opveclib as ovl @ovl.operator() def power(input_tensor, exponent=None): # define the output tensor output_tensor = ovl.output_like(input_tensor) # define the workgroup shape and get workgroup position reference wg_position = ovl.position_in(input_tensor.shape) # read input element at current workgroup position input_element = input_tensor[wg_position] # apply power function and set the output element output_tensor[wg_position] = ovl.power(input_element, exponent) return output_tensor # define a numpy input in_np = np.arange(-3, 4, dtype=np.float32) # apply the operator # Note that constants must be explicitly set as keyword arguments out = power(in_np, exponent=2) # lazily evaluate with the OVL test infrastructure print(ovl.evaluate([out])) # explicitly convert to tensorflow tensor out_tf = ovl.as_tensorflow(out) # lazily evaluate using tensorflow sess = tf.Session() print(sess.run([out_tf])[0]) Outputs: .. testoutput:: [ 9. 4. 1. 0. 1. 4. 9.] [ 9. 4. 1. 0. 1. 4. 9.] Conditionals and Variables ~~~~~~~~~~~~~~~~~~~~~~~~~~ Operators can exhibit control flow and use thread-local memory by using OVL conditionals and variables. This example implements the ``clip`` function which clips the values of the input tensor to within the specified boundaries. The following new concepts are illustrated here: * The ``Variable``, which is a worker-local scalar that has state which can be set with the ``<<=`` operator. * The conditionals ``with if_()``, ``with elif_()`` and ``with else_()`` which are used to conditionally execute a segment of the operator. Python native conditionals can be used from within an operator, but only when they are operating on a constant. This example differentiates between the two types of conditional. * The ``forbid_none_valued_constants`` option to the ``operator`` decorator which overrides the default behavior to and allows ``None`` as a valid constant value. .. testcode:: import numpy as np import opveclib as ovl # override default behavior and allow None valued constants @ovl.operator(forbid_none_valued_constants=False) def clip(arg, threshold1=None, threshold2=None): pos = ovl.position_in(arg.shape) clipped = ovl.output_like(arg) # define a worker local variable clipped_val = ovl.variable(0, arg.dtype) # assign the value of variables with the the <<= operator clipped_val <<= arg[pos] # python if statements can be used on constants to control how the operator behaves if threshold1 is not None: # when using conditionals on OVL expressions, you must use the OVL "with if_" function # these conditionals are evaluated at run time and are input-data dependent with ovl.if_(clipped_val < threshold1): clipped_val <<= threshold1 # this python native if statement is used to control how the operator behaves at definition time if threshold2 is not None: # this conditional controls how the operator deals with run time data with ovl.if_(clipped_val > threshold2): clipped_val <<= threshold2 clipped[pos] = clipped_val return clipped # define a numpy input in_np = np.arange(-3, 4, dtype=np.float32) # apply the operator # not all constants have to be supplied here since None value constants are permitted out1 = clip(in_np, threshold1=-1) out2 = clip(in_np, threshold2=1) out3 = clip(in_np, threshold1=-1, threshold2=1) # lazily evaluate with the OVL test infrastructure res1, res2, res3 = ovl.evaluate([out1, out2, out3]) print(res1) print(res2) print(res3) Outputs: .. testoutput:: [-1. -1. -1. 0. 1. 2. 3.] [-3. -2. -1. 0. 1. 1. 1.] [-1. -1. -1. 0. 1. 1. 1.] Loops and non-local IO ~~~~~~~~~~~~~~~~~~~~~~ OVL supports iterating over ranges and defining arbitrary workgroup shapes that may or may not be the same size as the input or output tensors. This simple example of a naive implementation of matrix-vector multiplication show the use of two new concenpts: * The `arange()` iterator which works like the python native `range` but allows for runtime iteration based off of either constants or data from one of the input tensors * Arbritrary workgroup shapes - in this case the workgroup shape is determined by the number of rows in the matrix. * Non-local IO is possible, in the sense that in this example each worker reads in multiple elements from the input matrix and vector. .. testcode:: import numpy as np import opveclib as ovl @ovl.operator() def matmul(mat, vec): # assert type properties of the two input tensors assert mat.rank == 2 assert vec.rank == 1 assert mat.shape[1] == vec.shape[0] assert mat.dtype == vec.dtype rows = mat.shape[0] cols = mat.shape[1] # define the output to be one dimensional y = ovl.output(rows, mat.dtype) # define a number of workers equal to the number of rows row = ovl.position_in(rows)[0] # define the accumulator and iterate over the matrix columns accum = ovl.variable(0, mat.dtype) for col in ovl.arange(cols): # accumulated the product of each element accum <<= accum + mat[row, col]*vec[col] y[row] = accum return y m = np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8]], dtype=np.float32) x = np.array([1, 2, 3], dtype=np.float32) out = matmul(m, x) print(ovl.evaluate([out])) Outputs: .. testoutput:: [ 8. 26. 44.] Worker-local tensors ~~~~~~~~~~~~~~~~~~~~ Similar to the OVL Variable, OVL also supports dynamically declared worker-local tensors of any dimension that use thread-local memory. Worker local tensors are generally small (on the order of hundreds of bytes) and should fit in register space on device or L1 cache. An OVL worker local tensor is defined by its shape and type and can be initialized to either all zeros or all ones using the `zeros()` or `ones()` functions. .. code-block:: python import opveclib as ovl # declare a 1x3 worker local tensor of type float32, initialized to all zeros a = ovl.zeros((1, 3), ovl.float32) for col in ovl.arange(3): a[col] = 0.9 Gradients ~~~~~~~~~ An OVL gradient operator is a special type of operator. It is defined by creating a python gradient function and decorating it with the ``operator()`` decorator as well as the ``gradient()`` decorator. The arguement to the gradient decorator function is the name of the operator it is the gradient for. The gradient function must compute the gradients with respect to each of the operators' inputs, given gradients with respect to the operators' outputs. Its arguments are the operator input tensors as well as the output gradient tensors from the next higher layer in the case of back propagation through a neural network. Below is a simple operator that computes the function .. math:: z_{i} = x_{i}^2 + 3.0 * y_{i} and its gradient, and tests the gradient against tensorflow .. testcode:: import numpy as np import opveclib as ovl import tensorflow as tf @ovl.operator() def func(x, y): assert(x.shape == y.shape) pos = ovl.position_in(x.shape) z = ovl.output_like(x) # compute the function z[pos] = x[pos] * x[pos] + 3.0 * y[pos] return z @ovl.gradient(func) @ovl.operator() def func_grad(x, y, dz): assert(x.shape == y.shape) assert(x.shape == dz.shape) pos = ovl.position_in(x.shape) dx = ovl.output_like(x) dy = ovl.output_like(y) # compute the gradients w.r.t each input dx[pos] = 2.0 * x[pos] * dz[pos] dy[pos] = 3.0 * dz[pos] return dx, dy # define some input values a = tf.constant(np.arange(5.0)) b = tf.constant(np.arange(5.0)) grad_above = tf.constant(np.random.random(5)) # run the operator in tensorflow with tf.Session() as s: func_ovl = ovl.as_tensorflow(func(a, b)) # define a tensorflow reference operation func_tf = a * a + 3.0 * b # register the gradients of the ovl operator and the refenence in tensorflow func_grad_ovl = tf.gradients(func_ovl, [a, b], grad_ys=grad_above) func_grad_tf = tf.gradients(func_tf, [a, b], grad_ys=grad_above) # evaluate the outputs func_grad_ovl, func_grad_tf = s.run([func_grad_ovl, func_grad_tf]) assert np.all(np.equal(func_grad_ovl, func_grad_tf)) Operator Merging ~~~~~~~~~~~~~~~~ Typical modern GPUs are able to perform arithmetic operations faster than transfers of a data value in or out of host memory, and are thus memory bandwidth limited. If 2 operators that are performed in sequence can be merged into a single operator that shares data values in device memory, performance can be improved. In an application using multiple OVL operators in sequence, the OVL optimizer will automatically merge any operators that it can. 2 OVL operators in sequence, from and to, will be merged if: 1) the workgroup shape of both operators is the same 2) the write indexing pattern of the from operator matches the read indexing pattern of the to operator 3) the tensor shape of the output of the from operator matches the tensor shape of the input to the to operator Optimization by automatically merging OVL operators is determined by the opt_level flag that is passed to the ``evaluate`` and the ``as_tensorflow`` functions. Merging is performed when opt_level >= 3. This is the default optimization level. When operators are merged, their gradient operators will also be merged when possible.