Research with TensorFlow (TF Dev Summit ’20)


[MUSIC PLAYING] ALEXANDRE PASSOS:
Hello, my name is Alex, and I work on TensorFlow. I am here today to tell you
all a little bit about how you can use TensorFlow to do
deep learning research more effectively. What we’re going to
do today is we’re going to take a little tour of
a few TensorFlow features that show you how controllable,
flexible, and composable TensorFlow is. We’ll take a quick look at
those features, some old and some new. And not, by far,
all the features are useful for research. But these features
let you accelerate your research using
TensorFlow in ways that perhaps you’re
not aware of. And I want to start by helping
you control how TensorFlow represents state. If you’ve used
TensorFlow before, and I am sure you
have at this point, you know that a lot
of our libraries use TF variables
to represent state, like your model parameters. And for example, a
Keras dense layer has one kernel matrix
and an optional bias vector stored in it. And these parameters are updated
when you train your model. And part of the whole
point of training models is so that we find out what
value those parameters should have had in the first place. And if you’re making
your own layers library, you can control absolutely
everything about how that state is represented. But you can also crack
open the black box and control how
state is represented, even inside the libraries
that we give you. So for example, we’re going to
use this little running example of what if I wanted to
re-parametrize a Keras layer so it does some
computation to generate the kernel matrix,
say to save space or to get the correct
inductive bias. The way to do this is to use
tf.variable_creator_scrope. It is a tool we have that lets
you take control of the state creation process in TensorFlow. It’s a context manager, and
all variables created under it go through a
function you specify. And this function can
choose to do nothing. It can delegate. Or it can modify how
variables are created. Under the hood, this is what
distributionstrategy.scope usually implies. So it’s the same
tool that we use to build TensorFlow that
we make available to you, so you can extend it. And here, if I wanted to do this
re-parametrization of the Keras layer, it’s actually
pretty simple. First, I define what type I want
to use to store those things. Here, I’m using this
vectorize variable type, which is a tf.module. tf.modules are a
very convenient type. You can have
variables as members, and we can track
them automatically for you and all
sorts of nice things. And once we define
this type, it’s really just a left
half and right half. I can tell TensorFlow
how do I use objects of this type as a part
of TensorFlow computations. And what we do here is we
do a matrix multiplication of the left component
and the right component. And now that I know how to use
this object, I can create it. And this is all
that I need to make my own little,
variable_creator_scope. In this case, I want
to peek at the shape. And if I’m not
creating a matrix, just delegate to
whatever TensorFlow would have done, normally. And if I am creating
a matrix, instead of creating a single
matrix, I’m going to create this
factorized variable that has the left half
and the right half. And finally, I now
get to just use it. And here, I create a
little Keras layer. I apply it. And I can check that
it is indeed using my vectorized representation. This gives you a lot of power. Because now, you can take
large libraries of code that you did not write and
do dependency injection to change how they behave. Probably if you’re going
to do this at scale, you might want to
implement your own layer so you can have full control. But it’s also very
valuable for you to be able to extend the
ones that we provide you. So use tf.variable_creator_scope
to control the stage. A big part of
TensorFlow and why we use these libraries
to do research at all, as opposed to just
writing plain Python code, is that deep learning is
really dependent on very fast computation. And one thing that we’re
making more and more easy to use in TensorFlow is our
underlying compiler, XLA, which we’ve always used for TPUs. But now, we’re making
it easier for you to use for CPUs and GPUs, as well. And the way we’re doing this
is using tf.function with the experimental_compile=True
annotation. What this means is if you
mark a function as a function that you want to compile,
we will compile it, or we’ll raise an error. So you can trust
the code you write inside a block is going to
run as quickly as if you had handwritten your own fuse
TensorFlow kernel for CPUs, and a Fuse.ko kernel, and then
all the machinery, yourself. But you get to write high level,
fast, Python TensorFlow code. One example where
you might easily find yourself writing your
own little custom kernel is if you want to do research
on activation functions, which is something that
people want to do. In activation functions,
this is a terrible one. But they tend to look
a little like this. They have a bunch of
nonlinear operations and a bunch of
element-wise things. But in general,
they apply lots of little element-wise operations
to each element of your vector. And these things, if
you try to run them in the normal
TensorFlow interpreter, they’re going to be rather
slow, because they’re going to do a new memory
allocation and a copy of things around for every single one
of these little operations. While if you were to make
a fused, single kernel, you just write a single thing
for each coordinate that does the explanation, and
logarithm, and addition, and all the things like that. But what we can see here is
that if I take this function, and I wrap it with
experimental_compile=True, and I benchmark running a
compiled version versus running a non-compiled version,
on this tiny benchmark, I can already see a 25% speedup. And it’s even better
than this, because we see speedups of this sort
of magnitude or larger, even on fairly large
models, including Bert. Because in large models, we
can fuse more computation into the linear operations,
and your reductions, and things like that. And this can get you
compounding wins. So try using
experimental_compile=True for automatic compilation
in TensorFlow. You should be able to apply
it to small pieces of code and replace what you’d normally
have to do with fused kernels. So you know what type of
researching code a lot of people rely on that has
lots of very small element-wise operations and that which would
greatly benefit from the fusion powers of a compiler– I think it’s optimizers. And a nice thing about doing
your optimizer research in TensorFlow is that
Keras makes it very easy for you to implement your own
stochastic gradient in style optimizer. You can make a class
that subclasses that TF Keras optimizer
and override three methods. You can define
your initialization while you compute your
learning rate or whatever, and you’re in it. You can create any accumulator
variables, like your momentum, or higher order powers of
gradients, or anything else you need, and create slots. And you can define how
to apply this optimizer update to a single variable. Once you’ve defined
those three things, you have everything
TensorFlow needs to be able to run
your custom optimizer. And normally,
TensorFlow optimizers are written with
hand-fused kernels, which can make the code very
complicated to read, but ensures that they
run very quickly. What I’m going to show
here is an example of a very simple
optimizer– again, not a particularly good one. This is a weird
variation that has some momentum and some
higher order powers, but it doesn’t train very well. However, it has the same
sorts of operations that you would have on a real optimizer. And I can just write them as
regular TensorFlow operations in my model. And by just adding this line
with experimental_compile=True, I can get it to run just as
fast as a hand-fused kernel. And the benchmarks
are written here. It was over a 2x speed up. So this can really
matter when you’re doing a lot of research
that looks like this. Something else– so Keras
optimizes in compilation. You experiment really fast or
with fairly intricate things, and I hope you will use this
to accelerate your research. The next thing I want to
talk about is vectorization. It’s, again, super
important for performance. I’m sure you’ve
heard, at this point, that Moore’s Law is
over, and we’re no longer going to get a
free lunch in terms of processes getting faster. The way we’re making our
machine learning models faster is by doing more and
more things in parallel. And this is great,
because we get to unlock the potential of GPUs and TPUs. This is also a little
scary, because now, even though we know what we want
to do to a single, little data point, we have to write these
batched operations, which can be fairly complicated. In TensorFlow, we’ve
been developing, recently, automatic
vectorization for you, where you can write the
element-wise code that you want to write and get the performance
of the batched computation that you want. So the working example I’m
going to use here is Jacobians. If you’re familiar with
TensorFlow’s gradient tape, you know that tape.gradient
computes an element-wise– computes a gradient of
a scalar, not a gradient of a vector value or a
matrix value function. And if you want the
Jacobian of a vector value to a matrix valued
function, you can just call tape.gradient
many, many times. And here, I have a very,
very simple function that is just the explanation
of the square of a matrix. And I want to
compute the Jacobian. And I do this by
writing this double for loop, where for every
row, for every column, I compute the gradient with
respect to the row and column output, and then stack
the results together to get my higher
order, Jacobian tensor. This is fine. This has always worked. However, you can replace
these explicit loops with tf.vectorized_map. And one, you get a
small readability win. Because now we’re
saying that, yes, you’re just applying this
operation everywhere. But also, you get a very
big performance win. And this version that
uses tf.vectorized_map is substantially faster than the
version that doesn’t use it. But of course, you
don’t want to have to write this all
the time, which is why, really, for Jacobians,
we implemented it directly in the gradient tape. And you can call tape.Jacobian
to get the Jacobian computer for you. And if you do this, it’s over
10 times faster on this example than doing the manual loop
yourself because we can do the automatic vectorization. But the reason why I
opened this black box and showed you
the previous slide is so you can know how to
implement something that is not a Jacobian but is like
Jacobian yourself- and how you can use TensorFlow’s
automatic vectorization capabilities together
with the other tools you have in your research to
make you more productive. So remember to use
automatic vectorization, so you can write short code
that actually runs really fast. And let us add the batched
dimensions ourselves. And here is another
interesting performance point. Because with TensorFlow,
we have always had the big, rectangular array
or hyper-array, the tensor, as the core data structure. And tensors are great. In a world where
we live in today, where we need to leverage
as much parallelism as we can to make our models
go fast, operations on tensors tend to be naturally
highly parallel by default. It’s a very intuitive API
to program the capabilities of these supercomputers we
have today, with many GPUs and TPUs wired together. And as long as you can stay
within this tensor box, you are happy. You get peak performance. And everything’s great. However, as deep
learning becomes more and more
successful, and as we want to do research on more and
more different types of data, we start to want to work
with things that don’t really look like these big,
rectangular arrays– a structure that is ragged
and has a different shape. And in TensorFlow, we’ve
been recently working really hard at adding native
support to ragged data. So here’s an example. Pretend it’s 10 years ago
and you have a sentence. You have a bunch of sentences. They all have different lengths. And you want to turn them into
embedding so you can feed them into a neural network. So what you want to
do here is you’re going to start with all
the words in that sentence. You’re going to look up their
index in your vocabulary table. Then you’re going to
use the index to look up a row in an embedding table. And finally, you want to average
the embeddings of each sentence to get an embedding for
each– the embeddings of all the words in a sentence to get
an embedding for each sentence, which you can then use in
the rest of your model. And even though we’re
working with ragged data here, because all the sentences
have different lengths, if you think about the underlying
operations that we’re doing here, most of
them don’t actually have to care about
this raggedness. So we can make this
run very efficiently by decomposing
this representation into two things– a tensor that concatenates
across the ragged dimension and a separate
tensor that tells you how to find the individual
ragged elements in there. And once you have
this representation, it’s very easy and efficient
to do all the computations that we wanted to
do to solve the task from the previous slide. You have always been able to
do this manually in TensorFlow. We’ve always had the features
and capabilities for you to do this. Now, with tf.ragged_tensor,
we’re taking over the management of this from you
and just giving you an object, a ragged tensor, that
looks like a tensor. It can be manipulated
like a tensor, but is represented like this. And so it has ragged
shapes and can represent much more
flexible data structures than you could otherwise. So let’s go over a little
bit of a code example, here. Here is my data, same one
from the previous slides. It’s just a Python list. And I can take this
Python list and turn it into a ragged tensor by
using tf.ragged.constant. And the right thing
is going to happen. TensorFlow is going to
automatically concatenate across the ragged dimension
and keep this array of indices under the hood. Then I can define my vocabulary
table and do my lookup. And here, I’m showing you how to
do your lookup or any operation on a ragged tensor where that
operation hasn’t actually been rewritten to
support raggedness. You can always use
tf.ragged.mapflatvalues to access the underlying
values of your ragged tensor, and apply operations in them. Once we dedicate
an embedded matrix, also, many of the
TensorFlow core operations have been adapted to
work with ragged tensors. So in this case, if you
want to do a tf.gather to find out the correct
rows of the embedding matrix for each
word, you can just apply your tf.gather
on the ragged tensor, and the right thing will happen. And similarly, if you want
to reduce and average out the ragged dimension,
it’s very easy to do. You can just use the
standard tf.reduce_mean. And the nice thing is that,
at this point, because we’ve reduced out the
ragged dimension, we have no ragged dimension. And we just have a
dense tensor that has the original shape
you expected to have. And I think this is really
important, because now, it’s much easier, much more
intuitive and affordable for you to work with data that
doesn’t necessarily look like the big,
rectangular data that TensorFlow
is optimized for. And yet, it lets you get
most of the performance that you’d get with the
big, rectangular data. It’s a win-win situation, and
I’m really looking forward to see what interesting
applications you all are going to work on
that use and exploit this notion of raggedness. So please, play with tf.ragged. Try it out. It’s very exciting. So next up, we’re
going to go over a particular, interesting
example of research done with TensorFlow. And Achshai here, who is a PhD
student at Stanford University, is going to come and tell us
all about convex optimization layers in TensorFlow. Thank you. [MUSIC PLAYING]