[MUSIC PLAYING] ALEXANDRE PASSOS:

Hello, my name is Alex, and I work on TensorFlow. I am here today to tell you

all a little bit about how you can use TensorFlow to do

deep learning research more effectively. What we’re going to

do today is we’re going to take a little tour of

a few TensorFlow features that show you how controllable,

flexible, and composable TensorFlow is. We’ll take a quick look at

those features, some old and some new. And not, by far,

all the features are useful for research. But these features

let you accelerate your research using

TensorFlow in ways that perhaps you’re

not aware of. And I want to start by helping

you control how TensorFlow represents state. If you’ve used

TensorFlow before, and I am sure you

have at this point, you know that a lot

of our libraries use TF variables

to represent state, like your model parameters. And for example, a

Keras dense layer has one kernel matrix

and an optional bias vector stored in it. And these parameters are updated

when you train your model. And part of the whole

point of training models is so that we find out what

value those parameters should have had in the first place. And if you’re making

your own layers library, you can control absolutely

everything about how that state is represented. But you can also crack

open the black box and control how

state is represented, even inside the libraries

that we give you. So for example, we’re going to

use this little running example of what if I wanted to

re-parametrize a Keras layer so it does some

computation to generate the kernel matrix,

say to save space or to get the correct

inductive bias. The way to do this is to use

tf.variable_creator_scrope. It is a tool we have that lets

you take control of the state creation process in TensorFlow. It’s a context manager, and

all variables created under it go through a

function you specify. And this function can

choose to do nothing. It can delegate. Or it can modify how

variables are created. Under the hood, this is what

distributionstrategy.scope usually implies. So it’s the same

tool that we use to build TensorFlow that

we make available to you, so you can extend it. And here, if I wanted to do this

re-parametrization of the Keras layer, it’s actually

pretty simple. First, I define what type I want

to use to store those things. Here, I’m using this

vectorize variable type, which is a tf.module. tf.modules are a

very convenient type. You can have

variables as members, and we can track

them automatically for you and all

sorts of nice things. And once we define

this type, it’s really just a left

half and right half. I can tell TensorFlow

how do I use objects of this type as a part

of TensorFlow computations. And what we do here is we

do a matrix multiplication of the left component

and the right component. And now that I know how to use

this object, I can create it. And this is all

that I need to make my own little,

variable_creator_scope. In this case, I want

to peek at the shape. And if I’m not

creating a matrix, just delegate to

whatever TensorFlow would have done, normally. And if I am creating

a matrix, instead of creating a single

matrix, I’m going to create this

factorized variable that has the left half

and the right half. And finally, I now

get to just use it. And here, I create a

little Keras layer. I apply it. And I can check that

it is indeed using my vectorized representation. This gives you a lot of power. Because now, you can take

large libraries of code that you did not write and

do dependency injection to change how they behave. Probably if you’re going

to do this at scale, you might want to

implement your own layer so you can have full control. But it’s also very

valuable for you to be able to extend the

ones that we provide you. So use tf.variable_creator_scope

to control the stage. A big part of

TensorFlow and why we use these libraries

to do research at all, as opposed to just

writing plain Python code, is that deep learning is

really dependent on very fast computation. And one thing that we’re

making more and more easy to use in TensorFlow is our

underlying compiler, XLA, which we’ve always used for TPUs. But now, we’re making

it easier for you to use for CPUs and GPUs, as well. And the way we’re doing this

is using tf.function with the experimental_compile=True

annotation. What this means is if you

mark a function as a function that you want to compile,

we will compile it, or we’ll raise an error. So you can trust

the code you write inside a block is going to

run as quickly as if you had handwritten your own fuse

TensorFlow kernel for CPUs, and a Fuse.ko kernel, and then

all the machinery, yourself. But you get to write high level,

fast, Python TensorFlow code. One example where

you might easily find yourself writing your

own little custom kernel is if you want to do research

on activation functions, which is something that

people want to do. In activation functions,

this is a terrible one. But they tend to look

a little like this. They have a bunch of

nonlinear operations and a bunch of

element-wise things. But in general,

they apply lots of little element-wise operations

to each element of your vector. And these things, if

you try to run them in the normal

TensorFlow interpreter, they’re going to be rather

slow, because they’re going to do a new memory

allocation and a copy of things around for every single one

of these little operations. While if you were to make

a fused, single kernel, you just write a single thing

for each coordinate that does the explanation, and

logarithm, and addition, and all the things like that. But what we can see here is

that if I take this function, and I wrap it with

experimental_compile=True, and I benchmark running a

compiled version versus running a non-compiled version,

on this tiny benchmark, I can already see a 25% speedup. And it’s even better

than this, because we see speedups of this sort

of magnitude or larger, even on fairly large

models, including Bert. Because in large models, we

can fuse more computation into the linear operations,

and your reductions, and things like that. And this can get you

compounding wins. So try using

experimental_compile=True for automatic compilation

in TensorFlow. You should be able to apply

it to small pieces of code and replace what you’d normally

have to do with fused kernels. So you know what type of

researching code a lot of people rely on that has

lots of very small element-wise operations and that which would

greatly benefit from the fusion powers of a compiler– I think it’s optimizers. And a nice thing about doing

your optimizer research in TensorFlow is that

Keras makes it very easy for you to implement your own

stochastic gradient in style optimizer. You can make a class

that subclasses that TF Keras optimizer

and override three methods. You can define

your initialization while you compute your

learning rate or whatever, and you’re in it. You can create any accumulator

variables, like your momentum, or higher order powers of

gradients, or anything else you need, and create slots. And you can define how

to apply this optimizer update to a single variable. Once you’ve defined

those three things, you have everything

TensorFlow needs to be able to run

your custom optimizer. And normally,

TensorFlow optimizers are written with

hand-fused kernels, which can make the code very

complicated to read, but ensures that they

run very quickly. What I’m going to show

here is an example of a very simple

optimizer– again, not a particularly good one. This is a weird

variation that has some momentum and some

higher order powers, but it doesn’t train very well. However, it has the same

sorts of operations that you would have on a real optimizer. And I can just write them as

regular TensorFlow operations in my model. And by just adding this line

with experimental_compile=True, I can get it to run just as

fast as a hand-fused kernel. And the benchmarks

are written here. It was over a 2x speed up. So this can really

matter when you’re doing a lot of research

that looks like this. Something else– so Keras

optimizes in compilation. You experiment really fast or

with fairly intricate things, and I hope you will use this

to accelerate your research. The next thing I want to

talk about is vectorization. It’s, again, super

important for performance. I’m sure you’ve

heard, at this point, that Moore’s Law is

over, and we’re no longer going to get a

free lunch in terms of processes getting faster. The way we’re making our

machine learning models faster is by doing more and

more things in parallel. And this is great,

because we get to unlock the potential of GPUs and TPUs. This is also a little

scary, because now, even though we know what we want

to do to a single, little data point, we have to write these

batched operations, which can be fairly complicated. In TensorFlow, we’ve

been developing, recently, automatic

vectorization for you, where you can write the

element-wise code that you want to write and get the performance

of the batched computation that you want. So the working example I’m

going to use here is Jacobians. If you’re familiar with

TensorFlow’s gradient tape, you know that tape.gradient

computes an element-wise– computes a gradient of

a scalar, not a gradient of a vector value or a

matrix value function. And if you want the

Jacobian of a vector value to a matrix valued

function, you can just call tape.gradient

many, many times. And here, I have a very,

very simple function that is just the explanation

of the square of a matrix. And I want to

compute the Jacobian. And I do this by

writing this double for loop, where for every

row, for every column, I compute the gradient with

respect to the row and column output, and then stack

the results together to get my higher

order, Jacobian tensor. This is fine. This has always worked. However, you can replace

these explicit loops with tf.vectorized_map. And one, you get a

small readability win. Because now we’re

saying that, yes, you’re just applying this

operation everywhere. But also, you get a very

big performance win. And this version that

uses tf.vectorized_map is substantially faster than the

version that doesn’t use it. But of course, you

don’t want to have to write this all

the time, which is why, really, for Jacobians,

we implemented it directly in the gradient tape. And you can call tape.Jacobian

to get the Jacobian computer for you. And if you do this, it’s over

10 times faster on this example than doing the manual loop

yourself because we can do the automatic vectorization. But the reason why I

opened this black box and showed you

the previous slide is so you can know how to

implement something that is not a Jacobian but is like

Jacobian yourself- and how you can use TensorFlow’s

automatic vectorization capabilities together

with the other tools you have in your research to

make you more productive. So remember to use

automatic vectorization, so you can write short code

that actually runs really fast. And let us add the batched

dimensions ourselves. And here is another

interesting performance point. Because with TensorFlow,

we have always had the big, rectangular array

or hyper-array, the tensor, as the core data structure. And tensors are great. In a world where

we live in today, where we need to leverage

as much parallelism as we can to make our models

go fast, operations on tensors tend to be naturally

highly parallel by default. It’s a very intuitive API

to program the capabilities of these supercomputers we

have today, with many GPUs and TPUs wired together. And as long as you can stay

within this tensor box, you are happy. You get peak performance. And everything’s great. However, as deep

learning becomes more and more

successful, and as we want to do research on more and

more different types of data, we start to want to work

with things that don’t really look like these big,

rectangular arrays– a structure that is ragged

and has a different shape. And in TensorFlow, we’ve

been recently working really hard at adding native

support to ragged data. So here’s an example. Pretend it’s 10 years ago

and you have a sentence. You have a bunch of sentences. They all have different lengths. And you want to turn them into

embedding so you can feed them into a neural network. So what you want to

do here is you’re going to start with all

the words in that sentence. You’re going to look up their

index in your vocabulary table. Then you’re going to

use the index to look up a row in an embedding table. And finally, you want to average

the embeddings of each sentence to get an embedding for

each– the embeddings of all the words in a sentence to get

an embedding for each sentence, which you can then use in

the rest of your model. And even though we’re

working with ragged data here, because all the sentences

have different lengths, if you think about the underlying

operations that we’re doing here, most of

them don’t actually have to care about

this raggedness. So we can make this

run very efficiently by decomposing

this representation into two things– a tensor that concatenates

across the ragged dimension and a separate

tensor that tells you how to find the individual

ragged elements in there. And once you have

this representation, it’s very easy and efficient

to do all the computations that we wanted to

do to solve the task from the previous slide. You have always been able to

do this manually in TensorFlow. We’ve always had the features

and capabilities for you to do this. Now, with tf.ragged_tensor,

we’re taking over the management of this from you

and just giving you an object, a ragged tensor, that

looks like a tensor. It can be manipulated

like a tensor, but is represented like this. And so it has ragged

shapes and can represent much more

flexible data structures than you could otherwise. So let’s go over a little

bit of a code example, here. Here is my data, same one

from the previous slides. It’s just a Python list. And I can take this

Python list and turn it into a ragged tensor by

using tf.ragged.constant. And the right thing

is going to happen. TensorFlow is going to

automatically concatenate across the ragged dimension

and keep this array of indices under the hood. Then I can define my vocabulary

table and do my lookup. And here, I’m showing you how to

do your lookup or any operation on a ragged tensor where that

operation hasn’t actually been rewritten to

support raggedness. You can always use

tf.ragged.mapflatvalues to access the underlying

values of your ragged tensor, and apply operations in them. Once we dedicate

an embedded matrix, also, many of the

TensorFlow core operations have been adapted to

work with ragged tensors. So in this case, if you

want to do a tf.gather to find out the correct

rows of the embedding matrix for each

word, you can just apply your tf.gather

on the ragged tensor, and the right thing will happen. And similarly, if you want

to reduce and average out the ragged dimension,

it’s very easy to do. You can just use the

standard tf.reduce_mean. And the nice thing is that,

at this point, because we’ve reduced out the

ragged dimension, we have no ragged dimension. And we just have a

dense tensor that has the original shape

you expected to have. And I think this is really

important, because now, it’s much easier, much more

intuitive and affordable for you to work with data that

doesn’t necessarily look like the big,

rectangular data that TensorFlow

is optimized for. And yet, it lets you get

most of the performance that you’d get with the

big, rectangular data. It’s a win-win situation, and

I’m really looking forward to see what interesting

applications you all are going to work on

that use and exploit this notion of raggedness. So please, play with tf.ragged. Try it out. It’s very exciting. So next up, we’re

going to go over a particular, interesting

example of research done with TensorFlow. And Achshai here, who is a PhD

student at Stanford University, is going to come and tell us

all about convex optimization layers in TensorFlow. Thank you. [MUSIC PLAYING]