{"id":52,"date":"2022-04-29T04:16:30","date_gmt":"2022-04-29T04:16:30","guid":{"rendered":"https:\/\/blogs.oregonstate.edu\/pimode\/?p=52"},"modified":"2022-04-29T04:16:30","modified_gmt":"2022-04-29T04:16:30","slug":"gradient-descent","status":"publish","type":"post","link":"https:\/\/blogs.oregonstate.edu\/pimode\/2022\/04\/29\/gradient-descent\/","title":{"rendered":"Gradient Descent"},"content":{"rendered":"\n<p>Last week, I wrote about forward propagation in a basic neural network. This week, I\u2019m going to cover back propagation and gradient descent. To keep it simple, I\u2019m not going to dive deeply into the math, but I will describe it in very rudimentary terms.<\/p>\n\n\n\n<p>Let\u2019s do a quick recap of forward propagation in training the model. We start with the 3 main layers: the input layer, the hidden layer, and the output layer.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"562\" height=\"295\" src=\"https:\/\/osu-wams-blogs-uploads.s3.amazonaws.com\/blogs.dir\/5421\/files\/2022\/04\/Layered-Neural-Network.png\" alt=\"\" class=\"wp-image-42\" srcset=\"https:\/\/osu-wams-blogs-uploads.s3.amazonaws.com\/blogs.dir\/5421\/files\/2022\/04\/Layered-Neural-Network.png 562w, https:\/\/osu-wams-blogs-uploads.s3.amazonaws.com\/blogs.dir\/5421\/files\/2022\/04\/Layered-Neural-Network-300x157.png 300w\" sizes=\"auto, (max-width: 562px) 100vw, 562px\" \/><figcaption>basic neural network layers<\/figcaption><\/figure>\n\n\n\n<p>The input layer just takes numerical data as input in the form of a tensor. In the case of an image file, the image is created from a matrix of pixels and each pixel is represented by a color of 3 dimensions (red, green, and blue). Each color in each pixel is encoded as a number between <code>0<\/code> and <code>255<\/code> which determines its strength. A video adds another dimension of time since it is a series of images in a sequential order.<\/p>\n\n\n\n<p>This small network has two hidden layers. Each layer in the hidden layer is associated with an activation function and each neuron in the hidden layer has its own bias (<code>b<\/code>) assigned randomly at first. The network is connected by weighted edges and each edge is also initially assigned a random weight (<code>w<\/code>). The fundamental equation between neurons is the dot product of the output from the previous connection (<code>x<\/code>) and the weight of the edge (<code>w<\/code>) plus the bias (<code>b<\/code>).<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"378\" height=\"141\" src=\"https:\/\/osu-wams-blogs-uploads.s3.amazonaws.com\/blogs.dir\/5421\/files\/2022\/04\/w-dot-x-plus-b.png\" alt=\"\" class=\"wp-image-45\" srcset=\"https:\/\/osu-wams-blogs-uploads.s3.amazonaws.com\/blogs.dir\/5421\/files\/2022\/04\/w-dot-x-plus-b.png 378w, https:\/\/osu-wams-blogs-uploads.s3.amazonaws.com\/blogs.dir\/5421\/files\/2022\/04\/w-dot-x-plus-b-300x112.png 300w\" sizes=\"auto, (max-width: 378px) 100vw, 378px\" \/><\/figure><\/div>\n\n\n\n<p>The result here is <code>z<\/code>, which is fed into the activation function of the neuron. The most common activation function is ReLU which is <code>f(x) = max(0, x)<\/code>. The activation function essentially condenses the tensor, resulting in a more abstract structure as it moves forward in each layer.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"430\" height=\"180\" src=\"https:\/\/osu-wams-blogs-uploads.s3.amazonaws.com\/blogs.dir\/5421\/files\/2022\/04\/Activation-Function.png\" alt=\"\" class=\"wp-image-46\" srcset=\"https:\/\/osu-wams-blogs-uploads.s3.amazonaws.com\/blogs.dir\/5421\/files\/2022\/04\/Activation-Function.png 430w, https:\/\/osu-wams-blogs-uploads.s3.amazonaws.com\/blogs.dir\/5421\/files\/2022\/04\/Activation-Function-300x126.png 300w\" sizes=\"auto, (max-width: 430px) 100vw, 430px\" \/><\/figure>\n\n\n\n<p>Finally, we reach the output layer where we calculate the difference between the current traversal of the network and the expected output (from our labeled training data). Formally, this is calculated by a loss function that might average the squared difference as you would with statistical error.<\/p>\n\n\n\n<p>So, we\u2019ve reached the end of the network and the difference between our current output and the expected output is very high. The labeled picture says \u201ccat\u201d and our current network thinks the image is most likely a \u201cdog\u201d. What do we do now?<\/p>\n\n\n\n<p>Backward propagation and gradient descent to the rescue!<\/p>\n\n\n\n<p>This is an optimization problem and there are two kinds of variables we can change. We can update weights and we can update biases. We want to adjust the weights and biases to get the loss function as close to 0 as possible. We could call the calculations we are using in forward propagation a series of functions. Each of these functions is differentiable; we can take their derivatives to determine the slope at each step along a tensor\u2019s path.<\/p>\n\n\n\n<p>Since we are working with tensors, we combine partial derivatives to determine the gradient (the direction and rate of fastest increase) and then adjust the weights and bias in the opposite direction. The gradient will tell us which weights and biases will have the most impact on the resulting loss. We tweak the weights and biases a lot on those edges and neurons that will have the most impact and we tweak the weights and biases a little for those edges and neurons that will have a little impact. Then we run the input through again and check the loss function again. In this way, we work the model closer and closer to zero cost. We are looking for the global minimum.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"606\" height=\"380\" src=\"https:\/\/osu-wams-blogs-uploads.s3.amazonaws.com\/blogs.dir\/5421\/files\/2022\/04\/Local-Min.png\" alt=\"\" class=\"wp-image-54\" srcset=\"https:\/\/osu-wams-blogs-uploads.s3.amazonaws.com\/blogs.dir\/5421\/files\/2022\/04\/Local-Min.png 606w, https:\/\/osu-wams-blogs-uploads.s3.amazonaws.com\/blogs.dir\/5421\/files\/2022\/04\/Local-Min-300x188.png 300w\" sizes=\"auto, (max-width: 606px) 100vw, 606px\" \/><figcaption>Stuck in a local minimum<\/figcaption><\/figure>\n\n\n\n<p>What if we get stuck in a local minimum? This is a real danger. Even traversing a shape of n-dimensional space with thousands or millions of hyper-parameters, it is possible to get caught in a local minimum without realizing it. This is where stochastic gradient descent (SGD) comes in.<\/p>\n\n\n\n<p>With stochastic gradient descent, the model is trained with mini-batches. The entire training set is split up into bite sized batches (usually of about 128 inputs) which are run separately. Each batch is seeded with its own random set of weights and biases. So, even if one batch gets stuck in a local minimum, it is more likely other batches <em>will<\/em> find the global minimum. The smaller batches also mean that a CPU will not be crushed under the weight of millions of hyper-parameters crunching at once. We can also use momentum here; we can increase the proportion by which we adjust the weights and biases to increase our odds of pushing past local minima.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"606\" height=\"380\" src=\"https:\/\/osu-wams-blogs-uploads.s3.amazonaws.com\/blogs.dir\/5421\/files\/2022\/04\/Global-Min.png\" alt=\"\" class=\"wp-image-55\" srcset=\"https:\/\/osu-wams-blogs-uploads.s3.amazonaws.com\/blogs.dir\/5421\/files\/2022\/04\/Global-Min.png 606w, https:\/\/osu-wams-blogs-uploads.s3.amazonaws.com\/blogs.dir\/5421\/files\/2022\/04\/Global-Min-300x188.png 300w\" sizes=\"auto, (max-width: 606px) 100vw, 606px\" \/><\/figure>\n\n\n\n<p>To find the gradient, we use back propagation. Let\u2019s explore this in a very basic way. As the neural network works forward, it keeps a directed acyclic graph of each operation.<\/p>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-28f84493 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:33.33%\">\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/osu-wams-blogs-uploads.s3.amazonaws.com\/blogs.dir\/5421\/files\/2022\/04\/w-dot-x-plus-b.png\" alt=\"\" class=\"wp-image-45\" width=\"209\" height=\"77\" srcset=\"https:\/\/osu-wams-blogs-uploads.s3.amazonaws.com\/blogs.dir\/5421\/files\/2022\/04\/w-dot-x-plus-b.png 378w, https:\/\/osu-wams-blogs-uploads.s3.amazonaws.com\/blogs.dir\/5421\/files\/2022\/04\/w-dot-x-plus-b-300x112.png 300w\" sizes=\"auto, (max-width: 209px) 100vw, 209px\" \/><\/figure>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:66.66%\">\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"193\" height=\"578\" src=\"https:\/\/osu-wams-blogs-uploads.s3.amazonaws.com\/blogs.dir\/5421\/files\/2022\/04\/operation-graph-1.png\" alt=\"\" class=\"wp-image-57\" srcset=\"https:\/\/osu-wams-blogs-uploads.s3.amazonaws.com\/blogs.dir\/5421\/files\/2022\/04\/operation-graph-1.png 193w, https:\/\/osu-wams-blogs-uploads.s3.amazonaws.com\/blogs.dir\/5421\/files\/2022\/04\/operation-graph-1-100x300.png 100w\" sizes=\"auto, (max-width: 193px) 100vw, 193px\" \/><\/figure>\n<\/div>\n<\/div>\n\n\n\n<p>Since each operation is tracked, we can find the derivative of each node with respect to the previous node. We use the chain-rule to determine how each node contributes to the overall loss function.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"503\" height=\"654\" src=\"https:\/\/osu-wams-blogs-uploads.s3.amazonaws.com\/blogs.dir\/5421\/files\/2022\/04\/backprop-graph.png\" alt=\"\" class=\"wp-image-58\" srcset=\"https:\/\/osu-wams-blogs-uploads.s3.amazonaws.com\/blogs.dir\/5421\/files\/2022\/04\/backprop-graph.png 503w, https:\/\/osu-wams-blogs-uploads.s3.amazonaws.com\/blogs.dir\/5421\/files\/2022\/04\/backprop-graph-231x300.png 231w\" sizes=\"auto, (max-width: 503px) 100vw, 503px\" \/><figcaption>Backpropagation<\/figcaption><\/figure>\n\n\n\n<p>Because it&#8217;s difficult to picture backpropagation and gradient descent using words and static pictures, I\u2019m going to leave you with the beautiful work of Grant Sanderson of 3Blue1Brown who explains the backpropagation of neural networks with the help of his animated pi-gals and pi-guys. <\/p>\n\n\n\n<figure class=\"wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio\"><div class=\"wp-block-embed__wrapper\">\n<iframe loading=\"lazy\" title=\"What is backpropagation really doing? | Chapter 3, Deep learning\" width=\"500\" height=\"281\" src=\"https:\/\/www.youtube.com\/embed\/Ilg3gGewQ5U?feature=oembed\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe>\n<\/div><\/figure>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<p><strong>References:<\/strong><\/p>\n\n\n\n<p>Chollet, F. (November 2021). <em>Deep Learning with Python<\/em> (2<sup>nd<\/sup> edition). Manning Publications. <a href=\"https:\/\/learning.oreilly.com\/library\/view\/deep-learning-with\/9781617296864\/\">https:\/\/learning.oreilly.com\/library\/view\/deep-learning-with\/9781617296864\/<\/a><\/p>\n\n\n\n<p>Sanderson, G. (November 2017). <em>Neural Networks: The basics of neural networks, and the math behind how they learn<\/em>. 3Blue1Brown. <a href=\"https:\/\/www.3blue1brown.com\/topics\/neural-networks\">https:\/\/www.3blue1brown.com\/topics\/neural-networks<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Last week, I wrote about forward propagation in a basic neural network. This week, I\u2019m going to cover back propagation and gradient descent. To keep it simple, I\u2019m not going to dive deeply into the math, but I will describe it in very rudimentary terms. Let\u2019s do a quick recap of forward propagation in training [&hellip;]<\/p>\n","protected":false},"author":12249,"featured_media":53,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-52","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/blogs.oregonstate.edu\/pimode\/wp-json\/wp\/v2\/posts\/52","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blogs.oregonstate.edu\/pimode\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.oregonstate.edu\/pimode\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.oregonstate.edu\/pimode\/wp-json\/wp\/v2\/users\/12249"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.oregonstate.edu\/pimode\/wp-json\/wp\/v2\/comments?post=52"}],"version-history":[{"count":1,"href":"https:\/\/blogs.oregonstate.edu\/pimode\/wp-json\/wp\/v2\/posts\/52\/revisions"}],"predecessor-version":[{"id":59,"href":"https:\/\/blogs.oregonstate.edu\/pimode\/wp-json\/wp\/v2\/posts\/52\/revisions\/59"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blogs.oregonstate.edu\/pimode\/wp-json\/wp\/v2\/media\/53"}],"wp:attachment":[{"href":"https:\/\/blogs.oregonstate.edu\/pimode\/wp-json\/wp\/v2\/media?parent=52"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.oregonstate.edu\/pimode\/wp-json\/wp\/v2\/categories?post=52"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.oregonstate.edu\/pimode\/wp-json\/wp\/v2\/tags?post=52"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}