Very deep neural network are difficult to train because of vanishing and exploding gradients problems. You find that as you increase the number of layers the training error will decrease after a while but then they’ll tend to go back up. In theory, as you make the neural network deeper should only do better and better on the training set.

In reality, your training error gets worse if you pick a network that’s too deep but what happens with ResNet is that even as a number of layers get deeper you can have the performance of the training error even to keep on going down.
ResNet enables you to train the very very deep neural network. Sometimes even network over 100 layers.ResNet is built of the residual block.
Residual Block
Here are two layers of a neural network where you start off with some activation a[l] then you go to a[l+1].In other words information from a[l] to flow a[l+2] it needs to go through all of these steps which call the main path of this set of layers.

In a ResNet we’re going to make a change to this we’re gonna take a[l] and just fast forward it copies it much further into the neural network to before a[l+2]. just add al before applying the non-linearity and this the shortcut.
Shortcut Connections
Shortcut connection or Skip connections which
Rather than follow the main path the information from a[l] you can now follow a shortcut to go much deeper into the neural network and what that means is that a[l+2] last equation goes away and we instead have that the output a[l+2] + a[l].
The additions of this a[l] here it makes this a residual block and in pictures,
Using the residual block allows you to train much deeper neural networks and the way you building a ResNet is by taking many of these blocks and stacking them together to form a deep network.
Building ResNet in TensorFlow using Keras API
Based on the plain network, we insert shortcut connections which turn the network into its counterpart residual version. The identity shortcuts can be directly used when the input and output are of the same dimensions.

def identity_block(input_tensor, kernel_size, filters): """The identity block is the block that has no conv layer at shortcut. # Arguments input_tensor: input tensor kernel_size: default 3, the kernel size of middle conv layer at main path filters: list of integers, the filters of 3 conv layer at main path """ filters1, filters2, filters3 = filters if backend.image_data_format() == 'channels_last': bn_axis = 3 else: bn_axis = 1 x = layers.Conv2D(filters1, (1, 1), use_bias=False, kernel_initializer='he_normal', kernel_regularizer=regularizers.l2(L2_WEIGHT_DECAY))(input_tensor) x = layers.BatchNormalization(axis=bn_axis, momentum=BATCH_NORM_DECAY, epsilon=BATCH_NORM_EPSILON)(x) x = layers.Activation('relu')(x) x = layers.Conv2D(filters2, kernel_size, padding='same', use_bias=False, kernel_initializer='he_normal', kernel_regularizer=regularizers.l2(L2_WEIGHT_DECAY))(x) x = layers.BatchNormalization(axis=bn_axis, momentum=BATCH_NORM_DECAY, epsilon=BATCH_NORM_EPSILON)(x) x = layers.Activation('relu')(x) x = layers.Conv2D(filters3, (1, 1), use_bias=False, kernel_initializer='he_normal', kernel_regularizer=regularizers.l2(L2_WEIGHT_DECAY))(x) x = layers.BatchNormalization(axis=bn_axis, momentum=BATCH_NORM_DECAY, epsilon=BATCH_NORM_EPSILON)(x) x = layers.add([x, input_tensor]) x = layers.Activation('relu')(x) return x
Projection Shortcuts
The dimensions of x and F must be equal in Identity Mapping. If this is not the case (e.g., when changing the input/output channels), we can perform a linear projection Ws by the shortcut connections to match the dimensions:
If this is not the case (e.g., when changing the input/output channels), we can perform a linear projection Ws by the shortcut connections to match the dimensions:
y = F(x, {Wi}) + Wsx.
the identity mapping is sufficient for addressing the degradation problem and is economical and thus Ws is only used when matching dimensions.
def conv_block(input_tensor, kernel_size, filters, strides=(2, 2)): """A block that has a conv layer at shortcut. # Arguments input_tensor: input tensor kernel_size: default 3, the kernel size of middle conv layer at main path filters: list of integers, the filters of 3 conv layer at main path stage: integer, current stage label, used for generating layer names # Returns Output tensor for the block. Note that from stage 3, the second conv layer at main path is with strides=(2, 2) And the shortcut should have strides=(2, 2) as well """ filters1, filters2, filters3 = filters if backend.image_data_format() == 'channels_last': bn_axis = 3 else: bn_axis = 1 x = layers.Conv2D(filters1, (1, 1), use_bias=False, kernel_initializer='he_normal', kernel_regularizer=regularizers.l2(L2_WEIGHT_DECAY))(input_tensor) x = layers.BatchNormalization(axis=bn_axis, momentum=BATCH_NORM_DECAY, epsilon=BATCH_NORM_EPSILON)(x) x = layers.Activation('relu')(x) x = layers.Conv2D(filters2, kernel_size, strides=strides, padding='same', use_bias=False, kernel_initializer='he_normal', kernel_regularizer=regularizers.l2(L2_WEIGHT_DECAY))(x) x = layers.BatchNormalization(axis=bn_axis, momentum=BATCH_NORM_DECAY, epsilon=BATCH_NORM_EPSILON)(x) x = layers.Activation('relu')(x) x = layers.Conv2D(filters3, (1, 1), use_bias=False, kernel_initializer='he_normal', kernel_regularizer=regularizers.l2(L2_WEIGHT_DECAY))(x) x = layers.BatchNormalization(axis=bn_axis, momentum=BATCH_NORM_DECAY, epsilon=BATCH_NORM_EPSILON)(x) shortcut = layers.Conv2D(filters3, (1, 1), strides=strides, use_bias=False, kernel_initializer='he_normal', kernel_regularizer=regularizers.l2(L2_WEIGHT_DECAY))(input_tensor) shortcut = layers.BatchNormalization(axis=bn_axis, momentum=BATCH_NORM_DECAY, epsilon=BATCH_NORM_EPSILON)(shortcut) x = layers.add([x, shortcut]) x = layers.Activation('relu')(x) return x
For each residual function F, we use a stack of 3 layers. The three layers are 1×1, 3×3, and 1×1 convolutions, where the 1×1 layers are responsible for reducing and then increasing (restoring) dimensions, leaving the 3×3 layer a bottleneck with smaller input/output dimensions.
Implement ResNet50
Building blocks are shown in brackets with the numbers of blocks stacked. Downsampling is performed by conv3 1, conv4 1, and conv5 1 with a stride of 2.

def resnet50(num_classes, input_shape): img_input = layers.Input(shape=input_shape) if backend.image_data_format() == 'channels_first': x = layers.Lambda(lambda x: backend.permute_dimensions(x, (0, 3, 1, 2)), name='transpose')(img_input) bn_axis = 1 else: # channels_last x = img_input bn_axis = 3 # Conv1 (7x7,64,stride=2) x = layers.ZeroPadding2D(padding=(3, 3))(x) x = layers.Conv2D(64, (7, 7), strides=(2, 2), padding='valid', use_bias=False, kernel_initializer='he_normal', kernel_regularizer=regularizers.l2(L2_WEIGHT_DECAY))(x) x = layers.BatchNormalization(axis=bn_axis, momentum=BATCH_NORM_DECAY, epsilon=BATCH_NORM_EPSILON)(x) x = layers.Activation('relu')(x) x = layers.ZeroPadding2D(padding=(1, 1))(x) # 3x3 max pool,stride=2 x = layers.MaxPooling2D((3, 3), strides=(2, 2))(x) # Conv2_x # 1×1, 64 # 3×3, 64 # 1×1, 256 x = conv_block(x, 3, [64, 64, 256], strides=(1, 1)) x = identity_block(x, 3, [64, 64, 256]) x = identity_block(x, 3, [64, 64, 256]) # Conv3_x # # 1×1, 128 # 3×3, 128 # 1×1, 512 x = conv_block(x, 3, [128, 128, 512]) x = identity_block(x, 3, [128, 128, 512]) x = identity_block(x, 3, [128, 128, 512]) x = identity_block(x, 3, [128, 128, 512]) # Conv4_x # 1×1, 256 # 3×3, 256 # 1×1, 1024 x = conv_block(x, 3, [256, 256, 1024]) x = identity_block(x, 3, [256, 256, 1024]) x = identity_block(x, 3, [256, 256, 1024]) x = identity_block(x, 3, [256, 256, 1024]) x = identity_block(x, 3, [256, 256, 1024]) x = identity_block(x, 3, [256, 256, 1024]) # 1×1, 512 # 3×3, 512 # 1×1, 2048 x = conv_block(x, 3, [512, 512, 2048]) x = identity_block(x, 3, [512, 512, 2048]) x = identity_block(x, 3, [512, 512, 2048]) # average pool, 1000-d fc, softmax x = layers.GlobalAveragePooling2D()(x) x = layers.Dense( num_classes, activation='softmax', kernel_regularizer=regularizers.l2(L2_WEIGHT_DECAY), bias_regularizer=regularizers.l2(L2_WEIGHT_DECAY))(x) # Create model. return models.Model(img_input, x, name='resnet50')
We adopt batch normalization (BN) right after each convolution and before activation, following. Zero-padding shortcuts are used for increasing dimensions, and all shortcuts are parameter-free. Projections shortcuts are used for increasing dimensions, and other shortcuts are identity