Skip to content
geeksforgeeks
  • Tutorials
    • Python
    • Java
    • Data Structures & Algorithms
    • ML & Data Science
    • Interview Corner
    • Programming Languages
    • Web Development
    • CS Subjects
    • DevOps And Linux
    • School Learning
    • Practice Coding Problems
  • Courses
    • DSA to Development
    • Get IBM Certification
    • Newly Launched!
      • Master Django Framework
      • Become AWS Certified
    • For Working Professionals
      • Interview 101: DSA & System Design
      • Data Science Training Program
      • JAVA Backend Development (Live)
      • DevOps Engineering (LIVE)
      • Data Structures & Algorithms in Python
    • For Students
      • Placement Preparation Course
      • Data Science (Live)
      • Data Structure & Algorithm-Self Paced (C++/JAVA)
      • Master Competitive Programming (Live)
      • Full Stack Development with React & Node JS (Live)
    • Full Stack Development
    • Data Science Program
    • All Courses
  • Data Science
  • Data Science Projects
  • Data Analysis
  • Data Visualization
  • Machine Learning
  • ML Projects
  • Deep Learning
  • NLP
  • Computer Vision
  • Artificial Intelligence
Open In App
Next Article:
Masked Autoencoders in Deep Learning
Next article icon

Masked Autoencoders in Deep Learning

Last Updated : 08 Jul, 2024
Comments
Improve
Suggest changes
Like Article
Like
Report

Masked autoencoders are neural network models designed to reconstruct input data from partially masked or corrupted versions, helping the model learn robust feature representations. They are significant in deep learning for tasks such as data denoising, anomaly detection, and improving model generalization by training on incomplete data.

In this article, we are going to explore the working and architecture of masked autoencoders.

Table of Content

  • Understanding Masked Autoencoders
  • Architecture of Masked Autoencoders
    • 1. Masking Mechanism
    • 2. Encoder
    • 3. Latent Space
    • 4. Decoder
    • 5. Output Layer
  • Mechanism of Masked Autoencoders
    • 1. Masking Strategy
    • 2. Encoding Phase
    • 3. Decoding Phase
  • Implementation of Masked Autoencoders

Autoencoders are a class of artificial neural networks used to learn efficient representations of data, typically for the purpose of dimensionality reduction or feature learning. They consist of an encoder, which compresses the input into a latent-space representation, and a decoder, which reconstructs the input from this representation. Over the years, autoencoders have evolved significantly, leading to various advancements in deep learning.

Understanding Masked Autoencoders

Masked Autoencoders (MAEs) represent a recent advancement in the autoencoder architecture, primarily aimed at improving the efficiency and effectiveness of models in learning representations from high-dimensional data. MAEs have gained significant attention, especially in natural language processing (NLP) and computer vision, for their ability to model complex dependencies within data.

Masked Autoencoders introduce a novel approach by randomly masking portions of the input data and training the model to reconstruct the missing parts. This encourages the model to learn robust and meaningful representations that capture the underlying structure of the data.

Architecture of Masked Autoencoders

1. Masking Mechanism

The input data is partially masked or corrupted before being fed into the autoencoder. This masking can be implemented by randomly setting a portion of the input elements to zero or some constant value. The masking process creates an incomplete version of the original input that the autoencoder will learn to reconstruct.

2. Encoder

The encoder is designed to process the masked input and produce a compact, latent representation. This typically involves several layers, such as:

  • Convolutional Layers (for image data): These layers apply convolution operations to capture spatial hierarchies and features in the data.
  • Fully Connected Layers: These layers are used to compress the data into a lower-dimensional space, retaining essential features.
  • Activation Functions: Common activation functions like ReLU (Rectified Linear Unit) are used to introduce non-linearity and help the network learn complex patterns.

3. Latent Space

The latent space is the compressed representation of the input data. It captures the most important features and patterns while discarding redundant information. The size of the latent space is a critical hyperparameter that determines the balance between compression and the ability to reconstruct the input accurately.

4. Decoder

The decoder is responsible for reconstructing the original input from the latent representation. It typically mirrors the encoder's architecture but in reverse, involving:

  • Deconvolutional (Transposed Convolutional) Layers: These layers expand the latent representation back to the original input size by reversing the convolution operations.
  • Fully Connected Layers: These layers gradually increase the dimensionality of the data to match the original input.
  • Activation Functions: Non-linear activation functions help the decoder reconstruct complex patterns.

5. Output Layer

The output layer of the decoder produces the final reconstructed version of the input data. For image data, this often involves a sigmoid or tanh activation function to ensure the output values are within a valid range (e.g., [0, 1] for normalized pixel values).

Role of Masking in Learning Representations in Autoencoders

Masking introduces an element of noise or missing information in the input data, forcing the autoencoder to learn robust and meaningful representations to reconstruct the original data. By learning to infer the missing parts, the autoencoder becomes better at generalizing and capturing the underlying patterns in the data. This improves its ability to handle incomplete or noisy inputs in real-world applications.

Visual Representation of Architecture

Here's a simple visual representation of the masked autoencoder architecture:

Input Image (with Masking) -> Encoder -> Latent Space -> Decoder -> Reconstructed Image

Each component in this pipeline plays a crucial role in learning and reconstructing the data. The encoder compresses the masked input, the latent space captures essential features, and the decoder reconstructs the original data from this compact representation.

Key Points:

  • Masking Mechanism: Introduces missing information in the input.
  • Encoder: Compresses the input into a latent representation.
  • Latent Space: Holds the essential features of the input.
  • Decoder: Reconstructs the original input from the latent space.
  • Output Layer: Produces the final reconstructed image.

By combining these elements, masked autoencoders effectively learn to reconstruct missing or corrupted parts of the input data, making them powerful tools for various applications in deep learning.

Mechanism of Masked Autoencoders

1. Masking Strategy

There are two types of masking Strategies:

  1. Random Masking: In this strategy, random elements of the input data are masked. This can be represented as: \tilde{x}=M⊙x. Here, \tilde{x} is the masked input, M is a binary mask matrix (with elements 0 or 1), and ⊙ denotes element-wise multiplication. The mask M is generated randomly, masking a fixed percentage of the input.
  2. Structured Masking: In structured masking, specific patterns or blocks of the input data are masked. This is useful in scenarios where the data has inherent structure, such as images or sequences. The mask M can be designed to hide contiguous blocks or specific regions of the input.

Impact of Different Masking Strategies on Learning

  • Random Masking: Encourages the model to learn general features that are distributed across the input. This can improve robustness but may lead to less efficient learning if important features are frequently masked.
  • Structured Masking: Forces the model to understand and reconstruct specific patterns or regions, which can be beneficial for tasks with spatial or sequential dependencies. However, it might lead to overfitting to certain patterns if not applied carefully.

2. Encoding Phase

In the encoding phase, the masked input \tilde{x} is processed to obtain a latent representation z. The encoder function f maps the masked input to the latent space:

z = f(\tilde{x})

Here, the encoder is typically a neural network that processes the masked input and captures essential features in the latent representation.

The goal of the encoder is to learn robust representations that can capture the underlying structure of the input data, even when parts of it are missing. The robustness is achieved by training the model to minimize the reconstruction error, ensuring that the latent representation z contains enough information to accurately reconstruct the original input x.

3. Decoding Phase

Reconstruction from Masked Latent Representations

In the decoding phase, the latent representation z is used to reconstruct the original input. The decoder function g maps the latent representation back to the input space:

\hat{x} = g(z)

where \hat{x} is the reconstructed input. The decoder is designed to fill in the missing parts of the input based on the learned representation.

Handling Missing Information

The reconstruction process involves generating the missing information that was masked during the input phase. The effectiveness of this process depends on the quality of the latent representation z learned by the encoder. The reconstruction error L is typically measured using a loss function such as Mean Squared Error (MSE):

L(x, \hat{x}) = \| x - \hat{x} \|^2

The model is trained to minimize this error, ensuring that the reconstructed input \hat{x} is as close as possible to the original input x.

Implementation of Masked Autoencoders

Step 1: Import Necessary Libraries

This step imports the required libraries for building and training the masked autoencoder.

import tensorflow as tf
from tensorflow.keras import layers, models
import numpy as np
import matplotlib.pyplot as plt

Step 2: Load and Preprocess the MNIST Dataset

The MNIST dataset is loaded, normalized, and reshaped to include a channel dimension.

# Load the MNIST dataset
(x_train, _), (x_test, _) = tf.keras.datasets.mnist.load_data()
x_train = x_train.astype('float32') / 255.0
x_test = x_test.astype('float32') / 255.0

# Add a channel dimension to the images
x_train = np.expand_dims(x_train, axis=-1)
x_test = np.expand_dims(x_test, axis=-1)

Step 3: Define the Masking Layer

This custom Keras layer applies random masking to the input data based on a specified mask ratio.

class MaskingLayer(layers.Layer):
def __init__(self, mask_ratio=0.5, **kwargs):
super(MaskingLayer, self).__init__(**kwargs)
self.mask_ratio = mask_ratio

def call(self, inputs):
# Create a random mask
mask = tf.random.uniform(shape=tf.shape(inputs)) > self.mask_ratio
return tf.where(mask, inputs, tf.zeros_like(inputs))

Step 4: Apply the Masking Layer and Visualize the Result

The masking layer is applied to a sample of the training data, and the original and masked images are plotted.

# Example of using the MaskingLayer
masking_layer = MaskingLayer(mask_ratio=0.5)
masked_x_train = masking_layer(x_train[:5])

# Plot original and masked images
plt.figure(figsize=(10, 5))
for i in range(5):
plt.subplot(2, 5, i + 1)
plt.imshow(x_train[i].squeeze(), cmap='gray')
plt.axis('off')

plt.subplot(2, 5, i + 6)
plt.imshow(masked_x_train[i].numpy().squeeze(), cmap='gray')
plt.axis('off')

plt.show()

Step 5: Build the Encoder

The encoder network is built using convolutional layers to compress the input data into a latent representation.

# Encoder
def build_encoder(input_shape):
inputs = layers.Input(shape=input_shape)
x = layers.Conv2D(32, (3, 3), activation='relu', padding='same')(inputs)
x = layers.MaxPooling2D((2, 2), padding='same')(x)
x = layers.Conv2D(64, (3, 3), activation='relu', padding='same')(x)
x = layers.MaxPooling2D((2, 2), padding='same')(x)
latent = layers.Conv2D(128, (3, 3), activation='relu', padding='same')(x)
return models.Model(inputs, latent, name='encoder')

Step 6: Build the Decoder

The decoder network is built using deconvolutional (upsampling) layers to reconstruct the input data from the latent representation.

# Decoder
def build_decoder(latent_shape):
inputs = layers.Input(shape=latent_shape)
x = layers.Conv2D(128, (3, 3), activation='relu', padding='same')(inputs)
x = layers.UpSampling2D((2, 2))(x)
x = layers.Conv2D(64, (3, 3), activation='relu', padding='same')(x)
x = layers.UpSampling2D((2, 2))(x)
x = layers.Conv2D(32, (3, 3), activation='relu', padding='same')(x)
outputs = layers.Conv2D(1, (3, 3), activation='sigmoid', padding='same')(x)
return models.Model(inputs, outputs, name='decoder')

Step 7: Initialize Encoder and Decoder

The encoder and decoder models are instantiated.

input_shape = x_train.shape[1:]
encoder = build_encoder(input_shape)
latent_shape = encoder.output_shape[1:]
decoder = build_decoder(latent_shape)

Step 8: Build the Masked Autoencoder Model

The masked autoencoder model is created by combining the masking layer, encoder, and decoder.

# Masked Autoencoder
inputs = layers.Input(shape=input_shape)
masked_inputs = MaskingLayer(mask_ratio=0.5)(inputs)
latent = encoder(masked_inputs)
outputs = decoder(latent)

autoencoder = models.Model(inputs, outputs, name='masked_autoencoder')
autoencoder.compile(optimizer='adam', loss='mse')

Step 9: Train the Masked Autoencoder

The autoencoder model is trained using the MNIST training data.

# Train the model
autoencoder.fit(x_train, x_train, epochs=10, batch_size=128, validation_data=(x_test, x_test))

Step 10: Encode and Decode Images

The trained encoder and decoder are used to encode and decode a sample of the masked training data.

# Encode and decode some images
encoded_imgs = encoder.predict(masked_x_train)
decoded_imgs = decoder.predict(encoded_imgs)

Step 11: Visualize the Original, Masked, and Reconstructed Images

The original, masked, and reconstructed images are plotted for comparison.

# Plot original, masked, and reconstructed images
plt.figure(figsize=(15, 5))
for i in range(5):
# Original images
plt.subplot(3, 5, i + 1)
plt.imshow(x_train[i].squeeze(), cmap='gray')
plt.axis('off')

# Masked images
plt.subplot(3, 5, i + 6)
plt.imshow(masked_x_train[i].numpy().squeeze(), cmap='gray')
plt.axis('off')

# Reconstructed images
plt.subplot(3, 5, i + 11)
plt.imshow(decoded_imgs[i].squeeze(), cmap='gray')
plt.axis('off')

plt.show()

Complete Code

Python
import tensorflow as tf from tensorflow.keras import layers, models import numpy as np import matplotlib.pyplot as plt  # Load the MNIST dataset (x_train, _), (x_test, _) = tf.keras.datasets.mnist.load_data() x_train = x_train.astype('float32') / 255.0 x_test = x_test.astype('float32') / 255.0  # Add a channel dimension to the images x_train = np.expand_dims(x_train, axis=-1) x_test = np.expand_dims(x_test, axis=-1)  class MaskingLayer(layers.Layer):     def __init__(self, mask_ratio=0.5, **kwargs):         super(MaskingLayer, self).__init__(**kwargs)         self.mask_ratio = mask_ratio      def call(self, inputs):         # Create a random mask         mask = tf.random.uniform(shape=tf.shape(inputs)) > self.mask_ratio         return tf.where(mask, inputs, tf.zeros_like(inputs))  # Example of using the MaskingLayer masking_layer = MaskingLayer(mask_ratio=0.5) masked_x_train = masking_layer(x_train[:5])  # Plot original and masked images plt.figure(figsize=(10, 5)) for i in range(5):     plt.subplot(2, 5, i + 1)     plt.imshow(x_train[i].squeeze(), cmap='gray')     plt.axis('off')      plt.subplot(2, 5, i + 6)     plt.imshow(masked_x_train[i].numpy().squeeze(), cmap='gray')     plt.axis('off')  plt.show()  # Encoder def build_encoder(input_shape):     inputs = layers.Input(shape=input_shape)     x = layers.Conv2D(32, (3, 3), activation='relu', padding='same')(inputs)     x = layers.MaxPooling2D((2, 2), padding='same')(x)     x = layers.Conv2D(64, (3, 3), activation='relu', padding='same')(x)     x = layers.MaxPooling2D((2, 2), padding='same')(x)     latent = layers.Conv2D(128, (3, 3), activation='relu', padding='same')(x)     return models.Model(inputs, latent, name='encoder')  # Decoder def build_decoder(latent_shape):     inputs = layers.Input(shape=latent_shape)     x = layers.Conv2D(128, (3, 3), activation='relu', padding='same')(inputs)     x = layers.UpSampling2D((2, 2))(x)     x = layers.Conv2D(64, (3, 3), activation='relu', padding='same')(x)     x = layers.UpSampling2D((2, 2))(x)     x = layers.Conv2D(32, (3, 3), activation='relu', padding='same')(x)     outputs = layers.Conv2D(1, (3, 3), activation='sigmoid', padding='same')(x)     return models.Model(inputs, outputs, name='decoder')  input_shape = x_train.shape[1:] encoder = build_encoder(input_shape) latent_shape = encoder.output_shape[1:] decoder = build_decoder(latent_shape)  # Masked Autoencoder inputs = layers.Input(shape=input_shape) masked_inputs = MaskingLayer(mask_ratio=0.5)(inputs) latent = encoder(masked_inputs) outputs = decoder(latent)  autoencoder = models.Model(inputs, outputs, name='masked_autoencoder') autoencoder.compile(optimizer='adam', loss='mse')  # Train the model autoencoder.fit(x_train, x_train, epochs=10, batch_size=128, validation_data=(x_test, x_test))  # Encode and decode some images encoded_imgs = encoder.predict(masked_x_train) decoded_imgs = decoder.predict(encoded_imgs)  # Plot original, masked, and reconstructed images plt.figure(figsize=(15, 5)) for i in range(5):     # Original images     plt.subplot(3, 5, i + 1)     plt.imshow(x_train[i].squeeze(), cmap='gray')     plt.axis('off')      # Masked images     plt.subplot(3, 5, i + 6)     plt.imshow(masked_x_train[i].numpy().squeeze(), cmap='gray')     plt.axis('off')      # Reconstructed images     plt.subplot(3, 5, i + 11)     plt.imshow(decoded_imgs[i].squeeze(), cmap='gray')     plt.axis('off')  plt.show() 

Output:

Epoch 1/10
469/469 [==============================] - 508s 1s/step - loss: 0.1127 - val_loss: 0.1140
Epoch 2/10
469/469 [==============================] - 488s 1s/step - loss: 0.1120 - val_loss: 0.1140
Epoch 3/10
469/469 [==============================] - 487s 1s/step - loss: 0.1120 - val_loss: 0.1140
Epoch 4/10
469/469 [==============================] - 479s 1s/step - loss: 0.1120 - val_loss: 0.1140
Epoch 5/10
469/469 [==============================] - 482s 1s/step - loss: 0.1120 - val_loss: 0.1140
Epoch 6/10
469/469 [==============================] - 479s 1s/step - loss: 0.1120 - val_loss: 0.1140
Epoch 7/10
469/469 [==============================] - 480s 1s/step - loss: 0.1120 - val_loss: 0.1140
Epoch 8/10
469/469 [==============================] - 482s 1s/step - loss: 0.1120 - val_loss: 0.1140
Epoch 9/10
469/469 [==============================] - 491s 1s/step - loss: 0.1120 - val_loss: 0.1140
Epoch 10/10
469/469 [==============================] - 481s 1s/step - loss: 0.1120 - val_loss: 0.1140
download-(18)
First Row represents original image, second row represents masked image and the third row represent the reconstructed image.

Note: If we increase the number of epochs, we can get better output of the reconstructed image.

Conclusion

Masked autoencoders, through their unique approach of reconstructing data from masked inputs, enhance model robustness and generalization. They are crucial in advancing deep learning applications such as data denoising and anomaly detection, making them indispensable in the field.


Next Article
Masked Autoencoders in Deep Learning

G

geeksforgn62g
Improve
Article Tags :
  • Blogathon
  • Artificial Intelligence
  • AI-ML-DS
  • AI-ML-DS With Python
  • Data Science Blogathon 2024

Similar Reads

    Sparse Autoencoders in Deep Learning
    Sparse autoencoders are a specific form of autoencoder that's been trained for feature learning and dimensionality reduction. As opposed to regular autoencoders, which are trained to reconstruct the input data in the output, sparse autoencoders add a sparsity penalty that encourages the hidden layer
    5 min read
    Autoencoders in Machine Learning
    Autoencoders are a special type of neural networks that learn to compress data into a compact form and then reconstruct it to closely match the original input. They consist of an:Encoder that captures important features by reducing dimensionality.Decoder that rebuilds the data from this compressed r
    8 min read
    Denoising AutoEncoders In Machine Learning
    Autoencoders are types of neural network architecture used for unsupervised learning. The architecture consists of an encoder and a decoder. The encoder encodes the input data into a lower dimensional space while the decoder decodes the encoded data back to the original input. The network is trained
    10 min read
    Challenges in Deep Learning
    Deep learning, a branch of artificial intelligence, uses neural networks to analyze and learn from large datasets. It powers advancements in image recognition, natural language processing, and autonomous systems. Despite its impressive capabilities, deep learning is not without its challenges. It in
    7 min read
    Role of KL-divergence in Variational Autoencoders
    Variational Autoencoders Variational autoencoder was proposed in 2013 by Knigma and Welling at Google and Qualcomm. A variational autoencoder (VAE) provides a probabilistic manner for describing an observation in latent space. Thus, rather than building an encoder that outputs a single value to desc
    9 min read
geeksforgeeks-footer-logo
Corporate & Communications Address:
A-143, 7th Floor, Sovereign Corporate Tower, Sector- 136, Noida, Uttar Pradesh (201305)
Registered Address:
K 061, Tower K, Gulshan Vivante Apartment, Sector 137, Noida, Gautam Buddh Nagar, Uttar Pradesh, 201305
GFG App on Play Store GFG App on App Store
Advertise with us
  • Company
  • About Us
  • Legal
  • Privacy Policy
  • In Media
  • Contact Us
  • Advertise with us
  • GFG Corporate Solution
  • Placement Training Program
  • Languages
  • Python
  • Java
  • C++
  • PHP
  • GoLang
  • SQL
  • R Language
  • Android Tutorial
  • Tutorials Archive
  • DSA
  • Data Structures
  • Algorithms
  • DSA for Beginners
  • Basic DSA Problems
  • DSA Roadmap
  • Top 100 DSA Interview Problems
  • DSA Roadmap by Sandeep Jain
  • All Cheat Sheets
  • Data Science & ML
  • Data Science With Python
  • Data Science For Beginner
  • Machine Learning
  • ML Maths
  • Data Visualisation
  • Pandas
  • NumPy
  • NLP
  • Deep Learning
  • Web Technologies
  • HTML
  • CSS
  • JavaScript
  • TypeScript
  • ReactJS
  • NextJS
  • Bootstrap
  • Web Design
  • Python Tutorial
  • Python Programming Examples
  • Python Projects
  • Python Tkinter
  • Python Web Scraping
  • OpenCV Tutorial
  • Python Interview Question
  • Django
  • Computer Science
  • Operating Systems
  • Computer Network
  • Database Management System
  • Software Engineering
  • Digital Logic Design
  • Engineering Maths
  • Software Development
  • Software Testing
  • DevOps
  • Git
  • Linux
  • AWS
  • Docker
  • Kubernetes
  • Azure
  • GCP
  • DevOps Roadmap
  • System Design
  • High Level Design
  • Low Level Design
  • UML Diagrams
  • Interview Guide
  • Design Patterns
  • OOAD
  • System Design Bootcamp
  • Interview Questions
  • Inteview Preparation
  • Competitive Programming
  • Top DS or Algo for CP
  • Company-Wise Recruitment Process
  • Company-Wise Preparation
  • Aptitude Preparation
  • Puzzles
  • School Subjects
  • Mathematics
  • Physics
  • Chemistry
  • Biology
  • Social Science
  • English Grammar
  • Commerce
  • World GK
  • GeeksforGeeks Videos
  • DSA
  • Python
  • Java
  • C++
  • Web Development
  • Data Science
  • CS Subjects
@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved
We use cookies to ensure you have the best browsing experience on our website. By using our site, you acknowledge that you have read and understood our Cookie Policy & Privacy Policy
Lightbox
Improvement
Suggest Changes
Help us improve. Share your suggestions to enhance the article. Contribute your expertise and make a difference in the GeeksforGeeks portal.
geeksforgeeks-suggest-icon
Create Improvement
Enhance the article with your expertise. Contribute to the GeeksforGeeks community and help create better learning resources for all.
geeksforgeeks-improvement-icon
Suggest Changes
min 4 words, max Words Limit:1000

Thank You!

Your suggestions are valuable to us.

What kind of Experience do you want to share?

Interview Experiences
Admission Experiences
Career Journeys
Work Experiences
Campus Experiences
Competitive Exam Experiences