Tutorials

There will be a lunch break at 12:30 – 13:30

Tutorial 1: (Half-Day) Mobile and Edge Intelligence via Green Learning
Presenters: C.-C. Jay Kuo, University of Southern California, USA
Date and Time: Sunday, 14 September 2025, 09:00 – 12:30 (including a 30-minute break)
Location: TBD

There has been a rapid development of artificial intelligence and machine learning technologies in the last decade. The core lies in many annotated training data and deep learning networks. Although deep learning networks have significantly impacted various application domains such as computer vision, natural language processing, autonomous driving, robotics navigation, etc., they have several inherent shortcomings. They are mathematically intractable, vulnerable to adversarial attacks, and demand a lot of annotated training data. Furthermore, their training is computationally intensive because of the use of backpropagation for end-to-end network optimization. Furthermore, their large model sizes make deploying mobile and edge devices a significant challenge.

Mobile and edge intelligence will prevail in the modern AI era. It is a hot topic nowadays. Most researchers focus on deep-learning-based model compression to achieve this goal. Model compression can reduce the model size from 50-80% with slight performance degradation. Model compression has to rely on an existing larger model. The training cost of such a large model remains. Model compression also demands additional resources.

In contrast, the emerging green learning methodology can reduce the model size of its deep-learning counterpart by 95-99%. The training can be conducted from scratch. The resulting AI model is smaller and without any compression. It is ideal for mobile and edge devices.

I have worked on green learning and AI since 2014, published many papers on this topic (see the recent publication list), and coined this emerging field with “green learning.” Green learning demands low power consumption in both training and inference. Besides, it has several attractive characteristics: small model sizes, fewer training samples, mathematical transparency, ease of incremental learning, etc. It is particularly attractive for mobile/edge computing.

Green learning relies more on signal-processing disciplines and concepts such as filter banks, linear algebra, subspace learning, probability theory, etc. Although it exploits optimization, it avoids end-to-end system optimization, a non-convex optimization problem. Instead, it adopts modularized optimization, and each optimization problem can be cast as convex optimization. Green learning suits researchers, engineers, and students with signal/image processing backgrounds.

I organized three tutorials on green learning at ICIP 2020, ICIP 2021, and ICIP 2022 to promote the importance of this emerging area. I linked the convolution layers with the unsupervised representation learning module of green learning at ICIP 2020. I interpreted the unsupervised representation learning using the filter bank theory and showed several application examples at ICIP 2021, such as face biometrics and point cloud classification, segmentation, and registration. I introduced two new green learning modules and showed more applications at ICIP 2022.

There has been significant progress in green learning since 2022. Related publications in 2023 and 2024 were listed in the “Recent Publications” Section. I will give a completely new tutorial on this topic at ICIP 2025. Besides basic theory, I will focus on mobile/edge intelligence applications.

The outline of the proposed tutorial is given below.

Introduction to Green Learning
a. Unsupervised Representation Learning
b. Supervised Feature Learning
c. Supervised Decision Learning
Three Green Learning Models
a. Image-to-Vector Transformation Models
b. Image-to-Image Transformation Models
c. Generative Models
Mobile/Edge Intelligence Applications
a. Image Signal Processing (ISP) Pipeline
b. Deepfake Video Detection
c. Visual Quality Assessment

C.-C. Jay Kuo

Tutorial 2: (Half-Day) Generation of super resolution images by the application of Generative Adversarial deep learning networks (GANs)
Presenters: Prof Xiaohong (Sharon) Gao, Professor in Computer Vision and Imaging Science, Middlesex University, London, UK
Date and Time: Sunday, 14 September 2025, 09:00 – 12:30 (including a 30-minute break)
Location: TBD

A super resolution (SR) image refers to the image with enhanced resolutions that depict increased clarity, sharpening and details (usually with four-fold (×4)) without compromising its original contents or characters. SR is employed to reveal an image in fine details. While SR image can be produced by the emerging optical devices, e.g. super-resolution microscopy (SRM) with the change of optical resolution from ~250nm to ~10nm, computational models appear to be more variable, in particular with the advances of current state of the art artificial intelligence (AI) techniques. This tutorial aims to address AI based techniques to generate SR images, with a focus on the family of generative adversarial deep learning networks (GANs). The architecture of texture-based vision transformer (TTSR) will be also discussed in conjunction with the application of detection of Human papillomavirus (HPV) from microscopic images.
In computer vision field, there are broadly two ways to contend with the fundamental low-level single image super-resolution (SISR) problem. One is from theoretical point of view and another is based on subjects’ visual appearance evaluation. While SISR attempts to recover a high-resolution (HR) image from a single low resolution (LR) one, it appears that the application of deep learning neural networks can achieve state of the art results. One of these models is GAN, designating an approach to generative modelling using deep learning methods, such as convolutional neural networks (CNN). Subsequently, a number of network architectures have been proposed to improve the SR performance mainly at improving Peak Signal-to-Noise Ratio (PSNR) values. This, however, tends to be in disagreement with human observers’ evaluation as pointed out in the network of SRGAN, one of the seminal works on the improvement of visual quality of generated SR images. Towards this end, several perceptual-driven methods are advanced, including incorporation of perceptual loss to optimize SR models in a feature space instead of pixel space and segmentation of semantic images prior to recovering detailed textures. Significantly, the application of GAN in SRGAN improves the overall visual quality of reconstruction over the PSNR-oriented methods considerably by encouraging the network to advocate solutions that look more like natural images. In addition, an enhanced SRGAN (ESRGAN) further improves the visual quality by reducing generated accompanying artefacts. ESRGAN introduces relativistic GAN, in which Residual-in-Residual Dense Block (RRDB) without batch normalization is utilised as basic network building blocks whereas SRGAN is built with residual blocks, offering consistently better visual quality with more realistic and natural textures.
Recently, vision transformers (ViT) are emerging and starting to show potentials by performing computer vision tasks, such as image recognition. Built upon self-attention architectures and being a leading model in natural language processing (NLP), ViT appears to demonstrate excellent performance when trained on sufficient data, outperforming comparable state-of-the-art CNNs with four times fewer computational resources.

In this tutorial, the architecture of texture transformer (TTSR) as well as GAN-based networks are elaborated, in conjunction with detection of HPV like particles (HPVLPs), or HPV viral factories. The four state of the art GAN models are ESRGAN, CycleGAN, Pix2pix and Pix2pixHD.

The tutorial will address the following topics:

Deep learning techniques for SR
Evaluation metrics for generated SR images – visual appearance & PSNR metrics
GAN architecture
GAN-enhanced networks, ESRGAN, CycleGAN, Pix2pix and Pix2pixHD
Visual transformer-based AI network for SR generation
Application of SR to identification of HPV for medical applications
Hand-on experience using Matlab and Python to generate SR images
Future directions

Prof Xiaohong (Sharon) Gao

Tutorial 3: (Half-Day) ICIP 2025 Tutorial on Polygonal Mesh Coding Standard
Presenters: Shan Liu, Tencent, Pranav Kadam, Tencent, Ondrej Stava, Google
Date and Time: Sunday, 14 September 2025, 09:00 – 12:30 (including a 30-minute break)
Location: TBD

3D objects are commonly modeled as polygonal meshes. Meshes describe the surface using a set of vertices, their positions in 3D space and incidence relation among vertices that compose edges and faces. Other mesh attributes include texture coordinate, normal, etc. Mesh compression has been an active research topic for over a decade now. However, growing demand from areas like gaming, animation, AR and VR, online commerce and cultural heritage applications has created the need for efficient coding technologies with reduced network bandwidths and optimal rate distortion performance.
To overcome the challenges of existing state of the art mesh coding methods and support the next generation 3D immersive experiences, the Visual Volumetric Media (VVM) working group in AOMedia issued a Call for Proposals (CfP) on Static Polygonal Mesh Coding (PMC) in May 2023. The standard is expected to be finalized by the end of 2025. This tutorial will first introduce mesh representation and mesh coding followed by an overview of standard development timeline and common test conditions. Then, the fundamental building blocks in PMC and some of the major lossless coding tools will be discussed in depth. This includes geometry traversal and connectivity coding, geometry predictive coding, attribute coding as well as advanced prediction schemes. Later, the lossy PMC codec will be discussed which offers spatial scalability via decimation, iterative mesh subdivision and displacement coding.
PMC has already achieved an average coding gain of more than 30% over Draco, a leading mesh compression library. Moreover, PMC offers a more generalized solution to mesh coding due to its ability to handle higher polygon meshes and non-manifold topologies. These aspects of the standard make this tutorial very relevant to students and industry professionals working on a wide range of topics from conventional video coding to immersive media processing.

Dr. Shan Liu

Dr. Pranav Kadam

Dr. Ondrej Stava

Tutorial 4: (Half-Day) Tutorial on Diffusion Models for Imaging and Vision
Presenters: Stanley Chan, Purdue University
Date and Time: Sunday, 14 September 2025, 09:00 – 12:30 (including a 30-minute break)
Location: TBD

The tutorial will be based on https://www.nowpublishers.com/article/Details/CGV-112

ArXiv version is available at: https://arxiv.org/abs/2403.18103

The astonishing growth of generative tools in recent years has empowered many exciting applications in text-to-image generation and text-to-video generation. The underlying principle behind these generative tools is the concept of diffusion, a particular sampling mechanism that has overcome some longstanding shortcomings in previous approaches. While there is an ocean of diffusion papers, Python demo, online blogs, etc, I have found it very difficult to understand the underlying mathematical principles. Not only do these online materials lack substance beyond superficial treatments, but often they just recycle another online source with the same technical holes propagating from one to the other. Many students claim they know the diffusion equations, but when asked deeper about the physical meanings, no one can clearly explain what they are.

This ICIP tutorial is based on a 90-page tutorial “Tutorial on Diffusion for Imaging and Vision” I wrote in 2024. The purpose is to explain the concepts as clearly as possible, through first-principle arguments, derivations, proofs, toy examples, and figures. There are five topics in this tutorial:

Variational AutoEncoder (VAE)
a. Encoder and Decoder
b. Evidence Lower Bound (ELBO)
c. Reparametrization
Denoising Diffusion Probabilistic Model (DDPM)
a. Transition Distributions
b. DDPM’s Evidence Lower Bound
c. Reverse Process
d. Training and Inference
Score Matching Langevin Dynamics (SMLD)
a. Sampling
b. Stein’s score functions
c. Score-matching techniques
Stochastic Differential Equations (SDE)
a. Forward SDE
b. Reverse SDE
c. How DDPM and SMLD can be formulated as SDE
Physics and Fokker Planck Equations
a. Brownian motion
b. Markov properties and the Chapman-Kolmogorov Equation
c. Master Equation for dynamical processes
d. Kramers-Moyal expansion and Fokker-Planck equation

Stanley H. Chan - Early Career Teaching Award

Stanley Chan

Tutorial 5: (Half-Day) The New Video-Based Dynamic Mesh Coding (V-DMC) International Standard
Presenters: Dr. Marius Preda, Associate Professor at Institut MINES-Télécom, Dr. Lukasz Kondrad, Principal Standardization Specialist at Nokia, Dr. Wenjie Zou, Associate Professor at Xidian University, Dr. Danillo Bracco Graziosi, manager at Sony Corporation of America
Date and Time: Sunday, 14 September 2025, 13:30 – 16:30 (including a 30-minute break)
Location: TBD

The proposed tutorial on Video-Based Dynamic Mesh Coding (V-DMC) begins with an introduction to the need for efficient compression of dynamic 3D meshes, emphasizing the challenges in volumetric media applications and the role of MPEG in addressing these through standardization. It highlights the key differences between V-DMC and other approaches used by the industry, alongside its practical applications in gaming, virtual reality, and digital twins.

The tutorial then explores into the V-DMC architecture, explaining the decomposition of the input 3D mesh frames into a sequence of base meshes, displacement vectors, attribute components, and associated timed metadata used for mesh reconstruction. The structure of the bitstream and the processes for encoding and decoding, supported by visual diagrams, are described in detail.

Participants will gain insight into the core technologies underlying V-DMC, such as methods for base mesh compression (e.g., Edgebreaker), displacement encoding using
wavelet transforms and quantization, the compression of texture attributes through remapping and video coding techniques.

A focused analysis of the bitstream syntax and semantics follows, detailing the organization of data and the integration with V3C technologies. The tutorial then explores the reconstruction process using timed atlas metadata, where the base mesh is refined through displacement vectors and texture attributes are mapped to achieve a seamless final output. The atlas metadata includes SEI messages for further enhancing reconstruction capabilities. Post-reconstruction methods, such as Zippering, are also explained.

The tutorial evaluates V-DMC’s performance, discussing the evolution of V-DMC’s compression efficiency during the standardization activity, visual quality metrics used, and real-world case studies from MPEG experiments. The session concludes with a discussion on future directions, highlighting ongoing research in motion vector simplification, lossy displacement coding, and V-DMC’s integration within the broader application ecosystem. The tutorial identifies opportunities for further innovation in dynamic mesh compression and invites participants to explore this evolving field.

Tutorial Outline
The tutorial follows the outline presented below:

Introduction to V-DMC (Marius Preda – 10 minutes)
V-DMC Architecture (Lukasz Konrad – 15 minutes)
Core technologies and coding features (Danillo Graziosi – 1 hour)
Syntax and semantics (Lukasz Konrad – 30 minutes)
Reconstruction techniques (Lukasz Konrad + Danillo Graziosi – 30 minutes)
Performance metrics and evaluation (Wenjie Zou – 25 minutes)
Conclusions and future directions (Marius Preda – 10 minutes)

Marius Preda

Wenjie Zou

Lukasz Kondrad

Danillo B Graziosi

Tutorial 6: (Half-Day) Quantum Machine Learning meets Image Processing
Presenters: Mihai DATCU, POLITEHNICA Bucharest
Date and Time: Sunday, 14 September 2025, 13:30 – 16:30 (including a 30-minute break)
Location: TBD

At present, quantum computing and AI are the key technologies in the digital era. The progress and transfer of quantum resources for use in practical applications is in constant acceleration. Quantum computing, quantum annealing, quantum circuits, or simulators for quantum computing are currently easily accessible. The exploitation of quantum physics effects such as superposition and entanglement opens new, still unexplored perspectives. Yet, with very limited capacities, hundreds of qubits, they draw the attention stimulating the new area of quantum machine learning. In this context the presentation will focus on relevant aspects of quantum algorithms for image processing. With the goal to identify if a quantum algorithm may bring any advantage compared with classical methods, will be firstly analysed the data complexity (i.e. data as prediction advantage). Secondly, it will be presented the classes of complexity of the algorithms. Thirdly, it will be presented methods for image data embedding.
While quantum information has innate physical nature, the tutorial will address the case of satellite remote sensing images. Generally imaging sensors generate an isomorphic representation of the observed scene. This is not the case for satellite remote sensing, these observations are a doppelgänger of the scattered field, an indirect signature of the imaged object, i.e. images are instrument records, in addition to the spatial information, they are sensing physical parameters, and they are mainly sensing outside of the visual spectrum.

Non-quantum data are many times “artificially” encoded at the input of quantum computers, thus quantum algorithms may not be efficient. For instance the polarimetric images are represented on the Poincare sphere which maps in a natural way to the qubit Bloch sphere. Thus, polarimetric images will not be any more processed as “signal” but directly as a physical signature. Further will be discussed the advantages of quantum annealing (D-Wave) for solving local optimization for non-convex problems. Also, the potential and advantage of the recent TensorFlow Quantum and the implementation of parametrized quantum circuits (PQC). The presentation will address the entire image analyis cycle encompassing the particular features from data acquisition, understanding and modelling of the image sensor, followed by information extraction. The quantum ML techniques are practically implemented using the open access to various quantum computers, as D-Wave, IBM, or Google. Hybrid methods will be discussed for satellite observations, i.e. managing the I/O of the data and maximally use the resources of quantum computers and quantum algorithms.

Mihai Datcu

Tutorial 7: (Half-Day) Synthetic Realities: Impact, Advancements, Ethical Considerations and the Future of Digital Forensics
Presenters: Gabriel Bertocco, Recod.ai/Unicamp, Anderson Rocha, Recod.ai/Unicamp
Date and Time: Sunday, 14 September 2025, 13:30 – 16:30 (including a 30-minute break)
Location: TBD

In this tutorial, we will explore the burgeoning landscape of synthetic realities, their impact, technological advancements, and ethical quandaries. Synthetic realities provide innovative solutions and opportunities for immersive experiences in various sectors, including education, healthcare, and commerce. However, these advances also present substantial challenges, such as the propagation of misinformation, privacy concerns, and ethical dilemmas. We will discuss the specifics of synthetic media, including deepfakes and their generation and detection techniques, modern AI-empowered multimedia manipulations, (mis)information, and (dis)information. We will also touch upon the imperative need for robust detection and explainable methods to combat the potential misuse of such technologies. We will show the dual-edged nature of synthetic realities and advocate for interdisciplinary research, informed public discourse, and collaborative efforts to harness their benefits while mitigating risks. We also present future trends and perspectives of synthetic realities. This tutorial contributes to the discourse on the responsible development and application of artificial intelligence and synthetic media in modern society. We will cover real cases of study where people were fooled by deepfakes causing financial loss or had their integrity prejudiced. Upon these cases, we will show state-of-the-art generators and detectors in image, video, audio and text modalities. In the generator’s side, we will cover the main generation strategies such as Face Swap, Face Reenactment, Lip Syncing, Facial Attribute Manipulation, Entire Face Synthesis, Text-to-Speech, Voice Conversion, and LLM-based text generation. From the detector’s side, we will delve into different solutions in each modality, and discuss their explainability and deployability in real-world applications. We will also cover legislation and political aspects of the employment of synthetic realities in society.

The tutorial has the following outline:

Introduction to Synthetic Realities
Applications of Synthetic Realities in advertising, entertainment and health campaigns
Synthetic Realities synthesis and cases of study
a. Face Swap
b. Face Reenactment (Puppet-mastery)
c. Lip-syncing
d. Entire Face Synthesis
e. Face Attribute Manipulation
f. Text-to-Speech Synthesis
g. Voice Conversion
h. Text generation with Large-Language Models (LLMs)
Detection methods
a. Image- and video-based Deepfake Detection
b. Detection of LLM-generated content
c. Generalizable Deepfake Detection
d. Self-Supervised Learning for Deepfake Detection
Social, Technical and Political Challenges
Legislation
Education and Standardization
Future perspectives
Practical Session with Detection Methods

Gabriel Bertocco

Prof Anderson Rocha

Tutorial 8: (Half-Day) Semantic Communication for Media Compression and Transmission in Next Generation Communication Networks
Presenters: Anil Fernando, Department of Computer and Information Sciences, University of Strathclyde, UK
Date and Time: Sunday, 14 September 2025, 09:00 – 12:30 (including a 30-minute break)
Location: TBD

Semantic communication, a concept first discussed by Shannon and Weaver in 1949, classifies communication challenges into three distinct levels: physical, semantic, and effectiveness. The physical problem concerns the accurate and reliable transmission of the raw data content of a message, which led to the development of information theory—a field that has profoundly influenced modern communication technologies. The semantic problem, in contrast, deals with ensuring that the intended meaning or context of a message is accurately delivered to the receiver. Finally, the effectiveness problem focuses on determining whether the message achieves its intended purpose or prompts the desired action from the recipient. While advancements in physical communications have progressed exponentially since the early days of information theory, laying the groundwork for today’s high-performance gaming, entertainment, and media ecosystems, semantic communication has remained underexplored for decades. This stagnation can largely be attributed to the absence of computational and theoretical tools required to implement semantic communication systems effectively. Recent advancements in deep learning, natural language processing (NLP), and computational performance have made it possible to revisit semantic communication as a practical and transformative paradigm. Unlike traditional communication methods that prioritize transmitting raw data with high fidelity, semantic communication focuses on delivering meaning, intent, or relevance while minimizing unnecessary data redundancy. This shift is particularly relevant for addressing modern challenges, such as the growing demand for bandwidth-intensive applications, low-latency connectivity, and efficient energy use in data transmission. Semantic communication enables intelligent and context-aware transmission, making it a promising solution to improve the capacity, scalability, and reliability of current and future communication systems. In summary, semantic communication is revolutionizing media compression and transmission by emphasizing meaning over raw data. This paradigm aligns well with the challenges of next-generation networks, such as 5G, 6G, and IoT, which require solutions for bandwidth optimization, latency reduction, and scalability. Its applications span a wide range of fields, including entertainment, gaming, smart devices, and autonomous systems, making it a critical component of future communication systems.

In this Tutorial, we delve into how semantic communication concepts can complement conventional multimedia communication systems, with a focus on image and video compression and transmission. Our early experiments and results in this field are highly promising, demonstrating that semantic communication can achieve better-quality reconstructions of images and videos for a given bandwidth compared to state-of-the-art compression techniques like HEIF/JPEG, H.264/H.265/H.266, and AV1/AV2. This improvement is achieved by selectively encoding and transmitting semantically relevant features rather than raw pixel data, effectively optimizing resource utilization. However, significant challenges remain before semantic communication can be widely adopted in commercial applications. These challenges include the development of robust and generalizable semantic models, ensuring compatibility with existing infrastructure, addressing computational complexities, and safeguarding data privacy and security. Additionally, standardized frameworks and protocols for semantic communication are needed to facilitate widespread deployment. We present an overview of the historical background, current state of research, and a future roadmap for leveraging semantic communication in multimedia compression and transmission. By addressing these challenges and exploring its potential, semantic communication is poised to become a cornerstone technology in transforming the way multimedia data is encoded, transmitted, and consumed.
This tutorial will explore how semantic communication can address these emerging challenges in media communications. We will examine how semantic communication can reduce the strain on bandwidth by transmitting only the relevant features needed for specific tasks, thereby optimizing network resources. Additionally, the tutorial will discuss the potential challenges of implementing semantic communication systems and outline strategies to overcome these obstacles. Beyond the communication between people and machines, we will also focus on communication between devices, processes, and objects, highlighting how semantic communication can support the growing complexity of machine-to-machine and device-to-device communication in next-generation networks.

The outline of the proposed tutorial is:

Introduction to Semantic Communication: This will begin with an introduction to semantic communication, explaining its fundamental principles and how it differs from traditional communication paradigms and will cover the motivation for semantic communication, emphasizing its focus on transmitting the meaning or intent of data rather than raw information. Further it highlights its significance in addressing challenges like bandwidth constraints, energy efficiency, and the growing complexity of communication networks in the era of 5G, 6G, and IoT/VIoT.
Key Concepts and Frameworks: Introduce the theoretical underpinnings of semantic communication, including concepts like semantic entropy, semantic noise, and mutual understanding. Explain the importance of aligning transmitter and receiver semantic models and discuss frameworks for encoding, transmitting, and decoding meaning. Provide examples of how semantic communication is applied in practical scenarios, such as natural language processing (NLP), image transmission, and video streaming.
Semantic Information Theory: Discuss the role of information theory in semantic communication, extending traditional metrics like Shannon entropy to account for meaning. Cover concepts such as semantic relevance, context-aware transmission, and the trade-offs between efficiency and accuracy in semantic encoding. Explain how these principles influence the design of semantic communication systems.
Artificial Intelligence and Machine Learning Integration: Explain how AI and machine learning enable semantic communication by extracting and interpreting meaning from data. Discuss key techniques, including feature extraction, deep learning models (e.g., transformers, convolutional neural networks), and knowledge representation methods (e.g., ontologies, graphs). Highlight applications such as semantic compression, personalized data delivery, and intelligent edge computing.
Applications of Semantic Communication: Explore real-world applications where semantic communication plays a crucial role. These include media streaming, IoT networks, smart cities, autonomous vehicles, augmented reality (AR), and virtual reality (VR). Discuss how semantic systems improve efficiency, reduce latency, and enhance user experiences in these domains.
Challenges and Future Directions: Discuss the challenges of implementing semantic communication, such as computational complexity, standardization issues, and privacy concerns. Highlight ongoing research efforts to address these challenges and explore emerging trends like bio-inspired communication, neuromorphic computing, and hybrid human-machine communication systems.
Case Studies and Practical Demonstrations: Include case studies that showcase successful implementations of semantic communication systems, such as semantic image and video compression and transmission. If possible, provide hands-on demonstrations or simulations to help participants visualize key concepts and their practical impact.

Tutorial 9: (Half-Day) Foundations and Recent Trends in Robust Multimodal Learning
Presenters: M. Salman Asif and Md Kaykobad Reza, University of California Riverside
Date and Time: Sunday, 14 September 2025, 09:00 – 12:30 (including a 30-minute break)
Location: TBD

Multimodal learning is an emerging field at the intersection of machine learning and multimodal data processing. Information from multiple sources — such as vision, text, audio, and different sensor data — is integrated to build more robust, adaptive, and reliable models for different real-world applications. This tutorial will provide a comprehensive overview of the foundations, recent advancements, and open challenges in this domain.

Outline
The tutorial will be divided into four parts to help the audience gain a deeper understanding of the topic:

Fundamentals of Multimodal Learning (30 mins): Cover core concepts, frameworks, and methods for multimodal data fusion, alignment, and representation learning.
Challenges in Robust Multimodal Learning (1 hour): Discuss challenges related to noisy, incomplete, or unaligned modalities and existing solutions to overcome these challenges. We will also discuss computational bottlenecks in practical scenarios.
Recent Advances and Applications (30 mins): Highlight state-of-the-art techniques and applications, including their impact on healthcare, autonomous systems, and surveillance. For instance, vision-language models (Stable Diffusion, GPT-4, Gemini-2, LLAMA etc.) and their applications in Healthcare, Autonomous Systems, and Surveillance.
Open Questions and Future Directions (20 mins): Discuss critical open problems, including novel approaches to test-time adaptation, cost-efficient integration of new modalities, and real-world deployment challenges.

Goals of the Tutorial
By the end of this tutorial, the audience will:

Gain a Strong Foundation: Understand the core principles of multimodal learning, including data fusion, alignment, and representation learning.
Learn Robust Techniques: Explore methods to address challenges like noisy, incomplete, and unaligned data, ensuring system reliability in real-world scenarios.
See Real-World Applications: Learn through practical examples in areas like healthcare, autonomous systems, and surveillance.
Identify Future Opportunities: Explore open challenges and cutting-edge research directions to inspire future work in multimodal learning.

M. Salman Asif

Md. Kaykobad Reza

Tutorial 11: (Half-Day) Fine-tuning hyperparameters for stochastic optimization: A review
Presenters: Paul Rodriguez, Pontifical Catholic University of Peru
Date and Time: Sunday, 14 September 2025, 09:00 – 12:30 (including a 30-minute break)
Location: TBD

While the impact of convolutional neural networks (CNN) / deep learning (DL) / artificial intelligence (AI) is still being assessed in several everyday technological and ethical aspects of our societies, stochastic optimizers, which encompasses the stochastic gradient descent (SGD) and variants (e.g. Momentum, ADAM, etc.), and the selection of their associated hyperparameters, play a crucial role in the successful training of such models.

The SGD algorithm, which maybe succinctly explained as the classical gradient descent (GD) algorithm along with a (very) noisy gradient, has only one hyperparameter, i.e. the learning rate (LR), which directly affects the practical rate of convergence. However, more effective (and popular) algorithms, such ADAM and derived methods, do have several hyperparameters, whose influence is not as direct nor as well understood as the LR for the SGD case.

The aim of this tutorial is two fold: (i) to give a direct overview of the SGD algorithm and most relevant variants, giving particular emphasis on how their associated hyperparameters do directly influence their performance, and (ii) to summarize the different methods to fine-tune the most influential hyperparameters, from grid-search strategies to adaptive schemes, providing both theoretical analysis and computational examples.

This 3 hours tutorial is planed to be delivered in two parts: each 80 minutes long, plus 10 minutes break at the middle and 10 minutes for questions/discussion at the end. The estimated tutorial’s breakdown, without considerering the break nor questions/discussion, is as follows:

Gradient descent (GD) and stochastic GD. (25 mins.)
This sub-section will focus on highlighting the similarities and differences between GD and SGD. It will also include a succinct list of theoretical aspects needed to understand well known SGD practices (such batch size and learning rate scheduling).
Accelerated GD (AGD). (20 mins.)
This sub-section will succintly summarized the key algorithms that are used to accelerated GD. This is included here since several SGD variants are based on (deterministic) Polyak’s momemtun, Nesterov’s, Andreson’s and tripple momemtun accelerations.
SGD variants and associated hyperparameters (55 mins.)
This sub-section will describe the most influential SGD variants (e.g. momentum, ADAM, AdaBound, AdaFactor, LookAhead, etc.) highlighting the fact that such variants may be understood as a set of add-on features over the vanilla SGD, and giving particular emphasis on the theoretical understanding of their associated hyperparameters.
Hyperparameter fine-tuning (60 mins.)
This sub-section will summarize the different strategies for fine-tuning the most influential hyperparameters, from grid-search to adaptive schemes, including recent strategies such the so-called “tuning-free” approach. In this sub-section several computational examples will be provided to highlight the dependencies among hyperparameters in connection to optimal performance across diverse tasks, such image classification and language modeling.

The topics’ depth of this tutorial is such that their understanding is adequate enough to follow the associated simulations; generally speaking, the baseline knowledge would be equivalent to that of a first year graduate student in the area of data science.

Paul Rodriguez

Tutorial 12: (Half-Day) Big Visual Data analytics for Natural Disaster Management
Presenters: Prof. Ioannis Pitas, Aristotle University of Thessaloniki (AUTH), Dr. Vasileios Mygdalis, Aristotle University of Thessaloniki (AUTH), Nikolaos Marios Militsis, Aristotle University of Thessaloniki (AUTH)
Date and Time: Sunday, 14 September 2025, 13:30 – 16:30 (including a 30-minute break)
Location: TBD

This short course on Big Data Analytics for Natural Disaster Management (NDM) provides a comprehensive overview and in-depth presentation of advanced technologies involved in the acquisition and analysis of Big Data for NDM. NDM can be greatly improved by developing automated means for precise semantic mapping and phenomenon evolution predictions in real-time. Several extreme data sources can significantly help towards achieving this goal: a) autonomous devices and smart sensors at the edge, equipped with AI-capabilities; b) satellite images; c) topographical data; d) official meteorological data, predictions or warnings published in the Web; and e) geosocial media data (including text, image and video). The course focus will be on drone image analysis for Natural Disaster Management.

The course consists of 4 lectures, covering important topics and presenting state-of-the-art technologies in:

Sensors and Big Visual Data Analytics for Natural Disaster Management (NDM).
Forest fire detection and fire/burnt region segmentation on drone images
Flood region segmentation on drone images
Simulation and visualization of forest fires and floods.

The presented technologies find practical application in developing an advanced NDM support system that dynamically exploits multiple data sources and AI technologies for providing an accurate assessment of an evolving crisis situation.

This short course overviews research topics dealt in European R&D project TEMA: https://tema-project.eu/

Prof. Ioannis Pitas is the coordinator of this big R&D project (20 University and company partners).

Tutorial 13: (Half-Day) Deepfakes From Creation to Detection, and Future Challenges
Presenters: Simon S. Woo, Sungkyunkwan University, South Korea
Date and Time: Sunday, 14 September 2025, 13:30 – 16:30 (including a 30-minute break)
Location: TBD

Introduction to Traditional Deepfake Generation Techniques: The tutorial begins by introducing the concept of deep- fakes, tracing the origins of GAN-based and face-swapping models. We will discuss how these traditional techniques enabled the creation of increasingly realistic facial manipulations and explore the societal and ethical implications that have made deepfake detection an urgent research area. This section provides the foundational background for understanding how deepfake technology first emerged and its initial challenges.

Deepfake Detection Framework: Based on the SoK: Facial Deepfake Detectors paper, this segment presents a structured framework that organizes over 50 traditional deepfake detectors by their focus on artifacts such as spatial, temporal, and frequency patterns. While this framework primarily addresses GAN-based and face-swapping models, it lays the groundwork for understanding deepfake detection method- ologies more broadly. Participants will gain insights into the types of artifacts these detectors are designed to capture and how these detectors respond to early forms of deepfake manipulation.

Diffusion-Based Deepfake Generation and Detection: Moving into the latest advancements, this section introduces diffusion- based deepfake generation,which represents a new frontier in generative AI. Unlike GANs, diffusion models use iterative generation processes that allow for fine control over visual detail, achieving exceptionally realistic media. This section will illustrate the unique challenges diffusion-based deepfakes pose for detection and present emerging detection methods specifically designed to counter these sophisticated techniques. We will dis- cuss new detection approaches that target statistical cues and distinct artifacts generated by diffusion models, as well as the need for novel datasets and benchmarks to properly evaluate these advanced detectors.

Challenges, Open Questions, and Future Directions: The tutorial concludes by examining real-world challenges, such as maintaining robustness against evolving attack types and achiev- ing generalizability across diverse scenarios. Open research questions will be presented, including how to further improved diffusion-based detection, how to address the ethical considerations associated with evolving generative techniques, and what regulatory measures may need to adapt to meet these challenges.