GCPR Papers and Sessions

Joint Oral Session I

InvGAN: Invertible GANs
Partha Ghosh (Max Planck Inst. for Int. Sys)*; Dominik Zietlow (Max Planck Institute for Intelligent Systems); Michael J. Black (Max Planck Institute for Intelligent Systems); Larry Davis (AMAZON COM SERVICES INC); Xiaochen Hu (AMAZON COM SERVICES INC)
View abstract Generation of photo-realistic images, semantic editing and representation learning are only a few of many applications of high-resolution generative models. Recent progress in GANs have established them as an excellent choice for such tasks. However, since they do not provide an inference model, downstream tasks such as classification cannot be easily applied on real images using the GAN latent space. Despite numerous efforts to train an inference model or design an iterative method to invert a pre-trained generator, previous methods are dataset (e.g., human face images) and architecture (e.g., StyleGAN) specific. These methods are nontrivial to extend to novel datasets or architectures. We propose a general framework that is agnostic to architecture and datasets. Our key insight is that, by training the inference and the generative model together, we allow them to adapt to each other and to converge to a better quality model. Our InvGAN, short for Invertible GAN, successfully embeds real images in the latent space of a high quality generative model. This allows us to perform image inpainting, merging, interpolation and online data augmentation. We demonstrate this with extensive qualitative and quantitative experiments.

Local Attention Guided Joint Depth Upsampling
Arijit Mallick; Raphael Braun; Andreas Engelhardt; Hendrik Lensch
View abstract Image super resolution is a classical computer vision problem. A branch of super resolution tasks deals with guided depth super resolution as objective. Here, the goal is to accurately upsample a given low resolution depth map with the help of features aggregated from the high resolution color image of that particular scene. Recently, the development of transformers has improved performance for general image processing tasks credited to self-attention. Unlike previous methods for guided joint depth upsampling which rely only on CNNs, we efficiently compute self-attention with the help of local image attention which avoids the quadratic growth typically found in self-attention layers. Our work combines CNNs and transformers to analyze the two input modalities and employs a cross-modal fusion network in order to predict both a weighted per-pixel filter kernel and a residual for the depth estimation. To further enhance the final output, we integrate a differentiable and a trainable deep guided filtering network which provides an additional depth prior. An ablation study and empirical trials demonstrate the importance of each proposed module. Our method shows comparable as well as state-of-the-art performance on the guided depth upsampling task.

ArtFID: Quantitative Evaluation of Neural Style Transfer
Matthias S Wright (Ludwig Maximilian University of Munich)*; Bjorn Ommer (University of Munich)
View abstract The field of neural style transfer has experienced a surge of research exploring different avenues ranging from optimization-based approaches and feed-forward models to meta-learning methods. The developed techniques have not just progressed the field of style transfer, but also led to breakthroughs in other areas of computer vision, such as all of visual synthesis. However, whereas quantitative evaluation and benchmarking have become pillars of computer vision research, the reproducible, quantitative assessment of style transfer models is still lacking. Even in comparison to other fields of visual synthesis, where widely used metrics exist, the quantitative evaluation of style transfer is still lagging behind. To support the automatic comparison of different style transfer approaches and to study their respective strengths and weaknesses, the field would greatly benefit from a quantitative measurement of stylization performance. Therefore, we propose a method to complement the currently mostly qualitative evaluation schemes. We provide extensive evaluations and a large-scale user study to show that the proposed metric strongly coincides with human judgment.

Joint Oral Session II

Evaluation of Volume Representation Networks for Meteorological Ensemble Compression
Kevin Höhlein; Sebastian Weiss; Rüdiger Westermann
View abstract Recently it has been shown that volume scene representation networks are a powerful tool to transform 3D scalar fields into an extremely compact representation from which the initial field samples can be randomly accessed. In this work, we evaluate the capabilities of such networks to compress meteorological ensemble data, which comprise many single weather forecast simulations. We analyze whether and how these networks can effectively exploit similarities between the ensemble members, and whether alternative classical compression approaches show similar capabilities. Since meteorological ensembles contain different physical parameters with different statistical characteristics and variations on multiple scales of magnitude, we analyze the impact of data normalization schemes on learning quality. Along with an evaluation of the trade-offs between reconstruction quality and network model parameterization, we compare compression ratios and reconstruction quality for different model architectures and alternative compression schemes.

InterCap: Joint Markerless 3D Tracking of Humans and Objects in Interaction
Yinghao Huang (Max-Planck-Institute for Intelligent Systems)*; Omid Taheri (Max Planck Institute for Intelligent Systems); Michael J. Black (Max Planck Institute for Intelligent Systems); Dimitrios Tzionas (University of Amsterdam)
View abstract Humans constantly interact with daily objects to accomplish tasks. To understand such interactions, computers need to reconstruct these from cameras observing whole-body interaction with scenes. This is challenging due to occlusion between the body and objects, motion blur, depth/scale ambiguities, and the low image resolution of hands and graspable object parts. To make the problem tractable, the community focuses either on interacting hands, ignoring the body, or on interacting bodies, ignoring hands. The GRAB dataset addresses dexterous whole-body interaction but uses marker-based MoCap and lacks images, while BEHAVE captures video of body-object interaction but lacks hand detail. We address the limitations of prior work with InterCap, a novel method that reconstructs interacting whole-bodies and objects from multi-view RGB-D data, using the parametric whole-body model SMPL-X and known object meshes. To tackle the above challenges, InterCap uses two key observations: (i) Contact between the hand and object can be used to improve the pose estimation of both. (ii) Azure Kinect sensors allow us to set up a simple multi-view RGB-D capture system that minimizes the effect of occlusion while providing reasonable inter-camera synchronization. With this method we capture the InterCap dataset, which contains 10 subjects (5 males and 5 females) interacting with 10 objects of various sizes and affordances, including contact with the hands or feet. In total, InterCap has 223 RGB-D videos, resulting in 67,357 multi-view frames, each containing 6 RGB-D images. Our method provides pseudo ground-truth body meshes and objects for each video frame. Our InterCap method and dataset fill an important gap in the literature and support many research directions. Our data and code are available for research purposes at https://intercap.is.tue.mpg.de.

Neural Adaptive SCEne Tracing (NAScenT)
Rui Li; Darius Rückert; Wolfgang Heidrich
View abstract Neural rendering with implicit neural networks has recently emerged as an attractive proposition for scene reconstruction, achieving excellent quality albeit at high computational cost. While the most recent generation of such methods has made progress on the rendering (inference) times, very little progress has been made on improving the reconstruction (training) times. In this work we present Neural Adaptive Scene Tracing (NAScenT), the first neural rendering method based on directly training a hybrid explicit-implicit neural representation. NAScenT uses a hierarchical octree representation with one neural network per leaf node and combines this representation with a two-stage sampling process that concentrates ray samples where they matter most – near object surfaces. As a result, NAScenT is capable of reconstructing challenging scenes including both large, sparsely populated volumes like UAV captured outdoor environments, as well as small scenes with high geometric complexity. NAScenT outperforms existing neural rendering approaches in terms of both quality and training time.

GCPR Oral session I: Machine Learning Methods

Blind Single Image Super-Resolution via Iterated Shared Prior Learning
Thomas Pinetz (University of Bonn)*; Erich Kobler (University of Linz); Thomas Pock (Graz University of Technology); Alexander Effland (University of Bonn)
View abstract In this paper, we adapt shared prior learning for blind single image super-resolution (SISR). From a variational perspective, we are aiming at minimizing an energy functional consisting of a learned data fidelity term and a data-driven prior, where the learnable parameters are computed in a mean-field optimal control problem. In the associated loss functional, we combine a supervised loss evaluated on synthesized observations and an unsupervised Wasserstein loss for real observations, in which local statistics of images with different resolutions are compared. In shared prior learning, only the parameters of the prior are shared among both loss functions. The kernel estimate is updated iteratively after each step of shared prior learning. In numerous numerical experiments, we achieve state-of-the-art results for blind SISR with a low number of learnable parameters and small training sets to account for real applications.

Auto-Compressing Subset Pruning for Semantic Image Segmentation
Konstantin J Ditschuneit (Merantix Labs GmbH)*; Johannes S Otterbach (Merantix Labs)
View abstract State-of-the-art semantic segmentation models are characterized by high parameter counts and slow inference times, making them unsuitable for deployment in resource-constrained environments. To address this challenge, we propose Auto-Compressing Subset Pruning, ACoSP, as a new online compression method. The core of ACoSP consists of learning a channel selection mechanism for individual channels of each convolution in the segmentation model based on an effective temperature annealing schedule. We show a crucial interplay between providing a high-capacity model at the beginning of training and the compression pressure forcing the model to compress concepts into retained channels. We apply ACoSP to SegNet and PSPNet architectures and show its success when trained on the CamVid, Cityscapes, Pascal VOC2012, and Ade20k datasets. The results are competitive with existing baselines for compression of segmentation models at low compression ratios and outperform them significantly at high compression ratios, yielding acceptable results even when removing more than 93% of the parameters. In addition, ACoSP is conceptually simple, easy to implement, and can readily be generalized to other data modalities, tasks, and architectures. Our code is available at https://github.com/merantix/acosp.

Improving Robustness and Calibration in Ensembles with Diversity Regularization
Hendrik A. Mehrtens (Technische Universität Darmstadt)*; Camila Gonzalez (Technische Universität Darmstadt); Anirban Mukhopadhyay (TU Darmstadt)
View abstract Calibration and uncertainty estimation are crucial topics in high-risk environments. Following the recent interest in the diversity of ensembles, we systematically evaluate the viability of explicitly regularizing ensemble diversity to improve robustness and calibration on in-distribution data as well as under dataset shift. We introduce a new diversity regularizer for classification tasks that uses out-of-distribution samples and increases the overall accuracy, calibration and out-of-distribution detection capabilities of ensembles. We demonstrate that diversity regularization is highly beneficial in architectures where weights are partially shared between the individual members and even allows to use fewer ensemble members to reach the same level of robustness. Experiments on CIFAR-10, CIFAR-100, and SVHN show that regularizing diversity can have a significant impact on calibration and robustness, as well as out-of-distribution detection.

GCPR Oral Session II: Computer Vision in Life and Natural Sciences

I-MuPPET: Interactive Multi-Pigeon Pose Estimation and Tracking
Urs Waldmann (University of Konstanz)*; Hemal Naik (Max Planck Institute of Animal Behavior); Nagy Máté (MTA-ELTE “Lendulet” Collective Behaviour Research Group); Fumihiro Kano (University of Konstanz); Iain D Couzin (Max Planck Institute of Animal Behavior); Oliver Deussen (University of Konstanz); Bastian Goldluecke (University of Konstanz)
View abstract Most tracking data encompasses humans, the availability of annotated tracking data for animals is limited, especially for multiple objects. To overcome this obstacle, we present I-MuPPET, a system to estimate and track 2D keypoints of multiple pigeons at interactive speed. We train a Keypoint R-CNN on single pigeons in a fully supervised manner and infer keypoints and bounding boxes of multiple pigeons with that neural network. We use a state of the art tracker to track the individual pigeons in video sequences. I-MuPPET is tested quantitatively on single pigeon motion capture data, and we achieve comparable accuracy to state of the art 2D animal pose estimation methods in terms of Root Mean Square Error (RMSE). Additionally, we test I-MuPPET to estimate and track poses of multiple pigeons in video sequences with up to four pigeons and obtain stable and accurate results with up to 17 fps. To establish a baseline for future research, we perform a detailed quantitative tracking evaluation, which yields encouraging results.

Interpretable Prediction of Pulmonary Hypertension in Newborns using Echocardiograms
Hanna Ragnarsdottir (ETH Zurich); Laura Manduchi (ETH Zurich); Holger Michel (barmherzige regensburg); Fabian Laumer (ETH); Sven Wellmann (Barmherzige Regensburg); Ece Ozkan (ETH Zurich)*; Julia Vogt (ETH Zurich)
View abstract Pulmonary hypertension (PH) in newborns and infants is a complex condition associated with several pulmonary, cardiac, and systemic diseases contributing to morbidity and mortality. Therefore, ac- curate and early detection of PH is crucial for successful management. Using echocardiography, the primary diagnostic tool in pediatrics, human assessment is both time consuming and expertise-demanding, raising the need for an automated approach. In this work, we present an interpretable multi-view video-based deep learning approach to predict PH for a cohort of 194 newborns using echocardiograms. We use spatio-temporal convolutional architectures for the prediction of PH from each view, and aggregate the predictions of the different views using majority voting. To the best of our knowledge, this is the first work for an auto- mated assessment of PH in newborns using echocardiograms. Our results show a mean F1-score of 0.84 for severity prediction and 0.92 for binary detection using 10-fold cross-validation. We complement our predictions with saliency maps and show that the learned model focuses on clinically relevant cardiac structures, motivating its usage in clinical practice. Keywords: Echocardiography, Pediatrics, Computer Assisted Diagnosis (CAD), Interpretable Machine Learning.

GCPR Oral Session III: Systems and Applications

GazeTransformer: Gaze Forecasting for Virtual Reality using Transformer Networks
Tim Rolff (Universität Hamburg)*; Harm Matthias Harms ( QUIS); Frank Steinicke (Universität Hamburg); Simone Frintrop (University of Hamburg)
View abstract In this paper, we propose GazeTransformer, a transformer architecture for forecasting egocentric gaze points in virtual environments (VEs) presented on immersive head-mounted displays (HMDs). In contrast to previous architectures, we do not rely on information that depends on the application state, but rather focus on data modalities provided by the eye-tracker or sent to the HMD. GazeTransformer allows to forecast multiple types of eye-movements, including saccades and fixations, by creating two unfiltered datasets, using the raw gaze from the eye-tracker for forecasting. Moreover, we analyze six different image encoding backends in their quality to forecast gaze positions. To evaluate the performance of our model, we compared all architectures on the generated datasets. The results show that our architecture with all chosen backends outperforms the current state-of-the-art approaches in forecasting egocentric gaze points in VEs.

Improving traffic sign recognition by active search
Sami Jaghouar (University Of Technology of Compiegne); Hannes Gustafsson (Chalmers University of Technology); Bernhard Mehlig (University of Gothenburg); Erik Werner (Zenseact)*; Niklas Gustafsson (Zenseact)
View abstract We describe an iterative active-learning algorithm to recognise rare traffic signs. A standard ResNet is trained on a training set containing only a single sample of the rare class. We demonstrate that by sorting the samples of a large, unlabeled set by the estimated probability of belonging to the rare class, we can efficiently identify samples from the rare class. This works despite the fact that this estimated probability is usually quite low. A reliable active-learning loop is obtained by labeling these candidate samples, including them in the training set, and iterating the procedure. Further, we show that we get similar results starting from a single synthetic sample. Our results are important as they indicate a straightforward way of improving traffic-sign recognition for automated driving systems. In addition, they show that we can make use of the information hidden in low confidence outputs, which is usually ignored.

GCPR Oral Session IV: 3D Vision, Stereo and Motion

SF2SE3: Clustering Scene Flow into SE(3)-Motions via Proposal and Selection
Leonhard Sommer (University of Freiburg)*; Philipp Schröppel (University of Freiburg); Thomas Brox (University of Freiburg)
View abstract We propose SF2SE3, a novel approach to estimate scene dynamics in form of a segmentation into independently moving rigid objects and their SE(3)-motions. SF2SE3 operates on two consecutive stereo or RGB-D images. First, noisy scene flow is obtained by application of existing optical flow and depth estimation algorithms. SF2SE3 then iteratively (1) samples pixel sets to compute SE(3)-motion proposals, and (2) selects the best SE(3)-motion proposal with respect to a maximum coverage formulation. Finally, objects are formed by assigning pixels uniquely to the selected SE(3)-motions based on consistency with the input scene flow and spatial proximity. The main novelties are a more informed strategy for the sampling of motion proposals and a maximum coverage formulation for the proposal selection. We conduct evaluations on multiple datasets regarding application of SF2SE3 for scene flow estimation, object segmentation and visual odometry. SF2SE3 performs on par with the state of the art for scene flow estimation and is more accurate for segmentation and odometry.

Online Marker-free Extrinsic Camera Calibration using Person Keypoint Detections
Bastian Pätzold (University of Bonn)*; Simon Bultmann (University of Bonn); Sven Behnke (University of Bonn)
View abstract Calibration of multi-camera systems, i.e. determining the relative poses between the cameras, is a prerequisite for many tasks in computer vision and robotics. Camera calibration is typically achieved using offline methods that use checkerboard calibration targets. These methods, however, often are cumbersome and lengthy, considering that a new calibration is required each time any camera pose changes. In this work, we propose a novel, marker-free online method for the extrinsic calibration of multiple smart edge sensors, relying solely on 2D human keypoint detections that are computed locally on the sensor boards from RGB camera images. Our method assumes the intrinsic camera parameters to be known and requires priming with a rough initial estimate of the camera poses. The person keypoint detections from multiple views are received at a central backend where they are synchronized, filtered, and assigned to person hypotheses. We use these person hypotheses to repeatedly solve optimization problems in the form of factor graphs. Given suitable observations of one or multiple persons traversing the scene, the estimated camera poses converge towards a coherent extrinsic calibration within a few minutes. We evaluate our approach in real-world settings and show that the calibration with our method achieves lower reprojection errors compared to a reference calibration generated by an offline method using a traditional calibration target.

NeuralMeshing: Differentiable Meshing of Implicit Neural Representations
Mathias Vetsch (Department of Computer Science, ETH Zurich, Switzerland); Sandro Lombardi (ETH Zurich)*; Marc Pollefeys (ETH Zurich / Microsoft); Martin R. Oswald (ETH Zurich)
View abstract The generation of triangle meshes from point clouds, i.e. meshing, is a core task in computer graphics and computer vision. Traditional techniques directly construct a surface mesh using local decision heuristics, while some recent methods based on neural implicit representations try to leverage data-driven approaches for this meshing process. However, it is challenging to define a learnable representation for triangle meshes of unknown topology and size and for this reason, neural implicit representations rely on non-differentiable post-processing in order to extract the final triangle mesh. In this work, we propose a novel differentiable meshing algorithm for extracting surface meshes from neural implicit representations. Our method produces the mesh in an iterative fashion, which makes it applicable to shapes of various scales and adaptive to the local curvature of the shape. Furthermore, our method produces meshes with regular tessellation patterns and fewer triangle faces compared to existing methods. Experiments demonstrate the comparable reconstruction performance and favorable mesh properties over baselines.

GCPR Oral Session V: Self-Supervised Learning

FastSiam: Resource-Efficient Self-Supervised Learning on a Single GPU
Daniel Pototzky (Robert Bosch GmbH)*; Azhar Sultan (Robert Bosch GmbH); Lars Schmidt-Thieme (Universität Hildesheim)
View abstract Self-supervised pretraining has shown impressive performance in recent years, matching or even outperforming ImageNet weights on a broad range of downstream tasks. Unfortunately, existing methods require massive amounts of computing power with large batch sizes and batch norm statistics synchronized across multiple GPUs. This effectively excludes substantial parts of the computer vision community from the benefits of self-supervised learning who do not have access to extensive computing resources. To address that, we develop FastSiam with the aim of matching ImageNet weights given as little computing power as possible. We find that a core weakness of previous methods like SimSiam is that they compute the training target based on a single augmented crop (or “view”), leading to target instability. We show that by using multiple views per image instead of one, the training target can be stabilized, allowing for faster convergence and substantially reduced runtime. We evaluate FastSiam on multiple challenging downstream tasks including object detection, instance segmentation and keypoint detection and find that it matches ImageNet weights after 25 epochs of pretraining on a single GPU with a batch size of only 32.

Self-Supervised Learning for Unintentional Action Prediction
Olga Zatsarynna (University of Bonn)*; Yazan Abu Farha (Birzeit University); Jürgen Gall (University of Bonn)
View abstract Distinguishing if an action is performed as intended or if an intended action fails is an important skill that not only humans have, but that is also important for intelligent systems that operate in human environments. Recognizing if an action is unintentional or anticipating if an action will fail, however, is not straight-forward due to lack of annotated data. While videos of unintentional or failed actions can be found in the Internet in abundance, high annotation costs are a major bottleneck for learning networks for these tasks. In this work, we thus study the problem of self-supervised representation learning for unintentional action prediction. While previous works learn the representation based on a local temporal neighborhood, we show that the global context of a video is needed to learn a good representation for the three downstream tasks: unintentional action classification, localization and anticipation. In the supplementary material, we show that the learned representation can be used for detecting anomalies in videos as well.

GCPR Oral Session VI: Photogrammetry and Remote Sensing

Probabilistic Biomass Estimation with Conditional Generative Adversarial Networks
Johannes Leonhardt (University of Bonn)*; Lukas Drees (University of Bonn – IGG); Peter Jung (Technische Universität Berlin); Ribana Roscher (University of Bonn)
View abstract Biomass is an important variable for our understanding of the terrestrial carbon cycle, facilitating the need for satellite-based global and continuous monitoring. However, current machine learning methods used to map biomass can often not model the complex relationship between biomass and satellite observations or cannot account for the estimation’s uncertainty. In this work, we exploit the stochastic properties of Conditional Generative Adversarial Networks for quantifying aleatoric uncertainty. Furthermore, we use generator Snapshot Ensembles in the context of epistemic uncertainty and show that unlabeled data can easily be incorporated into the training process. The methodology is tested on a newly presented dataset for satellite-based estimation of biomass from multispectral and radar imagery, using lidar-derived maps as reference data. The experiments show that the final network ensemble captures the dataset’s probabilistic characteristics, delivering accurate estimates and well-calibrated uncertainties.

Time dependent Image Generation of Plants from Incomplete Sequences with CNN-Transformer
Lukas Drees (University of Bonn – IGG)*; Immanuel Weber (AB|EX PLEdoc); Marc Rußwurm (École Polytechnique Fédérale de Lausanne); Ribana Roscher (University of Bonn)
View abstract Data imputation of incomplete image sequences is an essential prerequisite for analyzing and monitoring all development stages of plants in precision agriculture. For this purpose, we propose a conditional Wasserstein generative adversarial network TransGrow that combines convolutions for spatial modeling and a transformer for temporal modeling, enabling time-dependent image generation of above-ground plant phenotypes. Thereby, we achieve the following advantages over comparable data imputation approaches: (1) The model is conditioned by an incomplete image sequence of arbitrary length, the input time points, and the requested output time point, allowing multiple growth stages to be generated in a targeted manner; (2) By considering a stochastic component and generating a distribution for each point in time, the uncertainty in plant growth is considered and can be visualized; (3) Besides interpolation, also test-extrapolation can be performed to generate future plant growth stages. Experiments based on two datasets of different complexity levels are presented: Laboratory single plant sequences with Arabidopsis thaliana and agricultural drone image sequences showing crop mixtures. When comparing TransGrow to interpolation in image space, variational, and adversarial autoencoder, it demonstrates significant improvements in image quality, measured by multi-scale structural similarity, peak signal-to-noise ratio, and Fréchet inception distance. To our knowledge, TransGrow is the first approach for time- and image-dependent, high-quality generation of plant images based on incomplete sequences.

GCPR Posters I

Reiterative Domain Aware Multi-Target Adaptation
Sudipan Saha (Technical University of Munich)*; Shan Zhao (Technical University of Munich); Nasrullah Sheikh (IBM); Xiaoxiang Zhu (Technical University of Munich, German Aerospace Center)
View abstract Multi-Target Domain Adaptation (MTDA) is a recently popular powerful setting in which a single classifier is learned for multiple unlabeled target domains. A popular MTDA approach is to sequentially adapt one target domain at a time. While only one pass is made through each target domain, the adaptation process for each target domain may consist of many iterations. Inspired by the spaced learning in neuroscience, we instead propose a reiterative approach where we make several passes/reiterations through each target domain. This leads to a better episodic learning that effectively retains features for multiple targets. The reiterative approach does not increase total number of training iterations, as we simply decrease the number of iterations per domain per reiteration. To build a multi-target classifier, it is also important to have a backbone feature extractor that generalizes well across domains. Towards this, we adopt Transformer as a feature extraction backbone. We perform extensive experiments on three popular MTDA datasets: Office-Home, Office-31, and DomainNet, a large-scale dataset. Our experiments separately show the benefits of both reiterative approach and superior Transformer-based feature extractor backbone.

Augmentation Learning for Semi-Supervised Classification
Tim H.A. Frommknecht (Goethe University)*; Pedro Alves Zipf (Goethe Universität Frankfurt); Quanfu Fan (IBM Research); Nina Shvetsova (Goethe University Frankfurt); Hilde Kuehne (University of Frankfurt)
View abstract Recently, a number of new Semi-Supervised Learning methods have emerged. As the accuracy for ImageNet and similar datasets increased over time, the performance on tasks beyond the classification of natural images is yet to be explored. Most Semi-Supervised Learning methods rely on a carefully manually designed data augmentation pipeline that is not transferable for learning on images of other domains. In this work, we propose a Semi-Supervised Learning method that automatically selects the most effective data augmentation policy for a particular dataset. We build upon the Fixmatch method and extend it with meta-learning of augmentations. The augmentation is learned in additional training before the classification training and makes use of bi-level optimization, to optimize the augmentation policy and maximize accuracy. We evaluate our approach on two domain-specific datasets, containing satellite images and hand-drawn sketches, and obtain state-of-the-art results. We further investigate in an ablation the different parameters relevant for learning augmentation policies and show how policy learning can be used to adapt augmentations to datasets beyond ImageNet.

Multi-Attribute Open Set Recognition
Piyapat Saranrittichai (Bosch Center for Artificial Intelligence)*; Chaithanya Kumar Mummadi (Bosch Center for Artificial Intelligence); Claudia Blaiotta (Bosch Center for Artificial Intelligence); Mauricio Munoz (Bosch Center for Artificial Intelligence); Volker Fischer (Bosch Center for Artificial Intelligence)
View abstract Open Set Recognition (OSR) extends image classification to an open-world setting, by simultaneously classifying known classes and identifying unknown ones. While conventional OSR approaches can detect Out-of-Distribution (OOD) samples, they cannot provide explanations indicating which underlying visual attribute(s) (e.g., shape, color or background) cause a specific sample to be unknown. In this work, we introduce a novel problem setup that generalizes conventional OSR to a multi-attribute setting, where multiple visual attributes are simultaneously recognized. Here, OOD samples can be not only identified but also categorized by their unknown attribute(s). We propose simple extensions of common OSR baselines to handle this novel scenario. We show that these baselines are vulnerable to shortcuts when spurious correlations exist in the training dataset. This leads to poor OOD performance which, according to our experiments, is mainly due to unintended cross-attribute correlations of the predicted confidence scores. We provide an empirical evidence showing that this behavior is consistent across different baselines on both synthetic and real world datasets.

A Syntactic Pattern Recognition based Approach to Online Anomaly Detection and Identification on Electric Motors
Kutalmış Coşkun (Marmara University)*; Zeynep Kumralbaş (Marmara University); Hazel Çavuş (Marmara University); Borahan Tumer (Marmara University)
View abstract Online anomaly detection and identification is a major task of many Industry 4.0 applications. Electric motors, being one of the most crucial parts of many products, are subjected to end-of-line tests to pick up faulty ones before being mounted to other devices. With this study, we propose a Syntactic Pattern Recognition based approach to online anomaly detection and identification on electric motors. Utilizing Variable Order Markov Models and Probabilistic Suffix Trees, we apply both unsupervised and supervised approaches to cluster motor conditions and diagnose them. Besides being explainable, the diagnosis method we propose is completely online and suitable for parallel computing, which makes it a favorable method to use synchronously with a physical test system. We evaluate the proposed method on a benchmark dataset and on a case study, which is being worked on within the scope of a European Union funded research project on reliability.

Sparse Visual Counterfactual Explanations in Image Space
Valentyn Boreiko (University of Tübingen)*; Maximilian Augustin (University of Tuebingen); Francesco Croce (University of Tübingen); Philipp Berens (University of Tübingen); Matthias Hein (University of Tübingen)
View abstract Visual counterfactual explanations (VCEs) in image space are an important tool to understand decisions of image classifiers as they show under which changes of the image the decision of the classifier would change. Their generation in image space is challenging and requires robust models due to the problem of adversarial examples. Existing techniques to generate VCEs in image space suffer from spurious changes in the background. Our novel perturbation model for VCEs together with its ecient optimization via our novel Auto-Frank-Wolfe scheme yields sparse VCEs which lead to subtle changes specific for the target class. Moreover, we show that VCEs can be used to detect undesired behavior of ImageNet classifiers due to spurious features in the ImageNet dataset.

Improving Unsupervised Label Propagation for Pose Tracking and Video Object Segmentation
Urs Waldmann (University of Konstanz)*; Jannik Bamberger (University of Konstanz); Ole Johannsen (University of Konstanz); Oliver Deussen (University of Konstanz); Bastian Goldluecke (University of Konstanz)
View abstract Label propagation is a challenging task in computer vision with many applications. One approach is to learn representations of visual correspondence. In this paper, we study recent works on label propagation based on correspondence, carefully evaluate the effect of various aspects of their implementation, and improve upon various details. Our pipeline assembled from these best practices outperforms the previous state of the art in terms of PCK_0.1 on the JHMDB dataset by 6.5%. We also propose a novel joint framework for tracking and keypoint propagation, which in contrast to the base pipeline is applicable to tracking small objects and obtains results that substantially exceed the performance of the core pipeline. Finally, for VOS, we extend our pipeline to a fully unsupervised one by initializing the first frame with the self-attention layer from DINO. Our pipeline for VOS runs online and can handle static objects. It outperforms unsupervised frameworks with these characteristics.

D-InLoc++: Indoor Localization in Dynamic Environments
Martina Dubenova (Czech Technical University in Prague)*; Anna Zderadickova ( Czech Technical University in Prague); Ondrej Kafka (Czech Technical University in Prague); Tomas Pajdla (Czech Technical University); Michal Polic (Czech Technical University in Prague)
View abstract Most state-of-the-art localization algorithms rely on robust relative pose estimation and geometry verification to obtain moving object agnostic camera poses in complex indoor environments. However, this approach is prone to mistakes if a scene contains repetitive structures, e.g., desks, tables, boxes, or moving people. We show that the movable objects incorporate non-negligible localization error and present a new straightforward method to predict the six-degree-of-freedom (6DoF) pose more robustly. We equipped the localization pipeline InLoc with real-time instance segmentation network YOLACT++. The masks of dynamic objects are employed in the relative pose estimation step and in the final sorting of camera pose proposal. At first, we filter out the matches laying on masks of the dynamic objects. Second, we skip the comparison of query and synthetic images on the area related to the moving object. This procedure leads to a more robust localization. Lastly, we describe and improve the mistakes caused by gradient-based comparison between synthetic and query images and publish a new pipeline for simulation of environments with movable objects from the Matterport scans. All the codes are available on github.com/dubenma/D-InLocpp.

Global Hierarchical Attention for 3D Point Cloud Analysis
Dan Jia (RWTH Aachen University)*; Alexander Hermans (RWTH Aachen University); Bastian Leibe (RWTH Aachen University)
View abstract We propose a new attention mechanism, called Global Hierarchical Attention (GHA), for 3D point cloud analysis. GHA approximates the regular global dot-product attention via a series of coarsening and interpolation operations over multiple hierarchy levels. The advantage of GHA is two-fold. First, it has linear complexity with respect to the number of points, enabling the processing of large point clouds. Second, GHA inherently possesses the inductive bias to focus on spatially close points, while retaining the global connectivity among all points. Combined with a feedforward network, GHA can be inserted into many existing network architectures. We experiment with multiple baseline networks and show that adding GHA consistently improves performance across different tasks and datasets. For the task of semantic segmentation, GHA gives a +1.7% mIoU increase to the MinkowskiEngine baseline on ScanNet. For the 3D object detection task, GHA improves the CenterPoint baseline by +0.5% mAP on the nuScenes dataset, and the 3DETR baseline by +2.1% mAP25 and +1.5% mAP50 on ScanNet.

Robust Object Detection Using Knowledge Graph Embeddings
Christopher Lang (Robert Bosch GmbH)*; Alexander Nicolas Braun (Bosch); Abhinav Valada (University of Freiburg)
View abstract Object recognition for the most part has been treated as a one-hot problem that assumes classes to be discrete and unrelated. In this work, we challenge the prevalence of the one-hot approach in closed-set object detection. We evaluate the error statistics of learned class embeddings from a one-hot approach with knowledge embeddings that inherit semantic structure from natural language processing or knowledge graphs, as widely applied in open world object detection. Extensive experimental results on multiple knowledge-embeddings as well as distance metrics indicate that knowledge-based class representations result in more semantically grounded misclassifications while performing on par compared to one-hot methods on the challenging COCO and Cityscapes object detection benchmarks. We generalize our findings to multiple object detection architectures by proposing a knowledge-embedded design KGE for keypoint-based and transformer-based object detection architectures.

On Hyperbolic Embeddings in Object Detection
Christopher Lang (Robert Bosch GmbH)*; Lars Schillingmann (Robert Bosch GmbH); Alexander Nicolas Braun (Bosch); Abhinav Valada (University of Freiburg)
View abstract Object detection, for the most part, has been formulated in the euclidean space, where euclidean or spherical geodesic distances measure the similarity of an image region to an object class prototype. In this work, we study whether a hyperbolic geometry better matches the underlying structure of the object classification space. We incorporate a hyperbolic classifier in two-stage, keypoint-based, and transformer-based object detection architectures and evaluate them on large-scale, long-tailed, and zero-shot object detection benchmarks. In our extensive experimental evaluations, we observe categorical class hierarchies emerging in the structure of the classification space, resulting in lower classification errors and boosting the overall object detection performance.

GCPR Posters II

A Bhattacharyya Coefficient-Based Framework for Noise Model-Aware Random Walker Image
Dominik Drees (University of Münster); Florian Eilers (University of Münster); Ang Bian (Sichuan University); Xiaoyi Jiang (University of Münster)*
View abstract One well established method of interactive image segmentation is the random walker algorithm. Considerable research on this family of segmentation methods has been continuously conducted in recent years with numerous applications. These methods are common in using a simple Gaussian weight function which depends on a parameter that strongly influences the segmentation performance. In this work we propose a general framework of deriving weight functions based on probabilistic modeling. This framework can be concretized to cope with virtually any parametric noise model. It eliminates the critical parameter and thus avoids time-consuming parameter search. We derive the specific weight functions for common noise types and show their superior performance on synthetic data as well as different biomedical image data (MRI images from the NYU fastMRI dataset, larvae images acquired with the FIM technique). Our framework could also be used in multiple other applications, e.g., the graph cut algorithm and its extensions.

Optimizing Edge Detection for Image Segmentation with Multicut Penalties
Steffen Jung (MPII)*; Sebastian Ziegler (University of Mannheim); Amirhossein Kardoost (University of Mannheim); Margret Keuper (University of Mannheim)
View abstract The Minimum Cost Multicut Problem (MP) is a popular way for obtaining a graph decomposition by optimizing binary edge labels over edge costs. While the formulation of a MP from independently estimated costs per edge is highly flexible and intuitive, solving the MP is NP-hard and time-expensive. As a remedy, recent work proposed to predict edge probabilities with awareness of potential conflicts by incorporating cycle constraints in the prediction process. We argue that such formulation, while providing a first step towards end-to-end learnable edge weights, is suboptimal, since it is built upon a loose relaxation of the MP. We therefore propose an adaptive CRF that allows to progressively consider more violated constraints and, in consequence, to issue solutions with higher validity. Experiments on the BSDS500 benchmark for natural image segmentation as well as on electron microscopic recordings show that our approach yields more precise edge detection and image segmentation.

Hyperspectral Demosaicing of Snapshot Camera Images Using Deep Learning
Eric L Wisotzky (Fraunhofer HHI)*; Charul Daudkhane (Fraunhofer HHI); Anna Hilsmann (Fraunhofer HHI); Peter Eisert (Fraunhofer Heinrich Hertz Institute)
View abstract Spectral imaging technologies have rapidly evolved during the past decades. The recent development of single-camera-one-shot techniques for hyperspectral imaging allows multiple spectral bands to be captured simultaneously (3×3, 4×4 or 5×5 mosaic), opening up a wide range applications. Examples include intraoperative imaging, agricultural field inspection and food quality assessment. To capture images across a wide spectrum range, i.e. to achieve high spectral resolution, the sensor design sacrifices spatial resolution. With increasing mosaic size, this effect becomes increasingly detrimental. Furthermore, demosaicing beyond Bayer interpolation is challenging. Without incorporating edge, shape and object information during interpolation, chromatic artifacts are likely to appear in the obtained images. Recent approaches use neural networks for demosaicing, enabling direct information extraction from image data. However, obtaining training data for these approaches poses a challenge as well. This work proposes a parallel neural network based demosaicing procedure trained on a new ground truth dataset captured in a controlled environment by a hyperspectral snapshot camera with a 4×4 mosaic pattern. The dataset is a combination of captured scenes with images from publicly available data adapted to the 4×4 mosaic pattern. To obtain real world ground-truth data, we performed multiple camera captures with 1-pixel shifts in order to compose the entire data cube. Experiments show that the proposed network outperforms state-of-art networks.

A Framework for Benchmarking Real-Time Embedded Object Detection
Michael Teutsch (Hensoldt Optronics)*; Daniel König (Hensoldt Optronics GmbH); Michael Schlosser (HENSOLDT Optronics GmbH)
View abstract Object detection is one of the key tasks in many applications of computer vision. Deep Neural Networks (DNNs) are undoubtedly a well-suited approach for object detection. However, such DNNs need highly adapted hardware together with hardware-specific optimization to guarantee high efficiency during inference. This is especially the case when aiming for efficient object detection in video streaming applications on limited hardware such as edge devices. Comparing vendor-specific hardware and related optimization software pipelines in a fair experimental setup is a challenge. In this paper, we propose a framework that uses a host computer with a host software application together with a light-weight interface based on the Message Queuing Telemetry Transport (MQTT) protocol. Various different target devices with target apps can be connected via MQTT with this host computer. With well-defined and standardized MQTT messages, object detection results can be reported to the host computer, where the results are evaluated without harming or influencing the processing on the device. With this quite generic framework, we can measure the object detection performance, the runtime, and the energy efficiency at the same time. The effectiveness of this framework is demonstrated in multiple experiments that offer deep insights into the optimization of DNNs.

Image-based Detection of Structural Defects using Hierarchical Multi-Scale Attention
Christian Benz (Bauhaus-Universität Weimar)*; Volker Rodehorst (Bauhaus-Universität Weimar)
View abstract With improving acquisition technologies, the inspection and monitoring of structures has become a field of application for deep learning. While other research focuses on the design of neural network architectures, this work points out the applicability of transfer learning for detecting cracks and other structural defects. Being a high-performer on the Cityscapes benchmark, hierarchical multi-scale attention [43] also renders suitable for transfer learning in the domain of structural defects. Using the joint scales of 0.25, 0.5, and 1.0, the approach achieves 92% mean intersection-over-union on the test set. The effectiveness of multi-scale attention is demonstrated for class demarcation on large scales and class determination on lower scales. Furthermore, a line-based tolerant intersection-over-union metric is introduced for more robust benchmarking in the field of crack detection. The dataset of 743 images covering crack, spalling, corrosion, efflorescence, vegetation, and control point is unprecedented in terms of quantity and realism.

A Dataset for Analysing Complex Document Layouts in the Digital Humanities and its Evaluation with Krippendorff’s Alpha
David Eike Tschirschwitz (Bauhaus-Universität Weimar)*; Franziska Klemstein (Bauhaus-Universität Weimar); Benno Stein (Bauhaus-Universität Weimar); Volker Rodehorst (Bauhaus-Universität Weimar)
View abstract We introduce a new research resource in the form of a high- quality, domain-specific dataset for analysing the document layout of historical documents. The dataset provides an instance segmentation ground truth with 19 classes based on historical layout structures that stem (a) from the publication production process and the respective genres (life sciences, architecture, art, decorative arts, etc.) and, (b) from selected text registers (such as monograph, trade journal, illustrated magazine). Altogether, the dataset contains more than 52,000 instances annotated by experts. A baseline has been tested with the well-known Mask R-CNN and compared to the state-of-the-art model VSR. Inspired by evaluation practices from the field of Natural Language Processing (NLP), we have developed a new method for evaluating annotation consistency. Our method is based on Krippendorff’s alpha (K-α), a statistic for quantifying the so-called “inter-annotator-agreement”. In particular, we propose an adaptation of K-α that treats annotations as a multipartite graph for assessing the agreement of a variable number of annotators. The method is adjustable with regard to evaluation strictness, and it can be used in 2D or 3D as well as for a variety of tasks such as semantic segmentation, instance segmentation, and 3D point cloud segmentation.

End-to-End Single Shot Detector using Graph-based Learnable Duplicate Removal
Shuxiao Ding (Mercedes-Benz AG)*; Eike C Rehder (Daimler R&D); Lukas Schneider (Daimer); Marius Cordts (Daimler); Jürgen Gall (University of Bonn)
View abstract Non-Maximum Suppression (NMS) is widely used to remove duplicates in object detection. In strong disagreement with the deep learning paradigm, NMS often remains as the only heuristic step. Learning NMS methods have been proposed that are either designed for Faster-RCNN or rely on separate networks. In contrast, learning NMS for SSD models is not well investigated. In this paper, we show that even a very simple rescoring network can be trained end-to-end with an underlying SSD model to solve the duplicate removal problem efficiently. For this, detection scores and boxes are refined from image features by modeling relations between detections in a Graph Neural Network (GNN). Our approach is applicable to the large number of object proposals in SSD using a pre-filtering head. It can easily be employed in arbitrary SSD-like models with weight-shared box predictor. Experiments on MS-COCO and KITTI show that our method improves accuracy compared with other duplicate removal methods at significantly lower inference time.

Localized Vision-Language Matching for Open-vocabulary Object Detection
Maria A Bravo (University of Freiburg)*; Sudhanshu Mittal (University of Freiburg); Thomas Brox (University of Freiburg)
View abstract In this work, we propose an open-vocabulary object detection method that, based on image-caption pairs, learns to detect novel object classes along with a given set of known classes. It is a two-stage training approach that first uses a location-guided image-caption matching technique to learn class labels for both novel and known classes in a weakly-supervised manner and second specializes the model for the object detection task using known class annotations. We show that a simple language model fits better than a large contextualized language model for detecting novel objects. Moreover, we introduce a consistency-regularization technique to better exploit image-caption pair information. Our method compares favorably to existing open-vocabulary detection approaches while being data-efficient. Source code is available at https://github.com/lmb-freiburg/locov .

Diverse Video Captioning by Adaptive Spatio-temporal Attention
Zohreh Ghaderi (University of Tübingen)*; Leonard Salewski (University of Tübingen); Hendrik P. A. Lensch (University of Tübingen)
View abstract To generate proper captions for videos, the inference needs to identify relevant concepts and pay attention to the spatial relationships between them as well as to the temporal development in the clip. Our end-to-end encoder-decoder video captioning framework incorporates two transformer-based architectures, an adapted transformer for a single joint spatio-temporal video analysis as well as a self-attention-based decoder for advanced text generation. Furthermore, we introduce an adaptive frame selection scheme to reduce the number of required incoming frames while maintaining the relevant content when training both transformers. Additionally, we estimate semantic concepts relevant for video captioning by aggregating all ground truth captions of each sample. Our approach achieves state-of-the-art results on the MSVD, as well as on the large-scale MSR-VTT and the VATEX benchmark datasets considering multiple Natural Language Generation (NLG) metrics. Additional evaluations on diversity scores highlight the expressiveness and diversity in the structure of our generated captions.

Nectar Track Oral Session

A Scalable Combinatorial Solver for Elastic Geometrically Consistent 3D Shape Matching
Paul Roetzer (TUM)*; Paul Swoboda (MPI fuer Informatik, Saarbruecken); Daniel Cremers (TU Munich); Florian Bernard (University of Bonn)
View abstract We present a scalable combinatorial algorithm for globally optimizing over the space of geometrically consistent mappings between 3D shapes. We use the mathematically elegant formalism proposed by Windheuser et al. (ICCV, 2011) where 3D shape matching was formulated as an integer linear program over the space of orientation-preserving diffeomorphisms. Until now, the resulting formulation had limited practical applicability due to its complicated constraint structure and its large size. We propose a novel primal heuristic coupled with a Lagrange dual problem that is several orders of magnitudes faster compared to previous solvers. This allows us to handle shapes with substantially more triangles than previously solvable. We demonstrate compelling results on diverse datasets, and, even showcase that we can address the challenging setting of matching two partial shapes without availability of complete shapes. Our code is publicly available at http://github.com/paul0noah/sm-comb .

LiDAR Snowfall Simulation for Robust 3D Object Detection
Martin Hahner (ETH Zurich)*; Christos Sakaridis (ETH Zurich); Mario Bijelic (Princeton University); Felix Heide (Princeton University); Fisher Yu (ETH Zurich); Dengxin Dai (MPI for Informatics ); Luc Van Gool (ETH Zurich)
View abstract 3D object detection is a central task for applications such as autonomous driving, in which the system needs to localize and classify surrounding traffic agents, even in the presence of adverse weather. In this paper, we address the problem of LiDAR-based 3D object detection under snowfall. Due to the difficulty of collecting and annotating training data in this setting, we propose a physically based method to simulate the effect of snowfall on real clear-weather LiDAR point clouds. Our method samples snow particles in 2D space for each LiDAR line and uses the induced geometry to modify the measurement for each LiDAR beam accordingly. Moreover, as snowfall often causes wetness on the ground, we also simulate ground wetness on LiDAR point clouds. We use our simulation to generate partially synthetic snowy LiDAR data and leverage these data for training 3D object detection models that are robust to snowfall. We conduct an extensive evaluation using several state-of-the-art 3D object detection methods and show that our simulation consistently yields significant performance gains on the real snowy STF dataset compared to clear-weather baselines and competing simulation approaches, while not sacrificing performance in clear weather.

Towards Understanding Adversarial Robustness of Optical Flow Networks
Simon Schrodi (University of Freiburg)*; Tonmoy Saikia (University of Freiburg); Thomas Brox (University of Freiburg)
View abstract Recent work demonstrated the lack of robustness of optical flow networks to physical patch-based adversarial attacks. The possibility to physically attack a basic component of automotive systems is a reason for serious concerns. In this paper, we analyze the cause of the problem and show that the lack of robustness is rooted in the classical aperture problem of optical flow estimation in combination with bad choices in the details of the network architecture. We show how these mistakes can be rectified in order to make optical flow networks robust to physical patch-based attacks. Additionally, we take a look at global white-box attacks in the scope of optical flow. We find that targeted white-box attacks can be crafted to bias flow estimation models towards any desired output, but this requires access to the input images and model weights. However, in the case of universal attacks, we find that optical flow networks are robust. Code is available at https://github.com/lmb-freiburg/understanding_flow_robustness.

Tracking Every Thing in the Wild
Siyuan Li (ETH Zurich)*; Martin Danelljan (ETH Zurich); Henghui Ding (ETH Zurich); Thomas E Huang (ETH Zürich); Fisher Yu (ETH Zurich)
View abstract Current multi-category Multiple Object Tracking (MOT) metrics use class labels to group tracking results for per-class evaluation. Similarly, MOT methods typically only associate objects with the same class predictions. These two prevalent strategies in MOT implicitly assume that the classification performance is near-perfect. However, this is far from the case in recent large-scale MOT datasets, which contain large numbers of classes with many rare or semantically similar categories. Therefore, the resulting inaccurate classification leads to sub-optimal tracking and inadequate benchmarking of trackers. We address these issues by disentangling classification from tracking. We introduce a new metric, Track Every Thing Accuracy (TETA), breaking tracking measurement into three sub-factors: localization, association, and classification, allowing comprehensive benchmarking of tracking performance even under inaccurate classification. TETA also deals with the challenging incomplete annotation problem in large-scale tracking datasets. We further introduce a Track Every Thing tracker (TETer), that performs association using Class Exemplar Matching (CEM). Our experiments show that TETA evaluates trackers more comprehensively, and TETer achieves significant improvements on the challenging large-scale datasets BDD100K and TAO compared to the state-of-the-art.

Nectar Track Posters I

Kernel-Based Generalized Median Computation for Consensus Learning
Andreas Nienkötter (University of Münster); Xiaoyi Jiang (University of Münster)*
View abstract Computing a consensus object from a set of given objects is a core problem in machine learning and pattern recognition. One popular approach is to formulate it as an optimization problem using the generalized median. Previous methods like the Prototype and Distance-Preserving Embedding methods transform objects into a vector space, solve the generalized median problem in this space, and inversely transform back into the original space. Both of these methods have been successfully applied to a wide range of object domains, where the generalized median problem has inherent high computational complexity (typically NP-hard) and therefore approximate solutions are required. Previously, explicit embedding methods were used in the computation, which often do not reflect the spatial relationship between objects exactly. In this work we introduce a kernel-based generalized median framework that is applicable to both positive definite and indefinite kernels. This framework computes the relationship between objects and its generalized median in kernel space, without the need of an explicit embedding. We show that the spatial relationship between objects is more accurately represented in kernel space than in an explicit vector space using easy-to-compute kernels, and demonstrate superior performance of generalized median computation on datasets of three different domains. A software toolbox resulting from our work is made publicly available to encourage other researchers to explore the generalized median computation and applications.

Ranking Info Noise Contrastive Estimation: Boosting Contrastive Learning via Ranked Positives
David T. Hoffmann ( University of Freiburg; Bosch Center for Artificial Intelligence)*; Nadine Behrmann (Bosch Center for Artificial Intelligence); Juergen Gall (University of Bonn); Thomas Brox (University of Freiburg); Mehdi Noroozi (Bosch Center for Artificial Intelligence)
View abstract This paper introduces Ranking Info Noise Contrastive Estimation (RINCE), a new member in the family of InfoNCE losses that preserves a ranked ordering of positive samples. In contrast to the standard InfoNCE loss, which requires a strict binary separation of the training pairs into similar and dissimilar samples, RINCE can exploit information about a similarity ranking for learning a corresponding embedding space. We show that the proposed loss function learns favorable embeddings compared to the standard InfoNCE whenever at least noisy ranking information can be obtained or when the definition of positives and negatives is blurry. We demonstrate this for a supervised classification task with additional superclass labels and noisy similarity scores. Furthermore, we show that RINCE can also be applied to unsupervised training with experiments on unsupervised representation learning from videos. In particular, the embedding yields higher classification accuracy, retrieval rates and performs better on out-of-distribution detection than the standard InfoNCE loss.

Learning Where To Look – Generative NAS is Surprisingly Efficient
Steffen Jung (MPII)*; Jovita Lukasik (University of Mannheim); Margret Keuper (University of Mannheim)
View abstract The efficient, automated search for well-performing neural architectures (NAS) has drawn increasing attention in the recent past. Thereby, the predominant research objective is to reduce the necessity of costly evaluations of neural architectures while efficiently exploring large search spaces. To this aim, surrogate models embed architectures in a latent space and predict their performance, while generative models for neural architectures enable optimization-based search within the latent space the generator draws from. Both, surrogate and generative models, have the aim of facilitating query-efficient search in a well-structured latent space. In this paper, we further improve the trade-off between query-efficiency and promising architecture generation by leveraging advantages from both, efficient surrogate models and generative design. To this end, we propose a generative model, paired with a surrogate predictor, that iteratively learns to generate samples from increasingly promising latent subspaces. This approach leads to very effective and efficient architecture search, while keeping the query amount low. In addition, our approach allows in a straightforward manner to jointly optimize for multiple objectives such as accuracy and hardware latency. We show the benefit of this approach not only w.r.t. the optimization of architectures for highest classification accuracy but also in the context of hardware constraints and outperform state-of-the-art methods on several NAS benchmarks for single and multiple objectives. We also achieve state-of-the-art performance on ImageNet. The code is available at https://github.com/jovitalukasik/AG-Net.

A Unified Framework for Implicit Sinkhorn Differentiation
Marvin Eisenberger (TU Munich)*; Aysim Toker (TUM); Laura Leal-Taixé (TUM); Florian Bernard (University of Bonn); Daniel Cremers (TU Munich)
View abstract The Sinkhorn operator has recently experienced a surge of popularity in computer vision and related fields. One major reason is its ease of integration into deep learning frameworks. To allow for an efficient training of respective neural networks, we propose an algorithm that obtains analytical gradients of a Sinkhorn layer via implicit differentiation. In comparison to prior work, our framework is based on the most general formulation of the Sinkhorn operator. It allows for any type of loss function, while both the target capacities and cost matrices are differentiated jointly. We further construct error bounds of the resulting algorithm for approximate inputs. Finally, we demonstrate that for a number of applications, simply replacing automatic differentiation with our algorithm directly improves the stability and accuracy of the obtained gradients. Moreover, we show that it is computationally more efficient, particularly when resources like GPU memory are scarce.

Image Segmentation Using Text and Image Prompts
Timo Lüddecke (University of Göttingen)*; Alexander S Ecker (University of Göttingen)
View abstract Image segmentation is usually addressed by training a model for a fixed set of object classes. Incorporating additional classes or more complex queries later is expensive as it requires re-training the model on a dataset that encompasses these expressions. Here we propose a system that can generate image segmentations based on arbitrary prompts at test time. A prompt can be either a text or an image. This approach enables us to create a unified model (trained once) for three common segmentation tasks, which come with distinct challenges: referring expression segmentation, zero-shot segmentation and one-shot segmentation. We build upon the CLIP model as a backbone which we extend with a transformer-based decoder that enables dense prediction. After training on an extended version of the PhraseCut dataset, our system generates a binary segmentation map for an image based on a free-text prompt or on an additional image expressing the query. We analyze different variants of the latter image-based prompts in detail. This novel hybrid input allows for dynamic adaptation not only to the three segmentation tasks mentioned above, but to any binary segmentation task where a text or image query can be formulated. Finally, we find our system to adapt well to generalized queries involving affordances or properties. Code is available at https://eckerlab. org/code/clipseg

Nectar Track Posters II

Weakly Supervised Learning of Multi-Object 3D Scene Decompositions Using Deep Shape Priors
Cathrin Elich (Max Planck Institute for Intelligent Systems)*; Martin R. Oswald (ETH Zurich); Marc Pollefeys (ETH Zurich / Microsoft); Joerg Stueckler (Max-Planck-Institute for Intelligent Systems)
View abstract Representing scenes at the granularity of objects is a prerequisite for scene understanding and decision making. We propose PriSMONet, a novel approach based on Prior Shape knowledge for learning Multi-Object 3D scene decomposition and representations from single images. Our approach learns to decompose images of synthetic scenes with multiple objects on a planar surface into its constituent scene objects and to infer their 3D properties from a single view. A recurrent encoder regresses a latent representation of 3D shape, pose and texture of each object from an input RGB image. By differentiable rendering, we train our model to decompose scenes from RGB-D images in a self-supervised way. The 3D shapes are represented continuously in function-space as signed distance functions which we pre-train from example shapes in a supervised way. These shape priors provide weak supervision signals to better condition the challenging overall learning task. We evaluate the accuracy of our model in inferring 3D scene layout, demonstrate its generative capabilities, assess its generalization to real images, and point out benefits of the learned representation.

DiffSDFSim: Differentiable Rigid-Body Dynamics With Implicit Shapes
Michael Strecke (Max Planck Institute for Intelligent Systems)*; Joerg Stueckler (Max-Planck-Institute for Intelligent Systems)
View abstract Differentiable physics is a powerful tool in computer vision and robotics for scene understanding and reasoning about interactions. Existing approaches have frequently been limited to objects with simple shape or shapes that are known in advance. In this paper, we propose a novel approach to differentiable physics with frictional contacts which represents object shapes implicitly using signed distance fields (SDFs). Our simulation supports contact point calculation even when the involved shapes are nonconvex. Moreover, we propose ways for differentiating the dynamics for the object shape to facilitate shape optimization using gradient-based methods. In our experiments, we demonstrate that our approach allows for model-based inference of physical parameters such as friction coefficients, mass, forces or shape parameters from trajectory and depth image observations in several challenging synthetic scenarios and a real image sequence.

HODOR: High-level Object Descriptors for Object Re-segmentation in Video Learned from Static Images
Ali Athar (RWTH Aachen); Jonathon Luiten (RWTH Aachen University); Alexander Hermans (RWTH Aachen University)*; Deva Ramanan (Carnegie Mellon University); Bastian Leibe (RWTH Aachen University)
View abstract Existing state-of-the-art methods for Video Object Segmentation (VOS) learn lowlevel pixel-to-pixel correspondences between frames to propagate object masks across video. This requires a large amount of densely annotated video data, which is costly to annotate, and largely redundant since frames within a video are highly correlated. In light of this, we propose HODOR: a novel method that tack- les VOS by effectively leveraging annotated static images for understanding object appearance and scene context. We encode object instances and scene information from an image frame into robust high-level descriptors which can then be used to re-segment those objects in different frames. As a result, HODOR achieves state-of-the-art performance on the DAVIS and YouTube-VOS benchmarks compared to existing methods trained without video annotations. Without any architectural modification, HODOR can also learn from video context around single annotated video frames by utilizing cyclic consistency, whereas other methods rely on dense, temporally consistent annotations. Source code is available at: https://github.com/Ali2500/HODOR

AirPose: Multi-View Fusion Network for Aerial 3D Human Pose and Shape Estimation
Nitin Saini (Max Planck Institute for Intelligent Systems); Elia Bonetto (Max Planck Institute For Intelligent Systems); Eric Price (University Stuttgart)*; Aamir Ahmad (Max Planck Institute for Intelligent Systems); Michael J. Black (Max Planck Institute for Intelligent Systems)
View abstract In this letter, we present a novel markerless 3D human motion capture (MoCap) system for unstructured, outdoor environments that uses a team of autonomous unmanned aerial vehicles (UAVs) with on-board RGB cameras and computation. Existing methods are limited by calibrated cameras and off-line processing. Thus, we present the first method (AirPose) to estimate human pose and shape using images captured by multiple extrinsically uncalibrated flying cameras. AirPose itself calibrates the cameras relative to the person instead of relying on any pre-calibration. It uses distributed neural networks running on each UAV that communicate viewpoint-independent information with each other about the person (i.e., their 3D shape and articulated pose). The person’s shape and pose are parameterized using the SMPL-X body model, resulting in a compact representation, that minimizes communication between the UAVs. The network is trained using synthetic images of realistic virtual environments, and fine-tuned on a small set of real images. We also introduce an optimization-based post-processing method (AirPose+) for offline applications that require higher MoCap quality. We make our method’s code and data available for research at https://github.com/robot-perception-group/AirPose. A video describing the approach and results is available at https://youtu.be/xLYe1TNHsfs.

A Benchmark and a Baseline for Robust Multi-view Depth Estimation
Philipp Schröppel (University of Freiburg)*; Jan Bechtold (Bosch Center for Artificial Intelligence); Artemij Amiranashvili (University of Freiburg); Thomas Brox (University of Freiburg)
View abstract Recent deep learning approaches for multi-view depth estimation are employed either in a depth-from-video or a multi-view stereo setting. Despite different settings, these approaches are technically similar: they correlate multiple source views with a keyview to estimate a depth map for the keyview. In this work, we introduce the Robust Multi-View Depth Benchmark that is built upon a set of public datasets and allows evaluation in both settings on data from different domains. We evaluate recent approaches and find imbalanced performances across domains. Further, we consider a third setting, where camera poses are available and the objective is to estimate the corresponding depth maps with their correct scale. We show that recent approaches do not generalize across datasets in this setting. This is because their cost volume output runs out of distribution. To resolve this, we present the Robust MVD Baseline model for multi-view depth estimation, which is built upon existing components but employs a novel scale augmentation procedure. It can be applied for robust multi-view depth estimation, independent of the target data. We provide code for the proposed benchmark and baseline model at https://github.com/lmb-freiburg/robustmvd.