CVPR2023论文速览自监督Self

Paper1 Self-Supervised Video Forensics by Audio-Visual Anomaly Detection

摘要原文： Manipulated videos often contain subtle inconsistencies between their visual and audio signals. We propose a video forensics method, based on anomaly detection, that can identify these inconsistencies, and that can be trained solely using real, unlabeled data. We train an autoregressive model to generate sequences of audio-visual features, using feature sets that capture the temporal synchronization between video frames and sound. At test time, we then flag videos that the model assigns low probability. Despite being trained entirely on real videos, our model obtains strong performance on the task of detecting manipulated speech videos. Project site: https://cfeng16.github.io/audio-visual-forensics.

中文总结：这段话主要介绍了一种基于异常检测的视频取证方法，用于识别操纵视频中视觉和音频信号之间的微妙不一致之处。该方法利用自回归模型训练生成音频-视觉特征序列，通过捕捉视频帧和声音之间的时间同步关系。在测试时，通过标记模型赋予低概率的视频来识别操纵视频。尽管完全基于真实视频进行训练，该模型在检测操纵言语视频方面表现出色。

Paper2 LG-BPN: Local and Global Blind-Patch Network for Self-Supervised Real-World Denoising

摘要原文： Despite the significant results on synthetic noise under simplified assumptions, most self-supervised denoising methods fail under real noise due to the strong spatial noise correlation, including the advanced self-supervised blind-spot networks (BSNs). For recent methods targeting real-world denoising, they either suffer from ignoring this spatial correlation, or are limited by the destruction of fine textures for under-considering the correlation. In this paper, we present a novel method called LG-BPN for self-supervised real-world denoising, which takes the spatial correlation statistic into our network design for local detail restoration, and also brings the long-range dependencies modeling ability to previously CNN-based BSN methods. First, based on the correlation statistic, we propose a densely-sampled patch-masked convolution module. By taking more neighbor pixels with low noise correlation into account, we enable a denser local receptive field, preserving more useful information for enhanced fine structure recovery. Second, we propose a dilated Transformer block to allow distant context exploitation in BSN. This global perception addresses the intrinsic deficiency of BSN, whose receptive field is constrained by the blind spot requirement, which can not be fully resolved by the previous CNN-based BSNs. These two designs enable LG-BPN to fully exploit both the detailed structure and the global interaction in a blind manner. Extensive results on real-world datasets demonstrate the superior performance of our method. https://github.com/Wang-XIaoDingdd/LGBPN

中文总结：尽管在简化假设下对合成噪声取得了显著结果，但大多数自监督去噪方法在真实噪声下失败，原因是强烈的空间噪声相关性，包括先进的自监督盲点网络（BSNs）。针对真实世界去噪的最新方法，要么忽视了这种空间相关性，要么由于未考虑相关性而受到了对细微纹理破坏的限制。在本文中，我们提出了一种名为LG-BPN的新方法，用于自监督真实世界去噪，它将空间相关性统计信息纳入我们的网络设计中，以进行局部细节恢复，并为以前基于CNN的BSN方法带来了长距离依赖建模能力。首先，基于相关性统计，我们提出了一种密集采样的补丁掩膜卷积模块。通过考虑更多具有低噪声相关性的相邻像素，我们实现了更密集的局部感受野，保留了更多有用信息，以增强细微结构的恢复。其次，我们提出了一个扩张Transformer块，以允许BSN中的远程上下文利用。这种全局感知解决了BSN的固有缺陷，其感受野受到盲点要求的限制，这是以前基于CNN的BSN无法完全解决的。这两种设计使LG-BPN能够以盲目的方式充分利用详细结构和全局交互。在真实世界数据集上的广泛结果证明了我们方法的卓越性能。

Paper3 Object Detection With Self-Supervised Scene Adaptation

摘要原文： This paper proposes a novel method to improve the performance of a trained object detector on scenes with fixed camera perspectives based on self-supervised adaptation. Given a specific scene, the trained detector is adapted using pseudo-ground truth labels generated by the detector itself and an object tracker in a cross-teaching manner. When the camera perspective is fixed, our method can utilize the background equivariance by proposing artifact-free object mixup as a means of data augmentation, and utilize accurate background extraction as an additional input modality. We also introduce a large-scale and diverse dataset for the development and evaluation of scene-adaptive object detection. Experiments on this dataset show that our method can improve the average precision of the original detector, outperforming the previous state-of-the-art self-supervised domain adaptive object detection methods by a large margin. Our dataset and code are published at https://github.com/cvlab-stonybrook/scenes100.

中文总结：本文提出了一种新颖的方法，通过自监督适应来提高在固定摄像头视角下场景中训练的目标检测器的性能。给定一个特定的场景，通过在交叉教学的方式下，使用目标检测器本身和目标跟踪器生成的伪地面真实标签来对训练过的检测器进行适应。当摄像头视角固定时，我们的方法可以利用背景等变性，通过提出无伪影的对象混合作为一种数据增强手段，并利用准确的背景提取作为额外的输入模态。我们还引入了一个大规模且多样化的数据集，用于开发和评估场景自适应目标检测。在这个数据集上的实验表明，我们的方法可以提高原始检测器的平均精度，远远超过了以往最先进的自监督域自适应目标检测方法。我们的数据集和代码已发布在 https://github.com/cvlab-stonybrook/scenes100。

Paper4 Learning Common Rationale To Improve Self-Supervised Representation for Fine-Grained Visual Recognition Problems

摘要原文： Self-supervised learning (SSL) strategies have demonstrated remarkable performance in various recognition tasks. However, both our preliminary investigation and recent studies suggest that they may be less effective in learning representations for fine-grained visual recognition (FGVR) since many features helpful for optimizing SSL objectives are not suitable for characterizing the subtle differences in FGVR. To overcome this issue, we propose learning an additional screening mechanism to identify discriminative clues commonly seen across instances and classes, dubbed as common rationales in this paper. Intuitively, common rationales tend to correspond to the discriminative patterns from the key parts of foreground objects. We show that a common rationale detector can be learned by simply exploiting the GradCAM induced from the SSL objective without using any pre-trained object parts or saliency detectors, making it seamlessly to be integrated with the existing SSL process. Specifically, we fit the GradCAM with a branch with limited fitting capacity, which allows the branch to capture the common rationales and discard the less common discriminative patterns. At the test stage, the branch generates a set of spatial weights to selectively aggregate features representing an instance. Extensive experimental results on four visual tasks demonstrate that the proposed method can lead to a significant improvement in different evaluation settings.

中文总结：这段话主要讨论了自监督学习（SSL）策略在各种识别任务中表现出卓越的性能。然而，他们的初步调查和最近的研究表明，它们在学习用于细粒度视觉识别（FGVR）的表示时可能不够有效，因为许多有助于优化SSL目标的特征并不适合描述FGVR中微妙的差异。为了解决这个问题，他们提出了学习一个额外的筛选机制来识别跨实例和类别普遍出现的具有区分性线索，本文中称之为共同原理。直观地说，共同原理往往对应于前景对象关键部分的区分模式。他们展示了一个共同原理检测器可以通过简单地利用从SSL目标中诱导出的GradCAM来学习，而无需使用任何预训练的对象部分或显著性检测器，使其能够无缝地集成到现有的SSL流程中。具体来说，他们使用一个具有有限拟合能力的分支来拟合GradCAM，这允许该分支捕获共同原理并丢弃较不常见的区分模式。在测试阶段，该分支生成一组空间权重，以选择性地聚合代表一个实例的特征。在四个视觉任务上的广泛实验结果表明，该方法可以在不同的评估设置中显著提高性能。

Paper5 Fully Self-Supervised Depth Estimation From Defocus Clue

摘要原文： Depth-from-defocus (DFD), modeling the relationship between depth and defocus pattern in images, has demonstrated promising performance in depth estimation. Recently, several self-supervised works try to overcome the difficulties in acquiring accurate depth ground-truth. However, they depend on the all-in-focus (AIF) images, which cannot be captured in real-world scenarios. Such limitation discourages the applications of DFD methods. To tackle this issue, we propose a completely self-supervised framework that estimates depth purely from a sparse focal stack. We show that our framework circumvents the needs for the depth and AIF image ground-truth, and receives superior predictions, thus closing the gap between the theoretical success of DFD works and their applications in the real world. In particular, we propose (i) a more realistic setting for DFD tasks, where no depth or AIF image ground-truth is available; (ii) a novel self-supervision framework that provides reliable predictions of depth and AIF image under the the challenging setting. The proposed framework uses a neural model to predict the depth and AIF image, and utilizes an optical model to validate and refine the prediction. We verify our framework on three benchmark datasets with rendered focal stacks and real focal stacks. Qualitative and quantitative evaluations show that our method provides a strong baseline for self-supervised DFD tasks. The source code is publicly available at https://github.com/Ehzoahis/DEReD.

中文总结：这段话主要讨论了深度-从模糊（DFD）技术在深度估计中表现出有希望的性能。最近，一些自监督方法试图克服获取准确深度地面真实值的困难。然而，它们依赖于全焦点（AIF）图像，这在现实场景中无法捕捉。这种限制阻碍了DFD方法的应用。为了解决这个问题，提出了一个完全自监督的框架，可以纯粹从稀疏的焦点堆栈中估计深度。该框架绕过了对深度和AIF图像地面真实值的需求，并获得了更优的预测结果，从而弥合了DFD方法在理论成功和在实际应用中之间的差距。具体来说，提出了（i）一个更加现实的DFD任务设置，其中没有深度或AIF图像地面真实值可用；（ii）一个新颖的自监督框架，在具有挑战性的设置下提供深度和AIF图像的可靠预测。所提出的框架使用神经模型预测深度和AIF图像，并利用光学模型验证和完善预测。在三个基准数据集上验证了我们的框架，包括渲染的焦点堆栈和真实的焦点堆栈。定性和定量评估表明，我们的方法为自监督的DFD任务提供了一个强大的基准线。源代码公开可在https://github.com/Ehzoahis/DEReD获取。

Paper6 StepFormer: Self-Supervised Step Discovery and Localization in Instructional Videos

摘要原始：教学视频是从人类演示中学习程序任务的重要资源，但此类视频中的教学步骤通常较短且稀疏，并且大多数视频与程序无关，因此出现了及时定位的需求。这种视频指令步骤（称为关键步骤定位的任务）不适用于大型数据集，因为传统的关键步骤定位方法需要视频级人工注释。这项工作涉及在没有人工监督的情况下解决问题。 StepFormer 是一种自我监督模型，可以检测并定位视频中的指令步骤。 StepFormer 使用可学习的查询来响应视频并生成一组捕获视频中关键步骤的槽。我们的系统针对大量教育视频数据集，使用自动生成的字幕作为唯一的监控源，特别是使用顺序感知损失函数来过滤掉不相关的短语并通过一系列文本叙述来监控系统。证明我们的模型是好的

all previous unsupervised and weakly-supervised approaches on step detection and localization by a large margin on three challenging benchmarks. Moreover, our model demonstrates an emergent property to solve zero-shot multi-step localization and outperforms all relevant baselines at this task.

中文总结: 这段话主要讨论了教学视频在学习过程中的重要性，指出了现有教学视频中指导步骤通常简短稀疏，大部分视频内容与操作过程无关。因此，需要在这些视频中对指导步骤进行时间定位，即关键步骤定位。传统的关键步骤定位方法需要视频级别的人工标注，因此无法扩展到大型数据集。作者提出了一种无需人类监督的解决方案——StepFormer模型，它是一种自监督模型，可以在视频中发现和定位指导步骤。StepFormer是一个Transformer解码器，通过可学习的查询关注视频，并生成一个捕捉视频中关键步骤的序列。作者使用大量教学视频数据集训练了这个系统，仅使用自动生成的字幕作为监督来源。作者通过一个序列文本叙述和一个有序感知的损失函数对系统进行监督，以过滤出无关短语。研究结果表明，作者的模型在三个具有挑战性的基准测试中，在关键步骤检测和定位方面明显优于所有以前的无监督和弱监督方法。此外，作者的模型展示了解决零样本多步定位的新特性，并在这项任务上优于所有相关基线模型。

Paper7 Multi-Mode Online Knowledge Distillation for Self-Supervised Visual Representation Learning

摘要原文: Self-supervised learning (SSL) has made remarkable progress in visual representation learning. Some studies combine SSL with knowledge distillation (SSL-KD) to boost the representation learning performance of small models. In this study, we propose a Multi-mode Online Knowledge Distillation method (MOKD) to boost self-supervised visual representation learning. Different from existing SSL-KD methods that transfer knowledge from a static pre-trained teacher to a student, in MOKD, two different models learn collaboratively in a self-supervised manner. Specifically, MOKD consists of two distillation modes: self-distillation and cross-distillation modes. Among them, self-distillation performs self-supervised learning for each model independently, while cross-distillation realizes knowledge interaction between different models. In cross-distillation, a cross-attention feature search strategy is proposed to enhance the semantic feature alignment between different models. As a result, the two models can absorb knowledge from each other to boost their representation learning performance. Extensive experimental results on different backbones and datasets demonstrate that two heterogeneous models can benefit from MOKD and outperform their independently trained baseline. In addition, MOKD also outperforms existing SSL-KD methods for both the student and teacher models.

中文总结: 这段话主要介绍了自监督学习（SSL）在视觉表示学习方面取得了显著进展。一些研究将SSL与知识蒸馏（SSL-KD）结合起来，以提升小模型的表示学习性能。在这项研究中，提出了一种多模式在线知识蒸馏方法（MOKD），用于增强自监督视觉表示学习。不同于现有的SSL-KD方法将知识从静态预训练的教师模型转移到学生模型，MOKD中的两个不同模型以自监督方式协同学习。具体来说，MOKD包括两种蒸馏模式：自蒸馏和交叉蒸馏模式。其中，自蒸馏独立地为每个模型进行自监督学习，而交叉蒸馏实现了不同模型之间的知识交互。在交叉蒸馏中，提出了一种交叉注意力特征搜索策略，以增强不同模型之间的语义特征对齐。因此，这两个模型可以相互吸收知识，以提升它们的表示学习性能。对不同的骨干网络和数据集进行了广泛的实验结果，证明了两个异构模型可以从MOKD中受益，并且胜过它们独立训练的基准模型。此外，MOKD也在学生和教师模型方面胜过现有的SSL-KD方法。

Paper8 Lite-Mono: A Lightweight CNN and Transformer Architecture for Self-Supervised Monocular Depth Estimation

摘要原文: Self-supervised monocular depth estimation that does not require ground truth for training has attracted attention in recent years. It is of high interest to design lightweight but effective models so that they can be deployed on edge devices. Many existing architectures benefit from using heavier backbones at the expense of model sizes. This paper achieves comparable results with a lightweight architecture. Specifically, the efficient combination of CNNs and Transformers is investigated, and a hybrid architecture called Lite-Mono is presented. A Consecutive Dilated Convolutions (CDC) module and a Local-Global Features Interaction (LGFI) module are proposed. The former is used to extract rich multi-scale local features, and the latter takes advantage of the self-attention mechanism to encode long-range global information into the features. Experiments demonstrate that Lite-Mono outperforms Monodepth2 by a large margin in accuracy, with about 80% fewer trainable parameters. Our codes and models are available at https://github.com/noahzn/Lite-Mono.

中文总结: 这段话主要介绍了一种无需训练地面真实数据的自监督单目深度估计方法，近年来受到关注。设计轻量但有效的模型以便在边缘设备上部署具有很高的兴趣。许多现有架构通过使用更重的骨干网络来获益，但以模型大小为代价。本文提出了一种轻量级架构，名为Lite-Mono，实现了可比较的结果。具体来说，研究了CNN和Transformer的高效组合，并提出了一种混合架构。提出了一种连续扩张卷积（CDC）模块和一种局部-全局特征交互（LGFI）模块。前者用于提取丰富的多尺度局部特征，后者利用自注意机制将远程全局信息编码到特征中。实验证明，Lite-Mono在准确性方面远远优于Monodepth2，可训练参数数量约减少了80%。我们的代码和模型可在https://github.com/noahzn/Lite-Mono找到。

Paper9 Beyond Appearance: A Semantic Controllable Self-Supervised Learning Framework for Human-Centric Visual Tasks

摘要原文: Human-centric visual tasks have attracted increasing research attention due to their widespread applications. In this paper, we aim to learn a general human representation from massive unlabeled human images which can benefit downstream human-centric tasks to the maximum extent. We call this method SOLIDER, a Semantic cOntrollable seLf-supervIseD lEaRning framework. Unlike the existing self-supervised learning methods, prior knowledge from human images is utilized in SOLIDER to build pseudo semantic labels and import more semantic information into the learned representation. Meanwhile, we note that different downstream tasks always require different ratios of semantic information and appearance information. For example, human parsing requires more semantic information, while person re-identification needs more appearance information for identification purpose. So a single learned representation cannot fit for all requirements. To solve this problem, SOLIDER introduces a conditional network with a semantic controller. After the model is trained, users can send values to the controller to produce representations with different ratios of semantic information, which can fit different needs of downstream tasks. Finally, SOLIDER is verified on six downstream human-centric visual tasks. It outperforms state of the arts and builds new baselines for these tasks. The code is released in https://github.com/tinyvision/SOLIDER.

中文总结: 这篇论文主要讨论了人类中心视觉任务在研究中受到越来越多的关注，因为它们具有广泛的应用。作者旨在从大量未标记的人类图像中学习一种通用的人类表示，以最大程度地造福下游的人类中心任务。作者提出了一种名为SOLIDER的方法，即语义可控的自监督学习框架。与现有的自监督学习方法不同，SOLIDER利用人类图像中的先验知识来构建伪语义标签，并将更多的语义信息导入到学习的表示中。与此同时，作者指出不同的下游任务总是需要不同比例的语义信息和外观信息。例如，人体解析需要更多的语义信息，而人员重识别需要更多的外观信息用于识别目的。因此，单一学习的表示不能适用于所有需求。为了解决这个问题，SOLIDER引入了一个带有语义控制器的条件网络。在模型训练完成后，用户可以向控制器发送值，以产生具有不同比例的语义信息的表示，以适应下游任务的不同需求。最后，SOLIDER在六个下游人类中心视觉任务上进行了验证。它胜过了现有技术，并为这些任务建立了新的基准线。该代码已在https://github.com/tinyvision/SOLIDER发布。

Paper10 Coreset Sampling From Open-Set for Fine-Grained Self-Supervised Learning

摘要原文: Deep learning in general domains has constantly been extended to domain-specific tasks requiring the recognition of fine-grained characteristics. However, real-world applications for fine-grained tasks suffer from two challenges: a high reliance on expert knowledge for annotation and necessity of a versatile model for various downstream tasks in a specific domain (e.g., prediction of categories, bounding boxes, or pixel-wise annotations). Fortunately, the recent self-supervised learning (SSL) is a promising approach to pretrain a model without annotations, serving as an effective initialization for any downstream tasks. Since SSL does not rely on the presence of annotation, in general, it utilizes the large-scale unlabeled dataset, referred to as an open-set. In this sense, we introduce a novel Open-Set Self-Supervised Learning problem under the assumption that a large-scale unlabeled open-set is available, as well as the fine-grained target dataset, during a pretraining phase. In our problem setup, it is crucial to consider the distribution mismatch between the open-set and target dataset. Hence, we propose SimCore algorithm to sample a coreset, the subset of an open-set that has a minimum distance to the target dataset in the latent space. We demonstrate that SimCore significantly improves representation learning performance through extensive experimental settings, including eleven fine-grained datasets and seven open-sets in various downstream tasks.

中文总结: 这段话主要讨论了深度学习在一般领域中不断扩展到需要识别细粒度特征的特定领域任务。然而，细粒度任务的实际应用面临两个挑战：对专家知识进行标注的高度依赖以及需要一个多功能模型来处理特定领域中各种下游任务（例如，预测类别、边界框或像素级注释）。幸运的是，最近的自监督学习（SSL）是一种有前途的方法，可以在没有注释的情况下对模型进行预训练，为任何下游任务提供有效的初始化。由于SSL通常不依赖于注释的存在，它利用大规模未标记的数据集，称为开放集。在这种意义上，作者提出了一个新领域的自监督学习问题，假设在预训练阶段有一个大规模未标记的开放集和细粒度目标数据集可用。在这个问题设置中，考虑到开放集和目标数据集之间的分布不匹配是至关重要的。因此，作者提出了SimCore算法来采样一个核心集，即在潜在空间中与目标数据集距离最小的开放集子集。作者通过包括十一个细粒度数据集和七个不同下游任务的开放集在内的广泛实验设置，证明了SimCore显著改善了表示学习性能。

Paper11 Self-Supervised Super-Plane for Neural 3D Reconstruction

摘要原文: Neural implicit surface representation methods show impressive reconstruction results but struggle to handle texture-less planar regions that widely exist in indoor scenes. Existing approaches addressing this leverage image prior that requires assistive networks trained with large-scale annotated datasets. In this work, we introduce a self-supervised super-plane constraint by exploring the free geometry cues from the predicted surface, which can further regularize the reconstruction of plane regions without any other ground truth annotations. Specifically, we introduce an iterative training scheme, where (i) grouping of pixels to formulate a super-plane (analogous to super-pixels), and (ii) optimizing of the scene reconstruction network via a super-plane constraint, are progressively conducted. We demonstrate that the model trained with super-planes surprisingly outperforms the one using conventional annotated planes, as individual super-plane statistically occupies a larger area and leads to more stable training. Extensive experiments show that our self-supervised super-plane constraint significantly improves 3D reconstruction quality even better than using ground truth plane segmentation. Additionally, the plane reconstruction results from our model can be used for auto-labeling for other vision tasks. The code and models are available at https: //github.com/botaoye/S3PRecon.

中文总结: 这段话主要讨论了神经隐式表面表示方法在重建结果方面表现出色，但在处理室内场景中普遍存在的无纹理平面区域时存在困难。现有方法利用图像先验来解决这个问题，需要辅助网络经过大规模注释数据集的训练。本研究引入了一种自监督的超平面约束，通过探索从预测表面中获取的自由几何线索，进一步规范了平面区域的重建，而无需任何其他地面真实标注。具体来说，我们引入了一个迭代训练方案，逐步进行像素分组以构建超平面（类似于超像素），并通过超平面约束优化场景重建网络。我们证明，使用超平面训练的模型表现出乎意料地优于使用传统标注平面的模型，因为单个超平面统计上占据更大的区域，并且导致更稳定的训练。大量实验证明，我们的自监督超平面约束显著提高了3D重建质量，甚至比使用地面真实平面分割效果更好。此外，我们模型的平面重建结果可用于其他视觉任务的自动标注。代码和模型可在https://github.com/botaoye/S3PRecon找到。

Paper12 Self-Supervised 3D Scene Flow Estimation Guided by Superpoints

摘要原文: 3D scene flow estimation aims to estimate point-wise motions between two consecutive frames of point clouds. Superpoints, i.e., points with similar geometric features, are usually employed to capture similar motions of local regions in 3D scenes for scene flow estimation. However, in existing methods, superpoints are generated with the offline clustering methods, which cannot characterize local regions with similar motions for complex 3D scenes well, leading to inaccurate scene flow estimation. To this end, we propose an iterative end-to-end superpoint based scene flow estimation framework, where the superpoints can be dynamically updated to guide the point-level flow prediction. Specifically, our framework consists of a flow guided superpoint generation module and a superpoint guided flow refinement module. In our superpoint generation module, we utilize the bidirectional flow information at the previous iteration to obtain the matching points of points and superpoint centers for soft point-to-superpoint association construction, in which the superpoints are generated for pairwise point clouds. With the generated superpoints, we first reconstruct the flow for each point by adaptively aggregating the superpoint-level flow, and then encode the consistency between the reconstructed flow of pairwise point clouds. Finally, we feed the consistency encoding along with the reconstructed flow into GRU to refine point-level flow. Extensive experiments on several different datasets show that our method can achieve promising performance.

中文总结: 3D场景流估计旨在估计点云的两个连续帧之间的点级运动。通常使用具有相似几何特征的超点（即，局部区域中相似运动的点）来捕获3D场景中的相似运动，用于场景流估计。然而，在现有方法中，超点是通过离线聚类方法生成的，这种方法不能很好地表征复杂3D场景中具有相似运动的局部区域，导致场景流估计不准确。为此，我们提出了一种基于迭代端到端超点的场景流估计框架，其中超点可以动态更新以指导点级流预测。具体而言，我们的框架包括一个流引导超点生成模块和一个超点引导流细化模块。在我们的超点生成模块中，我们利用前一次迭代的双向流信息来获取点和超点中心的匹配点，以构建软点到超点的关联，其中超点是为成对点云生成的。利用生成的超点，我们首先通过自适应聚合超点级流来为每个点重建流，然后编码成对点云的重建流之间的一致性。最后，我们将一致性编码与重建流一起输入到GRU中，以细化点级流。对几个不同数据集的大量实验表明，我们的方法可以取得令人满意的性能。

Paper13 Semi-Supervised Learning Made Simple With Self-Supervised Clustering

摘要原文: Self-supervised learning models have been shown to learn rich visual representations without requiring human annotations. However, in many real-world scenarios, labels are partially available, motivating a recent line of work on semi-supervised methods inspired by self-supervised principles. In this paper, we propose a conceptually simple yet empirically powerful approach to turn clustering-based self-supervised methods such as SwAV or DINO into semi-supervised learners. More precisely, we introduce a multi-task framework merging a supervised objective using ground-truth labels and a self-supervised objective relying on clustering assignments with a single cross-entropy loss. This approach may be interpreted as imposing the cluster centroids to be class prototypes. Despite its simplicity, we provide empirical evidence that our approach is highly effective and achieves state-of-the-art performance on CIFAR100 and ImageNet.

中文总结: 这段话主要讨论了自监督学习模型已经被证明能够学习丰富的视觉表示，而无需人类标注。然而，在许多实际场景中，标签只能部分地获得，这促使了一系列受自监督原则启发的半监督方法的研究。在这篇论文中，我们提出了一个概念简单但在实证上强大的方法，将基于聚类的自监督方法（如SwAV或DINO）转化为半监督学习器。更具体地说，我们引入了一个多任务框架，将一个使用地面真实标签的监督目标和一个依赖于聚类分配的自监督目标合并，通过单一的交叉熵损失来实现。这种方法可以被解释为强制使聚类中心成为类别原型。尽管这种方法简单，但我们提供了实证证据表明我们的方法非常有效，并在CIFAR100和ImageNet上实现了最先进的性能。

Paper14 Evolved Part Masking for Self-Supervised Learning

摘要原文: Existing Masked Image Modeling methods apply fixed mask patterns to guide the self-supervised training. As those patterns resort to different criteria to mask local regions, sticking to a fixed pattern leads to limited vision cues modeling capability. This paper proposes an evolved part-based masking to pursue more general visual cues modeling in self-supervised learning. Our method is based on an adaptive part partition module, which leverages the vision model being trained to construct a part graph, and partitions parts with graph cut. The accuracy of partitioned parts is on par with the capability of the pre-trained model, leading to evolved mask patterns at different training stages. It generates simple patterns at the initial training stage to learn low-level visual cues, which hence evolves to eliminate accurate object parts to reinforce the learning of object semantics and contexts. Our method does not require extra pre-trained models or annotations, and effectively ensures the training efficiency by evolving the training difficulty. Experiment results show that it substantially boosts the performance on various tasks including image classification, object detection, and semantic segmentation. For example, it outperforms the recent MAE by 0.69% on imageNet-1K classification and 1.61% on ADE20K segmentation with the same training epochs.

中文总结: 现有的遮蔽图像建模方法采用固定的遮蔽模式来引导自监督训练。由于这些模式采用不同的标准来遮蔽局部区域，坚持使用固定模式会导致视觉线索建模能力受限。本文提出了一种进化的基于部件的遮蔽方法，旨在追求更一般的视觉线索建模自监督学习。我们的方法基于自适应部件划分模块，利用正在训练的视觉模型构建部件图，并通过图割划分部件。划分的部件的准确性与预训练模型的能力相当，导致不同训练阶段的进化遮罩模式。它在初始训练阶段生成简单的模式来学习低级视觉线索，随后演变为消除准确的对象部件以加强对象语义和背景的学习。我们的方法不需要额外的预训练模型或注释，并通过调整训练难度有效地确保训练效率。实验结果表明，它在包括图像分类、目标检测和语义分割在内的各种任务上显著提升了性能。例如，在ImageNet-1K分类上比最近的MAE高出0.69%，在ADE20K分割上高出1.61%，训练轮次相同。

Paper15 BKinD-3D: Self-Supervised 3D Keypoint Discovery From Multi-View Videos

摘要原文: Quantifying motion in 3D is important for studying the behavior of humans and other animals, but manual pose annotations are expensive and time-consuming to obtain. Self-supervised keypoint discovery is a promising strategy for estimating 3D poses without annotations. However, current keypoint discovery approaches commonly process single 2D views and do not operate in the 3D space. We propose a new method to perform self-supervised keypoint discovery in 3D from multi-view videos of behaving agents, without any keypoint or bounding box supervision in 2D or 3D. Our method, BKinD-3D, uses an encoder-decoder architecture with a 3D volumetric heatmap, trained to reconstruct spatiotemporal differences across multiple views, in addition to joint length constraints on a learned 3D skeleton of the subject. In this way, we discover keypoints without requiring manual supervision in videos of humans and rats, demonstrating the potential of 3D keypoint discovery for studying behavior.

中文总结: 这段话主要讨论了在研究人类和其他动物行为时，量化3D运动的重要性，但手动姿势标注获取成本高且耗时。自监督关键点发现是一种有前途的策略，可在没有注释的情况下估计3D姿势。然而，当前的关键点发现方法通常处理单个2D视图，并且不在3D空间中运作。我们提出了一种新方法，可以从行为主体的多视角视频中进行自监督的3D关键点发现，而无需任何2D或3D中的关键点或边界框监督。我们的方法，BKinD-3D，使用编码器-解码器架构和一个3D体积热图，训练以重建多个视图之间的时空差异，此外还在学习的主体的3D骨架上施加关节长度约束。通过这种方式，我们在人类和大鼠的视频中发现关键点，展示了3D关键点发现在研究行为方面的潜力。

Paper16 Look, Radiate, and Learn: Self-Supervised Localisation via Radio-Visual Correspondence

摘要原文: Next generation cellular networks will implement radio sensing functions alongside customary communications, thereby enabling unprecedented worldwide sensing coverage outdoors. Deep learning has revolutionised computer vision but has had limited application to radio perception tasks, in part due to lack of systematic datasets and benchmarks dedicated to the study of the performance and promise of radio sensing. To address this gap, we present MaxRay: a synthetic radio-visual dataset and benchmark that facilitate precise target localisation in radio. We further propose to learn to localise targets in radio without supervision by extracting self-coordinates from radio-visual correspondence. We use such self-supervised coordinates to train a radio localiser network. We characterise our performance against a number of state-of-the-art baselines. Our results indicate that accurate radio target localisation can be automatically learned from paired radio-visual data without labels, which is important for empirical data. This opens the door for vast data scalability and may prove key to realising the promise of robust radio sensing atop a unified communication-perception cellular infrastructure. Dataset will be hosted on IEEE DataPort.

中文总结: 这段话主要讨论了下一代蜂窝网络将在传统通信的基础上实现无线感知功能，从而在户外实现前所未有的全球感知覆盖。深度学习已经彻底改变了计算机视觉，但在无线感知任务中的应用有限，部分原因是缺乏专门用于研究无线感知性能和潜力的系统数据集和基准。为了填补这一空白，他们提出了MaxRay：一个合成的无线视觉数据集和基准，有助于在无线环境中精确定位目标。他们进一步提出通过从无线视觉对应中提取自坐标来学习无监督地在无线环境中定位目标。他们使用这种自监督坐标来训练一个无线定位网络。他们对自己的表现进行了对比，与一些最先进的基线进行了对比。结果表明，可以从配对的无线视觉数据中自动学习准确的无线目标定位，而无需标签，这对于实证数据非常重要。这为大规模数据可扩展性打开了大门，并可能成为实现在统一的通信感知蜂窝基础设施上实现强大无线感知的关键。数据集将托管在IEEE DataPort上。

Paper17 Three Guidelines You Should Know for Universally Slimmable Self-Supervised Learning

摘要原文: We propose universally slimmable self-supervised learning (dubbed as US3L) to achieve better accuracy-efficiency trade-offs for deploying self-supervised models across different devices. We observe that direct adaptation of self-supervised learning (SSL) to universally slimmable networks misbehaves as the training process frequently collapses. We then discover that temporal consistent guidance is the key to the success of SSL for universally slimmable networks, and we propose three guidelines for the loss design to ensure this temporal consistency from a unified gradient perspective. Moreover, we propose dynamic sampling and group regularization strategies to simultaneously improve training efficiency and accuracy. Our US3L method has been empirically validated on both convolutional neural networks and vision transformers. With only once training and one copy of weights, our method outperforms various state-of-the-art methods (individually trained or not) on benchmarks including recognition, object detection and instance segmentation.

中文总结: 这段话主要内容是介绍了一种名为US3L的普遍可减小的自监督学习方法，旨在在不同设备上部署自监督模型时实现更好的精度和效率平衡。研究人员发现直接将自监督学习（SSL）应用于普遍可减小的网络会导致训练过程频繁崩溃。随后，他们发现时间一致性指导是SSL成功应用于普遍可减小网络的关键，并提出了三条损失设计准则以确保这种时间一致性。此外，他们提出了动态采样和分组正则化策略，同时提高训练效率和准确性。US3L方法已在卷积神经网络和视觉变换器上经过实证验证。通过仅一次训练和一组权重，该方法在识别、目标检测和实例分割等基准测试中优于各种最先进方法（无论是单独训练还是非单独训练）。

Paper18 SCOOP: Self-Supervised Correspondence and Optimization-Based Scene Flow

摘要原文: Scene flow estimation is a long-standing problem in computer vision, where the goal is to find the 3D motion of a scene from its consecutive observations. Recently, there have been efforts to compute the scene flow from 3D point clouds. A common approach is to train a regression model that consumes source and target point clouds and outputs the per-point translation vector. An alternative is to learn point matches between the point clouds concurrently with regressing a refinement of the initial correspondence flow. In both cases, the learning task is very challenging since the flow regression is done in the free 3D space, and a typical solution is to resort to a large annotated synthetic dataset. We introduce SCOOP, a new method for scene flow estimation that can be learned on a small amount of data without employing ground-truth flow supervision. In contrast to previous work, we train a pure correspondence model focused on learning point feature representation and initialize the flow as the difference between a source point and its softly corresponding target point. Then, in the run-time phase, we directly optimize a flow refinement component with a self-supervised objective, which leads to a coherent and accurate flow field between the point clouds. Experiments on widespread datasets demonstrate the performance gains achieved by our method compared to existing leading techniques while using a fraction of the training data. Our code is publicly available.

中文总结: 这段话主要讨论了在计算机视觉中的一个长期存在的问题——场景流估计，即从连续的观察中找到场景的3D运动。最近，有人尝试从3D点云中计算场景流。一种常见的方法是训练一个回归模型，该模型消耗源点云和目标点云，并输出每个点的平移向量。另一种方法是同时学习点云之间的点匹配，并对初始对应流进行细化。在这两种情况下，学习任务非常具有挑战性，因为流回归是在自由的3D空间中进行的，一个典型的解决方案是利用大量的合成数据集。作者提出了SCOOP，这是一种新的场景流估计方法，可以在不使用地面实况流监督的情况下，仅使用少量数据进行学习。与以往的工作不同，作者训练了一个纯粹的对应模型，专注于学习点特征表示，并将流初始化为源点和其软匹配目标点之间的差异。然后，在运行时阶段，作者直接优化了一个带有自监督目标的流细化组件，从而导致点云之间的流场连贯且准确。在广泛的数据集上的实验证明了我们的方法相对于现有领先技术所取得的性能提升，同时只使用了一小部分训练数据。我们的代码已公开可用。

Paper19 PlaneDepth: Self-Supervised Depth Estimation via Orthogonal Planes

摘要原文: Multiple near frontal-parallel planes based depth representation demonstrated impressive results in self-supervised monocular depth estimation (MDE). Whereas, such a representation would cause the discontinuity of the ground as it is perpendicular to the frontal-parallel planes, which is detrimental to the identification of drivable space in autonomous driving. In this paper, we propose the PlaneDepth, a novel orthogonal planes based presentation, including vertical planes and ground planes. PlaneDepth estimates the depth distribution using a Laplacian Mixture Model based on orthogonal planes for an input image. These planes are used to synthesize a reference view to provide the self-supervision signal. Further, we find that the widely used resizing and cropping data augmentation breaks the orthogonality assumptions, leading to inferior plane predictions. We address this problem by explicitly constructing the resizing cropping transformation to rectify the predefined planes and predicted camera pose. Moreover, we propose an augmented self-distillation loss supervised with a bilateral occlusion mask to boost the robustness of orthogonal planes representation for occlusions. Thanks to our orthogonal planes representation, we can extract the ground plane in an unsupervised manner, which is important for autonomous driving. Extensive experiments on the KITTI dataset demonstrate the effectiveness and efficiency of our method. The code is available at https://github.com/svip-lab/PlaneDepth.

中文总结: 本文提出了PlaneDepth，一种基于垂直平面和地平面的正交平面表示方法，用于自监督单目深度估计。PlaneDepth使用基于正交平面的拉普拉斯混合模型估计输入图像的深度分布。这些平面用于合成参考视图以提供自监督信号。此外，作者发现广泛使用的调整大小和裁剪数据增强会破坏正交性假设，导致平面预测效果不佳。作者通过明确构建调整大小裁剪变换来纠正预定义平面和预测相机姿态，解决了这个问题。此外，作者提出了一种增强的自蒸馏损失，通过双边遮挡蒙版监督，以提高正交平面表示对遮挡的鲁棒性。由于正交平面表示，我们可以以无监督的方式提取地平面，这对自动驾驶至关重要。在KITTI数据集上进行了大量实验，证明了我们方法的有效性和效率。源代码可在https://github.com/svip-lab/PlaneDepth获取。

Paper20 HaLP: Hallucinating Latent Positives for Skeleton-Based Self-Supervised Learning of Actions

摘要原文: Supervised learning of skeleton sequence encoders for action recognition has received significant attention in recent times. However, learning such encoders without labels continues to be a challenging problem. While prior works have shown promising results by applying contrastive learning to pose sequences, the quality of the learned representations is often observed to be closely tied to data augmentations that are used to craft the positives. However, augmenting pose sequences is a difficult task as the geometric constraints among the skeleton joints need to be enforced to make the augmentations realistic for that action. In this work, we propose a new contrastive learning approach to train models for skeleton-based action recognition without labels. Our key contribution is a simple module, HaLP – to Hallucinate Latent Positives for contrastive learning. Specifically, HaLP explores the latent space of poses in suitable directions to generate new positives. To this end, we present a novel optimization formulation to solve for the synthetic positives with an explicit control on their hardness. We propose approximations to the objective, making them solvable in closed form with minimal overhead. We show via experiments that using these generated positives within a standard contrastive learning framework leads to consistent improvements across benchmarks such as NTU-60, NTU-120, and PKU-II on tasks like linear evaluation, transfer learning, and kNN evaluation. Our code can be found at https://github.com/anshulbshah/HaLP.

中文总结: 最近，监督学习骨架序列编码器用于动作识别已经受到了重视。然而，在没有标签的情况下学习这些编码器仍然是一个具有挑战性的问题。先前的研究表明，通过将对比学习应用于姿势序列，可以取得令人鼓舞的结果，但所学到的表示质量通常与用于制作正例的数据增强密切相关。然而，增强姿势序列是一项困难的任务，因为需要强制执行骨架关节之间的几何约束，以使增强对于该动作变得更加真实。在这项工作中，我们提出了一种新的对比学习方法，用于在没有标签的情况下训练基于骨架的动作识别模型。我们的关键贡献是一个简单的模块，HaLP – 用于为对比学习产生虚拟正例。具体而言，HaLP在适当的方向上探索姿势的潜在空间，以生成新的正例。为此，我们提出了一个新颖的优化公式，用于解决具有对其难度的显式控制的合成正例。我们提出了目标的近似解法，使其能够在闭合形式中以最小的开销解决。我们通过实验证明，在标准对比学习框架中使用这些生成的正例可以在NTU-60、NTU-120和PKU-II等基准数据集上实现一致的改进，包括线性评估、迁移学习和kNN评估等任务。我们的代码可以在 https://github.com/anshulbshah/HaLP 找到。

Paper21 Mixed Autoencoder for Self-Supervised Visual Representation Learning

摘要原文: Masked Autoencoder (MAE) has demonstrated superior performance on various vision tasks via randomly masking image patches and reconstruction. However, effective data augmentation strategies for MAE still remain open questions, different from those in contrastive learning that serve as the most important part. This paper studies the prevailing mixing augmentation for MAE. We first demonstrate that naive mixing will in contrast degenerate model performance due to the increase of mutual information (MI). To address, we propose homologous recognition, an auxiliary pretext task, not only to alleviate the MI increasement by explicitly requiring each patch to recognize homologous patches, but also to perform object-aware self-supervised pre-training for better downstream dense perception performance. With extensive experiments, we demonstrate that our proposed Mixed Autoencoder (MixedAE) achieves the state-of-the-art transfer results among masked image modeling (MIM) augmentations on different downstream tasks with significant efficiency. Specifically, our MixedAE outperforms MAE by +0.3% accuracy, +1.7 mIoU and +0.9 AP on ImageNet-1K, ADE20K and COCO respectively with a standard ViT-Base. Moreover, MixedAE surpasses iBOT, a strong MIM method combined with instance discrimination, while accelerating training by 2x. To our best knowledge, this is the very first work to consider mixing for MIM from the perspective of pretext task design. Code will be made available.

中文总结: 这篇论文研究了Masked Autoencoder (MAE)在视觉任务中的表现，通过随机遮蔽图像补丁和重建来展现出卓越的性能。然而，对于MAE的有效数据增强策略仍然是一个开放的问题，与对比学习中最重要的部分不同。本文研究了适用于MAE的流行混合增强策略。首先展示了简单的混合方法会导致模型性能下降，因为互信息的增加。为了解决这个问题，我们提出了同源识别，作为一项辅助预训练任务，不仅通过明确要求每个补丁识别同源补丁来减轻互信息的增加，还可以为更好的下游密集感知性能进行对象感知的自监督预训练。通过大量实验，我们展示了我们提出的Mixed Autoencoder (MixedAE)在不同下游任务上取得了最先进的转移结果，具有显著的效率。具体来说，我们的MixedAE在ImageNet-1K、ADE20K和COCO上分别比MAE高出+0.3%的准确率、+1.7的mIoU和+0.9的AP，使用标准的ViT-Base。此外，MixedAE超越了iBOT，这是一个结合了实例区分的强MIM方法，同时加速了2倍的训练速度。据我们所知，这是第一篇从预训练任务设计的角度考虑MIM的混合方法的工作。代码将会公开发布。

Paper22 Vid2Avatar: 3D Avatar Reconstruction From Videos in the Wild via Self-Supervised Scene Decomposition

摘要原文: We present Vid2Avatar, a method to learn human avatars from monocular in-the-wild videos. Reconstructing humans that move naturally from monocular in-the-wild videos is difficult. Solving it requires accurately separating humans from arbitrary backgrounds. Moreover, it requires reconstructing detailed 3D surface from short video sequences, making it even more challenging. Despite these challenges, our method does not require any groundtruth supervision or priors extracted from large datasets of clothed human scans, nor do we rely on any external segmentation modules. Instead, it solves the tasks of scene decomposition and surface reconstruction directly in 3D by modeling both the human and the background in the scene jointly, parameterized via two separate neural fields. Specifically, we define a temporally consistent human representation in canonical space and formulate a global optimization over the background model, the canonical human shape and texture, and per-frame human pose parameters. A coarse-to-fine sampling strategy for volume rendering and novel objectives are introduced for a clean separation of dynamic human and static background, yielding detailed and robust 3D human reconstructions. The evaluation of our method shows improvements over prior art on publicly available datasets.

中文总结: 本文介绍了一种名为Vid2Avatar的方法，用于从单目野外视频中学习人类化身。从单目野外视频中重建自然移动的人类是困难的。解决这个问题需要准确地将人类与任意背景分离开来。此外，还需要从短视频序列中重建详细的3D表面，使问题变得更加具有挑战性。尽管存在这些挑战，我们的方法不需要任何来自大型数据集的地面真实监督或先验信息，也不依赖于任何外部分割模块。相反，它通过在3D中直接建模场景中的人类和背景来解决场景分解和表面重建任务，通过两个独立的神经场参数化。具体而言，我们在规范空间中定义了一个时间一致的人类表示，并对背景模型、规范人类形状和纹理以及每帧人类姿势参数进行全局优化。引入了一种粗到细的体积渲染策略和新的目标，以清晰分离动态人类和静态背景，产生详细和稳健的3D人类重建。我们的方法的评估显示在公开数据集上相对于先前的方法有所改进。

Paper23 Self-Supervised Learning for Multimodal Non-Rigid 3D Shape Matching

摘要原文: The matching of 3D shapes has been extensively studied for shapes represented as surface meshes, as well as for shapes represented as point clouds. While point clouds are a common representation of raw real-world 3D data (e.g. from laser scanners), meshes encode rich and expressive topological information, but their creation typically requires some form of (often manual) curation. In turn, methods that purely rely on point clouds are unable to meet the matching quality of mesh-based methods that utilise the additional topological structure. In this work we close this gap by introducing a self-supervised multimodal learning strategy that combines mesh-based functional map regularisation with a contrastive loss that couples mesh and point cloud data. Our shape matching approach allows to obtain intramodal correspondences for triangle meshes, complete point clouds, and partially observed point clouds, as well as correspondences across these data modalities. We demonstrate that our method achieves state-of-the-art results on several challenging benchmark datasets even in comparison to recent supervised methods, and that our method reaches previously unseen cross-dataset generalisation ability.

中文总结: 这段话主要讨论了三维形状匹配的研究。传统上，三维形状匹配针对表示为表面网格和表示为点云的形状进行了广泛研究。点云是原始现实世界三维数据的常见表示形式（例如来自激光扫描仪），而网格编码了丰富和表达丰富的拓扑信息，但通常需要某种形式的（通常是手动的）筛选来创建。与仅依赖点云的方法相比，纯粹依赖点云的方法无法达到利用额外拓扑结构的基于网格的方法的匹配质量。本研究通过引入一种自监督多模态学习策略，结合基于网格的功能映射正则化和耦合网格和点云数据的对比损失，来弥合这一差距。我们的形状匹配方法可以获得三角网格、完整点云和部分观测点云的内模态对应，以及这些数据模态之间的对应。我们展示了我们的方法在几个具有挑战性的基准数据集上实现了最先进的结果，甚至与最近的监督方法相比，我们的方法具有先前未曾见过的跨数据集泛化能力。

Paper24 Self-Supervised Implicit Glyph Attention for Text Recognition

摘要原文: The attention mechanism has become the de facto module in scene text recognition (STR) methods, due to its capability of extracting character-level representations. These methods can be summarized into implicit attention based and supervised attention based, depended on how the attention is computed, i.e., implicit attention and supervised attention are learned from sequence-level text annotations and character-level bounding box annotations, respectively. Implicit attention, as it may extract coarse or even incorrect spatial regions as character attention, is prone to suffering from an alignment-drifted issue. Supervised attention can alleviate the above issue, but it is category-specific, which requires extra laborious character-level bounding box annotations and would be memory-intensive when the number of character categories is large. To address the aforementioned issues, we propose a novel attention mechanism for STR, self-supervised implicit glyph attention (SIGA). SIGA delineates the glyph structures of text images by jointly self-supervised text segmentation and implicit attention alignment, which serve as the supervision to improve attention correctness without extra character-level annotations. Experimental results demonstrate that SIGA performs consistently and significantly better than previous attention-based STR methods, in terms of both attention correctness and final recognition performance on publicly available context benchmarks and our contributed contextless benchmarks.

中文总结: 这段话主要讨论了在场景文本识别（STR）方法中，注意力机制已经成为默认模块，因为它能够提取字符级表示。这些方法可以总结为基于隐式注意力和基于监督注意力两种，取决于注意力是如何计算的，即隐式注意力和监督注意力分别是从序列级文本注释和字符级边界框注释中学习得到的。隐式注意力可能会提取粗糙甚至不正确的空间区域作为字符注意力，容易受到对齐漂移问题的影响。监督注意力可以缓解上述问题，但它是特定于类别的，需要额外费力的字符级边界框注释，当字符类别数量较大时会占用大量内存。为了解决上述问题，作者提出了一种新的STR注意力机制，即自监督隐式字形注意力（SIGA）。SIGA通过联合自监督文本分割和隐式注意力对齐来勾画文本图像的字形结构，这作为监督来提高注意力的正确性，而无需额外的字符级注释。实验结果表明，SIGA在公开的上下文基准数据集和作者贡献的无上下文基准数据集上，无论是在注意力正确性还是最终识别性能方面，都明显优于先前基于注意力的STR方法。

Paper25 SkyEye: Self-Supervised Bird’s-Eye-View Semantic Mapping Using Monocular Frontal View Images

摘要原文: Bird’s-Eye-View (BEV) semantic maps have become an essential component of automated driving pipelines due to the rich representation they provide for decision-making tasks. However, existing approaches for generating these maps still follow a fully supervised training paradigm and hence rely on large amounts of annotated BEV data. In this work, we address this limitation by proposing the first self-supervised approach for generating a BEV semantic map using a single monocular image from the frontal view (FV). During training, we overcome the need for BEV ground truth annotations by leveraging the more easily available FV semantic annotations of video sequences. Thus, we propose the SkyEye architecture that learns based on two modes of self-supervision, namely, implicit supervision and explicit supervision. Implicit supervision trains the model by enforcing spatial consistency of the scene over time based on FV semantic sequences, while explicit supervision exploits BEV pseudolabels generated from FV semantic annotations and self-supervised depth estimates. Extensive evaluations on the KITTI-360 dataset demonstrate that our self-supervised approach performs on par with the state-of-the-art fully supervised methods and achieves competitive results using only 1% of direct supervision in BEV compared to fully supervised approaches. Finally, we publicly release both our code and the BEV datasets generated from the KITTI-360 and Waymo datasets.

中文总结: 这段话主要讨论了Bird’s-Eye-View（BEV）语义地图已成为自动驾驶流程中不可或缺的组成部分，因为它们为决策任务提供了丰富的表示。然而，目前用于生成这些地图的方法仍然遵循完全监督的训练范式，因此依赖大量注释的BEV数据。作者提出了第一个自监督方法，通过提出的SkyEye架构，利用来自前视图（FV）的单个单眼图像生成BEV语义地图。在训练过程中，他们通过利用更容易获得的视频序列的FV语义注释，克服了对BEV地面真实标注的需求。该方法通过两种自监督模式学习，即隐式监督和显式监督。隐式监督通过基于FV语义序列的时间上的场景空间一致性来训练模型，而显式监督则利用从FV语义注释和自监督深度估计生成的BEV伪标签。在KITTI-360数据集上进行了广泛评估，结果表明作者的自监督方法与最先进的完全监督方法表现相当，并且相比完全监督方法，仅使用1%的BEV直接监督即可获得竞争性结果。最后，作者公开发布了他们的代码以及从KITTI-360和Waymo数据集生成的BEV数据集。

Paper26 Self-Supervised AutoFlow

摘要原文: Recently, AutoFlow has shown promising results on learning a training set for optical flow, but requires ground truth labels in the target domain to compute its search metric. Observing a strong correlation between the ground truth search metric and self-supervised losses, we introduce self-supervised AutoFlow to handle real-world videos without ground truth labels. Using self-supervised loss as the search metric, our self-supervised AutoFlow performs on par with AutoFlow on Sintel and KITTI where ground truth is available, and performs better on the real-world DAVIS dataset. We further explore using self-supervised AutoFlow in the (semi-)supervised setting and obtain competitive results against the state of the art.

中文总结: 最近，AutoFlow在学习光流训练集方面取得了令人期待的结果，但需要在目标领域中使用地面真实标签来计算其搜索度量。观察到地面真实搜索度量与自监督损失之间的强相关性，我们引入了自监督AutoFlow来处理没有地面真实标签的真实世界视频。使用自监督损失作为搜索度量，我们的自监督AutoFlow在Sintel和KITTI上表现与AutoFlow相当，这些数据集具有地面真实标签，并且在真实世界的DAVIS数据集上表现更好。我们进一步探讨了在（半）监督设置中使用自监督AutoFlow，并获得了与现有技术竞争力相当的结果。

Paper27 Spatial-Then-Temporal Self-Supervised Learning for Video Correspondence

摘要原文: In low-level video analyses, effective representations are important to derive the correspondences between video frames. These representations have been learned in a self-supervised fashion from unlabeled images/videos, using carefully designed pretext tasks in some recent studies. However, the previous work concentrates on either spatial-discriminative features or temporal-repetitive features, with little attention to the synergy between spatial and temporal cues. To address this issue, we propose a novel spatial-then-temporal self-supervised learning method. Specifically, we firstly extract spatial features from unlabeled images via contrastive learning, and secondly enhance the features by exploiting the temporal cues in unlabeled videos via reconstructive learning. In the second step, we design a global correlation distillation loss to ensure the learning not to forget the spatial cues, and we design a local correlation distillation loss to combat the temporal discontinuity that harms the reconstruction. The proposed method outperforms the state-of-the-art self-supervised methods, as established by the experimental results on a series of correspondence-based video analysis tasks. Also, we performed ablation studies to verify the effectiveness of the two-step design as well as the distillation losses.

中文总结: 在低级别视频分析中，有效的表示对于推导视频帧之间的对应关系至关重要。一些最近的研究已经通过精心设计的假设任务，以自监督的方式从未标记的图像/视频中学习这些表示。然而，先前的工作集中在空间-判别特征或时间-重复特征上，对空间和时间线索之间的协同关系关注较少。为了解决这个问题，我们提出了一种新颖的空间-然后-时间的自监督学习方法。具体来说，我们首先通过对比学习从未标记的图像中提取空间特征，然后通过重构学习利用未标记视频中的时间线索增强这些特征。在第二步中，我们设计了全局相关性蒸馏损失，以确保学习不会忘记空间线索，并设计了局部相关性蒸馏损失来对抗破坏重构的时间不连续性。所提出的方法在一系列基于对应关系的视频分析任务上优于最先进的自监督方法，这一点已由实验结果所证实。此外，我们进行了消融研究，以验证两步设计以及蒸馏损失的有效性。

Paper28 SECAD-Net: Self-Supervised CAD Reconstruction by Learning Sketch-Extrude Operations

摘要原文: Reverse engineering CAD models from raw geometry is a classic but strenuous research problem. Previous learning-based methods rely heavily on labels due to the supervised design patterns or reconstruct CAD shapes that are not easily editable. In this work, we introduce SECAD-Net, an end-to-end neural network aimed at reconstructing compact and easy-to-edit CAD models in a self-supervised manner. Drawing inspiration from the modeling language that is most commonly used in modern CAD software, we propose to learn 2D sketches and 3D extrusion parameters from raw shapes, from which a set of extrusion cylinders can be generated by extruding each sketch from a 2D plane into a 3D body. By incorporating the Boolean operation (i.e., union), these cylinders can be combined to closely approximate the target geometry. We advocate the use of implicit fields for sketch representation, which allows for creating CAD variations by interpolating latent codes in the sketch latent space. Extensive experiments on both ABC and Fusion 360 datasets demonstrate the effectiveness of our method, and show superiority over state-of-the-art alternatives including the closely related method for supervised CAD reconstruction. We further apply our approach to CAD editing and single-view CAD reconstruction. The code is released at https://github.com/BunnySoCrazy/SECAD-Net.

中文总结: 这段话主要介绍了一种名为SECAD-Net的端到端神经网络，旨在以自监督的方式重建紧凑且易于编辑的CAD模型。该方法受现代CAD软件中最常用的建模语言启发，提出从原始形状中学习2D草图和3D挤出参数，通过从2D平面向3D实体挤出每个草图来生成一组挤出圆柱体，通过布尔运算（即并集）将这些圆柱体组合在一起以紧密逼近目标几何形状。该方法采用隐式场来表示草图，允许通过在草图潜在空间中插值潜在编码来创建CAD变体。在ABC和Fusion 360数据集上进行的广泛实验表明了该方法的有效性，并展示了其在CAD重建方面的优越性，包括针对监督CAD重建的相关方法。此外，该方法还应用于CAD编辑和单视图CAD重建。其代码已发布在https://github.com/BunnySoCrazy/SECAD-Net。

Paper29 Self-Supervised Representation Learning for CAD

摘要原文: Virtually every object in the modern world was created, modified, analyzed and optimized using computer aided design (CAD) tools. An active CAD research area is the use of data-driven machine learning methods to learn from the massive repositories of geometric and program representations. However, the lack of labeled data in CAD’s native format, i.e., the parametric boundary representation (B-Rep), poses an obstacle at present difficult to overcome. Several datasets of mechanical parts in B-Rep format have recently been released for machine learning research. However, large-scale databases are mostly unlabeled, and labeled datasets are small. Additionally, task-specific label sets are rare and costly to annotate. This work proposes to leverage unlabeled CAD geometry on supervised learning tasks. We learn a novel, hybrid implicit/explicit surface representation for B-Rep geometry. Further, we show that this pre-training both significantly improves few-shot learning performance and achieves state-of-the-art performance on several current B-Rep benchmarks.

中文总结: 这段话主要讨论了现代世界中几乎每个物体都是使用计算机辅助设计（CAD）工具创建、修改、分析和优化的。其中一个活跃的CAD研究领域是利用数据驱动的机器学习方法从大量的几何和程序表示库中学习。然而，目前CAD本地格式（即参数化边界表示）中标记数据的缺乏构成了一道难以克服的障碍。最近已经发布了几个机械零件的B-Rep格式数据集用于机器学习研究。然而，大规模数据库大多数是未标记的，标记数据集很小。此外，特定任务的标签集很少且成本高昂。该研究提出利用未标记的CAD几何数据进行监督学习任务。我们学习了一种新颖的混合隐式/显式表面表示方法用于B-Rep几何。此外，我们展示了这种预训练显著提高了少样本学习性能，并在几个当前的B-Rep基准测试中取得了最先进的性能。

Paper30 CLIP-S4: Language-Guided Self-Supervised Semantic Segmentation

摘要原文: Existing semantic segmentation approaches are often limited by costly pixel-wise annotations and predefined classes. In this work, we present CLIP-S^4 that leverages self-supervised pixel representation learning and vision-language models to enable various semantic segmentation tasks (e.g., unsupervised, transfer learning, language-driven segmentation) without any human annotations and unknown class information. We first learn pixel embeddings with pixel-segment contrastive learning from different augmented views of images. To further improve the pixel embeddings and enable language-driven semantic segmentation, we design two types of consistency guided by vision-language models: 1) embedding consistency, aligning our pixel embeddings to the joint feature space of a pre-trained vision-language model, CLIP; and 2) semantic consistency, forcing our model to make the same predictions as CLIP over a set of carefully designed target classes with both known and unknown prototypes. Thus, CLIP-S^4 enables a new task of class-free semantic segmentation where no unknown class information is needed during training. As a result, our approach shows consistent and substantial performance improvement over four popular benchmarks compared with the state-of-the-art unsupervised and language-driven semantic segmentation methods. More importantly, our method outperforms these methods on unknown class recognition by a large margin.

中文总结: 这段话主要讨论了现有的语义分割方法通常受制于昂贵的像素级注释和预定义的类别限制。作者提出了一种名为CLIP-S4的方法，利用自监督像素表示学习和视觉-语言模型，实现了各种语义分割任务（如无监督、迁移学习、以语言驱动的分割），而无需任何人类注释和未知类别信息。首先，通过对图像的不同增强视图进行像素-分割对比学习，学习像素嵌入。为了进一步改进像素嵌入并实现以语言驱动的语义分割，作者设计了两种一致性方法，即嵌入一致性和语义一致性。CLIP-S4实现了一种新的无类别语义分割任务，无需在训练期间提供未知类别信息。作者的方法在四个流行基准数据集上相对于最先进的无监督和以语言驱动的语义分割方法表现出一致且显著的性能提升。更重要的是，作者的方法在未知类别识别方面大幅优于这些方法。

Paper31 Unified Mask Embedding and Correspondence Learning for Self-Supervised Video Segmentation

摘要原文: The objective of this paper is self-supervised learning of video object segmentation. We develop a unified framework which simultaneously models cross-frame dense correspondence for locally discriminative feature learning and embeds object-level context for target-mask decoding. As a result, it is able to directly learn to perform mask-guided sequential segmentation from unlabeled videos, in contrast to previous efforts usually relying on an oblique solution — cheaply “copying” labels according to pixel-wise correlations. Concretely, our algorithm alternates between i) clustering video pixels for creating pseudo segmentation labels ex nihilo; and ii) utilizing the pseudo labels to learn mask encoding and decoding for VOS. Unsupervised correspondence learning is further incorporated into this self-taught, mask embedding scheme, so as to ensure the generic nature of the learnt representation and avoid cluster degeneracy. Our algorithm sets state-of-the-arts on two standard benchmarks (i.e., DAVIS17 and YouTube-VOS), narrowing the gap between self- and fully-supervised VOS, in terms of both performance and network architecture design. Our full code will be released.

中文总结: 本文的目标是自监督学习视频对象分割。我们开发了一个统一的框架，同时建模跨帧密集对应用于本地区分特征学习，并嵌入对象级上下文用于目标掩码解码。因此，它能够直接学习从未标记视频中执行基于掩码的顺序分割，与先前的努力通常依赖于一种斜的解决方案形成对比 — 根据像素相关性“廉价”地“复制”标签。具体而言，我们的算法在以下两者之间交替进行：i) 对视频像素进行聚类以创建伪分割标签；和 ii) 利用伪标签学习用于视频对象分割的掩码编码和解码。无监督的对应关系学习进一步融入到这个自学习的掩码嵌入方案中，以确保所学习表示的通用性并避免集群退化。我们的算法在两个标准基准测试（即DAVIS17和YouTube-VOS）上取得了最新的成果，缩小了自监督和完全监督视频对象分割之间的差距，无论是在性能还是网络架构设计方面。我们的完整代码将会发布。

Paper32 DLBD: A Self-Supervised Direct-Learned Binary Descriptor

摘要原文: For learning-based binary descriptors, the binarization process has not been well addressed. The reason is that the binarization blocks gradient back-propagation. Existing learning-based binary descriptors learn real-valued output, and then it is converted to binary descriptors by their proposed binarization processes. Since their binarization processes are not a component of the network, the learning-based binary descriptor cannot fully utilize the advances of deep learning. To solve this issue, we propose a model-agnostic plugin binary transformation layer (BTL), making the network directly generate binary descriptors. Then, we present the first self-supervised, direct-learned binary descriptor, dubbed DLBD. Furthermore, we propose ultra-wide temperature-scaled cross-entropy loss to adjust the distribution of learned descriptors in a larger range. Experiments demonstrate that the proposed BTL can substitute the previous binarization process. Our proposed DLBD outperforms SOTA on different tasks such as image retrieval and classification.

中文总结: 这段话主要讨论了学习型二进制描述符中的二值化过程问题。现有的学习型二进制描述符学习实值输出，然后通过其提出的二值化过程将其转换为二进制描述符。然而，由于二值化过程阻碍了梯度的反向传播，这些学习型二进制描述符无法充分利用深度学习的优势。为了解决这个问题，提出了一种模型无关的插件二进制转换层（BTL），使网络能够直接生成二进制描述符。然后，提出了第一个自监督、直接学习的二进制描述符DLBD。此外，还提出了超宽温度缩放的交叉熵损失函数，以调整学习描述符的分布范围。实验证明，提出的BTL可以替代以往的二值化过程。我们提出的DLBD在图像检索和分类等不同任务中优于目前的最先进技术。

Paper33 Geometric Visual Similarity Learning in 3D Medical Image Self-Supervised Pre-Training

摘要原文: Learning inter-image similarity is crucial for 3D medical images self-supervised pre-training, due to their sharing of numerous same semantic regions. However, the lack of the semantic prior in metrics and the semantic-independent variation in 3D medical images make it challenging to get a reliable measurement for the inter-image similarity, hindering the learning of consistent representation for same semantics. We investigate the challenging problem of this task, i.e., learning a consistent representation between images for a clustering effect of same semantic features. We propose a novel visual similarity learning paradigm, Geometric Visual Similarity Learning, which embeds the prior of topological invariance into the measurement of the inter-image similarity for consistent representation of semantic regions. To drive this paradigm, we further construct a novel geometric matching head, the Z-matching head, to collaboratively learn the global and local similarity of semantic regions, guiding the efficient representation learning for different scale-level inter-image semantic features. Our experiments demonstrate that the pre-training with our learning of inter-image similarity yields more powerful inner-scene, inter-scene, and global-local transferring ability on four challenging 3D medical image tasks. Our codes and pre-trained models will be publicly available in https://github.com/YutingHe-list/GVSL.

中文总结: 这段话主要讨论了在3D医学图像的自监督预训练中，学习图像间相似性对于获取一致的语义表示至关重要，因为它们共享许多相同的语义区域。然而，由于度量中缺乏语义先验和3D医学图像中的语义独立变化，使得获得可靠的图像间相似性测量变得具有挑战性，阻碍了为相同语义学习一致表示的过程。研究人员探讨了这一任务的挑战性问题，即学习图像之间一致的表示以实现相同语义特征的聚类效应。他们提出了一种新颖的视觉相似性学习范式，即几何视觉相似性学习，将拓扑不变性的先验嵌入到图像间相似性的测量中，以实现语义区域的一致表示。为了推动这一范式，他们进一步构建了一种新颖的几何匹配头，即Z匹配头，以协同学习语义区域的全局和局部相似性，引导不同尺度级别的图像间语义特征的高效表示学习。实验结果表明，通过我们提出的图像间相似性学习进行预训练，能够在四个具有挑战性的3D医学图像任务中实现更强大的内部场景、场景间和全局-局部转移能力。我们的代码和预训练模型将在https://github.com/YutingHe-list/GVSL 上公开提供。

Paper34 Self-Supervised Learning From Images With a Joint-Embedding Predictive Architecture

摘要原文: This paper demonstrates an approach for learning highly semantic image representations without relying on hand-crafted data-augmentations. We introduce the Image-based Joint-Embedding Predictive Architecture (I-JEPA), a non-generative approach for self-supervised learning from images. The idea behind I-JEPA is simple: from a single context block, predict the representations of various target blocks in the same image. A core design choice to guide I-JEPA towards producing semantic representations is the masking strategy; specifically, it is crucial to (a) sample target blocks with sufficiently large scale (semantic), and to (b) use a sufficiently informative (spatially distributed) context block. Empirically, when combined with Vision Transformers, we find I-JEPA to be highly scalable. For instance, we train a ViT-Huge/14 on ImageNet using 16 A100 GPUs in under 72 hours to achieve strong downstream performance across a wide range of tasks, from linear classification to object counting and depth prediction.

中文总结: 这篇论文展示了一种学习高度语义图像表示的方法，而不依赖于手工制作的数据增强。我们引入了基于图像的联合嵌入预测架构（I-JEPA），这是一种从图像中进行自监督学习的非生成方法。I-JEPA的核心设计思想很简单：从单个上下文块中，预测同一图像中各种目标块的表示。引导I-JEPA产生语义表示的核心设计选择是掩蔽策略；具体而言，关键是（a）采样具有足够大尺度（语义）的目标块，并且（b）使用足够信息量的（空间分布的）上下文块。在经验上，当与Vision Transformers结合使用时，我们发现I-JEPA具有很高的可扩展性。例如，我们使用16个A100 GPU在不到72小时内训练了一个ViT-Huge/14模型在ImageNet上，从线性分类到物体计数和深度预测等一系列任务上都取得了强大的下游性能。

Paper35 Defending Against Patch-Based Backdoor Attacks on Self-Supervised Learning

摘要原文: Recently, self-supervised learning (SSL) was shown to be vulnerable to patch-based data poisoning backdoor attacks. It was shown that an adversary can poison a small part of the unlabeled data so that when a victim trains an SSL model on it, the final model will have a backdoor that the adversary can exploit. This work aims to defend self-supervised learning against such attacks. We use a three-step defense pipeline, where we first train a model on the poisoned data. In the second step, our proposed defense algorithm (PatchSearch) uses the trained model to search the training data for poisoned samples and removes them from the training set. In the third step, a final model is trained on the cleaned-up training set. Our results show that PatchSearch is an effective defense. As an example, it improves a model’s accuracy on images containing the trigger from 38.2% to 63.7% which is very close to the clean model’s accuracy, 64.6%. Moreover, we show that PatchSearch outperforms baselines and state-of-the-art defense approaches including those using additional clean, trusted data. Our code is available at https://github.com/UCDvision/PatchSearch

中文总结: 最近，自监督学习（SSL）被发现容易受到基于补丁的数据投毒后门攻击的影响。研究表明，对手可以投毒一小部分未标记数据，使得当受害者在其上训练SSL模型时，最终模型将具有一个对手可以利用的后门。这项工作旨在保护自监督学习免受此类攻击。我们采用了一个三步防御流程，首先在投毒数据上训练一个模型。在第二步中，我们提出的防御算法（PatchSearch）利用训练好的模型搜索训练数据中的受污染样本，并将其从训练集中删除。在第三步中，最终模型在清理后的训练集上进行训练。我们的结果表明，PatchSearch是一种有效的防御方法。举例来说，它将包含触发器的图像的模型准确率从38.2%提高到63.7%，接近干净模型的准确率64.6%。此外，我们展示了PatchSearch优于基线和最先进的防御方法，包括那些使用额外的干净、可信数据。我们的代码可在https://github.com/UCDvision/PatchSearch找到。

Paper36 Patch-Craft Self-Supervised Training for Correlated Image Denoising

摘要原文: Supervised neural networks are known to achieve excellent results in various image restoration tasks. However, such training requires datasets composed of pairs of corrupted images and their corresponding ground truth targets. Unfortunately, such data is not available in many applications. For the task of image denoising in which the noise statistics is unknown, several self-supervised training methods have been proposed for overcoming this difficulty. Some of these require knowledge of the noise model, while others assume that the contaminating noise is uncorrelated, both assumptions are too limiting for many practical needs. This work proposes a novel self-supervised training technique suitable for the removal of unknown correlated noise. The proposed approach neither requires knowledge of the noise model nor access to ground truth targets. The input to our algorithm consists of easily captured bursts of noisy shots. Our algorithm constructs artificial patch-craft images from these bursts by patch matching and stitching, and the obtained crafted images are used as targets for the training. Our method does not require registration of the different images within the burst. We evaluate the proposed framework through extensive experiments with synthetic and real image noise.

中文总结: 这段话主要讨论了监督神经网络在各种图像恢复任务中取得出色结果的情况。然而，这种训练需要由受损图像及其对应的真实目标图像组成的数据集。不幸的是，在许多应用中并没有这样的数据。针对噪声统计未知的图像去噪任务，已经提出了几种自监督训练方法来克服这一困难。其中一些方法需要噪声模型的知识，而另一些则假设污染噪声是不相关的，这两种假设对许多实际需求来说都太过限制性。这项工作提出了一种适用于去除未知相关噪声的新型自监督训练技术。所提出的方法既不需要噪声模型的知识，也不需要访问真实目标图像。我们的算法的输入是容易捕获到的一系列带有噪声的拍摄图像。我们的算法通过补丁匹配和拼接构建人工补丁图像，这些得到的构建图像被用作训练的目标。我们的方法不需要对爆发中的不同图像进行注册。我们通过对合成和真实图像噪声进行广泛的实验来评估所提出的框架。

Paper37 Self-Supervised Geometry-Aware Encoder for Style-Based 3D GAN Inversion

摘要原文: StyleGAN has achieved great progress in 2D face reconstruction and semantic editing via image inversion and latent editing. While studies over extending 2D StyleGAN to 3D faces have emerged, a corresponding generic 3D GAN inversion framework is still missing, limiting the applications of 3D face reconstruction and semantic editing. In this paper, we study the challenging problem of 3D GAN inversion where a latent code is predicted given a single face image to faithfully recover its 3D shapes and detailed textures. The problem is ill-posed: innumerable compositions of shape and texture could be rendered to the current image. Furthermore, with the limited capacity of a global latent code, 2D inversion methods cannot preserve faithful shape and texture at the same time when applied to 3D models. To solve this problem, we devise an effective self-training scheme to constrain the learning of inversion. The learning is done efficiently without any real-world 2D-3D training pairs but proxy samples generated from a 3D GAN. In addition, apart from a global latent code that captures the coarse shape and texture information, we augment the generation network with a local branch, where pixel-aligned features are added to faithfully reconstruct face details. We further consider a new pipeline to perform 3D view-consistent editing. Extensive experiments show that our method outperforms state-of-the-art inversion methods in both shape and texture reconstruction quality.

中文总结: 这段话主要讨论了StyleGAN在2D人脸重建和语义编辑方面取得了巨大进展，通过图像反演和潜在编辑。虽然已经出现了将2D StyleGAN扩展到3D人脸的研究，但仍然缺乏相应的通用3D GAN反演框架，限制了3D人脸重建和语义编辑的应用。本文研究了3D GAN反演的挑战性问题，即在给定单个人脸图像的情况下预测潜在代码，以忠实地恢复其3D形状和详细纹理。这个问题是不适定的：可以渲染无数种形状和纹理的组合到当前图像中。此外，由于全局潜在代码的容量有限，2D反演方法在应用于3D模型时无法同时保持忠实的形状和纹理。为了解决这个问题，我们设计了一种有效的自训练方案来约束反演的学习。学习是在没有任何真实的2D-3D训练对的情况下高效完成的，而是使用从3D GAN生成的代理样本。此外，除了捕捉粗略形状和纹理信息的全局潜在代码外，我们还通过在生成网络中增加一个局部分支来增强，其中添加了像素对齐特征以忠实重建面部细节。我们进一步考虑了一个新的流程来执行3D视角一致的编辑。大量实验证明，我们的方法在形状和纹理重建质量方面优于最先进的反演方法。

Paper38 DrapeNet: Garment Generation and Self-Supervised Draping

摘要原文: Recent approaches to drape garments quickly over arbitrary human bodies leverage self-supervision to eliminate the need for large training sets. However, they are designed to train one network per clothing item, which severely limits their generalization abilities. In our work, we rely on self-supervision to train a single network to drape multiple garments. This is achieved by predicting a 3D deformation field conditioned on the latent codes of a generative network, which models garments as unsigned distance fields. Our pipeline can generate and drape previously unseen garments of any topology, whose shape can be edited by manipulating their latent codes. Being fully differentiable, our formulation makes it possible to recover accurate 3D models of garments from partial observations – images or 3D scans – via gradient descent. Our code is publicly available at https://github.com/liren2515/DrapeNet.

中文总结: 最近的快速在任意人体上展开服装的方法利用自监督来消除对大型训练集的需求。然而，它们被设计为训练每种服装项目一个网络，这严重限制了它们的泛化能力。在我们的工作中，我们依靠自监督来训练一个单一网络来展开多种服装。这是通过在生成网络的潜在编码条件下预测一个三维变形场来实现的，该生成网络将服装建模为无符号距离场。我们的流程可以生成和展开以前未见过的任何拓扑结构的服装，其形状可以通过操纵它们的潜在编码来编辑。由于完全可微分，我们的公式使得通过梯度下降从部分观测（图像或三维扫描）中恢复准确的服装三维模型成为可能。我们的代码可以在 https://github.com/liren2515/DrapeNet 公开获取。

#以上关于CVPR2023论文速览自监督Self的相关内容来源网络仅供参考，相关信息请以官方公告为准！

原创文章，作者：CSDN，如若转载，请注明出处：https://www.sudun.com/ask/91328.html