PLAD: Learning to Infer Shape Programs
with Pseudo-Labels and Approximate Distributions

R. Kenny Jones¹ Homer Walke² Daniel Ritchie¹

¹Brown University ²UC Berkeley

Paper | Code | Video | Supplemental

CVPR 2022

We group a family of shape program inference methods under a single conceptual framework, where training is performed with maximum likelihood updates sourced from either Pseudo-Labels or an Approximate Distribution (PLAD). Compared with policy gradient reinforcement learning (RL), we show that PLAD techniques infer more accurate shape programs and converge significantly faster.

Bibtex

@article{jones2022PLAD,
  title={PLAD: Learning to Infer Shape Programs with Pseudo-Labels and Approximate Distributions},
  author={Jones, R. Kenny and Walke, Homer and Ritchie, Daniel},
  journal={The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2022}
}

Abstract

Inferring programs which generate 2D and 3D shapes is important for reverse engineering, editing, and more. Training models to perform this task is complicated because paired (shape, program) data is not readily available for many domains, making exact supervised learning infeasible. However, it is possible to get paired data by compromising the accuracy of either the assigned program labels or the shape distribution. Wake-sleep methods use samples from a generative model of shape programs to approximate the distribution of real shapes. In self-training, shapes are passed through a recognition model, which predicts programs that are treated as ‘pseudo-labels’ for those shapes. Related to these approaches, we introduce a novel self-training variant unique to program inference, where program pseudo-labels are paired with their executed output shapes, avoiding label mismatch at the cost of an approximate shape distribution.

We propose to group these regimes under a single conceptual framework, where training is performed with maximum likelihood updates sourced from either Pseudo-Labels or an Approximate Distribution (PLAD). We evaluate these techniques on multiple 2D and 3D shape program inference domains. Compared with policy gradient reinforcement learning, we show that PLAD techniques infer more accurate shape programs and converge significantly faster. Finally, we propose to combine updates from different PLAD methods within the training of a single model, and find that this approach outperforms any individual technique.

Shape Program Inference

Shape program inference is a type of visual program induction problem. The specification is a visual representation of an object, and we aim to infer a program whose output must geometrically match the input.

Our formulation assumes the following as input: a language L with executor E (programs from L executed with E produce shapes), a shape distribution of interest X* (e.g. CAD models), and a reconstruction metric M that compares the visual similarity of two shapes (e.g. IoU). Our goal is to learn a recognition network that is able to infer programs Z when run over X*, such that the executions of Z achieve good reconstruction accuracy using M against X*.

In this problem statement, we assume that the program executor is a black-box. The de facto approach to solving this problem has been to use reinforcement learning. Within this context, the recognition model can be treated as a policy network, and the reconstruction error between the input and the inferred program's output can generate the reward function.

Typically, RL has been used to fine-tune a model that has been pretrained on synthetically generated data. Synthetic data for shape program inference problems is usually easy to generate, as we can sample random program from the language, and record the program's executions, to generate paired shape, program data. When paired data is available, supervised learning with maximum likelihood estimation updates is used.

The main problem with policy gradient RL is its instability due to high variance gradients, which can slow convergence.

PLAD

We propose PLAD, a conceptual framework to group a family of related self-supervised learning techniques for shape program inference.

PLAD methods try to avoid the pitfalls of policy gradient RL by attempting to emulate supervised learning. Like supervised learning, PLAD methods create paired shape, program data this is used to train the recognition model with maximum likelihood updates. However, as ground-truth program annotations is usually unavailable, we can't construct exactly correct paired data. For this reason, PLAD methods need to make compromises on the accuracy of either the assigned program labels or the shape distribution in the paired data.

For methods the use pseudo-labels, each program is an incorrect label for its paired shape, so executing the program does not exactly recreate its paired shape. For methods that use approximate distributions, the programs are valid labels for their paired shapes (e.g. executing the program set would reproduce the shape set), but the shape set that has valid programs does not exactly match the target shape distribution of interest (it is only an approximation of this target distribution).

To fine-tuning the recognition model towards the distribution of interest, PLAD methods iterate through the following steps:

(1) The recognition model is run over the shapes from X* to infer a set of programs Z
(2) Using the reconstruction metric and the inferred programs Z, entries in a data structure are updated to keep track of the best matching discovered program for each shape.
(3) Using the best program data structure, construct (X, Z) paired data
(4) Fine-tune the recognition model with maximum likelihood estimation updates on batches from (X, Z)

PLAD methods differ by how they construct the (X, Z) paired data (step 3). Self-training (ST) uses X* as X, and the entries of the best program data structure as Z (pseudo-labels). Latent execution self-training (LEST) uses the entries of the best program data structure as Z, and defines X as the executions of X (approximate distribution). Wake-sleep (WS) trains a generative model over the entries of the best program data structure. Z is populated by sampling programs from this generative model, and X is defined as the executions of those X (approximate distribution).

PLAD methods can be combined while fine-tuning a single recognition network. To do so, we propose to first construct distinct paired (X, Z) distributions for each PLAD method, then during training, each batch is randomly sampled from one of these paired distributions.

Shape Program Domains

We run experiments across three shape program domains: 2D constructive solid geometry (CSG), 3D CSG, and ShapeAssembly.

In CSG languages, shapes are created by declaring parametric primitives and combining them with Boolean operations. CSG inference is non-trivial as intersection and union are non-additive operations.

ShapeAssembly is a domain-specific language for specifying the part structure of manufactured 3D objects. It creates objects by declaring cuboid part geometries and assembling those parts together via attachment and symmetry operators.

Reconstruction Accuracy

We evaluate the ability of PLAD methods to fine-tune a recognition network towards a distribution of interest. The recognition model is pretrained on synthetically generated paired data, and each approach fine-tunes it towards a shape distribution of interest (CAD manufactured shapes). We compare individual PLAD approaches (ST, LEST, WS) against combined PLAD approaches (LEST+ST, LEST+ST+WS), reinforcement learning (RL) and the pretrained model (SP).

We track test-set reconstruction metrics for each fine-tuning method. For 2D shapes, we use Chamfer distance. For 3D shapes, we use voxel IoU. Across all of the domains we studied, using the combined PLAD method (LEST+ST+WS) achieves the best fine-tuning performance.

We also share qualitative comparisons of each method on test-set examples, that back up the quantitative trends.

Reconstructions for 2D CSG:

Reconstructions for 3D CSG:

Reconstructions for ShapeAssembly:

Effect of training with less data

Within the 2D CSG domain, we designed an experiment to study how each fine-tuning approach was affected by modulating the number of shapes in the training set. On the X-axis we plot the number of shapes in the training set, and on the Y-axis we plot the test-set chamfer distance achieved by each fine-tuning method. Combined PLAD methods (LEST+ST+WS) achieve the best performance even when access to training data is limited, in fact, LEST+ST+WS with 10% of the training data outperforms RL with access to all of the training data.

Method Convergence Speed

Also within the 2D CSG domain, we track the convergence rates of different fine-tuning approaches. On the X-axis we plot wall-clock time of a training run in hours. On the Y-axis we plot the test-set chamfer distance achieved by each fine-tuning method. PLAD methods converge much faster, and to better minima, compared with policy gradient RL. For this domain, RL took over 36 hours to converge, while the LEST+ST+WS PLAD variant was able to match this performance level with less than an hour of training.

Acknowledgements

We would like to thank the anonymous reviewers for their helpful suggestions. This work was funded in parts by NSF award #1941808 and a Brown University Presidential Fellowship. Daniel Ritchie is an advisor to Geopipe and owns equity in the company. Geopipe is a start-up that is developing 3D technology to build immersive virtual copies of the real world with applications in various fields, including games and architecture.