Learning to Edit Visual Programs
with Self-Supervision

R. Kenny Jones¹ Renhao Zhang² Aditya Ganeshan¹ Daniel Ritchie¹

¹Brown University ²UMass Amherst

Paper | Code | Video

NeurIPS 2024

We design a system that learns how to edit visual programs. Given an initial program and a visual target, our network predicts local edit operations that can be applied to the input program to improve its similarity to the target.

Bibtex

@inproceedings{jones2024VPIEdit,
  title= {Learning to Edit Visual Programs with Self-Supervision},
  author= {Jones, R. Kenny and Zhang, Renhao and Ganeshan, Aditya and Ritchie, Daniel},
  booktitle = {Advances in Neural Information Processing Systems},
  year= {2024}
}

Edit networks learn how to locally improve visual programs

We design a model that learns how to edit visual programs in a goal-directed manner. Our network consumes a complete input program, this program’s executed state, and a visual target. It then proposes a local edit operation that modifies the input program to better match the target. In contrast with approaches that try find programs that explain visual inputs in a single shot, this framing allows our network to explicitly reason over a complete program and its execution, in order to decide how this program should be modified.

Jointly training edit and oneshot networks

We train edit networks without access to any ground-truth program annotations. To accomplish this, we propose an integration of our edit network with prior self-supervised bootstrapping approaches for one-shot VPI models (PLAD). As depicted above, during iterative finetuning rounds, we source paired training data for our edit network by first constructing pairs of start and end programs, and then using a domain-aware algorithm to find a set of edit operations that would bring about this transformation.

This process jointly finetunes both our edit network and a one-shot network, and as illustrated below we use an integrated inference algorithm that leverages the strengths of both of these paradigms. The one-shot model produces rough estimates that are refined with the edit network in a SMC style algorithm. We find that this joint self-supervised learning set-up forms a virtuous cycle: the one-shot model provides a good initialization state for the edit network, and the edit network improves inner-loop inference, creating better bootstrapped training data for the one-shot model.

Edit networks improve reconstruction performance

We experimentally compare the effectiveness of integrating our edit network into this joint paradigm against using one-shot models alone. Controlling for equal inference time, over multiple visual programming domains, we find that using the edit network improves reconstruction performance.

For 2D CSG, we compare reconstruction accuracy (Chamfer distance, lower is better, Y-axis) between using an edit network and using only a one-shot network while varying time spent on inference (left) and training set size (right). We find that the performance gap between these paradigms is more pronounced when more time is spent on program search or when less training data is available.

We evaluate reconstruction accuracy for challenge tasks that come from concepts not present in the target training set (e.g. snowmen). We compare against GPT-4V in a zero-shot setting (column 1), when an in-content example (ICE) is provided in the prompt (column 2), and when the one-shot model’s predicted program is provided as input (column 3). Our approach (column 5) finds more accurate reconstructions of these out-of-distribution targets (column 6) compared with using only the one-shot network (column 4) or any of the GPT-4V variants.

Local Edit Operations for Visual Programs

There are many ways to parameterize the space of possible program edits. We choose to constrain the possible edit operations our network can produce by forcing it to select from a set of local editing operations designed for visual programs. For instance, for functional visual programming DSLs with transformation and combination functions, we allow for seven different edit operations: modifying a transform’s parameters (MP), modifying a transform (MT), adding a transform (AT), removing a transform (RT), modifying a combinator (MC), removing a combinator (RC), or adding a combinator (AC). Each of these edit operations can be applied to a program at a specific token location, and results in a local change.

Given an input program, how do we know which edit operations are helpful? If we have access to not only a visual target, but also its corresponding program, we can find a set of edit operations that would transform the input program into this target. We follow this logic to source training data for our edit network: given a start program and an end program, we analytically identify a set of edit operations that would bring about this transformation with a findEdits function. We can then convert this set of edit operations into a large set of (input, output) pairs that our network can train on in a supervised fashion.

Related Publications

PLAD: Learning to Infer Shape Programs with Pseudo-Labels and Approximate Distributions
R. Kenny Jones, Homer Walke, Daniel Ritchie
CVPR 2022 Paper | Project Page | Code

Improving Unsupervised Visual Program Inference with Code Rewriting Families
Aditya Ganeshan, R. Kenny Jones, Daniel Ritchie
ICCV 2023 (Oral) Paper | Project Page | Code

Neurosymbolic Models for Computer Graphics
Daniel Ritchie, Paul Guerrero, R. Kenny Jones, Niloy J. Mitra, Adriana Schulz, Karl D. D. Willis, and Jiajun Wu
Eurographics 2023 STAR Paper

Acknowledgements

We would like to thank the anonymous reviewers for their helpful suggestions. Renderings of 3D shapes were produced using the Blender Cycles renderer. This work was funded in parts by NSF award #1941808 and a Brown University Presidential Fellowship. Daniel Ritchie is an advisor to Geopipe and owns equity in the company. Geopipe is a start-up that is developing 3D technology to build immersive virtual copies of the real world with applications in various fields, including games and architecture.