Learning to Infer Generative Template Programs for Visual Concepts

R. Kenny Jones1      Siddhartha Chaudhuri2      Daniel Ritchie1     
1Brown University      2Adobe Research     

Paper | Code

ICML 2024



Bibtex

@inproceedings{jones2024TemplatePrograms,
  title= {Learning to Infer Generative Template Programs for Visual Concepts},
  author= {Jones, R. Kenny and Chaudhuri, Siddhartha and Ritchie, Daniel},
  booktitle = {International Conference on Machine Learning (ICML)},
  year= {2024}
}


Abstract

People grasp flexible visual concepts from a few examples. We explore a neurosymbolic system that learns how to infer programs that capture visual concepts in a domain-general fashion. We introduce Template Programs: programmatic expressions from a domain-specific language that specify structural and parametric patterns common to an input concept. Our framework supports multiple concept-related tasks, including few-shot generation and co-segmentation through parsing. We develop a learning paradigm that allows us to train networks that infer Template Programs directly from visual datasets that contain concept groupings. We run experiments across multiple visual domains: 2D layouts, Omniglot characters, and 3D shapes. We find that our method outperforms task-specific alternatives, and performs competitively against domain-specific approaches for the limited domains where they exist.




Template Programs



In this work, we aim to endow machines with the ability to learn flexible, general purpose visual concepts. Our system consumes a group of inputs from some visual concept; each input group has elements of shared structure and elements of divergent structure. To represent such visual concepts, we introduce Template Programs. A Template Program is a partial program specification from a DSL. To capture shared attributes, Template Programs define a hierarchy of functions calls and optionally specify relationships between parameter arguments (such as variable re-use or static assignment). To account for divergent attributes, Template Programs admit a distribution of complete program instantiations. Instantiations use the specified functions and relations, but are allowed to vary freely whenever the Template Program uses a special HOLE construct or has left a parameter free. So given an input group our goal is to find a Template Program which can admit instantiations that correspond with the group members.



We solve this difficult inverse prediction problem with a series of learned inference networks. First, we encode each visual input into a latent code. We then use a Transformer to autoregressively produce a Template Program conditioned on the set of visual codes. Following this, another module with Transformers is presented with a single visual code and the inferred Template Program, and from this produce a full program instantiation which can be executed to form a visual output. By feeding in each visual latent code in turn, we can then find a set of instantiations that well-represent the members of the input group.



Our method is quite flexible, and we explore its uses across multiple visual domains including 2D primitive layouts, Omniglot stroke characters, and 3D shape structures. Solving this visual program induction problem through the shared structural intermediary of our Template Program allows us to parse members of input concepts in a consistent fashion. For each input concept (Inputs rows, above) we visualize the parse our method creates (Recons, above).



Once we have found Template Programs that capture input visual concepts, we can sample new instantiations from the Template Program to generate novel outputs, that still remain consistent with the examples from the input concept (Gens rows).



We develop an unsupervised learning methodology that allows us to train our inference networks starting from only an input DSL and a dataset that contains concept groupings, such as class-labels. Our approach here is inspired by our past PLAD work on single input visual program induction, and at high-level we break this learning task into two stages. In the first stage, we use the DSL to sample random programs that we can convert into synthetic data with known input-output mappings. Then, in a second stage, we employ a bootstrapped finetuning procedure that specializes the inference networks towards a target dataset of interest. This process alternates between an inference step, where we run search to find approximate solutions over tasks from the target dataset, and a training step, where we derive paired data from these inferences.



Experiments




Across multiple visual domains we quantitatively evaluate few-shot generation and co-segmentation performance. Our method outperforms domain-general but task-specific alternatives, and is competitive against approaches that specialize for Omniglot.



We visualize few-shot generation results for Omniglot characters (above). While we again offer much improved performance over the task-specific alternatives, we note that the methods that specialize for Omniglot typically demonstrate a wider range of output variability. We hypothesize this difference is due to BPL and GNS learning priors over human stroke patterns (learning how people typically produce characters). In contrast, our method finds a Template Program attuned to the visual data present in the input group without regard for structured priors beyond the input DSL



We designed the Layout domain so that we could evaluate the out-of-distribution generalization capabilities of different approaches. We visualize few-shot generations that different methods make for the layout domain for concepts that gradually get more and more out-of-distribution (above). From left-to-right, we present example few-shot generations for an easy, medium and hard concept. The easy concept (a side facing chair) has a set of attributes that have all individually been seen in the training set, but presents them in a new combination. The medium concept (a crab) introduces a new attribute not seen in the training set: extended and vertical arms. The hard concept (a bookshelf) introduces a new meta-concept that was never seen in the training set. The second row of the figure show the input prompt set, where in the top row we show the nearest neighbor in the training set to each image in the prompt, according to our reconstruction metric. On the third row we show generations produced by the arVHE comparison condition, while on the bottom row we show generations produced by our method. While arVHE does reasonably well on the easy case, as the input prompts get more and more out-of-distribution it begins to generate nonsensical outputs. On the other hand, our approach scales much better to out-of-distribution inputs, even though they don’t match any images from the training set.



Our method natively supports co-analysis tasks. When we infer a Template Program and instantiations that explain an input visual group, we can use the shared structure of the Template Program to parse the group members in a consistent fashion. Above, we compare co-segmentations produced from voxelized shapes (Input) to ground-truth annotations (GT) of our method against BAE-NET.





Our Template Program framework is able to sample novel concepts unconditionally. To produce these above visualizations, we use the networks trained during the wake-sleep phase of our finetuning process. We first sample a Template Program, and then sample five program instantiations from it. Each bottom row shows the executed versions of these five samples, and above each sample we show the nearest neighbor character in the training set according to our reconstruction metric.



Related Publications

PLAD: Learning to Infer Shape Programs with Pseudo-Labels and Approximate Distributions
R. Kenny Jones, Homer Walke, Daniel Ritchie
CVPR 2022 Paper | Project Page | Code

Neurosymbolic Models for Computer Graphics
Daniel Ritchie, Paul Guerrero, R. Kenny Jones, Niloy J. Mitra, Adriana Schulz, Karl D. D. Willis, and Jiajun Wu
Eurographics 2023 STAR Paper



Acknowledgements

We would like to thank the anonymous reviewers for their helpful suggestions and all of our perceptual study particpants for their time. Renderings of 3D shapes were produced using the Blender Cycles renderer. This work was funded in parts by NSF award #1941808 and a Brown University Presidential Fellowship. Daniel Ritchie is an advisor to Geopipe and owns equity in the company. Geopipe is a start-up that is developing 3D technology to build immersive virtual copies of the real world with applications in various fields, including games and architecture. Part of this work was done while R. Kenny Jones was an intern at Adobe Research.