Fast Contextual Scene Graph Generation With Unbiased Context Augmentation
Tianlei Jin · Fangtai Guo · Qiwei Meng · Shiqiang Zhu · Xiangming Xi · Wen Wang · Zonghao Mu · Wei Song
West Building Exhibit Halls ABC 208
Scene graph generation (SGG) methods have historically suffered from long-tail bias and slow inference speed. In this paper, we notice that humans can analyze relationships between objects relying solely on context descriptions,and this abstract cognitive process may be guided by experience. For example, given descriptions of cup and table with their spatial locations, humans can speculate possible relationships < cup, on, table > or < table, near, cup >. Even without visual appearance information, some impossible predicates like flying in and looking at can be empirically excluded. Accordingly, we propose a contextual scene graph generation (C-SGG) method without using visual information and introduce a context augmentation method. We propose that slight perturbations in the position and size of objects do not essentially affect the relationship between objects. Therefore, at the context level, we can produce diverse context descriptions by using a context augmentation method based on the original dataset. These diverse context descriptions can be used for unbiased training of C-SGG to alleviate long-tail bias. In addition, we also introduce a context guided visual scene graph generation (CV-SGG) method, which leverages the C-SGG experience to guide vision to focus on possible predicates. Through extensive experiments on the publicly available dataset, C-SGG alleviates long-tail bias and omits the huge computation of visual feature extraction to realize real-time SGG. CV-SGG achieves a great trade-off between common predicates and tail predicates.