Title: Visual-Textual Joint Relevance Learning for Tag-based Social Image Search, IEEE Transactions on Image Processing, 2013.
Motivation: With the development of social media, a new type of search has become increasingly popular - social search, and this paper particularly focuses on image search in social media. Given a tag query, for example "
apple", a good search engine is able to output a set of highly relevant but also diverse images showing fruit apples, cellphone and MacBook. Tag-based social image search always leverages user-generated tags to calculate image's relevance score, however, such tags contain too much noises and it's difficult to form an optimal ranking strategy. Therefore, this paper seeks to simultaneously utilizes tags as well as visual information for image relevance learning.
|
Fig 1. Framework of the proposed visual-textual joint relevance learning method. |
Method: The basic framework can be seen in Fig 1. (I) Given a set of images, each of them can be represented by two kinds of features - visual as well as textual features. (II) Based on such two types of representations, a
hypergraph can be constructed. Here it should be highlighted that hypergraph is different from the general graph. In a hypergraph, edge is called "hyperedge", which does not represent pairwise interactions, while it is a relationship consisting of a set of images. To be specific, the nodes sharing the same tags can be "linked" by a textual-based hyperedge, and the images sharing the same visual "words" can be connected by a visual-based hyperedge. Fig 2. reveals textual-based hyperedges (left) and visual-based hyperedges (right). (III) A joint image relevance learning process is performed on a set of pseudo-relevant samples. Pseudo-relevant samples are actually labeled images collected based on tags. Then they propose an objective function aiming to learn a relevance vector
f with each element indicating an image's relevance score. (IV) Based on the learned relevance score, the algorithm will return top-K images to users.
|
Fig 2. Examples of hyperedge construction. The left figure shows textual-based hyperedges and the right one shows visual-based hyperedges. |
Experiment: They perform experiments on Flickr Image Dataset (104,000 images, 83,999 tags), and compare the proposed textual-visual joint hypergraph learning approach (HG-WE-joint) with five stat-of-the-art baselines, including graph-based semi-supervised learning, sequential social image relevance learning, tag ranking, tag relevance combination, hypergraph-based relevance learning with equal weight (HG). It also examines the performance of the proposed approach with only single information, HG-WE-visual or HG-WE-textual. Results show that HG-WE-joint outperforms all baselines and maintains good robustness to parameters. However, HG-WE-joint requires the highest computational cost to achieve the best retrieval performance.