文档库

最新最全的文档下载
当前位置:文档库 > A Content-Based Image Retrieval System Based on Hadoop and Lucene

A Content-Based Image Retrieval System Based on Hadoop and Lucene

A Content-based Image Retrieval System Based on Hadoop and Lucene

Chunhao GU

Suzhou Branch Office

Oracle (China) Software Systems Co., Ltd.

Suzhou, China

guch2010@http://www.wendangku.net/doc/82774a5776c66137ee0619d9.html

Yang GAO

Department of Computer Science & Technology

Nanjing University

Nanjing, China

gaoy@http://www.wendangku.net/doc/82774a5776c66137ee0619d9.html

Abstract—T h is paper introduces a content-based image retrieval (CBIR) system based on Hadoop and Lucene. Hadoop is a kind of open source software with powerful parallelization and scalability and gradually becomes a popular technique for processing and storage of big data in recent years. We design and implement t h e system to overcome performance bottlenecks brought by computing complexity and big amount of data when constructing a CBIR system. We will present our ideas, designs and fruits of the system in this paper.

Keywords: CBIR; Hadoop; HBase; Lucene

I.I NTRODUCTION

Recently, with the development of the Internet and multimedia techniques, global digital images are increasing at an unprecedented speed. Therefore, to retrieve and manage these distributed images has become an important research issue.

For the difference between text data and image data, e.g. an image’s texture is hard to describe in a text, traditional text-based image retrieval systems have not be able to satisfy all requests. A lot of companies and organizations have begun to develop content-based image retrieval systems, known as CBIR systems.

Hadoop is a kind of open source software under the Apache Foundation [6, 18]. For its powerful parallelization and scalability, Hadoop has gradually become a popular technique for processing and storage of big data. Yahoo!, eBay, Facebook and Baidu have applied Hadoop in many products, such as data mining, searching, recommendation, etc.

For a long time, high computation tasks caused by computing complexity and big amount of data during storing, indexing, etc have been bottlenecks for constructing a CBIR system. Therefore, we study, design and implement a system based on Hadoop to explore the solution to the problem.

This paper is organized as follows: we discuss previous related works in Section II; we introduce our design of the system’s architecture and modules in Section III; in Section IV, we show the demo of the system’s prototype.

II.P REVIOUS R ELATED W ORKS

A.CBIR

C BIR technique has begun from 1990s [2]. After that, there appeared a lot of famous CBIR systems, such as QBIC [4], Virage [1], MARS [15], TinEye [10], etc. For all CBIR systems, algorithms for image feature extraction, methods of similarity measure between images and methods of relevance feedback are most basic issues.

1)Feature Extraction Algorithms

Low-level feature extraction algorithms are very important for research of CBIR techniques. These algorithms extract color, texture, shape and space features from an image into feature vectors. C ommonly used algorithms for color features are color histogram, color layout, color correlogram, etc [13] and texture features are Tamura texture, Gabor filter, etc [17].

2)Similarity Measure

The similarity between two images is measured by the distance of feature vectors of the images. C ommonly used methods are Euclidean distance, histograms’ cross, Mahalanobis distance, etc.

3)Reference Feedback

Relevance feedback technique is a kind of human-computer interaction method to improve performance of a system. It was introduced firstly in MARS [15]. Commonly used methods are naïve Bayes, SVM, and a method of dynamic adjustment of weights of feature vectors, etc.

B.Hadoop

Hadoop consists of two components of HDFS and MapReduce, which are respectively the implement of Google’s GFS and MapReduce [18].

HDFS is a distributed file system for big data storage with outstanding scalability and fault tolerance [16].

MapReduce is a distributed framework for data processing, especially big data [3]. The MapReduce process consists of two steps, Map and Reduce. Splits of data are inputted into the Map process which will output intermediate key-value pairs. And then lists of these key-value pairs each of which has a common key are inputted into the Reduce process to output final key-value pairs. For Hadoop, a programmer needs to customize his MapReduce program and submit it to the Hadoop cluster as a job to run parallelly.

HBase is a kind of NoSQL database software based on Hadoop [5, 7]. A record of a HBase table consists of a sole row key and some column families. Every column family

A Content-Based Image Retrieval System Based on Hadoop and Lucene

A Content-Based Image Retrieval System Based on Hadoop and Lucene

A Content-Based Image Retrieval System Based on Hadoop and Lucene

has one or more columns defined by qualifiers. A

2012 Second International Conference on Cloud and Green Computing

978-0-7695-4864-7/12 $26.00 © 2012 IEEE DOI 10.1109/CGC.2012.33

A Content-Based Image Retrieval System Based on Hadoop and Lucene

684