[IEEE 2010 IEEE 2nd International Conference on Cloud Computing Technology and Science (CloudCom) -...

Implementation and Performance Evaluation of a Hybrid Distributed System for Storing and Processing Images from the Web

Murali Krishna Balaji Kannan Anand Ramani Sriram J Sathish Amazon Yahoo! Yahoo! Yahoo! Bangalore Bangalore Bangalore Bangalore INDIA INDIA INDIA INDIA

[email protected] [email protected] [email protected] [email protected]

Abstract — Multimedia applications have undergone tremendous changes in the recent past that they have called for a scalable and reliable processing and storage framework. Image processing algorithms such as pornographic content detection becomes a lot more challenging in terms of accuracy, recall, and speed when run on billions of images. This paper presents the design and implementation of a hybrid-distributed architecture that uses Hadoop distributed file system for storage and Map/Reduce paradigm for processing images crawled from the web. This architecture combines the power of Hadoop framework when there is a need to parallelize the task as Map/Reduce jobs and uses stand alone crawler nodes to fetch relevant contents from the web. Evaluations on real world web data indicate that the system can store and process billions of images in few hours

Keywords-component; Distributed Multimedia, Hadoop, HDFS, map-reduce.

I. INTRODUCTION

Over the past decade, the capabilities and usage of multimedia systems and applications have increased manifold. Multimedia finds its application in various areas such as advertisement, education, engineering, medicine etc. Image and video content consumption on the Internet is increasing at a rapid rate. There is a proliferation of image and video sharing sites (Flicker, You tube, etc.) and image and video search engines (Yahoo!, Google, etc.). The number of images on Flickr has recently exceeded 5 billion [3]. Internet today is estimated to contain about 100 billion images.

There is wide range in the size and quality of images on the web. The nature of image content ranges from professionally created to user uploaded, very low to very high resolutions and so on. Since the processing capabilities offered by a single node are limited in nature it calls for a distributed setup to store and process a large volume of multimedia content. Associated with any distributed system are the problems such as reliability and how efficiently the platform can scale. Building and maintaining such a large-scale multimedia system is heavily constrained by high setup and maintenance cost.

A large-scale multimedia content system typically fetches the media contents from the web, processes the content and stores it. Fetching the content is similar to a web crawl process where we discover links, schedule

crawls, honor politeness etc. It is also known that crawling all the documents on the web is a daunting task. Further not all documents are crawlable either intentionally or unintentionally. The performance of the crawl is also constrained by the network bandwidth between the crawler and the destination host that is being crawled. There are a number of crawling policies to be followed by sophisticated crawlers to achieve maximum benefit because of the crawl operation.

Despite these constraints the crawlers can easily crawl a few billion images from the web. Storing and processing such large number of images/videos poses a lot of scalability and reliability issues. The system is also responsible for running a number of multimedia processing algorithms such as offensive content detection, face recognition, applying various filters, attribute extraction, fingerprint generation etc on the internet scale data.

A typical distributed approach would follow a technique such as consistent hashing [4] where the multimedia documents will be distributed among the set of nodes that are a part of the distributed system cluster by performing a hash on the image URLs. However, this method does not address the storage issues, as it gets difficult to adhere to the policy of locality of reference wherein the data is stored closer to the node where it is getting processed. Further, when we have to scale the architecture by adding more machines we will have to do a complete redistribution of the entire multimedia data

This paper presents a framework built using Hadoop to store and process the images from the web. This framework finds its application in situations where there is a need to fetch, process and store large volumes of multimedia content. Though there are a number of open source platforms that offer storage and processing capabilities at a web scale, this architecture is beneficial when there is a need to decouple the fetching service from the processing service. An illustration of how pornographic content from the images can be detected is also given here.

The rest of the paper is organized as follows. Section II briefly discusses related work. Section III presents the architecture overview of our system. Section IV provides the implementation details and Hadoop system tuning. Section V presents the experimental results and analysis of our results. Section VI summarizes our work.

2nd IEEE International Conference on Cloud Computing Technology and Science

978-0-7695-4302-4/10 $26.00 © 2010 IEEE

DOI 10.1109/CloudCom.2010.116

762

II. RELATED WORK The authors of Nutch [5], an open-source web

search engine have proposed a scheme that uses Hadoop as the platform for storage and processing data. The system offers a Map/Reduce based solution for fetching the web content and has its own indexer for creating the reverse index of the document. However, Nutch is not tuned well for multimedia content systems where one has to apply filters such as image sharpening, offensive content detection etc.

There are a number of vertical specific distributed crawlers such as Cobra [6], a blog aggregator and content-based filtering system. Cobra uses a three-tiered network of crawlers that scan web feeds, filters that match crawled articles to user subscriptions, and reflectors that provide users with an RSS feed containing search results.

Amazon’s Dynamo [7] is a highly available key-value storage system. It has the properties of both databases and distributed hash tables. It is designed to guarantee strict SLA requirements and also supports versioning of objects that it stores. There are quite a few implementations of the distributed hash table (DHT) concept that offer high scalability and availability. DHTs use hash functions such as SHA1,MD5 to store and locate the objects in the system. The content is also replicated on a set of nodes to handle failures.

In this paper, we discuss the design of hybrid architecture for multimedia content storage and processing based on Hadoop framework. The architecture has been designed and implemented for handling large number of multimedia objects collected from the web. The architecture utilizes both the processing power and storage capability of Hadoop framework so that the processing system is not physically separated from the storage. The system also tries to parallelize the content fetching effort and the actual data processing pipeline. The architecture also supports logical grouping of multimedia content based on the SLAs that are to be met to process them

III. ARCHITECTURE Figure 1 shows an illustration of distributed image

processing architecture depicting the three major components, namely the content fetcher, the storage framework, and the data processors. The architecture uses Hadoop Distributed File System (HDFS) for storage and Map/Reduce (M/R) framework for various processing. The content serving components are beyond the scope of this paper and are not discussed here.

The input to the whole system is a list of image URLs and the contextual information aka the metadata of the image. The entire system is modeled for efficiently processing incremental data. Any incoming feed is compared against the existing dataset on the HDFS to generate the incremental data. The changes to the metadata of already crawled images are updated directly onto the

HDFS and any new URLs are scheduled for crawling by the content fetcher module.

Figure 1. An illustration of Distributed Image Processing Architecture

The content fetcher module runs as a daemon and periodically downloads the URLs from the HDFS onto the local file system. The URL data is then used to fetch the image from the web. The module also creates thumbnails and generates a unique key per image after fetching the image binary. The crawled binary data along with the thumbnail data and the unique key is then uploaded back to HDFS on a periodic basis. A merge Map/Reduce job is then triggered to merge the metadata associated with an image and crawler generated image. The resultant complete image document is written back to HDFS store.

The data is partitioned differently in the storage based on its access patterns to minimize the amount of data read. For instance, the binary image data, the image attributes, and the surrounding text are kept in separate folders in HDFS with a common field between them which will be used during join operation to get the desired views. This is to ensure that image-processing algorithms can easily access only image data and other jobs that operate only upon the metadata such as duplicate removal can use the text data independently.

The entire system itself is partitioned according to the SLA requirements associated with the multimedia content. So, based on the frequency of data refresh and the size of the data, the system is logically partitioned into multiple clusters in HDFS by allocating a root level folder for each cluster. This ensures that we don’t have to process all the data at all the times. With this provision we were able to test our architecture for four different SLA requirements such as real-time, daily, weekly, and monthly content updates.

763

IV. IMPLEMENTATION DETAILS

A. Map Reduce on Hadoop Internet-scale image processing is a huge

computational challenge. An unsatisfactory solution to processing billions of images is to restrict the breadth and scope of the underlying algorithms and to use only the computationally simplest techniques to process the images. Some of our content processing algorithms are multimodal in nature. i.e. it relies on both content information and textual information and combines them in a natural way to achieve its performance. The multimodal nature of these algorithms brings its own challenges. A natural solution to handle the scale and nature of the complexity involved in processing the images is to perform distributed computing over a large cluster of machines.

The Map/Reduce programming abstraction was introduced and implemented by Google for the distributed processing of large data sets on clusters of machines. Map/Reduce is very efficient when the input data can be split into multiple independent chunks. The individual chunks of data are processed in parallel on the cluster nodes. A Map/Reduce task has three phases: Map phase: The data is split into chunks and processed in parallel on multiple nodes by map tasks. The output for each data point is a tuple (key, value). The key is typically a unique identifier for the processed data and the value is the result of processing. Sort (or shuffle) phase: The key-value pairs that are output by the map tasks are sorted by the key and distributed (based on the key) to the nodes on which reduce tasks are running. Two key-value pairs (perhaps coming from different map tasks), which have the same key, are always distributed to the same reduce node. Reduce phase: The reduce phase aggregates the data in the key-value pairs based on key into tuples like (key1, value11, value12...) and (key2, value21, value22...) for further processing.

This architecture uses Hadoop, an open source Map/Reduce implementation, for storing and processing images. Content processing and textual processing are performed on Hadoop independently as two separate Map/Reduce jobs. There is a final Map/Reduce job to combine the outputs from the two previous jobs and construct the final image document. We have experimentally verified that with a cluster of 200 computational units, it takes about 11 hours to process a corpus of 1 billion images.

B. Data Flow and Input Parsing Figure 2 illustrates the flow of Multimedia Content in the system. The input to the system is set of image URLs and its contextual information. The scale is huge in terms of few billion documents. This paper doesn’t address the mechanisms of obtaining these seed URLs.

Figure 2. Data Flow Diagram

This system takes input of the format ‘URL, contextual data (CD)’, where URL is the location of the image in the web and the contextual data will include multiple key value pairs such as the page URL, terms around the image, title, anchor text etc. This tuple has the following representation

{ImageURL, pageurl, termvector, title, anchor text}

These Terabytes of raw data are uploaded to the HDFS. The feeds are then parsed and mapped to a common document model on which the rest of the components work on. The generated data will be compared with what is already present in the full storage set during the reduce phase. The reducer asserts whether the incoming data is new or modified or deleted based on the business logic. The parsers are modeled as RecordReaders of mapper job and the framework supports pluggability. The map reduce model for this job is shown below

Input: feed, storage Mapper: Emits: - {NEW_URL, NEW_CD} {OLD_URL, OLD_CD} Reducer:

Emits: - URL, {NEW_CD, MODIFIED_CD, and DELETED_CD)

Output: NEW_URLS (to be fetched by the crawler) DELTA_STORAGE (temporary storage for

NEW_CD) storage/CD (updated with MODIFIED_CD)

764

C. Image Fetching Service

The NEW_URLS folder contains the list of new URLS that needs to be downloaded from the web. This stage is the most expensive part of the content processing pipeline as it needs lot of bandwidth to fetch the images and involves external resources and constraints. Also there is a need to adhere to typical politeness of a web crawler and honor robots.txt etc. Further, we need to keep refreshing the data periodically to detect moved images (HTTP Status – 404). This means that the crawlers should run forever in the background and should not be stopped and started again to pave way for other processes in the pipeline. So, the fetching operation is not distributed across boxes using Map/Reduce as in Nutch, to prevent the slots from being occupied forever. Our fetching service has been designed to be distributed in nature without using the Map/Reduce framework. The /NEW_URLS contains subfolders depending on the number of crawler nodes we have and each folder is mapped to a crawler node. The crawler machines (which is part of hadoop cluster) retrieve the list of URLs that it needs to crawl by downloading the set of files assigned to it from the HDFS on the local file system. The URLs are then crawled from the web, thumbnails are generated and the binaries are uploaded back to HDFS.

The implementation uses Mercator framework [8] to fetch the content and it is modeled as typical web crawler which follows the crawl criteria. Since the input data is in the HDFS, the sources of data to the crawlers are made to be reliable. There are frequent checkpoints created to enable redistribution of uncrawled data to other nodes in case of a node failure. The crawler cluster is designed to be stateless and is agnostic to node failures. This is an advantage over modeling the crawl process as a Map/Reduce job where there is a possibility of entire crawl operation getting aborted in case of some node failures. The fetching service outputs the following details

{ImageURL, image binary, image attributes}

This data is periodically uploaded to CRAWL_DELTA folder in HDFS by all the crawler machines.

D. Data Merge and Duplicate Removal The data merge job is solely responsible for

merging the text attributes and image attributes in order to provide a single document view of the multimedia content.

Input:

CRAWL_DELTA (generated by crawlers) DELTA_STORAGE (temporary storage for

NEW_CD generated by feed diff) storage/image (old storage which has old data)

Mapper: Emits: - URL, {image binary, image attributes} URL, NEW_CD

URL, {old image binary, old image attributes (in case of refresh)} Reducer: Emits: - URL, NEW_CD, image binary, image attributes} Output: storage/image (merged with new image data, keyed by hash of the content) storage/CD (merged with new CD, keyed by URL)

Duplicate detection helps in storing only one copy

of each thumbnail in the system. Since all the crawled data from the crawlers are uploaded back to HDFS, a duplicate removal M/R job can scan through the files and emit only unique elements. Experiments show 20% saving on storage space due to the duplicate detection operation. There are other forms of duplicates that exist where multiple sites link to the same imageURL. These are identified as duplicates at the time of parsing, so that the content fetcher service has to download each image only once. We have observed a 3:1 ratio between number of pages that reference the image and number of imageURLs.

Input:

CRAWL_DELTA storage/image

Mapper: Emits: - hash(content), content (from crawled data) hash, null (from storage) Reducer: Emits: - content (only for new hash) Output: storage/image

E. Content Processing Algorithms Most of the multimedia applications have a need to

extract some content features. Typical use cases are object detection, adult detection, face/skin detection, color analysis etc. In some cases, they might have to modify the image such as resizing, sharpening etc. These might be expensive and can take few milliseconds to operate on each image. These image-processing algorithms are plugged into the content processing pipeline and are run as M/R jobs. : Input:

storage/image storage/imageattributes

Mapper: Emits: - hash, binaryImage

hash, attributes Reducer: Emits: - hash, Processor.processedImage (binaryImage, attributes)

765

hash,Processor.getAttributes(binaryImage,attributes)

Output: storage/image

storage/imageattributes

[where Processor is a pluggable image-processing algorithm]

Figure 3. Illustration of Pipelined approach for pornographic content

detection

We show an illustration of how pornographic content was detected from images at an Internet scale. In the pipelined approach, we adopt the configuration showed in Figure 3. In this Figure, the components which classify an image as pornographic are depicted in red boxes while the components which classify an image as benign are depicted in green boxes. The pipeline combines the various components in a disjunctive set up. An image propagates through the pipeline till a component comes to a conclusion about it. Thus if an image is regarded as pornographic in the first body part detection stage, it is not tested by the subsequent stages. In this approach, there are some images, which pass out of the pipeline without any component coming to a conclusive verdict about them. Such images are referred to as unclassified (unknowns) and can be dealt with by appending suitable stages at the end of the pipeline. In choosing this particular ordering for the pipeline, we have ensured that the more precise (determined experimentally) stages occur towards the beginning. Further, in the training of some of the subsequent stages (shape based, bag of colors and textons) explicit care was taken to ensure that they performed well on the examples missed by the earlier stages. Note that the pipeline schematic in Figure 3 shows repeated instances of the shape and text based components. These components act as both pornographic and benign image classifiers. Thus the two instances of shape-based classifier are different in the sense that one filters out pornographic images and the other filters out non-pornographic ones. Though these components are repeated

in the diagram, in actual implementation effective caching is used to minimize re-computation

F. Redistribution of load Hadoop is known for its scalability and reliability.

There are several articles that talk about the scalability benefits offered by Hadoop [9]. Since this architecture is built using Hadoop, the scalability and availability issues are already taken care of by the platform. The only machine dependent function is the mapping of URLs to crawlers that is tied to the number of machines. When there are additions or deletions of nodes to the cluster, there is a need to redistribute the documents and it is done as an M/R job. The M/R job to redistribute the load figures out what is pending to crawl and redistributes to the available number of crawlers (each is a folder in HDFS). Redistribution Map/Reduce job is modeled as given below: Input:

NEW_URLS (n1 subfolders for n1 crawlers) CRAWL_DELTA (already crawled data)

Mapper:

IdentityMaper Reducer:

Emits: - URL, null (only for URLs for which there is no CRAWL_DELTA) Output: NEW_URLS (N subfolders corresponding to N crawlers)

V. EXPERIMENTS We conducted an experiment by turning off the

compression of Hadoop outputs to achieve higher speed. We ran a test to write 1 million multimedia documents into Hadoop FS using two serialization techniques. One was using the generic writable interface provided by Hadoop and the other one used ObjectOutputStream to write records to HDFS as shown in Table 1.

Table 1: Performance of Hadoop Serialization Methods

Choosing the correct serialization technique helped

us in improving the performance of the architecture. We also observed that each crawler could crawl more than a million urls per day. We increased the value for parallel copies run by reducers to fetch outputs from large number of mappers to 100 and dfs_datanode_max_xceivers to 8192.

OutputType Disk Usage Time (in sec)

ObjectOutput Stream - Compressed 534 100.9

Generic Writable - Compressed 407 76.5

ObjectOutput Stream - UnCompressed 715 28.6

Generic Writable - UnCompressed 577 16.9

766

With these settings the implementation was tested on three different setups whose configuration details are described in Table 2.

Table 2: Experimental Setups

Setup I was used to test small feeds of about 6000

documents on the average. Setup II was used to process a few million documents and Setup III was targeted towards processing more than a billion documents. The setup was tested for different types of feeds such as xml, csv, tsv etc. We measured the time taken to process the feeds with varying number of documents. The processed data was written as SequenceFiles to HDFS.

Figure 4. Performance of three different setups

Figure 4 shows the graph, which describes the content processing time for the three different setups. It does not include the crawl time, as the crawl operation was a continuous background process. The graph shows an exponential increase in time as the number of documents increased for all the three setups. The two node setup was performing reasonably well and was meeting our sub-minute SLA when the number of documents was less than 1000 which usually was the case with real-time news and events feed.

The 100-node cluster was built to handle a few million documents. However the setup didn’t perform well

for more than 300M documents as the processing time was close to 10 hours. We observed a high initial processing time and the system memory was a major bottleneck, as we had to remember a large number of documents to remove the duplicate entries in the system.

The 200-node cluster was targeted towards handling more than a billion documents efficiently. There wasn’t any significant improvement in time for documents in the order of few thousands when compared to the 100-node cluster. However, when there were millions of documents the 200-node cluster performed roughly 6 times better than the 100-node cluster.

VI. SUMMARY This paper describes the architecture of a hybrid

architecture designed to efficiently process incremental multimedia feeds. Adding more machines can easily scale the architecture and the redistribution logic is fairly simple to redistribute the load across the set of nodes that form a part of the cluster. This architecture is suitable for multimedia applications where there is a need to fetch, process and store large number of documents.

Our future work will involve testing the architecture for other multimedia content such as video and also integrate the content serving layer to the storage architecture.

ACKNOWLEDGMENT We would like to thank Ashwinder Alhuwalia, Nathan

Wang, Eric Zhang, Mattias Larson, Quoc Pham, Patrick Mccormack, Vijayanand, Bala, Greg, Hari Vasudev and the Hadoop engineering team at Yahoo! for supporting and guiding us during the course of development of this architecture.

REFERENCES

[1] Hadoop: A Distributed Computing Platform http://hadoop.apache.org [2] Jeffrey Dean and Sanjay Ghemawat. Map/Reduce: Simplified data

processing on large clusters. In Sixth Symposium on Operating System Design and Implementation, December 2004

[3] Flickr Blog, http://blog.flickr.net/en/2010/09/19/5000000000/ [4] Consistent Hashing –Wikipedia

http://en.wikipedia.org/wiki/Consistent_hashing. [5] Nutch: open source web-search software

http://lucene.apache.org/nutch/about.html [6] Ian Rose, Rohan Murty, Peter Pietzuch, Jonathan Ledlie, Mema

Roussopoulos, and Matt Welsh. Cobra: Content based filterting and aggregation of blogs and rss feeds

[7] Dynamo Storage System, http://en.wikipedia.org/wiki/Dynamo_(storage_system)

[8] Mercator: A Scalable, Extensible Web Crawler www.mias.uiuc.edu/files/tutorials/mercator.pdf.

[9] High Scalability http://highscalability.com/product-hadoop

Num Nodes in Cluster

Disk Space Per Node

Memory Per Node

CPU

Setup I 2 400 GB 4 2 x Xeon 3.00 GHz

Setup II 100 1.8 TB 4 2 x Xeon 2.80 GHz

Setup III 200 4 TB 16 2 x Xeon 2.50 GHz (8 cores)

767

[IEEE 2010 IEEE 2nd International Conference on Cloud Computing Technology and Science (CloudCom) -...

Documents

Transcript of [IEEE 2010 IEEE 2nd International Conference on Cloud Computing Technology and Science (CloudCom) -...