When to Choose Hadoop

Apache Hadoop has rapidly become a leading open source Big Data management software used for situations where large scopes of data must be dealt with. But there are some scenarios when Hadoop is especially useful:

1. Complex information processing is needed (But this applies to parallelizable algorithms only.)

2. Unstructured data processing is needed.

3. It is not critical to have results in real time.

4. Machine learning tasks are involved.

5. Data sets are too large to fit into the RAM or discs; or require too many cores to process them (few TBs and more).

Let`s consider some typical Hadoop usage examples.

  • Log Files/Click Stream Analysis

    Log processing is one of the most common tasks solved by Hadoop. Large data centers can generate gigabytes of logs every single day. These logs may be extremely useful for troubleshooting and analysis, but storage and processing of such volumes of data presents a serious technical challenge. Even a simple search becomes an issue when dealing with such extensive volumes of data. A single machine can’t perform such a job effectively, which leads to the necessity of having a cluster of distributed machines to work on the task. This is where Hadoop becomes useful as it provides tools to store (HDFS) and process (MapReduce) large volumes of log files.

  • Recommendation Engines/Ad Targeting

    In order to suggest the most appropriate ads for a given user, it`s necessary to analyze all the available information about the users (profile, web browsing history, clicks history, email headers to, etc.). The recommendations engine should understand behavior and preferences of each individual user and be able to estimate the probability of the user’s interest in each specific ad.

    Another type of similar analysis could be described by use cases when an organization will process all the available data from multiple data sources related to each customer in order to understand: what could be done (campaigns, benefit programs, personal treatment, etc.) to increase customer satisfaction?
    Such tasks can be effectively solved by Hadoop as it allows for analyzing different parts of data (e.g. data about each customer) separately and in parallel.

  • Search Index Generation

    This is where Hadoop was originally started as an attempt to build an open source search engine. After Google published their GFS (Google’s distributed File System) and MapReduce papers, Hadoop was designed and implemented according to the design principles defined in these papers.

    The problem was to effectively store and process extremely large volumes of crawled data. The main result of this processing is a generated search index. Each individual piece of data (e.g. crawled web site) could be processed separately from other parts of data. But there was a need in the infrastructure and programming model to store and process huge volumes of data, and this is where HDFS/MapReduce were designed to complete the task.