Photo by Orlova Maria on Unsplash

People with limited knowledge of Hadoop sometimes ask me why do we need a new data storage technology? Why not stay with true and tested relational database technology? Why not indeed? In this post I will discuss the main differences between Hadoop and relational databases and some reasons why we want to use one versus another.

Hadoop is technically not a database so when we compare it to relational databases it appears as if we are comparing apples to oranges. But Hadoop is actually used to store data sets across a cluster of computers although it behaves like a distributed file system. It is designed to store very large files and is fault-tolerant by replicating blocks of data within the cluster. From the point of view of being able to store large volumes of data, we can thus continue to compare it to relational databases.

I am in no means suggesting that we have to use Hadoop rather than traditional databases because it is new and better. There is always a time and a place and a justification to use any technology and for sure there are many use cases out there which warrant the use of a relational database. On the other hand, there are many use cases, especially when we talk about Big Data, that support using Hadoop. Let’s examine and compare the two so that we can make informed decisions on what to use when.

Amount of data

The term Big Data is often used to describe data sets that are too large to be stored using traditional relational databases and must therefore go into Hadoop. This is a generalization that doesn’t necessarily apply in all cases. We can store very large data sets in relational databases that have been sized accordingly. And of course, we can store small data sets in Hadoop if we want to.

Very large data sets when stored in relational databases can cause headaches, especially when it comes to query performance and the amount of time it takes to backup such data sets. This is where we can consider Hadoop as an alternative, as it doesn’t require backing up due to its fault-tolerant nature and it enables fast access to data through MapReduce.

Data consistency and mutability

Data in relational databases can be updated through transactions. Relational databases typically provide ACID properties (Atomicity, Consistency, Isolation and Durability) which guarantee that transactions in the database are valid. We cannot write half of a transaction if something unexpected happens while we are performing the write. We either write the complete transaction or nothing at all, which guarantees consistency of our data. Relational databases also allow us to make updates to the data as we see fit.

The Hadoop file system is immutable, meaning that we can write data only once and we can read it many times afterwards. Hadoop guarantees fault-tolerance, meaning that if one node in the cluster goes down, we can still read the data because blocks of data are replicated across the cluster. But we can’t update data as we would in relational databases. And Hadoop does not guarantee consistency because it has no notion of a transaction. Data is written as it comes in and there is no rollback of a transaction. The best we can achieve is eventual consistency as we can expect that data will be written or replicated eventually.

Here we have a clear distinction between relational database technology and Hadoop. If we want to update the data frequently, we have to use relational database technology and not Hadoop. However, if we just want to store large volumes of data, then we can go with Hadoop.

Schema

A relational database needs a predefined schema into which data is written. For example, a table with a predefined structure must exist before data can be written into the database. Modern relational databases allow additional data structures to be stored within relational tables, such as binary blobs, or XML or JSON structures. But even so, the fact that we want to store these types of data must be known up-front. We refer to this type of data processing schema-on-write because the schema must be known at the time of writing the data.

Hadoop has the ability to process and store all kinds of data whether it is structured, semi-structured or unstructured. We don’t have to know the schema of the data in order to be able to store it in a Hadoop cluster. On the other hand, in order to read the data and make sense of it, we have to know the schema at the time when we are reading. We refer to this type of data processing schema-on-read.

In comparison, we typically use relational databases to manage structured and semi-structured data while on the other hand we use Hadoop primarily to process large amounts of unstructured data.

Performance

In Hadoop we can access huge amounts of data much faster than we would in relational databases, but we cannot access a particular record from a data set as efficiently as we can read a record using an index in a large table in a relational database.

Writing data is typically faster in Hadoop as compared to relational databases, because Hadoop doesn’t have to deal with the overhead that comes with guaranteeing consistency or rollback of incomplete transactions.

Scalability

Relational databases typically provide vertical scalability, meaning that with demands for higher workloads we have to add more resources such as memory and CPU to the server which can become expensive.

Hadoop provides horizontal scalability, meaning that we add more computers to the existing cluster and these computers can be just commodity hardware. This additionally supports fault tolerance, because due to replication of data across the cluster, we can still access data even if one of the computers in the cluster fails.

The bottom line

Last but not least, one of the decisive factors in choosing a technology is cost, not necessarily in terms of initial set up and licensing, but more in terms of total cost of ownership. Hadoop is a free open source software framework that can be set up on a cluster of cheap commodity computers which we may already have lying around. Many relational databases are rather expensive both in licensing and in the hardware that is required to make them operational. But this is just one point of view.

Hadoop as a much more recent technology as compared to relational databases doesn’t offer an abundance of seasoned experts on the market who have the knowledge and experience in how to set up the Hadoop cluster so that it works reliably and efficiently. Sure, Hadoop is supposed to be fault-tolerant, but this comes as a result of it being set up correctly and monitored to ensure smooth operation.

As a recent technology, the Hadoop ecosystem is still evolving. There are many technologies that are involved and it takes a considerable amount of time and effort to master all of them and keep up to date with new developments. Again, comparing with relational database technology, there aren’t as many experienced Hadoop developers on the market and it might be expensive to hire any that are available.

 

In summary, as stated at the beginning, there is always a time and a place and a justification for choosing one technology over another. If we want transactional reliability for an ERP system, we would go with a relational database. If we want cheap long-term storage of large amounts of quickly accessible immutable data, we would go with Hadoop. For anything else, we would weigh the benefits and drawbacks of each technology and make a decision based on the many factors listed above.

Leave a Reply

Your email address will not be published. Required fields are marked *