From Application Developer to Big Data Engineer
I started my career as an application developer mostly having worked on OOP languages, databases, REST APIs, a little client side programming, etc. It was around that time that I heard about “Big Data” and “Hadoop”, but didn’t know much in detail. My journey into Big Data started when, in the project that I was working on, we started migrating to Hadoop in order to support the large volume of data for our client. In my first month, I did a few things that most people do when they start learning Big Data —
- read blogs and books about Hadoop and Map Reduce
- manually setup small standalone cluster with the help of Google (Mostly copy/paste all the commands and config)
- copy/paste word count problem from the internet, solve all the errors and ran it successfully on the cluster, and be happy.
- tried to understand the syntax and working of map reduce tried to find other similar problems like word count, but there are not many.
- playing with different configs in core-site.xml and few other Xmls. If the solution is giving some exception after config change, searched on the internet and fixed it.
The above things help me to understand Big Data to some extent, but I didn’t get the whole picture. Here are a few reasons —
- continued using the earlier programming approaches in big data problems, forgetting that they mostly work for silo-computing(single machine/data in MBs).
- hard to understand how Hadoop can solve the same problem that we solved using common language once data size increased. (I was wondering how Hadoop can solve most of the problem by just giving key/value pair as input.)
- not able to come up with the solution of any problem when data is partitioned and stored in different machines.
- ignored the downstream impact of each API and how exactly it will affect the performance and memory usage.
- not able to decide which tool or framework would solve my problem efficiently. Because, there are many tools and frameworks available in Hadoop ecosystem and day by day it’s increasing.
While working on big data projects, I observed couple of things that we need to focus —
Mindset changeWhile solving problem for silo-computing, we mostly focus on performant APIs, immutability, heap memory and few other things. But, for distributed computing, you also have to consider in-memory vs disk I/O, CPU usage and network bottleneck. At every stage from analysis to production deployment, we need to test and verify all above parameters for each and every small change.
Problem solving approach
Let’s say we want to do an average of 1000 numbers, we can implement it using any language and run it on single machine. But, what if the numbers are 100 billions? We cannot efficiently solve this using the earlier approach. There are mainly two parameters that we need to consider — the volume of data and parallel execution. We need to split the data and store it in multiple machines and want to run each split in parallel to get the answer in minimum time.
The Average Problem using map reduce — Find the average of 1 to 10 numbers. Numbers are partitioned into three machines.
As you may know, MapReduce is a two phase process: Map phase and Reduce phase. In map phase, the mapper will take any Hadoop directory as input, one line at a time as key-value pair. The goal of the mapper is to take key-value as the input, process it and emit the other key-value pair. Between the map and reduce phase, all the values emitted by mapper are grouped together according the key. Reduce phase will take one key and all associated values at a time and process it and emit the other key-value pair.
In the above diagram, all mappers are run in parallel and generates key as number of elements and value as sum of all the elements. Then, all the grouped values will be consumed by the reduce task and emits key as zero and value as tuple of number of element and its total respectively. The emitted pairs will pass to one more mapper and it will give the average of one to ten number.
Some points to remember:
- Unlearn app dev programming approach
- While implementing solution of any problem for large volume of data, you have to make it run in parallel.
- Don’t think about the logic that will directly gives you the final result. Break the problems into parts and finalize the intermediate outputs ( and make sure all the splits run in parallel). In the average problem, in step 1, we do sum of numbers for all the split in parallel and the result will go for further processing in step 2 and 3.
- Minimum network I/O
Try thinking how one can solve problems like cumulative sum, transpose of matrix, top N most frequently used words etc. in distributed fashion.
Do the impact analysis of the API that you are going to use
The framework like Spark provides higher level abstraction such as RDD (Resilient Distributed Datasets). When you do any operation on RDD, behind the scene it will execute on each of the split available on different machine. That’s why it is more important to understand the internal working of each API that you are going use for any transformation. For example, word count can be computed using either reduceByKey or groupByKey. Both will give the correct answer, but reduceByKey is much better on large volume of data.
Try new tools or frameworks on environment that is close to a copy of production
When I started exploring tools in Big Data space, I was confused because you will find many tools or frameworks to fulfill your single requirement. Each of them has its own set of pros and cons. Here is a snapshot of how the big data landscape looks like —
Some frameworks give more flexibility but don’t perform well on very large volume of data. Some tools are good in iterative processing and some tools are not. Some frameworks are good in writing the data and some frameworks are good in querying the data. Understand your use cases, volume of data that you want to process by that framework and the performance. Before finalizing any tool or framework, I would suggest to spike out the framework on test environment and measure the performance on production-like environment to get better understanding of the framework on large data set. If everything goes right, use it in production!
Thanks for reading. If you found this blog helpful, please recommend it and share it. Follow me for more articles on Big Data.
Originally published on medium.com