Hadoop on Windows Azure: Hive vs. JavaScript for processing big data

By Sergey Klimov and Andrei Paleyes, senior R&D engineers at Altoros Systems Inc., Network World |  Big Data, Hadoop, Hive

Such a great difference in performance may also have another explanation. The results of the JavaScript query are written to the outputFile of the runJs command ("codeFile," "inputFiles," "outputFile") using a single Reduce task, as indicated in the table above.

Dependency between the block size and the number of Map tasks

We have also analyzed how the size of a block in a distributed file system influenced the number of Map tasks triggered in Hive and JavaScript queries.

For a 64MB block, the HQL query ran 37 Map tasks and 10 Reduce tasks. When a JavaScript query was processed, the task manager divided the total load into 150 Map tasks and a Reduce task.

Referring to the table, we can conclude that the number of Reduce tasks does not depend on the block size and is equal to 10 for Hive queries and to 1 for JavaScript queries.

Dependency between performance and the number of Map/Reduce tasks

We also analyzed how the number of Map and Reduce tasks influenced the speed of processing Hive and JavaScript queries.

From this diagram, you can see that Hive queries were properly optimized and the block size had almost no impact on execution time. In JavaScript, on the contrary, the processing speed depended directly on the number of Map tasks.

Dependency between performance and the type of a query

Below you can see the diagram that shows how the processing speed depends on the query type for a data set of 64MB.

The difference between the first and the second, as well as between the third and the fourth, queries was in the number of grouping parameters. The first query calculated flight delay times by year. In the second query, we added such parameters as month and day. The third query returned the average flight delay times by year, which is a different arithmetic operation. The fourth query calculated the average flight delay times by year, month and day.

Judging by the diagram, additional grouping parameters had much greater influence on JavaScript queries than the performed arithmetic operations. In case of Hive, such operations as transforming, converting and computing data caused the processing speed to degrade significantly, which can be seen from the difference in processing time between the first and the third queries. The first query calculated average values and included four grouping parameters, which resulted in the slowest processing speed.

Conclusion


Originally published on Network World |  Click here to read the original story.
Join us:
Facebook

Twitter

Pinterest

Tumblr

LinkedIn

Google+

Answers - Powered by ITworld

ITworld Answers helps you solve problems and share expertise. Ask a question or take a crack at answering the new questions below.

Join us:
Facebook

Twitter

Pinterest

Tumblr

LinkedIn

Google+

Ask a Question
randomness