Monday, November 3, 2014

Impala Performance Consideration

Introduction

Recently, I’ve done load testing of impala. During testing I’ve found by doing some tiny changes query performance can be improve drastically. I would like to share my experience to all. Majorly I’ve found you should tune impala by two ways.

Less your scanning time of HDFS(storage):

It’s always primary concern to read data from distributed mode. Data scatter across the multiple machines. You’re limited to disk throughput and network I/O. We’ve found scanning of HDFS taking so long time then to do actual aggregation jobs.  There are basically two reasons for taking long time:
1)      Scan data from disk:
Disk throughput is limited to MBPS. In our testing I’ve found it goes to max 150mbps(on m3.Xlarge instance on aws EC2 and 10 IOPS). i.e. if you are reading 50 GB data from scatter disk it will take around 6 minutes only for reading your data and We expect query response in less than 3 to 5 sec in real time. I would also like to point your direction on concurrent read. Reading tens of concurrent user from disk can kill your system for endless time period. So does it mean we can’t reduce the time?
Answer is no, we can. Newer Cloudera distribution come up with the concept of HDFS caching. You can cache your data into physical memory. I’ve seen the reading from cache data gives you GBPS speed. It can drastically change scanning time of HDFS. And also this reflection you can see on concurrent reads also.

2)      Scan full data:
Scanning full data is sometimes panic for system and that also if I consider it sequential scanning. What if I interested only particular app id, your system will scan complete data and filter your data row by row. What if I get something where I get already filtered data there I don’t need to scan my complete data set. And this mechanism called Partition. We can partition our data by some predefined filters. So that impala will scan only those data set which you are put into filter conditions.

 Aggregate level

I’ve also found you can tune impala by your query also. As for example we can do the date parsing before giving the input to impala. So that Impala not take much time on parsing and filtering the rows. Joining optimization can also be consider.

Summary:

Impala is a great tool; If you optimize it from storage level to processing level. Storage and processing equally put the impact on system. Sometimes, an optimization technique improves scalability more than performance. For example, reducing memory usage for a query might not change the query performance much, but might improve scalability by allowing more Impala queries or other kinds of jobs to run at the same time without running out of memory.