Introduction
Recently, I’ve done load testing
of impala. During testing I’ve found by doing some tiny changes query
performance can be improve drastically. I would like to share my experience to
all. Majorly I’ve found you should tune impala by two ways.
Less your scanning time of HDFS(storage):
It’s always primary
concern to read data from distributed mode. Data scatter across the multiple machines.
You’re limited to disk throughput and network I/O. We’ve found scanning of HDFS
taking so long time then to do actual aggregation jobs. There are basically two reasons for taking long
time:
1) Scan data from disk:
Disk throughput
is limited to MBPS. In our testing I’ve found it goes to max 150mbps(on m3.Xlarge instance on aws EC2 and 10 IOPS). i.e. if
you are reading 50 GB data from scatter disk it will take around 6 minutes only
for reading your data and We expect query response in less than 3 to 5 sec in real time. I
would also like to point your direction on concurrent read. Reading tens of
concurrent user from disk can kill your system for endless time period. So does
it mean we can’t reduce the time?
Answer is no, we
can. Newer Cloudera distribution come up with the concept of HDFS caching. You
can cache your data into physical memory. I’ve seen the reading from cache data
gives you GBPS speed. It can drastically change scanning time of HDFS. And also
this reflection you can see on concurrent reads also.
2) Scan full data:
Scanning full data
is sometimes panic for system and that also if I consider it sequential
scanning. What if I interested only particular app id, your system will scan
complete data and filter your data row by row. What if I get something where I
get already filtered data there I don’t need to scan my complete data set. And
this mechanism called Partition. We can partition our data by some predefined
filters. So that impala will scan only those data set which you are put into
filter conditions.
Aggregate level
I’ve also
found you can tune impala by your query also. As for example we can do the date
parsing before giving the input to impala. So that Impala not take much time on
parsing and filtering the rows. Joining optimization can also be consider.
Summary:
Impala is a great
tool; If you optimize it from storage level to processing level. Storage and
processing equally put the impact on system. Sometimes, an optimization technique improves scalability more
than performance. For example, reducing memory usage for a query might not
change the query performance much, but might improve scalability by allowing
more Impala queries or other kinds of jobs to run at the same time without
running out of memory.
No comments:
Post a Comment