You have a ton of data, gigabytes of it, all in thousands - if not hundreds of thousands - of seperate files. How on earth do you go about sorting it out and creating a database? After all maybe databases aren't your thing. Or worse, you don't have time to create fancy databases. One of the greatest hurdles that Troy Hunt describes having when updating haveibeenpwned.com is the time it takes to batch load hundreds of millions of records into a database.
If you read my last post, then you will know I experimented with Amazon's RedShift, an insanely fast sql database with the ability to regex. The issue I had was the cost, roughly £50,000 per year for the size and speed I was running. For my project I certainly didnt need the speed, so why pay the costs of having a function I don't need?
AWS Athena is literally, in name and useability, a godsend. Athena is a full sql database, you create tables and queries in the same manner as any other sql database, but the difference is ingestion of records, no INSERT or COPY statements to be found. The brilliance of Athena is that you create an S3 Bucket and just point your database to it.
Athena seemlessly trawls data in your requested format straight from S3 and runs queries on it. It is a bit slower than Redshift but the hours it saves you fumbling with data upfront is incredible.
Let's work through an example with just 20GB of data. You will need to create an S3 Bucket and an account for Athena, setting up a trial account as per the last blog post, or continue on using the same S3 bucket (as I will here in this post).
First let's get our data up into S3. Athena works best on files between 300GB and 1TB, any larger and it bogs Athena down, any smaller and the nodes behind the scenes have to do more work. Another advatange to this solution is that Athena can deal with compressed files, so let's upload 20GB of gzip'd csv files (roughly 22 million rows).
That's all we have to do to get data into our the database which we are about to create; as you add more files to the bucket, they can be scanned for data. If you want data removed from a database, just delete the file. It really doesn't get any simpler than this - speed of "inserts" are now limited to your transfer speed into S3!
OK, let's move over to AWS Athena. When you select it via the services tab in your account you will see a not-very-inspiring user interface:
It isn't exactly obvious from here but first we need to create a database; we do that by either running a query or hitting the Create Table link.
Here is a simple query for creating a database (just paste and hit Run Query):
CREATE EXTERNAL TABLE IF NOT EXISTS s3crt_small.crt (
partitionkey string,
record int,
rowkey string,
base64 string,
domains string,
num1 float,
num2 float
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ','
) LOCATION 's3://redshifts3dev/argon2019small/'
TBLPROPERTIES ('has_encrypted_data'='false');
However, the Create Table link gives you a little wizard for doing things in a much simpler method:
All very self explanatory I hope.
The last option is to create partitions, if your dataset needs one, then create it here. For my data, I did not need it so I skipped the optional step.
That's it we are done!
Let's run a query and see what we get and how fast this is:
Not exactly speedy, but considering we did this in essentially four steps and didn't have to worry about doing hours of inserts and data wrangling to get everything in order, 45 seconds to parse 20GB of unordered data is pretty cool. To be honest, the way Athena is built (scaling out rather than up) means that the speed doesn't vary much no matter how much data it looks through, never getting much slower or faster. The same data in Redshift was 0.8ms, so you need to really consider the cost/speed benefits of each.
The crazy thing with Athena though is that REGEX queries are faster than straight queries like that above:
That really is all there is to using Athena with S3 Buckets. It's super simple for analysis of large data without the faff or expense of building SQL servers or using Storage Tables etc. As I throw more data up into the S3 Bucket, the search queries will be able to search those too.
Talking of costs, Athena works on billing per bytes scanned and of course you also need to factor in S3 costs as well.
Other benefits are that all searches are saved into your S3 Bucket and you can also export results too, maybe even put the results into another bucket folder and create a smaller Athena search from that (greatly reducing costs).
If you play around with Redshift and Athena, I hope these guides will help you and that you'll let me know how you get on!