You have a ton of data, gigabytes of it, all in thousands - if not hundreds of thousands - of seperate files. How on earth do you go about sorting it out and creating a database? After all maybe databases aren't your thing. Or worse, you don't have time to create fancy databases. One of the greatest hurdles that Troy Hunt describes having when updating haveibeenpwned.com is the time it takes to batch load hundreds of millions of records into a database.
If you read my last post, then you will know I experimented with Amazon's RedShift, an insanely fast sql database with the ability to regex. The issue I had was the cost, roughly £50,000 per year for the size and speed I was running. For my project I certainly didnt need the speed, so why pay the costs of having a function I don't need?
AWS Athena is literally, in name and useability, a godsend. Athena is a full sql database, you create tables and queries in the same manner as any other sql database, but the difference is ingestion of records, no INSERT or COPY statements to be found. The brilliance of Athena is that you create an S3 Bucket and just point your database to it.
Athena seemlessly trawls data in your requested format straight from S3 and runs queries on it. It is a bit slower than Redshift but the hours it saves you fumbling with data upfront is incredible.
Let's work through an example with just 20GB of data. You will need to create an S3 Bucket and an account for Athena, setting up a trial account as per the last blog post, or continue on using the same S3 bucket (as I will here in this post).
First let's get our data up into S3. Athena works best on files between 300GB and 1TB, any larger and it bogs Athena down, any smaller and the nodes behind the scenes have to do more work. Another advatange to this solution is that Athena can deal with compressed files, so let's upload 20GB of gzip'd csv files (roughly 22 million rows).
That's all we have to do to get data into our the database which we are about to create; as you add more files to the bucket, they can be scanned for data. If you want data removed from a database, just delete the file. It really doesn't get any simpler than this - speed of "inserts" are now limited to your transfer speed into S3!
OK, let's move over to AWS Athena. When you select it via the services tab in your account you will see a not-very-inspiring user interface:
It isn't exactly obvious from here but first we need to create a database; we do that by either running a query or hitting the Create Table link.
Here is a simple query for creating a database (just paste and hit Run Query):
CREATE EXTERNAL TABLE IF NOT EXISTS s3crt_small.crt (
partitionkey string,
record int,
rowkey string,
base64 string,
domains string,
num1 float,
num2 float
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ','
) LOCATION 's3://redshifts3dev/argon2019small/'
TBLPROPERTIES ('has_encrypted_data'='false');
However, the Create Table link gives you a little wizard for doing things in a much simpler method:
All very self explanatory I hope.
The last option is to create partitions, if your dataset needs one, then create it here. For my data, I did not need it so I skipped the optional step.
That's it we are done!
Let's run a query and see what we get and how fast this is:
Not exactly speedy, but considering we did this in essentially four steps and didn't have to worry about doing hours of inserts and data wrangling to get everything in order, 45 seconds to parse 20GB of unordered data is pretty cool. To be honest, the way Athena is built (scaling out rather than up) means that the speed doesn't vary much no matter how much data it looks through, never getting much slower or faster. The same data in Redshift was 0.8ms, so you need to really consider the cost/speed benefits of each.
The crazy thing with Athena though is that REGEX queries are faster than straight queries like that above:
That really is all there is to using Athena with S3 Buckets. It's super simple for analysis of large data without the faff or expense of building SQL servers or using Storage Tables etc. As I throw more data up into the S3 Bucket, the search queries will be able to search those too.
Talking of costs, Athena works on billing per bytes scanned and of course you also need to factor in S3 costs as well.
Other benefits are that all searches are saved into your S3 Bucket and you can also export results too, maybe even put the results into another bucket folder and create a smaller Athena search from that (greatly reducing costs).
If you play around with Redshift and Athena, I hope these guides will help you and that you'll let me know how you get on!
I appreciate Henry for making me realise the truth to a certified hacker who knows a lot about what he is doing. I strongly recommend you hire him because he’s the best out there and always delivers. I have referred over 10 people to him and all had positive results. He can help you hack into any devices, social networks including – Facebook, Hangout, iMessages, Twitter accounts, Snap chat , Instagram, Whatsapp, wechat, text messages ,smartphone cloning,tracking emails and also any other social media messenger or sites. It’s advisable to hire a professional hacker.Thank me later. Contact him here., Henryclarkethicalhacker@gmail.com and you can text, call and Whatsapp him on +1(201)4305865, or +1(219)7960574.....
Contact him for any type of hacking, he is a professional hacker that specializes in exposing cheating spouses, and every other hacking related issues. he is a cyber guru, he helps catch cheating spouses by hacking their communications like call, Facebook, text, emails, Skype, whats-app and many more. I have used this service before and he did a very good job, he gave me every proof I needed to know that my fiancee was cheating. You can contact him on his email to help you catch your cheating spouse, or for any other hacking related problems, like hacking websites, bank statement, grades and many more. he will definitely help you, he has helped a lot of people, contact him on,…
I’m excited to write about Henry Hacker, he is a great and brilliant hacker who penetrated my spouse’s phone without a physical installation app. And I was able to access my spouse’s phone, SMS, Whatsapp, Instagram, Facebook, Wechat, Snapchat, Call Logs, Kik, Twitter and all social media. The most amazing thing there is that he restores all phone deleted text messages. And I also have access to everything including the phone gallery without touching the phone.I can see the whole secret of my spouse. Contact him for any hacking service. He is also a genius in repairing Credit Score, increasing school grade, Clear Criminal Record etc. His service is fast. Contact:, Henryclarkethicalhacker@gmail.com and you can text, call him on whatsapp…