How to build a data warehouse with AWS Redshift
Updated: Jan 22
I've been working on an idea for a while now that needed the ability to store and search through a large amount of data, we are not talking about Big Data but just normal day-to-day Terabytes of data.
First a little background as to what I tried and why it didn't work:
I initially looked at building a normal RDBMS system, something like MySQL or Postgres as they are systems I am already comfortable with, but the downside is that I need this dataset on the move wherever I am in the world. I can build it on my laptop but that means zero space for anything else, and building it onprem on our Petabyte array wouldn't give me the portability I need without exposing ports to the internet. To the cloud I went!
My first attempts involved an Azure VM with SQL running, but this turned out to be unfeasible because of the speed and lack of scalability. So I spoke with two people that are well versed in this areas: @troyhunt (on Setting up Table Storage) and @Scott_Helme (on Hacking Table Storage). Both have massively large datasets that are queried by a webapp, the end goal I had in mind!
They suggested I use a Azure Table Storage. It was fantastic: super easy to chuck data into and the speed was amazing off the bat. But it had one flaw, it doesn't support full SQL queries and the one thing I needed was the "like" query. I abandoned my now third attempt at building this and looked at alternatives. Thanks to @DeathsPirate's suggestion, I turned to Amazon services and Redshift, a data warehouse solution for massive data. This turned out to be exactly what I was looking for.
You need a two things to start with to make your first data warehouse, the first being an Amazon AWS account, the second is your data.
If you're going to give Redshift a go, I highly recommend that you use the 2 month trial if you haven't already got an account.
Redshift is a scale-out rather than a scale-up cluster system, which means you don't keep adding more and more ram or CPU. You add nodes to a cluster to massively parallel the workload, Redshift takes care of all the mechanics of that but for our example we need to build that cluster first.
Select the new cluster button and go with the defaults, there are many types of cluster available but for POC testing the default is more than enough. This builds a Dense Compute cluster with one Lead Node and 2 Cluster Nodes underneath it, each with 2 CPU cores, 15GB of ram and 160GB of SSD storage attached.
Once it is up and running you will be given a summary page with a few key bits of information you need going forward. Specifically the EndPoint (highlighted) and the JDBC URL further down the page.
We now need to ingest data somehow and whilst you might be tempted to think that Insert statements from your SQL days are the easiest, you should consider using the COPY command and S3 Buckets. Obviously your data might be coming from somewhere else but in our example here we are going to be dealing with CSV files.
Setting up an S3 Bucket is incredibly easy on AWS, just select S3 from the services menu and you will be presented with the option to create a new bucket; beyond giving it a name and selecting the region there's not much to configure. I made my Redshift Cluster in US-East-2 (Ohio), so it makes sense I select the same region for speed reasons.
Once we have the S3 bucket running we can upload a file to it, in our case a single CSV file called 'input_headless.csv'.
The next step is to be able to interact with Redshift and create the tables and import the file from S3. There are several ways you can do this, first is to use SQL Workbench/J (note this is not Oracle's SQL Workbench). By using SQL workbench/J we can feed it the JBDC URL from above and connect to the database. The second method is probably easiest for most starting out. The AWS Redshift Console has a new function called Query Editor.
From here or Workbench/J you can create your table. Here is my example one:
Before we can inport data from S3 into Redshift we have to make sure that the permissions are set correctly. Select the IAM service from the services menu in AWS and create a new role for Redshift and select Redshift customisable:
There are loads of premade ones, but we only need readonly access so search for S3 and you will see AmazonS3ReadOnlyAccess:
After the role is created you will see an ARN - copy this as we will need this soon. For now we need to associate this new role with our Redshift Cluster. You do this by going back into Redshift Cluster settings and managing the IAM roles.
Select your new role from the list and finally we are ready to import our data straight from the S3 bucket. We do that by using the COPY command as mentioned earlier:
Once all the data is imported we can now start using the database as normal.
Hopefully this has been helpful in explaining the first steps in getting working with Amazon Redshift.