Kaspersky Endpoint Security File Server

Most startups eventually need a robust solution for storing large amounts of data for analytics. Perhaps you’re running a video app trying to understand user drop-off or you’re studying user behavior on your website like we do at Credible.

You might start with a few tables in your primary database. Soon you may create a separate web app with a nightly cron job to sync data. Before you know it, you have more data than you can handle, jobs are taking way too long, and you’re being asked to integrate data from more sources. This is where a comes in handy. It allows your team to store and query terabytes or even petabytes of data from many sources without writing a bunch of custom code.

In the past, only big companies like Amazon had data warehouses because they were expensive, hard to setup, and time-consuming to maintain. With AWS Redshift and Ruby, we’ll show you how to setup your own simple, inexpensive, and scalable data warehouse. We’ll provide that will show you to how to extract, transform, and load (ETL) data into Redshift as well as how to access the data from a Rails app.

Part I: Setting up AWS Redshift

Creating a Redshift Cluster

We chose AWS’s Redshift offering because it’s easy to set up, inexpensive (it’s AWS after all), and its interface is pretty similar to that of Postgres so you can manage it using tools like , a Postgres database manager for OSX, and use with Ruby via an . Let’s begin by logging into your AWS console and creating a new Redshift cluster. Make sure to write down your cluster info as we’ll need it later.

We’re going with a single node here for development and QA environments but for production, you’ll want to create a multi-node cluster so you can get faster importing and querying as well as handle more data.

You can optionally encrypt the data and enable other security settings here. You can go with defaults the rest of the way for the purposes of this tutorial. Note that you’ll start incurring charges once you create the cluster ($0.25 an hour for DC1.Large and first 2 months free).

When you’re done, you’ll see a summary page for the cluster. Please jot down the hostname in the Endpoint.

By default, nothing is allowed to connect to the cluster. You can create one for your computer by going to Security > Add Connection Type > Authorize–AWS will automatically fill in your current IP address for convenience.

Verifying Your Cluster

Now, let’s try connecting to your cluster using . You’ll need to create a Favorite and fill in the info you used to create the cluster. Note that the Endpoint url you got from the Redshift cluster contains both the host and port–you’ll need to put them in separate fields.

If you’re successful, you’ll see something like this.

Congrats, you’ve created your first data warehouse! For your Production environment, you may want to beef up the security or use a multi-node cluster for redundancy and performance.

The next step is to configure Redshift so we can load data into it. Redshift acts like Postgres for the most part. For example, you need to create tables ahead of time and you’ll need to specify the data types for each column. There are some differences that may trip you up. We ran into issues at first because the default Rails data types don’t map correctly. The following are some examples of Rails data types and how they should be mapped to Redshift:

integer => int
string => varchar
date => date
datetime => timestamp
boolean => bool
text => varchar(65535)
decimal(precision, scale) => decimal(precision, scale)

Note that the ID column should be of type “bigint”. The has more details. Here’s how we mapped the “users” table for the sample app.

You should also note that we didn’t map all fields. You’ll want to omit sensitive fields like “password” or add fields on an as-needed basis to reduce complexity and costs.

Part 2: Extracting, Transforming, and Loading (ETL)

Create an S3 Bucket

You’ll need to create an S3 bucket either via the AWS Console or through their API. For this sample, we’ve created one called “redshift-ruby-tutorial”.

Setup the Sample App

We created a for this part. It contains a User table, some seed data, and a Loader class that will perform ETL. The high-level approach is to output the User data to CSV files, upload the files to an AWS S3 bucket, and then trigger Redshift to load the CSV files.

Let’s start by cloning the app:

git clone git@github.com:tuesy/redshift-ruby-tutorial.git
endpoint security download free endpoint security console