close

Eset Endpoint Security File Sharing

cd redshift-ruby-tutorial

Next, update your environment variables by editing and sourcing the ~/.bash_profile. You should use the info from when you created your cluster.

# redshift-ruby-tutorial export REDSHIFT_HOST=redshift-ruby-tutorial.ccmj2nxbsay7.us-east-1.redshift.amazonaws.com export REDSHIFT_PORT=5439 export REDSHIFT_USER=deploy export REDSHIFT_PASSWORD=<your password here> export REDSHIFT_DATABASE=analytics export REDSHIFT_BUCKET=redshift-ruby-tutorial

We’re ready to bundle our gems, create our database, and seed the dummy data:

bundle install bundle exec rake db:setup

Before we run ETL, let’s check the connection to Redshift. This should return “0 users” because we haven’t loaded any data yet:

bundle exec rails c RedshiftUser.count

Now let’s run ETL and then count users again (there should be some users now):

require 'loader' Loader.load RedshiftUser.count

Here’s an example of the output you should see:

~/git/redshift-ruby-tutorial(master)$ bundle exec rails c Loading development environment (Rails 4.2.3) irb(main):001:0> RedshiftUser.count unknown OID 16: failed to recognize type of 'attnotnull'. It will be treated as String. (1055.2ms) SELECT COUNT(*) FROM "users" => 0 irb(main):002:0> require 'loader' => true irb(main):003:0> Loader.load User Load (0.2ms) SELECT "users".* FROM "users" ORDER BY "users"."id" ASC LIMIT 1000 INFO: Load into table 'users' completed, 6 record(s) loaded successfully. => #<PG::Result:0x007ff31da1de08 status=PGRES_COMMAND_OK ntuples=0 nfields=0 cmd_tuples=0> irb(main):004:0> RedshiftUser.count (95.7ms) SELECT COUNT(*) FROM "users" => 6 irb(main):005:0> RedshiftUser.first RedshiftUser Load (1528.4ms) SELECT "users".* FROM "users" ORDER BY "users"."id" ASC LIMIT 1 => #<RedshiftUser id: 1, name: "Data", email: "data@enterprise.fed", sign_in_count: 0, current_sign_in_at: nil, last_sign_in_at: nil, current_sign_in_ip: nil,last_sign_in_ip: nil, created_at: nil, updated_at: nil>

How to Connect to Redshift

You can configure each Rails model to connect to a separate database so we created a base class for all the tables we’ll use to connect to Redshift:

class RedshiftBase < ActiveRecord::Base establish_connection Rails.application.secrets.redshift_config self.abstract_class = true end

For the RedshiftUser class, we’ll just need to specify the name of the table, otherwise Rails would look for a table named “redshift_users”. This is also necessary because we have our own User class for the local database.

class RedshiftUser < RedshiftBase self.table_name = :users end

With this configured, you can query the table. For associations, you’ll have to do some more customizations if you want niceties like “@user.posts”.

How to ETL

This task is performed by the Loader class. We begin by connecting to AWS and Redshift:

# setup AWS credentials Aws.config.update({ region: 'us-east-1', credentials: Aws::Credentials.new( ENV['AWS_ACCESS_KEY_ID'], ENV['AWS_SECRET_ACCESS_KEY']) }) # connect to Redshift db = PG.connect( host: ENV['REDSHIFT_HOST'], port: ENV['REDSHIFT_PORT'], user: ENV['REDSHIFT_USER'], password: ENV['REDSHIFT_PASSWORD'], dbname: ENV['REDSHIFT_DATABASE'], )

This is the heart of the process. The source data comes from the User table. We’re fetching users in fixed-size batches to avoid timeouts. For now, we’re querying for all users, but you can modify this to return only active users, for example.

Don’t be alarmed by all the nested blocks–we’re just creating temporary files, generating an array with the values for each column, and then compressing the data using gzip so we can save time and money. We’re not doing any transformation here, but you could do things like format a column or generate new columns. We upload each CSV file to our S3 bucket after processing each batch but you could upload after everything is generated if desired.

# extract data to CSV files and upload to S3 User.find_in_batches(batch_size: BATCH_SIZE).with_index do |group, batch| Tempfile.open(TABLE) do |f| Zlib::GzipWriter.open(f) do |gz| csv_string = CSV.generate do |csv| group.each do |record| csv << COLUMNS.map{|x| record.send(x)} end end gz.write csv_string end # upload to s3 s3 = Aws::S3::Resource.new key = "#{TABLE}/data-#{batch}.gz" obj = s3.bucket(BUCKET).object(key) obj.upload_file(f) end end

Finally, we clear existing data in this Redshift table and tell Redshift to load the new data from S3. Note that we are specifying the column names for the table so that the right data goes to the right columns in the database. We also specify “GZIP” so that Redshift knows that the files are compressed. Using multiple files also allows Redshift to load data in parallel if you have multiple nodes.

# clear existing data for this table db.exec <<-EOS TRUNCATE #{TABLE} EOS   # load the data, specifying the order of the fields db.exec <<-EOS COPY #{TABLE} (#{COLUMNS.join(', ')}) FROM 's3://#{BUCKET}/#{TABLE}/data' CREDENTIALS 'aws_access_key_id=#{ENV['AWS_ACCESS_KEY_ID']};aws_secret_access_key=#{ENV['AWS_SECRET_ACCESS_KEY']}' CSV EMPTYASNULL GZIP EOS

There are other improvements you can add. For example, using a manifest file, you can have full control over which CSVs are loaded. Also, while the current approach truncates and reloads the table on each run, which can be slow, you can do incremental loads.

Links

We’re hiring!

Checkout our


endpoint security controls     endpoint security book

TAGS

CATEGORIES