Data and Analytics

Athena

Query service that let you analyze data stored in S3. It uses SQL language to do the query.

You will be putting data into S3 then you will query it with Athena.

You pay per TB of data scanned. No need to pay for any server since it is serverless.

It is commonly used with Quicksights for reporting/dashboards.

Use case: Business intelligence / analytics / reporting and analyze logs

Serverless SQL to analyze S3 use Athena!

Improving performance

To improve Athena for performance you want to scan less data. Apache Parquet or ORC is recommended, you want to use Glue to convert your data. So that you can get column data and not all the data.

Do some compression for smaller retrievals/

Partition your S3 so that it is easier for querying virtual columns.

You only want to get subset of the column so you will be paying less.

Use larger files since it is easier to scan > lots of smaller files

Federated query

Allows you to run SQl queries across data stored in relational, non-relational, object, and custom data source. So basically let you run Athena on any type of databases even on-premise database.

You need a Data Source Connectors that runs on AWS Lambda to run Federated Queries.

Redshift

It is a database but also do analytics. The databse is based on PostgresSQL. For data warehouse and analytics.

Load your data into Redshift and then do analytics. Column of storage data and then do parallel query engine.

SQL interface for performing queries.

Comparing with Athena, you have to load data into Redshift first, then it will have much faster query and join compared to Athena. It has indexes to have high performance. If many query, joints needed Redshift is better than Athena.

Redshift Cluster

Provision the node size in advance, can use reserved instances for cost saving

Has leader node for query planning and result aggregation.

Compute node: Actually carry out the queries

Snapshots and disaster recovery

Redshift can have multi-AZ for some clusters.

Snapshots are point in time backups of a cluster stored in S3, you can restore the snapshot to a new cluster. Snapshots can be taken automatically or you can do manual snapshot retained until you delete it.

You can also automatically copy snapshot of a cluster to another AWS region to do disaster recovery.

Inserting data into Redshift

You can insert data into Redshift via Kinesis data firehose (which will deposit it into S3 then it will issue a copy from S3 to redshift)

You can also copy the data from S3 from Redshift by issuing a copy command with the correct IAM role.

Or you can write from you EC2 instances into Redshift, but do it in large batch much more efficient.

Redshift spectrum

This is how you can query data that is already in S3 without loading it into Redshift. How do we do it? We have a redshift cluster available that's a must, then you will submit the query from your cluster, that query will reach the Redshift spectrum nodes (they sit in front of your S3 and then query the result, there are thousands of them so it is very efficient), then it will return you the data to the one that asked for the query.

You can leverage the nodes in Redshift spectrum rather than your own cluster to perform much more efficient query.

OpenSearch

Successor to ElasticSearch