AWS Athena

AWS Athena is a cloud based querying service built using Presto that allows for querying data in S3 using standard SQL. Athena is awesome in that it is serverless and does not require any infrastructure. Customers only pay for the queries they run.

Athena is very easy to use. Schema can be defined by the user based on the data (single file or mutliple files in a bucket) in S3. Athena can parse files in CSV, JSON, ORC, Avro, and Parquet format.

Athena is cheap and users are charged based on the data scanned for a query. The files can be zipped format which reduces the data that needs to be scanned and therefore reduces cost and increases performance. Files can also be in columnar format and partitioned that reduces the data that needs to be scanned for a query.

Athena is very fast and can be run over large datasets. Athena automatically executes queries in parallel and thus most queries are returned in seconds.

Athena is integrated with AWS Glue Data Catalog. Glue can create a unified metadata repository across various services, crawl data sources to discover schemas and populate data catalog with new and modified table and partition definitions, and maintain schema versioning. Glue can also convert it into columnar formats that will lower cost and improve performance.

To create a table in Athena, navigate to the Athena console > Catalog Manager and click on Add Table. Simply fill out the information. Then select the data format and set the columns and then the table is ready.

The table can then be easily queried using standard SQL. Multiple tables can be set up and then the tables can be joined for queries. Other standard features like group by, order by can also be used.