Amazon Athena - User Guide

Amazon Athena

User Guide

Amazon Athena: User Guide

Amazon Athena User Guide

Amazon's trademarks and trade dress may not be used in connection with any product or service that is not Amazon's, in any manner

that is likely to cause confusion among customers, or in any manner that disparages or discredits Amazon. All other trademarks not

owned by Amazon are the property of their respective owners, who may or may not be aﬃliated with, connected to, or sponsored by

Amazon.

Amazon Athena User Guide

Table of Contents

What is Amazon Athena? .................................................................................................................... 1

When should I use Athena? ......................................................................................................... 1

Accessing Athena ....................................................................................................................... 1

Understanding Tables, Databases, and the Data Catalog ................................................................. 2

Release Notes .................................................................................................................................... 4

February 2, 2018 ....................................................................................................................... 4

January 19, 2018 ....................................................................................................................... 4

November 13, 2017 ................................................................................................................... 5

November 1, 2017 ..................................................................................................................... 5

October 19, 2017 ....................................................................................................................... 5

October 3, 2017 ........................................................................................................................ 6

September 25, 2017 ................................................................................................................... 6

August 14, 2017 ........................................................................................................................ 6

August 4, 2017 .......................................................................................................................... 6

June 22, 2017 ........................................................................................................................... 6

June 8, 2017 ............................................................................................................................. 6

May 19, 2017 ............................................................................................................................ 6

Improvements .................................................................................................................... 7

Bug Fixes .......................................................................................................................... 7

April 4, 2017 ............................................................................................................................. 7

Features ............................................................................................................................ 7

Improvements .................................................................................................................... 7

Bug Fixes .......................................................................................................................... 7

March 24, 2017 ......................................................................................................................... 8

Features ............................................................................................................................ 8

Improvements .................................................................................................................... 9

Bug Fixes .......................................................................................................................... 9

February 20, 2017 ...................................................................................................................... 9

Features ............................................................................................................................ 9

Improvements .................................................................................................................. 11

Setting Up ....................................................................................................................................... 12

Sign Up for AWS ...................................................................................................................... 12

To create an AWS account ................................................................................................. 12

Create an IAM User .................................................................................................................. 12

To create a group for administrators ................................................................................... 12

To create an IAM user for yourself, add the user to the administrators group, and create a

password for the user ....................................................................................................... 13

Attach Managed Policies for Using Athena .................................................................................. 13

Getting Started ................................................................................................................................ 14

Prerequisites ............................................................................................................................ 14

Step 1: Create a Database ......................................................................................................... 14

Step 2: Create a Table .............................................................................................................. 15

Step 3: Query Data .................................................................................................................. 16

Integration with AWS Glue ................................................................................................................ 18

Upgrading to the AWS Glue Data Catalog Step-by-Step ................................................................ 19

Step 1 - Allow a User to Perform the Upgrade ..................................................................... 19

Step 2 - Update Customer-Managed/Inline Policies Associated with Athena Users ..................... 19

Step 3 - Choose Upgrade in the Athena Console .................................................................. 20

FAQ: Upgrading to the AWS Glue Data Catalog ............................................................................ 21

Why should I upgrade to the AWS Glue Data Catalog? .......................................................... 21

Are there separate charges for AWS Glue? ........................................................................... 22

Upgrade process FAQ ........................................................................................................ 22

Best Practices When Using Athena with AWS Glue ........................................................................ 23

Database, Table, and Column Names .................................................................................. 24

iii

Amazon Athena User Guide

Using AWS Glue Crawlers .................................................................................................. 24

Working with CSV Files ..................................................................................................... 28

Using AWS Glue Jobs for ETL with Athena ........................................................................... 30

Connecting to Amazon Athena with ODBC and JDBC Drivers .................................................................. 33

Using Athena with the JDBC Driver ............................................................................................ 33

Download the JDBC Driver ................................................................................................ 33

Specify the Connection String ............................................................................................ 33

Specify the JDBC Driver Class Name ................................................................................... 33

Provide the JDBC Driver Credentials ................................................................................... 33

Conﬁgure the JDBC Driver Options ..................................................................................... 34

Connecting to Amazon Athena with ODBC .................................................................................. 35

Amazon Athena ODBC Driver License Agreement ................................................................. 35

Windows ......................................................................................................................... 35

Linux .............................................................................................................................. 35

OSX ................................................................................................................................ 36

ODBC Driver Connection String .......................................................................................... 36

Documentation ................................................................................................................ 36

Security ........................................................................................................................................... 37

Setting User and Amazon S3 Bucket Permissions .......................................................................... 37

IAM Policies for User Access .............................................................................................. 37

AmazonAthenaFullAccess Managed Policy ........................................................................... 37

AWSQuicksightAthenaAccess Managed Policy ....................................................................... 39

Access through JDBC Connections ...................................................................................... 40

Amazon S3 Permissions .................................................................................................... 40

Cross-account Permissions ................................................................................................. 40

Conﬁguring Encryption Options ................................................................................................. 41

Permissions for Encrypting and Decrypting Data .................................................................. 42

Creating Tables Based on Encrypted Datasets in Amazon S3 .................................................. 42

Encrypting Query Results Stored in Amazon S3 .................................................................... 44

Encrypting Query Results stored in Amazon S3 Using the JDBC Driver ..................................... 45

Working with Source Data ................................................................................................................. 46

Tables and Databases Creation Process in Athena ......................................................................... 46

Requirements for Tables in Athena and Data in Amazon S3 ................................................... 46

Functions Supported ......................................................................................................... 47

CREATE TABLE AS Type Statements Are Not Supported ......................................................... 47

Transactional Data Transformations Are Not Supported ......................................................... 47

Operations That Change Table States Are ACID .................................................................... 47

All Tables Are EXTERNAL ................................................................................................... 47

UDF and UDAF Are Not Supported ..................................................................................... 47

To create a table using the AWS Glue Data Catalog .............................................................. 47

To create a table using the wizard ...................................................................................... 48

To create a database using Hive DDL .................................................................................. 48

To create a table using Hive DDL ....................................................................................... 49

Names for Tables, Databases, and Columns ................................................................................. 50

Table names and table column names in Athena must be lowercase ........................................ 50

Athena table, database, and column names allow only underscore special characters ................. 50

Names that begin with an underscore ................................................................................. 50

Table names that include numbers ..................................................................................... 50

Table Location in Amazon S3 ..................................................................................................... 51

Partitioning Data ...................................................................................................................... 51

Scenario 1: Data already partitioned and stored on S3 in hive format ...................................... 52

Scenario 2: Data is not partitioned ..................................................................................... 53

Converting to Columnar Formats ............................................................................................... 55

Overview ......................................................................................................................... 55

Before you begin .............................................................................................................. 14

Example: Converting data to Parquet using an EMR cluster .................................................... 57

Querying Data in Amazon Athena Tables ............................................................................................ 59

Amazon Athena User Guide

Query Results .......................................................................................................................... 59

Saving Query Results ........................................................................................................ 59

Viewing Query History .............................................................................................................. 60

Viewing Query History ...................................................................................................... 60

Querying Arrays ....................................................................................................................... 61

Creating Arrays ................................................................................................................ 61

Concatenating Arrays ........................................................................................................ 62

Converting Array Data Types ............................................................................................. 63

Finding Lengths ............................................................................................................... 64

Accessing Array Elements .................................................................................................. 64

Flattening Nested Arrays ................................................................................................... 65

Creating Arrays from Subqueries ........................................................................................ 67

Filtering Arrays ................................................................................................................ 68

Sorting Arrays .................................................................................................................. 69

Using Aggregation Functions with Arrays ............................................................................ 69

Converting Arrays to Strings .............................................................................................. 70

Querying Arrays with ROWS and STRUCTS .................................................................................. 70

Creating a ROW ................................................................................................................. 70

Changing Field Names in Arrays Using CAST ........................................................................ 71

Filtering Arrays Using the . Notation .................................................................................. 71

Filtering Arrays with Nested Values .................................................................................... 72

Filtering Arrays Using UNNEST ........................................................................................... 73

Finding Keywords in Arrays ................................................................................................ 73

Ordering Values in Arrays .................................................................................................. 74

Querying Arrays with Maps ....................................................................................................... 75

Examples ......................................................................................................................... 61

Querying JSON ........................................................................................................................ 76

Tips for Parsing Arrays with JSON ...................................................................................... 76

Extracting Data from JSON ................................................................................................ 78

Searching for Values ......................................................................................................... 80

Obtaining Length and Size of JSON Arrays .......................................................................... 81

Querying Geospatial Data ................................................................................................................. 83

What is a Geospatial Query? ...................................................................................................... 83

Input Data Formats and Geometry Data Types ............................................................................. 83

Input Data Formats .......................................................................................................... 83

Geometry Data Types ....................................................................................................... 84

List of Supported Geospatial Functions ....................................................................................... 84

Before You Begin ............................................................................................................. 84

Constructor Functions ....................................................................................................... 85

Geospatial Relationship Functions ...................................................................................... 86

Operation Functions ......................................................................................................... 88

Accessor Functions ........................................................................................................... 89

Examples: Geospatial Queries .................................................................................................... 92

Querying AWS Service Logs ............................................................................................................... 94

Querying AWS CloudTrail Logs ................................................................................................... 94

Creating the Table for CloudTrail Logs ................................................................................ 95

Tips for Querying CloudTrail Logs ...................................................................................... 96

Querying Amazon CloudFront Logs ............................................................................................ 97

Creating the Table for CloudFront Logs ............................................................................... 97

Example Query for CloudFront logs .................................................................................... 98

Querying Classic Load Balancer Logs .......................................................................................... 98

Creating the Table for Elastic Load Balancing Logs ............................................................... 98

Example Queries for Elastic Load Balancing Logs .................................................................. 99

Querying Application Load Balancer Logs .................................................................................. 100

Creating the Table for ALB Logs ....................................................................................... 100

Example Queries for ALB logs .......................................................................................... 101

Querying Amazon VPC Flow Logs ............................................................................................. 101

Amazon Athena User Guide

Creating the Table for VPC Flow Logs ............................................................................... 101

Example Queries for Amazon VPC Flow Logs ..................................................................... 102

Monitoring Logs and Troubleshooting ............................................................................................... 103

Logging Amazon Athena API Calls with AWS CloudTrail ............................................................... 103

Athena Information in CloudTrail ...................................................................................... 103

Understanding Athena Log File Entries .............................................................................. 104

Troubleshooting ..................................................................................................................... 105

SerDe Reference ............................................................................................................................. 107

Using a SerDe ........................................................................................................................ 107

To Use a SerDe in Queries ............................................................................................... 107

Supported SerDes and Data Formats ........................................................................................ 108

Avro SerDe .................................................................................................................... 109

RegexSerDe for Processing Apache Web Server Logs ........................................................... 111

CloudTrail SerDe ............................................................................................................. 112

OpenCSVSerDe for Processing CSV ................................................................................... 114

Grok SerDe .................................................................................................................... 115

JSON SerDe Libraries ...................................................................................................... 117

LazySimpleSerDe for CSV, TSV, and Custom-Delimited Files ................................................. 120

ORC SerDe ..................................................................................................................... 125

Parquet SerDe ................................................................................................................ 127

Compression Formats .............................................................................................................. 130

DDL and SQL Reference .................................................................................................................. 131

Data Types ............................................................................................................................ 131

List of Supported Data Types in Athena ............................................................................ 131

DDL Statements ..................................................................................................................... 132

ALTER DATABASE SET DBPROPERTIES ............................................................................... 132

ALTER TABLE ADD PARTITION .......................................................................................... 133

ALTER TABLE DROP PARTITION ........................................................................................ 134

ALTER TABLE RENAME PARTITION .................................................................................... 134

ALTER TABLE SET LOCATION ........................................................................................... 135

ALTER TABLE SET TBLPROPERTIES ................................................................................... 135

CREATE DATABASE .......................................................................................................... 136

CREATE TABLE ............................................................................................................... 136

DESCRIBE TABLE ............................................................................................................ 139

DROP DATABASE ............................................................................................................ 140

DROP TABLE .................................................................................................................. 140

MSCK REPAIR TABLE ....................................................................................................... 141

SHOW COLUMNS ........................................................................................................... 141

SHOW CREATE TABLE ..................................................................................................... 142

SHOW DATABASES ......................................................................................................... 142

SHOW PARTITIONS ......................................................................................................... 142

SHOW TABLES ............................................................................................................... 143

SHOW TBLPROPERTIES ................................................................................................... 143

SQL Queries, Functions, and Operators ..................................................................................... 144

SELECT .......................................................................................................................... 144

Unsupported DDL ................................................................................................................... 148

Limitations ............................................................................................................................. 149

Code Samples and Service Limits ..................................................................................................... 150

Code Samples ........................................................................................................................ 150

Create a Client to Access Athena ...................................................................................... 150

Start Query Execution ..................................................................................................... 151

Stop Query Execution ..................................................................................................... 154

List Query Executions ...................................................................................................... 155

Create a Named Query .................................................................................................... 156

Delete a Named Query .................................................................................................... 156

List Named Queries ........................................................................................................ 157

Service Limits ......................................................................................................................... 158

Amazon Athena User Guide

When should I use Athena?

What is Amazon Athena?

Amazon Athena is an interactive query service that makes it easy to analyze data directly in Amazon

Simple Storage Service (Amazon S3) using standard SQL. With a few actions in the AWS Management

Console, you can point Athena at your data stored in Amazon S3 and begin using standard SQL to run

ad-hoc queries and get results in seconds.

Athena is serverless, so there is no infrastructure to set up or manage, and you pay only for the queries

you run. Athena scales automatically—executing queries in parallel—so results are fast, even with large

datasets and complex queries.

When should I use Athena?

Athena helps you analyze unstructured, semi-structured, and structured data stored in Amazon S3.

Examples include CSV, JSON, or columnar data formats such as Apache Parquet and Apache ORC. You

can use Athena to run ad-hoc queries using ANSI SQL, without the need to aggregate or load the data

into Athena.

Athena integrates with the AWS Glue Data Catalog, which oﬀers a persistent metadata store for your

data in Amazon S3. This allows you to create tables and query data in Athena based on a central

metadata store available throughout your AWS account and integrated with the ETL and data discovery

features of AWS Glue. For more information, see Integration with AWS Glue (p. 18) and What is AWS

Glue in the AWS Glue Developer Guide.

Athena integrates with Amazon QuickSight for easy data visualization.

You can use Athena to generate reports or to explore data with business intelligence tools or SQL

clients connected with a JDBC or an ODBC driver. For more information, see What is Amazon QuickSight

in the Amazon QuickSight User Guide and Connecting to Amazon Athena with ODBC and JDBC

Drivers (p. 33).

You can create named queries with AWS CloudFormation and run them in Athena. Named queries

allow you to map a query name to a query and then call the query multiple times referencing it

by its name. For information, see CreateNamedQuery in the Amazon Athena API Reference, and

AWS::Athena::NamedQuery in the AWS CloudFormation User Guide.

Accessing Athena

You can access Athena using the AWS Management Console, through a JDBC connection, using the

Athena API, or using the Athena CLI.

• To get started with the console, see Getting Started (p. 14).

• To learn how to use JDBC, see Connecting to Amazon Athena with JDBC (p. 33).

• To use the Athena API, see the Amazon Athena API Reference.

• To use the CLI, install the AWS CLI and then type aws athena help from the command line to see

available commands. For information about available commands, see the AWS Athena command line

reference.

Amazon Athena User Guide

Understanding Tables, Databases, and the Data Catalog

Understanding Tables, Databases, and the Data

Catalog

In Athena, tables and databases are containers for the metadata deﬁnitions that deﬁne a schema for

underlying source data. For each dataset, a table needs to exist in Athena. The metadata in the table

tells Athena where the data is located in Amazon S3, and speciﬁes the structure of the data, for example,

column names, data types, and the name of the table. Databases are a logical grouping of tables, and

also hold only metadata and schema information for a dataset.

For each dataset that you'd like to query, Athena must have an underlying table it will use for obtaining

and returning query results. Therefore, before querying data, a table must be registered in Athena. The

registration occurs when you either create tables automatically or manually.

Regardless of how the tables are created, the tables creation process registers the dataset with Athena.

This registration occurs either in the AWS Glue Data Catalog, or in the internal Athena data catalog and

enables Athena to run queries on the data.

• To create a table automatically, use an AWS Glue crawler from within Athena. For more information

about AWS Glue and crawlers, see Integration with AWS Glue (p. 18). When AWS Glue creates a

table, it registers it in its own AWS Glue Data Catalog. Athena uses the AWS Glue Data Catalog to store

and retrieve this metadata, using it when you run queries to analyze the underlying dataset.

The AWS Glue Data Catalog is accessible throughout your AWS account. Other AWS services can share

the AWS Glue Data Catalog, so you can see databases and tables created throughout your organization

using Athena and vice versa. In addition, AWS Glue lets you automatically discover data schema and

extract, transform, and load (ETL) data.

Note

You use the internal Athena data catalog in regions where AWS Glue is not available and where

the AWS Glue Data Catalog cannot be used.

• To create a table manually:

• Use the Athena console to run the Create Table Wizard.

• Use the Athena console to write Hive DDL statements in the Query Editor.

• Use the Athena API or CLI to execute a SQL query string with DDL statements.

• Use the Athena JDBC or ODBC driver.

When you create tables and databases manually, Athena uses HiveQL data deﬁnition language (DDL)

statements such as CREATE TABLE, CREATE DATABASE, and DROP TABLE under the hood to create

tables and databases in the AWS Glue Data Catalog, or in its internal data catalog in those regions where

AWS Glue is not available.

Note

If you have tables in Athena created before August 14, 2017, they were created in an Athena-

managed data catalog that exists side-by-side with the AWS Glue Data Catalog until you

choose to update. For more information, see Upgrading to the AWS Glue Data Catalog Step-by-

Step (p. 19).

When you query an existing table, under the hood, Amazon Athena uses Presto, a distributed SQL

engine. We have examples with sample data within Athena to show you how to create a table and then

issue a query against it using Athena. Athena also has a tutorial in the console that helps you get started

creating a table based on data that is stored in Amazon S3.

• For a step-by-step tutorial on creating a table and write queries in the Athena Query Editor, see

Getting Started (p. 14).

Amazon Athena User Guide

Understanding Tables, Databases, and the Data Catalog

• Run the Athena tutorial in the console. This launches automatically if you log in to https://

console.aws.amazon.com/athena/ for the ﬁrst time. You can also choose Tutorial in the console to

launch it.

Amazon Athena User Guide

February 2, 2018

Release Notes

Describes Amazon Athena features, improvements, and bug ﬁxes by release date.

Contents

• February 2, 2018 (p. 4)

• January 19, 2018 (p. 4)

• November 13, 2017 (p. 5)

• November 1, 2017 (p. 5)

• October 19, 2017 (p. 5)

• October 3, 2017 (p. 6)

• September 25, 2017 (p. 6)

• August 14, 2017 (p. 6)

• August 4, 2017 (p. 6)

• June 22, 2017 (p. 6)

• June 8, 2017 (p. 6)

• May 19, 2017 (p. 6)

• Improvements (p. 7)

• Bug Fixes (p. 7)

• April 4, 2017 (p. 7)

• Features (p. 7)

• Improvements (p. 7)

• Bug Fixes (p. 7)

• March 24, 2017 (p. 8)

• Features (p. 8)

• Improvements (p. 9)

• Bug Fixes (p. 9)

• February 20, 2017 (p. 9)

• Features (p. 9)

• Improvements (p. 11)

February 2, 2018

Published on 2018-02-12

Added an ability to securely oﬄoad intermediate data to disk for memory-intensive queries that use the

GROUP BY clause. This improves the reliability of such queries, preventing "Query resource exhausted"

errors.

January 19, 2018

Published on 2018-01-19

Athena uses Presto, an open-source distributed query engine, to run queries.

With Athena, there are no versions to manage. We have transparently upgraded the underlying engine in

Athena to a version based on Presto version 0.172. No action is required on your end.

Amazon Athena User Guide

November 13, 2017

With the upgrade, you can now use Presto 0.172 Functions and Operators, including Presto 0.172

Lambda Expressions in Athena.

Major updates for this release, including the community-contributed ﬁxes, include:

• Support for ignoring headers. You can use the skip.header.line.count property when deﬁning

tables, to allow Athena to ignore headers.

• Support for the CHAR(n) data type in STRING functions. The range for CHAR(n) is [1.255], while

the range for VARCHAR(n) is [1,65535].

• Support for correlated subqueries.

• Support for Presto Lambda expressions and functions.

• Improved performance of the DECIMAL type and operators.

• Support for ﬁltered aggregations, such as SELECT sum(col_name) FILTER, where id > 0.

• Push-down predicates for the DECIMAL, TINYINT, SMALLINT, and REAL data types.

• Support for quantiﬁed comparison predicates: ALL, ANY, and SOME.

• Added functions: arrays_overlap(), array_except(), levenshtein_distance(),

codepoint(), skewness(), kurtosis(), and typeof().

• Added a variant of the from_unixtime() function that takes a timezone argument.

• Added the bitwise_and_agg() and bitwise_or_agg() aggregation functions.

• Added the xxhash64() and to_big_endian_64() functions.

• Added support for escaping double quotes or backslashes using a backslash with a JSON path

subscript to the json_extract() and json_extract_scalar() functions. This changes the

semantics of any invocation using a backslash, as backslashes were previously treated as normal

characters.

For a complete list of functions and operators, see SQL Queries, Functions, and Operators (p. 144) in

this guide, and Presto 0.172 Functions.

Athena does not support all of Presto's features. For more information, see Limitations (p. 149).

November 13, 2017

Published on 2017-11-13

Added support for connecting Athena to the ODBC Driver. For information, see Connecting to Amazon

Athena with ODBC (p. 35).

November 1, 2017

Published on 2017-11-01

Added support for querying geospatial data, and for Asia Paciﬁc (Seoul), Asia Paciﬁc (Mumbai), and

EU (London) regions. For information, see Querying Geospatial Data (p. 83) and AWS Regions and

Endpoints.

October 19, 2017

Published on 2017-10-19

Added support for EU (Frankfurt). For a list of supported regions, see AWS Regions and Endpoints.

Amazon Athena User Guide

October 3, 2017

Published on 2017-10-03

Create named Athena queries with CloudFormation. For more information, see

AWS::Athena::NamedQuery in the AWS CloudFormation User Guide.

September 25, 2017

Published on 2017-09-25

Added support for Asia Paciﬁc (Sydney). For a list of supported regions, see AWS Regions and Endpoints.

August 14, 2017

Published on 2017-08-14

Added integration with the AWS Glue Data Catalog and a migration wizard for updating from the Athena

managed data catalog to the AWS Glue Data Catalog. For more information, see Integration with AWS

Glue (p. 18).

August 4, 2017

Published on 2017-08-04

Added support for Grok SerDe, which provides easier pattern matching for records in unstructured text

ﬁles such as logs. For more information, see Grok SerDe (p. 115). Added keyboard shortcuts to scroll

through query history using the console (CTRL +

⇧/⇩

using Windows, CMD +

⇧/⇩

using Mac).

June 22, 2017

Published on 2017-06-22

Added support for Asia Paciﬁc (Tokyo) and Asia Paciﬁc (Singapore). For a list of supported regions, see

AWS Regions and Endpoints.

June 8, 2017

Published on 2017-06-08

Added support for EU (Ireland). For more information, see AWS Regions and Endpoints.

May 19, 2017

Published on 2017-05-19

Added an Amazon Athena API and AWS CLI support for Athena; updated JDBC driver to version 1.1.0;

ﬁxed various issues.

Amazon Athena User Guide

Improvements

• Amazon Athena enables application programming for Athena. For more information, see Amazon

Athena API Reference. The latest AWS SDKs include support for the Athena API. For links to

documentation and downloads, see the SDKs section in Tools for Amazon Web Services.

• The AWS CLI includes new commands for Athena. For more information, see the AWS CLI Reference for

Athena.

• A new JDBC driver 1.1.0 is available, which supports the new Athena API as well as the latest features

and bug ﬁxes. Download the driver at https://s3.amazonaws.com/athena-downloads/drivers/

AthenaJDBC41-1.1.0.jar. We recommend upgrading to the latest Athena JDBC driver; however, you

may still use the earlier driver version. Earlier driver versions do not support the Athena API. For more

information, see Using Athena with the JDBC Driver (p. 33).

• Actions speciﬁc to policy statements in earlier versions of Athena have been deprecated. If you

upgrade to JDBC driver version 1.1.0 and have customer-managed or inline IAM policies attached to

JDBC users, you must update the IAM policies. In contrast, earlier versions of the JDBC driver do not

support the Athena API, so you can specify only deprecated actions in policies attached to earlier

version JDBC users. For this reason, you shouldn't need to update customer-managed or inline IAM

policies.

• These policy-speciﬁc actions were used in Athena before the release of the Athena API. Use these

deprecated actions in policies only with JDBC drivers earlier than version 1.1.0. If you are upgrading

the JDBC driver, replace policy statements that allow or deny deprecated actions with the appropriate

API actions as listed or errors will occur:

Deprecated Policy-Speciﬁc Action Corresponding Athena API Action

athena:RunQuery athena:StartQueryExecution

athena:CancelQueryExecution athena:StopQueryExecution

athena:GetQueryExecutions athena:ListQueryExecutions

Improvements

• Increased the query string length limit to 256 KB.

Bug Fixes

• Fixed an issue that caused query results to look malformed when scrolling through results in the

console.

• Fixed an issue where a \u0000 character string in Amazon S3 data ﬁles would cause errors.

• Fixed an issue that caused requests to cancel a query made through the JDBC driver to fail.

• Fixed an issue that caused the AWS CloudTrail SerDe to fail with Amazon S3 data in US East (Ohio).

• Fixed an issue that caused DROP TABLE to fail on a partitioned table.

April 4, 2017

Published on 2017-04-04

Amazon Athena User Guide

Features

Added support for Amazon S3 data encryption and released JDBC driver update (version 1.0.1) with

encryption support, improvements, and bug ﬁxes.

Features

• Added the following encryption features:

• Support for querying encrypted data in Amazon S3.

• Support for encrypting Athena query results.

• A new version of the driver supports new encryption features, adds improvements, and ﬁxes issues.

• Added the ability to add, replace, and change columns using ALTER TABLE. For more information, see

Alter Column in the Hive documentation.

• Added support for querying LZO-compressed data.

For more information, see Conﬁguring Encryption Options (p. 41).

Improvements

• Better JDBC query performance with page-size improvements, returning 1,000 rows instead of 100.

• Added ability to cancel a query using the JDBC driver interface.

• Added ability to specify JDBC options in the JDBC connection URL. For more information, see Using

Athena with the JDBC Driver (p. 33).

• Added PROXY setting in the driver, which can now be set using ClientConﬁguration in the AWS SDK for

Java.

Bug Fixes

Fixed the following bugs:

• Throttling errors would occur when multiple queries were issued using the JDBC driver interface.

• The JDBC driver would abort when projecting a decimal data type.

• The JDBC driver would return every data type as a string, regardless of how the data type

was deﬁned in the table. For example, selecting a column deﬁned as an INT data type using

resultSet.GetObject() would return a STRING data type instead of INT.

• The JDBC driver would verify credentials at the time a connection was made, rather than at the time a

query would run.

• Queries made through the JDBC driver would fail when a schema was speciﬁed along with the URL.

March 24, 2017

Published on 2017-03-24

Added the AWS CloudTrail SerDe, improved performance, ﬁxed partition issues.

Features

• Added the AWS CloudTrail SerDe. For more information, see CloudTrail SerDe (p. 112). For detailed

usage examples, see the AWS Big Data Blog post, Analyze Security, Compliance, and Operational

Activity Using AWS CloudTrail and Amazon Athena.

Amazon Athena User Guide

Improvements

• Improved performance when scanning a large number of partitions.

• Improved performance on MSCK Repair Table operation.

• Added ability to query Amazon S3 data stored in regions other than your primary region. Standard

inter-region data transfer rates for Amazon S3 apply in addition to standard Athena charges.

Bug Fixes

• Fixed a bug where a "table not found error" might occur if no partitions are loaded.

• Fixed a bug to avoid throwing an exception with ALTER TABLE ADD PARTITION IF NOT EXISTS

queries.

• Fixed a bug in DROP PARTITIONS.

February 20, 2017

Published on 2017-02-20

Added support for AvroSerDe and OpenCSVSerDe, US East (Ohio) region, and bulk editing columns in the

console wizard. Improved performance on large Parquet tables.

Features

• Introduced support for new SerDes:

• Avro SerDe (p. 109)

• OpenCSVSerDe for Processing CSV (p. 114)

• US East (Ohio) region (us-east-2) launch. You can now run queries in this region.

• You can now use the Add Table wizard to deﬁne table schema in bulk. Choose Catalog Manager, Add

table, and then choose Bulk add columns as you walk through the steps to deﬁne the table.

Amazon Athena User Guide

Features

Type name value pairs in the text box and choose Add.

Amazon Athena User Guide

Improvements

• Improved performance on large Parquet tables.

Amazon Athena User Guide

Setting Up

If you've already signed up for Amazon Web Services (AWS), you can start using Amazon Athena

immediately. If you haven't signed up for AWS, or if you need assistance querying data using Athena, ﬁrst

complete the tasks below:

When you sign up for AWS, your account is automatically signed up for all services in AWS, including

Athena. You are charged only for the services that you use. When you use Athena, you use Amazon S3 to

store your data. Athena has no AWS Free Tier pricing.

If you have an AWS account already, skip to the next task. If you don't have an AWS account, use the

following procedure to create one.

To create an AWS account

1. Open http://aws.amazon.com/, and then choose Create an AWS Account.

2. Follow the online instructions. Part of the sign-up procedure involves receiving a phone call and

entering a PIN using the phone keypad.

Note your AWS account number, because you need it for the next task.

Create an IAM User

An AWS Identity and Access Management (IAM) user is an account that you create to access services. It is

a diﬀerent user than your main AWS account. As a security best practice, we recommend that you use the

IAM user's credentials to access AWS services. Create an IAM user, and then add the user to an IAM group

with administrative permissions or and grant this user administrative permissions. You can then access

AWS using a special URL and the credentials for the IAM user.

If you signed up for AWS but have not created an IAM user for yourself, you can create one using the IAM

console. If you aren't familiar with using the console, see Working with the AWS Management Console.

To create a group for administrators

1. Sign in to the IAM console at https://console.aws.amazon.com/iam/.

2. In the navigation pane, choose Groups, Create New Group.

3. For Group Name, type a name for your group, such as Administrators, and choose Next Step.

4. In the list of policies, select the check box next to the AdministratorAccess policy. You can use the

Filter menu and the Search ﬁeld to ﬁlter the list of policies.

5. Choose Next Step, Create Group. Your new group is listed under Group Name.

Amazon Athena User Guide

To create an IAM user for yourself, add the user to the

administrators group, and create a password for the user

To create an IAM user for yourself, add the user to the

administrators group, and create a password for the

user

1. In the navigation pane, choose Users, and then Create New Users.

2. For 1, type a user name.

3. Clear the check box next to Generate an access key for each user and then Create.

4. In the list of users, select the name (not the check box) of the user you just created. You can use the

Search ﬁeld to search for the user name.

5. Choose Groups, Add User to Groups.

6. Select the check box next to the administrators and choose Add to Groups.

7. Choose the Security Credentials tab. Under Sign-In Credentials, choose Manage Password.

8. Choose Assign a custom password. Then type a password in the Password and Conﬁrm Password

ﬁelds. When you are ﬁnished, choose Apply.

9. To sign in as this new IAM user, sign out of the AWS console, then use the following URL, where

your_aws_account_id is your AWS account number without the hyphens (for example, if your AWS

account number is 1234-5678-9012, your AWS account ID is 123456789012):

https://*your_account_alias*.signin.aws.amazon.com/console/

It is also possible the sign-in link will use your account name instead of number. To verify the sign-in

link for IAM users for your account, open the IAM console and check under IAM users sign-in link on the

dashboard.

Attach Managed Policies for Using Athena

Attach Athena managed policies to the IAM account you use to access Athena. There are two managed

policies for Athena: AmazonAthenaFullAccess and AWSQuicksightAthenaAccess. These policies

grant permissions to Athena to query Amazon S3 as well as write the results of your queries to a

separate bucket on your behalf. For more information and step-by-step instructions, see Attaching

Managed Policies in the AWS Identity and Access Management User Guide. For information about policy

contents, see IAM Policies for User Access (p. 37).

Note

You may need additional permissions to access the underlying dataset in Amazon S3. If

you are not the account owner or otherwise have restricted access to a bucket, contact the

bucket owner to grant access using a resource-based bucket policy, or contact your account

administrator to grant access using an identity-based policy. For more information, see Amazon

S3 Permissions (p. 40). If the dataset or Athena query results are encrypted, you may need

additional permissions. For more information, see Conﬁguring Encryption Options (p. 41).

Amazon Athena User Guide

Prerequisites

Getting Started

This tutorial walks you through using Amazon Athena to query data. You'll create a table based on

sample data stored in Amazon Simple Storage Service, query the table, and check the results of the

query.

The tutorial is using live resources, so you are charged for the queries that you run. You aren't charged

for the sample datasets that you use, but if you upload your own data ﬁles to Amazon S3, charges do

apply.

Prerequisites

If you have not already done so, sign up for an account in Setting Up (p. 12).

Step 1: Create a Database

You ﬁrst need to create a database in Athena.

To create a database

1. Open the Athena console.

2. If this is your ﬁrst time visiting the Athena console, you'll go to a Getting Started page. Choose Get

Started to open the Query Editor. If it isn't your ﬁrst time, the Athena Query Editor opens.

3. In the Athena Query Editor, you see a query pane with an example query. Start typing your query

anywhere in the query pane.

4. To create a database named mydatabase, enter the following CREATE DATABASE statement, and

then choose Run Query:

CREATE DATABASE mydatabase

5. Conﬁrm that the catalog display refreshes and mydatabase appears in the DATABASE list in the

Catalog dashboard on the left side.

Amazon Athena User Guide

Step 2: Create a Table

Now that you have a database, you're ready to create a table that's based on the sample data ﬁle. You

deﬁne columns that map to the data, specify how the data is delimited, and provide the location in

Amazon S3 for the ﬁle.

To create a table

1. Make sure that mydatabase is selected for DATABASE and then choose New Query.

2. In the query pane, enter the following CREATE TABLE statement, and then choose Run Query:

Note

You can query data in regions other than the region where you run Athena. Standard inter-

region data transfer rates for Amazon S3 apply in addition to standard Athena charges. To

reduce data transfer charges, replace myregion in s3://athena-examples-myregion/

path/to/data/ with the region identiﬁer where you run Athena, for example, s3://

athena-examples-us-east-1/path/to/data/.

CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs (

`Date` DATE,

Time STRING,

Location STRING,

Bytes INT,

RequestIP STRING,

Method STRING,

Host STRING,

Uri STRING,

Status INT,

Referrer STRING,

os STRING,

Browser STRING,

BrowserVersion STRING

) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'

WITH SERDEPROPERTIES (

"input.regex" = "^(?!#)([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s

+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+[^\(]+[\(]([^\;]+).*\%20([^

\/]+)[\/](.*)$"

) LOCATION 's3://athena-examples-myregion/cloudfront/plaintext/';

The table cloudfront_logs is created and appears in the Catalog dashboard for your database.

Amazon Athena User Guide

Step 3: Query Data

Now that you have the cloudfront_logs table created in Athena based on the data in Amazon S3, you

can run queries on the table and see the results in Athena.

To run a query

1. Choose New Query, enter the following statement anywhere in the query pane, and then choose

Run Query:

SELECT os, COUNT(*) count

FROM cloudfront_logs

WHERE date BETWEEN date '2014-07-05' AND date '2014-08-05'

GROUP BY os;

Results are returned that look like the following:

Amazon Athena User Guide

Step 3: Query Data

2. Optionally, you can save the results of a query to CSV by choosing the ﬁle icon on the Results pane.

You can also view the results of previous queries or queries that may take some time to complete.

Choose History then either search for your query or choose View or Download to view or download the

results of previous completed queries. This also displays the status of queries that are currently running.

Query history is retained for 45 days. For information, see Viewing Query History (p. 60).

Query results are also stored in Amazon S3 in a bucket called aws-athena-query-

results-ACCOUNTID-REGION. You can change the default location in the console and encryption options

by choosing Settings in the upper right pane. For more information, see Query Results (p. 59).

Amazon Athena User Guide

Integration with AWS Glue

AWS Glue is a fully managed ETL (extract, transform, and load) service that can categorize your data,

clean it, enrich it, and move it reliably between various data stores. AWS Glue crawlers automatically

infer database and table schema from your source data, storing the associated metadata in the AWS Glue

Data Catalog. When you create a table in Athena, you can choose to create it using an AWS Glue crawler.

In regions where AWS Glue is supported, Athena uses the AWS Glue Data Catalog as a central location

to store and retrieve table metadata throughout an AWS account. The Athena execution engine requires

table metadata that instructs it where to read data, how to read it, and other information necessary to

process the data. The AWS Glue Data Catalog provides a uniﬁed metadata repository across a variety

of data sources and data formats, integrating not only with Athena, but with Amazon S3, Amazon RDS,

Amazon Redshift, Amazon Redshift Spectrum, Amazon EMR, and any application compatible with the

Apache Hive metastore.

For more information about the AWS Glue Data Catalog, see Populating the AWS Glue Data Catalog

in the AWS Glue Developer Guide. For a list of regions where AWS Glue is available, see Regions and

Endpoints in the AWS General Reference.

Separate charges apply to AWS Glue. For more information, see AWS Glue Pricing and Are there separate

charges for AWS Glue? (p. 22) For more information about the beneﬁts of using AWS Glue with

Athena, see Why should I upgrade to the AWS Glue Data Catalog? (p. 21)

Topics

• Upgrading to the AWS Glue Data Catalog Step-by-Step (p. 19)

• FAQ: Upgrading to the AWS Glue Data Catalog (p. 21)

• Best Practices When Using Athena with AWS Glue (p. 23)

Amazon Athena User Guide

Upgrading to the AWS Glue Data Catalog Step-by-Step

Upgrading to the AWS Glue Data Catalog Step-by-

Step

Amazon Athena manages its own data catalog until the time that AWS Glue releases in the Athena

region. At that time, if you previously created databases and tables using Athena or Amazon Redshift

Spectrum, you can choose to upgrade Athena to the AWS Glue Data Catalog. If you are new to Athena,

you don't need to make any changes; databases and tables are available to Athena using the AWS Glue

Data Catalog and vice versa. For more information about the beneﬁts of using the AWS Glue Data

Catalog, see FAQ: Upgrading to the AWS Glue Data Catalog (p. 21). For a list of regions where AWS

Glue is available, see Regions and Endpoints in the AWS General Reference.

Until you upgrade, the Athena-managed data catalog continues to store your table and database

metadata, and you see the option to upgrade at the top of the console. The metadata in the Athena-

managed catalog isn't available in the AWS Glue Data Catalog or vice versa. While the catalogs exist side-

by-side, you aren't able to create tables or databases with the same names, and the creation process in

either AWS Glue or Athena fails in this case.

We created a wizard in the Athena console to walk you through the steps of upgrading to the AWS

Glue console. The upgrade takes just a few minutes, and you can pick up where you left oﬀ. For more

information about each upgrade step, see the topics in this section. For more information about working

with data and tables in the AWS Glue Data Catalog, see the guidelines in Best Practices When Using

Athena with AWS Glue (p. 23).

Step 1 - Allow a User to Perform the Upgrade

By default, the action that allows a user to perform the upgrade is not allowed in any policy, including

any managed policies. Because the AWS Glue Data Catalog is shared throughout an account, this extra

failsafe prevents someone from accidentally migrating the catalog.

Before the upgrade can be performed, you need to attach a customer-managed IAM policy, with a policy

statement that allows the upgrade action, to the user who performs the migration.

The following is an example policy statement.

{

"Version": "2012-10-17",

"Statement": [

{

"Effect": "Allow",

"Action": [

"glue:ImportCatalogToGlue "

"Resource": [ "*" ]

}

]

}

Step 2 - Update Customer-Managed/Inline Policies

Associated with Athena Users

If you have customer-managed or inline IAM policies associated with Athena users, you need to update

the policy or policies to allow actions that AWS Glue requires. If you use the managed policy, they are

automatically updated. The AWS Glue policy actions to allow are listed in the example policy below. For

the full policy statement, see IAM Policies for User Access (p. 37).

Amazon Athena User Guide

Step 3 - Choose Upgrade in the Athena Console

{

"Effect":"Allow",

"Action":[

"glue:CreateDatabase",

"glue:DeleteDatabase",

"glue:GetDatabase",

"glue:GetDatabases",

"glue:UpdateDatabase",

"glue:CreateTable",

"glue:DeleteTable",

"glue:BatchDeleteTable",

"glue:UpdateTable",

"glue:GetTable",

"glue:GetTables",

"glue:BatchCreatePartition",

"glue:CreatePartition",

"glue:DeletePartition",

"glue:BatchDeletePartition",

"glue:UpdatePartition",

"glue:GetPartition",

"glue:GetPartitions",

"glue:BatchGetPartition"

"Resource":[

"*"

]

}

Step 3 - Choose Upgrade in the Athena Console

After you make the required IAM policy updates, choose Upgrade in the Athena console. Athena moves

your metadata to the AWS Glue Data Catalog. The upgrade takes only a few minutes. After you upgrade,

the Athena console has a link to open the AWS Glue Catalog Manager from within Athena.

When you create a table using the console, you now have the option to create a table using an AWS Glue

crawler. For more information, see Using AWS Glue Crawlers (p. 24).

Amazon Athena User Guide

FAQ: Upgrading to the AWS Glue Data Catalog

If you created databases and tables using Athena in a region before AWS Glue was available in that

region, metadata is stored in an Athena-managed data catalog, which only Athena and Amazon Redshift

Spectrum can access. To use AWS Glue features together with Athena and Redshift Spectrum, you must

upgrade to the AWS Glue Data Catalog. Athena can only be used together with the AWS Glue Data

Catalog in regions where AWS Glue is available. For a list of regions, see Regions and Endpoints in the

AWS General Reference.

Why should I upgrade to the AWS Glue Data Catalog?

AWS Glue is a completely-managed extract, transform, and load (ETL) service. It has three main

components:

• An AWS Glue crawler can automatically scan your data sources, identify data formats, and infer

schema.

• A fully managed ETL service allows you to transform and move data to various destinations.

• The AWS Glue Data Catalog stores metadata information about databases and tables, pointing to a

data store in Amazon S3 or a JDBC-compliant data store.

For more information, see AWS Glue Concepts.

Upgrading to the AWS Glue Data Catalog has the following beneﬁts.

Uniﬁed metadata repository

The AWS Glue Data Catalog provides a uniﬁed metadata repository across a variety of data sources

and data formats. It provides out-of-the-box integration with Amazon Simple Storage Service (Amazon

S3), Amazon Relational Database Service (Amazon RDS), Amazon Redshift, Amazon Redshift Spectrum,

Athena, Amazon EMR, and any application compatible with the Apache Hive metastore. You can create

your table deﬁnitions one time and query across engines.

Amazon Athena User Guide

Are there separate charges for AWS Glue?

For more information, see Populating the AWS Glue Data Catalog.

Automatic schema and partition recognition

AWS Glue crawlers automatically crawl your data sources, identify data formats, and suggest schema and

transformations. Crawlers can help automate table creation and automatic loading of partitions that you

can query using Athena, Amazon EMR, and Redshift Spectrum. You can also create tables and partitions

directly using the AWS Glue API, SDKs, and the AWS CLI.

For more information, see Cataloging Tables with a Crawler.

Easy-to-build pipelines

The AWS Glue ETL engine generates Python code that is entirely customizable, reusable, and portable.

You can edit the code using your favorite IDE or notebook and share it with others using GitHub. After

your ETL job is ready, you can schedule it to run on the fully managed, scale-out Spark infrastructure of

AWS Glue. AWS Glue handles provisioning, conﬁguration, and scaling of the resources required to run

your ETL jobs, allowing you to tightly integrate ETL with your workﬂow.

For more information, see Authoring AWS Glue Jobs in the AWS Glue Developer Guide.

Are there separate charges for AWS Glue?

Yes. With AWS Glue, you pay a monthly rate for storing and accessing the metadata stored in the AWS

Glue Data Catalog, an hourly rate billed per second for AWS Glue ETL jobs and crawler runtime, and an

hourly rate billed per second for each provisioned development endpoint. The AWS Glue Data Catalog

allows you to store up to a million objects at no charge. If you store more than a million objects, you

are charged USD$1 for each 100,000 objects over a million. An object in the AWS Glue Data Catalog is a

table, a partition, or a database. For more information, see AWS Glue Pricing.

Upgrade process FAQ

• Who can perform the upgrade? (p. 22)

• My users use a managed policy with Athena and Redshift Spectrum. What steps do I need to take to

upgrade? (p. 22)

• What happens if I don’t upgrade? (p. 23)

• Why do I need to add AWS Glue policies to Athena users? (p. 23)

• What happens if I don’t allow AWS Glue policies for Athena users? (p. 23)

• Is there risk of data loss during the upgrade? (p. 23)

• Is my data also moved during this upgrade? (p. 23)

Who can perform the upgrade?

You need to attach a customer-managed IAM policy with a policy statement that allows the upgrade

action to the user who will perform the migration. This extra check prevents someone from accidentally

migrating the catalog for the entire account. For more information, see Step 1 - Allow a User to Perform

the Upgrade (p. 19).

My users use a managed policy with Athena and Redshift

Spectrum. What steps do I need to take to upgrade?

The Athena managed policy has been automatically updated with new policy actions that allow Athena

users to access AWS Glue. However, you still must explicitly allow the upgrade action for the user who

performs the upgrade. To prevent accidental upgrade, the managed policy does not allow this action.

Amazon Athena User Guide

Best Practices When Using Athena with AWS Glue

What happens if I don’t upgrade?

If you don’t upgrade, you are not able to use AWS Glue features together with the databases and tables

you create in Athena or vice versa. You can use these services independently. During this time, Athena

and AWS Glue both prevent you from creating databases and tables that have the same names in the

other data catalog. This prevents name collisions when you do upgrade.

Why do I need to add AWS Glue policies to Athena users?

Before you upgrade, Athena manages the data catalog, so Athena actions must be allowed for your users

to perform queries. After you upgrade to the AWS Glue Data Catalog, Athena actions no longer apply to

accessing the AWS Glue Data Catalog, so AWS Glue actions must be allowed for your users. Remember,

the managed policy for Athena has already been updated to allow the required AWS Glue actions, so no

action is required if you use the managed policy.

What happens if I don’t allow AWS Glue policies for Athena

users?

If you upgrade to the AWS Glue Data Catalog and don't update a user's customer-managed or inline IAM

policies, Athena queries fail because the user won't be allowed to perform actions in AWS Glue. For the

speciﬁc actions to allow, see Step 2 - Update Customer-Managed/Inline Policies Associated with Athena

Users (p. 19).

Is there risk of data loss during the upgrade?

No.

Is my data also moved during this upgrade?

No. The migration only aﬀects metadata.

Best Practices When Using Athena with AWS Glue

When using Athena with the AWS Glue Data Catalog, you can use AWS Glue to create databases and

tables (schema) to be queried in Athena, or you can use Athena to create schema and then use them in

AWS Glue and related services. This topic provides considerations and best practices when using either

method.

Under the hood, Athena uses Presto to execute DML statements and Hive to execute the DDL statements

that create and modify schema. With these technologies, there are a couple conventions to follow so

that Athena and AWS Glue work well together.

In this topic

• Database, Table, and Column Names (p. 24)

• Using AWS Glue Crawlers (p. 24)

• Scheduling a Crawler to Keep the AWS Glue Data Catalog and Amazon S3 in Sync (p. 24)

• Using Multiple Data Sources with Crawlers (p. 25)

• Syncing Partition Schema to Avoid "HIVE_PARTITION_SCHEMA_MISMATCH" (p. 27)

• Updating Table Metadata (p. 27)

Amazon Athena User Guide

Database, Table, and Column Names

• Working with CSV Files (p. 28)

• CSV Data Enclosed in Quotes (p. 28)

• CSV Files with Headers (p. 30)

• Using AWS Glue Jobs for ETL with Athena (p. 30)

• Creating Tables Using Athena for AWS Glue ETL Jobs (p. 30)

• Using ETL Jobs to Optimize Query Performance (p. 31)

• Converting SMALLINT and TINYINT Datatypes to INT When Converting to ORC (p. 32)

• Changing Date Data Types to String for Parquet ETL Transformation (p. 32)

• Automating AWS Glue Jobs for ETL (p. 32)

Database, Table, and Column Names

When you create schema in AWS Glue to query in Athena, consider the following:

• A database name cannot be longer than 252 characters.

• A table name cannot be longer than 255 characters.

• A column name cannot be longer than 128 characters.

• The only acceptable characters for database names, table names, and column names are lowercase

letters, numbers, and the underscore character.

You can use the AWS Glue Catalog Manager to rename columns, but at this time table names and

database names cannot be changed using the AWS Glue console. To correct database names, you need to

create a new database and copy tables to it (in other words, copy the metadata to a new entity). You can

follow a similar process for tables. You can use the AWS Glue SDK or AWS CLI to do this.

Using AWS Glue Crawlers

AWS Glue crawlers help discover and register the schema for datasets in the AWS Glue Data Catalog.

The crawlers go through your data, and inspect portions of it to determine the schema. In addition, the

crawler can detect and register partitions. For more information, see Cataloging Data with a Crawler in

the AWS Glue Developer Guide.

Scheduling a Crawler to Keep the AWS Glue Data Catalog and

Amazon S3 in Sync

AWS Glue crawlers can be set up to run on a schedule or on demand. For more information, see Time-

Based Schedules for Jobs and Crawlers in the AWS Glue Developer Guide.

If you have data that arrives for a partitioned table at a ﬁxed time, you can set up an AWS Glue crawler

to run on schedule to detect and update table partitions. This can eliminate the need to run a potentially

long and expensive MSCK REPAIR command or manually execute an ALTER TABLE ADD PARTITION

command. For more information, see Table Partitions in the AWS Glue Developer Guide.

Amazon Athena User Guide

Using AWS Glue Crawlers

Using Multiple Data Sources with Crawlers

When an AWS Glue crawler scans Amazon S3 and detects multiple directories, it uses a heuristic to

determine where the root for a table is in the directory structure, and which directories are partitions

for the table. In some cases, where the schema detected in two or more directories is similar, the crawler

may treat them as partitions instead of separate tables. One way to help the crawler discover individual

tables is to add each table's root directory as a data store for the crawler.

The following partitions in Amazon S3 are an example:

s3://bucket01/folder1/table1/partition1/file.txt

s3://bucket01/folder1/table1/partition2/file.txt

s3://bucket01/folder1/table1/partition3/file.txt

s3://bucket01/folder1/table2/partition4/file.txt

s3://bucket01/folder1/table2/partition5/file.txt

If the schema for table1 and table2 are similar, and a single data source is set to s3://bucket01/

folder1/ in AWS Glue, the crawler may create a single table with two partition columns: one partition

column that contains table1 and table2, and a second partition column that contains partition1

through partition5.

To have the AWS Glue crawler create two separate tables as intended, use the AWS Glue console to set

the crawler to have two data sources, s3://bucket01/folder1/table1/ and s3://bucket01/

folder1/table2, as shown in the following procedure.

To add another data store to an existing crawler in AWS Glue

1. In the AWS Glue console, choose Crawlers, select your crawler, and then choose Action, Edit crawler.

Amazon Athena User Guide

Using AWS Glue Crawlers

2. Under Add information about your crawler, choose additional settings as appropriate, and then

choose Next.

3. Under Add a data store, change Include path to the table-level directory. For instance, given the

example above, you would change it from s3://bucket01/folder1 to s3://bucket01/folder1/table1/.

Choose Next.

4. For Add another data store, choose Yes, Next.

Amazon Athena User Guide

Using AWS Glue Crawlers

5. For Include path, enter your other table-level directory (for example, s3://bucket01/folder1/table2/)

and choose Next.

a. Repeat steps 3-5 for any additional table-level directories, and ﬁnish the crawler conﬁguration.

The new values for Include locations appear under data stores

Syncing Partition Schema to Avoid

"HIVE_PARTITION_SCHEMA_MISMATCH"

For each table within the AWS Glue Data Catalog that has partition columns, the schema is stored

at the table level and for each individual partition within the table. The schema for partitions are

populated by an AWS Glue crawler based on the sample of data that it reads within the partition. For

more information, see Using Multiple Data Sources with Crawlers (p. 25).

When Athena runs a query, it validates the schema of the table and the schema of any partitions

necessary for the query. The validation compares the column data types in order and makes sure

that they match for the columns that overlap. This prevents unexpected operations such as adding

or removing columns from the middle of a table. If Athena detects that the schema of a partition

diﬀers from the schema of the table, Athena may not be able to process the query and fails with

HIVE_PARTITION_SCHEMA_MISMATCH.

There are a few ways to ﬁx this issue. First, if the data was accidentally added, you can remove the data

ﬁles that cause the diﬀerence in schema, drop the partition, and re-crawl the data. Second, you can drop

the individual partition and then run MSCK REPAIR within Athena to re-create the partition using the

table's schema. This second option works only if you are conﬁdent that the schema applied will continue

to read the data correctly.

Updating Table Metadata

After a crawl, the AWS Glue crawler automatically assigns certain table metadata to help make it

compatible with other external technologies like Apache Hive, Presto, and Spark. Occasionally, the

crawler may incorrectly assign metadata properties. Manually correct the properties in AWS Glue before

querying the table using Athena. For more information, see Viewing and Editing Table Details in the AWS

Glue Developer Guide.

AWS Glue may mis-assign metadata when a CSV ﬁle has quotes around each data ﬁeld, getting the

serializationLib property wrong. For more information, see CSV Data Enclosed in quotes (p. 28).

Amazon Athena User Guide

Working with CSV Files

CSV ﬁles occasionally have quotes around the data values intended for each column, and there may be

header values included in CSV ﬁles, which aren't part of the data to be analyzed. When you use AWS Glue

to create schema from these ﬁles, follow the guidance in this section.

CSV Data Enclosed in Quotes

If you run a query in Athena against a table created from a CSV ﬁle with quoted data values, update

the table deﬁnition in AWS Glue so that it speciﬁes the right SerDe and SerDe properties. This allows

the table deﬁnition to use the OpenCSVSerDe. For more information about the OpenCSV SerDe, see

OpenCSVSerDe for Processing CSV (p. 114).

In this case, you need to change the serializationLib property under ﬁeld in the SerDeInfo ﬁeld in

the table to org.apache.hadoop.hive.serde2.OpenCSVSerde and enter appropriate values for

separatorChar, quoteChar, and escapeChar.

For example, for a CSV ﬁle with records such as the following:

"John","Doe","123-555-1231","John said \"hello\""

"Jane","Doe","123-555-9876","Jane said \"hello\""

The separatorChar``value is a comma, the ``quoteChar value is double quotes, and the

escapeChar value is the backslash.

You can use the AWS Glue console to edit table details as shown in this example:

Amazon Athena User Guide

Working with CSV Files

Alternatively, you can update the table deﬁnition in AWS Glue to have a SerDeInfo block such as the

following:

"SerDeInfo": {

Amazon Athena User Guide

Using AWS Glue Jobs for ETL with Athena

"name": "",

"serializationLib": "org.apache.hadoop.hive.serde2.OpenCSVSerde",

"parameters": {

"separatorChar": ","

"quoteChar": """

"escapeChar": "\\"

}

For more information, see Viewing and Editing Table Details in the AWS Glue Developer Guide.

CSV Files with Headers

If you are writing CSV ﬁles from AWS Glue to query using Athena, you must remove the CSV headers

so that the header information is not included in Athena query results. One way to achieve this is to

use AWS Glue jobs, which perform extract, transform, and load (ETL) work. You can write scripts in AWS

Glue using a language that is an extension of the PySpark Python dialect. For more information, see

Authoring Jobs in Glue in the AWS Glue Developer Guide.

The following example shows a function in an AWS Glue script that writes out a dynamic frame

using from_options, and sets the writeHeader format option to false, which removes the header

information:

glueContext.write_dynamic_frame.from_options(frame = applymapping1, connection_type

= "s3", connection_options = {"path": "s3://MYBUCKET/MYTABLEDATA/"}, format = "csv",

format_options = {"writeHeader": False}, transformation_ctx = "datasink2")

Using AWS Glue Jobs for ETL with Athena

AWS Glue jobs perform ETL operations. An AWS Glue job runs a script that extracts data from sources,

transforms the data, and loads it into targets. For more information, see Authoring Jobs in Glue in the

AWS Glue Developer Guide.

Creating Tables Using Athena for AWS Glue ETL Jobs

Tables that you create from within Athena must have a table property added to them called a

classification, which identiﬁes the format of the data. This allows AWS Glue to be able to use the

tables for ETL jobs. The classiﬁcation values can be csv, parquet, orc, avro, or json. An example

create table statement in Athena follows:

CREATE EXTERNAL TABLE sampleTable (

column1 INT,

column2 INT

) STORED AS PARQUET

TBLPROPERTIES (

'classification'='parquet')

If the table property was not added when the table was created, the property can be added using the

AWS Glue console.

Amazon Athena User Guide

Using AWS Glue Jobs for ETL with Athena

To change the classiﬁcation property using the console

1. Choose Edit Table.

2. For Classiﬁcation, select the ﬁle type and choose Apply.

For more information, see Working with Tables in the AWS Glue Developer Guide.

Using ETL Jobs to Optimize Query Performance

AWS Glue jobs can help you transform data to a format that optimizes query performance in Athena.

Data formats have a large impact on query performance and query costs in Athena.

We recommend the Parquet and ORC formats. AWS Glue supports writing to both of these data formats,

which can make it easier and faster for you to transform data to an optimal format for Athena. For more

information about these formats and other ways to improve performance, see Top Performance Tuning

tips for Amazon Athena.

Amazon Athena User Guide

Using AWS Glue Jobs for ETL with Athena

Converting SMALLINT and TINYINT Datatypes to INT When

Converting to ORC

To reduce the likelihood that Athena is unable to read the SMALLINT and TINYINT data types produced

by an AWS Glue ETL job, convert SMALLINT and TINYINT to INT when using the wizard or writing a

script for an ETL job.

Changing Date Data Types to String for Parquet ETL

Transformation

Athena currently does not support the DATE data type for Parquet ﬁles. Convert DATE data types to

STRING when using the wizard or writing a script for an AWS Glue ETL job.

Automating AWS Glue Jobs for ETL

You can conﬁgure AWS Glue ETL jobs to run automatically based on triggers. This feature is ideal when

data from outside AWS is being pushed to an S3 bucket in a suboptimal format for querying in Athena.

For more information, see Triggering AWS Glue Jobs in the AWS Glue Developer Guide.

Amazon Athena User Guide

Using Athena with the JDBC Driver

Connecting to Amazon Athena with

ODBC and JDBC Drivers

To explore and visualize your data with business intelligence tools, download, install, and conﬁgure an

ODBC (Open Database Connectivity) or JDBC (Java Database Connectivity) driver.

Topics

• Using Athena with the JDBC Driver (p. 33)

• Connecting to Amazon Athena with ODBC (p. 35)

Using Athena with the JDBC Driver

You can use a JDBC connection to connect Athena to business intelligence tools, such as SQL Workbench.

To do this, download, install, and conﬁgure the Athena JDBC driver, using the following link on Amazon

S3.

Download the JDBC Driver

1. Download the driver (JDBC 4.1 and Java 8 compatible) from this location in Amazon S3 https://

s3.amazonaws.com/athena-downloads/drivers/AthenaJDBC41-1.1.0.jar.

2. Use the AWS CLI with the following command:

aws s3 cp s3://athena-downloads/drivers/AthenaJDBC41-1.1.0.jar [local_directory]

Specify the Connection String

To specify the JDBC driver connection URL in your custom application, use the string in this format:

jdbc:awsathena://athena.{REGION}.amazonaws.com:443

where {REGION} is a region identiﬁer, such as us-west-2. For information on Athena regions see

Regions.

Specify the JDBC Driver Class Name

To use the driver in custom applications, set up your Java class path to the location of the JAR ﬁle

that you downloaded from Amazon S3 https://s3.amazonaws.com/athena-downloads/drivers/

AthenaJDBC41-1.1.0.jar in the previous section. This makes the classes within the JAR available for use.

The main JDBC driver class is com.amazonaws.athena.jdbc.AthenaDriver.

Provide the JDBC Driver Credentials

To gain access to AWS services and resources, such as Athena and the Amazon S3 buckets, provide JDBC

driver credentials to your application.

To provide credentials in the Java code for your application:

Amazon Athena User Guide

Conﬁgure the JDBC Driver Options

1. Use a class which implements the AWSCredentialsProvider.

2. Set the JDBC property, aws_credentials_provider_class, equal to the class name, and include it

in your classpath.

3. To include constructor parameters, set the JDBC property

aws_credentials_provider_arguments as speciﬁed in the following section about conﬁguration

options.

Another method to supply credentials to BI tools, such as SQL Workbench, is to supply the credentials

used for the JDBC as AWS access key and AWS secret key for the JDBC properties for user and password,

respectively.

Users who connect through the JDBC driver and have custom access policies attached to their proﬁles

need permissions for policy actions in addition to those in the Amazon Athena API Reference.

Policies

You must allow JDBC users to perform a set of policy-speciﬁc actions. These actions are not part of the

Athena API. If the following actions are not allowed, users will be unable to see databases and tables:

• athena:GetCatalogs

• athena:GetExecutionEngine

• athena:GetExecutionEngines

• athena:GetNamespace

• athena:GetNamespaces

• athena:GetTable

• athena:GetTables

Conﬁgure the JDBC Driver Options

You can conﬁgure the following options for the JDBC driver. With this version of the driver, you can also

pass parameters using the standard JDBC URL syntax, for example: jdbc:awsathena://athena.us-

west-1.amazonaws.com:443?max_error_retries=20&connection_timeout=20000.

Options for the JDBC Driver

Property Name Description Default

Value

Required

s3_staging_dir The S3 location to which your query output is

written, for example s3://query-results-

bucket/folder/, which is established under

Settings in the Athena Console, https://

console.aws.amazon.com/athena/. The JDBC driver

then asks Athena to read the results and provide

rows of data back to the user.

N/A Yes

query_results_encryption_optionThe encryption method to use for the directory

speciﬁed by s3_staging_dir. If not speciﬁed, the

location is not encrypted. Valid values are SSE_S3,

SSE_KMS, and CSE_KMS.

N/A No

query_results_aws_kms_keyThe Key ID of the AWS customer master key (CMK)

to use if query_results_encryption_option

N/A No

Amazon Athena User Guide

Connecting to Amazon Athena with ODBC

Property Name Description Default

Value

Required

speciﬁes SSE-KMS or CSE-KMS. For example,

123abcde-4e56-56f7-g890-1234h5678i9j.

aws_credentials_provider_classThe credentials provider class name, which

implements the AWSCredentialsProvider

interface.

N/A No

aws_credentials_provider_argumentsArguments for the credentials provider constructor

as comma-separated values.

N/A No

max_error_retries The maximum number of retries that the JDBC

client attempts to make a request to Athena.

10 No

connection_timeout The maximum amount of time, in milliseconds, to

make a successful connection to Athena before an

attempt is terminated.

10,000 No

socket_timeoutt The maximum amount of time, in milliseconds, to

wait for a socket in order to send data to Athena.

10,000 No

retry_base_delay Minimum delay amount, in milliseconds, between

retrying attempts to connect Athena.

100 No

retry_max_backoﬀ_time Maximum delay amount, in milliseconds, between

retrying attempts to connect to Athena.

1000 No

log_path Local path of the Athena JDBC driver logs. If no log

path is provided, then no log ﬁles are created.

N/A No

log_level Log level of the Athena JDBC driver logs. Valid

values: INFO, DEBUG, WARN, ERROR, ALL, OFF,

FATAL, TRACE.

N/A No

Connecting to Amazon Athena with ODBC

Download the ODBC driver, the Amazon Athena ODBC driver License Agreement, and the documentation

for the driver using the following links.

Amazon Athena ODBC Driver License Agreement

License Agreement

Windows

• Windows 32 bit ODBC Driver

• Windows 64 bit ODBC Driver

Linux

• Linux 32 bit ODBC Driver

• Linux 64 bit ODBC Driver

Amazon Athena User Guide

OSX

OSX ODBC Driver

ODBC Driver Connection String

For information about the connection string to use with your application, see ODBC Driver Installation

and Conﬁguration Guide.

Documentation

ODBC Driver Installation and Conﬁguration Guide.

Amazon Athena User Guide

Setting User and Amazon S3 Bucket Permissions

Security

Amazon Athena uses IAM policies to restrict access to Athena operations. Encryption options enable you

to encrypt query result ﬁles in Amazon S3 and query data encrypted in Amazon S3. Users must have the

appropriate permissions to access the Amazon S3 locations and decrypt ﬁles.

Topics

• Setting User and Amazon S3 Bucket Permissions (p. 37)

• Conﬁguring Encryption Options (p. 41)

Setting User and Amazon S3 Bucket Permissions

To run queries in Athena, you must have the appropriate permissions for:

• The Athena actions.

• The Amazon S3 locations where the underlying data is stored that you are going to query in Athena.

If you are an administrator for other users, make sure that they have appropriate permissions associated

with their user proﬁles.

IAM Policies for User Access

To allow or deny Athena service actions for yourself or other users, use IAM policies attached to

principals, such as users or groups.

Each IAM policy consists of statements that deﬁne the actions that are allowed or denied. For a list of

actions, see the Amazon Athena API Reference.

Managed policies are easy to use and are automatically updated with the required actions as the service

evolves.

The AmazonAthenaFullAccess policy is the managed policy for Athena. Attach this policy to users and

other principals who need full access to Athena. For more information and step-by-step instructions for

attaching a policy to a user, see Attaching Managed Policies in the AWS Identity and Access Management

User Guide.

Customer-managed and inline policies allow you to specify more granular Athena actions within a policy

to ﬁne-tune access. We recommend that you use the AmazonAthenaFullAccess policy as a starting

point and then allow or deny speciﬁc actions listed in the Amazon Athena API Reference. For more

information about inline policies, see Managed Policies and Inline Policies in the AWS Identity and Access

Management User Guide.

If you also have principals that connect using JDBC, you must allow additional actions not listed in the

API. For more information, see Service Actions for JDBC Connections (p. 40).

AmazonAthenaFullAccess Managed Policy

Managed policy contents change, so the policy shown here may be out-of-date. Check the IAM console

for the most up-to-date policy.

{

"Version": "2012-10-17",

"Statement": [

Amazon Athena User Guide

AmazonAthenaFullAccess Managed Policy

{

"Effect": "Allow",

"Action": [

"athena:*"

"Resource": [

"*"

]

{

"Effect": "Allow",

"Action": [

"glue:CreateDatabase",

"glue:DeleteDatabase",

"glue:GetDatabase",

"glue:GetDatabases",

"glue:UpdateDatabase",

"glue:CreateTable",

"glue:DeleteTable",

"glue:BatchDeleteTable",

"glue:UpdateTable",

"glue:GetTable",

"glue:GetTables",

"glue:BatchCreatePartition",

"glue:CreatePartition",

"glue:DeletePartition",

"glue:BatchDeletePartition",

"glue:UpdatePartition",

"glue:GetPartition",

"glue:GetPartitions",

"glue:BatchGetPartition"

"Resource": [

"*"

]

{

"Effect": "Allow",

"Action": [

"s3:GetBucketLocation",

"s3:GetObject",

"s3:ListBucket",

"s3:ListBucketMultipartUploads",

"s3:ListMultipartUploadParts",

"s3:AbortMultipartUpload",

"s3:CreateBucket",

"s3:PutObject"

"Resource": [

"arn:aws:s3:::aws-athena-query-results-*"

]

{

"Effect": "Allow",

"Action": [

"s3:GetObject"

"Resource": [

"arn:aws:s3:::athena-examples*"

]

}

]

}

Amazon Athena User Guide

AWSQuicksightAthenaAccess Managed Policy

An additional managed policy, AWSQuicksightAthenaAccess, grants access to actions that Amazon

QuickSight needs to integrate with Athena. This policy includes deprecated actions for Athena that are

not in the API. Attach this policy only to principals who use Amazon QuickSight in conjunction with

Athena.

Managed policy contents change, so the policy shown here may be out-of-date. Check the IAM console

for the most up-to-date policy.

{

"Version": "2012-10-17",

"Statement": [

{

"Effect": "Allow",

"Action": [

"athena:BatchGetQueryExecution",

"athena:CancelQueryExecution",

"athena:GetCatalogs",

"athena:GetExecutionEngine",

"athena:GetExecutionEngines",

"athena:GetNamespace",

"athena:GetNamespaces",

"athena:GetQueryExecution",

"athena:GetQueryExecutions",

"athena:GetQueryResults",

"athena:GetTable",

"athena:GetTables",

"athena:ListQueryExecutions",

"athena:RunQuery",

"athena:StartQueryExecution",

"athena:StopQueryExecution"

"Resource": [

"*"

]

{

"Effect": "Allow",

"Action": [

"glue:CreateDatabase",

"glue:DeleteDatabase",

"glue:GetDatabase",

"glue:GetDatabases",

"glue:UpdateDatabase",

"glue:CreateTable",

"glue:DeleteTable",

"glue:BatchDeleteTable",

"glue:UpdateTable",

"glue:GetTable",

"glue:GetTables",

"glue:BatchCreatePartition",

"glue:CreatePartition",

"glue:DeletePartition",

"glue:BatchDeletePartition",

"glue:UpdatePartition",

"glue:GetPartition",

"glue:GetPartitions",

"glue:BatchGetPartition"

"Resource": [

"*"

]

Amazon Athena User Guide

Access through JDBC Connections

{

"Effect": "Allow",

"Action": [

"s3:GetBucketLocation",

"s3:GetObject",

"s3:ListBucket",

"s3:ListBucketMultipartUploads",

"s3:ListMultipartUploadParts",

"s3:AbortMultipartUpload",

"s3:CreateBucket",

"s3:PutObject"

"Resource": [

"arn:aws:s3:::aws-athena-query-results-*"

]

}

]

}

Access through JDBC Connections

To gain access to AWS services and resources, such as Athena and the Amazon S3 buckets, provide JDBC

driver credentials to your application. See Connect with the JDBC Driver (p. 33).

Amazon S3 Permissions

In addition to the allowed actions for Athena that you deﬁne in policies, if you or your users need to

create tables and work with underlying data, you must grant appropriate access to the Amazon S3

location of the data.

You can do this using user policies, bucket policies, or both. For detailed information and scenarios about

how to grant Amazon S3 access, see Example Walkthroughs: Managing Access in the Amazon Simple

Storage Service Developer Guide. For more information and an example of which Amazon S3 actions to

allow, see the example bucket policy later in this topic.

Note

Athena does not support restricting or allowing access to Amazon S3 resources based on the

aws:SourceIp condition key.

Cross-account Permissions

A common scenario is granting access to users in an account diﬀerent from the bucket owner so that

they can perform queries. In this case, use a bucket policy to grant access.

The following example bucket policy, created and applied to bucket s3://my-athena-data-bucket

by the bucket owner, grants access to all users in account 123456789123, which is a diﬀerent account.

{

"Version": "2012-10-17",

"Id": "MyPolicyID",

"Statement": [

{

"Sid": "MyStatementSid",

"Effect": "Allow",

"Principal": {

"AWS": "arn:aws:iam::123456789123:root"

Amazon Athena User Guide

Conﬁguring Encryption Options

"Action": [

"s3:GetBucketLocation",

"s3:GetObject",

"s3:ListBucket",

"s3:ListBucketMultipartUploads",

"s3:ListMultipartUploadParts",

"s3:AbortMultipartUpload",

"s3:PutObject"

"Resource": [

"arn:aws:s3:::my-athena-data-bucket",

"arn:aws:s3:::my-athena-data-bucket/*"

]

}

]

}

To grant access to a particular user in an account, replace the Principal key with

a key that speciﬁes the user instead of root. For example, for user proﬁle Dave, use

arn:aws:iam::123456789123:user/Dave.

Conﬁguring Encryption Options

You can use Athena to query encrypted data in Amazon S3 by indicating data encryption when you

create a table. You can also choose to encrypt the results of all queries in Amazon S3, which Athena

stores in a location known as the S3 staging directory. You can encrypt query results stored in Amazon

S3 whether the underlying dataset is encrypted in Amazon S3 or not. You set up query-result encryption

using the Athena console or, if you connect using the JDBC driver, by conﬁguring driver options. You

specify the type of encryption to use and the Amazon S3 staging directory location. Query-result

encryption applies to all queries.

These options encrypt data at rest in Amazon S3. Regardless of whether you use these options, transport

layer security (TLS) encrypts objects in-transit between Athena resources and between Athena and

Amazon S3. Query results stream to JDBC clients as plain text and are encrypted using SSL.

Important

The setup for querying an encrypted dataset in Amazon S3 and the options in Athena to encrypt

query results are independent. Each option is enabled and conﬁgured separately. You can use

diﬀerent encryption methods or keys for each. This means that reading encrypted data in

Amazon S3 doesn't automatically encrypt Athena query results in Amazon S3. The opposite is

also true. Encrypting Athena query results in Amazon S3 doesn't encrypt the underlying dataset

in Amazon S3.

Athena supports the following S3 encryption options, both for encrypted datasets in Amazon S3 and for

encrypted query results:

• Server side encryption with an Amazon S3-managed key (SSE-S3)

• Server-side encryption with a AWS KMS-managed key (SSE-KMS).

Note

With SSE-KMS, Athena does not require you to indicate data is encrypted when creating a

table.

• Client-side encryption with a AWS KMS-managed key (CSE-KMS)

For more information about AWS KMS encryption with Amazon S3, see What is AWS Key Management

Service and How Amazon Simple Storage Service (Amazon S3) Uses AWS KMS in the AWS Key

Management Service Developer Guide.

Amazon Athena User Guide

Permissions for Encrypting and Decrypting Data

Athena does not support SSE with customer-provided keys (SSE-C), nor does it support client-side

encryption using a client-side master key. To compare Amazon S3 encryption options, see Protecting

Data Using Encryption in the Amazon Simple Storage Service Developer Guide.

Athena does not support running queries from one region on encrypted data stored in Amazon S3 in

another region.

Permissions for Encrypting and Decrypting Data

If you use SSE-S3 for encryption, Athena users require no additional permissions for encryption and

decryption. Having the appropriate Amazon S3 permissions for the appropriate Amazon S3 location

(and for Athena actions) is enough. For more information about policies that allow appropriate

Athena and Amazon S3 permissions, see IAM Policies for User Access (p. 37) and Amazon S3

Permissions (p. 40).

For data that is encrypted using AWS KMS, Athena users must be allowed to perform particular AWS

KMS actions in addition to Athena and S3 permissions. You allow these actions by editing the key policy

for the KMS customer master keys (CMKs) that are used to encrypt data in Amazon S3. The easiest

way to do this is to use the IAM console to add key users to the appropriate KMS key policies. For

information about how to add a user to a KMS key policy, see How to Modify a Key Policy in the AWS Key

Management Service Developer Guide.

Note

Advanced key policy administrators may want to ﬁne-tune key policies. kms:Decrypt is the

minimum allowed action for an Athena user to work with an encrypted dataset. To work with

encrypted query results, the minimum allowed actions are kms:GenerateDataKey and

kms:Decrypt.

When using Athena to query datasets in Amazon S3 with a large number of objects that are encrypted

with AWS KMS, AWS KMS may throttle query results. This is more likely when there are a large number

of small objects. Athena backs oﬀ retry requests, but a throttling error might still occur. In this case, visit

the AWS Support Center and create a case to increase your limit. For more information about limits and

AWS KMS throttling, see Limits in the AWS Key Management Service Developer Guide.

Creating Tables Based on Encrypted Datasets in

Amazon S3

You indicate to Athena that a dataset is encrypted in Amazon S3 when you create a table (this is not

required when using SSE-KMS). For both SSE-S3 and KMS encryption, Athena is able to determine the

proper materials to use to decrypt the dataset and create the table, so you don't need to provide key

information.

Users that run queries, including the user who creates the table, must have the appropriate permissions

as described earlier.

Important

If you use Amazon EMR along with EMRFS to upload encrypted Parquet ﬁles, you must disable

multipart uploads (set fs.s3n.multipart.uploads.enabled to false); otherwise, Athena

is unable to determine the Parquet ﬁle length and a HIVE_CANNOT_OPEN_SPLIT error occurs.

For more information, see Conﬁgure Multipart Upload for Amazon S3 in the EMR Management

Guide.

Indicate that the dataset is encrypted in Amazon S3 in one of the following ways. This is not required if

SSE-KMS is used.

• Use the CREATE TABLE (p. 136) statement with a TBLPROPERTIES clause that speciﬁes

'has_encrypted_data'='true'.

Amazon Athena User Guide

Creating Tables Based on Encrypted Datasets in Amazon S3

• Use the JDBC driver (p. 33) and set the TBLPROPERTIES value as above when you execute CREATE

TABLE (p. 136) using statement.executeQuery().

• Use the Add table wizard in the Athena console, and then choose Encrypted data set when you

specify a value for Location of input data set.

Tables based on encrypted data in Amazon S3 appear in the Database list with an encryption icon.

Amazon Athena User Guide

Encrypting Query Results Stored in Amazon S3

You use the Athena console or JDBC driver properties to specify that query results, which Athena stores

in the S3 staging directory, are encrypted in Amazon S3. This setting applies to all Athena query results.

You can't conﬁgure the setting for individual databases, tables, or queries.

To encrypt query results stored in Amazon S3 using the console

1. In the Athena console, choose Settings.

2. For Query result location, enter a custom value or leave the default. This is the Amazon S3 staging

directory where query results are stored.

3. Choose Encrypt query results.

4. For Encryption type, choose CSE-KMS, SSE-KMS, or SSE-S3.

If you chose SSE-KMS or CSE-KMS, for Encryption key, specify one of the following:

Amazon Athena User Guide

Encrypting Query Results stored in

Amazon S3 Using the JDBC Driver

• If your account has access to an existing KMS CMK, choose its alias, or

• Choose Enter a KMS key ARN and then enter an ARN.

• To create a new KMS key, choose Create KMS key, use the IAM console to create the key, and then

return to specify the key by alias or ARN as described in the previous steps. For more information,

see Creating Keys in the AWS Key Management Service Developer Guide.

5. Choose Save.

Encrypting Query Results stored in Amazon S3 Using

the JDBC Driver

You can conﬁgure the JDBC Driver to encrypt your query results using any of the encryption protocols

that Athena supports. For more information, see JDBC Driver Options (p. 34).

Amazon Athena User Guide

Tables and Databases Creation Process in Athena

Working with Source Data

Topics

• Tables and Databases Creation Process in Athena (p. 46)

• Names for Tables, Databases, and Columns (p. 50)

• Table Location in Amazon S3 (p. 51)

• Partitioning Data (p. 51)

• Converting to Columnar Formats (p. 55)

Tables and Databases Creation Process in Athena

Amazon Athena uses Apache Hive data deﬁnition language (DDL) statements to deﬁne and query

external tables where data resides on Amazon Simple Storage Service.

You can run DDL statements:

• In the Athena console

• using a JDBC or an ODBC driver

• using the Athena Create Table wizard.

When you create a new table schema in Athena, Athena stores the schema in a data catalog and uses it

when you run queries.

Athena uses an approach known as schema-on-read, which means a schema is projected on to your data

at the time you execute a query. This eliminates the need for data loading or transformation.

Athena does not modify your data in Amazon S3.

Athena uses Apache Hive to deﬁne tables and create databases, which are essentially a logical

namespace of tables.

When you create a database and table in Athena, you are simply describing the schema and the location

where the table data are located in Amazon S3 for read-time querying. Database and table, therefore,

have a slightly diﬀerent meaning than they do for traditional relational database systems because the

data isn't stored along with the schema deﬁnition for the database and table.

When you query, you query the table using standard SQL and the data is read at that time. You can ﬁnd

guidance for how to create databases and tables using Apache Hive documentation, but the following

provides guidance speciﬁcally for Athena.

Hive supports multiple data formats through the use of serializer-deserializer (SerDe) libraries. You

can also deﬁne complex schemas using regular expressions. For a list of supported SerDe libraries, see

Supported Data Formats, SerDes, and Compression Formats (p. 108).

Requirements for Tables in Athena and Data in

Amazon S3

When you create a table, you specify an Amazon S3 bucket location for the underlying data using the

LOCATION clause. Consider the following:

• You must have the appropriate permissions to work with data in the Amazon S3 location. For more

information, see Setting User and Amazon S3 Bucket Permissions (p. 37).

Amazon Athena User Guide

Functions Supported

• If the data is not encrypted in Amazon S3, it can be stored n a diﬀerent region from the primary region

where you run Athena. Standard inter-region data transfer rates for Amazon S3 apply in addition to

standard Athena charges.

• If the data is encrypted in Amazon S3, it must be stored in the same region, and the user or principal

who creates the table in Athena must have the appropriate permissions to decrypt the data. For more

information, see Conﬁguring Encryption Options (p. 41).

• Athena does not support diﬀerent storage classes within the bucket speciﬁed by the LOCATION

clause, does not support the GLACIER storage class, and does not support Requester Pays buckets. For

more information, see Storage Classes, Changing the Storage Class of an Object in Amazon S3, and

Requester Pays Buckets in the Amazon Simple Storage Service Developer Guide.

Functions Supported

The functions supported in Athena queries are those found within Presto. For more information, see

Presto 0.172 Functions and Operators in the Presto documentation.

CREATE TABLE AS Type Statements Are Not

Supported

Athena does not support CREATE TABLE AS type statements, for example, CREATE TABLE AS SELECT,

which creates a table from the result of a SELECT query statement.

Transactional Data Transformations Are Not

Supported

Athena does not support transaction-based operations (such as the ones found in Hive or Presto) on

table data. For a full list of keywords not supported, see Unsupported DDL (p. 148).

Operations That Change Table States Are ACID

When you create, update, or delete tables, those operations are guaranteed ACID-compliant. For

example, if multiple users or clients attempt to create or alter an existing table at the same time, only

one will be successful.

All Tables Are EXTERNAL

If you use CREATE TABLE without the EXTERNAL keyword, Athena issues an error; only tables with the

EXTERNAL keyword can be created. We recommend that you always use the EXTERNAL keyword. When

you drop a table in Athena, only the table metadata is removed; the data remains in Amazon S3.

UDF and UDAF Are Not Supported

User-deﬁned functions (UDF or UDAFs) and stored procedures are not supported.

To create a table using the AWS Glue Data Catalog

1. Open the Athena console at https://console.aws.amazon.com/athena/.

2. Choose AWS Glue Data Catalog. You can now create a table with the AWS Glue Crawler. For more

information, see Using AWS Glue Crawlers (p. 24).

Amazon Athena User Guide

To create a table using the wizard

1. Open the Athena console at https://console.aws.amazon.com/athena/.

2. Under the database display in the Query Editor, choose Add table, which displays a wizard.

3. Follow the steps for creating your table.

To create a database using Hive DDL

A database in Athena is a logical grouping for tables you create in it.

1. Open the Athena console at https://console.aws.amazon.com/athena/.

2. Choose Query Editor.

3. Enter CREATE DATABASE myDataBase and choose Run Query.

Amazon Athena User Guide

To create a table using Hive DDL

4. Select your database from the menu. It is likely to be an empty database.

To create a table using Hive DDL

The Athena Query Editor displays the current database. If you create a table and don't specify a

database, the table is created in the database chosen in the Databases section on the Catalog tab.

1. In the database that you created, create a table by entering the following statement and choosing

Run Query:

CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs (

`Date` Date,

Time STRING,

Location STRING,

Bytes INT,

RequestIP STRING,

Amazon Athena User Guide

Names for Tables, Databases, and Columns

Method STRING,

Host STRING,

Uri STRING,

Status INT,

Referrer STRING,

OS String,

Browser String,

BrowserVersion String

) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'

WITH SERDEPROPERTIES (

"input.regex" = "^(?!#)([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s

+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+[^\(]+[\(]([^\;]+).*\%20([^\/]+)[\/](.*)$"

) LOCATION 's3://athena-examples/cloudfront/plaintext/';

2. If the table was successfully created, you can then run queries against your data.

Names for Tables, Databases, and Columns

Use these tips for naming items in Athena.

Table names and table column names in Athena must

be lowercase

If you are interacting with Apache Spark, then your table names and table column names must be

lowercase. Athena is case-insensitive and turns table names and column names to lower case, but Spark

requires lowercase table and column names.

Queries with mixedCase column names, such as profileURI, or upper case column names do not work.

Athena table, database, and column names allow

only underscore special characters

Athena table, database, and column names cannot contain special characters, other than underscore

(_).

Names that begin with an underscore

Use backtics to enclose table or column names that begin with an underscore. For example:

CREATE TABLE `_myUnderScoreTable` (

`_id` string,

`_index`string,

...

Table names that include numbers

Enclose table names that include numbers in quotation marks. For example:

CREATE TABLE "Table123"

`_id` string,

`_index` string,

...

Amazon Athena User Guide

Table Location in Amazon S3

When you run a CREATE TABLE AS query in Athena, you register your table with the data catalog

that Athena uses. If you migrated to AWS Glue, this is the catalog from AWS Glue. You also specify the

location in Amazon S3 for your table in this format: s3://bucketname/keyname.

Use these tips and examples when you specify the location in Amazon S3.

• Athena reads all ﬁles in an Amazon S3 location you specify in the CREATE TABLE statement, and

cannot ignore any ﬁles included in the preﬁx. When you create tables, include in the Amazon S3 path

only the ﬁles you want Athena to read. Use AWS Lambda functions to scan ﬁles in the source location,

remove any empty ﬁles, and move unneeded ﬁles to another location.

• In the LOCATION clause, use a trailing slash for your folder or bucket.

Use:

s3://bucketname/keyname/

• Do not use ﬁlenames, wildcards, or glob patterns for specifying ﬁle locations.

• Do not add the full HTTP notation, such as s3.amazon.com to the Amazon S3 bucket path.

Do not use:

s3://path_to_bucket

s3://path_to_bucket/*

s3://path_to_bucket/mySpecialFile.dat

s3://bucketname/keyname/filename.csv

s3://test-bucket.s3.amazon.com

arn:aws:s3:::bucketname/keyname

Partitioning Data

By partitioning your data, you can restrict the amount of data scanned by each query, thus improving

performance and reducing cost. Athena leverages Hive for partitioning data. You can partition your

data by any key. A common practice is to partition the data based on time, often leading to a multi-

level partitioning scheme. For example, a customer who has data coming in every hour might decide to

partition by year, month, date, and hour. Another customer, who has data coming from many diﬀerent

sources but loaded one time per day, may partition by a data source identiﬁer and date.

To create a table with partitions, you must deﬁne it during the CREATE TABLE statement. Use

PARTITIONED BY to deﬁne the keys by which to partition data. There are two scenarios discussed below:

1. Data is already partitioned, stored on Amazon S3, and you need to access the data on Athena.

2. Data is not partitioned.

Amazon Athena User Guide

Scenario 1: Data already partitioned

and stored on S3 in hive format

Scenario 1: Data already partitioned and stored on S3

in hive format

Storing Partitioned Data

Partitions are stored in separate folders in Amazon S3. For example, here is the partial listing for sample

ad impressions:

aws s3 ls s3://elasticmapreduce/samples/hive-ads/tables/impressions/

PRE dt=2009-04-12-13-00/

PRE dt=2009-04-12-13-05/

PRE dt=2009-04-12-13-10/

PRE dt=2009-04-12-13-15/

PRE dt=2009-04-12-13-20/

PRE dt=2009-04-12-14-00/

PRE dt=2009-04-12-14-05/

PRE dt=2009-04-12-14-10/

PRE dt=2009-04-12-14-15/

PRE dt=2009-04-12-14-20/

PRE dt=2009-04-12-15-00/

PRE dt=2009-04-12-15-05/

Here, logs are stored with the column name (dt) set equal to date, hour, and minute increments. When

you give a DDL with the location of the parent folder, the schema, and the name of the partitioned

column, Athena can query data in those subfolders.

Creating a Table

To make a table out of this data, create a partition along 'dt' as in the following Athena DDL statement:

CREATE EXTERNAL TABLE impressions (

requestBeginTime string,

adId string,

impressionId string,

referrer string,

userAgent string,

userCookie string,

ip string,

number string,

processId string,

browserCookie string,

requestEndTime string,

timers struct<modelLookup:string, requestTime:string>,

threadId string,

hostname string,

sessionId string)

PARTITIONED BY (dt string)

ROW FORMAT serde 'org.apache.hive.hcatalog.data.JsonSerDe'

with serdeproperties ( 'paths'='requestBeginTime, adId, impressionId, referrer,

userAgent, userCookie, ip' )

LOCATION 's3://elasticmapreduce/samples/hive-ads/tables/impressions/' ;

This table uses Hive's native JSON serializer-deserializer to read JSON data stored in Amazon S3. For

more information about the formats supported, see Supported Data Formats, SerDes, and Compression

Formats (p. 108).

After you execute this statement in Athena, choose New Query and execute:

Amazon Athena User Guide

Scenario 2: Data is not partitioned

MSCK REPAIR TABLE impressions

Athena loads the data in the partitions.

Query the Data

Now, query the data from the impressions table using the partition column. Here's an example:

SELECT dt,impressionid FROM impressions WHERE dt<'2009-04-12-14-00' and

dt>='2009-04-12-13-00' ORDER BY dt DESC LIMIT 100

This query should show you data similar to the following:

2009-04-12-13-20 ap3HcVKAWfXtgIPu6WpuUfAfL0DQEc

2009-04-12-13-20 17uchtodoS9kdeQP1x0XThKl5IuRsV

2009-04-12-13-20 JOUf1SCtRwviGw8sVcghqE5h0nkgtp

2009-04-12-13-20 NQ2XP0J0dvVbCXJ0pb4XvqJ5A4QxxH

2009-04-12-13-20 fFAItiBMsgqro9kRdIwbeX60SROaxr

2009-04-12-13-20 V4og4R9W6G3QjHHwF7gI1cSqig5D1G

2009-04-12-13-20 hPEPtBwk45msmwWTxPVVo1kVu4v11b

2009-04-12-13-20 v0SkfxegheD90gp31UCr6FplnKpx6i

2009-04-12-13-20 1iD9odVgOIi4QWkwHMcOhmwTkWDKfj

2009-04-12-13-20 b31tJiIA25CK8eDHQrHnbcknfSndUk

Scenario 2: Data is not partitioned

A layout like the following does not, however, work for automatically adding partition data with MSCK

REPAIR TABLE:

aws s3 ls s3://athena-examples/elb/plaintext/ --recursive

2016-11-23 17:54:46 11789573 elb/plaintext/2015/01/01/part-r-00000-ce65fca5-d6c6-40e6-

b1f9-190cc4f93814.txt

2016-11-23 17:54:46 8776899 elb/plaintext/2015/01/01/part-r-00001-ce65fca5-d6c6-40e6-

b1f9-190cc4f93814.txt

2016-11-23 17:54:46 9309800 elb/plaintext/2015/01/01/part-r-00002-ce65fca5-d6c6-40e6-

b1f9-190cc4f93814.txt

2016-11-23 17:54:47 9412570 elb/plaintext/2015/01/01/part-r-00003-ce65fca5-d6c6-40e6-

b1f9-190cc4f93814.txt

2016-11-23 17:54:47 10725938 elb/plaintext/2015/01/01/part-r-00004-ce65fca5-d6c6-40e6-

b1f9-190cc4f93814.txt

2016-11-23 17:54:46 9439710 elb/plaintext/2015/01/01/part-r-00005-ce65fca5-d6c6-40e6-

b1f9-190cc4f93814.txt

2016-11-23 17:54:47 0 elb/plaintext/2015/01/01_$folder$

2016-11-23 17:54:47 9012723 elb/plaintext/2015/01/02/part-r-00006-ce65fca5-d6c6-40e6-

b1f9-190cc4f93814.txt

2016-11-23 17:54:47 7571816 elb/plaintext/2015/01/02/part-r-00007-ce65fca5-d6c6-40e6-

b1f9-190cc4f93814.txt

2016-11-23 17:54:47 9673393 elb/plaintext/2015/01/02/part-r-00008-ce65fca5-d6c6-40e6-

b1f9-190cc4f93814.txt

2016-11-23 17:54:48 11979218 elb/plaintext/2015/01/02/part-r-00009-ce65fca5-d6c6-40e6-

b1f9-190cc4f93814.txt

2016-11-23 17:54:48 9546833 elb/plaintext/2015/01/02/part-r-00010-ce65fca5-d6c6-40e6-

b1f9-190cc4f93814.txt

2016-11-23 17:54:48 10960865 elb/plaintext/2015/01/02/part-r-00011-ce65fca5-d6c6-40e6-

b1f9-190cc4f93814.txt

2016-11-23 17:54:48 0 elb/plaintext/2015/01/02_$folder$

2016-11-23 17:54:48 11360522 elb/plaintext/2015/01/03/part-r-00012-ce65fca5-d6c6-40e6-

b1f9-190cc4f93814.txt

Amazon Athena User Guide

Scenario 2: Data is not partitioned

2016-11-23 17:54:48 11211291 elb/plaintext/2015/01/03/part-r-00013-ce65fca5-d6c6-40e6-

b1f9-190cc4f93814.txt

2016-11-23 17:54:48 8633768 elb/plaintext/2015/01/03/part-r-00014-ce65fca5-d6c6-40e6-

b1f9-190cc4f93814.txt

2016-11-23 17:54:49 11891626 elb/plaintext/2015/01/03/part-r-00015-ce65fca5-d6c6-40e6-

b1f9-190cc4f93814.txt

2016-11-23 17:54:49 9173813 elb/plaintext/2015/01/03/part-r-00016-ce65fca5-d6c6-40e6-

b1f9-190cc4f93814.txt

2016-11-23 17:54:49 11899582 elb/plaintext/2015/01/03/part-r-00017-ce65fca5-d6c6-40e6-

b1f9-190cc4f93814.txt

2016-11-23 17:54:49 0 elb/plaintext/2015/01/03_$folder$

2016-11-23 17:54:50 8612843 elb/plaintext/2015/01/04/part-r-00018-ce65fca5-d6c6-40e6-

b1f9-190cc4f93814.txt

2016-11-23 17:54:50 10731284 elb/plaintext/2015/01/04/part-r-00019-ce65fca5-d6c6-40e6-

b1f9-190cc4f93814.txt

2016-11-23 17:54:50 9984735 elb/plaintext/2015/01/04/part-r-00020-ce65fca5-d6c6-40e6-

b1f9-190cc4f93814.txt

2016-11-23 17:54:50 9290089 elb/plaintext/2015/01/04/part-r-00021-ce65fca5-d6c6-40e6-

b1f9-190cc4f93814.txt

2016-11-23 17:54:50 7896339 elb/plaintext/2015/01/04/part-r-00022-ce65fca5-d6c6-40e6-

b1f9-190cc4f93814.txt

2016-11-23 17:54:51 8321364 elb/plaintext/2015/01/04/part-r-00023-ce65fca5-d6c6-40e6-

b1f9-190cc4f93814.txt

2016-11-23 17:54:51 0 elb/plaintext/2015/01/04_$folder$

2016-11-23 17:54:51 7641062 elb/plaintext/2015/01/05/part-r-00024-ce65fca5-d6c6-40e6-

b1f9-190cc4f93814.txt

2016-11-23 17:54:51 10253377 elb/plaintext/2015/01/05/part-r-00025-ce65fca5-d6c6-40e6-

b1f9-190cc4f93814.txt

2016-11-23 17:54:51 8502765 elb/plaintext/2015/01/05/part-r-00026-ce65fca5-d6c6-40e6-

b1f9-190cc4f93814.txt

2016-11-23 17:54:51 11518464 elb/plaintext/2015/01/05/part-r-00027-ce65fca5-d6c6-40e6-

b1f9-190cc4f93814.txt

2016-11-23 17:54:51 7945189 elb/plaintext/2015/01/05/part-r-00028-ce65fca5-d6c6-40e6-

b1f9-190cc4f93814.txt

2016-11-23 17:54:51 7864475 elb/plaintext/2015/01/05/part-r-00029-ce65fca5-d6c6-40e6-

b1f9-190cc4f93814.txt

2016-11-23 17:54:51 0 elb/plaintext/2015/01/05_$folder$

2016-11-23 17:54:51 11342140 elb/plaintext/2015/01/06/part-r-00030-ce65fca5-d6c6-40e6-

b1f9-190cc4f93814.txt

2016-11-23 17:54:51 8063755 elb/plaintext/2015/01/06/part-r-00031-ce65fca5-d6c6-40e6-

b1f9-190cc4f93814.txt

2016-11-23 17:54:52 9387508 elb/plaintext/2015/01/06/part-r-00032-ce65fca5-d6c6-40e6-

b1f9-190cc4f93814.txt

2016-11-23 17:54:52 9732343 elb/plaintext/2015/01/06/part-r-00033-ce65fca5-d6c6-40e6-

b1f9-190cc4f93814.txt

2016-11-23 17:54:52 11510326 elb/plaintext/2015/01/06/part-r-00034-ce65fca5-d6c6-40e6-

b1f9-190cc4f93814.txt

2016-11-23 17:54:52 9148117 elb/plaintext/2015/01/06/part-r-00035-ce65fca5-d6c6-40e6-

b1f9-190cc4f93814.txt

2016-11-23 17:54:52 0 elb/plaintext/2015/01/06_$folder$

2016-11-23 17:54:52 8402024 elb/plaintext/2015/01/07/part-r-00036-ce65fca5-d6c6-40e6-

b1f9-190cc4f93814.txt

2016-11-23 17:54:52 8282860 elb/plaintext/2015/01/07/part-r-00037-ce65fca5-d6c6-40e6-

b1f9-190cc4f93814.txt

2016-11-23 17:54:52 11575283 elb/plaintext/2015/01/07/part-r-00038-ce65fca5-d6c6-40e6-

b1f9-190cc4f93814.txt

2016-11-23 17:54:53 8149059 elb/plaintext/2015/01/07/part-r-00039-ce65fca5-d6c6-40e6-

b1f9-190cc4f93814.txt

2016-11-23 17:54:53 10037269 elb/plaintext/2015/01/07/part-r-00040-ce65fca5-d6c6-40e6-

b1f9-190cc4f93814.txt

2016-11-23 17:54:53 10019678 elb/plaintext/2015/01/07/part-r-00041-ce65fca5-d6c6-40e6-

b1f9-190cc4f93814.txt

2016-11-23 17:54:53 0 elb/plaintext/2015/01/07_$folder$

2016-11-23 17:54:53 0 elb/plaintext/2015/01_$folder$

2016-11-23 17:54:53 0 elb/plaintext/2015_$folder$

Amazon Athena User Guide

Converting to Columnar Formats

In this case, you would have to use ALTER TABLE ADD PARTITION to add each partition manually.

For example, to load the data in s3://athena-examples/elb/plaintext/2015/01/01/, you can run the

following:

ALTER TABLE elb_logs_raw_native_part ADD PARTITION (year='2015',month='01',day='01')

location 's3://athena-examples/elb/plaintext/2015/01/01/'

You can also automate adding partitions by using the JDBC driver (p. 33).

Converting to Columnar Formats

Your Amazon Athena query performance improves if you convert your data into open source columnar

formats, such as Apache Parquet or ORC.

You can do this to existing Amazon S3 data sources by creating a cluster in Amazon EMR and converting

it using Hive. The following example using the AWS CLI shows you how to do this with a script and data

stored in Amazon S3.

Overview

The process for converting to columnar formats using an EMR cluster is as follows:

1. Create an EMR cluster with Hive installed.

2. In the step section of the cluster create statement, specify a script stored in Amazon S3, which points

to your input data and creates output data in the columnar format in an Amazon S3 location. In this

example, the cluster auto-terminates.

The full script is located on Amazon S3 at:

s3://athena-examples/conversion/write-parquet-to-s3.q

Here's an example script beginning with the CREATE TABLE snippet:

ADD JAR /usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core-1.0.0-amzn-5.jar;

CREATE EXTERNAL TABLE impressions (

requestBeginTime string,

adId string,

impressionId string,

referrer string,

userAgent string,

userCookie string,

ip string,

number string,

processId string,

browserCookie string,

requestEndTime string,

timers struct<modelLookup:string, requestTime:string>,

threadId string,

hostname string,

sessionId string)

PARTITIONED BY (dt string)

ROW FORMAT serde 'org.apache.hive.hcatalog.data.JsonSerDe'

with serdeproperties ( 'paths'='requestBeginTime, adId, impressionId, referrer,

userAgent, userCookie, ip' )

Amazon Athena User Guide

Overview

LOCATION 's3://${REGION}.elasticmapreduce/samples/hive-ads/tables/impressions' ;

Note

Replace REGION in the LOCATION clause with the region where you are running queries. For

example, if your console is in us-east-1, REGION is s3://us-east-1.elasticmapreduce/

samples/hive-ads/tables/.

This creates the table in Hive on the cluster which uses samples located in the Amazon EMR samples

bucket.

3. On Amazon EMR release 4.7.0, include the ADD JAR line to ﬁnd the appropriate JsonSerDe. The

prettiﬁed sample data looks like the following:

{

"number": "977680",

"referrer": "fastcompany.com",

"processId": "1823",

"adId": "TRktxshQXAHWo261jAHubijAoNlAqA",

"browserCookie": "mvlrdwrmef",

"userCookie": "emFlrLGrm5fA2xLFT5npwbPuG7kf6X",

"requestEndTime": "1239714001000",

"impressionId": "1I5G20RmOuG2rt7fFGFgsaWk9Xpkfb",

"userAgent": "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; SLCC1; .NET CLR

2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506; InfoPa",

"timers": {

"modelLookup": "0.3292",

"requestTime": "0.6398"

"threadId": "99",

"ip": "67.189.155.225",

"modelId": "bxxiuxduad",

"hostname": "ec2-0-51-75-39.amazon.com",

"sessionId": "J9NOccA3dDMFlixCuSOtl9QBbjs6aS",

"requestBeginTime": "1239714000000"

}

4. In Hive, load the data from the partitions, so the script runs the following:

MSCK REPAIR TABLE impressions;

The script then creates a table that stores your data in a Parquet-formatted ﬁle on Amazon S3:

CREATE EXTERNAL TABLE parquet_hive (

requestBeginTime string,

adId string,

impressionId string,

referrer string,

userAgent string,

userCookie string,

ip string

) STORED AS PARQUET

LOCATION 's3://myBucket/myParquet/';

The data are inserted from the impressions table into parquet_hive:

INSERT OVERWRITE TABLE parquet_hive

SELECT

requestbegintime,

adid,

impressionid,

referrer,

Amazon Athena User Guide

Before you begin

useragent,

usercookie,

ip FROM impressions WHERE dt='2009-04-14-04-05';

The script stores the above impressions table columns from the date, 2009-04-14-04-05, into s3://

myBucket/myParquet/ in a Parquet-formatted ﬁle.

5. After your EMR cluster is terminated, create your table in Athena, which uses the data in the format

produced by the cluster.

Before you begin

• You need to create EMR clusters. For more information about Amazon EMR, see the Amazon EMR

Management Guide.

• Follow the instructions found in Setting Up (p. 12).

Example: Converting data to Parquet using an EMR

cluster

1. Use the AWS CLI to create a cluster. If you need to install the AWS CLI, see Installing the AWS

Command Line Interface in the AWS Command Line Interface User Guide.

2. You need roles to use Amazon EMR, so if you haven't used Amazon EMR before, create the default

roles using the following command:

aws emr create-default-roles

3. Create an Amazon EMR cluster using the emr-4.7.0 release to convert the data using the following

AWS CLI emr create-cluster command:

export REGION=us-east-1

export SAMPLEURI=s3://${REGION}.elasticmapreduce/samples/hive-ads/tables/impressions/

export S3BUCKET=myBucketName

aws emr create-cluster

--applications Name=Hadoop Name=Hive Name=HCatalog \

--ec2-attributes KeyName=myKey,InstanceProfile=EMR_EC2_DefaultRole,SubnetId=subnet-

mySubnetId \

--service-role EMR_DefaultRole

--release-label emr-4.7.0

--instance-type \m4.large

--instance-count 1

--steps Type=HIVE,Name="Convert to Parquet",\

ActionOnFailure=CONTINUE,

ActionOnFailure=TERMINATE_CLUSTER,

Args=[-f,

\s3://athena-examples/conversion/write-parquet-to-s3.q,-hiveconf,

INPUT=${SAMPLEURI},-hiveconf,

OUTPUT=s3://${S3BUCKET}/myParquet,-hiveconf,

REGION=${REGION}

] \

--region ${REGION}

--auto-terminate

For more information, see Create and Use IAM Roles for Amazon EMR in the Amazon EMR

Management Guide.

Amazon Athena User Guide

Example: Converting data to Parquet using an EMR cluster

A successful request gives you a cluster ID.

4. Monitor the progress of your cluster using the AWS Management Console, or using the cluster ID with

the list-steps subcommand in the AWS CLI:

aws emr list-steps --cluster-id myClusterID

Look for the script step status. If it is COMPLETED, then the conversion is done and you are ready to

query the data.

5. Create the same table that you created on the EMR cluster.

You can use the same statement as above. Log into Athena and enter the statement in the Query

Editor window:

CREATE EXTERNAL TABLE parquet_hive (

requestBeginTime string,

adId string,

impressionId string,

referrer string,

userAgent string,

userCookie string,

ip string

) STORED AS PARQUET

LOCATION 's3://myBucket/myParquet/';

Choose Run Query.

6. Run the following query to show that you can query this data:

SELECT * FROM parquet_hive LIMIT 10;

Alternatively, you can select the view (eye) icon next to the table's name in Catalog:

The results should show output similar to this:

Amazon Athena User Guide

Query Results

Querying Data in Amazon Athena

Tables

Topics

• Query Results (p. 59)

• Viewing Query History (p. 60)

• Querying Arrays (p. 61)

• Querying Arrays with ROWS and STRUCTS (p. 70)

• Querying Arrays with Maps (p. 75)

• Querying JSON (p. 76)

Query Results

Athena stores query results in Amazon S3.

Each query that you run has:

• A results ﬁle stored automatically in a CSV format (*.csv), and

• A metadata ﬁle (*.csv.metadata) that includes header information, such as column type.

If necessary, you can access the result ﬁles to work with them. Athena stores query results in this Amazon

S3 bucket by default: aws-athena-query-results-<ACCOUNTID>-<REGION>.

To view or change the default location for saving query results, choose Settings in the upper right pane.

Note

You can delete metadata ﬁles (*.csv.metadata) without causing errors, but important

information about the query is lost.

Query results are saved in an Amazon S3 location based on the name of the query and the date the

query ran, as follows:

QueryLocation}/{QueryName|Unsaved}/{yyyy}/{mm}/{dd}/{QueryID}/

In this notation:

• QueryLocation is the base location for all query results. To view or change this location, choose

Settings. You can enter a new value for Query result location at any time. You can also choose to

encrypt query results in Amazon S3. For more information, see Conﬁguring Encryption Options (p. 41).

• QueryName is the name of the query for which the results are saved. If the query wasn't saved,

Unsaved appears. To see a list of queries and examine their SQL statements, choose Saved queries.

• yyyy/mm/dd/ is the date the query ran.

• QueryID is the unique ID of the query.

Saving Query Results

After you run the query, the results appear in the Results pane.

Amazon Athena User Guide

Viewing Query History

To save the results of the most recent query to CSV, choose the ﬁle icon.

To save the results of a query you ran previously, choose History, locate your query, and use Download

Results.

Viewing Query History

To view your recent query history, use History. Athena retains query history for 45 days.

Note

Starting on December 1, 2017, Athena retains query history for 45 days.

To retain query history for a longer period, write a Java program using methods from Athena API and the

AWS CLI to periodically retrieve the query history and save it to a data store:

1. Retrieve the query IDs with ListQueryExecutions.

2. Retrieve information about each query based on its ID with GetQueryExecution.

3. Save the obtained information in a data store, such as Amazon S3, using the put-object AWS CLI

command from the Amazon S3 API.

Viewing Query History

1. To view a query in your history for up to 45 days after it ran, choose History and select a query.

You can also see which queries succeeded and failed, download their results, and view query IDs, by

clicking the status value.

Amazon Athena User Guide

Querying Arrays

Amazon Athena lets you create arrays, concatenate them, convert them to diﬀerent data types, and then

ﬁlter, ﬂatten, and sort them.

Topics

• Creating Arrays (p. 61)

• Concatenating Arrays (p. 62)

• Converting Array Data Types (p. 63)

• Finding Lengths (p. 64)

• Accessing Array Elements (p. 64)

• Flattening Nested Arrays (p. 65)

• Creating Arrays from Subqueries (p. 67)

• Filtering Arrays (p. 68)

• Sorting Arrays (p. 69)

• Using Aggregation Functions with Arrays (p. 69)

• Converting Arrays to Strings (p. 70)

Creating Arrays

To build an array literal in Athena, use the ARRAY keyword, followed by brackets [ ], and include the

array elements separated by commas.

Examples

This query creates one array with four elements.

SELECT ARRAY [1,2,3,4] AS items

It returns:

+-----------+

| items |

+-----------+

| [1,2,3,4] |

+-----------+

This query creates two arrays.

SELECT ARRAY[ ARRAY[1,2], ARRAY[3,4] ] AS items

It returns:

+--------------------+

| items |

+--------------------+

| [[1, 2], [3, 4]] |

+--------------------+

To create an array from selected columns of compatible types, use a query, as in this example:

Amazon Athena User Guide

Concatenating Arrays

WITH

dataset AS (

SELECT 1 AS x, 2 AS y, 3 AS z

)

SELECT ARRAY [x,y,z] AS items FROM dataset

This query returns:

+-----------+

| items |

+-----------+

| [1,2,3] |

+-----------+

In the following example, two arrays are selected and returned as a welcome message.

WITH

dataset AS (

SELECT

ARRAY ['hello', 'amazon', 'athena'] AS words,

ARRAY ['hi', 'alexa'] AS alexa

)

SELECT ARRAY[words, alexa] AS welcome_msg

FROM dataset

This query returns:

+----------------------------------------+

| welcome_msg |

+----------------------------------------+

| [[hello, amazon, athena], [hi, alexa]] |

+----------------------------------------+

To create an array of key-value pairs, use the MAP operator that takes an array of keys followed by an

array of values, as in this example:

SELECT ARRAY[

MAP(ARRAY['first', 'last', 'age'],ARRAY['Bob', 'Smith', '40']),

MAP(ARRAY['first', 'last', 'age'],ARRAY['Jane', 'Doe', '30']),

MAP(ARRAY['first', 'last', 'age'],ARRAY['Billy', 'Smith', '8'])

] AS people

This query returns:

+-----------------------------------------------------------------------------------------------------

| people

+-----------------------------------------------------------------------------------------------------

| [{last=Smith, first=Bob, age=40}, {last=Doe, first=Jane, age=30}, {last=Smith,

first=Billy, age=8}] |

+-----------------------------------------------------------------------------------------------------

Concatenating Arrays

To concatenate multiple arrays, use the double pipe || operator between them.

Amazon Athena User Guide

Converting Array Data Types

SELECT ARRAY [4,5] || ARRAY[ ARRAY[1,2], ARRAY[3,4] ] AS items

This query returns:

+--------------------------+

| items |

+--------------------------+

| [[4, 5], [1, 2], [3, 4]] |

+--------------------------+

To combine multiple arrays into a single array, use the concat function.

WITH

dataset AS (

SELECT

ARRAY ['hello', 'amazon', 'athena'] AS words,

ARRAY ['hi', 'alexa'] AS alexa

)

SELECT concat(words, alexa) AS welcome_msg

FROM dataset

This query returns:

+------------------------------------+

| welcome_msg |

+------------------------------------+

| [hello, amazon, athena, hi, alexa] |

+------------------------------------+

Converting Array Data Types

To convert data in arrays to supported data types, use the CAST operator, as CAST(value AS type).

Athena supports all of the native Presto data types.

SELECT

ARRAY [CAST(4 AS VARCHAR), CAST(5 AS VARCHAR)]

AS items

This query returns:

+-------+

| items |

+-------+

| [4,5] |

+-------+

Create two arrays with key-value pair elements, convert them to JSON, and concatenate, as in this

example:

SELECT

ARRAY[CAST(MAP(ARRAY['a1', 'a2', 'a3'], ARRAY[1, 2, 3]) AS JSON)] ||

ARRAY[CAST(MAP(ARRAY['b1', 'b2', 'b3'], ARRAY[4, 5, 6]) AS JSON)]

AS items

This query returns:

Amazon Athena User Guide

Finding Lengths

+--------------------------------------------------+

| items |

+--------------------------------------------------+

| [{"a1":1,"a2":2,"a3":3}, {"b1":4,"b2":5,"b3":6}] |

+--------------------------------------------------+

Finding Lengths

The cardinality function returns the length of an array, as in this example:

SELECT cardinality(ARRAY[1,2,3,4]) AS item_count

This query returns:

+------------+

| item_count |

+------------+

| 4 |

+------------+

Accessing Array Elements

To access array elements, use the [] operator, with 1 specifying the ﬁrst element, 2 specifying the

second element, and so on, as in this example:

WITH dataset AS (

SELECT

ARRAY[CAST(MAP(ARRAY['a1', 'a2', 'a3'], ARRAY[1, 2, 3]) AS JSON)] ||

ARRAY[CAST(MAP(ARRAY['b1', 'b2', 'b3'], ARRAY[4, 5, 6]) AS JSON)]

AS items )

SELECT items[1] AS item FROM dataset

This query returns:

+------------------------+

| item |

+------------------------+

| {"a1":1,"a2":2,"a3":3} |

+------------------------+

To access the elements of an array at a given position (known as the index position), use the

element_at() function and specify the array name and the index position:

• If the index is greater than 0, element_at() returns the element that you specify, counting from the

beginning to the end of the array. It behaves as the [] operator.

• If the index is less than 0, element_at() returns the element counting from the end to the beginning

of the array.

The following query creates an array words, and selects the ﬁrst element hello from it as the

first_word, the second element amazon (counting from the end of the array) as the middle_word,

and the third element athena, as the last_word.

WITH dataset AS (

SELECT ARRAY ['hello', 'amazon', 'athena'] AS words

Amazon Athena User Guide

Flattening Nested Arrays

)

SELECT

element_at(words, 1) AS first_word,

element_at(words, -2) AS middle_word,

element_at(words, cardinality(words)) AS last_word

FROM dataset

This query returns:

+----------------------------------------+

| first_word | middle_word | last_word |

+----------------------------------------+

| hello | amazon | athena |

+----------------------------------------+

Flattening Nested Arrays

When working with nested arrays, you often need to expand nested array elements into a single array, or

expand the array into multiple rows.

Examples

To ﬂatten a nested array's elements into a single array of values, use the flatten function. This query

returns a row for each element in the array.

SELECT flatten(ARRAY[ ARRAY[1,2], ARRAY[3,4] ]) AS items

This query returns:

+-----------+

| items |

+-----------+

| [1,2,3,4] |

+-----------+

To ﬂatten an array into multiple rows, use CROSS JOIN in conjunction with the UNNEST operator, as in

this example:

WITH dataset AS (

SELECT

'engineering' as department,

ARRAY['Sharon', 'John', 'Bob', 'Sally'] as users

)

SELECT department, names FROM dataset

CROSS JOIN UNNEST(users) as t(names)

This query returns:

+----------------------+

| department | names |

+----------------------+

| engineering | Sharon |

+----------------------|

| engineering | John |

+----------------------|

| engineering | Bob |

+----------------------|

Amazon Athena User Guide

Flattening Nested Arrays

| engineering | Sally |

+----------------------+

To ﬂatten an array of key-value pairs, transpose selected keys into columns, as in this example:

WITH

dataset AS (

SELECT

'engineering' as department,

ARRAY[

MAP(ARRAY['first', 'last', 'age'],ARRAY['Bob', 'Smith', '40']),

MAP(ARRAY['first', 'last', 'age'],ARRAY['Jane', 'Doe', '30']),

MAP(ARRAY['first', 'last', 'age'],ARRAY['Billy', 'Smith', '8'])

] AS people

)

SELECT names['first'] AS

first_name,

names['last'] AS last_name,

department FROM dataset

CROSS JOIN UNNEST(people) AS t(names)

This query returns:

+--------------------------------------+

| first_name | last_name | department |

+--------------------------------------+

| Bob | Smith | engineering |

| Jane | Doe | engineering |

| Billy | Smith | engineering |

+--------------------------------------+

From a list of employees, select the employee with the highest combined scores. UNNEST can be used

in the FROM clause without a preceding CROSS JOIN as it is the default join operator and therefore

implied.

WITH

dataset AS (

SELECT ARRAY[

CAST(ROW('Sally', 'engineering', ARRAY[1,2,3,4]) AS ROW(name VARCHAR, department

VARCHAR, scores ARRAY(INTEGER))),

CAST(ROW('John', 'finance', ARRAY[7,8,9]) AS ROW(name VARCHAR, department VARCHAR,

scores ARRAY(INTEGER))),

CAST(ROW('Amy', 'devops', ARRAY[12,13,14,15]) AS ROW(name VARCHAR, department VARCHAR,

scores ARRAY(INTEGER)))

] AS users

users AS (

SELECT person, score

FROM

dataset,

UNNEST(dataset.users) AS t(person),

UNNEST(person.scores) AS t(score)

)

SELECT person.name, person.department, SUM(score) AS total_score FROM users

GROUP BY (person.name, person.department)

ORDER BY (total_score) DESC

LIMIT 1

This query returns:

+---------------------------------+

Amazon Athena User Guide

Creating Arrays from Subqueries

| name | department | total_score |

+---------------------------------+

| Amy | devops | 54 |

+---------------------------------+

From a list of employees, select the employee with the highest individual score.

WITH

dataset AS (

SELECT ARRAY[

CAST(ROW('Sally', 'engineering', ARRAY[1,2,3,4]) AS ROW(name VARCHAR, department

VARCHAR, scores ARRAY(INTEGER))),

CAST(ROW('John', 'finance', ARRAY[7,8,9]) AS ROW(name VARCHAR, department VARCHAR,

scores ARRAY(INTEGER))),

CAST(ROW('Amy', 'devops', ARRAY[12,13,14,15]) AS ROW(name VARCHAR, department VARCHAR,

scores ARRAY(INTEGER)))

] AS users

users AS (

SELECT person, score

FROM

dataset,

UNNEST(dataset.users) AS t(person),

UNNEST(person.scores) AS t(score)

)

SELECT person.name, score FROM users

ORDER BY (score) DESC

LIMIT 1

This query returns:

+--------------+

| name | score |

+--------------+

| Amy | 15 |

+--------------+

Creating Arrays from Subqueries

Create an array from a collection of rows.

WITH

dataset AS (

SELECT ARRAY[1,2,3,4,5] AS items

)

SELECT array_agg(i) AS array_items

FROM dataset

CROSS JOIN UNNEST(items) AS t(i)

This query returns:

+-----------------+

| array_items |

+-----------------+

| [1, 2, 3, 4, 5] |

+-----------------+

To create an array of unique values from a set of rows, use the distinct keyword.

Amazon Athena User Guide

Filtering Arrays

WITH

dataset AS (

SELECT ARRAY [1,2,2,3,3,4,5] AS items

)

SELECT array_agg(distinct i) AS array_items

FROM dataset

CROSS JOIN UNNEST(items) AS t(i)

This query returns the following result. Note that ordering is not guaranteed.

+-----------------+

| array_items |

+-----------------+

| [1, 2, 3, 4, 5] |

+-----------------+

Filtering Arrays

Create an array from a collection of rows if they match the ﬁlter criteria.

WITH

dataset AS (

SELECT ARRAY[1,2,3,4,5] AS items

)

SELECT array_agg(i) AS array_items

FROM dataset

CROSS JOIN UNNEST(items) AS t(i)

WHERE i > 3

This query returns:

+-------------+

| array_items |

+-------------+

| [4, 5] |

+-------------+

Filter an array based on whether one of its elements contain a speciﬁc value, such as 2, as in this

example:

WITH

dataset AS (

SELECT ARRAY

[

ARRAY[1,2,3,4],

ARRAY[5,6,7,8],

ARRAY[9,0]

] AS items

)

SELECT i AS array_items FROM dataset

CROSS JOIN UNNEST(items) AS t(i)

WHERE contains(i, 2)

This query returns:

+--------------+

| array_items |

Amazon Athena User Guide

Sorting Arrays

+--------------+

| [1, 2, 3, 4] |

+--------------+

Sorting Arrays

Create a sorted array of unique values from a set of rows.

WITH

dataset AS (

SELECT ARRAY[3,1,2,5,2,3,6,3,4,5] AS items

)

SELECT array_sort(array_agg(distinct i)) AS array_items

FROM dataset

CROSS JOIN UNNEST(items) AS t(i)

This query returns:

+--------------------+

| array_items |

+--------------------+

| [1, 2, 3, 4, 5, 6] |

+--------------------+

Using Aggregation Functions with Arrays

• To add values within an array, use SUM, as in the following example.

• To aggregate multiple rows within an array, use array_agg. For information, see Creating Arrays from

Subqueries (p. 67).

WITH

dataset AS (

SELECT ARRAY

[

ARRAY[1,2,3,4],

ARRAY[5,6,7,8],

ARRAY[9,0]

] AS items

item AS (

SELECT i AS array_items

FROM dataset, UNNEST(items) AS t(i)

)

SELECT array_items, sum(val) AS total

FROM item, UNNEST(array_items) AS t(val)

GROUP BY array_items

This query returns the following results. The order of returned results is not guaranteed.

+----------------------+

| array_items | total |

+----------------------+

| [1, 2, 3, 4] | 10 |

| [5, 6, 7, 8] | 26 |

| [9, 0] | 9 |

+----------------------+

Amazon Athena User Guide

Converting Arrays to Strings

To convert an array into a single string, use the array_join function.

WITH

dataset AS (

SELECT ARRAY ['hello', 'amazon', 'athena'] AS words

)

SELECT array_join(words, ' ') AS welcome_msg

FROM dataset

This query returns:

+---------------------+

| welcome_msg |

+---------------------+

| hello amazon athena |

+---------------------+

Querying Arrays with ROWS and STRUCTS

Your source data often contains arrays with complex data types and nested structures. Examples in this

section show how to change element's data type, locate elements within arrays, order values, and ﬁnd

keywords using Athena queries.

• Creating a ROW (p. 70)

• Changing Field Names in Arrays Using CAST (p. 71)

• Filtering Arrays Using the . Notation (p. 71)

• Filtering Arrays with Nested Values (p. 72)

• Filtering Arrays Using UNNEST (p. 73)

• Finding Keywords in Arrays (p. 73)

• Ordering Values in Arrays (p. 74)

Creating a ROW

Note

The examples in this section use ROW as a means to create sample data to work with. When

you query tables within Athena, you do not need to create ROW data types, as they are already

created from your data source by Presto. When you use CREATE_TABLE, Athena deﬁnes a

STRUCT in it, and relies on Presto for populating it with data. In turn, Presto creates the ROW

data type for you, for each row in the dataset. The underlying ROW data type consists of named

ﬁelds of any SQL data types supported in Presto.

WITH dataset AS (

SELECT

ROW('Bob', 38) AS users

)

SELECT * FROM dataset

This query returns:

+-------------------------+

Amazon Athena User Guide

Changing Field Names in Arrays Using CAST

| users |

+-------------------------+

| {field0=Bob, field1=38} |

+-------------------------+

Changing Field Names in Arrays Using CAST

To change the ﬁeld name in an array that contains ROW values, you can CAST the ROW declaration:

WITH dataset AS (

SELECT

CAST(

ROW('Bob', 38) AS ROW(name VARCHAR, age INTEGER)

) AS users

)

SELECT * FROM dataset

This query returns:

+--------------------+

| users |

+--------------------+

| {NAME=Bob, AGE=38} |

+--------------------+

Note

In the example above, you declare name as a VARCHAR because this is its type in Presto. If you

declare this STRUCT inside a CREATE TABLE statement, use String type because Hive deﬁnes

this data type as String.

Filtering Arrays Using the . Notation

In the following example, select the accountId ﬁeld from the userIdentity column of a AWS

CloudTrail logs table by using the dot . notation. For more information, see Querying AWS CloudTrail

Logs (p. 94).

SELECT

CAST(useridentity.accountid AS bigint) as newid

FROM cloudtrail_logs

LIMIT 2;

This query returns:

+--------------+

| newid |

+--------------+

| 112233445566 |

+--------------+

| 998877665544 |

+--------------+

To query an array of values, issue this query:

WITH dataset AS (

SELECT ARRAY[

Amazon Athena User Guide

Filtering Arrays with Nested Values

CAST(ROW('Bob', 38) AS ROW(name VARCHAR, age INTEGER)),

CAST(ROW('Alice', 35) AS ROW(name VARCHAR, age INTEGER)),

CAST(ROW('Jane', 27) AS ROW(name VARCHAR, age INTEGER))

] AS users

)

SELECT * FROM dataset

It returns this result:

+-----------------------------------------------------------------+

| users |

+-----------------------------------------------------------------+

| [{NAME=Bob, AGE=38}, {NAME=Alice, AGE=35}, {NAME=Jane, AGE=27}] |

+-----------------------------------------------------------------+

Filtering Arrays with Nested Values

Large arrays often contain nested structures, and you need to be able to ﬁlter, or search, for values

within them.

To deﬁne a dataset for an array of values that includes a nested BOOLEAN value, issue this query:

WITH dataset AS (

SELECT

CAST(

ROW('aws.amazon.com', ROW(true)) AS ROW(hostname VARCHAR, flaggedActivity ROW(isNew

BOOLEAN))

) AS sites

)

SELECT * FROM dataset

It returns this result:

+----------------------------------------------------------+

| sites |

+----------------------------------------------------------+

| {HOSTNAME=aws.amazon.com, FLAGGEDACTIVITY={ISNEW=true}} |

+----------------------------------------------------------+

Next, to ﬁlter and access the BOOLEAN value of that element, continue to use the dot . notation.

WITH dataset AS (

SELECT

CAST(

ROW('aws.amazon.com', ROW(true)) AS ROW(hostname VARCHAR, flaggedActivity ROW(isNew

BOOLEAN))

) AS sites

)

SELECT sites.hostname, sites.flaggedactivity.isnew

FROM dataset

This query selects the nested ﬁelds and returns this result:

+------------------------+

| hostname | isnew |

+------------------------+

| aws.amazon.com | true |

Amazon Athena User Guide

Filtering Arrays Using UNNEST

+------------------------+

Filtering Arrays Using UNNEST

To ﬁlter an array that includes a nested structure by one of its child elements, issue a query with an

UNNEST operator. For more information about UNNEST, see Flattening Nested Arrays (p. 65).

For example, this query ﬁnds hostnames of sites in the dataset.

WITH dataset AS (

SELECT ARRAY[

CAST(

ROW('aws.amazon.com', ROW(true)) AS ROW(hostname VARCHAR, flaggedActivity ROW(isNew

BOOLEAN))

CAST(

ROW('news.cnn.com', ROW(false)) AS ROW(hostname VARCHAR, flaggedActivity ROW(isNew

BOOLEAN))

CAST(

ROW('netflix.com', ROW(false)) AS ROW(hostname VARCHAR, flaggedActivity ROW(isNew

BOOLEAN))

)

] as items

)

SELECT sites.hostname, sites.flaggedActivity.isNew

FROM dataset, UNNEST(items) t(sites)

WHERE sites.flaggedActivity.isNew = true

It returns:

+------------------------+

| hostname | isnew |

+------------------------+

| aws.amazon.com | true |

+------------------------+

Finding Keywords in Arrays

To search a dataset for a keyword within an element inside an array, use the regexp_like function.

Consider an array of sites containing their hostname, and a flaggedActivity element. This element

includes an ARRAY, containing several MAP elements, each listing diﬀerent popular keywords and their

popularity count. Assume you want to ﬁnd a particular keyword inside a MAP in this array.

To search this dataset for sites with a speciﬁc keyword, use the regexp_like function.

Note

Instead of using the SQL LIKE operator, this query uses the regexp_like function. This

function takes as an input a regular expression pattern to evaluate, or a list of terms separated

by |, known as an OR operator. Additionally, searching for a large number of keywords is more

eﬃcient with the regexp_like function, because its performance exceeds that of the SQL

LIKE operator.

WITH dataset AS (

SELECT ARRAY[

CAST(

ROW('aws.amazon.com', ROW(ARRAY[

Amazon Athena User Guide

Ordering Values in Arrays

MAP(ARRAY['term', 'count'], ARRAY['bigdata', '10']),

MAP(ARRAY['term', 'count'], ARRAY['serverless', '50']),

MAP(ARRAY['term', 'count'], ARRAY['analytics', '82']),

MAP(ARRAY['term', 'count'], ARRAY['iot', '74'])

])

) AS ROW(hostname VARCHAR, flaggedActivity ROW(flags ARRAY(MAP(VARCHAR, VARCHAR)) ))

CAST(

ROW('news.cnn.com', ROW(ARRAY[

MAP(ARRAY['term', 'count'], ARRAY['politics', '241']),

MAP(ARRAY['term', 'count'], ARRAY['technology', '211']),

MAP(ARRAY['term', 'count'], ARRAY['serverless', '25']),

MAP(ARRAY['term', 'count'], ARRAY['iot', '170'])

])

) AS ROW(hostname VARCHAR, flaggedActivity ROW(flags ARRAY(MAP(VARCHAR, VARCHAR)) ))

CAST(

ROW('netflix.com', ROW(ARRAY[

MAP(ARRAY['term', 'count'], ARRAY['cartoons', '1020']),

MAP(ARRAY['term', 'count'], ARRAY['house of cards', '112042']),

MAP(ARRAY['term', 'count'], ARRAY['orange is the new black', '342']),

MAP(ARRAY['term', 'count'], ARRAY['iot', '4'])

])

) AS ROW(hostname VARCHAR, flaggedActivity ROW(flags ARRAY(MAP(VARCHAR, VARCHAR)) ))

)

] AS items

sites AS (

SELECT sites.hostname, sites.flaggedactivity

FROM dataset, UNNEST(items) t(sites)

)

SELECT hostname

FROM sites, UNNEST(sites.flaggedActivity.flags) t(flags)

WHERE regexp_like(flags['term'], 'politics|bigdata')

GROUP BY (hostname)

This query returns two sites:

+----------------+

| hostname |

+----------------+

| aws.amazon.com |

+----------------+

| news.cnn.com |

+----------------+

Ordering Values in Arrays

To order values in an array, use ORDER BY. The query in the following example adds up the total

popularity scores for the sites matching your search terms with the regexp_like function, and then

orders them from highest to lowest.

WITH dataset AS (

SELECT ARRAY[

CAST(

ROW('aws.amazon.com', ROW(ARRAY[

MAP(ARRAY['term', 'count'], ARRAY['bigdata', '10']),

MAP(ARRAY['term', 'count'], ARRAY['serverless', '50']),

MAP(ARRAY['term', 'count'], ARRAY['analytics', '82']),

MAP(ARRAY['term', 'count'], ARRAY['iot', '74'])

])

) AS ROW(hostname VARCHAR, flaggedActivity ROW(flags ARRAY(MAP(VARCHAR, VARCHAR)) ))

Amazon Athena User Guide

Querying Arrays with Maps

CAST(

ROW('news.cnn.com', ROW(ARRAY[

MAP(ARRAY['term', 'count'], ARRAY['politics', '241']),

MAP(ARRAY['term', 'count'], ARRAY['technology', '211']),

MAP(ARRAY['term', 'count'], ARRAY['serverless', '25']),

MAP(ARRAY['term', 'count'], ARRAY['iot', '170'])

])

) AS ROW(hostname VARCHAR, flaggedActivity ROW(flags ARRAY(MAP(VARCHAR, VARCHAR)) ))

CAST(

ROW('netflix.com', ROW(ARRAY[

MAP(ARRAY['term', 'count'], ARRAY['cartoons', '1020']),

MAP(ARRAY['term', 'count'], ARRAY['house of cards', '112042']),

MAP(ARRAY['term', 'count'], ARRAY['orange is the new black', '342']),

MAP(ARRAY['term', 'count'], ARRAY['iot', '4'])

])

) AS ROW(hostname VARCHAR, flaggedActivity ROW(flags ARRAY(MAP(VARCHAR, VARCHAR)) ))

)

] AS items

sites AS (

SELECT sites.hostname, sites.flaggedactivity

FROM dataset, UNNEST(items) t(sites)

)

SELECT hostname, array_agg(flags['term']) AS terms, SUM(CAST(flags['count'] AS INTEGER)) AS

total

FROM sites, UNNEST(sites.flaggedActivity.flags) t(flags)

WHERE regexp_like(flags['term'], 'politics|bigdata')

GROUP BY (hostname)

ORDER BY total DESC

This query returns this result:

+------------------------------------------------+

| hostname | terms | total |

+------------------------------------------------+

| news.cnn.com | [politics,serverless] | 241 |

+------------------------------------------------+

| aws.amazon.com | [serverless] | 10 |

+------------------------------------------------+

Querying Arrays with Maps

Maps are key-value pairs that consist of data types available in Athena.

To create maps, use the MAP operator and pass it two arrays: the ﬁrst is the column (key) names, and

the second is values. All values in the arrays must be of the same type. If any of the map value array

elements need to be of diﬀerent types, you can convert them later.

Examples

This example selects a user from a dataset. It uses the MAP operator and passes it two arrays. The ﬁrst

array includes values for column names, such as "ﬁrst", "last", and "age". The second array consists of

values for each of these columns, such as "Bob", "Smith", "35".

WITH dataset AS (

SELECT MAP(

Amazon Athena User Guide

Querying JSON

ARRAY['first', 'last', 'age'],

ARRAY['Bob', 'Smith', '35']

) AS user

)

SELECT user FROM dataset

This query returns:

+---------------------------------+

| user |

+---------------------------------+

| {last=Smith, first=Bob, age=35} |

+---------------------------------+

You can retrieve Map values by selecting the ﬁeld name followed by [key_name], as in this example:

WITH dataset AS (

SELECT MAP(

ARRAY['first', 'last', 'age'],

ARRAY['Bob', 'Smith', '35']

) AS user

)

SELECT user['first'] AS first_name FROM dataset

This query returns:

+------------+

| first_name |

+------------+

| Bob |

+------------+

Querying JSON

Amazon Athena lets you parse JSON-encoded values, extract data from JSON, search for values, and ﬁnd

length and size of JSON arrays.

Topics

• Tips for Parsing Arrays with JSON (p. 76)

• Extracting Data from JSON (p. 78)

• Searching for Values (p. 80)

• Obtaining Length and Size of JSON Arrays (p. 81)

Tips for Parsing Arrays with JSON

JavaScript Object Notation (JSON) is a common method for encoding data structures as text. Many

applications and tools output data that is JSON-encoded.

In Amazon Athena, you can create tables from external data and include the JSON-encoded data

in them. For such types of source data, use Athena together with a JSON SerDe Libraries (p. 117)

SerDe. When Athena creates tables backed by JSON data, it parses the data based on the existing and

predeﬁned schema.

Amazon Athena User Guide

Tips for Parsing Arrays with JSON

However, not all of your data may have a predeﬁned schema. To simplify schema management in such

cases, it is often useful to convert ﬁelds in source data that have an undetermined schema to JSON

strings in Athena, and then use JSON SerDe Libraries (p. 117).

For example, consider an IoT application that publishes events with common ﬁelds from diﬀerent

sensors. One of those ﬁelds must store a custom payload that is unique to the sensor sending the event.

In this case, since you don't know the schema, we recommend that you store the information as a JSON-

encoded string. To do this, convert data in your Athena table to JSON, as in the following example. You

can also convert JSON-encoded data to Athena data types.

• Converting Athena Data Types to JSON (p. 77)

• Converting JSON to Athena Data Types (p. 77)

Converting Athena Data Types to JSON

To convert Athena data types to JSON, use CAST.

WITH dataset AS (

SELECT

CAST('HELLO ATHENA' AS JSON) AS hello_msg,

CAST(12345 AS JSON) AS some_int,

CAST(MAP(ARRAY['a', 'b'], ARRAY[1,2]) AS JSON) AS some_map

)

SELECT * FROM dataset

This query returns:

+-------------------------------------------+

| hello_msg | some_int | some_map |

+-------------------------------------------+

| "HELLO ATHENA" | 12345 | {"a":1,"b":2} |

+-------------------------------------------+

Converting JSON to Athena Data Types

To convert JSON data to Athena data types, use CAST.

Note

In this example, to denote strings as JSON-encoded, start with the JSON keyword and use single

quotes, such as JSON '12345'

WITH dataset AS (

SELECT

CAST(JSON '"HELLO ATHENA"' AS VARCHAR) AS hello_msg,

CAST(JSON '12345' AS INTEGER) AS some_int,

CAST(JSON '{"a":1,"b":2}' AS MAP(VARCHAR, INTEGER)) AS some_map

)

SELECT * FROM dataset

This query returns:

+-------------------------------------+

| hello_msg | some_int | some_map |

+-------------------------------------+

| HELLO ATHENA | 12345 | {a:1,b:2} |

Amazon Athena User Guide

Extracting Data from JSON

+-------------------------------------+

Extracting Data from JSON

You may have source data with containing JSON-encoded strings that you do not necessarily want to

deserialize into a table in Athena. In this case, you can still run SQL operations on this data, using the

JSON functions available in Presto.

Consider this JSON string as an example dataset.

{"name": "Susan Smith",

"org": "engineering",

"projects":

[

{"name":"project1", "completed":false},

{"name":"project2", "completed":true}

]

}

Examples: extracting properties

To extract the name and projects properties from the JSON string, use the json_extract function

as in the following example. The json_extract function takes the column containing the JSON string,

and searches it using a JSONPath-like expression with the dot . notation.

Note

JSONPath performs a simple tree traversal. It uses the $ sign to denote the root of the JSON

document, followed by a period and an element nested directly under the root, such as $.name.

WITH dataset AS (

SELECT '{"name": "Susan Smith",

"org": "engineering",

"projects": [{"name":"project1", "completed":false},

{"name":"project2", "completed":true}]}'

AS blob

)

SELECT

json_extract(blob, '$.name') AS name,

json_extract(blob, '$.projects') AS projects

FROM dataset

The returned value is a JSON-encoded string, and not a native Athena data type.

+-----------------------------------------------------------------------------------------------

| name | projects

+-----------------------------------------------------------------------------------------------

| "Susan Smith" | [{"name":"project1","completed":false},

{"name":"project2","completed":true}] |

+-----------------------------------------------------------------------------------------------

To extract the scalar value from the JSON string, use the json_extract_scalar function. It is similar

to json_extract, but returns only scalar values (Boolean, number, or string).

Note

Do not use the json_extract_scalar function on arrays, maps, or structs.

Amazon Athena User Guide

Extracting Data from JSON

WITH dataset AS (

SELECT '{"name": "Susan Smith",

"org": "engineering",

"projects": [{"name":"project1", "completed":false},{"name":"project2",

"completed":true}]}'

AS blob

)

SELECT

json_extract_scalar(blob, '$.name') AS name,

json_extract_scalar(blob, '$.projects') AS projects

FROM dataset

This query returns:

+---------------------------+

| name | projects |

+---------------------------+

| Susan Smith | |

+---------------------------+

To obtain the ﬁrst element of the projects property in the example array, use the json_array_get

function and specify the index position.

WITH dataset AS (

SELECT '{"name": "Bob Smith",

"org": "engineering",

"projects": [{"name":"project1", "completed":false},{"name":"project2",

"completed":true}]}'

AS blob

)

SELECT json_array_get(json_extract(blob, '$.projects'), 0) AS item

FROM dataset

It returns the value at the speciﬁed index position in the JSON-encoded array.

+---------------------------------------+

| item |

+---------------------------------------+

| {"name":"project1","completed":false} |

+---------------------------------------+

To return an Athena string type, use the [] operator inside a JSONPath expression, then Use

the json_extract_scalar function. For more information about [], see Accessing Array

Elements (p. 64).

WITH dataset AS (

SELECT '{"name": "Bob Smith",

"org": "engineering",

"projects": [{"name":"project1", "completed":false},{"name":"project2",

"completed":true}]}'

AS blob

)

SELECT json_extract_scalar(blob, '$.projects[0].name') AS project_name

FROM dataset

It returns this result:

+--------------+

Amazon Athena User Guide

Searching for Values

| project_name |

+--------------+

| project1 |

+--------------+

Searching for Values

To determine if a speciﬁc value exists inside a JSON-encoded array, use the json_array_contains

function.

The following query lists the names of the users who are participating in "project2".

WITH dataset AS (

SELECT * FROM (VALUES

(JSON '{"name": "Bob Smith", "org": "legal", "projects": ["project1"]}'),

(JSON '{"name": "Susan Smith", "org": "engineering", "projects": ["project1",

"project2", "project3"]}'),

(JSON '{"name": "Jane Smith", "org": "finance", "projects": ["project1", "project2"]}')

) AS t (users)

)

SELECT json_extract_scalar(users, '$.name') AS user

FROM dataset

WHERE json_array_contains(json_extract(users, '$.projects'), 'project2')

This query returns a list of users.

+-------------+

| user |

+-------------+

| Susan Smith |

+-------------+

| Jane Smith |

+-------------+

The following query example lists the names of users who have completed projects along with the total

number of completed projects. It performs these actions:

• Uses nested SELECT statements for clarity.

• Extracts the array of projects.

• Converts the array to a native array of key-value pairs using CAST.

• Extracts each individual array element using the UNNEST operator.

• Filters obtained values by completed projects and counts them.

Note

When using CAST to MAP you can specify the key element as VARCHAR (native String in Presto),

but leave the value as JSON, because the values in the MAP are of diﬀerent types: String for the

ﬁrst key-value pair, and Boolean for the second.

WITH dataset AS (

SELECT * FROM (VALUES

(JSON '{"name": "Bob Smith",

"org": "legal",

"projects": [{"name":"project1", "completed":false}]}'),

(JSON '{"name": "Susan Smith",

"org": "engineering",

"projects": [{"name":"project2", "completed":true},

{"name":"project3", "completed":true}]}'),

Amazon Athena User Guide

Obtaining Length and Size of JSON Arrays

(JSON '{"name": "Jane Smith",

"org": "finance",

"projects": [{"name":"project2", "completed":true}]}')

) AS t (users)

employees AS (

SELECT users, CAST(json_extract(users, '$.projects') AS

ARRAY(MAP(VARCHAR, JSON))) AS projects_array

FROM dataset

names AS (

SELECT json_extract_scalar(users, '$.name') AS name, projects

FROM employees, UNNEST (projects_array) AS t(projects)

)

SELECT name, count(projects) AS completed_projects FROM names

WHERE cast(element_at(projects, 'completed') AS BOOLEAN) = true

GROUP BY name

This query returns the following result:

+----------------------------------+

| name | completed_projects |

+----------------------------------+

| Susan Smith | 2 |

+----------------------------------+

| Jane Smith | 1 |

+----------------------------------+

Obtaining Length and Size of JSON Arrays

Example: json_array_length

To obtain the length of a JSON-encoded array, use the json_array_length function.

WITH dataset AS (

SELECT * FROM (VALUES

(JSON '{"name":

"Bob Smith",

"org":

"legal",

"projects": [{"name":"project1", "completed":false}]}'),

(JSON '{"name": "Susan Smith",

"org": "engineering",

"projects": [{"name":"project2", "completed":true},

{"name":"project3", "completed":true}]}'),

(JSON '{"name": "Jane Smith",

"org": "finance",

"projects": [{"name":"project2", "completed":true}]}')

) AS t (users)

)

SELECT

json_extract_scalar(users, '$.name') as name,

json_array_length(json_extract(users, '$.projects')) as count

FROM dataset

ORDER BY count DESC

This query returns this result:

+---------------------+

| name | count |

Amazon Athena User Guide

Obtaining Length and Size of JSON Arrays

+---------------------+

| Susan Smith | 2 |

+---------------------+

| Bob Smith | 1 |

+---------------------+

| Jane Smith | 1 |

+---------------------+

Example: json_size

To obtain the size of a JSON-encoded array or object, use the json_size function, and specify the

column containing the JSON string and the JSONPath expression to the array or object.

WITH dataset AS (

SELECT * FROM (VALUES

(JSON '{"name": "Bob Smith", "org": "legal", "projects": [{"name":"project1",

"completed":false}]}'),

(JSON '{"name": "Susan Smith", "org": "engineering", "projects": [{"name":"project2",

"completed":true},{"name":"project3", "completed":true}]}'),

(JSON '{"name": "Jane Smith", "org": "finance", "projects": [{"name":"project2",

"completed":true}]}')

) AS t (users)

)

SELECT

json_extract_scalar(users, '$.name') as name,

json_size(users, '$.projects') as count

FROM dataset

ORDER BY count DESC

This query returns this result:

+---------------------+

| name | count |

+---------------------+

| Susan Smith | 2 |

+---------------------+

| Bob Smith | 1 |

+---------------------+

| Jane Smith | 1 |

+---------------------+

Amazon Athena User Guide

What is a Geospatial Query?

Querying Geospatial Data

Geospatial data contains identiﬁers that specify a geographic position for an object. Examples of

this type of data include weather reports, map directions, tweets with geographic positions, store

locations, and airline routes. Geospatial data plays an important role in business analytics, reporting, and

forecasting.

Geospatial identiﬁers, such as latitude and longitude, allow you to convert any mailing address into a set

of geographic coordinates.

Topics

• What is a Geospatial Query? (p. 83)

• Input Data Formats and Geometry Data Types (p. 83)

• List of Supported Geospatial Functions (p. 84)

• Examples: Geospatial Queries (p. 92)

What is a Geospatial Query?

Geospatial queries are specialized types of SQL queries supported in Athena. They diﬀer from non-

spatial SQL queries in the following ways:

• Using the following specialized geometry data types: point, line, multiline, polygon, and

multipolygon.

• Expressing relationships between geometry data types, such as distance, equals, crosses,

touches, overlaps, disjoint, and others.

Using geospatial queries in Athena, you can run these and other similar operations:

• Find the distance between two points.

• Check whether one area (polygon) contains another.

• Check whether one line crosses or touches another line or polygon.

For example, to obtain a pair of double type coordinates from the geographic coordinates of Mount

Rainier in Athena, use the ST_POINT (double, double) ((longitude, latitude) pair)

geospatial function, specifying the latitude and longitude:

ST_POINT(46.8527,-121.7602) ((longitude, latitude) pair)

Input Data Formats and Geometry Data Types

To use geospatial functions in Athena, input your data in the WKT or WKB formats, or use the Hive JSON

SerDe. You can also use the geometry data types supported in Athena.

Input Data Formats

To handle geospatial queries, Athena supports input data in these data formats:

Amazon Athena User Guide

Geometry Data Types

• WKT (Well-known Text). In Athena, WKT is represented as a varchar data type.

• WKB (Well-known binary). In Athena, WKB is represented as a varbinary data type with a spatial

reference ID (WKID) 4326. For information about both of these types, see the Wikipedia topic on Well-

known text.

• JSON-encoded geospatial data. To parse JSON ﬁles with geospatial data and create tables for them,

Athena uses the Hive JSON SerDe. For more information about using this SerDe in Athena, see JSON

SerDe Libraries (p. 117).

Geometry Data Types

To handle geospatial queries, Athena supports these specialized geometry data types:

• point

• line

• polygon

• multiline

• multipolygon

List of Supported Geospatial Functions

Geospatial functions in Athena have these characteristics:

• The functions follow the general principles of Spatial Query.

• The functions are implemented as a Presto plugin that uses the ESRI Java Geometry Library. This

library has an Apache 2 license.

• The functions rely on the ESRI Geometry API.

• Not all of the ESRI-supported functions are available in Athena. This topic lists only the ESRI

geospatial functions that are supported in Athena.

Athena supports four types of geospatial functions:

• Constructor Functions (p. 85)

• Geospatial Relationship Functions (p. 86)

• Operation Functions (p. 88)

• Accessor Functions (p. 89)

Before You Begin

Create two tables, earthquakes and counties, as follows:

CREATE external TABLE earthquakes

(

earthquake_date STRING,

latitude DOUBLE,

longitude DOUBLE,

depth DOUBLE,

magnitude DOUBLE,

magtype string,

mbstations string,

Amazon Athena User Guide

Constructor Functions

gap string,

distance string,

rms string,

source string,

eventid string

)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ','

STORED AS TEXTFILE LOCATION 's3://my-query-log/csv'

CREATE external TABLE IF NOT EXISTS counties

(

Name string,

BoundaryShape binary

)

ROW FORMAT SERDE 'com.esri.hadoop.hive.serde.JsonSerde'

STORED AS INPUTFORMAT 'com.esri.json.hadoop.EnclosedJsonInputFormat'

OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'

LOCATION 's3://my-query-log/json'

Some of the subsequent examples are based on these tables and rely on two sample ﬁles stored in the

Amazon S3 location. These ﬁles are not inlcuded with Athena and are used for illustration purposes only:

• An earthquakes.csv ﬁle, which lists earthquakes that occurred in California. This ﬁle has ﬁelds that

correspond to the ﬁelds in the table earthquakes.

• A california-counties.json ﬁle, which lists JSON-encoded county data in the ESRI-compliant

format, and includes many ﬁelds, such as AREA, PERIMETER, STATE, COUNTY, and NAME. The

counties table is based on this ﬁle and has two ﬁelds only: Name (string), and BoundaryShape

(binary).

Constructor Functions

Use constructor geospatial functions to obtain binary representations of a point, `line, or polygon.

You can also convert a particular geometry data type to text, and obtain a binary representation of a

geometry data type from text (WKT).

ST_POINT(double, double)

Returns a value in the point data type, which is a binary representation of the geometry data type

point.

Syntax:

SELECT ST_POINT(longitude, latitude)

FROM earthquakes

LIMIT 1;

In the alternative syntax, you can also specify the coordinates as a point data type with two values:

SELECT ST_POINT('point (0 0)')

FROM earthquakes

LIMIT 1;

Example. This example uses speciﬁc longitude and latitude coordinates from earthquakes.csv:

SELECT ST_POINT(61.56, -158.54)

Amazon Athena User Guide

Geospatial Relationship Functions

FROM earthquakes

LIMIT 1;

It returns this binary representation of a geometry data type point:

00 00 00 00 01 01 00 00 00 48 e1 7a 14 ae c7 4e 40 e1 7a 14 ae 47 d1 63 c0

ST_LINE(varchar)

Returns a value in the line data type, which is a binary representation of the geometry data type line.

Example:

SELECT ST_Line('linestring(1 1, 2 2, 3 3)')

ST_POLYGON(varchar)

Returns a value in the polygon data type, which is a binary representation of the geometry data type

polygon. Example:

SELECT ST_Polygon('polygon ((1 1, 4 1, 1 4))')

ST_GEOMETRY_TO_TEXT (varbinary)

Converts each of the speciﬁed geometry data types to text. Returns a value in a geometry data type,

which is a WKT representation of the geometry data type. Example:

SELECT ST_GEOMETRY_TO_TEXT(ST_POINT(61.56, -158.54))

ST_GEOMETRY_FROM_TEXT (varchar)

Converts text into a geometry data type. Returns a value in a geometry data type, which is a binary

representation of the geometry data type. Example:

SELECT ST_GEOMETRY_FROM_TEXT(ST_GEOMETRY_TO_TEXT(ST_Point(1, 2)))

Geospatial Relationship Functions

The following functions express relationships between two diﬀerent geometries that you specify as

input. They return results of type boolean. The order in which you specify the pair of geometries

matters: the ﬁrst geometry value is called the left geometry, the second geometry value is called the

right geometry.

These functions return:

• TRUE if and only if the relationship described by the function is satisﬁed.

• FALSE if and only if the relationship described by the function is not satisﬁed.

ST_CONTAINS (geometry, geometry)

Returns TRUE if and only if the left geometry contains the right geometry. Examples:

Amazon Athena User Guide

Geospatial Relationship Functions

SELECT ST_CONTAINS('POLYGON((0 2,1 1,0 -1,0 2))', 'POLYGON((-1 3,2 1,0 -3,-1 3))')

SELECT ST_CONTAINS('POLYGON((0 2,1 1,0 -1,0 2))', ST_Point(0, 0));

SELECT ST_CONTAINS(ST_GEOMETRY_FROM_TEXT('POLYGON((0 2,1 1,0 -1,0 2))'),

ST_GEOMETRY_FROM_TEXT('POLYGON((-1 3,2 1,0 -3,-1 3))'))

ST_CROSSES (geometry, geometry)

Returns TRUE if and only if the left geometry crosses the right geometry. Example:

SELECT ST_CROSSES(ST_LINE('linestring(1 1, 2 2 )'), ST_LINE('linestring(0 1, 2 2)'))

ST_DISJOINT (geometry, geometry)

Returns TRUE if and only if the intersection of the left geometry and the right geometry is empty.

Example:

SELECT ST_DISJOINT(ST_LINE('linestring(0 0, 0 1)'), ST_LINE('linestring(1 1, 1 0)'))

ST_EQUALS (geometry, geometry)

Returns TRUE if and only if the left geometry equals the right geometry. Example:

SELECT ST_EQUALS(ST_LINE('linestring( 0 0, 1 1)'), ST_LINE('linestring(1 3, 2 2)'))

ST_INTERSECTS (geometry, geometry)

Returns TRUE if and only if the left geometry intersects the right geometry. Example:

SELECT ST_INTERSECTS(ST_LINE('linestring(8 7, 7 8)'), ST_POLYGON('polygon((1 1, 4 1, 4 4, 1

4))'))

ST_OVERLAPS (geometry, geometry)

Returns TRUE if and only if the left geometry overlaps the right geometry. Example:

SELECT ST_OVERLAPS(ST_POLYGON('polygon((2 0, 2 1, 3 1))'), ST_POLYGON('polygon((1 1, 1 4, 4

4, 4 1))'))

ST_RELATE (geometry, geometry)

Returns TRUE if and only if the left geometry has the speciﬁed Dimensionally Extended nine-Intersection

Model (DE-9IM) relationship with the right geometry. For more information, see the Wikipedia topic

DE-9IM. Example:

SELECT ST_RELATE(ST_LINE('linestring(0 0, 3 3)'), ST_LINE('linestring(1 1, 4 4)'),

'T********')

Amazon Athena User Guide

Operation Functions

ST_TOUCHES (geometry, geometry)

Returns TRUE if and only if the left geometry touches the right geometry.

Example:

SELECT ST_TOUCHES(ST_POINT(8, 8), ST_POLYGON('polygon((1 1, 1 4, 4 4, 4 1))'))

ST_WITHIN (geometry, geometry)

Returns TRUE if and only if the left geometry is within the right geometry.

Example:

SELECT ST_WITHIN(ST_POINT(8, 8), ST_POLYGON('polygon((1 1, 1 4, 4 4, 4 1))'))

Operation Functions

Use operation functions to perform operations on geometry data type values. For example, you can

obtain the boundaries of a single geometry data type; intersections between two geometry data types;

diﬀerence between left and right geometries, where each is of the same geometry data type; or an

exterior buﬀer or ring around a particular geometry data type.

All operation functions take as an input one of the geometry data types and return their binary

representations.

ST_BOUNDARY (geometry)

Takes as an input one of the geometry data types, and returns a binary representation of the boundary

geometry data type.

Examples:

SELECT ST_BOUNDARY(ST_LINE('linestring(0 1, 1 0)')))

SELECT ST_BOUNDARY(ST_POLYGON('polygon((1 1, 1 4, 4 4, 4 1))'))

ST_BUFFER (geometry, double)

Takes as an input a geometry data type and a distance (as type double). Returns a binary representation

of the geometry data type buﬀered by the speciﬁed distance.

Example:

SELECT ST_BUFFER(ST_Point(1, 2), 2.0)

ST_DIFFERENCE (geometry, geometry)

Returns a binary representation of a diﬀerence between the left geometry and right geometry. Example:

SELECT ST_GEOMETRY_TO_TEXT(ST_DIFFERENCE(ST_POLYGON('polygon((0 0, 0 10, 10 10, 10 0))'),

ST_POLYGON('polygon((0 0, 0 5, 5 5, 5 0))')))

Amazon Athena User Guide

Accessor Functions

ST_ENVELOPE (geometry)

Takes as an input one of the geometry data types and returns a binary representation of an envelope,

where an envelope is a rectangle around the speciﬁed geometry data type. Examples:

SELECT ST_ENVELOPE(ST_LINE('linestring(0 1, 1 0)'))

SELECT ST_ENVELOPE(ST_POLYGON('polygon((1 1, 1 4, 4 4, 4 1))'))

ST_EXTERIOR_RING (geometry)

Returns a binary representation of the exterior ring of the input type polygon. Examples:

SELECT ST_EXTERIOR_RING(ST_POLYGON(1,1, 1,4, 4,1))

SELECT ST_EXTERIOR_RING(ST_POLYGON('polygon ((0 0, 8 0, 0 8, 0 0), (1 1, 1 5, 5 1, 1 1))'))

ST_INTERSECTION (geometry, geometry)

Returns a binary representation of the intersection of the left geometry and right geometry. Examples:

SELECT ST_INTERSECTION(ST_POINT(1,1), ST_POINT(1,1))

SELECT ST_INTERSECTION(ST_LINE('linestring(0 1, 1 0)'), ST_POLYGON('polygon((1 1, 1 4, 4

4, 4 1))'))

SELECT ST_GEOMETRY_TO_TEXT(ST_INTERSECTION(ST_POLYGON('polygon((2 0, 2 3, 3 0))'),

ST_POLYGON('polygon((1 1, 4 1, 4 4, 1 4))')))

ST_SYMMETRIC_DIFFERENCE (geometry, geometry)

Returns a binary representation of the geometrically symmetric diﬀerence between left geometry and

right geometry. Example:

SELECT ST_GEOMETRY_TO_TEXT(ST_SYMMETRIC_DIFFERENCE(ST_LINE('linestring(0 2, 2 2)'),

ST_LINE('linestring(1 2, 3 2)')))

Accessor Functions

Accessor functions are useful to obtain values in types varchar, bigint, or double from diﬀerent

geometry data types, where geometry is any of the geometry data types supported in Athena: point,

line, polygon, multiline, and multipolygon. For example, you can obtain an area of a polygon

geometry data type, maximum and minimum X and Y values for a speciﬁed geometry data type, obtain

the length of a line, or receive the number of points in a speciﬁed geometry data type.

ST_AREA (geometry)

Takes as an input a geometry data type polygon and returns an area in type double. Example:

Amazon Athena User Guide

Accessor Functions

SELECT ST_AREA(ST_POLYGON('polygon((1 1, 4 1, 4 4, 1 4))'))

ST_CENTROID (geometry)

Takes as an input a geometry data type polygon, and returns a point that is the center of the polygon's

envelope in type varchar. Example:

SELECT ST_CENTROID(ST_GEOMETRY_FROM_TEXT('polygon ((0 0, 3 6, 6 0, 0 0))'))

ST_COORDINATE_DIMENSION (geometry)

Takes as input one of the supported geometry types, and returns the count of coordinate components in

type bigint. Example:

SELECT ST_COORDINATE_DIMENSION(ST_POINT(1.5,2.5))

ST_DIMENSION (geometry)

Takes as an input one of the supported geometry types, and returns the spatial dimension of a geometry

in type bigint. Example:

SELECT ST_DIMENSION(ST_POLYGON('polygon((1 1, 4 1, 4 4, 1 4))'))

ST_DISTANCE (geometry, geometry)

Returns the distance in type double between the left geometry and the right geometry. Example:

SELECT ST_DISTANCE(ST_POINT(0.0,0.0), ST_POINT(3.0,4.0))

ST_IS_CLOSED (geometry)

Returns TRUE (type boolean) if and only if the line is closed. Example:

SELECT ST_IS_CLOSED(ST_LINE('linestring(0 2, 2 2)'))

ST_IS_EMPTY (geometry)

Returns TRUE (type boolean) if and only if the speciﬁed geometry is empty. Example:

SELECT ST_IS_EMPTY(ST_POINT(1.5, 2.5))

ST_IS_RING (geometry)

Returns TRUE (type boolean) if and only if the line type is closed and simple. Example:

SELECT ST_IS_RING(ST_LINE('linestring(0 2, 2 2)'))

ST_LENGTH (geometry)

Returns the length of line in type double. Example:

Amazon Athena User Guide

Accessor Functions

SELECT ST_LENGTH(ST_LINE('linestring(0 2, 2 2)'))

ST_MAX_X (geometry)

Returns the maximum X coordinate of a geometry in type double. Example:

SELECT ST_MAX_X(ST_LINE('linestring(0 2, 2 2)'))

ST_MAX_Y (geometry)

Returns the maximum Y coordinate of a geometry in type double. Example:

SELECT ST_MAX_Y(ST_LINE('linestring(0 2, 2 2)'))

ST_MIN_X (geometry)

Returns the minimum X coordinate of a geometry in type double. Example:

SELECT ST_MIN_X(ST_LINE('linestring(0 2, 2 2)'))

ST_MIN_Y (geometry)

Returns the minimum Y coordinate of a geometry in type double. Example:

SELECT ST_MAX_Y(ST_LINE('linestring(0 2, 2 2)'))

ST_START_POINT (geometry)

Returns the ﬁrst point of a line geometry data type in type point. Example:

SELECT ST_START_POINT(ST_LINE('linestring(0 2, 2 2)'))

ST_END_POINT (geometry)

Returns the last point of a line geometry data type in type point. Example:

SELECT ST_END_POINT(ST_LINE('linestring(0 2, 2 2)'))

ST_X (point)

Returns the X coordinate of a point in type double. Example:

SELECT ST_X(ST_POINT(1.5, 2.5))

ST_Y (point)

Returns the Y coordinate of a point in type double. Example:

SELECT ST_Y(ST_POINT(1.5, 2.5))

Amazon Athena User Guide

Examples: Geospatial Queries

ST_POINT_NUMBER (geometry)

Returns the number of points in the geometry in type bigint. Example:

SELECT ST_POINT_NUMBER(ST_POINT(1.5, 2.5))

ST_INTERIOR_RING_NUMBER (geometry)

Returns the number of interior rings in the polygon geometry in type bigint. Example:

SELECT ST_INTERIOR_RING_NUMBER(ST_POLYGON('polygon ((0 0, 8 0, 0 8, 0 0), (1 1, 1 5, 5 1, 1

1))'))

Examples: Geospatial Queries

The following examples create two tables and issue a query against them.

These examples rely on two ﬁles stored in an Amazon S3 location:

• An earthquakes.csv sample ﬁle, which lists earthquakes that occurred in California. This ﬁle has

ﬁelds that correspond to the ﬁelds in the table earthquakes in the following example.

• A california-counties.json ﬁle, which lists JSON-encoded county data in the ESRI-compliant

format, and includes many ﬁelds, such as AREA, PERIMETER, STATE, COUNTY, and NAME. The

following example shows the counties table from this ﬁle with two ﬁelds only: Name (string), and

BoundaryShape (binary).

Note

These ﬁles contain sample data and are not guaranteed to be accurate. They are used in the

documentation for illustration purposes only and are not included with the product.

The following code example creates a table called earthquakes:

CREATE external TABLE earthquakes

(

earthquake_date STRING,

latitude DOUBLE,

longitude DOUBLE,

depth DOUBLE,

magnitude DOUBLE,

magtype string,

mbstations string,

gap string,

distance string,

rms string,

source string,

eventid string

)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ','

STORED AS TEXTFILE LOCATION 's3://my-query-log/csv'

The following code example creates a table called counties:

CREATE external TABLE IF NOT EXISTS counties

(

Amazon Athena User Guide

Examples: Geospatial Queries

Name string,

BoundaryShape binary

)

ROW FORMAT SERDE 'com.esri.hadoop.hive.serde.JsonSerde'

STORED AS INPUTFORMAT 'com.esri.json.hadoop.EnclosedJsonInputFormat'

OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'

LOCATION 's3://my-query-log/json'

The following code example uses the CROSS JOIN function for the two tables created earlier.

Additionally, for both tables, it uses ST_CONTAINS and asks for counties whose boundaries include a

geographical location of the earthquakes, speciﬁed with ST_POINT. It then groups such counties by

name, orders them by count, and returns them.

SELECT counties.name,

COUNT(*) cnt

FROM counties

CROSS JOIN earthquakes

WHERE ST_CONTAINS (counties.boundaryshape, ST_POINT(earthquakes.longitude,

earthquakes.latitude))

GROUP BY counties.name

ORDER BY cnt DESC

This query returns:

+------------------------+

| name | cnt |

+------------------------+

| Kern | 36 |

+------------------------+

| San Bernardino | 35 |

+------------------------+

| Imperial | 28 |

+------------------------+

| Inyo | 20 |

+------------------------+

| Los Angeles | 18 |

+------------------------+

| Riverside | 14 |

+------------------------+

| Monterey | 14 |

+------------------------+

| Santa Clara | 12 |

+------------------------+

| San Benito | 11 |

+------------------------+

| Fresno | 11 |

+------------------------+

| San Diego | 7 |

+------------------------+

| Santa Cruz | 5 |

+------------------------+

| Ventura | 3 |

+------------------------+

| San Luis Obispo | 3 |

+------------------------+

| Orange | 2 |

+------------------------+

| San Mateo | 1 |

+------------------------+

Amazon Athena User Guide

Querying AWS CloudTrail Logs

Querying AWS Service Logs

This section includes several procedures for using Amazon Athena to query popular datasets, such as

AWS CloudTrail logs, Amazon CloudFront logs, Classic Load Balancer logs, Application Load Balancer

logs, and Amazon VPC ﬂow logs.

The tasks in this section use the Athena console, but you can also use other tools that connect via JDBC.

For more information, see Accessing Amazon Athena with JDBC, the AWS CLI, or the Amazon Athena API

Reference.

The topics in this section assume that you have set up both an IAM user with appropriate permissions to

access Athena and the Amazon S3 bucket where the data to query should reside. For more information,

see Setting Up (p. 12) and Getting Started (p. 14).

Topics

• Querying AWS CloudTrail Logs (p. 94)

• Querying Amazon CloudFront Logs (p. 97)

• Querying Classic Load Balancer Logs (p. 98)

• Querying Application Load Balancer Logs (p. 100)

• Querying Amazon VPC Flow Logs (p. 101)

Querying AWS CloudTrail Logs

AWS CloudTrail logs include details about any API calls made to your AWS services, including the console.

AWS CloudTrail is a service that records AWS API calls and events for AWS accounts. CloudTrail logs

include details about any API calls made to your AWS services, including the console. CloudTrail

generates encrypted log ﬁles and stores them in Amazon S3. You can use Athena to query these log

ﬁles directly from Amazon S3, specifying the LOCATION of log ﬁles. For more information, see the AWS

CloudTrail User Guide.

Using Athena with CloudTrail logs is a powerful way to enhance your analysis of AWS service activity, for

example you can use queries to identify trends and further isolate activity by attribute, such as source IP

address or user.

A common application is to use CloudTrail logs to analyze operational activity for security and

compliance. For a detailed example, see the AWS Big Data Blog post, Analyze Security, Compliance, and

Operational Activity Using AWS CloudTrail and Amazon Athena.

CloudTrail saves logs as JSON text ﬁles in compressed gzip format (*.json.gzip). The location of log ﬁles

depends on how you set up trails, the region or regions you are logging, and other factors.

For more information about where logs are stored, the JSON structure, and the record ﬁle contents, see

the following topics in the AWS CloudTrail User Guide:

• Finding Your CloudTrail Log Files

• CloudTrail Log File Examples

Amazon Athena User Guide

Creating the Table for CloudTrail Logs

• CloudTrail Record Contents

• CloudTrail Event Reference

To collect logs and save them to Amazon S3, enable CloudTrail for the console. For more information,

see Creating a Trail in the AWS CloudTrail User Guide.

Note the destination Amazon S3 bucket where you save the logs, as you need it in the next section.

Replace the LOCATION clause with the path to the CloudTrail log location and set of objects with which

to work. The example uses a LOCATION value of logs for a particular account, but you can use the degree

of speciﬁcity that suits your application.

For example:

• To analyze data from multiple accounts, you can roll back the LOCATION speciﬁer to indicate all

AWSLogs by using LOCATION 's3://MyLogFiles/AWSLogs/.

• To analyze data from a speciﬁc date, account, and region, use LOCATION `s3://

MyLogFiles/123456789012/CloudTrail/us-east-1/2016/03/14/'.

Using the highest level in the object hierarchy gives you the greatest ﬂexibility when you query using

Athena.

To query the CloudTrail logs, you use the CloudTrail SerDe (p. 112).

• Creating the Table for CloudTrail Logs (p. 95)

• Tips for Querying CloudTrail Logs (p. 96)

Creating the Table for CloudTrail Logs

To create the CloudTrail table

1. Copy and paste the following DDL statement into the Athena console.

2. Modify the s3://cloudtrail_bucket_name/AWSLogs/bucket_ID/ to point to the S3 bucket

that contains your logs data.

Verify that ﬁelds are listed correctly. For a full list of ﬁelds in a CloudTrail record, see CloudTrail Record

Contents.

In this example, the ﬁelds requestParameters, responseElements, and additionalEventData

are included as part of STRUCT data type used in JSON. To get data out of these ﬁelds, use

JSON_EXTRACT functions. For more information, see Extracting Data From JSON (p. 78).

CREATE EXTERNAL TABLE cloudtrail_logs (

eventversion STRING,

userIdentity STRUCT<

type:STRING,

principalid:STRING,

arn:STRING,

accountid:STRING,

invokedby:STRING,

accesskeyid:STRING,

userName:STRING,

sessioncontext:STRUCT<

attributes:STRUCT<

mfaauthenticated:STRING,

Amazon Athena User Guide

Tips for Querying CloudTrail Logs

creationdate:STRING>,

sessionIssuer:STRUCT<

type:STRING,

principalId:STRING,

arn:STRING,

accountId:STRING,

userName:STRING>>>,

eventTime STRING,

eventSource STRING,

eventName STRING,

awsRegion STRING,

sourceIpAddress STRING,

userAgent STRING,

errorCode STRING,

errorMessage STRING,

requestParameters STRING,

responseElements STRING,

additionalEventData STRING,

requestId STRING,

eventId STRING,

resources ARRAY<STRUCT<

ARN:STRING,

accountId:STRING,

type:STRING>>,

eventType STRING,

apiVersion STRING,

readOnly STRING,

recipientAccountId STRING,

serviceEventDetails STRING,

sharedEventID STRING,

vpcEndpointId STRING

)

ROW FORMAT SERDE 'com.amazon.emr.hive.serde.CloudTrailSerde'

STORED AS INPUTFORMAT 'com.amazon.emr.cloudtrail.CloudTrailInputFormat'

OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'

LOCATION 's3://cloudtrail_bucket_name/AWSLogs/bucket_ID/';

3. Run the query in Athena console. After the query completes, Athena registers cloudtrail_logs,

making the data in it ready for you to issue queries.

Tips for Querying CloudTrail Logs

To explore the CloudTrail logs data, use these tips:

• Before querying the logs, verify that your logs table looks the same as the one in Creating the

Table for CloudTrail Logs (p. 95). If it isn’t the ﬁrst table, delete the existing table: DROP TABLE

cloudtrail_logs;.

• Recreate the table. For more information, see Creating the Table for CloudTrail Logs (p. 95).

Verify that ﬁelds in your Athena query are listed correctly. For a full list of ﬁelds in a CloudTrail record,

see CloudTrail Record Contents.

If your query includes ﬁelds in JSON formats, such as STRUCT, extract data from JSON. See Extracting

Data From JSON (p. 78).

Now you are ready to issue queries against your CloudTrail table.

• Start by looking at which IAM users called which API operations and from which source IP addresses.

• Use the following basic SQL query as your template. Paste the query to the Athena console and run it.

SELECT

Amazon Athena User Guide

Querying Amazon CloudFront Logs

useridentity.arn,

eventname,

sourceipaddress,

eventtime

FROM cloudtrail_logs

LIMIT 100;

• Modify this query further to explore your data.

• To improve performance and prevent long-running queries, include the LIMIT clause to return a

speciﬁed subset of rows.

For more information, see the AWS Big Data blog post Analyze Security, Compliance, and Operational

Activity Using AWS CloudTrail and Amazon Athena.

Querying Amazon CloudFront Logs

You can conﬁgure Amazon CloudFront CDN to export Web distribution access logs to Amazon Simple

Storage Service. Use these logs to explore users’ surﬁng patterns across your web properties served by

CloudFront.

Before you begin querying the logs, enable Web distributions access log on your preferred CloudFront

distribution. For information, see Access Logs in the Amazon CloudFront Developer Guide.

Make a note of the Amazon S3 bucket to which to save these logs.

Note

This procedure works for the Web distribution access logs in CloudFront. It does not apply to

streaming logs from RTMP distributions.

• Creating the Table for CloudFront Logs (p. 97)

• Example Query for CloudFront logs (p. 98)

Creating the Table for CloudFront Logs

To create the CloudFront table

1. Copy and paste the following DDL statement into the Athena console. Modify the LOCATION for the

S3 bucket that stores your logs.

This query uses the LazySimpleSerDe (p. 120) by default and it is omitted.

CREATE EXTERNAL TABLE IF NOT EXISTS default.cloudfront_logs (

`date` DATE,

time STRING,

location STRING,

bytes BIGINT,

requestip STRING,

method STRING,

host STRING,

uri STRING,

status INT,

referrer STRING,

useragent STRING,

querystring STRING,

cookie STRING,

Amazon Athena User Guide

Example Query for CloudFront logs

resulttype STRING,

requestid STRING,

hostheader STRING,

requestprotocol STRING,

requestbytes BIGINT,

timetaken FLOAT,

xforwardedfor STRING,

sslprotocol STRING,

sslcipher STRING,

responseresulttype STRING,

httpversion STRING,

filestatus STRING,

encryptedfields INT

)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '\t'

LOCATION 's3://your_log_bucket/prefix/'

TBLPROPERTIES ( 'skip.header.line.count'='2' )

2. Run the query in Athena console. After the query completes, Athena registers the cloudfront_logs

table, making the data in it ready for you to issue queries.

Example Query for CloudFront logs

The following query adds up the number of bytes served by CloudFront between June 9 and June 11,

2017. Surround the date column name with double quotes because it is a reserved word.

SELECT SUM(bytes) AS total_bytes

FROM cloudfront_logs

WHERE "date" BETWEEN DATE '2017-06-09' AND DATE '2017-06-11'

LIMIT 100;

In some cases, you need to eliminate empty values from the results of CREATE TABLE query for

CloudFront. To do so, run SELECT DISTINCT * FROM cloudfront_logs LIMIT 10;

For more information, see the AWS Big Data Blog post Build a Serverless Architecture to Analyze Amazon

CloudFront Access Logs Using AWS Lambda, Amazon Athena, and Amazon Kinesis Analytics.

Querying Classic Load Balancer Logs

Use Classic Load Balancer logs to analyze and understand traﬃc patterns to and from Elastic Load

Balancing instances and backend applications. You can see the source of traﬃc, latency, and bytes

transferred.

Before you begin to analyze the Elastic Load Balancing logs, conﬁgure them for saving in the destination

Amazon S3 bucket. For more information, see Enable Access Logs for Your Classic Load Balancer.

• Creating the Table for ELB Logs (p. 98)

• Example Queries for ELB Logs (p. 99)

Creating the Table for Elastic Load Balancing Logs

Create the table that you can later query. This table must include the exact location of your Elastic Load

Balancing logs in Amazon S3.

Amazon Athena User Guide

Example Queries for Elastic Load Balancing Logs

To create the Elastic Load Balancing table

1. Copy and paste the following DDL statement into the Athena console.

2. Modify the LOCATION S3 bucket to specify the destination of your Elastic Load Balancing logs.

CREATE EXTERNAL TABLE IF NOT EXISTS elb_logs (

request_timestamp string,

elb_name string,

request_ip string,

request_port int,

backend_ip string,

backend_port int,

request_processing_time double,

backend_processing_time double,

client_response_time double,

elb_response_code string,

backend_response_code string,

received_bytes bigint,

sent_bytes bigint,

request_verb string,

url string,

protocol string,

user_agent string,

ssl_cipher string,

ssl_protocol string

)

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'

WITH SERDEPROPERTIES (

'serialization.format' = '1',

'input.regex' = '([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*)[:\-]([0-9]*) ([-.0-9]*)

([-.0-9]*) ([-.0-9]*) (|[-0-9]*) (-|[-0-9]*) ([-0-9]*) ([-0-9]*) \\\"([^ ]*) ([^ ]*) (-

|[^ ]*)\\\" (\"[^\"]*\") ([A-Z0-9-]+) ([A-Za-z0-9.-]*)$' )

LOCATION 's3://your_log_bucket/prefix/AWSLogs/AWS_account_ID/elasticloadbalancing/';

3. Run the query in the Athena console. After the query completes, Athena registers the elb_logs

table, making the data in it ready for queries.

Example Queries for Elastic Load Balancing Logs

Use a query similar to this example. It lists the backend application servers that returned a 4XX or 5XX

error response code. Use the LIMIT operator to limit the number of logs to query at a time.

SELECT

request_timestamp,

elb_name,

backend_ip,

backend_response_code

FROM elb_logs

WHERE backend_response_code LIKE '4%' OR

backend_response_code LIKE '5%'

LIMIT 100;

Use a subsequent query to sum up the response time of all the transactions grouped by the backend IP

address and Elastic Load Balancing instance name.

SELECT sum(backend_processing_time) AS

total_ms,

elb_name,

backend_ip

Amazon Athena User Guide

Querying Application Load Balancer Logs

FROM elb_logs WHERE backend_ip <> ''

GROUP BY backend_ip, elb_name

LIMIT 100;

For more information, see Analyzing Data in S3 using Athena.

Querying Application Load Balancer Logs

An Application Load Balancer is a load balancing option for Elastic Load Balancing that enables traﬃc

distribution in a microservices deployment using containers. Querying Application Load Balancer logs

allows you to see the source of traﬃc, latency, and bytes transferred to and from Elastic Load Balancing

instances and backend applications.

Before you begin, enable access logging for Application Load Balancer logs to be saved to your Amazon

S3 bucket.

• Creating the Table for ALB Logs (p. 100)

• Example Queries for ALB logs (p. 101)

Creating the Table for ALB Logs

1. Copy and paste the following DDL statement into the Athena console, and modify values in LOCATION

's3://your_log_bucket/prefix/AWSLogs/your_ID/elasticloadbalancing/'. For a full list

of ﬁelds present in the ALB logs, see Access Log Entries.

Create the alb_logs table as follows.

CREATE EXTERNAL TABLE IF NOT EXISTS alb_logs (

type string,

time string,

elb string,

client_ip string,

client_port int,

target_ip string,

target_port int,

request_processing_time double,

target_processing_time double,

response_processing_time double,

elb_status_code string,

target_status_code string,

received_bytes bigint,

sent_bytes bigint,

request_verb string,

request_url string,

request_proto string,

user_agent string,

ssl_cipher string,

ssl_protocol string,

target_group_arn string,

trace_id string,

domain_name string,

chosen_cert_arn string

)

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'

WITH SERDEPROPERTIES (

'serialization.format' = '1',

100

Amazon Athena User Guide

Example Queries for ALB logs

'input.regex' = '([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*)[:\-]([0-9]*) ([-.0-9]*)

([-.0-9]*) ([-.0-9]*) (|[-0-9]*) (-|[-0-9]*) ([-0-9]*) ([-0-9]*) \"([^ ]*) ([^ ]*) (- |

[^ ]*)\" (\"[^\"]*\") ([A-Z0-9-]+) ([A-Za-z0-9.-]*) ([^ ]*) (.*) (.*) (.*)' )

LOCATION 's3://your-alb-logs-directory/AWSLogs/elasticloadbalancing/';

2. Run the query in the Athena console. After the query completes, Athena registers the alb_logs

table, making the data in it ready for you to issue queries.

Example Queries for ALB logs

The following query counts the number of HTTP GET requests received by the load balancer grouped by

the client IP address.

SELECT COUNT(request_verb) AS

count,

request_verb,

client_ip

FROM alb_logs

GROUP BY request_verb, client_ip

LIMIT 100;

Another query shows the URLs visited by Safari browser users.

SELECT request_url

FROM alb_logs

WHERE user_agent LIKE '%Safari%'

LIMIT 10;

Querying Amazon VPC Flow Logs

Amazon Virtual Private Cloud ﬂow logs capture information about the IP traﬃc going to and from

network interfaces in a VPC. Use the logs to investigate network traﬃc patterns and identify threats and

risks across your VPC network.

Before you begin querying the logs in Athena, enable VPC ﬂow logs and export log data to Amazon S3.

After you create the logs, let them run for a few minutes to collect some data.

• Creating the Table for VPC Flow Logs (p. 101)

• Example Queries for Amazon VPC Flow Logs (p. 102)

Creating the Table for VPC Flow Logs

To create the Amazon VPC table

1. Copy and paste the following DDL statement into the Athena console.

2. Modify the LOCATION 's3://your_log_bucket/prefix/' to point to the S3 bucket that contains

your log data.

CREATE EXTERNAL TABLE IF NOT EXISTS vpc_flow_logs (

ts string,

version int,

account string,

interfaceid string,

101

Amazon Athena User Guide

Example Queries for Amazon VPC Flow Logs

sourceaddress string,

destinationaddress string,

sourceport int,

destinationport int,

protocol int,

numpackets int,

numbytes int,

starttime string,

endtime string,

action string,

logstatus string

)

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'

WITH SERDEPROPERTIES

( "input.regex" = "^([^ ]+)\\s+([0-9]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\

\s+([0-9]+)\\s+([0-9]+)\\s+([0-9]+)\\s+([0-9]+)\\s+([0-9]+)\\s+([0-9]+)\\s+([0-9]+)\\s

+([^ ]+)\\s+([^ ]+)$" )

LOCATION 's3://your_log_bucket/prefix/';

3. Run the query in Athena console. After the query completes, Athena registers the vpc_flow_logs

table, making the data in it ready for you to issue queries.

Example Queries for Amazon VPC Flow Logs

The following query lists all of the rejected TCP connections. The query uses a datetime function to

convert the timestamp ﬁeld ts, and extracts only the day of the week for which these events occurred.

SELECT day_of_week(from_iso8601_timestamp(ts)) AS

day,

interfaceid,

sourceaddress,

action,

protocol

FROM vpc_flow_logs

WHERE action = 'REJECT' AND protocol = 6

LIMIT 100;

To see which one of your servers is receiving the highest number of HTTPS requests, use this query. It

counts the number of packets received on HTTPS port 443, groups them by destination IP address, and

returns the top 10.

SELECT SUM(numpackets) AS

packetcount,

destinationaddress

FROM vpc_flow_logs

WHERE destinationport = 443

GROUP BY destinationaddress

ORDER BY packetcount DESC

LIMIT 10;

For more information, see the AWS Big Data blog post Analyzing VPC Flow Logs with Amazon Kinesis

Firehose, Athena, and Amazon QuickSight.

102

Amazon Athena User Guide

Logging Amazon Athena API Calls with AWS CloudTrail

Monitoring Logs and

Troubleshooting

Topics

• Logging Amazon Athena API Calls with AWS CloudTrail (p. 103)

• Troubleshooting (p. 105)

Logging Amazon Athena API Calls with AWS

CloudTrail

Athena is integrated with CloudTrail, a service that captures all of the Athena API calls and delivers the

log ﬁles to an Amazon S3 bucket that you specify. CloudTrail captures API calls from the Athena console

or from your code to the Athena API operations. Using the information collected by CloudTrail, you

can determine the request that was made to Athena, the source IP address from which the request was

made, who made the request, when it was made, and so on.

You can also use Athena to query CloudTrail log ﬁles for insight. For more information, see CloudTrail

SerDe (p. 112). To learn more about CloudTrail, including how to conﬁgure and enable it, see the AWS

CloudTrail User Guide.

Athena Information in CloudTrail

When CloudTrail logging is enabled in your AWS account, API calls made to Athena actions are tracked in

CloudTrail log ﬁles, where they are written with other AWS service records. CloudTrail determines when

to create and write to a new ﬁle based on a time period and ﬁle size.

All Athena actions are logged by CloudTrail and are documented in the Amazon Athena API Reference.

For example, calls to the StartQueryExecution and GetQueryResults actions generate entries in the

CloudTrail log ﬁles.

Every log entry contains information about who generated the request. The user identity information in

the log entry helps you determine the following:

• Whether the request was made with root or IAM user credentials

• Whether the request was made with temporary security credentials for a role or federated user

• Whether the request was made by another AWS service

For more information, see CloudTrail userIdentity Element in the AWS CloudTrail User Guide.

You can store your log ﬁles in your S3 bucket for as long as you want, but you can also deﬁne Amazon

S3 lifecycle rules to archive or delete log ﬁles automatically. By default, your log ﬁles are encrypted with

Amazon S3 server-side encryption (SSE).

To be notiﬁed upon log ﬁle delivery, you can conﬁgure CloudTrail to publish Amazon SNS notiﬁcations

when new log ﬁles are delivered. For more information, see Conﬁguring Amazon SNS Notiﬁcations for

CloudTrail.

You can also aggregate Athena log ﬁles from multiple AWS regions and multiple AWS accounts into a

single S3 bucket.

103

Amazon Athena User Guide

Understanding Athena Log File Entries

For more information, see Receiving CloudTrail Log Files from Multiple Regions and Receiving CloudTrail

Log Files from Multiple Accounts.

Understanding Athena Log File Entries

CloudTrail log ﬁles can contain one or more log entries. Each entry lists multiple JSON-formatted events.

A log entry represents a single request from any source and includes information about the requested

action, the date and time of the action, request parameters, and so on. Log entries are not an ordered

stack trace of the public API calls, so they do not appear in any speciﬁc order.

The following examples demonstrate CloudTrail log entries for:

• StartQueryExecution (Successful) (p. 104)

• StartQueryExecution (Failed) (p. 104)

• CreateNamedQuery (p. 105)

StartQueryExecution (Successful)

{

"eventVersion":"1.05",

"userIdentity":{

"type":"IAMUser",

"principalId":"EXAMPLE_PRINCIPAL_ID",

"arn":"arn:aws:iam::123456789012:user/johndoe",

"accountId":"123456789012",

"accessKeyId":"EXAMPLE_KEY_ID",

"userName":"johndoe"

"eventTime":"2017-05-04T00:23:55Z",

"eventSource":"athena.amazonaws.com",

"eventName":"StartQueryExecution",

"awsRegion":"us-east-1",

"sourceIPAddress":"77.88.999.69",

"userAgent":"aws-internal/3",

"requestParameters":{

"clientRequestToken":"16bc6e70-f972-4260-b18a-db1b623cb35c",

"resultConfiguration":{

"outputLocation":"s3://athena-johndoe-test/test/"

"query":"Select 10"

"responseElements":{

"queryExecutionId":"b621c254-74e0-48e3-9630-78ed857782f9"

"requestID":"f5039b01-305f-11e7-b146-c3fc56a7dc7a",

"eventID":"c97cf8c8-6112-467a-8777-53bb38f83fd5",

"eventType":"AwsApiCall",

"recipientAccountId":"123456789012"

}

StartQueryExecution (Failed)

{

"eventVersion":"1.05",

"userIdentity":{

"type":"IAMUser",

"principalId":"EXAMPLE_PRINCIPAL_ID",

"arn":"arn:aws:iam::123456789012:user/johndoe",

104

Amazon Athena User Guide

Troubleshooting

"accountId":"123456789012",

"accessKeyId":"EXAMPLE_KEY_ID",

"userName":"johndoe"

"eventTime":"2017-05-04T00:21:57Z",

"eventSource":"athena.amazonaws.com",

"eventName":"StartQueryExecution",

"awsRegion":"us-east-1",

"sourceIPAddress":"77.88.999.69",

"userAgent":"aws-internal/3",

"errorCode":"InvalidRequestException",

"errorMessage":"Invalid result configuration. Should specify either output location or

result configuration",

"requestParameters":{

"clientRequestToken":"ca0e965f-d6d8-4277-8257-814a57f57446",

"query":"Select 10"

"responseElements":null,

"requestID":"aefbc057-305f-11e7-9f39-bbc56d5d161e",

"eventID":"6e1fc69b-d076-477e-8dec-024ee51488c4",

"eventType":"AwsApiCall",

"recipientAccountId":"123456789012"

}

CreateNamedQuery

{

"eventVersion":"1.05",

"userIdentity":{

"type":"IAMUser",

"principalId":"EXAMPLE_PRINCIPAL_ID",

"arn":"arn:aws:iam::123456789012:user/johndoe",

"accountId":"123456789012",

"accessKeyId":"EXAMPLE_KEY_ID",

"userName":"johndoe"

"eventTime":"2017-05-16T22:00:58Z",

"eventSource":"athena.amazonaws.com",

"eventName":"CreateNamedQuery",

"awsRegion":"us-west-2",

"sourceIPAddress":"77.88.999.69",

"userAgent":"aws-cli/1.11.85 Python/2.7.10 Darwin/16.6.0 botocore/1.5.48",

"requestParameters":{

"name":"johndoetest",

"queryString":"select 10",

"database":"default",

"clientRequestToken":"fc1ad880-69ee-4df0-bb0f-1770d9a539b1"

"responseElements":{

"namedQueryId":"cdd0fe29-4787-4263-9188-a9c8db29f2d6"

"requestID":"2487dd96-3a83-11e7-8f67-c9de5ac76512",

"eventID":"15e3d3b5-6c3b-4c7c-bc0b-36a8dd95227b",

"eventType":"AwsApiCall",

"recipientAccountId":"123456789012"

Troubleshooting

Use these resources to troubleshoot problems with Amazon Athena.

105

Amazon Athena User Guide

Troubleshooting

• Service Limits (p. 158)

• Athena topics in the AWS Knowledge Center

• Athena discussion forum

• Athena posts in the AWS Big Data Blog

106

Amazon Athena User Guide

Using a SerDe

SerDe Reference

Athena supports several SerDe libraries for parsing data from diﬀerent data formats, such as CSV, JSON,

Parquet, and ORC. Athena does not support custom SerDes.

Topics

• Using a SerDe (p. 107)

• Supported SerDes and Data Formats (p. 108)

• Compression Formats (p. 130)

Using a SerDe

A SerDe (Serializer/Deserializer) is a way in which Athena interacts with data in various formats.

It is the SerDe you specify, and not the DDL, that deﬁnes the table schema. In other words, the SerDe can

override the DDL conﬁguration that you specify in Athena when you create your table.

To Use a SerDe in Queries

To use a SerDe when creating a table in Athena, use one of the following methods:

• Use DDL statements to describe how to read and write data to the table and do not specify a ROW

FORMAT, as in this example. This omits listing the actual SerDe type and the native LazySimpleSerDe

is used by default.

In general, Athena uses the LazySimpleSerDe if you do not specify a ROW FORMAT, or if you specify

ROW FORMAT DELIMITED.

ROW FORMAT

DELIMITED FIELDS TERMINATED BY ','

ESCAPED BY '\\'

COLLECTION ITEMS TERMINATED BY '|'

MAP KEYS TERMINATED BY ':'

• Explicitly specify the type of SerDe Athena should use when it reads and writes data to the table. Also,

specify additional properties in SERDEPROPERTIES, as in this example.

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'

WITH SERDEPROPERTIES (

'serialization.format' = ',',

'field.delim' = ',',

'colelction.delim' = '|',

'mapkey.delim' = ':',

'escape.delim' = '\\'

)

107

Amazon Athena User Guide

Supported SerDes and Data Formats

Athena supports creating tables and querying data from ﬁles in CSV, TSV, custom-delimited, and JSON

formats; ﬁles from Hadoop-related formats: ORC, Apache Avro and Parquet; log ﬁles from Logstash, AWS

CloudTrail logs, and Apache WebServer logs.

To create tables and query data from ﬁles in these formats in Athena, specify a serializer-deserializer

class (SerDe) so that Athena knows which format is used and how to parse the data.

This table lists the data formats supported in Athena and their corresponding SerDe libraries.

A SerDe is a custom library that tells the data catalog used by Athena how to handle the data. You

specify a SerDe type by listing it explicitly in the ROW FORMAT part of your CREATE TABLE statement in

Athena. In some cases, you can omit the SerDe name because Athena uses some SerDe types by default

for certain types of ﬁle formats.

Supported Data Formats and SerDes

Data Format Description SerDe types supported in

Athena

CSV (Comma-Separated Values) In a CSV ﬁle, each line

represents a data record, and

each record consists of one

or more ﬁelds, separated by

commas.

• Use the LazySimpleSerDe

for CSV, TSV, and Custom-

Delimited Files (p. 120) if

your data does not include

values enclosed in quotes.

• Use the OpenCSVSerDe for

Processing CSV (p. 114)

when your data includes

quotes in values, or diﬀerent

separator or escape

characters.

TSV (Tab-Separated Values) In a TSV ﬁle, each line represents

a data record, and each record

consists of one or more ﬁelds,

separated by tabs.

Use the LazySimpleSerDe for

CSV, TSV, and Custom-Delimited

Files (p. 120) and specify the

separator character as FIELDS

TERMINATED BY '\t'.

Custom-Delimited ﬁles In a ﬁle in this format, each line

represents a data record, and

records are separated by custom

delimiters.

Use the LazySimpleSerDe for

CSV, TSV, and Custom-Delimited

Files (p. 120) and specify

custom delimiters.

JSON (JavaScript Object

Notation)

In a JSON ﬁle, each line

represents a data record,

and each record consists of

attribute–value pairs and arrays,

separated by commas.

• Use the Hive JSON

SerDe (p. 118).

• Use the OpenX JSON

SerDe (p. 118).

Apache Avro A format for storing data in

Hadoop that uses JSON-based

schemas for record values.

Use the Avro SerDe (p. 109).

ORC (Optimized Row Columnar) A format for optimized

columnar storage of Hive data.

Use the ORC SerDe (p. 125)

and ZLIB compression.

108

Amazon Athena User Guide

Avro SerDe

Data Format Description SerDe types supported in

Athena

Apache Parquet A format for columnar storage

of data in Hadoop.

Use the Parquet SerDe (p. 127)

and SNAPPY compression.

Logstash log ﬁles A format for storing log ﬁles in

Logstash.

Use the Grok SerDe (p. 115).

Apache WebServer log ﬁles A format for storing log ﬁles in

Apache WebServer.

Use the RegexSerDe for

Processing Apache Web Server

Logs (p. 111).

CloudTrail log ﬁles A format for storing log ﬁles in

CloudTrail.

• Use the CloudTrail

SerDe (p. 112) to query most

ﬁelds in CloudTrail logs.

• Use the OpenX JSON

SerDe (p. 118) for a few

ﬁelds where their format

depends on the service.

For more information, see

CloudTrail SerDe (p. 112).

Topics

• Avro SerDe (p. 109)

• RegexSerDe for Processing Apache Web Server Logs (p. 111)

• CloudTrail SerDe (p. 112)

• OpenCSVSerDe for Processing CSV (p. 114)

• Grok SerDe (p. 115)

• JSON SerDe Libraries (p. 117)

• LazySimpleSerDe for CSV, TSV, and Custom-Delimited Files (p. 120)

• ORC SerDe (p. 125)

• Parquet SerDe (p. 127)

Avro SerDe

SerDe Name

Avro SerDe

Library Name

org.apache.hadoop.hive.serde2.avro.AvroSerDe

Examples

Athena does not support using avro.schema.url to specify table schema for security reasons. Use

avro.schema.literal. To extract schema from an Avro ﬁle, you can use the Apache avro-tools-

<version>.jar with the getschema parameter. This returns a schema that you can use in your WITH

SERDEPROPERTIES statement. For example:

java -jar avro-tools-1.8.2.jar getschema my_data.avro

109

Amazon Athena User Guide

Avro SerDe

The avro-tools-<version>.jar ﬁle is located in the java subdirectory of your installed Avro

release. To download Avro, see Apache Avro Releases. To download Apache Avro Tools directly, see the

Apache Avro Tools Maven Repository.

After you obtain the schema, use a CREATE TABLE statement to create an Athena table based on

underlying Avro data stored in Amazon S3. In ROW FORMAT, specify the Avro SerDe as follows: ROW

FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' In SERDEPROPERTIES,

specify the schema, as shown in this example.

Note

You can query data in regions other than the region where you run Athena. Standard inter-

region data transfer rates for Amazon S3 apply in addition to standard Athena charges. To

reduce data transfer charges, replace myregion in s3://athena-examples-myregion/

path/to/data/ with the region identiﬁer where you run Athena, for example, s3://athena-

examples-us-east-1/path/to/data/.

CREATE EXTERNAL TABLE flights_avro_example (

yr INT,

flightdate STRING,

uniquecarrier STRING,

airlineid INT,

carrier STRING,

flightnum STRING,

origin STRING,

dest STRING,

depdelay INT,

carrierdelay INT,

weatherdelay INT

)

PARTITIONED BY (year STRING)

ROW FORMAT

SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'

WITH SERDEPROPERTIES ('avro.schema.literal'='

{

"type" : "record",

"name" : "flights_avro_subset",

"namespace" : "default",

"fields" : [ {

"name" : "yr",

"type" : [ "null", "int" ],

"default" : null

}, {

"name" : "flightdate",

"type" : [ "null", "string" ],

"default" : null

}, {

"name" : "uniquecarrier",

"type" : [ "null", "string" ],

"default" : null

}, {

"name" : "airlineid",

"type" : [ "null", "int" ],

"default" : null

}, {

"name" : "carrier",

"type" : [ "null", "string" ],

"default" : null

}, {

"name" : "flightnum",

"type" : [ "null", "string" ],

"default" : null

}, {

"name" : "origin",

"type" : [ "null", "string" ],

110

Amazon Athena User Guide

RegexSerDe for Processing Apache Web Server Logs

"default" : null

}, {

"name" : "dest",

"type" : [ "null", "string" ],

"default" : null

}, {

"name" : "depdelay",

"type" : [ "null", "int" ],

"default" : null

}, {

"name" : "carrierdelay",

"type" : [ "null", "int" ],

"default" : null

}, {

"name" : "weatherdelay",

"type" : [ "null", "int" ],

"default" : null

} ]

}

STORED AS AVRO

LOCATION 's3://athena-examples-myregion/flight/avro/';

Run the MSCK REPAIR TABLE statement on the table to refresh partition metadata.

MSCK REPAIR TABLE flights_avro_example;

Query the top 10 departure cities by number of total departures.

SELECT origin, count(*) AS total_departures

FROM flights_avro_example

WHERE year >= '2000'

GROUP BY origin

ORDER BY total_departures DESC

LIMIT 10;

Note

The ﬂight table data comes from Flights provided by US Department of Transportation, Bureau

of Transportation Statistics. Desaturated from original.

RegexSerDe for Processing Apache Web Server Logs

SerDe Name

RegexSerDe

Library Name

RegexSerDe

Examples

The following example creates a table from CloudFront logs using the RegExSerDe from the Getting

Started tutorial.

Note

You can query data in regions other than the region where you run Athena. Standard inter-

region data transfer rates for Amazon S3 apply in addition to standard Athena charges. To

111

Amazon Athena User Guide

CloudTrail SerDe

reduce data transfer charges, replace myregion in s3://athena-examples-myregion/

path/to/data/ with the region identiﬁer where you run Athena, for example, s3://athena-

examples-us-east-1/path/to/data/.

CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs (

`Date` DATE,

Time STRING,

Location STRING,

Bytes INT,

RequestIP STRING,

Method STRING,

Host STRING,

Uri STRING,

Status INT,

Referrer STRING,

os STRING,

Browser STRING,

BrowserVersion STRING

) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'

WITH SERDEPROPERTIES (

"input.regex" = "^(?!#)([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s

+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+[^\(]+[\(]([^\;]+).*\%20([^\/]+)[\/](.*)$"

) LOCATION 's3://athena-examples-myregion/cloudfront/plaintext/';

CloudTrail SerDe

AWS CloudTrail is a service that records AWS API calls and events for AWS accounts. CloudTrail generates

encrypted log ﬁles and stores them in Amazon S3. You can use Athena to query these log ﬁles directly

from Amazon S3, specifying the LOCATION of log ﬁles.

To query CloudTrail logs in Athena, create table from the log ﬁles and use the CloudTrail SerDe to

deserialize the logs data.

In addition to using the CloudTrail SerDe, instances exist where you need to use a diﬀerent SerDe or to

extract data from JSON. Certain ﬁelds in CloudTrail logs are STRING values that may have a variable

data format, which depends on the service. As a result, the CloudTrail SerDe is unable to predictably

deserialize them. To query the following ﬁelds, identify the data pattern and then use a diﬀerent

SerDe, such as the OpenX JSON SerDe (p. 118). Alternatively, to get data out of these ﬁelds, use

JSON_EXTRACT functions. For more information, see Extracting Data From JSON (p. 78).

• requestParameters

• responseElements

• additionalEventData

• serviceEventDetails

SerDe Name

CloudTrail SerDe

Library Name

com.amazon.emr.hive.serde.CloudTrailSerde

Examples

The following example uses the CloudTrail SerDe on a ﬁctional set of log ﬁles to create a table based on

them.

112

Amazon Athena User Guide

CloudTrail SerDe

In this example, the ﬁelds requestParameters, responseElements, and additionalEventData

are included as part of STRUCT data type used in JSON. To get data out of these ﬁelds, use

JSON_EXTRACT functions. For more information, see Extracting Data From JSON (p. 78).

CREATE EXTERNAL TABLE cloudtrail_logs (

eventversion STRING,

userIdentity STRUCT<

type:STRING,

principalid:STRING,

arn:STRING,

accountid:STRING,

invokedby:STRING,

accesskeyid:STRING,

userName:STRING,

sessioncontext:STRUCT<

attributes:STRUCT<

mfaauthenticated:STRING,

creationdate:STRING>,

sessionIssuer:STRUCT<

type:STRING,

principalId:STRING,

arn:STRING,

accountId:STRING,

userName:STRING>>>,

eventTime STRING,

eventSource STRING,

eventName STRING,

awsRegion STRING,

sourceIpAddress STRING,

userAgent STRING,

errorCode STRING,

errorMessage STRING,

requestParameters STRING,

responseElements STRING,

additionalEventData STRING,

requestId STRING,

eventId STRING,

resources ARRAY<STRUCT<

ARN:STRING,

accountId:STRING,

type:STRING>>,

eventType STRING,

apiVersion STRING,

readOnly STRING,

recipientAccountId STRING,

serviceEventDetails STRING,

sharedEventID STRING,

vpcEndpointId STRING

)

ROW FORMAT SERDE 'com.amazon.emr.hive.serde.CloudTrailSerde'

STORED AS INPUTFORMAT 'com.amazon.emr.cloudtrail.CloudTrailInputFormat'

OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'

LOCATION 's3://cloudtrail_bucket_name/AWSLogs/bucket_ID/';

The following query returns the logins that occurred over a 24-hour period.

SELECT

useridentity.username,

sourceipaddress,

eventtime,

additionaleventdata

FROM default.cloudtrail_logs

WHERE eventname = 'ConsoleLogin'

AND eventtime >= '2017-02-17T00:00:00Z'

113

Amazon Athena User Guide

OpenCSVSerDe for Processing CSV

AND eventtime < '2017-02-18T00:00:00Z';

For more information, see Querying AWS CloudTrail Logs (p. 94).

OpenCSVSerDe for Processing CSV

When you create a table from a CSV ﬁle in Athena, determine what types of values it contains:

• If the ﬁle contains values enclosed in quotes, use the OpenCSV SerDe to deserialize the values in

Athena.

• If the ﬁle does not contain values enclosed in quotes, you can omit specifying any SerDe. In this case,

Athena uses the default LazySimpleSerDe. For information, see LazySimpleSerDe for CSV, TSV, and

Custom-Delimited Files (p. 120).

CSV SerDe (OpenCSVSerde)

The OpenCSV SerDe behaves as follows:

• Allows you to specify separator, quote, and escape characters, such as: WITH SERDEPROPERTIES

("separatorChar" = ",", "quoteChar" = "`", "escapeChar" = "\\" )

• Does not support embedded line breaks in CSV ﬁles.

• Converts all column type values to STRING.

• To recognize data types other than STRING, relies on the Presto parser and converts the values from

STRING into those data types if it can recognize them.

In particular, for data types other than STRING this SerDe behaves as follows:

• Recognizes BOOLEAN, BIGINT, INT, and DOUBLE data types and parses them without changes.

• Recognizes the TIMESTAMP type if it is speciﬁed in the UNIX format, such as yyyy-mm-dd

hh:mm:ss[.f...], as the type LONG.

• Does not support TIMESTAMP in the JDBC-compliant java.sql.Timestamp format, such as "YYYY-

MM-DD HH:MM:SS.fffffffff" (9 decimal place precision). If you are processing CSV ﬁles from Hive,

use the UNIX format for TIMESTAMP.

• Recognizes the DATE type if it is speciﬁed in the UNIX format, such as YYYY-MM-DD, as the type LONG.

• Does not support DATE in another format. If you are processing CSV ﬁles from Hive, use the UNIX

format for DATE.

SerDe Name

CSV SerDe

Library Name

To use this SerDe, specify its fully qualiﬁed class name in ROW FORMAT, also specify the delimiters inside

SERDEPROPERTIES, as follows:

...

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'

WITH SERDEPROPERTIES (

"separatorChar" = ",",

"quoteChar" = "`",

"escapeChar" = "\\"

)

114

Amazon Athena User Guide

Grok SerDe

Example

This example presumes a source CSV ﬁle saved in s3://mybucket/mycsv/ with the following data

contents:

"a1","a2","a3","a4"

"1","2","abc","def"

"a","a1","abc3","ab4"

Use a CREATE TABLE statement to create an Athena table based on this CSV ﬁle and reference the

OpenCSVSerDe class in ROW FORMAT, also specifying SerDe properties for character separator, quote

character, and escape character.

CREATE EXTERNAL TABLE myopencsvtable (

col1 string,

col2 string,

col3 string,

col4 string

)

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'

WITH SERDEPROPERTIES (

'separatorChar' = ',',

'quoteChar' = '\"',

'escapeChar' = '\\'

)

STORED AS TEXTFILE

LOCATION 's3://location/of/csv/';

Query all values in the table.

SELECT * FROM myopencsvtable;

The query returns the following values.

col1 col2 col3 col4

-----------------------------

a1 a2 a3 a4

1 2 abc def

a a1 abc3 ab4

Note

The ﬂight table data comes from Flights provided by US Department of Transportation, Bureau

of Transportation Statistics. Desaturated from original.

Grok SerDe

The Logstash Grok SerDe is a library with a set of specialized patterns for deserialization of unstructured

text ﬁles, usually logs. Each Grok pattern is a named regular expression. You can identify and re-use

these deserialization patterns as needed. This makes it easier to use Grok compared with using regular

expressions. Grok provides a set of pre-deﬁned patterns. You can also create custom patterns.

To specify the Grok SerDe when creating a table in Athena, use the ROW FORMAT SERDE

'com.amazonaws.glue.serde.GrokSerDe' clause, followed by the WITH SERDEPROPERTIES clause

that speciﬁes the patterns to match in your data, where:

• The input.format expression deﬁnes the patterns to match in the data ﬁle. It is required.

• The input.grokCustomPatterns expression deﬁnes a named custom pattern, which you can

subsequently use within the input.format expression. It is optional.

115

Amazon Athena User Guide

Grok SerDe

• The STORED AS INPUTFORMAT and OUTPUTFORMAT clauses are required.

• The LOCATION clause speciﬁes an Amazon S3 bucket, which can contain multiple source data ﬁles. All

ﬁles in the bucket are deserialized to create the table.

Examples

These examples rely on the list of predeﬁned Grok patterns. See pre-deﬁned patterns.

Example 1

This example uses a single ﬁctional text ﬁle saved in s3://mybucket/groksample with the following

data, which represents Postﬁx maillog entries.

Feb 9 07:15:00 m4eastmail postfix/smtpd[19305]: B88C4120838: connect from

unknown[192.168.55.4]

Feb 9 07:15:00 m4eastmail postfix/smtpd[20444]: B58C4330038: client=unknown[192.168.55.4]

Feb 9 07:15:03 m4eastmail postfix/cleanup[22835]: BDC22A77854: message-

id=<[email protected]>

The following statement creates a table in Athena called mygroktable from the source data ﬁle, using a

custom pattern and the predeﬁned patterns that you specify.

CREATE EXTERNAL TABLE `mygroktable`(

'SYSLOGBASE' string,

'queue_id' string,

'syslog_message' string

)

ROW FORMAT SERDE

'com.amazonaws.glue.serde.GrokSerDe'

WITH SERDEPROPERTIES (

'input.grokCustomPatterns' = 'POSTFIX_QUEUEID [0-9A-F]{7,12}',

'input.format'='%{SYSLOGBASE} %{POSTFIX_QUEUEID:queue_id}: %{GREEDYDATA:syslog_message}'

)

STORED AS INPUTFORMAT

'org.apache.hadoop.mapred.TextInputFormat'

OUTPUTFORMAT

'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'

LOCATION

's3://mybucket/groksample';

Start with something simple like %{NOTSPACE:column} to get the columns mapped ﬁrst and then

specialize the columns if you want to.

Example 2

In the following example, you create a query for Log4j logs. The example log ﬁle has the entries in this

format:

2017-09-12 12:10:34,972 INFO - processType=AZ, processId=ABCDEFG614B6F5E49, status=RUN,

threadId=123:amqListenerContainerPool23[P:AJ|ABCDE9614B6F5E49||

2017-09-12T12:10:11.172-0700],

executionTime=7290, tenantId=12456, userId=123123f8535f8d76015374e7a1d87c3c,

shard=testapp1,

jobId=12312345e5e7df0015e777fb2e03f3c, messageType=REAL_TIME_SYNC,

action=receive, hostname=1.abc.def.com

To query this log ﬁle:

116

Amazon Athena User Guide

JSON SerDe Libraries

• Add the Grok pattern to the input.format for each column. For example, for timestamp, add

%{TIMESTAMP_ISO8601:timestamp}. For loglevel, add %{LOGLEVEL:loglevel}.

• Make sure the pattern in input.format matches the format of the log exactly, by mapping the

dashes (-) and the commas that separate the entries in the log format.

CREATE EXTERNAL TABLE bltest (

timestamp STRING,

loglevel STRING,

processtype STRING,

processid STRING,

status STRING,

threadid STRING,

executiontime INT,

tenantid INT,

userid STRING,

shard STRING,

jobid STRING,

messagetype STRING,

action STRING,

hostname STRING

)

ROW FORMAT SERDE 'com.amazonaws.glue.serde.GrokSerDe'

WITH SERDEPROPERTIES (

"input.grokCustomPatterns" = 'C_ACTION receive|send',

"input.format" = "%{TIMESTAMP_ISO8601:timestamp}, %{LOGLEVEL:loglevel} - processType=

%{NOTSPACE:processtype}, processId=%{NOTSPACE:processid}, status=%{NOTSPACE:status},

threadId=%{NOTSPACE:threadid}, executionTime=%{POSINT:executiontime}, tenantId=

%{POSINT:tenantid}, userId=%{NOTSPACE:userid}, shard=%{NOTSPACE:shard}, jobId=

%{NOTSPACE:jobid}, messageType=%{NOTSPACE:messagetype}, action=%{C_ACTION:action},

hostname=%{HOST:hostname}"

) STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'

OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'

LOCATION 's3://mybucket /samples/';

JSON SerDe Libraries

In Athena, you can use two SerDe libraries for processing JSON ﬁles:

• The native Hive JSON SerDe (p. 118)

• The OpenX JSON SerDe (p. 118)

This SerDe is used to process JSON documents, most commonly events. These events are represented

as blocks of JSON-encoded text separated by a new line. The JsonSerDe is also capable of parsing more

complex JSON documents with nested structures. However this requires a matching DDL representing

the complex data types. There are only two properties for the JsonSerDe: ignore.malformed.json,

which is self-explanatory, and dots.in.keys, which can be set to true or false and determines if there

are dots in the name of the keys that will be replaced with underscores by the SerDe.

SerDe Names

Hive-JsonSerDe

Openx-JsonSerDe

Library Names

Use one of the following:

117

Amazon Athena User Guide

JSON SerDe Libraries

org.apache.hive.hcatalog.data.JsonSerDe

org.openx.data.jsonserde.JsonSerDe

Hive JSON SerDe

The Hive JSON SerDe is used to process JSON documents, most commonly events. These events are

represented as blocks of JSON-encoded text separated by a new line.

You can also use the Hive JSONSerDe to parse more complex JSON documents with nested structures.

See Example: Deserializing Nested JSON (p. 119)

This SerDe has two useful optional properties you can specify when creating tables in Athena, to help

deal with inconsistencies in the data:

• 'ignore.malformed.json' if set to TRUE, lets you skip malformed JSON syntax.

• 'dots.in.keys' if set to TRUE, speciﬁes that the names of the keys include dots and replaces them

with underscores.

Note

You can query data in regions other than the region where you run Athena. Standard inter-

region data transfer rates for Amazon S3 apply in addition to standard Athena charges. To

reduce data transfer charges, replace myregion in s3://athena-examples-myregion/

path/to/data/ with the region identiﬁer where you run Athena, for example, s3://athena-

examples-us-east-1/path/to/data/.

The following DDL statement uses the Hive JsonSerDe:

CREATE EXTERNAL TABLE impressions (

requestBeginTime string,

adId string,

impressionId string,

referrer string,

userAgent string,

userCookie string,

ip string,

number string,

processId string,

browserCookie string,

requestEndTime string,

timers struct<modelLookup:string, requestTime:string>,

threadId string, hostname string,

sessionId string

) PARTITIONED BY (dt string)

ROW FORMAT serde 'org.apache.hive.hcatalog.data.JsonSerDe'

with serdeproperties ( 'paths'='requestBeginTime, adId, impressionId, referrer, userAgent,

userCookie, ip' )

LOCATION 's3://myregion.elasticmapreduce/samples/hive-ads/tables/impressions';

OpenX JSON SerDe

The following DDL statement uses the OpenX SerDe:

CREATE EXTERNAL TABLE impressions (

requestBeginTime string,

adId string,

impressionId string,

referrer string,

userAgent string,

userCookie string,

118

Amazon Athena User Guide

JSON SerDe Libraries

ip string,

number string,

processId string,

browserCookie string,

requestEndTime string,

timers struct<modelLookup:string, requestTime:string>,

threadId string, hostname string,

sessionId string

) PARTITIONED BY (dt string)

ROW FORMAT serde 'org.openx.data.jsonserde.JsonSerDe'

with serdeproperties ( 'paths'='requestBeginTime, adId, impressionId, referrer, userAgent,

userCookie, ip' )

LOCATION 's3://myregion.elasticmapreduce/samples/hive-ads/tables/impressions';

Example: Deserializing Nested JSON

JSON data can be challenging to deserialize when creating a table in Athena. When dealing with complex

nested JSON, there are common issues you may encounter. For more information about these issues and

troubleshooting practices, see the AWS Knowledge Center Article I receive errors when I try to read JSON

data in Amazon Athena. For more information about common scenarios and query tips, see Create Tables

in Amazon Athena from Nested JSON and Mappings Using JSONSerDe.

The following example demonstrates a simple approach to creating an Athena table from a nested JSON

ﬁle. This example presumes a JSON ﬁle with the following structure:

{

"DocId": "AWS",

"User": {

"Id": 1234,

"Username": "bob1234",

"Name": "Bob",

"ShippingAddress": {

"Address1": "123 Main St.",

"Address2": null,

"City": "Seattle",

"State": "WA"

"Orders": [

{

"ItemId": 6789,

"OrderDate": "11/11/2017"

{

"ItemId": 4352,

"OrderDate": "12/12/2017"

}

]

}

The following CREATE TABLE command uses the Openx-JsonSerDe with collection data types like

struct and array to establish groups of objects.

CREATE external TABLE complex_json (

DocId string,

`USER` struct<Id:INT,

Username:string,

Name:string,

ShippingAddress:struct<Address1:string,

Address2:string,

City:string,

State:string>,

119

Amazon Athena User Guide

LazySimpleSerDe for CSV, TSV, and Custom-Delimited Files

Orders:array<struct<ItemId:INT,

OrderDate:string>>>

)

ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'

LOCATION 's3://mybucket/myjsondata/';

LazySimpleSerDe for CSV, TSV, and Custom-

Delimited Files

Specifying this SerDe is optional. This is the SerDe for ﬁles in CSV, TSV, and custom-delimited formats

that Athena uses by default. This SerDe is used if you don't specify any SerDe and only specify ROW

FORMAT DELIMITED. Use this SerDe if your data does not have values enclosed in quotes.

Library Name

The Class library name for the LazySimpleSerDe is

org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe. For more information, see

LazySimpleSerDe

Examples

The following examples show how to create tables in Athena from CSV and TSV, using the

LazySimpleSerDe. To deserialize custom-delimited ﬁle using this SerDe, specify the delimiters similar

to the following examples.

• CSV Example (p. 120)

• TSV Example (p. 122)

Note

You can query data in regions other than the region where you run Athena. Standard inter-

region data transfer rates for Amazon S3 apply in addition to standard Athena charges. To

reduce data transfer charges, replace myregion in s3://athena-examples-myregion/

path/to/data/ with the region identiﬁer where you run Athena, for example, s3://athena-

examples-us-east-1/path/to/data/.

Note

The ﬂight table data comes from Flights provided by US Department of Transportation, Bureau

of Transportation Statistics. Desaturated from original.

CSV Example

Use the CREATE TABLE statement to create an Athena table from the underlying CSV ﬁle stored in

Amazon S3.

CREATE EXTERNAL TABLE flight_delays_csv (

yr INT,

quarter INT,

month INT,

dayofmonth INT,

dayofweek INT,

flightdate STRING,

uniquecarrier STRING,

airlineid INT,

carrier STRING,

tailnum STRING,

flightnum STRING,

originairportid INT,

120

Amazon Athena User Guide

LazySimpleSerDe for CSV, TSV, and Custom-Delimited Files

originairportseqid INT,

origincitymarketid INT,

origin STRING,

origincityname STRING,

originstate STRING,

originstatefips STRING,

originstatename STRING,

originwac INT,

destairportid INT,

destairportseqid INT,

destcitymarketid INT,

dest STRING,

destcityname STRING,

deststate STRING,

deststatefips STRING,

deststatename STRING,

destwac INT,

crsdeptime STRING,

deptime STRING,

depdelay INT,

depdelayminutes INT,

depdel15 INT,

departuredelaygroups INT,

deptimeblk STRING,

taxiout INT,

wheelsoff STRING,

wheelson STRING,

taxiin INT,

crsarrtime INT,

arrtime STRING,

arrdelay INT,

arrdelayminutes INT,

arrdel15 INT,

arrivaldelaygroups INT,

arrtimeblk STRING,

cancelled INT,

cancellationcode STRING,

diverted INT,

crselapsedtime INT,

actualelapsedtime INT,

airtime INT,

flights INT,

distance INT,

distancegroup INT,

carrierdelay INT,

weatherdelay INT,

nasdelay INT,

securitydelay INT,

lateaircraftdelay INT,

firstdeptime STRING,

totaladdgtime INT,

longestaddgtime INT,

divairportlandings INT,

divreacheddest INT,

divactualelapsedtime INT,

divarrdelay INT,

divdistance INT,

div1airport STRING,

div1airportid INT,

div1airportseqid INT,

div1wheelson STRING,

div1totalgtime INT,

div1longestgtime INT,

div1wheelsoff STRING,

div1tailnum STRING,

div2airport STRING,

121

Amazon Athena User Guide

LazySimpleSerDe for CSV, TSV, and Custom-Delimited Files

div2airportid INT,

div2airportseqid INT,

div2wheelson STRING,

div2totalgtime INT,

div2longestgtime INT,

div2wheelsoff STRING,

div2tailnum STRING,

div3airport STRING,

div3airportid INT,

div3airportseqid INT,

div3wheelson STRING,

div3totalgtime INT,

div3longestgtime INT,

div3wheelsoff STRING,

div3tailnum STRING,

div4airport STRING,

div4airportid INT,

div4airportseqid INT,

div4wheelson STRING,

div4totalgtime INT,

div4longestgtime INT,

div4wheelsoff STRING,

div4tailnum STRING,

div5airport STRING,

div5airportid INT,

div5airportseqid INT,

div5wheelson STRING,

div5totalgtime INT,

div5longestgtime INT,

div5wheelsoff STRING,

div5tailnum STRING

)

PARTITIONED BY (year STRING)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ','

ESCAPED BY '\\'

LINES TERMINATED BY '\n'

LOCATION 's3://athena-examples-myregion/flight/csv/';

Run MSCK REPAIR TABLE to refresh partition metadata each time a new partition is added to this table.

MSCK REPAIR TABLE flight_delays_csv;

Query the top 10 routes delayed by more than 1 hour.

SELECT origin, dest, count(*) as delays

FROM flight_delays_csv

WHERE depdelayminutes > 60

GROUP BY origin, dest

ORDER BY 3 DESC

LIMIT 10;

TSV Example

This example presumes a source TSV ﬁle saved in s3://mybucket/mytsv/.

Use a CREATE TABLE statement to create an Athena table from the TSV ﬁle stored in Amazon S3.

Notice that this example does not reference any SerDe class in ROW FORMAT because it uses the

LazySimpleSerDe, and it can be omitted. The example speciﬁes SerDe properties for character and line

separators, and an escape character:

CREATE EXTERNAL TABLE flight_delays_tsv (

122

Amazon Athena User Guide

LazySimpleSerDe for CSV, TSV, and Custom-Delimited Files

yr INT,

quarter INT,

month INT,

dayofmonth INT,

dayofweek INT,

flightdate STRING,

uniquecarrier STRING,

airlineid INT,

carrier STRING,

tailnum STRING,

flightnum STRING,

originairportid INT,

originairportseqid INT,

origincitymarketid INT,

origin STRING,

origincityname STRING,

originstate STRING,

originstatefips STRING,

originstatename STRING,

originwac INT,

destairportid INT,

destairportseqid INT,

destcitymarketid INT,

dest STRING,

destcityname STRING,

deststate STRING,

deststatefips STRING,

deststatename STRING,

destwac INT,

crsdeptime STRING,

deptime STRING,

depdelay INT,

depdelayminutes INT,

depdel15 INT,

departuredelaygroups INT,

deptimeblk STRING,

taxiout INT,

wheelsoff STRING,

wheelson STRING,

taxiin INT,

crsarrtime INT,

arrtime STRING,

arrdelay INT,

arrdelayminutes INT,

arrdel15 INT,

arrivaldelaygroups INT,

arrtimeblk STRING,

cancelled INT,

cancellationcode STRING,

diverted INT,

crselapsedtime INT,

actualelapsedtime INT,

airtime INT,

flights INT,

distance INT,

distancegroup INT,

carrierdelay INT,

weatherdelay INT,

nasdelay INT,

securitydelay INT,

lateaircraftdelay INT,

firstdeptime STRING,

totaladdgtime INT,

longestaddgtime INT,

divairportlandings INT,

divreacheddest INT,

123

Amazon Athena User Guide

LazySimpleSerDe for CSV, TSV, and Custom-Delimited Files

divactualelapsedtime INT,

divarrdelay INT,

divdistance INT,

div1airport STRING,

div1airportid INT,

div1airportseqid INT,

div1wheelson STRING,

div1totalgtime INT,

div1longestgtime INT,

div1wheelsoff STRING,

div1tailnum STRING,

div2airport STRING,

div2airportid INT,

div2airportseqid INT,

div2wheelson STRING,

div2totalgtime INT,

div2longestgtime INT,

div2wheelsoff STRING,

div2tailnum STRING,

div3airport STRING,

div3airportid INT,

div3airportseqid INT,

div3wheelson STRING,

div3totalgtime INT,

div3longestgtime INT,

div3wheelsoff STRING,

div3tailnum STRING,

div4airport STRING,

div4airportid INT,

div4airportseqid INT,

div4wheelson STRING,

div4totalgtime INT,

div4longestgtime INT,

div4wheelsoff STRING,

div4tailnum STRING,

div5airport STRING,

div5airportid INT,

div5airportseqid INT,

div5wheelson STRING,

div5totalgtime INT,

div5longestgtime INT,

div5wheelsoff STRING,

div5tailnum STRING

)

PARTITIONED BY (year STRING)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '\t'

ESCAPED BY '\\'

LINES TERMINATED BY '\n'

LOCATION 's3://athena-examples-myregion/flight/tsv/';

Run MSCK REPAIR TABLE to refresh partition metadata each time a new partition is added to this table.

MSCK REPAIR TABLE flight_delays_tsv;

Query the top 10 routes delayed by more than 1 hour.

SELECT origin, dest, count(*) as delays

FROM flight_delays_tsv

WHERE depdelayminutes > 60

GROUP BY origin, dest

ORDER BY 3 DESC

LIMIT 10;

124

Amazon Athena User Guide

ORC SerDe

Note

The ﬂight table data comes from Flights provided by US Department of Transportation, Bureau

of Transportation Statistics. Desaturated from original.

ORC SerDe

SerDe Name

OrcSerDe

Library Name

This is the SerDe class for ORC ﬁles. It passes the object from the ORC ﬁle to the reader and from the

ORC ﬁle to the writer: OrcSerDe

Examples

Note

You can query data in regions other than the region where you run Athena. Standard inter-

region data transfer rates for Amazon S3 apply in addition to standard Athena charges. To

reduce data transfer charges, replace myregion in s3://athena-examples-myregion/

path/to/data/ with the region identiﬁer where you run Athena, for example, s3://athena-

examples-us-east-1/path/to/data/.

The following example creates a table for the ﬂight delays data in ORC. The table includes partitions:

DROP TABLE flight_delays_orc;

CREATE EXTERNAL TABLE flight_delays_orc (

yr INT,

quarter INT,

month INT,

dayofmonth INT,

dayofweek INT,

flightdate STRING,

uniquecarrier STRING,

airlineid INT,

carrier STRING,

tailnum STRING,

flightnum STRING,

originairportid INT,

originairportseqid INT,

origincitymarketid INT,

origin STRING,

origincityname STRING,

originstate STRING,

originstatefips STRING,

originstatename STRING,

originwac INT,

destairportid INT,

destairportseqid INT,

destcitymarketid INT,

dest STRING,

destcityname STRING,

deststate STRING,

deststatefips STRING,

deststatename STRING,

destwac INT,

crsdeptime STRING,

deptime STRING,

depdelay INT,

125

Amazon Athena User Guide

ORC SerDe

depdelayminutes INT,

depdel15 INT,

departuredelaygroups INT,

deptimeblk STRING,

taxiout INT,

wheelsoff STRING,

wheelson STRING,

taxiin INT,

crsarrtime INT,

arrtime STRING,

arrdelay INT,

arrdelayminutes INT,

arrdel15 INT,

arrivaldelaygroups INT,

arrtimeblk STRING,

cancelled INT,

cancellationcode STRING,

diverted INT,

crselapsedtime INT,

actualelapsedtime INT,

airtime INT,

flights INT,

distance INT,

distancegroup INT,

carrierdelay INT,

weatherdelay INT,

nasdelay INT,

securitydelay INT,

lateaircraftdelay INT,

firstdeptime STRING,

totaladdgtime INT,

longestaddgtime INT,

divairportlandings INT,

divreacheddest INT,

divactualelapsedtime INT,

divarrdelay INT,

divdistance INT,

div1airport STRING,

div1airportid INT,

div1airportseqid INT,

div1wheelson STRING,

div1totalgtime INT,

div1longestgtime INT,

div1wheelsoff STRING,

div1tailnum STRING,

div2airport STRING,

div2airportid INT,

div2airportseqid INT,

div2wheelson STRING,

div2totalgtime INT,

div2longestgtime INT,

div2wheelsoff STRING,

div2tailnum STRING,

div3airport STRING,

div3airportid INT,

div3airportseqid INT,

div3wheelson STRING,

div3totalgtime INT,

div3longestgtime INT,

div3wheelsoff STRING,

div3tailnum STRING,

div4airport STRING,

div4airportid INT,

div4airportseqid INT,

div4wheelson STRING,

div4totalgtime INT,

126

Amazon Athena User Guide

Parquet SerDe

div4longestgtime INT,

div4wheelsoff STRING,

div4tailnum STRING,

div5airport STRING,

div5airportid INT,

div5airportseqid INT,

div5wheelson STRING,

div5totalgtime INT,

div5longestgtime INT,

div5wheelsoff STRING,

div5tailnum STRING

)

PARTITIONED BY (year String)

STORED AS ORC

LOCATION 's3://athena-examples-myregion/flight/orc/'

tblproperties ("orc.compress"="ZLIB");

Run the MSCK REPAIR TABLE statement on the table to refresh partition metadata.

MSCK REPAIR TABLE flight_delays_orc;

Use this query to obtain the top 10 routes delayed by more than 1 hour:

SELECT origin, dest, count(*) as delays

FROM flight_delays_pq

WHERE depdelayminutes > 60

GROUP BY origin, dest

ORDER BY 3 DESC

LIMIT 10;

Parquet SerDe

SerDe Name

ParquetHiveSerDe is used for ﬁles stored in Parquet. For more information, see Parquet Format.

Library Name

Athena uses this class when it needs to deserialize ﬁles stored in Parquet:

org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe

Example: Querying a File Stored in Parquet

Note

You can query data in regions other than the region where you run Athena. Standard inter-

region data transfer rates for Amazon S3 apply in addition to standard Athena charges. To

reduce data transfer charges, replace myregion in s3://athena-examples-myregion/

path/to/data/ with the region identiﬁer where you run Athena, for example, s3://athena-

examples-us-east-1/path/to/data/.

Use the CREATE TABLE statement to create an Athena table from the underlying CSV ﬁle stored in

Amazon S3 in Parquet.

CREATE EXTERNAL TABLE flight_delays_pq (

yr INT,

quarter INT,

month INT,

127

Amazon Athena User Guide

Parquet SerDe

dayofmonth INT,

dayofweek INT,

flightdate STRING,

uniquecarrier STRING,

airlineid INT,

carrier STRING,

tailnum STRING,

flightnum STRING,

originairportid INT,

originairportseqid INT,

origincitymarketid INT,

origin STRING,

origincityname STRING,

originstate STRING,

originstatefips STRING,

originstatename STRING,

originwac INT,

destairportid INT,

destairportseqid INT,

destcitymarketid INT,

dest STRING,

destcityname STRING,

deststate STRING,

deststatefips STRING,

deststatename STRING,

destwac INT,

crsdeptime STRING,

deptime STRING,

depdelay INT,

depdelayminutes INT,

depdel15 INT,

departuredelaygroups INT,

deptimeblk STRING,

taxiout INT,

wheelsoff STRING,

wheelson STRING,

taxiin INT,

crsarrtime INT,

arrtime STRING,

arrdelay INT,

arrdelayminutes INT,

arrdel15 INT,

arrivaldelaygroups INT,

arrtimeblk STRING,

cancelled INT,

cancellationcode STRING,

diverted INT,

crselapsedtime INT,

actualelapsedtime INT,

airtime INT,

flights INT,

distance INT,

distancegroup INT,

carrierdelay INT,

weatherdelay INT,

nasdelay INT,

securitydelay INT,

lateaircraftdelay INT,

firstdeptime STRING,

totaladdgtime INT,

longestaddgtime INT,

divairportlandings INT,

divreacheddest INT,

divactualelapsedtime INT,

divarrdelay INT,

divdistance INT,

128

Amazon Athena User Guide

Parquet SerDe

div1airport STRING,

div1airportid INT,

div1airportseqid INT,

div1wheelson STRING,

div1totalgtime INT,

div1longestgtime INT,

div1wheelsoff STRING,

div1tailnum STRING,

div2airport STRING,

div2airportid INT,

div2airportseqid INT,

div2wheelson STRING,

div2totalgtime INT,

div2longestgtime INT,

div2wheelsoff STRING,

div2tailnum STRING,

div3airport STRING,

div3airportid INT,

div3airportseqid INT,

div3wheelson STRING,

div3totalgtime INT,

div3longestgtime INT,

div3wheelsoff STRING,

div3tailnum STRING,

div4airport STRING,

div4airportid INT,

div4airportseqid INT,

div4wheelson STRING,

div4totalgtime INT,

div4longestgtime INT,

div4wheelsoff STRING,

div4tailnum STRING,

div5airport STRING,

div5airportid INT,

div5airportseqid INT,

div5wheelson STRING,

div5totalgtime INT,

div5longestgtime INT,

div5wheelsoff STRING,

div5tailnum STRING

)

PARTITIONED BY (year STRING)

STORED AS PARQUET

LOCATION 's3://athena-examples-myregion/flight/parquet/'

tblproperties ("parquet.compress"="SNAPPY");

Run the MSCK REPAIR TABLE statement on the table to refresh partition metadata.

MSCK REPAIR TABLE flight_delays_pq;

Query the top 10 routes delayed by more than 1 hour.

SELECT origin, dest, count(*) as delays

FROM flight_delays_pq

WHERE depdelayminutes > 60

GROUP BY origin, dest

ORDER BY 3 DESC

LIMIT 10;

Note

The ﬂight table data comes from Flights provided by US Department of Transportation, Bureau

of Transportation Statistics. Desaturated from original.

129

Amazon Athena User Guide

Compression Formats

Athena supports the following compression formats:

• SNAPPY. This is the default compression format for ﬁles in the Parquet format. For Amazon Kinesis

Firehose logs, SNAPPY compression is not supported. Use GZIP instead.

• ZLIB. This is the default compression format for ﬁles in the ORC format.

• GZIP

• LZO

130

Amazon Athena User Guide

Data Types

DDL and SQL Reference

Athena supports a subset of DDL statements and ANSI SQL functions and operators.

Topics

• Data Types (p. 131)

• DDL Statements (p. 132)

• SQL Queries, Functions, and Operators (p. 144)

• Unsupported DDL (p. 148)

• Limitations (p. 149)

Data Types

When you run CREATE TABLE, you must specify column names and their data types. For a complete

syntax of this command, see CREATE TABLE (p. 136).

The ﬁeld col_name speciﬁes the name for each column in the table Athena creates, along with the

column's data type. If col_name begins with an underscore, enclose it in backticks, for example

`_mycolumn`.

List of Supported Data Types in Athena

The data_type value in the col_name ﬁeld of CREATE TABLE can be any of the following:

• primitive_type

• TINYINT

• SMALLINT

• INT

• BIGINT

• BOOLEAN

• DOUBLE

• FLOAT

• STRING

• TIMESTAMP

• DECIMAL [ (precision, scale) ]

• DATE (not supported for PARQUET ﬁle_format)

• CHAR. Fixed length character data, with a speciﬁed length between 1 and 255, such as

char(10). For more information, see CHAR Hive Data Type.

• VARCHAR. Variable length character data, with a speciﬁed length between 1 and 65535, such as

varchar(10). For more information, see VARCHAR Hive Data Type.

• array_type

• ARRAY < data_type >

• map_type

• MAP < primitive_type, data_type >

131

Amazon Athena User Guide

DDL Statements

• struct_type

• STRUCT < col_name : data_type [COMMENT col_comment] [, ...] >

DDL Statements

Use the following DDL statements directly in Athena.

Athena query engine is based on Hive DDL.

Athena does not support all DDL statements. For information, see Unsupported DDL (p. 148).

Topics

• ALTER DATABASE SET DBPROPERTIES (p. 132)

• ALTER TABLE ADD PARTITION (p. 133)

• ALTER TABLE DROP PARTITION (p. 134)

• ALTER TABLE RENAME PARTITION (p. 134)

• ALTER TABLE SET LOCATION (p. 135)

• ALTER TABLE SET TBLPROPERTIES (p. 135)

• CREATE DATABASE (p. 136)

• CREATE TABLE (p. 136)

• DESCRIBE TABLE (p. 139)

• DROP DATABASE (p. 140)

• DROP TABLE (p. 140)

• MSCK REPAIR TABLE (p. 141)

• SHOW COLUMNS (p. 141)

• SHOW CREATE TABLE (p. 142)

• SHOW DATABASES (p. 142)

• SHOW PARTITIONS (p. 142)

• SHOW TABLES (p. 143)

• SHOW TBLPROPERTIES (p. 143)

ALTER DATABASE SET DBPROPERTIES

Creates one or more properties for a database. The use of DATABASE and SCHEMA are interchangeable;

they mean the same thing.

Synopsis

ALTER (DATABASE|SCHEMA) database_name

SET DBPROPERTIES ('property_name'='property_value' [, ...] )

Parameters

SET DBPROPERTIES ('property_name'='property_value' [, ...]

Speciﬁes a property or properties for the database named property_name and establishes the

value for each of the properties respectively as property_value. If property_name already exists,

the old value is overwritten with property_value.

132

Amazon Athena User Guide

ALTER TABLE ADD PARTITION

Examples

ALTER DATABASE jd_datasets

SET DBPROPERTIES ('creator'='John Doe', 'department'='applied mathematics');

ALTER SCHEMA jd_datasets

SET DBPROPERTIES ('creator'='Jane Doe');

ALTER TABLE ADD PARTITION

Creates one or more partition columns for the table. Each partition consists of one or more distinct

column name/value combinations. A separate data directory is created for each speciﬁed combination,

which can improve query performance in some circumstances. Partitioned columns don't exist within the

table data itself, so if you use a column name that has the same name as a column in the table itself, you

get an error. For more information, see Partitioning Data (p. 51).

Synopsis

ALTER TABLE table_name ADD [IF NOT EXISTS]

PARTITION

(partition_col1_name = partition_col1_value

[,partition_col2_name = partition_col2_value]

[,...])

[LOCATION 'location1']

[PARTITION

(partition_colA_name = partition_colA_value

[,partition_colB_name = partition_colB_value

[,...])]

[LOCATION 'location2']

[,...]

Parameters

[IF NOT EXISTS]

Causes the error to be suppressed if a partition with the same deﬁnition already exists.

PARTITION (partition_col_name = partition_col_value [,...])

Creates a partition with the column name/value combinations that you specify. Enclose

partition_col_value in string characters only if the data type of the column is a string.

[LOCATION 'location']

Speciﬁes the directory in which to store the partitions deﬁned by the preceding statement.

Examples

ALTER TABLE orders ADD

PARTITION (dt = '2016-05-14', country = 'IN');

ALTER TABLE orders ADD

PARTITION (dt = '2016-05-14', country = 'IN')

133

Amazon Athena User Guide

ALTER TABLE DROP PARTITION

PARTITION (dt = '2016-05-15', country = 'IN');

ALTER TABLE orders ADD

PARTITION (dt = '2016-05-14', country = 'IN') LOCATION 's3://mystorage/path/to/

INDIA_14_May_2016'

PARTITION (dt = '2016-05-15', country = 'IN') LOCATION 's3://mystorage/path/to/

INDIA_15_May_2016';

ALTER TABLE DROP PARTITION

Drops one or more speciﬁed partitions for the named table.

Synopsis

ALTER TABLE table_name DROP [IF EXISTS] PARTITION (partition_spec) [, PARTITION

(partition_spec)]

Parameters

[IF EXISTS]

Suppresses the error message if the partition speciﬁed does not exist.

PARTITION (partition_spec)

Each partition_spec speciﬁes a column name/value combination in the form

partition_col_name = partition_col_value [,...].

Examples

ALTER TABLE orders DROP PARTITION (dt = '2014-05-14', country = 'IN');

ALTER TABLE orders DROP PARTITION (dt = '2014-05-14', country = 'IN'), PARTITION (dt =

'2014-05-15', country = 'IN');

ALTER TABLE RENAME PARTITION

Renames a partition column, partition_spec, for the table named table_name, to

new_partition_spec.

Synopsis

ALTER TABLE table_name PARTITION (partition_spec) RENAME TO PARTITION (new_partition_spec)

Parameters

PARTITION (partition_spec)

Each partition_spec speciﬁes a column name/value combination in the form

partition_col_name = partition_col_value [,...].

134

Amazon Athena User Guide

ALTER TABLE SET LOCATION

Examples

ALTER TABLE orders PARTITION (dt = '2014-05-14', country = 'IN') RENAME TO PARTITION (dt =

'2014-05-15', country = 'IN');

ALTER TABLE SET LOCATION

Changes the location for the table named table_name, and optionally a partition with

partition_spec.

Synopsis

ALTER TABLE table_name [ PARTITION (partition_spec) ] SET LOCATION 'new location'

Parameters

PARTITION (partition_spec)

Speciﬁes the partition with parameters partition_spec whose location you want to change. The

partition_spec speciﬁes a column name/value combination in the form partition_col_name

= partition_col_value.

SET LOCATION 'new location'

Speciﬁes the new location, which must be an Amazon S3 location.

Examples

ALTER TABLE customers PARTITION (zip='98040', state='WA') SET LOCATION 's3://mystorage/

custdata';

ALTER TABLE SET TBLPROPERTIES

Adds custom metadata properties to a table sets their assigned values.

Managed tables are not supported, so setting 'EXTERNAL'='FALSE' has no eﬀect.

Synopsis

ALTER TABLE table_name SET TBLPROPERTIES ('property_name' = 'property_value' [ , ... ])

Parameters

SET TBLPROPERTIES ('property_name' = 'property_value' [ , ... ])

Speciﬁes the metadata properties to add as property_name and the value for each as property

value. If property_name already exists, its value is reset to property_value.

Examples

ALTER TABLE orders SET TBLPROPERTIES ('notes'="Please don't drop this table.");

135

Amazon Athena User Guide

CREATE DATABASE

Creates a database. The use of DATABASE and SCHEMA is interchangeable. They mean the same thing.

Synopsis

CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name

[COMMENT 'database_comment']

[LOCATION 'S3_loc']

[WITH DBPROPERTIES ('property_name' = 'property_value') [, ...]]

Parameters

[IF NOT EXISTS]

Causes the error to be suppressed if a database named database_name already exists.

[COMMENT database_comment]

Establishes the metadata value for the built-in metadata property named comment and the value

you provide for database_comment.

[LOCATION S3_loc]

Speciﬁes the location where database ﬁles and metastore will exist as S3_loc. The location must be

an Amazon S3 location.

[WITH DBPROPERTIES ('property_name' = 'property_value') [, ...] ]

Allows you to specify custom metadata properties for the database deﬁnition.

Examples

CREATE DATABASE clickstreams;

CREATE DATABASE IF NOT EXISTS clickstreams

COMMENT 'Site Foo clickstream data aggregates'

LOCATION 's3://myS3location/clickstreams'

WITH DBPROPERTIES ('creator'='Jane D.', 'Dept.'='Marketing analytics');

CREATE TABLE

Creates a table with the name and the parameters that you specify.

Synopsis

CREATE [EXTERNAL] TABLE [IF NOT EXISTS]

[db_name.]table_name [(col_name data_type [COMMENT col_comment] [, ...] )]

[COMMENT table_comment]

[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]

[ROW FORMAT row_format]

[STORED AS file_format] [WITH SERDEPROPERTIES (...)] ]

[LOCATION 's3_loc']

136

Amazon Athena User Guide

CREATE TABLE

[TBLPROPERTIES ( ['has_encrypted_data'='true | false',]

['classification'='aws_glue_classification',] property_name=property_value [, ...] ) ]

Parameters

[EXTERNAL]

Speciﬁes that the table is based on an underlying data ﬁle that exists in Amazon S3, in the

LOCATION that you specify. When you create an external table, the data referenced must comply

with the default format or the format that you specify with the ROW FORMAT, STORED AS, and WITH

SERDEPROPERTIES clauses.

[IF NOT EXISTS]

Causes the error message to be suppressed if a table named table_name already exists.

[db_name.]table_name

Speciﬁes a name for the table to be created. The optional db_name parameter speciﬁes the

database where the table exists. If omitted, the current database is assumed. If the table name

includes numbers, enclose table_name in quotation marks, for example "table123". If

table_name begins with an underscore, use backticks, for example, `_mytable`. Special

characters (other than underscore) are not supported.

Athena table names are case-insensitive; however, if you work with Apache Spark, Spark requires

lowercase table names.

[ ( col_name data_type [COMMENT col_comment] [, ...] ) ]

Speciﬁes the name for each column to be created, along with the column's data type. Column names

do not allow special characters other than underscore (_). If col_name begins with an underscore,

enclose the column name in backticks, for example `_mycolumn`. The data_type value can be any

of the following:

• primitive_type

• TINYINT

• SMALLINT

• INT

• BIGINT

• BOOLEAN

• DOUBLE

• FLOAT

• STRING

• TIMESTAMP

• DECIMAL [ (precision, scale) ]

• DATE (not supported for PARQUET ﬁle_format)

• CHAR. Fixed length character data, with a speciﬁed length between 1 and 255, such as

char(10). For more information, see CHAR Hive Data Type.

• VARCHAR. Variable length character data, with a speciﬁed length between 1 and 65535,

such as varchar(10). For more information, see VARCHAR Hive Data Type.

• array_type

• ARRAY < data_type >

• map_type

• MAP < primitive_type, data_type >

137

Amazon Athena User Guide

CREATE TABLE

• struct_type

• STRUCT < col_name : data_type [COMMENT col_comment] [, ...] >

[COMMENT table_comment]

Creates the comment table property and populates it with the table_comment you specify.

[PARTITIONED BY (col_name data_type [ COMMENT col_comment ], ... ) ]

Creates a partitioned table with one or more partition columns that have the col_name,

data_type and col_comment speciﬁed. A table can have one or more partitions, which consist of a

distinct column name and value combination. A separate data directory is created for each speciﬁed

combination, which can improve query performance in some circumstances. Partitioned columns

don't exist within the table data itself. If you use a value for col_name that is the same as a table

column, you get an error. For more information, see Partitioning Data (p. 51).

Note

After you create a table with partitions, run a subsequent query that consists of the MSCK

REPAIR TABLE (p. 141) clause to refresh partition metadata, for example, MSCK REPAIR

TABLE cloudfront_logs;.

[ROW FORMAT row_format]

Speciﬁes the row format of the table and its underlying source data if applicable. For row_format,

you can specify one or more delimiters with the DELIMITED clause or, alternatively, use the SERDE

clause as described below. If ROW FORMAT is omitted or ROW FORMAT DELIMITED is speciﬁed, a

native SerDe is used.

• [DELIMITED FIELDS TERMINATED BY char [ESCAPED BY char]]

• [DELIMITED COLLECTION ITEMS TERMINATED BY char]

• [MAP KEYS TERMINATED BY char]

• [LINES TERMINATED BY char]

• [NULL DEFINED AS char] -- (Note: Available in Hive 0.13 and later)

--OR--

• SERDE 'serde_name' [WITH SERDEPROPERTIES ("property_name" = "property_value",

"property_name" = "property_value" [, ...] )]

The serde_name indicates the SerDe to use. The WITH SERDEPROPERTIES clause allows you to

provide one or more custom properties allowed by the SerDe.

[STORED AS ﬁle_format]

Speciﬁes the ﬁle format for table data. If omitted, TEXTFILE is the default. Options for

file_format are:

• SEQUENCEFILE

• TEXTFILE

• RCFILE

• ORC

• PARQUET

• AVRO

• INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname

[LOCATION 'S3_loc']

Speciﬁes the location of the underlying data in Amazon S3 from which the table is created, for

example, `s3://mystorage/`. For more information about considerations such as data format

and permissions, see Requirements for Tables in Athena and Data in Amazon S3 (p. 46).

138

Amazon Athena User Guide

DESCRIBE TABLE

Use a trailing slash for your folder or bucket. Do not use ﬁle names or glob characters.

Use: s3://mybucket/myfolder/

Don't use: s3://path_to_bucket s3://path_to_bucket/* s3://path_to-bucket/

mydatafile.dat

[TBLPROPERTIES ( ['has_encrypted_data'='true | false',] ['classiﬁcation'='aws_glue_classiﬁcation',]

property_name=property_value [, ...] ) ]

Speciﬁes custom metadata key-value pairs for the table deﬁnition in addition to predeﬁned table

properties, such as "comment".

Athena has a built-in property, has_encrypted_data. Set this property to true to indicate

that the underlying dataset speciﬁed by LOCATION is encrypted. If omitted, false is assumed. If

omitted or set to false when underlying data is encrypted, the query results in an error. For more

information, see Conﬁguring Encryption Options (p. 41).

To run ETL jobs, AWS Glue requires that you create a table with the classification property

to indicate the data type for AWS Glue as csv, parquet, orc, avro, or json. For example,

'classification'='csv'. ETL jobs will fail if you do not specify this property. You can

subsequently specify it using the AWS Glue console, API, or CLI. For more information, see Using

AWS Glue Jobs for ETL with Athena (p. 30) and Authoring Jobs in Glue in the AWS Glue Developer

Guide.

Examples

CREATE EXTERNAL TABLE IF NOT EXISTS mydatabase.cloudfront_logs (

Date DATE,

Time STRING,

Location STRING,

Bytes INT,

RequestIP STRING,

Method STRING,

Host STRING,

Uri STRING,

Status INT,

Referrer STRING,

os STRING,

Browser STRING,

BrowserVersion STRING

) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'

WITH SERDEPROPERTIES (

"input.regex" = "^(?!#)([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s

+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+[^\(]+[\(]([^\;]+).*\%20([^\/]+)

[\/](.*)$"

) LOCATION 's3://athena-examples/cloudfront/plaintext/';

DESCRIBE TABLE

Shows the list of columns, including partition columns, for the named column. This allows you to

examine the attributes of a complex column.

Synopsis

DESCRIBE [EXTENDED | FORMATTED] [db_name.]table_name [PARTITION partition_spec] [col_name

( [.field_name] | [.'$elem$'] | [.'$key$'] | [.'$value$'] )]

139

Amazon Athena User Guide

DROP DATABASE

Parameters

[EXTENDED | FORMATTED]

Determines the format of the output. If you specify EXTENDED, all metadata for the table is

output in Thrift serialized form. This is useful primarily for debugging and not for general use. Use

FORMATTED or omit the clause to show the metadata in tabular format.

[PARTITION partition_spec]

Lists the metadata for the partition with partition_spec if included.

[col_name ( [.ﬁeld_name] | [.'$elem$'] | [.'$key$'] | [.'$value$'] )* ]

Speciﬁes the column and attributes to examine. You can specify .field_name for an element of a

struct, '$elem$' for array element, '$key$' for a map key, and '$value$' for map value. You can

specify this recursively to further explore the complex column.

Examples

DESCRIBE orders;

DROP DATABASE

Removes the named database from the catalog. If the database contains tables, you must either drop the

tables before executing DROP DATABASE or use the CASCADE clause. The use of DATABASE and SCHEMA

are interchangeable. They mean the same thing.

Synopsis

DROP {DATABASE | SCHEMA} [IF EXISTS] database_name [RESTRICT | CASCADE]

Parameters

[IF EXISTS]

Causes the error to be suppressed if database_name doesn't exist.

[RESTRICT|CASCADE]

Determines how tables within database_name are regarded during the DROP operation. If you

specify RESTRICT, the database is not dropped if it contains tables. This is the default behavior.

Specifying CASCADE causes the database and all its tables to be dropped.

Examples

DROP DATABASE clickstreams;

DROP SCHEMA IF EXISTS clickstreams CASCADE;

DROP TABLE

Removes the metadata table deﬁnition for the table named table_name. When you drop an external

table, the underlying data remains intact because all tables in Athena are EXTERNAL.

140

Amazon Athena User Guide

MSCK REPAIR TABLE

Synopsis

DROP TABLE [IF EXISTS] table_name [PURGE]

Parameters

[ IF EXISTS ]

Causes the error to be suppressed if table_name doesn't exist.

[ PURGE ]

Applies to managed tables. Ignored for external tables. Speciﬁes that data should be removed

permanently rather than being moved to the .Trash/Current directory.

Examples

DROP TABLE fulfilled_orders;

DROP TABLE IF EXISTS fulfilled_orders PURGE;

MSCK REPAIR TABLE

Recovers partitions and data associated with partitions. Use this statement when you add partitions

to the catalog. It is possible it will take some time to add all partitions. If this operation times out, it

will be in an incomplete state where only a few partitions are added to the catalog. You should run

the statement on the same table until all partitions are added. For more information, see Partitioning

Data (p. 51).

Synopsis

MSCK REPAIR TABLE table_name

Examples

MSCK REPAIR TABLE orders;

SHOW COLUMNS

Lists the columns in the table schema.

Synopsis

SHOW COLUMNS IN table_name

Examples

SHOW COLUMNS IN clicks;

141

Amazon Athena User Guide

SHOW CREATE TABLE

Analyzes an existing table named table_name to generate the query that created it.

Synopsis

SHOW CREATE TABLE [db_name.]table_name

Parameters

TABLE [db_name.]table_name

The db_name parameter is optional. If omitted, the context defaults to the current database.

Examples

SHOW CREATE TABLE orderclickstoday;

SHOW CREATE TABLE salesdata.orderclickstoday;

SHOW DATABASES

Lists all databases deﬁned in the metastore. You can use DATABASES or SCHEMAS. They mean the same

thing.

Synopsis

SHOW {DATABASES | SCHEMAS} [LIKE 'regular_expression']

Parameters

[LIKE 'regular_expression']

Filters the list of databases to those that match the regular_expression you specify. Wildcards

can only be *, which indicates any character, or |, which indicates a choice between characters.

Examples

SHOW SCHEMAS;

SHOW DATABASES LIKE '*analytics';

SHOW PARTITIONS

Lists all the partitions in a table.

142

Amazon Athena User Guide

SHOW TABLES

Synopsis

SHOW PARTITIONS table_name

Examples

SHOW PARTITIONS clicks;

SHOW TABLES

Lists all the base tables and views in a database.

Synopsis

SHOW TABLES [IN database_name] ['regular_expression']

Parameters

[IN database_name]

Speciﬁes the database_name from which tables will be listed. If omitted, the database from the

current context is assumed.

['regular_expression']

Filters the list of tables to those that match the regular_expression you specify. Only the

wildcards *, which indicates any character, or |, which indicates a choice between characters, can be

used.

Examples

SHOW TABLES;

SHOW TABLES IN marketing_analytics 'orders*';

SHOW TBLPROPERTIES

Lists table properties for the named table.

Synopsis

SHOW TBLPROPERTIES table_name [('property_name')]

Parameters

[('property_name')]

If included, only the value of the property named property_name is listed.

143

Amazon Athena User Guide

SQL Queries, Functions, and Operators

Examples

SHOW TBLPROPERTIES orders;

SHOW TBLPROPERTIES orders('comment');

SQL Queries, Functions, and Operators

Use the following functions directly in Athena.

Amazon Athena query engine is based on Presto 0.172. For more information about these functions, see

Presto 0.172 Functions and Operators.

Athena does not support all of Presto's features. For information, see Limitations (p. 149).

• SELECT (p. 144)

• Logical Operators

• Comparison Functions and Operators

• Conditional Expressions

• Conversion Functions

• Mathematical Functions and Operators

• Bitwise Functions

• Decimal Functions and Operators

• String Functions and Operators

• Binary Functions

• Date and Time Functions and Operators

• Regular Expression Functions

• JSON Functions and Operators

• URL Functions

• Aggregate Functions

• Window Functions

• Color Functions

• Array Functions and Operators

• Map Functions and Operators

• Lambda Expressions and Functions

• Teradata Functions

SELECT

Retrieves rows from zero or more tables.

Synopsis

[ WITH with_query [, ...] ]

SELECT [ ALL | DISTINCT ] select_expression [, ...]

[ FROM from_item [, ...] ]

144

Amazon Athena User Guide

SELECT

[ WHERE condition ]

[ GROUP BY [ ALL | DISTINCT ] grouping_element [, ...] ]

[ HAVING condition ]

[ UNION [ ALL | DISTINCT ] union_query ]

[ ORDER BY expression [ ASC | DESC ] [ NULLS FIRST | NULLS LAST] [, ...] ]

[ LIMIT [ count | ALL ] ]

Parameters

[ WITH with_query [, ....] ]

You can use WITH to ﬂatten nested queries, or to simplify subqueries.

Using the WITH clause to create recursive queries is not supported.

The WITH clause precedes the SELECT list in a query and deﬁnes one or more subqueries for use

within the SELECT query.

Each subquery deﬁnes a temporary table, similar to a view deﬁnition, which you can reference in the

FROM clause. The tables are used only when the query runs.

with_query syntax is:

subquery_table_name [ ( column_name [, ...] ) ] AS (subquery)

Where:

• subquery_table_name is a unique name for a temporary table that deﬁnes the results of the

WITH clause subquery. Each subquery must have a table name that can be referenced in the

FROM clause.

• column_name [, ...] is an optional list of output column names. The number of column

names must be equal to or less than the number of columns deﬁned by subquery.

• subquery is any query statement.

[ ALL | DISTINCT ] select_expr

select_expr determines the rows to be selected.

ALL is the default. Using ALL is treated the same as if it were omitted; all rows for all columns are

selected and duplicates are kept.

Use DISTINCT to return only distinct values when a column contains duplicate values.

FROM from_item [, ...]

Indicates the input to the query, where from_item can be a view, a join construct, or a subquery as

described below.

The from_item can be either:

• table_name [ [ AS ] alias [ (column_alias [, ...]) ] ]

Where table_name is the name of the target table from which to select rows, alias is the name

to give the output of the SELECT statement, and column_alias deﬁnes the columns for the

alias speciﬁed.

-OR-

• join_type from_item [ ON join_condition | USING ( join_column [, ...] ) ]

Where join_type is one of:

• [ INNER ] JOIN

145

Amazon Athena User Guide

SELECT

• LEFT [ OUTER ] JOIN

• RIGHT [ OUTER ] JOIN

• FULL [ OUTER ] JOIN

• CROSS JOIN

• ON join_condition | USING (join_column [, ...]) Where using join_condition

allows you to specify column names for join keys in multiple tables, and using join_column

requires join_column to exist in both tables.

[ WHERE condition ]

Filters results according to the condition you specify.

[ GROUP BY [ ALL | DISTINCT ] grouping_expressions [, ...] ]

Divides the output of the SELECT statement into rows with matching values.

ALL and DISTINCT determine whether duplicate grouping sets each produce distinct output rows. If

omitted, ALL is assumed.

grouping_expressions allow you to perform complex grouping operations.

The grouping_expressions element can be any function, such as SUM, AVG, or COUNT, performed

on input columns, or be an ordinal number that selects an output column by position, starting at

one.

GROUP BY expressions can group output by input column names that don't appear in the output of

the SELECT statement.

All output expressions must be either aggregate functions or columns present in the GROUP BY

clause.

You can use a single query to perform analysis that requires aggregating multiple column sets.

These complex grouping operations don't support expressions comprising input columns. Only

column names or ordinals are allowed.

You can often use UNION ALL to achieve the same results as these GROUP BY operations, but

queries that use GROUP BY have the advantage of reading the data one time, whereas UNION ALL

reads the underlying data three times and may produce inconsistent results when the data source is

subject to change.

GROUP BY CUBE generates all possible grouping sets for a given set of columns. GROUP BY

ROLLUP generates all possible subtotals for a given set of columns.

[ HAVING condition ]

Used with aggregate functions and the GROUP BY clause. Controls which groups are selected,

eliminating groups that don't satisfy condition. This ﬁltering occurs after groups and aggregates

are computed.

[ UNION [ ALL | DISTINCT ] union_query] ]

Combines the results of more than one SELECT statement into a single query. ALL or DISTINCT

control which rows are included in the ﬁnal result set.

ALL causes all rows to be included, even if the rows are identical.

DISTINCT causes only unique rows to be included in the combined result set. DISTINCT is the

default.

Multiple UNION clauses are processed left to right unless you use parentheses to explicitly deﬁne the

order of processing.

146

Amazon Athena User Guide

SELECT

[ ORDER BY expression [ ASC | DESC ] [ NULLS FIRST | NULLS LAST] [, ...] ]

Sorts a result set by one or more output expression.

When the clause contains multiple expressions, the result set is sorted according to the ﬁrst

expression. Then the second expression is applied to rows that have matching values from the

ﬁrst expression, and so on.

Each expression may specify output columns from SELECT or an ordinal number for an output

column by position, starting at one.

ORDER BY is evaluated as the last step after any GROUP BY or HAVING clause. ASC and DESC

determine whether results are sorted in ascending or descending order.

The default null ordering is NULLS LAST, regardless of ascending or descending sort order.

LIMIT [ count | ALL ]

Restricts the number of rows in the result set to count. LIMIT ALL is the same as omitting the

LIMIT clause. If the query has no ORDER BY clause, the results are arbitrary.

TABLESAMPLE BERNOULLI | SYSTEM (percentage)

Optional operator to select rows from a table based on a sampling method.

BERNOULLI selects each row to be in the table sample with a probability of percentage. All

physical blocks of the table are scanned, and certain rows are skipped based on a comparison

between the sample percentage and a random value calculated at runtime.

With SYSTEM, the table is divided into logical segments of data, and the table is sampled at this

granularity.

Either all rows from a particular segment are selected, or the segment is skipped based on a

comparison between the sample percentage and a random value calculated at runtime. SYTSTEM

sampling is dependent on the connector. This method does not guarantee independent sampling

probabilities.

[ UNNEST (array_or_map) [WITH ORDINALITY] ]

Expands an array or map into a relation. Arrays are expanded into a single column. Maps are

expanded into two columns (key, value).

You can use UNNEST with multiple arguments, which are expanded into multiple columns with as

many rows as the highest cardinality argument.

Other columns are padded with nulls.

The WITH ORDINALITY clause adds an ordinality column to the end.

UNNEST is usually used with a JOIN and can reference columns from relations on the left side of the

JOIN.

Examples

SELECT * FROM table;

SELECT os, COUNT(*) count FROM cloudfront_logs WHERE date BETWEEN date '2014-07-05' AND

date '2014-08-05' GROUP BY os;

For more examples, see Querying Data in Amazon Athena Tables (p. 59).

147

Amazon Athena User Guide

Unsupported DDL

The following native Hive DDLs are not supported by Athena:

• ALTER INDEX

• ALTER TABLE table_name ARCHIVE PARTITION

• ALTER TABLE table_name CLUSTERED BY

• ALTER TABLE table_name EXCHANGE PARTITION

• ALTER TABLE table_name NOT CLUSTERED

• ALTER TABLE table_name NOT SKEWED

• ALTER TABLE table_name NOT SORTED

• ALTER TABLE table_name NOT STORED AS DIRECTORIES

• ALTER TABLE table_name partitionSpec ADD COLUMNS

• ALTER TABLE table_name partitionSpec CHANGE COLUMNS

• ALTER TABLE table_name partitionSpec COMPACT

• ALTER TABLE table_name partitionSpec CONCATENATE

• ALTER TABLE table_name partitionSpec REPLACE COLUMNS

• ALTER TABLE table_name partitionSpec SET FILEFORMAT

• ALTER TABLE table_name RENAME TO

• ALTER TABLE table_name SET SKEWED LOCATION

• ALTER TABLE table_name SKEWED BY

• ALTER TABLE table_name TOUCH

• ALTER TABLE table_name UNARCHIVE PARTITION

• COMMIT

• CREATE INDEX

• CREATE ROLE

• CREATE TABLE table_name LIKE existing_table_name

• CREATE TEMPORARY MACRO

• CREATE VIEW

• DELETE FROM

• DESCRIBE DATABASE

• DFS

• DROP INDEX

• DROP ROLE

• DROP TEMPORARY MACRO

• EXPORT TABLE

• GRANT ROLE

• IMPORT TABLE

• INSERT INTO

• LOCK DATABASE

• LOCK TABLE

• REVOKE ROLE

• ROLLBACK

• SHOW COMPACTIONS

• SHOW CURRENT ROLES

• SHOW GRANT

148

Amazon Athena User Guide

Limitations

• SHOW INDEXES

• SHOW LOCKS

• SHOW PRINCIPALS

• SHOW ROLE GRANT

• SHOW ROLES

• SHOW TRANSACTIONS

• START TRANSACTION

• UNLOCK DATABASE

• UNLOCK TABLE

Limitations

Athena does not support the following features, which are supported by an open source Presto version

0.172.

• User-deﬁned functions (UDFs or UDAFs).

• Stored procedures.

• A particular subset of data types is supported. For more information, see Data Types (p. 131).

• CREATE TABLE AS SELECT statements.

• INSERT INTO statements.

• Prepared statements. You cannot run EXECUTE with USING.

• CREATE TABLE LIKE.

• DESCRIBE INPUT and DESCRIBE OUTPUT.

• EXPLAIN statements.

• Federated connectors. For more information, see Connectors.

149

Amazon Athena User Guide

Code Samples

Code Samples and Service Limits

Topics

• Code Samples (p. 150)

• Service Limits (p. 158)

Code Samples

Use examples in this topic as a starting point for writing Athena applications using the AWS SDK for Java.

• Java Code Samples

• Create a Client to Access Athena (p. 150)

• Working with Query Executions

• Start Query Execution (p. 151)

• Stop Query Execution (p. 154)

• List Query Executions (p. 155)

• Working with Named Queries

• Create a Named Query (p. 156)

• Delete a Named Query (p. 156)

• List Query Executions (p. 155)

Note

These samples use constants (for example, ATHENA_SAMPLE_QUERY) for strings, which are

deﬁned in an ExampleConstants class declaration not shown in this topic. Replace these

constants with your own strings or deﬁned constants.

Create a Client to Access Athena

package com.amazonaws.services.athena.sdksamples;

import com.amazonaws.ClientConfiguration;

import com.amazonaws.auth.InstanceProfileCredentialsProvider;

import com.amazonaws.regions.Regions;

import com.amazonaws.services.athena.AmazonAthena;

import com.amazonaws.services.athena.AmazonAthenaClientBuilder;

/**

* AthenaClientFactory

* -------------------------------------

* This code shows how to create and configure an Amazon Athena client.

public class AthenaClientFactory

{

/**

* AmazonAthenaClientBuilder to build Athena with the following properties:

* - Set the region of the client

* - Use the instance profile from the EC2 instance as the credentials provider

* - Configure the client to increase the execution timeout.

private final AmazonAthenaClientBuilder builder = AmazonAthenaClientBuilder.standard()

150

Amazon Athena User Guide

Start Query Execution

.withRegion(Regions.US_EAST_1)

.withCredentials(InstanceProfileCredentialsProvider.getInstance())

.withClientConfiguration(new

ClientConfiguration().withClientExecutionTimeout(ExampleConstants.CLIENT_EXECUTION_TIMEOUT));

public AmazonAthena createClient()

{

return builder.build();

}

Start Query Execution

package com.amazonaws.services.athena.sdksamples;

import com.amazonaws.services.athena.AmazonAthena;

import com.amazonaws.services.athena.model.ColumnInfo;

import com.amazonaws.services.athena.model.GetQueryExecutionRequest;

import com.amazonaws.services.athena.model.GetQueryExecutionResult;

import com.amazonaws.services.athena.model.GetQueryResultsRequest;

import com.amazonaws.services.athena.model.GetQueryResultsResult;

import com.amazonaws.services.athena.model.QueryExecutionContext;

import com.amazonaws.services.athena.model.QueryExecutionState;

import com.amazonaws.services.athena.model.ResultConfiguration;

import com.amazonaws.services.athena.model.Row;

import com.amazonaws.services.athena.model.StartQueryExecutionRequest;

import com.amazonaws.services.athena.model.StartQueryExecutionResult;

import java.util.List;

/**

* StartQueryExample

* -------------------------------------

* This code shows how to submit a query to Athena for execution, wait till results

* are available, and then process the results.

public class StartQueryExample

{

public static void main(String[] args) throws InterruptedException

{

// Build an AmazonAthena client

AthenaClientFactory factory = new AthenaClientFactory();

AmazonAthena client = factory.createClient();

String queryExecutionId = submitAthenaQuery(client);

waitForQueryToComplete(client, queryExecutionId);

processResultRows(client, queryExecutionId);

}

/**

* Submits a sample query to Athena and returns the execution ID of the query.

private static String submitAthenaQuery(AmazonAthena client)

{

// The QueryExecutionContext allows us to set the Database.

QueryExecutionContext queryExecutionContext = new

QueryExecutionContext().withDatabase(ExampleConstants.ATHENA_DEFAULT_DATABASE);

// The result configuration specifies where the results of the query should go in S3

and encryption options

ResultConfiguration resultConfiguration = new ResultConfiguration()

151

Amazon Athena User Guide

Start Query Execution

// You can provide encryption options for the output that is written.

// .withEncryptionConfiguration(encryptionConfiguration)

.withOutputLocation(ExampleConstants.ATHENA_OUTPUT_BUCKET);

// Create the StartQueryExecutionRequest to send to Athena which will start the

query.

StartQueryExecutionRequest startQueryExecutionRequest = new

StartQueryExecutionRequest()

.withQueryString(ExampleConstants.ATHENA_SAMPLE_QUERY)

.withQueryExecutionContext(queryExecutionContext)

.withResultConfiguration(resultConfiguration);

StartQueryExecutionResult startQueryExecutionResult =

client.startQueryExecution(startQueryExecutionRequest);

return startQueryExecutionResult.getQueryExecutionId();

}

/**

* Wait for an Athena query to complete, fail or to be cancelled. This is done by polling

Athena over an

* interval of time. If a query fails or is cancelled, then it will throw an exception.

private static void waitForQueryToComplete(AmazonAthena client, String

queryExecutionId) throws InterruptedException

{

GetQueryExecutionRequest getQueryExecutionRequest = new GetQueryExecutionRequest()

.withQueryExecutionId(queryExecutionId);

GetQueryExecutionResult getQueryExecutionResult = null;

boolean isQueryStillRunning = true;

while (isQueryStillRunning) {

getQueryExecutionResult = client.getQueryExecution(getQueryExecutionRequest);

String queryState =

getQueryExecutionResult.getQueryExecution().getStatus().getState();

if (queryState.equals(QueryExecutionState.FAILED).toString()) {

throw new RuntimeException("Query Failed to run with Error Message: " +

getQueryExecutionResult.getQueryExecution().getStatus().getStateChangeReason());

}

else if (queryState.equals(QueryExecutionState.CANCELLED.toString())) {

throw new RuntimeException("Query was cancelled.");

}

else if (queryState.equals(QueryExecutionState.SUCCEEDED.toString())) {

isQueryStillRunning = false;

}

else {

// Sleep an amount of time before retrying again.

Thread.sleep(ExampleConstants.SLEEP_AMOUNT_IN_MS);

}

System.out.println("Current Status is: " + queryState);

}

/**

* This code calls Athena and retrieves the results of a query.

* The query must be in a completed state before the results can be retrieved and

* paginated. The first row of results are the column headers.

private static void processResultRows(AmazonAthena client, String queryExecutionId)

{

GetQueryResultsRequest getQueryResultsRequest = new GetQueryResultsRequest()

// Max Results can be set but if its not set,

// it will choose the maximum page size

// As of the writing of this code, the maximum value is 1000

// .withMaxResults(1000)

.withQueryExecutionId(queryExecutionId);

152

Amazon Athena User Guide

Start Query Execution

GetQueryResultsResult getQueryResultsResult =

client.getQueryResults(getQueryResultsRequest);

List<ColumnInfo> columnInfoList =

getQueryResultsResult.getResultSet().getResultSetMetadata().getColumnInfo();

while (true) {

List<Row> results = getQueryResultsResult.getResultSet().getRows();

for (Row row : results) {

// Process the row. The first row of the first page holds the column names.

processRow(row, columnInfoList);

}

// If nextToken is null, there are no more pages to read. Break out of the loop.

if (getQueryResultsResult.getNextToken() == null) {

break;

}

getQueryResultsResult = client.getQueryResults(

getQueryResultsRequest.withNextToken(getQueryResultsResult.getNextToken()));

}

private static void processRow(Row row, List<ColumnInfo> columnInfoList)

{

for (int i = 0; i < columnInfoList.size(); ++i) {

switch (columnInfoList.get(i).getType()) {

case "varchar":

// Convert and Process as String

break;

case "tinyint":

// Convert and Process as tinyint

break;

case "smallint":

// Convert and Process as smallint

break;

case "integer":

// Convert and Process as integer

break;

case "bigint":

// Convert and Process as bigint

break;

case "double":

// Convert and Process as double

break;

case "boolean":

// Convert and Process as boolean

break;

case "date":

// Convert and Process as date

break;

case "timestamp":

// Convert and Process as timestamp

break;

default:

throw new RuntimeException("Unexpected Type is not expected" +

columnInfoList.get(i).getType());

}

153

Amazon Athena User Guide

Stop Query Execution

package com.amazonaws.services.athena.sdksamples;

import com.amazonaws.services.athena.AmazonAthena;

import com.amazonaws.services.athena.model.GetQueryExecutionRequest;

import com.amazonaws.services.athena.model.GetQueryExecutionResult;

import com.amazonaws.services.athena.model.QueryExecutionContext;

import com.amazonaws.services.athena.model.ResultConfiguration;

import com.amazonaws.services.athena.model.StartQueryExecutionRequest;

import com.amazonaws.services.athena.model.StartQueryExecutionResult;

import com.amazonaws.services.athena.model.StopQueryExecutionRequest;

import com.amazonaws.services.athena.model.StopQueryExecutionResult;

/**

* StopQueryExecutionExample

* -------------------------------------

* This code runs an example query, immediately stops the query, and checks the status of

the query to

* ensure that it was cancelled.

public class StopQueryExecutionExample

{

public static void main(String[] args) throws Exception

{

// Build an Athena client

AthenaClientFactory factory = new AthenaClientFactory();

AmazonAthena client = factory.createClient();

String sampleQueryExecutionId = getExecutionId(client);

// Submit the stop query Request

StopQueryExecutionRequest stopQueryExecutionRequest = new StopQueryExecutionRequest()

.withQueryExecutionId(sampleQueryExecutionId);

StopQueryExecutionResult stopQueryExecutionResult =

client.stopQueryExecution(stopQueryExecutionRequest);

// Ensure that the query was stopped

GetQueryExecutionRequest getQueryExecutionRequest = new GetQueryExecutionRequest()

.withQueryExecutionId(sampleQueryExecutionId);

GetQueryExecutionResult getQueryExecutionResult =

client.getQueryExecution(getQueryExecutionRequest);

(getQueryExecutionResult.getQueryExecution().getStatus().getState().equals(ExampleConstants.QUERY_STATE_CANCELLED))

{

// Query was cancelled.

System.out.println("Query has been cancelled");

}

/**

* Submits an example query and returns a query execution ID of a running query to stop.

public static String getExecutionId(AmazonAthena client)

{

QueryExecutionContext queryExecutionContext = new

QueryExecutionContext().withDatabase(ExampleConstants.ATHENA_DEFAULT_DATABASE);

ResultConfiguration resultConfiguration = new ResultConfiguration()

.withOutputLocation(ExampleConstants.ATHENA_OUTPUT_BUCKET);

154

Amazon Athena User Guide

List Query Executions

StartQueryExecutionRequest startQueryExecutionRequest = new

StartQueryExecutionRequest()

.withQueryString(ExampleConstants.ATHENA_SAMPLE_QUERY)

.withQueryExecutionContext(queryExecutionContext)

.withResultConfiguration(resultConfiguration);

StartQueryExecutionResult startQueryExecutionResult =

client.startQueryExecution(startQueryExecutionRequest);

return startQueryExecutionResult.getQueryExecutionId();

}

List Query Executions

package com.amazonaws.services.athena.sdksamples;

import com.amazonaws.services.athena.AmazonAthena;

import com.amazonaws.services.athena.model.ListQueryExecutionsRequest;

import com.amazonaws.services.athena.model.ListQueryExecutionsResult;

import java.util.List;

/**

* ListQueryExecutionsExample

* -------------------------------------

* This code shows how to obtain a list of query execution IDs.

public class ListQueryExecutionsExample

{

public static void main(String[] args) throws Exception

{

// Build an Athena client

AthenaClientFactory factory = new AthenaClientFactory();

AmazonAthena client = factory.createClient();

// Build the request

ListQueryExecutionsRequest listQueryExecutionsRequest = new

ListQueryExecutionsRequest();

// Get the list results.

ListQueryExecutionsResult listQueryExecutionsResult =

client.listQueryExecutions(listQueryExecutionsRequest);

// Process the results.

boolean hasMoreResults = true;

while (hasMoreResults) {

List<String> queryExecutionIds =

listQueryExecutionsResult.getQueryExecutionIds();

// process queryExecutionIds.

//If nextToken is not null, then there are more results. Get the next page of

results.

if (listQueryExecutionsResult.getNextToken() != null) {

listQueryExecutionsResult = client.listQueryExecutions(

listQueryExecutionsRequest.withNextToken(listQueryExecutionsResult.getNextToken()));

}

else {

hasMoreResults = false;

}

155

Amazon Athena User Guide

Create a Named Query

}

Create a Named Query

package com.amazonaws.services.athena.sdksamples;

import com.amazonaws.services.athena.AmazonAthena;

import com.amazonaws.services.athena.model.CreateNamedQueryRequest;

import com.amazonaws.services.athena.model.CreateNamedQueryResult;

/**

* CreateNamedQueryExample

* -------------------------------------

* This code shows how to create a named query.

public class CreateNamedQueryExample

{

public static void main(String[] args) throws Exception

{

// Build an Athena client

AthenaClientFactory factory = new AthenaClientFactory();

AmazonAthena client = factory.createClient();

// Create the named query request.

CreateNamedQueryRequest createNamedQueryRequest = new CreateNamedQueryRequest()

.withDatabase(ExampleConstants.ATHENA_DEFAULT_DATABASE)

.withQueryString(ExampleConstants.ATHENA_SAMPLE_QUERY)

.withDescription("Sample Description")

.withName("SampleQuery");

// Call Athena to create the named query. If it fails, an exception is thrown.

CreateNamedQueryResult createNamedQueryResult =

client.createNamedQuery(createNamedQueryRequest);

}

Delete a Named Query

package com.amazonaws.services.athena.sdksamples;

import com.amazonaws.services.athena.AmazonAthena;

import com.amazonaws.services.athena.model.CreateNamedQueryRequest;

import com.amazonaws.services.athena.model.CreateNamedQueryResult;

import com.amazonaws.services.athena.model.DeleteNamedQueryRequest;

import com.amazonaws.services.athena.model.DeleteNamedQueryResult;

/**

* DeleteNamedQueryExample

* -------------------------------------

* This code shows how to delete a named query by using the named query ID.

public class DeleteNamedQueryExample

{

private static String getNamedQueryId(AmazonAthena athenaClient)

{

// Create the NameQuery Request.

CreateNamedQueryRequest createNamedQueryRequest = new CreateNamedQueryRequest()

.withDatabase(ExampleConstants.ATHENA_DEFAULT_DATABASE)

.withQueryString(ExampleConstants.ATHENA_SAMPLE_QUERY)

.withName("SampleQueryName")

.withDescription("Sample Description");

156

Amazon Athena User Guide

List Named Queries

// Create the named query. If it fails, an exception is thrown.

CreateNamedQueryResult createNamedQueryResult =

athenaClient.createNamedQuery(createNamedQueryRequest);

return createNamedQueryResult.getNamedQueryId();

}

public static void main(String[] args) throws Exception

{

// Build an Athena client

AthenaClientFactory factory = new AthenaClientFactory();

AmazonAthena client = factory.createClient();

String sampleNamedQueryId = getNamedQueryId(client);

// Create the delete named query request

DeleteNamedQueryRequest deleteNamedQueryRequest = new DeleteNamedQueryRequest()

.withNamedQueryId(sampleNamedQueryId);

// Delete the named query

DeleteNamedQueryResult deleteNamedQueryResult =

client.deleteNamedQuery(deleteNamedQueryRequest);

}

List Named Queries

package com.amazonaws.services.athena.sdksamples;

import com.amazonaws.services.athena.AmazonAthena;

import com.amazonaws.services.athena.model.ListNamedQueriesRequest;

import com.amazonaws.services.athena.model.ListNamedQueriesResult;

import java.util.List;

/**

* ListNamedQueryExample

* -------------------------------------

* This code shows how to obtain a list of named query IDs.

public class ListNamedQueryExample

{

public static void main(String[] args) throws Exception

{

// Build an Athena client

AthenaClientFactory factory = new AthenaClientFactory();

AmazonAthena client = factory.createClient();

// Build the request

ListNamedQueriesRequest listNamedQueriesRequest = new ListNamedQueriesRequest();

// Get the list results.

ListNamedQueriesResult listNamedQueriesResult =

client.listNamedQueries(listNamedQueriesRequest);

// Process the results.

boolean hasMoreResults = true;

while (hasMoreResults) {

List<String> namedQueryIds = listNamedQueriesResult.getNamedQueryIds();

// process named query IDs

157

Amazon Athena User Guide

Service Limits

// If nextToken is not null, there are more results. Get the next page of

results.

if (listNamedQueriesResult.getNextToken() != null) {

listNamedQueriesResult = client.listNamedQueries(

listNamedQueriesRequest.withNextToken(listNamedQueriesResult.getNextToken()));

}

else {

hasMoreResults = false;

}

Service Limits

Note

You can contact AWS Support to request a limit increase for the limits listed here.

• By default, concurrency limits on your account allow you to submit ﬁve concurrent DDL queries (used

for creating tables and adding partitions) and ﬁve concurrent SELECT queries at a time. This is a soft

limit and you can request a limit increase for concurrent queries.

• If you use Athena in regions where AWS Glue is available, migrate to AWS Glue Catalog. See Upgrading

to the AWS Glue Data Catalog Step-by-Step (p. 19). If you have migrated to AWS Glue, for service

limits on tables, databases, and partitions in Athena, see AWS Glue Limits.

• If you have not migrated to AWS Glue Catalog, you can request a limit increase.

• You may encounter a limit for Amazon S3 buckets per account, which is 100. Athena also needs a

separate bucket to log results.

• Query timeout: 30 minutes

158