AWS Big Data Study Notes – AWS QuickSight, Athena, Glue, and ES
AWS QuickSight
This is the cheat sheet on AWS QuickSight, AWS Athena, AWS Glue and AWS Elasticsearch.
- Business analytics service for visualizations and perform ad hoc analysis
- Visuals: a graphical representation of data visualization
- Sheets: a set of visuals that are all based on the same data source and are all viewed together
- Stories: a set of one or more scenes (captured visuals) that you can play like a slideshow
- Discover AWS data sources and also works with your data sources
- Data sources: local/s3 file data, Salesforce data, Athena/RDS/Redshift/external database
- Preparing data: joining tables, using SQL query, selecting fields, using operators/functions/calculating on fields, adding filters, creating geographic hierarchies
- Robust in-memory engine – SPICE (Super-fast, Parallel, In-memory Calculation Engine) capacity:
Total logical row size in bytes = (Number of Numeric Fields * 8 bytes per field) + (Number of Date Fields * 8 bytes per field) + (Number of Text Fields * (8 bytes + UTF-8 encoded character length per field) )
Total bytes of data = Number of rows * Total logical row size in bytes
GB of SPICE Capacity Needed = Total bytes of data / 1,073,741,824
- QuickSight can perform typical arithmetic and comparison functions; conditional functions such as if, then; and date, numeric, and string calculations.
- QuickSight is integrated with AWS CloudTrail. This service provides a record of actions taken by a user, role, or an AWS service in QuickSight
Visual Types
- Comparison and distribution
- Bar charts (horizontal or vertical) create single-measure(e.g. average delay time by flight number), multi-measure(e.g. sales total and profit total by automobile mode), or clustered bar charts (e.g. sales total by state, grouped by region).
- Stacked bar charts similar to a clustered bar chart with color blocks
- Combo chart is known as line and column charts
- KPI visualizes a comparison between a key-value and its target value. A KPI displays a value comparison, the two values being compared, and a progress bar
- Changes over time
- Line graphs: compare changes in measured values over a period of time
- Area line charts: differ from regular line charts in that each value is represented by a colored area of the chart instead of just a line
- Correlation
- Scatter plots: visualize two or three measures for a dimension
- Heat maps: the intersection of two dimensions with color-coding
- Aggregation
- Pie graphs: compare values for items in a dimension
- Tree Maps: visualizes one or two measures for a dimension
- Tabular
- Pivot tables: show measure values for the intersection of two dimensions
Security
- QuickSight supports multi-factor authentication (MFA)
- Private VPC (Virtual Private Cloud) Access for QuickSight uses an Elastic Network Interface (ENI) for secure
- If your VPC has been set up with public connectivity, you can add QuickSight’s IP address range to database instances’ security group rules to enable traffic flow into the VPC and database instance
- Restrict Access to a Data Set by Using Row-Level Security
- You can do this before or after you have shared the data set.
- Only the people you shared with can see any of the data. By adding row-level security, you can further control their access.
- To do this, you create a query or file that has one column named UserName, GroupName, or both. You can also think of this as adding a rule for that user or group. Then you can add one column to the query or file for each field that you want to grant or restrict access to. For each user or group name that you add, you add the values for each field. You can use NULL (no value) to mean all values.
AWS Athena
- An interactive query serverless service to analyze data directly in S3 using standard SQL.
- Access through AWS Management Console, through a JDBC or ODBC connection, using the Athena API, or using the Athena CLI.
- The tables creation process registers the dataset with Athena – either in the AWS Glue Data Catalog or in the internal Athena data catalog (if Glue is not available in the region).
- Analyze unstructured, semi-structured, and structured data stored in S3. Support a variety of standard data formats, including CSV, JSON, ORC, Avro, and Parquet. Athena also supports compressed data in Snappy, Zlib, LZO, and GZIP formats.
- Athena uses Presto to execute DML statements and Hive to execute the DDL statements that create and modify schema. Athena can handle complex analysis, including large joins, window functions, and arrays.
- Athena uses SerDes(Serializer/Deserializer) to interpret the data to read from Amazon S3.
- Partition data with PARTITIONED BY clause in the CREATE TABLE statement. To add new data from S3 into an existing table after data partitioned, you need to run a metadata query (ALTER TABLE ADD PARTITION)
- Control access: Identity and Access Management (IAM) policies, Access Control Lists (ACLs), and Amazon S3 bucket policies
- Encrypted data in S3: Server-Side Encryption with S3-Managed Encryption Keys, Server-Side Encryption with AWS Key Management Service (KMS) – Managed Keys, and Client-Side Encryption with keys managed by KMS
AWS Glue
- AWS Glue is a fully managed ETL (extract, transform, and load) service
- AWS Glue Data Catalog: central metadata repository to store structural and operational metadata. It can be used by Athena, Redshift Spectrum, EMR, and Apache Hive Metastore
- AWS Glue Crawlers and Classifiers: scan data in all kinds of repositories, classify it, extract schema information from it, and store the metadata automatically in the AWS Glue Data Catalog
- AWS Glue ETL Operation: autogenerate Scala or PySpark (the Python API for Apache Spark) scripts with AWS Glue extensions that you can use and modify to perform various ETL operations
- AWS Glue Jobs system: a flexible scheduler that handles dependency resolution, job monitoring, and retries
- ETL jobs can either be triggered on a schedule or on a job completion event.
- Multiple jobs can be triggered in parallel or sequentially by triggering them on a job completion event.
- You can also trigger one or more Glue jobs from an external source such as an AWS Lambda function.
- ETL is batch-oriented with at a minimum of 5 min intervals. While it can process micro-batches, it does not handle streaming data.
- AWS Glue allocates 10 DPUs(Data Processing Units) to each ETL job. A development endpoint is provisioned with 5 DPUs by default.
- AWS Glue natively supports data stored in RDS (Aurora/MySQL/Oracle/PostgreSQL/SQL Server), Redshift, and S3, as well as MySQL, Oracle, Microsoft SQL Server, and PostgreSQL databases in your Virtual Private Cloud (Amazon VPC) running on Amazon EC2.
- Security: server-side encryption for data at rest and SSL for data in motion
Elasticsearch
- Elasticsearch is a popular open-source search and analytics engine for use cases such as log analytics, real-time application monitoring, and clickstream analysis.
- AWS Elasticsearch Service domains are Elasticsearch clusters created using the AWS Elasticsearch Service console, CLI, or API.
- A collection of connected nodes is called a cluster
- Dedicated master nodes: recommend three dedicated master nodes to performs cluster management tasks.
- Data nodes: hold data and perform data-related operations such as CRUD, search, and aggregations.
- ElasticSearch creates an index for each field
- Data is stored in indexes and distributed across shards
- Shards are primary or replica. The primary shard count can’t be changed. ElasticSearch distributes the shards to instances elastically. Primary and replica are distributed to different instances
- Instance types: T (3.5TB, dev/QA), M (150TB, data/queries are average), I (150TB, higher request volumes/larger documents/heavy aggregation), C (150TB, high concurrency), I (1.5PB, XL storage needs)
- (Source Data + Room to Grow) * (1 + Indexing Overhead) / Desired Shard Size = Approximate Number of Primary Shards
- Source Data * (1 + Number of Replicas) * 1.45 = Minimum Storage Requirement
- ES domains offer encryption of data at rest, a security feature that helps prevent unauthorized access to your data. The feature uses the AWS Key Management Service (AWS KMS) to store and manage your encryption keys. If enabled, it encrypts the following aspects of a domain:
- Indices
- Automated snapshots
- Elasticsearch logs
- Swap files
- All other data in the application directory
- The following are not encrypted when you enable encryption of data at rest, but you can take additional steps to protect them:
- Manual snapshots
- Slow logs and error logs
- AWS Elasticseach vs. AWS CloudSeach:
- ES: Analytics
- CS: search on web pages, document files, forum posts, product information. Features: Full-text search, Boolean search, Prefix searches, Range searches, Term boosting, Faceting, Highlighting, Autocomplete Suggestions
Notes List
- How to Pass AWS Certified Big Data SpecialtyAWS Big Data Study Notes – AWS Kinesis
- AWS Big Data Study Notes – EMR and Redshift
- AWS Big Data Study Notes – AWS Machine Learning and IoT
- AWS Big Data Study Notes – AWS QuickSight, Athena, Glue, and ES
- AWS Big Data Study Notes – AWS DynamoDB, S3 and SQS
- AWS Kinesis Data Streams vs. Kinesis Data Firehose
- Streaming Platforms: Apache Kafka vs. AWS Kinesis
- When Should Use Amazon DynamoDB Accelerator (AWS DAX)?
- Data Storage for Big Data: Aurora, Redshift or Hadoop?
- Apache Hadoop Ecosystem Cheat Sheet