AWS Data Engineering Cheatsheet
Hello dears, here you can find cheat sheets for most commonly used AWS services in Data Engineering, like:
- AWS Redshift Cheat Sheet
- Amazon S3 Cheat Sheet
- Amazon Athena Cheat Sheet
- Amazon Kinesis Cheat Sheet
Amazon Redshift Cheat Sheet
Overview
Amazon Redshift is a fully managed, petabyte-scale data warehouse service that extends data warehouse queries to your data lake. It allows you to run analytic queries against petabytes of data stored locally in Redshift and directly against exabytes of data stored in S3. Redshift is designed for OLAP (Online Analytical Processing).
Currently, Redshift only supports Single-AZ deployments.
Features
- Columnar Storage: Redshift uses columnar storage, data compression, and zone maps to minimize the amount of I/O needed for queries.
- Parallel Processing: It utilizes a massively parallel processing (MPP) data warehouse architecture to distribute SQL operations across multiple nodes.
- Machine Learning: Redshift leverages machine learning to optimize throughput based on workloads.
- Result Caching: Provides sub-second response times for repeat queries.
- Automated Backups: Redshift continuously backs up your data to S3 and can replicate snapshots to another region for disaster recovery.
Components
- Cluster: Comprises a leader node and one or more compute nodes. A database is created upon provisioning a cluster for loading data and running queries.
- Scaling: Clusters can be scaled in/out by adding/removing nodes and scaled up/down by changing node types.
- Maintenance Window: Redshift assigns a 30-minute maintenance window randomly within an 8-hour block per region each week. During this time, clusters are unavailable.
- Deployment Platforms: Supports both EC2-VPC and EC2-Classic platforms for launching clusters.
Redshift Nodes
- Leader Node: Manages client connections, parses queries, and coordinates execution plans with compute nodes.
- Compute Nodes: Execute query plans, exchange data, and send intermediate results to the leader node for aggregation.
Node Types
- Dense Storage (DS): For large data workloads using HDD storage.
- Dense Compute (DC): Optimized for performance-intensive workloads using SSD storage.
Parameter Groups
Parameter groups apply to all databases within a cluster. The default parameter group has preset values and cannot be modified.
Database Querying Options
- Query Editor: Use the AWS Management Console to connect to your cluster and run queries.
- SQL Client Tools: Connect via standard ODBC and JDBC connections.
- Enhanced VPC Routing: Manages data flow between your cluster and other resources using VPC features.
Redshift Spectrum
- Query Exabytes of Data: Run queries against data in S3 without loading or transforming it.
- Columnar Format: Scans only the needed columns for your query, reducing data processing.
- Compression Algorithms: Scans less data when data is compressed with supported algorithms.
Redshift Streaming Ingestion
- Streaming Data: Consume and process data directly from streaming sources like Amazon Kinesis Data Streams and Amazon Managed Streaming for Apache Kafka (MSK).
- Low Latency: Provides high-speed ingestion without staging data in S3.
Redshift ML
- Machine Learning: Train and deploy machine learning models using SQL commands within Redshift.
- In-Database Inference: Perform in-database predictions without moving data.
- SageMaker Integration: Utilizes Amazon SageMaker Autopilot to find the best model for your data.
Redshift Data Sharing
- Live Data Sharing: Securely share live data across Redshift clusters within an AWS account without copying data.
- Up-to-Date Information: Users always access the most current data in the warehouse.
- No Additional Cost: Available on Redshift RA3 clusters without extra charges.
Redshift Cross-Database Query
- Query Across Databases: Allows querying across different databases within a Redshift cluster,
regardless of the database you are connected to. This feature is available on Redshift RA3 node types at no extra cost.
Cluster Snapshots
- Types: There are two types of snapshots, automated and manual, stored in S3 using SSL.
- Automated Snapshots: Taken every 8 hours or 5 GB per node of data change and are enabled by default. They are deleted at the end of a one-day retention period, which can be modified.
- Manual Snapshots: Retained indefinitely unless manually deleted. Can be shared with other AWS accounts.
- Cross-Region Snapshots: Snapshots can be copied to another AWS Region for disaster recovery, with a default retention period of seven days.
Monitoring
- Audit Logging: Tracks authentication attempts, connections, disconnections, user definition changes, and queries. Logs are stored in S3.
- Event Tracking: Redshift retains information about events for several weeks.
- Performance Metrics: Uses CloudWatch to monitor physical aspects like CPU utilization, latency, and throughput.
- Query/Load Performance Data: Helps monitor database activity and performance.
- CloudWatch Alarms: Optionally configured to monitor disk space usage across cluster nodes.
Security
- Access Control: By default, only the AWS account that creates the cluster can access it.
- IAM Integration: Create user accounts and manage permissions using IAM.
- Security Groups: Use Redshift security groups for EC2-Classic platforms and VPC security groups for EC2-VPC platforms.
- Encryption: Optionally encrypt clusters upon provisioning. Encrypted clusters' snapshots are also encrypted.
Pricing
- Billing: Pay per second based on the type and number of nodes in your cluster.
- Spectrum Scanning: Pay for the number of bytes scanned by Redshift Spectrum.
- Reserved Instances: Save costs by committing to 1 or 3-year terms.
Cluster Management
Creating a Cluster
aws redshift create-cluster \
--cluster-identifier my-redshift-cluster \
--node-type dc2.large \
--master-username masteruser \
--master-user-password masterpassword \
--cluster-type multi-node \
--number-of-nodes 2
Deleting a Cluster
aws redshift delete-cluster \
--cluster-identifier my-redshift-cluster \
--skip-final-cluster-snapshot
Describing a Cluster
aws redshift describe-clusters \
--cluster-identifier my-redshift-cluster
Database Management
Connecting to the Database
Use a PostgreSQL-compatible tool such as psql
or a SQL client:
psql -h my-cluster.cduijjmc4xkx.us-west-2.redshift.amazonaws.com -U masteruser -d dev
Creating a Database
CREATE DATABASE mydb;
Dropping a Database
DROP DATABASE mydb;
User Management
Creating a User
CREATE USER myuser WITH PASSWORD 'mypassword'
;
Dropping a User
DROP USER myuser;
Granting Permissions
GRANT ALL PRIVILEGES ON DATABASE mydb TO myuser;
Revoking Permissions
REVOKE ALL PRIVILEGES ON DATABASE mydb FROM myuser;
Table Management
Creating a Table
CREATE TABLE mytable (
id INT PRIMARY KEY,
name VARCHAR(50),
age INT
);
Dropping a Table
DROP TABLE mytable;
Inserting Data
INSERT INTO mytable (id, name, age) VALUES (1, 'John Doe', 30);
Updating Data
UPDATE mytable SET age = 31 WHERE id = 1;
Deleting Data
DELETE FROM mytable WHERE id = 1;
Querying Data
SELECT * FROM mytable;
Performance Tuning
Analyzing a Table
ANALYZE mytable;
Vacuuming a Table
VACUUM mytable;
Redshift Distribution Styles
- KEY: Distributes rows based on the values in one column.
- EVEN: Distributes rows evenly across all nodes.
- ALL: Copies the entire table to each node.
Example: Creating a Table with Distribution Key
CREATE TABLE mytable (
id INT,
name VARCHAR(50),
age INT
)
DISTSTYLE KEY
DISTKEY(id);
Backup and Restore
Creating a Snapshot
aws redshift create-cluster-snapshot \
--snapshot-identifier my-snapshot \
--cluster-identifier my-redshift-cluster
Restoring from a Snapshot
aws redshift restore-from-cluster-snapshot \
--snapshot-identifier my-snapshot \
--cluster-identifier my-new-cluster
Security
Enabling SSL
In psql
or your SQL client, use the sslmode
parameter:
psql "host=my-cluster.cduijjmc4xkx.us-west-2.redshift.amazonaws.com dbname=dev user=masteruser password=masterpassword sslmode=require"
Managing VPC Security Groups
aws redshift create-cluster-security-group --cluster-security-group-name my-security-group
aws redshift authorize-cluster-security-group-ingress --cluster-security-group-name my-security-group --cidrip 0.0.0.0/0
Maintenance
Resizing a Cluster
aws redshift modify-cluster \
--cluster-identifier my-redshift-cluster \
--node-type dc2.large \
--number-of-nodes 4
Monitoring Cluster Performance
Use Amazon CloudWatch to monitor:
- CPU Utilization
- Database Connections
- Read/Write IOPS
- Network Traffic
Viewing Cluster Events
aws redshift describe-events \
--source-identifier my-redshift-cluster \
--source-type cluster
Amazon S3 Cheat Sheet
Overview
Amazon S3 (Simple Storage Service) stores data as objects within buckets. Each object includes a file and optional metadata that describes the file. A key is a unique identifier for an object within a bucket, and storage capacity is virtually unlimited.
Buckets
- Access Control: For each bucket, you can control access, create, delete, and list objects, view access logs, and choose the geographical region for storage.
- Naming: Bucket names must be unique DNS-compliant names across all existing S3 buckets. Once created, the name cannot be changed and is visible in the URL pointing to the objects in the bucket.
- Limits: By default, you can create up to 100 buckets per AWS account. The region of a bucket cannot be changed after creation.
- Static Website Hosting: Buckets can be configured to host static websites.
- Deletion Restrictions: Buckets with 100,000 or more objects cannot be deleted via the S3 console. Buckets with versioning enabled cannot be deleted via the AWS CLI.
Data Consistency Model
- Read-After-Write Consistency: For PUTS of new objects in all regions.
- Strong Consistency: For read-after-write HEAD or GET requests, overwrite PUTS, and DELETES in all regions.
- Eventual Consistency: For listing all buckets after deletion and for enabling versioning on a bucket for the first time.
Storage Classes
Frequently Accessed Objects
- S3 Standard: General-purpose storage for frequently accessed data.
- S3 Express One Zone: High-performance, single-AZ storage class for latency-sensitive applications, offering improved access speeds and reduced request costs compared to S3 Standard.
Infrequently Accessed Objects
- S3 Standard-IA: For long-lived but less frequently accessed data, with redundant storage across multiple AZs.
- S3 One Zone-IA: Less expensive, stores data in one AZ, and is not resilient to AZ loss. Suitable for objects over 128 KB stored for at least 30 days.
Amazon S3 Intelligent-Tiering
- Automatic Cost Optimization: Moves data between frequent and infrequent access tiers based on access patterns.
- Monitoring: Moves objects to infrequent access after 30 days without access, and to archive tiers after 90 and 180 days without access.
- No Retrieval Fees: Optimizes costs without performance impact.
S3 Glacier
- Long-Term Archive: Provides storage classes like Glacier Instant Retrieval, Glacier Flexible Retrieval, and Glacier Deep Archive for long-term archiving.
- Access: Archived objects must be restored before access and are only visible through S3.
Retrieval Options
- Expedited: Access data within 1-5 minutes for urgent requests.
- Standard: Default option, typically completes within 3-5 hours.
- Bulk: Lowest-cost option for retrieving large amounts of data, typically completes within 5-12 hours.
Additional Information
- Object Storage: For S3 Standard, Standard-IA, and Glacier classes, objects are stored across multiple devices in at least three AZs.
Amazon Athena Cheat Sheet
Overview
Amazon Athena is an interactive query service that allows you to analyze data directly in Amazon S3 and other data sources using SQL. It is serverless and uses Presto, an open-source, distributed SQL query engine optimized for low-latency, ad hoc analysis.
Features
- Serverless: No infrastructure to manage.
- Built-in Query Editor: Allows you to write and execute queries directly in the Athena console.
- Wide Data Format Support: Supports formats such as CSV, JSON, ORC, Avro, and Parquet.
- Parallel Query Execution: Executes queries in parallel to provide fast results, even for large datasets.
- Amazon S3 Integration: Uses S3 as the underlying data store, ensuring high availability and durability.
- Data Visualization: Integrates with Amazon QuickSight.
- AWS Glue Integration: Works seamlessly with AWS Glue for data cataloging.
- Managed Data Catalog: Stores metadata and schemas for your S3-stored data.
Queries
- Geospatial Data: You can query geospatial data.
- Log Data: Supports querying various log types.
- Query Results: Results are stored in S3.
- Query History: Retains history for 45 days.
- User-Defined Functions (UDFs): Supports scalar UDFs, executed with AWS Lambda, to process records or groups of records.
- Data Types: Supports both simple (e.g., INTEGER, DOUBLE, VARCHAR) and complex (e.g., MAPS, ARRAY, STRUCT) data types.
- Requester Pays Buckets: Supports querying data in S3 Requester Pays buckets.
Athena Federated Queries
- Data Connectors: Allows querying data sources beyond S3 using data connectors implemented in Lambda functions via the Athena Query Federation SDK.
- Pre-built Connectors: Available for popular data sources like MySQL, PostgreSQL, Oracle, SQL Server, DynamoDB, MSK, RedShift, OpenSearch, CloudWatch Logs, CloudWatch metrics, and DocumentDB.
- Custom Connectors: You can write custom data connectors or customize pre-built ones using the Athena Query Federation SDK.
Optimizing Query Performance
- Data Partitioning: Partitioning data by column values (e.g., date, country, region) reduces the amount of data scanned by a query.
- Columnar Formats: Converting data to columnar formats like Parquet and ORC improves performance.
- File Compression: Compressing files reduces the amount of data scanned.
- Splittable Files: Using splittable files allows Athena to read them in parallel, speeding up query completion. Formats like AVRO, Parquet, and ORC are splittable, regardless of the compression codec. Only text files compressed with BZIP2 and LZO are splittable.
Cost Controls
- Workgroups: Isolate queries by teams, applications, or workloads and enforce cost controls.
- Per-Query Limit: Sets a threshold for the total amount of data scanned per query, canceling any query that exceeds this limit.
- Per-Workgroup Limit: Limits the total amount of data scanned by all queries within a specified timeframe, with multiple limits based on hourly or daily data scan totals.
Amazon Athena Security
- Access Control: Use IAM policies, access control lists, and S3 bucket policies to control data access.
- Encrypted Data: Queries can be performed directly on encrypted data in S3.
Amazon Athena Pricing
- Pay Per Query: Charged based on the amount of data scanned by each query.
- No Charge for Failed Queries: You are not charged for queries that fail.
- Cost Savings: Compressing, partitioning, or converting data to columnar formats reduces the amount of data scanned, leading to cost savings and performance gains.
Amazon Kinesis Cheat Sheet
Overview
Amazon Kinesis makes it easy to collect, process, and analyze real-time streaming data. It can ingest real-time data such as video, audio, application logs, website clickstreams, and IoT telemetry data for machine learning, analytics, and other applications.
Kinesis Video Streams
A fully managed service for streaming live video from devices to the AWS Cloud or building applications for real-time video processing or batch-oriented video analytics.
Benefits
- Device Connectivity: Connect and stream from millions of devices.
- Custom Retention Periods: Configure video streams to durably store media data for custom retention periods, generating an index based on timestamps.
- Serverless: No infrastructure setup or management required.
- Security: Enforces TLS-based encryption for data streaming and encrypts all data at rest using AWS KMS.
Components
- Producer: Source that puts data into a Kinesis video stream.
- Kinesis Video Stream: Enables the transportation, optional storage, and real-time or batch consumption of live video data.
- Consumer: Retrieves data from a Kinesis video stream to view, process, or analyze it.
- Fragment: A self-contained sequence of frames with no dependencies on other fragments.
Video Playbacks
- HLS (HTTP Live Streaming): For live playback.
- GetMedia API: For building custom applications to process video streams in real time with low latency.
Metadata
- Nonpersistent Metadata: Ad hoc metadata for specific fragments.
- Persistent Metadata: Metadata for consecutive fragments.
Pricing
- Pay for the volume of data ingested, stored, and consumed.
Kinesis Data Stream
A scalable, durable data ingestion and processing service optimized for streaming data.
Components
- Data Producer: Application emitting data records to a Kinesis data stream, assigning partition keys to records.
- Data Consumer: Application or AWS service retrieving data from all shards in a stream for real-time analytics or processing.
- Data Stream: A logical grouping of shards retaining data for 24 hours or up to 7 days with extended retention.
- Shard: The base throughput unit, ingesting up to 1000 records or 1 MB per second. Provides ordered records by arrival time.
Data Record
- Record: Unit of data in a stream with a sequence number, partition key, and data blob (max 1 MB).
- Partition Key: Identifier (e.g., user ID, timestamp) used to route records to shards.
Sequence Number
- Unique identifier for each data record, assigned by Kinesis when data is added.
Monitoring
- Monitor shard-level metrics using CloudWatch, Kinesis Agent, and Kinesis libraries. Log API calls with CloudTrail.
Security
- Automatically encrypt sensitive data with AWS KMS.
- Use IAM for access control and VPC endpoints to keep traffic within the Amazon network.
Pricing
- Charged per shard hour, PUT Payload Unit, and enhanced fan-out usage. Extended data retention incurs additional charges.
Kinesis Data Firehose
The easiest way to load streaming data into data stores and analytics tools.
Features
- Scalable: Automatically scales to match data throughput.
- Data Transformation: Can batch, compress, and encrypt data before loading it.
- Destination Support: Captures, transforms, and loads data into S3, Redshift, Elasticsearch, HTTP endpoints, and service providers like Datadog, New Relic, MongoDB, and Splunk.
- Batch Size and Interval: Control data upload frequency and size.
Data Delivery and Transformation
- Lambda Integration: Transforms incoming data before delivery.
- Format Conversion: Converts JSON to Parquet or ORC for storage in S3.
- Buffer Configuration: Controls data buffering before delivery to destinations.
Pricing
- Pay for the volume of data transmitted. Additional charges for data format conversion.
Kinesis Data Analytics
Analyze streaming data, gain insights, and respond to business needs in real time.
General Features
- Serverless: Automatically manages infrastructure.
- Scalable: Elastically scales to handle data volume.
- Low Latency: Provides sub-second processing latencies.
SQL Features
- Standard ANSI SQL: Integrates with Kinesis Data Streams and Firehose.
- Input Types: Supports streaming and reference data sources.
- Schema Editor: Recognizes standard formats like JSON and CSV.
Java Features
- Apache Flink: Uses open-source libraries for building streaming applications.
- State Management: Stores state in encrypted, incrementally saved running application storage.
- Exactly Once Processing: Ensures processed records affect results exactly once.
Components
- Input: Streaming source for the application.
- Application Code: SQL statements processing input data.
- In-Application Streams: Stores data for processing.
- Kinesis Processing Units (KPU): Provides memory, computing, and networking resources.
Pricing
- Charged based on the number of KPUs used. Additional charges for Java application orchestration and storage.