- 50 questions MCQ (scored) + 15 unscored questions
- 180 minutes
- Pass/fail grade, need ~75% correct to pass
- No penalty for guessing
- Domains
- Collection — 18%
- Storage & Data Management — 22%
- Data Processing — 24%
- Analysis and Visualization — 18%
- Security — 18%
- Can upload files into S3 through console, CLI, and SDKs
- Problem: Users geographically away from the S3 bucket suffer from higher download and upload latency
- Download-side solution: CloudFront
- CloudFront’s global CDN uses edge locations to cache frequently accessed data closer to the end-user
- Upload-side solution: Transfer Acceleration
- Transfer Acceleration uses CloudFront’s globally distributed edge locations to route uploads to S3 over an optimized network path
- Must be enabled per bucket
- Uses special endpoints: bucketname.s3-accelerate.amazonaws.com, bucketname.s3-accelerate.dualstack.amazonaws.com
- Additional cost — $0.04/GB for the US, Europe, and Japan, $0.08/GB for all other AWS edge locations
- Transfer Acceleration uses CloudFront’s globally distributed edge locations to route uploads to S3 over an optimized network path
- Download-side solution: CloudFront
- Multipart Uploads
- Moving large files all at once is impractical, time consuming, and expensive
- Single S3 put only allows 5GiB, but an S3 object can be up to 5TiB, so large files must be broken down to smaller chunks for upload
- Three API calls
- CreateMultipartUpload — returns bucket, key, and upload id
- UploadPart — provide bucket, key, part number, and upload id; returns an ETag
- CompleteMultipartUpload — provide bucket, key, upload id, all part numbers, and ETags; returns bucket, ETag, and key
- Considerations
- Max 10,000 parts, each part (except final) must be atleast 5MiB
- Recommended to use MultipartUpload for files larger than 100MiB
- Parts can be overwritten during upload incase of updates or failure
- Configure lifecycle policy to auto-abort the multipart upload after a specified time period, incase of failure
- Moving large files all at once is impractical, time consuming, and expensive
- S3 Storage Classes
- Design
- Durability (data persistence) — 99.9999999% (11 nine’s)
- Availability (data uptime) — 99.5% - 99.99% (2-4 nine’s depending on storage class)
- Infrequent access and archival storage classes are less expensive storage solutions, but add an additional write charge and retrieval charge
- Storage Classes (ordered from most expensive to least expensive)
- Standard — General purpose, data that is regularly accessed
- Intelligent Tiering — Unknown access pattern
- Standard Infrequent Access — Weekly/monthly access
- One-Zone Infrequent Access — Easily replaceable data
- Glacier — Archive, acceptable minutes to hours of retrieval time
- Glacier Deep Archive — Archive, acceptable hours of retrieval time
- Objects smaller than 128KB are billed as if they are 128KB objects, so best to bundle smaller files into minimum 128KB filesize
- Lifecycle policies allow you to set rules to move data to archival storage classes after X days
- Written in JSON or XML or using the Wizard in the console
- Filter — which data are is the lifecycle policy referring to
- Transition — when data should be moved from one storage class to another
- Expiration — when data should be deleted
- Design
- S3 Security and Encryption
- Encryption
- In-Transit
- Mandatory client TLS support is required
- TLS 1.2 and above recommended, but all S3 clients are required to support at least TLS 1.0
- Clients must support Perfect Forward Secrecy (PFS) cipher suites
- Ephemeral Diffie-Hellman (DHE)
- Elliptic Curve Diffie-Hellman Ephemeral (ECDHE)
- VPC endpoints to limit bucket access to within VPC
- Data travels over private secure connection to private subnets, not over public internet
- Multiple VPN options for access from outside VPC
- Options: Build own VPN, AWS Site-to-Site VPN, or Direct Connect
- Access control and auditing
- Access Control (from least to most granular)
- Service Control Policies (dddOrganization-level)
- IAM — Users, Roles, Policies
- Bucket ACL
- Bucket Policies
- Object ACL
- Object Policy
- Auditing
- CloudTrail logs
- S3 server access logs
- Access Analyzer for S3
- Access Control (from least to most granular)
- Mandatory client TLS support is required
- At-Rest
- Client-Side Encryption (through SDKs)
- Customer Master Key (CMK) in Key Management Service (KMS)
- Application requests object from S3 bucket, S3 returns encrypted object along with a cipher blob for the encryption key
- Application then makes a KmsDecrypt API call to pass the cipher blob to KMS, which returns the plaintext data key to decrypt S3 object
- Application-Stored Key
- Customer Master Key (CMK) in Key Management Service (KMS)
- Server-Side Encryption (using KMS/S3)
- S3 Managed Encryption Keys (SSE-S3)
- Application server requests object, S3 automatically decrypts the object and sends it back
- CMK stored in KMS (SSE-KMS)
- Application server requests object, S3 automatically fetches decryption key from KMS, decrypts object, and sends it back
- Customer-provided Encryption Keys (SSE-C)
- Customer provides keys and exchange is handled by the application code, so decryption is not automatic. S3 just stores and sends the encrypted object when it is requested
- S3 Managed Encryption Keys (SSE-S3)
- Client-Side Encryption (through SDKs)
- In-Transit
- Object Protection
- Locking — write-once read-many mode (WORM)
- Multi-factor authentication (MFA) Delete
- Versioning — can rollback to a prior version even after “deletion”
- Replication
- Cross-region replication
- Replicates S3 bucket in a bucket in a different region
- Other bucket will need to have a unique/different name
- Will need to set a automatic failover in application code to point to replication bucket if original region goes down
- Must enable versioning to turn on cross-region replication to prevent overwriting
- Replicates S3 bucket in a bucket in a different region
- Cross-region replication
- Encryption
- S3 FAQs
- S3 best-fit use-cases
- Store large amounts of frequently-accessed data at low cost
- Congregate data over a short interval (buffering)
- Cloud-local cache for on-prem data (Single-zone IA)
- Easy static website hosting
- Archiving — tiered automatic lifecycle archiving
- Long-term archive
- Archiving source code
- S3 limits
- No hard limit on how much data a bucket can hold
- Single object can be up to 5TB
- Single put upload is max 5GB, so we used multi-part uploads for bigger files
- S3 security and compliance overview
- S3 is ideal for security compliance standards like HIPAA
- Implements in-flight and at-rest encryption
- Excellent data archival capability
- GovCloud region provides ITAR compliance
- S3 best-fit use-cases
- Database Engines
- Relational engines
- Data normalization — storing data once and referencing it in other tables
- Tabular (row-based) table access — Aurora, MySQL, MariaDB, Oracle, SQL Server
- Used for Online Transaction Processing (OLTP)
- Used for rapid transaction, protecting data through transaction rollback
- All-or-nothing architecture
- Ideal for low-latency applications
- Used for Online Transaction Processing (OLTP)
- Columnar (column-based) table access — Aurora, PostgreSQL, Oracle, Redshift
- Used for Online Analytics Processing (OLAP)
- Used for analytics workloads that manage large amounts of data
- Handle complex long-running query operations
- OLTP databases are often a data source in analytics workflows
- Used for Online Analytics Processing (OLAP)
- Non-relational engines
- Key-value
- Elasticache (Memcached & Redis)
- Data is stored in memory
- Fast but fragile
- Memory access is much faster than disk access
- But all data can be lost if there is a hardware failure
- Redis has disk persistence, but at the cost of speed
- DynamoDB
- Flexible and inflexible, can be used in key-value or document style databases
- Data model dictates style
- Table structure can’t be changed after creation
- Primary key
- Partition key should be unique identifier, sort key is optional
- Partition key + sort key can make a composite primary key
- Need it to query the table
- Secondary Indexes can be used to query tables using alternate keys
- Global Secondary Index — different partition key than table
- Local Secondary Index — same partition key as table, but different sort key
- Elasticache (Memcached & Redis)
- Document
- DynamoDB (see above)
- DocumentDB
- For structured or semi-structured document storage (JSON/YAML)
- Metadata is stored with document records
- Can hit critical scaling failures (more about this below)
- Graph
- Neptune
- Relational+ / Hyperrelational
- Data is structured as nodes connected with edges (i.e. shared attributes between nodes)
- Can reveal otherwise hidden patterns in data
- Relational+ / Hyperrelational
- Neptune
- Key-value
- Relational engines
- Relational Database Service (RDS)
- Managed service, so AWS is in charge of maintaining the OS and server
- Database instance is configured through a Management API, and can run the following DB engines:
- Amazon Aurora (MySQL/PostgreSQL)
- MySQL
- MariaDB
- PostgreSQL
- SQL Server
- Oracle
- “Second-level service” i.e. uses a bunch of other services under the hood, such as EC2, EBS, S3, and Route53
- Since we don’t have operating system access, the DB engines are configured through Parameter Groups and Option Groups
- Parameter Groups are used to manage engine parameters
- Logging, Memory management, and other engine variables
- Option Groups are used to manage plugins and plugin settings
- Usage varies by engine: Not used at all by PostgreSQL, heavily used by Oracle
- Parameter Groups are used to manage engine parameters
- Uses Multi-AZ deployments for Disaster Recovery
- Keeps a secondary instance dormant in a different AZ, and monitors the primary instance
- Block level replication helps keep primary and secondary RDS instances in sync
- If any part of primary instance fails, RDS automatically switches to the secondary instance, which becomes the primary
- After the failover, RDS will create a new secondary instance in the original AZ
- Single-AZ deployments are not recommended for production workloads
- Multi-AZ failover takes roughly 60 seconds, whereas deploying a new RDS instance can take 5 minutes
- Keeps a secondary instance dormant in a different AZ, and monitors the primary instance
- Neptune
- Graph is comprised of nodes, and these nodes are connected by edges
- Nodes contain data
- Edges are common data points between nodes
- Interface languages to traverse nodes across edges
- Apache — Tinkerpop Gremlin
- Graph Structure: Property
- Interface: WebSocket
- Query Pattern: Traversal
- W3C — SPARQL Protocol and RDF Query Language
- Graph Structure: Resource Description Framework (RDF)
- Interface: HTTP REST
- Query Pattern: SQL
- Apache — Tinkerpop Gremlin
- Similarities to RDS
- Fully-managed service
- Parameter/Option Groups
- Multi-AZ Deployments* — more like Aurora as replication happens at storage level; not engine level
- Instance Monitoring
- Snapshots/Backup
- Used for mainly pattern recognition
- Security — detecting fraud
- Social Media — identify connections between users, or targeted advertising
- Science — scientific modeling
- Graph is comprised of nodes, and these nodes are connected by edges
- DocumentDB
- Document store database management system, stores data in JSON
- Able to index JSON data structures at scale (a failure point of MongoDB)
- Fully managed service, and storage automatically scales upward
- MongoDB compatible
- Use cases
- Social media profiles
- Content management systems (CMS)
- Object catalogues
- Document store database management system, stores data in JSON
- Serverless Database Options
- S3 SelectObjectContent API Call
- Available through CLI or SDKs
- Reads data from CSV, JSON, or Parquet
- Must be encoded in UTF-8 format
- Compression must be either GZIP or BZIP2 for CSV and JSON
- and only GZIP Columnar Compression for Parquet
- Server-side encryption is supported, no client-side encryption
- Athena
- SQL interface for data stored in S3
- Fully featured service, used for more complex queries than S3 Select
- Runs Presto under the hood
- Supports CSV, JSON, Parquet, ORC and Avro data formats
- Service integrations with Glue and Quicksight
- DynamoDB
- Key-value and document store
- Great for OLTP use-cases
- Key features include
- Local Secondary Indexes — use another attribute as primary key to query
- Global Secondary Indexes
- Streams
- Service integrations with Lambda, DAX, and IAM
- Aurora Serverless
- Runs MySQL/PostgreSQL compatible versions for an “Infrequent Access” Aurora DB
- Proxy sits infront of DB to handle incoming connection
- Compute instances (Aurora Capacity Units or ACUs)
- Provisioned when a user connects to the proxy, and on standby for queries
- Runs two engines
- MySQL — scales from 1-256 ACUs
- PostgreSQL — scales from 2-384 ACUs
- But for long-running events these scale events can fail, so not ideal for critical production systems
- Ideal for infrequently run short-running queries
- Storage Cluster
- 10 GiB storage nodes, so virtually no IOPS limit
- Databases FAQs
- What services can manage data between these databases?
- Database Migration Service (DMS), Glue, Elastic Map Reduce (EMR)
- Other database engines not covered in this section
- Amazon Quantum Ledger Database (QLDB) — cannot delete a record, any changes are tracked, used for financial transactions
- Amazon Timestream — timeseries data store, useful for IoT
- Amazon Keyspaces — Managed Apache Cassandra (an analytics DB engine)
- What services can manage data between these databases?
- S3 SelectObjectContent API Call
-
What is streaming data?
- Data loses decision-making value over time
- Streaming data is fresh/realtime, and usually has the most decision-making value
-
Amazon Kinesis Family Overview
- Kinesis Data Streams — collect and process large streams of data records in realtime
- Kinesis Data Firehose — deliver streaming data to various data sources
- Kinesis Video Streams — stream live videos from devices to the cloud, and build realtime applications around these streams
- Kinesis Data Analytics — run SQL queries in realtime over streaming data
-
Kinesis Data Streams
- Overview
- Kinesis Data Streams aggregate realtime data and load it to other services like EMR/RedShift
- Managed service thats durable (won’t lose data) and elastic (scales up or down)
- Allows for parallel applications reading from the data streams, which differentiates it from queueing services
- Process
- Data producers (i.e. logs, social media streams, video streams, etc.) push data onto the Kinesis Data Streams
- Shards package and aggregate the data and send it to Data Consumers
- Default limit is 500 shards, but it can be increased
- Data consumers are processing tools such as Kinesis Data Analytics, Kinesis Data Firehose, EMR, EC2, Lambda etc.
- Can process, analyze or create dashboards to visualize the data
- Storage
- Storage isn’t necessary for streaming data as data gets less valuable over time
- But different storage options are available, such as data lakes on S3, data warehousing on Redshift, or DynamoDB
- Shards
- Each shard consists of a sequence of data records, which can be ingested at 1000 records per second
- All data within a shard shares the same partition key, and has a unique sequence number
- Data payload per record can be up to 1MB, but most cases will be a few Kilobytes
- 1 shard can process 1 MB/sec input data, and 2 MB/sec output data
- 2 shards can process 2 MB/sec input data, and 2 MB/sec output data
- and so on...
- Scaling up shards is the way to scale Kinesis streams (e.g. add more shards if user activity increases)
- Shards are transient data storage
- Data records are stored for 24h by default, and can be extended up to 365 days
- Extended data retention enables storage for 7 days
- Long-term data retention enables storage for up to 365 days
- Data records are stored for 24h by default, and can be extended up to 365 days
- Reading To/Writing From Kinesis Streams
- Kinesis Producer Library (KPL) — Easy to use library to write to a Kinesis Data Stream
- Uses Kinesis API under-the-hood, providing a layer of abstraction specific for ingesting data
- Automatic and configurable retry mechanism
- Processing delays can occur for higher packing efficiencies and better performance
- Only usable with a Java wrapper
- Kinesis Client Library (KCL) — Integrated with KPL for consumer applications to consume/process data from Kinesis Data Stream
- Kinesis Agent — Pre-built Java application to be installed on Linux servers
- Kinesis API (AWS SDK) — API operations to send data to Kinesis Data Stream using PutRecord or PutRecords
- Uses PutRecords and GetRecords to manually handle streams
- No delays in processing
- Usable through any AWS SDK
- Kinesis Producer Library (KPL) — Easy to use library to write to a Kinesis Data Stream
- Kinesis Data Firehose
- Managed, server-less streaming application to load data into a few options (mostly other AWS services)
- Receives records from multiple producers and can transform the data before storing or delivering to data consumers
- Data producers can produce records as large as 1000KB
- Can copy raw data to transform it in a Lambda function before storing in S3, while also storing raw data in a different bucket
- To store in Redshift, data will always be stored in S3 first and then Firehose issues a COPY command to load data onto S3
- Can store in Redshift, S3, ElasticSearch, Splunk, and to an HTTP Endpoint (so data can be indirectly sent to SNS/API Gateway as well)
- Firehose buffers and compresses all data you push to it, causing a short delay
- Buffer Interval — Ranges from 60 seconds to 900 seconds
- Buffer Sizes
- S3 — 1MB to 128MB
- Elasticsearch — 1MB to 100MB
- Redshift — depends on how fast your cluster can finish the COPY command
- Splunk — 5MB buffer size, and 60s interval
- Firehose Data Streams can be encrypted using KMS
- Kinesis Video Streams
- Uses Kinesis Video Stream Libraries
- Manage media sources and stream lifecycle as data flows from source to Kinesis Video Streams
- Installed on-device to securely connect to Kinesis Video Streams
- Can do real-time video analysis or batch-oriented video analysis after storing videos durably
- Uses Kinesis Video Stream Libraries
- Kinesis Data Analytics
- Run SQL queries on streaming data
- Can join streaming data with data in S3
- Query output can be streamed to Kinesis Data Streams/Firehose and sent to other data consumers
- Overview
-
Amazon Managed Streaming for Kafka (MSK)
- Apache Kafka — open source streaming data service, using ZooKeeper
- Publish/subscribe to streams
- Store streams in a durable way — Unlimited durability
- Process streams as they occur — Very low latency
- Amazon Managed Streaming for Kafka
- With the managed service, you don’t need to:
- Manage cluster
- Ensure software updates
- Ensure high availability
- Structure
- VPC
- Broker Nodes
- Live in different availability zones (for high-availability)
- Contain streaming data
- Subscribed to by producer and consumer nodes outside the VPC
- ZooKeeper Nodes — coordinate data between broker nodes
- Topic Creators
- Take care of different partitions
- Broker Nodes
- VPC
- Comparisons to Kinesis Data Streams
- Throughput provisioning (with Shards) or cluster provisioning
- Seamless scaling for KDS, but MSK scaling is not seamless to client
- KDS stores data for 7 days max, but MSK can for unlimited time
- KDS has deep AWS integration, whereas MSK has 3rd party tooling
- MSK is the most low-latency streaming solution
- KDS has maximum payload of 1MB, whereas MSK provides configurable data payload sizes up to 6MB
- With the managed service, you don’t need to:
- Apache Kafka — open source streaming data service, using ZooKeeper
- AWS Direct Connect and AWS Snow Family
- Moving mass amounts (GB/TB/PB) of data with minimal disruption, cost, and time
- Managed tools to move data from on-premise to AWS network
- Hybrid Cloud Storage
- AWS Direct Connect
- Direct Connect Location — connects 1 or 10Gbps lines directly to AWS region
- Customer location connects into Direct Connect Location (physical cable) without the internet
- Can also use hosted connections by AWS partners for connection speeds ranging from 50Mbps-10Gbps
- Cost — billing for port hours, and data transfer out of AWS
- Direct Connect Location — connects 1 or 10Gbps lines directly to AWS region
- AWS Direct Connect
- Online Data Transfer — via internet (CLI, console, SDK)
- AWS DataSync — one-time data migration, automate moving data between on-premise, S3, EFS, FSx etc.
- FTP/SFTP/FTPS into and out of S3
- S3 Transfer Acceleration — faster transfer of data into S3 over the internet
- Kinesis Data Firehose — load streaming data into S3
- AWS Snowcone — collect, process, and move data to AWS online through AWS DataSync
- Offline Data Transfer
- AWS Snow Family
- Snowcone
- Load data through WiFi or wired 10GbE networking, and ship data to AWS for offline data transfer
- Snowball
- Petabyte-scale data transport with import/export to S3
- Snowball Edge
- Local storage and large-scale data transfer like Snowball
- Local compute — Lambda, EC2, and AWS IoT Greengrass
- Ideal for data processing / ML for places with poor internet connectivity
- Snowmobile
- Exabyte-scale data transport in a secure 40-foot shipping container
- Solves security concerns, network costs, and transfer times for massive data transfer
- Snowcone
- AWS Snow Family
- AWS Database Migration Service (DMS)
- Easily and securely migrates databases and warehouses, by replicating data so databases stay fully operational during migration
- Eases many common concerns like:
- On-premises to cloud migrations
- Commercial databases to open source database engine or S3 datalake migrations
- Avoiding long periods of downtime for users
- Eases many common concerns like:
- Scenarios/Use-cases
- Upgrade versions of database software with no downtime
- Archive old data in more cost efficient storage
- Migrate business-critical databases, from classic EC2 to VPC, costly license driven data-warehouse to Redshift
- Migrate databases from NoSQL to SQL, SQL to NoSQL, and SQL to SQL
- Heterogenous and homogenous migrations
- Can use a Snowball device to migrate large terabyte/petabytes of data
- Replication — can be used for one-time data migration, or continuous replication
- Change data capture — only replicate the changes
- Cross region replication — Keep databases in-sync across regions to reduce latency
- Offload Analytics to another database with replication
- Keep data in-sync between testing, staging, and production environments
- Cost
- Pay for on-demand instances spun up at the start of migration process
- Data transfer between DMS and databases in RDS/EC2 in the same AZ is free
- Data transfer cost when data leaves an AZ, Region, or outside of AWS
- Easily and securely migrates databases and warehouses, by replicating data so databases stay fully operational during migration
- Amazon Data Pipeline
- Move and process data between compute and storage services
- Similar to EMR/Lambda, but more managed service to schedule ETL workloads
- Pipeline Jargon
- Data nodes — end destination for data
- Activities — action that Data Pipeline initiates on your behalf
- Preconditions — must be met before activity is ran
- Schedules — define when/how many times the activities run on data
- Can be used on-premise by installing a Installed Task Runner
- Move tables between production database on RDS and non-prod db on-premise
- Remotely execute stored procedures on-premise
- AWS Lambda
- Event-driven service allowing code to run without managing infrastructure
- Important Limits
- 10x concurrent executions per region (for async non-AWS invocations, and all sync invocations)
- 15-minute function timeout
- 6MB invocation payload (for sync invocations), 256 KB (for async)
- Amazon API Gateway
- Server-less API service for REST, HTTP, and WebSocket APIs
- Amazon CloudFront
- CDN with edge locations that delivers data faster to end-users
- Comparing data migration options
- 4 W’s and 1 H of Data Migration
- Why are we migrating to AWS?
- What data is being migrated? What applications are attached to it?
- Where is the data going?
- When does the data need to finish?
- How much data do you have? How much network capacity do you have?
- 4 W’s and 1 H of Data Migration
- FAQs about Data Migration
- How quickly can I migrate X amount of data?
- Typically about 100 TBs in a week, but Snowmobile can transfer data at a rate of up to 1 TB/s, where 100 PBs can be migrated in a few weeks
- Standard activities in Data Pipeline
- CopyActivity — copy data between S3 and JDBC data sources
- HiveActivity — execute Hive queries
- EMRActivity — run EMR jobs
- ShellCommandActivity — run arbitrary Linux shell commands/programs
- Standard preconditions in Data Pipeline
- Exists precondition — check if DynamoDBTable/DynamoDBData/S3Key/S3Prefix Exists
- Shell command precondition — runs shell script on resources and checks if it succeeds
- How quickly can I migrate X amount of data?
- MapReduce
- Algorithm for batch-processing large amounts of data
- Batch workload is distributed evenly across many different nodes
- Distributed File Systems are used to split files and replicate data across several nodes
- Open source Apache Hadoop software makes it easy to run distributed file systems
- Phases
- Map — each node computes/processes the result for its own batch
- Reduce — nodes get together and aggregate batch results into a final result
- EMR Cluster Components — all nodes must be in a single AZ to minimize latency
- Primary/Master node(s) — single or three primary nodes (for high availability)
- Manages cluster resources by coordinating all the nodes
- Tracks and directs HDFS
- YARN resource management
- Conducts monitoring and health checks for core and task nodes
- Secondary/Slave nodes
- Run tasks for the primary node
- Coordinates data storage
- Multiple core nodes, but only one core instance group
- Task nodes
- Optional helpers
- Add parallel computing power to help core nodes process data
- But these nodes don’t store anything
- Primary/Master node(s) — single or three primary nodes (for high availability)
- Algorithm for batch-processing large amounts of data
- Amazon Elastic Map Reduce
- Amazon’s Managed Hadoop cluster in AWS to store, process, and analyze big data
- Storage Options
- Hadoop Distributed Filesystem (HDFS)
- Fast but ephemeral
- Used for caching results produced by intermediate job-flow steps
- Elastic Map Reduce File System (EMRFS)
- Direct reads and writes to Amazon S3
- Integrated with persistent S3, advantages include server-side encryption and consistency
- Lower IOPS compared to HDFS/LFS
- Local File System (Instance Storage and EBS)
- Speed/size determined by instance type
- Ephemeral, so may be used for temporary data (caches/buffers)
- High IOPS at low cost
- Hadoop Distributed Filesystem (HDFS)
- EMR Operations
- Transient
- Shut down cluster when its not being used to save money
- Not using HDFS as primary storage (using EMRFS/S3 instead)
- Intensive & iterative data processing
- Long-running
- Frequently running processing jobs
- Processing jobs have an input-output dependency on one another
- Cost-effective to store your data on HDFS compared to S3
- Requirement of higher performance provided by HDFS
- Transient
- Choosing EMR Instance Types
- Primary Node — does not have large computational requirements
- Clusters with 50 or fewer nodes — M5
- Clusters with more than 50 nodes — M4
- Core & Task Nodes — depends on type of processing
- General purpose — M5
- Batch processing/HPC/ML (CPU Intensive) — C4/C5/z1d
- Graphics processing/ML (GPU Intensive) — G3/P2/P3
- Spark applications (Memory Intensive) — R4/R5
- Large HDFS/MapReduce (High IOPS) — H1/I2/I3/D2
- Primary Node — does not have large computational requirements
- Choosing the Number of EMR Instances
- Primary Node — optional one or three (for high-availability)
- Core & Task Nodes
- Run tasks — fewer instances will be too slow, too many instances will be costly
- Store HDFS data — data is replicated across nodes
- Default replication factors (changeable) — number of times data is replicated
- 1-3 nodes — replication factor of 1
- 4-9 nodes — replication factor of 2
- 10+ nodes — replication factor of 3
- Types of Storage
- EMRFS
- HDFS — high IOPS requirement
- Will need to add extra EBS capacity if instance store capacity across all nodes is smaller than the data size*replication factor
- Or you can add more instances
- Will need to add extra EBS capacity if instance store capacity across all nodes is smaller than the data size*replication factor
- Default replication factors (changeable) — number of times data is replicated
- AWS suggests using a smaller cluster of larger nodes
- Reduces failure possibilities
- Reduces maintenance required
- When would you run Spot Instances for EMR
- Low-cost requirement
- Interruptions acceptable
- Partial work loss is acceptable
- HDFS data is lost if Spot Instances are interrupted
- Monitoring and Resizing EMR Clusters
- Key CloudWatch metrics for:
- Tracking cluster progress
- RunningMapTasks / RemainingMapTasks
- RunningReduceTasks / RemainingReduceTasks
- Detecting idle clusters — IsIdle
- Check if HDFS storage is running out of capacity — HDFSUtilization
- Tracking cluster progress
- Monitoring can also be done with web UI through SSH tunneling and port forwarding
- Manual resizing
- Removing nodes — EMR will automatically remove idle nodes first, and replicate data to other nodes
- Cannot have fewer core nodes than the current replication factor allows — i.e. reducing from 11 core nodes to 9
- Must change the replication factor in hdfs-site.xml and restart NameNode daemon before resizing cluster
- Autoscaling
- EMR-managed scaling — set a minimum and maximum number of nodes, the on-demand limit, and maximum core node limit
- Custom autoscaling policy — set a minimum and maximum number of nodes, and then define scale out and scale in policies based on CloudWatch metrics
- Key CloudWatch metrics for:
- EMR File Storage and Compression
- Hadoop splits files into smaller chunks, and a single map task is run on each chunk
- Files stored in HDFS are split automatically
- Files stored in S3/EMRFS are split in multiple HTTP range requests
- EMR supported compression algorithms
- Splittable: bzip2 (speed: slow), LZO (fast)
- Non-splittable: GZIP (speed: medium), Snappy (very fast)
- Benefits of Compression
- Better performance as less data is transferred between S3, mappers, and reducers
- Less network traffic
- Reduced storage costs for smaller files
- EMR File Formats
- Text — csv, json, tsv
- Parquet — column-oriented commonly used in Hadoop (Hive, HBase, MapReduce, Pig, Spark)
- ORC — row-column file format highly efficient for Hive data
- SequenceFile — flat-file of binary key-value pairs, used extensively with MapReduce input/output formats
- Avro — row-oriented data serialization framework, developed and used by Hadoop project
- Best practices for file sizes
- 1-2 GB files may use a non-splittable compression algorithm
- 2-4 GB files should use a splittable compression algorithm
- Avoid smaller files (100MB or less), plan for fewer large files
- Files should ideally map to block size for max cost efficiency
- For smaller files,
- Reduce HDFS block size from default 128 MB
- Use S3DistCp command to combine smaller files together into a single large file
- The S3DistCp command can also be used to compress files
- Copy files within a cluster, or to another cluster
- Can be run as a Step or on the Primary node
- Hadoop splits files into smaller chunks, and a single map task is run on each chunk
- FAQs
- Pricing
- Pay for EC2 instances running, and S3 usage — expensive!
- Also pay for data transfer
- May use dedicated/reserved/spot instances to save on EMR, along with autoscaling policies
- Pay for EC2 instances running, and S3 usage — expensive!
- Security Measures between EMR and S3
- EMR always uses HTTPS to send data between S3 and EC2 — encryption in-transit
- Encrypt input data before uploading to S3 and decrypt within EMR — encryption at-rest
- Security Groups
- Two security groups are setup in EMR — one for primary and one for secondary nodes
- These two only allow communication with each other by default
- How to configure custom Hadoop settings on my cluster?
- Configure pre-defined bootstrap actions on startup
- Console/CLI/SDK — by shorthand syntax or config object in JSON file
- Pricing
- Amazon Redshift Architecture
- Cluster — Contains a leader node and worker nodes
- Leader node deals with schema management, warehouse metadata, query planning
- Worker nodes perform query execution and slice management
- Node — Leader or worker nodes
- Individual compute instances with storage attached
- Three types to choose from:
- DC2 — Dense compute — more compute and memory, but less storage
- Storage attached is less but faster
- 5TB — 326TB of storage
- DS2 — Dense storage — slower storage but more of it
- 64TB — 2PB of storage
- RA3 nodes — split the difference between dense compute and dense storage
- Storage backed by S3, using local storage cache to provide “hot data”
- Only pay for the managed storage that you use
- Might be slower if storage being accessed isn’t cached and needs to be retrieved from S3
- DC2 — Dense compute — more compute and memory, but less storage
- Slice
- Divides compute/memory/storage of a node into smaller separate pieces
- Each node has 2 slices by default (but this can be adjusted)
- Slices receive query scripts from the leader node, execute the script, and return the result
- Query Process
- User passes query to leader node
- Leader node generates query plan, which is passed to slices in each worker node
- Slices execute script and return results to the leader node
- Leader node aggregates and returns a final result to the
- Cluster — Contains a leader node and worker nodes
- Use Cases for Amazon Redshift
- Warehouse vs. Lake — Data warehouses are structured, and better catalogued than data lakes, so accessing data is much faster
- Redshift’s speed, scalability, and AWS service integrations differentiate it from other warehousing solutions
- Use cases for Redshift
- Central analytics repository for performance and usage of an application
- Manage complex relational data
- Training, refining, and reinforcing machine learning models
- Table Design
- Columnar databases — access columns of data more efficiently
- So Redshift is better for OLAP — much more efficient
- Compression — many different types of compression for different data types, not super important for exam
- Sort Keys — to set a column as a sort key, its compression must be “Raw” (no compression)
- Compound sort keys for complex or hierarchical data
- Interleaved for single column JOIN data
- Distribution Styles
- Even — Round Robin, even distribution of data across slices
- Key — JOIN tables, identical data stored in same slice
- All — Small fact/static tables, all slices store this data
- Auto — Redshift-assigned auto distribution based on size of table data
- Assigns to All while table data is small
- Eventually upgrades to Even as data grows larger
- Constraints — used for query planning hints, optional
- Columnar databases — access columns of data more efficiently
- Spectrum
- Create foreign tables from data stored in S3 using the EXTERNAL keyword for schema/table creation
- Passes through Redshift Cluster, Redshift Spectrum, Glue, Athena, and S3 under the hood
- Access control through AWS IAM or AWS Lake Formation
- Allows SELECT and INSERT (so limited read-write access)
- Disallows UPDATE and DELETE
- Create foreign tables from data stored in S3 using the EXTERNAL keyword for schema/table creation
- Resizing a cluster
- Classic Resize — can take hours to days
- Provisions new cluster and transfers data to it
- Cluster is available in read-only mode through the risize
- Elastic Resize — takes 10-20 minutes
- Creates a snapshot of existing node, and provisions a new cluster with the snapshot, and transfers all remaining data from source cluster
- Cluster is unavailable for a few minutes. Connections held open and queries are paused.
- Classic Resize — can take hours to days
- Redshift Maintenance
- Vacuums help reclaim disk space by removing data with a delete marker
- Maintain optimal query performance
- Sort data and reindex table
- Types: Full, Sort only, Delete only, Reindex, To threshold percent, Boost
- Automatic vacuum occurs when cluster activity is low
- Deep copy quickly sorts large amounts of data, so its faster than a vacuum
- With original table DDL, Create Like Table, Temp table & truncate
- Vacuums help reclaim disk space by removing data with a delete marker
- Backup & Restore
- Snapshots
- Point in time backup of whole cluster
- Manual/scheduled
- Retention period is configurable
- Restoring from snapshot
- Creates a new cluster, where data is lazy loaded as queries request it
- Snapshots loaded from S3 using COPY command
- Other COPY sources include EMR, DynamoDB, SSH
- Backing up a single table can be done in S3 using UNLOAD
- Backup in parquet format is efficient for Athena
- Snapshots
- Monitoring Redshift
- Redshift Console allows for
- Cluster/node metrics, engine-specific metrics
- Query monitoring (duration, throughput, wait time)
- Cluster/node metrics, engine-specific metrics
- CloudWatch provides a more granular view, and allows combining metric graphs
- Redshift Console allows for
-
AWS Glue
- Server-less managed ETL service
- Extract, transform, and load pipeline — categorize/clean/enrich data
- Move data between data stores
- Use Cases
- Query data in S3 by crawling files
- Joining data with a warehouse using Redshift Spectrum
- Create a centralized data catalog from data stored in different locations
- Components
- Data Source — S3/DynamoDB/RDS/Redshift/EC2/Any JDBC source
- Crawler finds data schema and creates a catalog
- Finds important metadata
- Job transforms data from catalog, and can store to another location
- Data Catalog
- Persistent metadata store
- Centralized repository — one per AWS region
- Provided comprehensive audit — track schema changes and data access control
- Converts semi-structured schema into a relational schema
- Glue Jobs
- Business logic to perform the extract, transform, load on the data
- Workflow
- Data is indexed in catalog
- Job processing environment pulls data, runs Scala or Python script to transform data, and stores in new store
- Jobs can be run on-demand, scheduled, or triggered by an event
- Job bookmarks process only new data that has not been processed before
- Enabled
- Disabled
- Pause — set from/to value and it will only process stuff in between
- Output File Formats — JSON*, CSV*, ORC, Parquet, Avro
- *Optional compression (gzip, bzip2)
- DPU — Data Processing Unit
- More DPUs make job faster if there’s a ton of data
- Up to a certain threshold, overutilizing or underutilizing DPUs makes process cost-inefficient
- Job types
- Apache Spark
- Min DPU — 2
- Max DPU — 100
- Default — 10
- Spark Streaming
- Min DPU — 2
- Max DPU — 100
- Default — 5
- Python Shell
- Min DPU — 0.0625 or 1
- Max DPU — 1
- Default — 0.625
- Apache Spark
- Cost per DPU is generally region dependent (~$0.44/hr)
- Right-sizing DPUs
- More DPUs make job faster if there’s a ton of data
- Security
- Glue jobs run on temporary virtual resources in isolation
- Glue jobs needs
- Data sources and targets in VPC scope
- IAM role
- VPC ID/subnet ID/security group access
- Traffic in/out is governed by VPC policies
- Except for API calls, which can be audited through AWS CloudTrail
- Server-less managed ETL service
-
Amazon Athena
- Athena is a serverless service to query data in S3 and other data sources, scales automatically
- Standard SQL syntax
- Federated queries connects to non-S3 JDBC-compliant data sources
- Data Formats & Integration
- Unstructured, semi-structured, and structured data
- CSV, TSV, JSON, Textfiles, Parquet, ORC file formats
- Snappy, Zlib, LZO, GZIP compression
- Integrates with Quicksight and AWS Glue
- Query logs from several AWS services
- Use cases
- Ad-hoc queries and BI tools
- Join data from multiple sources
- Create ETL pipelines
- Comparing Athena to other services
- S3 Select & Glacier Select allow running queries on data in S3
- Cons: Allows only simple S3 expressions, and can’t deal with compression, allows querying only single object querying
- EMR & Redshift allow for large scale data processing
- But Athena allows query processing without managing servers
- S3 Select & Glacier Select allow running queries on data in S3
- Athena is a serverless service to query data in S3 and other data sources, scales automatically
-
Amazon QuickSight
- Create visualizations with data without managing additional infrastructure
- Source Data
- Usable data types
- Relational Data with AWS data — RDS/Aurora, etc.
- File data — CSV/TXT/JSON etc
- SaaS data from GitHub, Jira, Adobe Analytics, Salesforce, etc
- Load into SPICE (Super-fast Parallel In-memory calculation engine)
- Query data after it is loaded into SPICE
- Data stored can be reused without additional cost
- SPICE capacity is shared between all users for each AWS region
- Direct query without SPICE is possible but generally slow and inefficient for data reuse
- Usable data types
- Permissions for Dashboard Sharing
- Specify which users have access
- Differentiate between viewers vs owners (editors)
- Can embed in website or application (only for Enterprise edition)
- Security and Authentication
- Data Encryption
- SPICE data is encrypted using AWS-managed keys
- Encryption at rest is ONLY supported for Enterprise edition
- Encryption in-transit is always available using SSL for both tiers
- All encryption keys are AWS-managed
- Connecting to AWS Resources
- Relational data
- Ensure the following mechanisms are setup correctly: Database usernames, passwords, ports, public/private resources, security group, VPC resources, NACLs
- Security group must contain an inbound rule authorizing access from the IP address range for QuickSight servers in the AWS Region: For RDS, RedShift, and EC2
- Connection paths:
- Manual connect — username must have SELECT permissions on system tables to allow QuickSight to discover table schemas and estimate table size
- Auto-discovery — The resources must be in the same region as your QuickSight account
- S3 Access
- Ensure the following mechanisms are setup correctly: IAM role access, bucket policies, manifest files (metadata)
- Relational data
- IAM in QuickSight
- QuickSight User Types
- Setup users using IAM credentials
- QuickSight-only user accounts using email addresses (filters through IAM)
- User Sign-On Integrations
- Free: IAM & SSO (Single-Sign On) through SAML 2.0 ("Security Assertion Markup Language")
- Enterprise Edition Only: AWS Directory Service, AWS Directory Service with AD (“Active Directory”) Connector, On-Premise AD with SSO or AD Connector, SSO using AWS SSO service or third-party federation service
- QuickSight User Types
- Best Practices
- Firewall — Allow HTTPS and WebSockets protocol access for QuickSight to reach databases not on AWS
- SSL — Use SSL to connect to databases, with publicly recognized certificate authorities
- Enhanced Security
- Encrypt data stored at-rest in SPICE
- Integrate AD/SSO auth
- Limit access with low-level security IAM rules
- VPC — Use VPC for data in AWS data sources, or Direct Connect to link on-premises resources to VPC and QuickSight
- Data Encryption
-
Amazon Elasticsearch/OpenSearch
- Runs most of the ELK (Elasticsearch, Logstash, Kibana) stack
- Elasticsearch — organizes data as indexes rather than tables
- Logstash — ingests, processes, and stores log data
- Kibana — powerful visualization engine
- Using Elasticsearch
- Organizes data in JSON documents
- Top-level organizational unit is the index
- Type is the second-level used to categorize data
- Interface
- REST API with either GET, PUT, POST, or DELETE methods
- Anything interacting with REST API endpoint can read/update Elasticsearch
- Load data from anything that can generate JSON
- Direct Elasticsearch Service Integrations without writing code:
- Kinesis Data Firehose
- CloudWatch
- IoT
- Organizes data in JSON documents
- FAQ
- Security for Elasticsearch Domains
- API access controlled with IAM
- VPC security groups and ACLs secure the Elasticsearch domain
- Elasticsearch provides fine-grained IAM controls for security
- How do I backup ElasticSearch?
- Automatic/manual snapshots
- Snapshots allow migration of ES domains between regions/accounts
- Logstash and Elasticsearch
- Logstash is not included in Elasticsearch Service domains
- ES domains are a compatible backend for logstash implementations
- Security for Elasticsearch Domains
- Runs most of the ELK (Elasticsearch, Logstash, Kibana) stack
Other Notes (from practice test questions that I got wrong)
- Secrets Manager has fully automated secret rotation for RDS, DocumentDB, and Redshift
- Windowed Queries
- manage query queues in redshift
- Amazon Redshift workload management (WLM) enables users to flexibly manage priorities within workloads so that short, fast-running queries won't get stuck in queues behind long-running queries. Amazon Redshift WLM creates query queues at runtime according to service classes, which define the configuration parameters for various types of queues, including internal system queues and user-accessible queues
- EMR Uniform Instance Groups vs Instance Fleets
- Cannot have a cluster with both instance groups and instance fleets
- KCL/KPL retries
- KPL — Retries are automatic when using the KPL and records will be included in subsequent Kinesis Data Streams API calls, in order to improve throughput and reduce complexity while enforcing the Kinesis Data Streams record’s time-to-live value. There is no backoff algorithm, making this a relatively aggressive retry strategy. Spamming due to excessive retries is prevented by rate limiting. KPL Retries and Rate Limiting
- KCL — The KCL automatically takes care of failed process records in a shard, so you would not need to implement this logic separately.
- Since KCL takes care of tracking by passing a checkpointer when processing records, if the worker fails, the KCL will use this information to restart the processing of the shard at the last known processed record.
- Kinesis Data Streams requires the record processor to keep track of the records that have already been processed in a shard. The KCL takes care of this tracking for you by passing a checkpointer (IRecordProcessorCheckpointer) to
processRecords()
. The record processor calls the checkpoint method on this interface to inform the KCL of how far it has progressed in processing the records in the shard. If the worker fails, the KCL uses this information to restart the processing of the shard at the last known processed record. Developing a Kinesis Client Library Consumer in Java
- Athena Workgroups
- By default, all Athena queries execute in the primary workgroup. As an administrator, you can create new workgroups to separate different types of workloads. Administrators commonly turn to workgroups to separate analysts running ad-hoc queries from automated reports
- Redshift — Manifest to specify data files loaded using COPY commands
- Amazon S3 provides strong read-after-write consistency for COPY, UNLOAD, INSERT (external table), CREATE EXTERNAL TABLE AS, and Amazon Redshift Spectrum operations on Amazon S3 buckets in all AWS Regions. You can use a manifest to ensure that the COPY command loads all of the required files, and only the required files, for a data load. Managing Data Consistency
- Redshift — For a COPY command to be successful, two things are required:
- IAM role with appropriate S3 permissions
- Explicit COMMIT at the end of COPY command to save changes uploaded from S3; or ensuring that data is auto-committed
- Glue — Cannot process irregular files
- While Glue can process text-based files with a common delimiter, this data has irregular delimiters that will require something more customizable to handle processing.
- Because you have irregularly formatted data, you need to perform manual data transforms. So, the bulk of the work here is to map a line from a file in each bucket prefix and then apply that transformation where appropriate. After that, you can easily deliver the results to a Kinesis Firehose stream, which can have an Elasticsearch domain configured as the target.
- Athena — Process (CSV→Parquet & Partition & Compress) data in-place without using Glue
- You can process the data "in place" with just Athena. By default, CTAS queries will store the output in Parquet format, and from there it's relatively simple to create partitions and configure column compression. All of these things will improve query performance and reduce the cost of querying the data.
- EMR — Calculate # of mappers
- Each file less than the mapper block size will require its own mapper
- Each file larger than the mapper block size will require (file_size/block_size) mappers
- EMR — Encryption Types
- EBS Encryption
- EBS Encryption — Beginning with Amazon EMR version 5.24.0, you can choose to enable EBS encryption. The EBS encryption option encrypts the EBS root device volume and attached storage volumes.
- LUKS Encryption — Linux Unified Key Setup, or LUKS encryption is supported for encrypting data on EBS volumes attached to EC2 instances in an EMR cluster.
- Open-source HDFS encryption
- EMRFS Encryption
- SSE-S3
- CSE-KMS
- EBS Encryption