Ultimate AWS Certified Solutions Architect Associate Course
- Getting started with AWS
- IAM & AWS CLI
- EC2 Fundamentals
- More EC2 Topics
- EC2 Instance Storage
- High Availability and Scalability: ELB & ASG
- AWS RDS, Aurora, ElastiCache
- Route 53
- Classic Solutions Architecture Discussion
- S3 Buckets
- S3 Bucket Security
- CloudFront & Global Accelerator
- More Storage Options
- SQS, SNS, Kinesis, Active MQ
- ECS, Fargate, ECR, & EKS
- Serverless, Lambdas, DynamoDB, Cognito, API Gateway
- Serverless Architecture Discussion
- Picking the Right Databases
- Data and Analytics
- Machine Learning
- AWS Monitoring CloudWatch, CloudTrail, Config
- IAM Advanced
- KMS, SSM Parameter Store, Shield, WAF
- VPC
- Disaster Recovery and Migration
- Even More Architecture Discussion
- Some Random Services
Getting started with AWS
History
Launched internally in Amazon, then they realize they can provide these services to other company as a service, so they launched SQS as their first product. Then they relaunched AWS cloud with SQS, S3 & EC2.
Then they launched in Europe after only launching in US.
AWS is on the front of pioneering the cloud services, and they occupy the largest market share, so learning AWS is useful!
You can build pretty build anything on AWS. Applicable to diverse set industries. You can use it as backup storage, big data analytics, website hosting, or even host gaming servers.
AWS global infrastructure
AWS is a global service, meaning it is available world wide. AWS service are separated into regions and availability zones spread world-wide.
AWS regions are geographical locations with a collection of availability zones being mapped to a physical data center. Each have their own power, water supply, etc. If one region fails then other region will not be impacted.
Within each region it contains at least two availability zones which is the actual physical data center, each zone doesn't necessarily mean it is backed by exactly one physical data center, each zone might be backed by one or more physical data center. Each zone in a region also have redundant and separate power, networking to prevent the likelihood of two zones failing simultaneously. Hence, data for each region are replicated within each availability zone.
Basically, AWS divide up the service places into regions, which are collection of availability zones. Each region contain multiple availability zones to have redundancy. Regions exist so that they operate independent of each different regions.
How do you choose an AWS Region?
If you need to launch a new application, where should you do it? It will depend on several factors
- Compliance with data governance and legal requirements: data never leaves a region without your explicit permission. Sometimes the government want data in France to stay in France and not leave to be somewhere else.
- Proximity to customers: In order to reduce latency.
- Available services within a region: Not all regions will have all of the services available. When you are deploying you have to make sure that the region you are deploying to have that service that you want to deploy
- Pricing: Pricing varies region from region. More expensive service in more obscure region
AWS Availability zones
Minimum is 3, maximum is 6.
For example, we have region ap-southeast-2
Then for availability zone we have ap-southeast-2a, ap-southeast-2b, ap-southeast-2c
Again each of the availability zone might be backing up one more more data centers. They each have their own supply and networking as well. So we have isolation within a region as well.
Each availability zone are connected together with high width, low latency networking in order to form the region.
AWS points of presencen
Have 216 points of presence.
Content is delivered to end users with lower latency.
AWS console
Some services will have global as their region. This means it is available world-wide you do not need to pick a region.
Services such as Route 53 which is available world-wide meaning it will not be impacted which region you are at, you can utilize Route 53 no matter which region you are picking.
IAM & AWS CLI
IAM: users and groups
Identity and access management, it is also a global service because it is needed in order to start up your AWS console after all if it is only available in one particular region then other region would not be able to work at all!
Root account is created by default when you register for an AWS account. They should not be used or shared. Only used for making other account with permissions because root account has access to everything.
Users are people within your organization and can be grouped.
Groups can only contain users not other groups.
Users don't have to belong to a group, and user can belong to multiple group.
Why groups and users?
Users or groups can be assigned a JSON document called policies which contains permission what a user/group are allowed to do.
IAM policy define the permissions of the users.
By default, a user are not allowed to do anything by default, you apply the least privilege principle: don't give more permissions than a user need to complete a task. For example if a user only need to work on a EC2 instance, then you only give permission to work on that EC2 instance, you don't give it more permission to say work on all of the EC2 instances.
User that's in a group will inherit the permission policy assigned to the group.
Tags
Logging in as IAM user
To login as an IAM user, you would need either the account ID or if you created an alias for the account ID, you can use that. Then you would enter in the IAM user's username and password.
For root user you do not need to use the account ID, you can just log in using the email address.
IAM policies inheritance
If a user is part of a group then it will inherit the policy that is attached to the group. If a user is part of multiple group then it will inherit all of the policy that the user is part of.
You can also create something called inline policy for a user that is not part of a group. They are attached to the user directly not inherited from group.
IAM policy structure
A JSON document that consists of:
- Version: policy language version
- Id: An identifier for the policy, this is optional
- Statement: one or more individual statements, this is required
Further more, for each statement it consists of:
- Sid: an identifier for the statement, optional
- Effect: whether the statement is allow or denied access
- Principal: account/user/role to which this policy applied to
- Action: list of actions this policy allows or denies
- Resource: list of resources to which the actions applied to
- Condition: conditions for when this policy is in effect, optional
IAM password policy
Strong password = higher security for your account
You can set up policy to prevent password re-use
Also need multi factor authentication (MFA). You need a secondary device that you own to verify your identity.
Password + security device you own = good and strong security
Even if you lose your password or got stolen you account won't be compromised, because the hacker will need the same security device that you only have.
MFA device options in AWS
- Virtual MFA device: Google authenticator, Authy. Support multiple tokens on a single device. Another app that you install to authenticate yourself
- Universal 2nd factor security key. YubiKey by Yubico. It is a physical device.
- Hardware key fob MFA Device
- Hardware key fob MFA device for AWS GovCloud
AWS access keys
When you need to access management console it is protected by password + MFA.
CLI access is protected by access keys that you can generate from your account.
Software Developer Kit is also protected by access keys, meaning you need a valid access key in order to use CLI and SDKs.
Access keys is generated via AWS console. They should be treated as secrets, do not share access keys!
AWS CLI
A tool that allows you to interact with AWS service using commands in your command-line shell.
You can have direct access to public APIs to AWS services.
Alternative to management console.
AWS SDK
A library that is language specific that allow you access to manage AWS services programmatically. Supports lots of languages.
Mobile SDKs, and IoT device SDKs are also available.
AWS CLI is actually built using the Python AWS SDK.
IAM Roles for services
Some AWS service will need to perform actions on your behalf. You will then need to assign permissions to AWS services with IAM roles. They give the AWS resource the permission for a short period of time by assuming into the role, and have the appropriate permission to carry out whatever they need to do on AWS.
The IAM roles are just like users, but they are to be assigned to non-physical people but assigned to other AWS services.
EX: We make a EC2 instance and it might want to perform some action on AWS for example read some resources from the AWS. Then we will haveto assign an IAM role to give it permission to access AWS, otherwise, the EC2 instance does not have any permission to do so.
Common roles: EC2 instance roles, lambda function roles, roles for CloudFormation. But there are tons of services that supports IAM roles.
IAM Security Tools
IAM Credentials report: A report that lists all your account's users and the status of their various credentials
IAM Access Advisor: Shows the service permissions granted to a user and when those services were last accessed. You can then modify their permission according to their usage. Remember principal of least privilege.
IAM best practices
Do not use root account except when you are using it to make other AWS account.
One physical user = One AWS user
Assign users to groups and give groups policy
Use a strong password policy + MFA!
Create and use roles for giving permissions to AWS services, otherwise, they cannot access AWS
Use access keys for SDK + CLI
IAM Advanced
EC2 Fundamentals
Setting up billing alert
If you would like your IAM user to also be able to set up billing alerts then you would have to enable that setting under Account as the root user.
EC2
Most popular AWS service. Elastic compute cloud, infrastructure as a service (renting out virtual servers as a service).
What can you do with EC2?
- You can have your own virtual machine, EC2 instances
- You can store data on the virtual drivers that's connected to EC2 instances, called Elastic block storage
- Distribute load across machines using Elastic load balancer
- Finally, scale the service using auto-scaling group (Create or terminate EC2 instances based on demand)
EC2 sizing & configuration options
- Operating system: Linux, Windows or Mac OS
- How much CPU you want
- How much RAM
- How much storage space:
- Network-attached (EBS and EFS)
- Hardware (EC2 instance store)
- Network card: How fast is the internet, control what is the public IP address
- Firewall rules: What traffic can go in and out
- Bootstrap script (run at first launch)/EC2 User Data
EC2 User Data
Bootstrapping means launching commands when a machine starts. You can use EC2 User Data to write some bash scripts that will be ran when the EC2 instance is FIRST BOOTED ONLY. On restart of the same instance it will not run again.
Use it to automate boot task like installing updates, software, downloading common files from the internet, or anything you can think of.
It will run as the root user! So keep that in mind.
EC2 status
You can stop an instance to stop it from running. AWS will not be charging you if it is Stopped. If you stopped the instance EBS storage will be kept, meaning data on disk is kept intact until next start.
You can also terminate the instance and delete it from existence. Which will also delete the storage if you configured it so.
You can start an instance after it is stopped, OS boots and EC2 user data script is run.
Everytime you stop and start up an EC2 instance it will be given another public ipv4 address! Private ipv4 will always be kept the same.
EC2 instance types
There are variety of EC2 types that are optimized for certain type of work/different use cases. AWS also has a naming convention for the EC2 instance that they have.
m5.2xlarge
m: Tells the instance class
5: Tells the generation of the instance class (It is improved over time)
2xlarge: Tells the size of the instance class (how much cpu, memory, networking capability they have)
General purpose: Great for diversity of workloads like web servers or code repositories.
They balance between compute power, memory, and networking. t2.micro is a General purpose EC2 instance.
Compute optimized: Optimized for compute-intensive tasks that require high performance processors.
Great for batch processing workloads, media transcoding, high performance web servers, high performance computing, dedicated gaming servers.
Memory optimized: Fast performance for workloads that process large data sets in memory
If you use it for high performance, relational/non-relational databases, distributed web scale cache stores, applications performing real-time processing of big unstructured data.
Storage optimized: Great for storage-intensive tasks that require high, sequential read and wrtie access to large data sets on local storage
Use it for high frequency online transaction processing systems, relational and NoSQL databases, cache for in-memory databases, distribute file systems
Security groups
They are firewall on EC2 instance. They let you control what kind of traffic is allowed in or out of EC2 instances.
One EC2 instance can have multiple security groups, they aren't limited to only one! The rules will just add on to each other.
They regulate access to ports, authorized IP range, control inbound network and outbound network.
For example: You can specify for TCP over port 443 to allow it or disallow it for said EC2 instances.
Security groups only contain allow rules, and you can reference by IP or other by security group (reference each other).
Additional information
You can attach a security group to multiple instances, and they are locked down to a region / VPC combination. Meaning if you switch to another region or set up another VPC, then you will have to reconfigure the security group as they are not carried over.
Good to maintain one separate security group for SSH access.
Referencing other security groups
You can set up security groups to reference other security groups, what does it mean? If you have an EC2 instance you can set up its security group such that it authorize Security Group 1 and Security Group 2. So if there are other EC2 instance with either Security Group 1 or Security Group 2 attached to it, their traffic to the current EC2 instance will not be restricted, they can communicate directly with the current EC2 without having to set an explicit IP address or inbound/outbound rule.
If you have another EC2 instance say with Security Group 3 attached to it, then it will not be able to send any traffic to the current EC2 instance since it is not an authorized security group.
Classic port to know
port 22 is for SSH, let you log into EC2 instance
port 21 is for FTP, upload files into a file share
port 22 is also for SFTP, upload file using SSH
port 80 for HTTP, access unsecured websites
port 443 for HTTPS, access secured websites
port 3389 for RDP (Remote desktop protocol), log into a Window instances
SSH EC2 instance
SSH allow you to remotely log into a machine and interact with the machine using command line. The default user that EC2 instance created for us is ec2-user
.
The EC2 instance doesn't use password for login, only private key are allowed to establish the ssh connection. To use the .pem
private key file you would do something like so: ssh ec2-user@<ipv4 address> -i key.pem
The -i option uses the private key file for logging into the EC2 instance.
More on SSH HERE
After SSHing into EC2
If you are going to use aws
command-line tool which is installed by default, DO NOT upload your AWS credentials by doing aws configure
. This is because once you upload it, everybody who has access to the EC2 instance can inspect your AWS credentials! They can see your access key as well as your secret access key! So do not upload any AWS credentials into your EC2 instances.
The way that you should be doing is to use IAM roles! You attach IAM role to EC2 instances or any compatible resources to give it permissions to access certain AWS CLI commands.
To give an EC2 IAM role, you would do Action -> Security -> Modify IAM role, then you can give it the IAM role you want to give. For example, if you are letting EC2 instance to be able to read all of the IAM users, then you give it the IAM role which contain the read only access to IAM!
Now you can run aws iam list-users
on EC2 instances without providing any credentials because it assumed the IAM role, giving it temporary credentials to be able to carry out that command!
EC2 purchasing options
EC2 On Demand
This is pay for what you use model. Linux or Window is billing per second, after the first minute. All other operating system is billing per hour.
This has the highest cost but it has no upfront payment and there is no long-term commitment
Recommended for short-term and uninterrupted workloads, where you can't predict how the application will behave.
EC2 Reserved Instances
Get lots of discount compared to on demand. You will be reserving a specific instance attribute (consist of the instance type, in which region/availability zone you are reserving it, tenancy are you going to be sharing it with other customer, OS) over a long period of time.
Reservation can be done for 1 year or 3 years, with 3 year offering the most discount.
You can pay no upfront, partial upfront, all upfront, of course paying all upfront netting you the most discount.
Recommended for applications like databases.
You can also buy or sell it in the reserved instance marketplace if you do not need the EC2 instance after but still have the reservation.
Convertible reserved instance
Another type of reserved instance that allows you to change the EC2 instance type, instance family, OS, scope and tenancy.
EC2 Savings Plans
Get a discount based on long-term usage. You will be committed to a certain type of usage ($10/hour for 1 or 3 years). Any usage above the EC2 saving plan will be billed on-demand.
Saving plan you will be locked to a specific instance family and AWS region like M5 EC2 family in us-east-1. But you do get to switch between instance size m5.xlarge to m5.2xlarge, and the OS you can freely change as well as the tenancy.
EC2 Spot Instance
Give you the most discount compared to on-demand. These are instances that you can lose at any point of time if the max price you are willing to pay is less than the current spot price. Like an auction, if you can pay the highest bid price, then you get to use it, otherwise, you lose it.
This is the most cost-efficient instance in AWS.
Recommended for workloads that are resilient to failure. Batch jobs, data analysis, image processing, any distributed workload, workloads with a flexible start and end time.
NOT RECOMMENDED FOR CRITICAL JOBS OR HOSTING DATABASE.
EC2 Dedicated Host
A physical server with EC2 instance capacity fully dedicated to your use, this is the most expensive option.
Recommended for compliance requirements and use your existing server-bound software licenses. Gives you compliance because you have access to the actual lower level hardware server themselves, so companies can have better control to be compliant to government laws.
This is basically giving you your own server.
For dedicated host, you can buy it on-demand or can also do reserved.
EC2 Dedicated Instance
Instances run on hardware that's dedicated to you, but you may share hardware with other instances in the SAME account.
You have no control over where is instance placed.
With both you get a dedicated server to host your EC2 instances. however, with dedicated host the server will be the same, since it is literally renting you the server, meanwhile dedicated instances the instance might be deployed to another dedicated server, it doesn't have to be the same one.
Dedicated host you are paying per host, while dedicated instance you are still paying per instances.
EC2 Capacity Reservations
You can reserve on-demand instance capacity in a specific availability zone for any duration.
You are guaranteed that those instances will be available to you when you need it.
There is no time commitment meaning you can create/cancel the reservation anytime, which means there is no billing discount.
You have to combine it with regional reserved instance and saving plan to actually get billing discounts.
While the reservation is in effect, you will be paying the on-demand price whether you run instances or not.
Recommended for short-term, uninterrupted workloads that needs to be in a specific availability zone.
More on spot instances
You define a max spot price that you are willing to pay, and as long as the current spot price is less than your maximum spot price then you will be keeping that spot price.
The current spot price will change hourly based on offer and capacity, and if the current spot price is greater than what you are willing to pay you can choose to stop or terminate your instance with a 2 minutes grace period. Stopping will allow you to resume the instance with its state after you have get the spot again, and terminate will allow you to start off fresh with a new instance once you regain the spot.
The other strategy is spot block, you can block a spot instance for 1-6 hour without interruptions. It won't be claimed if you block it, but it is no longer supported.
How to terminate spot instance
You will be first create request (maximum price you willing to pay, desired number of instances, launch specification, whether it is a one-time or persistent, and specify the date range for this request).
If it is one-time, as soon as you get a spot instance then the request is fulfilled and will go away.
If it is persistent, after you get a spot instance your request will still stay, and if your spot instance gets stopped or interrupted then the spot request will still remain until it can claim another spot instance. The spot request will be persistent for the specified range.
You can only cancel spot instance request that are open, active, or disabled.
Cancelling spot request will not terminate any spot instances! You must first cancel spot request then you terminate the associated spot instances.
Spot fleet
A way to get a set of spot instances + optional on-demand instances
The spot fleet will try to meet the target capacity with price constraints.
How it works is that you will define a set of pools, a pool consist of # of instance, the type, the OS, and the AZ. You define multiple of the pools so that the fleet can choose depending on the strategy.
Then spot fleet will stop launching the instances when it reaches the max # you defined for that pool or reached max cost.
Strategy that the spot fleet will use is:
- lowestPrice: You pick the pool with the lowest price (cost optimized, for short workload)
- diversified: You launch instances from all the pools you have defined (good for availability, long workloads)
- capacityOptimized: You pick the pool with the largest number of instances
Ultimately, spot fleet let you automatically request for spot instances with lowest price after you define the pool. Since it will be picking the instances with the lowest price.
More EC2 Topics
Private vs Public vs Elastic IP
Networking have two types of internet process address. IPv4, IPv6.
A public IP address allows internet to reach the host using internet. They are unique on the internet.
A private network has its own set of private IP address that the normal internet cannot access. They are private. The network is exposed only via an internet gateway IP address which allows communication to the outside world. However, the hosts that are inside the private network are allowed to communicate to each other using their respective private IP addresses.
Two different private network can have the same private IPs that is perfectly fine.
Elastic IPS
When you stop and start EC2 instance, the public IP is changed. If you need a fixed public IP then you will need Elastic IP. You attach it to one of the EC2 instance and as long as you don't delete it you will own a IPv4 IP address for your EC2 instance.
So even if you stop and restart your EC2 instance, it will have the same elastic IP.
You can only have 5 elastic IP account, avoid them if possible. Use random public IP and use DNS name to it.
Placement groups
Use placement groups when you want to control over how EC2 is placed in the AWS infrastructure. You don't get to directly access the hardware and placement of EC2 instance but you can make suggestions to AWS.
You can specify three strategies:
- Cluster: Clusters instance into a low-latency group in single availability zone
- Spread: Spread instances across underlying hardware (max 7 instance per group per AZ), use this if you have critical applications
- Partition: Similar to spread, but they are spread across many different partition (which underneath under different sets of racks) within an AZ. Let you scale to 100s of EC2 per group.
Cluster
EC2 are in the same rack, same hardware, and in the same availability zone.
Gives you super low latency between instances. 10Gbps!
Cons: If the rack fails, all instances fails at the same time. Big risk.
Recommended Big data job that need to be completed fast, application that need extremely low latency and high network throughput between instances.
Spread
All EC2 will be located on different hardware. You can spread it across different availability zone.
Reduce the risk of simultaneous failure since EC2 instances are on different physical hardware.
But you are limited to 7 instances per availability zone per placement group that you have created.
Recommended good for app that need high availability and failures needs to be isolated from each other
Partition
Up to 7 partition in one availability zone. Each partition can contain many EC2 instances. Each partition correspond to a server racket.
Up to 100s of EC2 instances.
A partition failure won't affect the other partitions.
Recommended for big data applications.
Elastic Network Interface
Logical component in a VPC that represents a virtual network card. They give EC2 instance access to internet and to the private network. They also give the EC2 the public IP and private IP.
ENI can have attributes:
- Primary private IPv4, one or more secondary IPv4
- One elastic IP per private IPv4. Or one public IP per private IPv4.
- One or more security groups attached to ENI and also a MAC address attach to it
ENI can be created independently and attach on the fly (move them to another EC2) instances.
They are bounded to a specific availability zone.
Good for failover purposes, just attach the ENI which is associated with a private IP to another EC2 instances to cover the failure of the failed instances.
In addition, when you terminate a EC2 instance, the default ENI that's associated will also be deleted. But if you create your own ENI and attach to it they will remain after EC2 instance is deleted. You have more control over the private IP with your own ENI.
EC2 Hibernate
A new state for EC2. The in-memory state is preserved. Instance booting is much faster since OS is not stopped. Whatever is in RAM will be stored to a file in root EBS volume.
When your instance is started again the RAM is restored from the file so in effect, it is like your instance was never stopped. It was just restoring the state.
Recommended for long running process, saving RAM state, or for services that takes long time to initialize.
Hibernate is supported for many family, there is a limitation for RAM size. The root volume must be EBS, encrypted.
Available for on-demand, reserved, and spot instances.
You cannot hibernate more than a period of time, 60 days as of recording.
EC2 Instance Storage
EBS Volume
Elastic block store volume is a network drive that you can attach to your EC2 instance while they run. It is a volume which means you have to provision the capacity first, define how many GB of EBS storage you want in advance and IO/S.
It is storage for your EC2 instances. It is like SSD/hard drive for your virtual machine, they allow storing data and let them persist just like a normal SSD/hard drive. Except they are connected via network rather than physically.
EBS volume can be mounted to one instance at a time! And they are bound to a specific availability zone. Meaning you can not attach a EBS in us-east-1a to an EC2 in us-east-1b! In order to move the volume across a availability zone you have to snapshot it to do so.
They can be detached quickly compared to physical drive that you have to plug it out physically, and attach to another EC2 instance.
You will be billed for the provisioned capacity.
Delete on termination attribute
By default, the root EBS volume for EC2 is deleted on termination, this option is checked by default
By default, any other attached EBS volume are not deleted on termination, this option is not checked by default
You can choose to preserve root EBS volume by unchecking the box to save some data if you so choose.
EBS snapshots
Is a backup of your EBS volume at any point of time. You don't need to detach it to do snapshot but is recommended.
Then you can copy snapshot across AZ or Region.
You can also recreate a EBS volume from a snapshot which contains all of the data from the snapshot. This is how you would be copying data across region/AZ.
EBS snapshot archive
You can move a snapshot to an "archieve tier" that is 75% cheaper, but the reason why it is cheaper because it will take 24 to 72 hours to restore the archive. Cheap storage but expensive retrieval.
EBS recycle bin
You can set up rules that to retain deleted snapshots up from 1 day to 1 year for you to recover in case of accidental deletion.
Fast snapshot restore
Force full initialization of snapshot to have no latency on the first use. This is recommended if your snapshot is large, however, this causes lots of money.
EBS volume type
When you pick EBS you get to choose the actual hardware underneath that's backing up your storage. Currently there are six types:
- gp2/gp3 SSD: General purpose SSD volume. Balance price and performance
- io1/io2 SSD: Highest-performance SSD for mission-critical low-latency or high-throughput workloads. Use this if your EC2 contains some very important task that needs the IO speed
- std1 HDD: Low cost hard-disk drive that is designed for frequent access, throughput-intensive work
- sc1 HDD: Lowest cost hard-disk drive for less frequent accessed workload
Only gp2/gp3 and io1/io2 can beused as root volumes for booting
Throughput = How fast can your storage read/write data. This is measured in MB/s. How much data can you transfer either read or data from/to the disk can you do in a second.
IOPS = Input/output operations per second. Tells you the amount of read/write operation that your storage can do in a second. If this number is small then under heavy use your drive will not be able to keep up with the request and might drop some. If say someone is reading a picture from the drive at the same time another person is storing things into the disk.
General purpose SSD (gp2/gp3)
It is cost effective storage with low latency.
1 Gib - 16 TiB capacity
gp3 is the newer volume type, the old one is gp2. It allows you to increase IOPS to 16,000 and throughput up to 1000 MiB independently.
gp2 is the older type. The size of the volume and IOPS are linked, max at 16,000. With 3 IOPS per GB.
Provisioned IOPS SSD (io1/io2)
This is for critical business application that require HIGHEST IO performances, or those that need more than 16,000 IOPS.
This is good for database workloads that is sensitive to the speed of the input/output writing speeds.
4 GiB - 16 TiB capacity
The IOPS for io1/io2 are capped at 32,000 for normal EC2 instances. But for nitro EC2 instances they are boosted to 64,000 IOPS.
Now-days there is no reason to pick io1 since io2 offers more durability and gives more IOPS per GiB that you allocate at the same price.
io2 Block Express: Gives you even more IOPS up to 256,000 maximum! Sub-millisecond latency.
IOPS EBS support multi-attach!
Hard disk drives
Cannot be used as boot volume
125 GiB - 16 TiB
Throughput optimized are st1, and have max IOPS of 500. Good for big data, data warehouses, log processing, dumping for large amount of data.
For data that are infrequently accessed good to use cold hard-drive disk sc1. Max IOPS is 250, gives you the lowest cost.
EBS Multi-attach
This feature is only available to the io1/io2 family. It allows you to attach the same EBS volume to multiple EC2 instances that is in the same availability zone. The EBS volume of course needs to also be in the same availability duh.
Recommended for higher application availability in clustered Linux application, or if your application need to do concurrent write.
You are limited to 16 EC2 instances at a time. No more!
EBS encryption
If your EBS is encrypted then you get the following:
- Data is encrypted inside the volume
- Data in transit to and from an instance is encrypted
- All the snapshot are encrypted
- All volumes created from snapshot are also encrypted as well
The encryption leverages keys from KMS services, and has minimum impact on performance.
Copying an unencrypted snapshot allows encryption! Create an EBS snapshot of the unencrypted volume -> encrypt the EBS snapshot using the Copy snapshot function -> Create new EBS volume from the snapshot which is now encrypted
EFS
Elastic file system. It is a managed Network file system (NOT NAS, network attached file storage like EBS!!!) that can be mounted on many EC2.
EFS works with EC2 instances in different availability zones.
It is highly available, scalable, but very expensive compared to EBS. It has pay per use model (how many GiB you read/write you will be paying it in $/GiB you used).
Recommended for content management, web serving, data sharing between EC2 instanes, and Wordpress?
A security group is used to control access to the EFS. AND EFS is only compatible with Linux based AMI no Windows machine can access the EFS. You can enable encryption, and is a standard file system in Linux.
When you create EFS you will have to attach security groups in order to allow EC2 instances with the corresponding security group to attach to it. Remember Security group you can define actual rules of traffic or reference another security group, if the EC2 instance is in that group then it is allowed access!
EFS scale automatically, pay-per-use, don't need to provision capacity.
EFS configurations
You get 1000s of NFS clients each with 10 GB+ throughput, can grew to petabyte automatically.
Have performance mode that you can set:
- General purpose: latency-sensitive use cases for CMS (wordpress) or a web server
- Max I/O: Higher latency, throughput, highly parallel. This is good for big data and media processing
Have throughput mode that you can set:
- Bursting: 1 TB = 50MiB/s + burst of up to 100MiB/s. Scale with the file system size.
- Provisioned: Set your throughput regardless of storage size size, like 1 GiB/s for 1 TB storage
- Elastic: Automatically scale throughput up or down based on your workloads, good for unpredictable workloads
Have storage tiers:
- Standard: For frequently accessed files
- Infrequent access (EFS-IA): Cost money in order to retrieve files, lower price to store, you have to enable it using a lifecycle policy
You can set up lifecycle policy for files i.e. put the file into EFS-IA after N days.
Availability and durability:
- Standard: EFS can be set multi-az, great for production
- One zone: One AZ, good for development, backup enabled by default, compatible with IA (EFS One zone-IA) big discounts!
EBS vs EFS
- EBS can only attach to one instance at a time. The only exception is the io1/io2 volume type of EBS which can be attached by multiple EC2 using multi-attach, but is still limited to 16 EC2 instances
- EBS are locked to a specific availability zone. Need snapshot for replication across AZ
- Root EBS gets terminated if your EC2 gets terminated
- EFS allows you to be attached to 100s of EC2 instances even across AZ!
- Good for sharing website files, but is only for Linux EC2 instances
- Higher cost > EBS of course, it is more advance and is multi-az. But you can use EFS-IA for cost saving for those infrequently accessed files
EC2 instance store
There is another type of block storage type for an EC2 instance and that is called the EC2 instance store. EBS are network attached device meaning that the speed is limited by the networking speed, although it is still good. An alternative is the EC2 instance store which are physical hardware disk that's connected on the server.
EC2 instance store have better I/O performance compared to EBS instance.
But the caviar is that EC2 instance store is an ephemeral storage, meaning that the data on disk are lost once you stop/terminate the EC2 instance.
EC2 instance store is recommended for buffer / cache /scratch data / temporary content. It is not for LONG-TERM STORAGE!
Backup and replication are YOUR responsibility if you choose EC2 instance store as storage.
AMI
Stands for Amazon machine image. AMI are like a template of a computer's root drive, they contain the OS and the software all pre-packaged into an image.
When you launch an EC2 you will be launching from an AMI, which will just extract OS and the software that are pre-packaged from the AMI to initialize the root volume.
You can use a public AMI which AWS provides, or you can make your own AMI and maintain them yourself, or from AWS marketplace AMI which someone else created the AMI and is up for sale for you to use. If you do build your own AMI it will be for a specific region, but you can copy it across regions.
Custom AMI creation process
To make your own AMI here is the process:
- Start an EC2 instance and then customize it, i.e. download the specific software that you want your AMI to contain
- Right click the EC2 instance after you have done customizing it, and under Image and templates -> Create image
- To launch an EC2 with your own AMI you just have to select it from My AMI tab
Again AMI have all the pre-packaged software installed by default, you can go on your way without installing them yourself.
High Availability and Scalability: ELB & ASG
Scalability
There are two type of scalability, vertical scalability and horizontal scalability.
Vertical scalability/scalable
It means that you are increasing the size of the instance on a more persistent level to meet workload growth.
For example, if you have a call center and your instances are the people taking the calls. We have a junior operator who can take 5 calls per minute, if we want to vertical scale the operator that means we need to hire say a senior operator who can take 10 calls per minute.
Or if we have a chain of piazza place, if we want to vertical scale our business that would mean we need to build more piazza restaurant to adopt to the need of the customer. It is more permanent.
For AWS, it would mean switching from t2.micro to t2.large to meet workload demand.
Vertical scaling is common for distributed systems like database.
Horizontal scalability/elasticity
You increase the number of instances to meet peak of workload demand.
Going back to the example of call center instead of hiring senior operator, you would just increase the number of junior operator to handle the calls.
Horizontal scaling implies distributed systems.
High availability
You run your application or system in at least 2 data center. The goal of high availability is to be able to survive through a data center loss, that you can quickly recover after one of the instances goes down by using the backup instance.
High availability can be passive, or active. Exact replica of the application or a smaller replica of the application that you can scale up after disaster recovery.
High availability & scalability for EC2
You can increase the instance size to have more hardware capability for vertical scaling.
You can increase the number of the same instance for horizontal scaling, this can be done by auto scaling group automatically with load balancer.
Finally you run those instances for same application across multiple availability zones, in case one goes down, the other EC2 instances in other availability zone will be fine, this is for high availability.
ELB
Elastic load balancer. A load balancer is a server that distribute traffic to other servers called downstream servers (EC2 instances to actually handle the request) in order to avoid overloading one instances with too many requests.
You use load balancer to spread load across multiple downstream instances. It will expose one single point of access to your application.
Failures of downstream instances can be handle easily because there is built-in health checks for ELB. it also provide SSL certificates for your websites.
AWS's Elastic load balancer is a managed one so you don't have to worry about upgrading it, making it highly availability, it is all done by AWS. Using it will cost less and you have to do less work than setting up your own load balancer.
Health checks
A way for ELB to verify whether an EC2 instance is working properly. If the instance is not working properly then load balancer will not forward traffic to that dead EC2 instances.
Health check is done on a port and a route to the EC2 instances. EX: on port 5677, endpoint /health, if there is a response that comes back from that endpoint then it is good, otherwise, it is not healthy.
Types of load balancer
There are 4 kinds of managed load balancers.
Classic load balancer
This is the old generation of the load balancer, AWS do not recommend using this load balancer anymore.
Application load balancer (ALB)
Supports HTTP, HTTPS, WebSocket.
Target group: They can be EC2 instances, ECS tasks, Lambda functions, or IP addresses (must be private, your own on premise servers). They designate a set of resources that the ALB can route traffic to.
ALB can route to multiple target groups, it doesn't just have to be one target group, you can have multiple of them. Health checks are done on target group basis.
You can route the traffic to different target groups based on path, host name, or even query strings and headers.
ALB are great for micro services & container-based application.
When you create an ALB you get a fixed hostname. In addition, the server application (the target group instances) do not see the client's IP directly, they are inserted in X-Forwraded-(For,Port,Proto) headers. This is because the load balancer is doing the request, not the client. If the application server wishes to see the client's traffic then they need to inspect those extra headers.
When you create an ALB, you will have to define a security group to define what kind of traffics are allowed into the load balancer. Then ALB will then forward the traffic into the EC2 instance, now those EC2 instances in the target group should only be accepting traffics from ALB which means that it will be authorizing the security group the load balancer is part of.
In addition, you can define different routing rules based on hostname, query string, and paths to different target groups.
Network load balancer (NLB)
Supports TCP, TLS, UDP. So instead of only forwarding HTTP/HTTPS you are now allowed to forward TCP/UDP traffic, one layer down the network stack. You are dealing with even lower level network traffics.
Network load balancer is very high performance, handles millions of request per seconds, lower latency compared to ALB.
It only has one static IP per AZ and support assigning elastic IP. This means that if you use network load balancer as your load balancer for your application you will end up with several IP addresses for each of the availability zone within a region.
Target group for NLB can be EC2 instances, IP addresses (must be private IPs, on premise servers again), you can also frontend an ALB with a NLB.
Why would you front ALB with NLB? Well you will end up getting a fixed IP address rather than a fixed hostname, then you can route the HTTP traffics using the ALB directly.
Health checks performed on target group for NLB supports TCP, HTTP, and HTTPS protocols.
When you create a network load balancer, you don't define a security group meaning that the traffic is routed directly to the EC2 instance, therefore the security group that the EC2 instance has will be the security group that takes effect. Meaning you should be defining rules that accepts public traffic for NLB.
Furthermore, Network load balancer provides both a DNS name and static IP per availability zone that you deploy the NLB in. The reason why the static IP is used is because TCP/UDP does not have DNS server to resolve the name thus they must use static IP for communication. The reason why the DNS name is provided like ALB is only for those applications that communicates via HTTP/HTTPS! Otherwise, the DNS name is useless for TCP/UDP protocols.
Gateway load balancer (GWLB)
Operates at layer 3 (network layer). It looks at the IP packets.
How does it works, well you will be setting up 3rd party security virtual appliances, they can be EC2 instances that looks like the request and tell you whether they are malicious or not. Malicious in terms of intrusion detection, deep packet inspection, whether the payload has been manipulated or not.
Then you will configure your routing table to have your traffic to go through a gateway load balancer which forward the traffic into the target group of those security appliances, they will examine the traffic and analyze it whether or not they are good, if they are then the traffic goes back into gateway load balancer which then finally forward it to the actual application.
It uses the GENEVE protocol on port 6081.
Target group can be the EC2 instances or IP addresses, again 3rd party security virtual appliance or on premise virtual appliances.
Security group for load balancer
The user can access the load balancer using HTTPS/HTTP from anywhere. This means that the security group attached to the load balancer should be accepting on port 80 and 443 with IP range anywhere.
On the other hand for the EC2 instances, because it is being fronted by the load balancer, it should not be accepting any external traffics except from load balancer itself. Which means that the security group for the EC2 instances should reference the security group that the load balancer have. Remember that if you reference a security group, then any traffic that comes from entity that has the security group will be accepted. In this case, only traffic forwarded from the load balancer will be accepted by the EC2 instance.
Sticky sessions
Sticky sessions allows the client to be redirected by the load balancer to the same instance behind the load balancer. For example if Client A is being redirected by the load balancer to EC2 instance 1, then if it makes another request it will still be redirected to EC2 instance 1 instead of another instance by chance.
You would want to do this in order to persist session data like authentication. However, at the cost of causing imblanace to the load balancer.
Sticky session is implemented by a cookie.
There is two types of cookie that you can generate one is application-based cookies, you generate it yourself to include any custom attributes, or use the one that's generated by the load balancer automatically.
Cross-zone load balancing
This feature is enabled by default for application load balancers, meaning that it will distribute the traffic to all of the target groups evenly regardless of their availability zones.
For example, say you have a target group in availability zone A, and another target group in availability zone B for an ALB. The traffic is going to be distributed as if they are all in one availability zone evenly across all the instances regardless of their availability zone locations.
https://wiki.tamarinne.me/uploads/images/gallery/2023-02/Sxsimage.png
If you turn this feature off, then it will be distributed first by the availability zone then among the instances within the availability zone.
https://wiki.tamarinne.me/uploads/images/gallery/2023-02/Q33image.png
You will not be charged by having this feature turn on by default.
However, for NLB and GWLB if you enable this feature then you will be charged for the data that crosses availability zone.
SSL/TLS certificates
Having SSL certificate allows data between client and load balancer to be encrypted.
TLS is the newer version of SSL but people still refer to it as SSL. SSL certificates are issued by certificate authorities.
How does this come into play for a load balancer? Well user will connect to the load balancer via HTTPS which is encrypted, then load balancer talk to the EC2 instances using HTTP in the private VPC. There is no need for encryption because it is within the cloud.
Server name indication
How do you load mulitple SSL certificates onto one web server? How do you host example.com, foo.com, bar.com all in one server? Back in the day is one hostname per one ip, now days servers allows virtual host meaning you can have many different hostname under one server and they all can be hosted on one machine.
How is this achieved? Via Server name indiction, the client indicate to the server the hostname that they want to reach in the initial SSL handshake, then the web server will fetch the corresponding SSL certificate for the hostname that the client ask.
SNI with ALB allows you to host many SSL certificate so it will be directed to the correct EC2 instances.
Classic load balancer can only support one SSL certificate. ALB and NLB support multiple SSL certificates by using SNI.
Connection draining
Also called deregistration delay: Give time to complete "in-flight requests" (request that's already sent out) while the instance is de-registering or unhealthy, then stop sending new requests to the EC2 instance which is de-registering. Newer request will not be sent to the "draining" EC2 instances.
You can set a time for the draining period. Set to a low value if your requests are short, set to high value if your request is long lived, but tradeoff is that your EC2 instances will take more time to drain.
Auto scaling group
Let you do horizontal scaling, in or out automatically to match workloads. It will increase EC2 instances to match increase load, it will decrease EC2 instances to match decrease load all done automatically by the auto scaling group.
You can set up minimum and maximum number of EC2 instances running. Automatically register those new instances to the load balancer. It will also terminate and recreate those unhealthy EC2 instances for you (ELB does the health check then pass those health check to the auto scaling groups which then terminate those unhealthy ones).
When you create an auto scaling group it will contain similar configuration to an EC2 in addition the load balancer that you are pairing it with. The instance that you choose to launch can be in multi-AZ, auto scaling group will automatically balance it across different AZ that you choose to launch in.
You will also pair auto scaling group with CloudWatch Alarm to set up a scale-out event or scale-in event.
Dynamic scaling policies
There is three types:
- Target tracking scaling: The easiest to set up, you can set it up so that you want the average ASG CPU to stay at around 40%
- Simple / step scaling: This one requires you to set up your own CloudWatch alarm on a custom metric, like CPU utilization. If it is > 70% then you add X units, if it is < say 30% you remove X unit
- Scheduled action: This is scaling based on known usage patterns, if you know that on Friday you will get a big spike of requests you can set up this scaling policy to increase the minimum capacity on every Friday.
Predictive scaling
Have a forecast of request load and schedule ahead. This is done by machine learning which is pretty cool!
Metrics
Good metrics to scale on is definitely CPU utilization. If CPU utilization is high then it is highly likely that your instances is doing lots of work.
RequestCountPerTarget: Have a specific number of request that you want each of your EC2 instances to have. Say 3 requests on average.
Average network in/out: You can also scale on the amount of data going in or out from the EC2 instances.
Everytime there is a scaling activity occurs there is a cooldown! Default is 5 minutes. During cooldown ASG will not launch or terminate instances, this is to allow metrics to stabilize.
AWS RDS, Aurora, ElastiCache
AWS RDS
Relational database service. It is a managed database service and provide you with databases that uses SQL as the query language and AWS manages it for you.
Database that you can create is Postgres, MySQL, MariaDB, Oracle, Microsoft SQL Server, and AWS Aurora (AWS's proprietary database).
Why RDS over Database on EC2
Well RDS is managed for you so you don't have to deploy them yourself with an EC2 instance. There is automatic continuous backup and restoration capability.
Monitoring dashboard is provided, have read replicas for improved read performance. Multi AZ setup for disaster recovery, and maintenance windows for upgrades.
You can scale vertically by increasing the size and horizontally by increasing the number of instances. The storage is backed by EBS (gp2 or io1).
However, you can't SSH into the underlying EC2 instance can only interact with the database via a database client.
Storage auto scaling
When you first provision your RDS you will specify the initial capacity, and with this feature RDS will detect when you are running out of database storage and scales it automatically.
You will have to set a Maximum storage threshold (the maximum limit that you are willing storage auto scaling to allocate at max for you).
Then it will automatically modify the size of the storage if the amount of free space is < 10%, and the low-storage last at least 5 minutes, and 6 hours has passed since last modification.
This feature is good if you can't predict the amount of storage you will be taking up.
Read replicas
Read replicas is used for read scalability.
Your application will communicate with a RDS database instance and most of the time it is read requests, lots of them actually. Your database might be overwhelmed by those requests and therefore you can set up read replica, which is a copy of the main database instance but can only take read requests. They cannot take write operations since they are a replication of the original.
Read replicas can be setup in the same AZ, cross AZ, or even cross region.
The way that the replicas sync up the data is via asynchronous synchronization meaning that the reads will eventually become consistent. The replicas will catch up when there is free time, which means if your application reads from the replicas there is a possibility that it is reading old data, but it will eventually be synced up with the latest changes.
Replicas can be promoted to its own database giving it the capability of writing.
Use cases for read replicas
The good use cases for that is for example if you are going to run some analytics on the production database instances. If you are going to query directly into the production database it might overload it since the production application is using it as well.
What you should do is that you make a replica of the production database then your analytic application can query into the replica instead of the prod database that way the production application will not be slowed down/affected.
Network cost
Normally there is a network cost when data goes from one AZ to another, but are exceptions depending on the services for example ALB for cross zone load balancing.
For RDS read replicas within the same region, you don't have to pay cross AZ network fee. For example, you set up your main RDS database instance in us-east-1a, but you set up your read replica in us-east-1b, normally because the asynchronous synchronization require data to cross AZ, in this case you don't have to pay the fee if it is in the same region.
However, if your replica is in another region for example instead of us-east-1b it is eu-west-1b, then it will cost you network fee.
Multi AZ
This is for disaster recovery mainly.
You have your application that talks to the main database instance like usual, then you set up another RDS DB as a standby in another AZ and data is going to be replicated SYNCHRONOUSLY, meaning every write on the master database will be also pushed to the standby instance as well.
Then with multi AZ database set up you will get one DNS name which behind it has two instances of RDS database, one that's the master and the other as standby, if the master database failed due to AZ failure, the DNS name will automatically switch over to the standby instance.
This increases overall availability, and this is completely transparent to the application because they just need to talk to the DNS name that it is fronted with.
This is not for scaling since the standby instance is just for backup. You can also set up your read replica as multi AZ for disaster recovery that is possible this is possible.
From single AZ to multi AZ
To change it from single AZ to multi AZ there is no need to stop the database. Just need to modify it.
In the background, a snapshot of the original database will be taken and a new database instance will be restore from the snapshot. Then there will be synchronization established between the two database for multi AZ setup.
RDS Custom
This is only for managed oracle and Microsoft SQL server databases. it lets you access the underlying EC2 instances managed by AWS for the RDS.
This is so that you can configure settings, install patches, enable native features, and SSH into the EC2 instances that you normally don't get to see when you create an RDS instance.
You should disable automation mode when you are performing customization on the underlying EC2 instance and take snapshot just in case you wreck your database.
Amazon Aurora
AWS's proprietary database. But it is compatible with Postgres and MySQL, connecting to an Aurora DB you can query it like it is Postgres or MySQL database.
They claim it is cloud optimized, more performance compared to MySQL on RDS.
Aurora storage automatically grow in increments of 10GB, up to 128TB. So you don't even need to provision the initial disk space.
You can have up to 15 replicas compared to 5 for MySQL. Failover in Aurora is instantaneous and is highly available by default. But it cost more.
High availability and read scaling
Every time you write to the database Aurora will store 6 copies of your data across 3 AZ. Meaning every time you write to the database it will store 2 copy of the same data in one of each AZ. The underlying storage are shared across the AZ!
It needs 4 copies out of 6 needed for write operations and 3 copies out of 6 needed for reading operation.
Self healing with peer-to-peer replication so if corruption occurs in one database it will be automatically corrected.
Only master instance of Aurora instance takes writes, and automated failover for master in less than 30 seconds. You can have up to 15 replicas + master for your database.
It also support cross region replication.
Data replication is in millisecond which is extremely fast.
Aurora database cluster
Writer endpoint: Points to the master instance for writing operations. The client will talk to this endpoint in order to invoke write operations.
Reader endpoint: Like a load balancer because if you are going to auto scale how do you keep track of the instances of read replica that is part of the auto scaling? Easy it is done by the reader endpoint, it will automatically point to all of read replicas and those that are added due to auto scaling and load balance read requests among all of them.
Then the client will communicate with the reader endpoint in order to read from the database.
Aurora custom endpoints
This allows you to define a subset of Aurora instances into a custom endpoint. So with each Aurora instance there is a EC2 instance behind it, you can define a subset of them to be a larger instance to say handle analytic work on the database under a custom endpoint. Then normal reader endpoint under another custom endpoint.
If you define your own custom endpoint then the reader endpoint will not be used after, you would have to define your own reader endpoint to point to those instances that you would still like to use for reading.
Aurora serverless
Automate database instantiation and auto scaling based on actual usage. Like lambda functions.
Good for infrequent, intermittent or unpredictable workloads.
No capacity planning needed and is pay per second which can be more cost-effective.
You will set up a proxy fleet which is managed by Aurora and it will instantiate the Aurora RDS based on the incoming requests.
Aurora multi-master
In case you want immediate failover for writer node. Let every node does R/W vs. promoting a read replica to be the new master (this is default behavior).
Global Aurora
Aurora cross region read replicas: Allow cross region read replica which is useful for disaster recovery.
Aurora global database: You set one primary region for read/write, then up to 5 secondary read only region, each of those region can have up to 16 read replicas. help you decrease latency for data access all over the world.
Have recovery time objective of < 1 minute, how long it will take to recover by promoting another region.
Typical cross-region replication takes less than 1 second. Hint to use global Aurora
Aurora machine learning
Let you add machine learning based prediction to your application via SQL interface to do things like fraud detection, ads targeting, sentiment analysis, product recommendation all in Aurora.
Amazon SageMaker and Amazon Comprehend is supported.
RDS backup
- Automated backup: daily full backup of the database. Transaction logs are backed-up by RDS every 5 minutes, which means you can go back to 5 minutes ago. The backup are kept 1 - 35 days.
- Manual database snapshots: Manually triggered by user, but you can keep the backup for as long as you want
If you are going to stop a RDS database, which you will still pay for storage. To save cost, you do a snapshot and restore it which will cost way less. Then just restore the database from the snapshot once you are ready to use it again.
Aurora backup
- Automated backup: 1 to 35 days of retention you cannot disable it. Point-in-time recovery in that timeframe, meaning you can restore to data up to 35 days ago.
- Manual database snapshot: Same triggered by user, keep backup as long as you want
Restore options
Restoring RDS/Aurora backup or a snapshot creates a new database.
You can also migrate on-premise database to RDS database by making a backup of your on-premise database and put it in S3 bucket, then create a new RDS from it.
To migrate to Aurora cluster from on-premise database you would need to use service call Percona XtraBackup and store the backup file into S3 and create Aurora from it.
Aurora database cloning
Create a new Aurora database cluster from an existing one.
Basically let you clone an existing Aurora, this is faster than snapshotting it and restoring it.
The cloned database initially uses the shared volume but as writes are made then it will create new shared volume, this is copy on write optimization.
RDS & Aurora security
You can have at-rest encryption: require master database and replicas encryption using KMS
If master is not encrypted then read replicas can't be encrypted. Again to encrypt an unencrypted database you have to go through snapshot and restore as encrypted
You can also have in-flight encryption: It is ready by default, data that are in transit are encrypted.
IAM Authentication: You can use IAM roles to connect to your databases instead of username and password.
You can allow or block specific ports / IP using security groups.
No SSHing except RDS custom. These are the security options for RDS / Aurora
RDS proxy
A database proxy for RDS.
The proxy allow apps to pool and share DB connection established with the database. This allows you to improve database efficiency by reducing the number of open connections. Save CPU and RAM resources.
Definitely needed for Lambda functions that need database access. This is because lambda function appear and disappear really quickly, if every lambda function open up a database connection and close it that quickly then it will overwhelm the RDS / Aurora database very fast, this is why a proxy is needed to pool and share the connection.
It is serverless, autoscaling, high available through multi AZ. Can reduce failover time by up 66%.
Supports RDS and Aurora, no code change is needed. It also enforce IAM authentication for database instead of username/password.
The proxy is only available in the VPC. Cannot be access publicly.
Amazon ElastiCache
RDS is a managed relational database.
ElastiCache is a managed Redis or Memcached. Caches are basically memory that are high performance and have low latency, you use them to store frequently accessed data so that you don't have to query the data in slower disks.
Help reduce load off of databases for read intensive workloads, if they exist in cache then no need to hit the disk which cost even more time. Also make your application stateless.
Using ElastiCache need heavy code changes!
Use ElastiCache as database Cache
Your application will first query the ElastiCache if it exist then is cache hit, no need to hit up RDS. If cache miss then you will query RDS and then store it in ElastiCache.
Help relief load off from RDS from same request over and over again.
Use ElastiCache as user session store
User will log into your application, your application store the session data into ElastiCache instead of database. Then you can retrieve session data from the ElastiCache making the user log in already
Redis vs Memcached
Both are in-memory data structure server. Both utilizes key/value store.
Redis allow multi AZ with auto-failover, can have read replicas to scale and is highly available. There is backup and restore feature. Redis is persistent by default.
Memcached however don't have replication thus not highly availability, you can't backup and restore, and is non persistent. Memcached is not persistent, if you restart the service you will lose your data!
ElastiCache security
Uses IAM authentication for Redis only. The rest just use username and password.
Redis AUTH: Set password/token when you create a Redis cluster, extra layer of security. Gain in-flight encryption.
Memcached supports SASL based authentication.
Pattern for ElastiCache
- Lazy loading: All the read data is cached, data can become stale in cache. Cache data that you don't have this is simple
- Write through: Add or update data in the cache when writing to a database, more complicated.
- Session store: Store temporary session data in a cache
Redis Use Case
Gaming leaderboards are computationally complex. Redis sorted set give you that ranking in uniquenesses and element ordering. Each time a new element is added, it will be reordered in real time.
So you can use Redis as a game leaderboard
Route 53
Domain name system
It is a lookup server that help you translate domain names to IP addresses. It is the backbone of the internet.
It uses hierarchical naming structure, start from the root DNS server which contain information about Top Level Domains, which contain information about the Authoritative level domain.
Terminologies
Domain registrar: Amazon Route 53, GoDaddy
DNS Records: A, AAAA, CNAME, NS
Zone file: Contain DNS records
Name server: Resolves DNS queries. They are the server that answers your DNS requests
Top Level Domain: .com, .us, .gov
How DNS work
When you visit a website say example.com, your web browser will ask your local DNS server managed by your ISP, "do you know what example.com" is, local DNS server say no I don't then it will go out and ask the root DNS server whether it knows example.com.
The root DNS server respond back saying I don't but I do know who manages the .com TLD server and here is the address.
Then local DNS server will use the .com nameserver and ask do you know what "example.com" is, the .com nameserver don't know but it knows the authoritative server example.com's address.
Finally the DNS server will query the example.com authoritative server and get the A record which contains the IP address of the query.
The result will be cached into local DNS server.
Route 53
Highly available, scalable, fully managed and authoritative DNS. You can update the DNS records.
It is also a domain registrar, let you buy domain name.
Records
Define how you want to route traffic for a domain. Each record have a record type, A or CNAME, the value it is mapped to, Time to live, Routing policy.
Route 53 support A, AAAA, CNAME, NS, these are the most important ones
- A: Maps the domain name to IPv4, example.com = 1.1.1.1
- AAAA: Maps the domain name to IPv6
- CNAME: Map a hostname to another hostname, the hostname it is mapped to must have an A or AAAA record.
Think of CNAME record as alias.
Can't create CNAME for example.com but for www.example.com - NS: Name servers, they are the name servers that can respond to your domain query
Hosted zones
A container for records that define how to route traffic to a domain and subdomain. Basically one hosted zone for your domain and the subdomains that you have.
- Public hosted zone: Have records that specify how to route traffic on the internet (this is for public domain names)
- Private hosted zone: Have records tells you how to route traffic in VPC. This is used if you worked for a company, internally they have some reserved URLs that is used to redirect to some internal sites. Public cannot access this.
You pay 0.50 per month per hosted zone
For private hosted name, you can name your EC2 instances an internal domain name like example.com, it will be resolved internally as its name implied and is not for public internet.
Your EC2 instances within a VPC can use route 53 to access the IP address of other EC2 instances in the same hosted name. Remember it is only for those private instances within the hosted zone not for public usage.
TTL
Time to live, the amount of time for client to cache the result of the DNS query. If the TTL is long then less traffic on Route 53 but at the con of having outdated records. if the TTL is short then you will have more traffic on Route 53 because every expired DNS query will need you to fetch it again, but you can change your records easily since it expires often.
Every record have mandatory TTL except for Alias records which you cannot set.
CNAME
CNAME or canonical name is a record to allow a domain name to be used as an alias for another true domain. When a DNS return a CNAME record it will not return that to the client, rather it will look up the true domain name that is returned again and return the A record's IP address.
The chain of CNAME can continue as many as you want, but it is discouraged because it is inefficient.
Simple example would be www.example.com being an alias for example.com. So you would set the CNAME of www.example.com to be example.com, as example.com is the canonical name, the true name for www.example.com.
Limitation: The limitation for CNAME record is that it cannot be used for making a CNAME for root domain. This is because SOA and NS record must be present at the root domain, but CNAME requires that there be no other record be combined with CNAME. Which means that the root domain cannot be used for CNAME. It can be the value of CNAME but it cannot be used as the alias.
Alias
This is specific for Route 53. It allows you to point a hostname to any AWS resources. This will work for root domain and non root domain, doesn't have the limitation of CNAME.
It is free of charge and have native health check.
Example would be setting example.com, having A record to a load balancer's DNS name. The actual IP address will change which is why you would use the DNS name of the load balancer.
You can use example.com to point to the load balancer's DNS name. With CNAME you cannot do that because you are not allowed to use root domain for CNAME records!
Alias type record is always A/AAAA. You cannot set the TTL.
Alias target can be:
- Elastic load balancer
- Amazon CloudFront
- API Gateway
- Elastic Beanstalk
- S3 Website
- VPC Interface endpoints
- Global Accelerator accelerator
- Route 53 record in the same hosted zone
You cannot set alias for an EC2 DNS name!
Routing policies
Define how Route 53 responds to DNS queries. It is only responding to DNS requests! That's all, not load balancer routing!
Simple policy
Route traffic to a single resource.
The value can specify a single value or multiple value, Route 53 will just pick one randomly and return to the client. The example is just a simple A record for a hostname. You can specify one value associated to that hostname, or multiple which Route 53 will pick randomly and return it to the client.
When Alias enabled then you can only specify one AWS resource, you can't do multiple like for values.
Simple policy have no Health checks.
Weighted policy
Control the percentage of requests that go to each specific resource.
Say you can route 30% of the traffic to one EC2 instance, and 70% to another. The DNS records must have the same name and type. Can use health checks with this policy.
The use cases would be load balancing between regions at the DNS level, and test new application versions by routing a small percentage of the traffic to the new application.
Giving a weight of 0 will stop sending traffic to that resource, and if all records have weight of 0 then it will be equally distributed.
Latency policy
Redirect to the resource that has the least latency. Latency is based on traffic between users and AWS regions. People in North America will be redirected to resources in us-east-1 than resources deployedi n ap-southeast-1 since it has lower latency.
You can set records to point to an EC2 instance with latency policy, and you specify the region that the resource is located in, it will then redirect you to the lowest latency resource based on the region that the user is closest to.
Failover policy (active-passive)
We associate the primary health check to the main EC2 instance the health check is a must, and then there is a failover EC2 instance. When the health check detects the first EC2 instance is unhealthy it will route the traffic to the second record to the failover EC2 instance.
There can only be one primary and one secondary resource.
Geolocation policy
Routing is done based on user location not based on latency!
Specify location by continent, country, or by US state. This is used for website localization, restrict content distribution for those country with strict regulations.
Can be associated with health checks.
Ex: German users should go to this IP, France go to this IP, and rest go to this IP.
Geoproximity policy
You can shift more traffic to resources based on the defined bias. More traffic to one resource you increase the bias, less traffic then you decrease the shrink.
The resources can be AWS resources, or Non-AWS resources.
Geoproximity policy is useful when you need to shift traffic from one region to another by increasing the region.
The visual for this is that if you have bias of 0 for two regions, then traffic will be split equally by putting a line in between them. Users on their corresponding side will be routed to that region. However, if you change the bias of that setting with the same setup to say 50 for one region, then the line will be shifted to the other side, incorporating more traffic from the other side because the bias is increased. It is favored more.
Multi-value policy
Used when routing traffic to multiple resources, Route 53 will return multiple values/resources, this is different that simple policy multi-value, simple policy doesn't allow health checks while multi-value policy does. It ensures that the resource/value returned is healthy!
You can associate health checks. Up to 8 healthy records can be returned from multi-value query.
It is not a substitute for having an ELB, it is a client side load balancer, with those multi-value query returned, it will be up to the client to pick one healthy query result from it.
Health Check
A way for checking the health only for public resources. This is for when you are deploying into multi-region for high availability and then you let Route 53 route the traffic based on latency or geoproximity. If one of the region goes down, then your traffic will be rerouted to other region because there is a health check built in (automatic DNS failover).
This gives you automated DNS failover. Health check have three types:
- Health checks that monitor an endpoint (application, server, other AWS resource)
- Health check that monitor other health checks
- Health checks that monitor CloudWatch alarms
Health check is an entity that you will be creating. Each health check is backed back about 15 global health checker that will actually be doing the check for healthiness.
Health Check - Monitoring endpoint
There are about 15 global health checkers who will be checking the endpoint for healthiness.
Health cehcker in us-east-1, us-west-1, sa-east-1 and so on. If you set up to monitor an endpoint in eu-west-1, then all of these 15 health checker will ping the endpoint in eu-west-1, and if it receives 200 status then it is healthy.
You can decide the threshold for what is healthy and unhealthy. Default interval is 30 seconds. Supports HTTP, HTTPS, TCP protocol health checks.
Health check wil only with 2xx or 3xx status code. Health check can also be setup based on text in the first 5120 bytes of the response.
You have to configure it to allow Route 53 health checker's traffic in your resource's security group.
Health Check - Calculated Health Check
Combine results of multiple health checks into single health check.
Basically you can set up child health check that monitors their own individual instances. Then you have a master health check that will pool all those child health check's result together and then specify how many of them need to pass to make the parent pass.
This is good for when you are maintaining your website without causing all health checks to fail in say one of the instances.
Health Check - Private resources
Route 53 health checkers live outside of VPC, so they cannot access private endpoints that lives inside VPC.
To do it you have to create a CloudWatch Metric and associate the CloudWatch alarm to the health checker. This is how you can monitor private resources.
Domain registrar vs DNS Service
Domain registrar is the place where you would buy a domain from. They also provide you with DNS service. Namecheap after you purchase the domain name, will let you set DNS record.
However, you do not need to use their domain registrar. It is perfectly valid to purchase your domain name from GoDaddy and then use Route 53 DNS service. You just need to change the NS record of your domain to point to Route 53's DNS server IP. It will then push all the DNS query to Route 53, then you can still use Route 53 for DNS service even though you didn't buy domain from them.
Classic Solutions Architecture Discussion
Solution architecture
How do you use all these components to make them all work together into one architecture.
We will study solution architecture and how to come up with them via case studies.
WhatsTheTime.com
Let people know what time it is. We don't need a database because every EC2 instances know the time.
We want to start small and downtime is acceptable, then scale vertically and horizontally with no downtime.
Initial solution
We start with a public t2.micro EC2 instance, and user will ask what is the time, it just spit back the current time. We attache an elastic IP address so the IP address of the EC2 instance is static.
Now users starts to come into our app, now our t2.micro can't keep up with the load, maybe we should scale it vertically and make it into m5 instances. We have to stop our app, and change our EC2 instance size to be m5. We have downtime when upgrading our app. This isn't great.
Now even more people come in, we scale it horizontally to three EC2 instances. But users needs to know about the IP address of those horizontally scaled EC2 instances.
To fix this, we can leverage Route 53. Set a A record to point to those three EC2 instances. Now users can access the time API just through a common endpoint, api.whatisthetime.com. With TTL of 1 hour.
Now if we are going to make an upgrade and take down one of the instances that the Route 53 points to, it is going to make some users suffer because the TTL is 1 hour. They won't be routed to other EC2 instances that are still up! They will be unhappy!
How do we remediate this?
We make our EC2 instances private, and front it with elastic load balancer with health checks. Route 53 need to have an Alias record that point to the ELB resource. Now it is working properly, no downtime for nay user because of health checks.
Now manually launching groups is tedious, we can have an auto scaling group to scale on-demand. We set min, max, and desire count of the instances.
But what happens if the ELB that we fronted with is in a availability zone that just had an earthquake? Our application will still go down! To solve this, we can deploy our ELB in multi-AZ, say AZ 1-3. Our auto scaling will group will also launch instances in different AZ. Now it is highly available great!
Now after optimizing the architecture, you will switch to thinking about cost saving. You can reserve capacity for cost saving! Reserving minimum capacity of our auto scaling group we can save lots of money!
Now this is good architecture. We are considering 5 pillars for a well architect ed applications: Cost (reserved instances for optimized cost + ASG), performance (Vertical scaling, ELB, adapt performance over time), reliability (Route 53, multi-AZ deployment), security (Security group to link ELB to EC2), operational excellence.
MyClothes.com (stateful app)
Now let's try to make a stateful app. This e-com web app will have a shopping cart and we need some place to store all these user informations. We want to keep our web app as stateless as possible, and user should not lose their shopping cart when they refresh their pages.
Details such as address should be in the database.
We will have the same set up from the Whatsthetime app with ELB, ASG, Multi-AZ, and Route 53. Now whenever the user add to chart, the page refreshes and they lose their data because they are talking to different EC2 instances from before they are redirected. How do we remediate this? We introduce stickiness so that user will be always talking to the same EC2 instances.
But if the EC2 instance is terminated data will still be lost, so stickiness isn't a complete solution.
We introduce server session, we set a cookie called session_id, and use ElastiCache to store user session. So that as long as the user have the same session_id, it will retrieve the same data for that user.
Now we can also introduce Amazon RDS to store user data in a database. But now there is too many reads what do we do? We add read replicas to the RDS, we can have up to 5 read replicas. We can also then add on top of it lazy read with ElastiCache, but this pattern require code change to your repository, however, it is more efficient since frequently accessed data will be in ElastiCache and doesn't need to hit RDS constantly.
Now how do we make it Multi AZ? In order to survive disasters. Route 53 we don't have to worry about it since it is highly available already. ELB make it multi-AZ, ASG make it multi-AZ, ElasticCache also have multi-AZ if you use Redis, RDS you can do multi-AZ to have a standby instance if the master goes down.
For security group, we restrict traffic only from the resources that it is fronted with.
MyWordPress.com (stateful app)
We want to make a scalable WordPress website, and they should have the capability to upload and access picture upload.
RDS layer with multi-AZ for handling user data. Or we can have Aurora MySQL to scale better than RDS.
Storing images we will do it with EBS initially. Image will have to go through ELB to EC2 then EC2 will store it into EBS volume. Problem is when we start scaling horizontally, what if image is stored into one of the EBS volume but not the other? Then user won't be able to access the image that they have uploaded to one of the EBS volume.
To solve this we will use EFS instead of EBS, it scales automatically as you use more storage, and the storage is shared between all many EC2 instances.
Instantiating applications quickly
For EC2 instances, we can use a golden AMI (or also called custom AMI), which contains pre-installed software and packages that is needed to run your applications. All your other EC2 instances can be created from this AMI at a much faster rate.
You can also use user data for dynamic configurations.
Elastic Beanstalk will combine both golden AMI and user data to quickly spin up your applications
For RDS databases restoring it from a snapshot is much faster.
EBS volumes also restoring it from a snapshot will be much faster.
Elastic Beanstalk
So far the architecture is ELB with EC2 in multi-AZ, then we have RDS and ElastiCache for caching frequently read data and session data.
Most of the application follow these type of structure, and if we are going to deploy many application then it is going to be a pain to deploy these manually for every application!
Most of the application follow these same architecture, ALB + ASG mainly, so Elastic Beanstalk provide this one way of deploying the code without you having to worry about provisioning all these resources yourself. All you have to do is write the code.
Elastic Beanstalk is going to use all of the component that we have seen before, it is a managed services so it will automatically handle capacity provisioning, load balancing, scaling, application health monitoring. We still have full control over the configuration but it puts less burden on the developer.
Beanstalk itself is free, but the resources that it manages will not be! Will be priced accordingly.
Beanstalk support lots of platforms! Python, Java, you name it, even if your platform isn't on it by default, you can just create it yourself.
Components
Application: The collection of Elastic Beanstalk components
Application version: An iteration of your application code
Environment: Collection of AWS resources running application version (running only one version at a time)
Tier: Web server environment tier vs worker environment tier, you can make multiple environment
So the process is you create application, upload it with a version number, and you launch into an environment, then you can upload updated version and relaunch it with the updated version.
Web server tier: The traditional architecture that we know, EC2 instances is managed by auto scaling group and is fronted with ELB, it will be deployed under a DNS name from ELB.
Worker environment: No clients accessing EC2 instances directly, the EC2 instance will be consuming SQS messages that comes from SQS queues. You push messages to the SQS queue to kick start the process.
Deployment modes
Single instances: Good for development, an EC2 instance with Elastic IP
High availability with load balancer: This is good for production, this is the traditional architecture in AWS that we see already. EC2 instances managed by ASG in multi-AZ, fronted by ELB.
S3 Buckets
S3
Advertised as infinitely scaling storage. Many AWS services also use S3 as part of its service.
You can use S3 for:
- Backup
- Storage
- Disaster recovery
- Archive
- Static website
- Software delivery
- Data lakes and big data analytics
The objects are stored into buckets (think of it as directories). Each bucket that you create must be globally unique (across all regions and accounts). However, the buckets are per region. No uppercase no underscore restrictions for naming buckets.
Each file stored into a bucket have a key, the key for S3 is the full path. However, S3 does not have directories concept! If you have folders then it will be named as prefixed. The object key contains prefixed (which can have folder path) + the actual name file itself.
The values that the key mapped to contain the content of the file itself, max at 5TB. If you're uploading a file more than 5GB must use multi-part upload.
S3 Bucket security
- User-based rules: Attach IAM policies to specify which API call to S3 bucket the use can make to a user.
- Resource-based: Have three types
- Bucket policies: Bucket wide rules from the S3 console, this allow cross account to let other account from AWS to access it
- Object access control list: finer grain
- Bucket access control list: less common now
So basically you can attach policy to the user to give it permission to the S3 bucket. Or you attach the policy to the S3 to allow who is able to access it. You can also add IAM role to resources to give it permission so that say an EC2 instance can access it. You can also add cross account to grant it access to the bucket.
IAM principal (the who) can access and S3 object if the user IAM allows it or resource policy allows it and there is no explicit deny
You can also encrypt the object using encryption keys to add more security.
S3 static website hosting
S3 can be used for hosting static websites. The static website URL will be the bucket's public URL. In order to make this work you must configure the bucket to be publicly accessible.
S3 versioning
You can enable versioning for your files in S3. Any changes over the same key will create a version for that particular key. For example if you upload the same file twice it will just overwrite the file with version 2.
You can rollback to the previous version to prevent from accidental corruption or deletion. Files that aren't versioned have version null if you didn't enable it first.
S3 replication
Must enable version in the source and destination bucket that you are creation replication to. The buckets could be in other AWS account.
The replication occurs asynchronously.
There are two flavors:
- CRR (Cross region replication): Compliance, lower latency access, replication across accounts
- SRR (Same region replication): Log aggregation, live replication between production and test accounts
Only new objects are replicated, old objects won't be replicated you must use S3 batch replication to do so.
Deletions with version ID are not replicated. But deleting the entire object will be replicated since it only place a delete marker.
S3 storage classes
Durability: What are the chances of your object being lost. On average single object is loss once every 10,000 years if you store 10,000,000 objects. This durability is the same for all storage.
Availability: How readability available the S3 is. On average down 53 minutes per year.
General purpose
This is used for frequently accessed data. Use it for big data analytics, mobile and gaming application. No cost for retrieval, only for storage.
Infrequent access
For data that is less frequent accessed, but that needs rapid access sometimes.
Has lower cost than S3 standard but has cost when retrieving.
Standard-infrequent access: Use for disaster recovery and backups
One Zone-Infrequent Access: Very high durability in a single availability zone, but the data can be lost if the AZ is destroyed. Use it for storing secondary backup
Glacier
Low cost when storing for archiving and backup.
You will be paying for low storage price but high object retrieval cost
Glacier instant retrieval: Give you millisecond get time. Good for data accessed once a quarter.
Glacier flexible retrieval: Expedited (1-5 minutes), standard (3-5 hours), bulk (5-12 hours) to get data back
Glacier deep archive: Meant for long archive. Standard (12 hour), bulk (48 hours)
Intelligent-tiering
Let you move objects automatically based on access pattern. No retrieval charges. But you need to pay monitoring and auto-tiering fee.
- Frequent access tier: default tier
- Infrequent access tier: if obj hasn't been access for 30 days
- Archive instant access tier: 90 days
- Archive access tier: 90 days to 700+ days
- Deep archive access tier: 180 to 700+ days
Lifecycle rules
You can transition objects between storages. Moving object is done by life cycle rules. You specify the storage class you want to move the object to, and the amount of days that it has to pass for the transition to occur.
You can also setup expiration actions, say delete the object after some times. You can also delete incomplete multi-part uploads.
You can also move non-current version of the object to other storage classes, doesn't have to be current versions.
S3 analytic will give you some recommendation to tell you when to transition objects to the right storage class: Help you put together life cycle rules that make sense!
Requester pays
Owner of the S3 bucket pay for storage + downloading cost, but there is the option to make the requester who is asking for the file to pay for the network cost.
S3 event notification
When you do something with S3 bucket, whether is depositing a file, or a file gets deleted, it will sent out events. Now these events can go to couple of places:
- SNS topic
- SQS queue
- Lambda function
- Event bridge: This is the latest one, you set this one to sent all events to Event Bridge then you can forward it to other AWS services. This expand on the only 3 that S3 can sent the event to. It enhances!
Services can then react to these events to say do something with the file that was just deposited.
S3 performance
It scales automatically to high request rate with 100-200ms latency. You get lots of requests PER prefix. Remember prefix is like "directories" even though S3 doesn't have a directory concept.
So if you have like 2 folders, then they each get a set of request rates per prefix.
Use multi-part upload for files > 5GB, recommended for 100MB. This speed up file upload by taking advantage of parallel uploads. You do this by dividing your big file into parts then upload each parts in parallel, S3 can reconstruct the parts into the big file after.
S3 Transfer acceleration is used if you need to transfer files across region. This is done by uploading your file to edge location which then can forward the file to the S3 bucket via a private network which is much faster than public internet to the bucket in another region. (Remember public internet requires packet hopping across different routers).
S3 Byte-range fetches is like multi-part upload but for download, to speed up download. You can request specific byte range of the file in parts, say byte 0-50, byte 51-100, and so on. This is so that you can take advantage of parallel downloads. Partial byte range can be retrieved as well.
S3 batch operations
Make lots of operation on existing S3 objects with one request.
- Encrypt, un-encrypt objects
- Restore objects from S3 glacier
- Modify ACLs
- Invoke lambda to do whatever function you want on the object
So the batch operation contain jobs which contain the cation to take and object to take it on
S3 batch operation have built-in retries, progress tracking, and generate reports.
Can use S3 select to do filtering on the object that you want to perform the operation on.
S3 Bucket Security
S3 encryption
There are four flavors.
You can also set up a bucket policy to only allow files that are encrypted by one of the following server sided encryption. You would provide that in the S3 bucket JSON policy list. For example: You can specify to only accept files that are encrypted by KMS, and deny all other files encrypted by other schemes.
Server-side encryption S3
Enabled by default for new buckets and new objects. The encryption use keys that is handled managed and owned by AWS S3.
The object is uploaded then when it reaches S3 it will be encrypted with the key owned by the S3 bucket, and stored into the bucket.
Set header for x-amz-server-side-encryption: AES256
Server-side encryption KMS
The encryption uses key that's handled and managed by AWS KMS. It is also encrypted on the server side
This is good because it has user control and logs every key usage.
Limitation: You will be using KMS APIs which will have their limitation. Limited 5500, 10000, 30000 request per region.
Set header for x-amz-server-side-encryption: aws:kms
Server-side encryption-C (custom key)
The custom key are sent to AWS to be used. The encryption key is never stored they are discarded right after.
HTTPS must be used because you will be sending the encryption key over the wire.
Client-side encryption
Client encrypt the file themselves before sending to S3, and the client must decrypt the file themselves when retrieving it from S3.
Therefore, the client manages the key and encryption cycle themselves.
The files is encrypted during upload, and S3 won't do any encryption. When the file is downloaded the client will be responsible for decrypting it.
Encryption in transit
How is data encrypted when they are being downloaded.
HTTPS for S3 bucket endpoint have encryption in flight.
HTTP for S3 endpoint have no encryption in flight.
You don't have to worry since most client uses HTTPS by default.
S3 CORS
origin = protocol + host + port
CORS tells the browser to whether allow the current website to access the resource they fetched from other places. Same origin is allowed by default, if they have the same origin it is allowed by default (same origin policy). If they aren't then CORS needs to explicitly say that this website are allowed to access the resources they requested, otherwise, the website cannot use it.
If client make cross-origin request on your S3 bucket, then it needs to have the correct CORS headers. You can just configure it so that the header allow for specific origin or for * regions.
S3 MFA delete
Force user to generate a code from their device before doing important operations on S3
Deleting object, suspend versioning on the bucket. They are dangerous operations. Versioning must be enabled to use MFA delete.
Only root account can enable/disable MFA delete.
S3 Access logs
You might want to log all access to your S3 bucket. The data can then be analyzed.
The log will be deposited to another S3 bucket and the logging bucket must be in the same AWS region.
Do not monitor logging bucket, recursion you will pay lots of money in the end!
S3 Pre-signed URL
You can generate these URL using console, CLI, or SDK.
There will be an expiration date for the URL. The pre-signed URL inherit the permissions of the user that generated the URL.
This is used for giving out the files stored in a private S3 bucket to other users. So that they can access the file as well.
They will be able to upload to the precise file location as well!
S3 Glacier vault lock
Adopt the write once read many model. Once you lock the vault it cannot be changed or deleted anymore after you put the object into it.
This is helpful or compliance and data retention.
S3 object lock
Enable versioning first, also let you do WORM model. It is a lock for each object, it prevents deletion for a specified amount of time.
- Compliance mode: Object version can't be overwritten or deleted even root user. Retention mode can't be changed or even the period won't be shorten
- Governance: Most users can't overwrite or delete an object version or alter lock setting, but special users can
Retention period: How long you are going to protect the object, can be extended not shorten.
Legal hold: Protect object indefinitely, independent from retention object. They can be removed.
S3 access points
You can define access point which is associated with a policy, that grant R/W permission to a specific prefix for S3 bucket, for a particular IAM user. Each access point will have their own DNS name.
This allows you to control who is able to read what "folders" from S3 bucket.
You can also define VPC origin, to only allow access point to be accessed within VPC. To make this work you need to create VPC endpoint policy to allow traffics out from the VPC.
S3 object lambda
Use case for access point. Change the object before is retrieved from the access point.
Uses a lambda function, the lambda can sit in front of the access point to modify the object before it is returned to the user at the access point.
You need Access Point + Object Lambda access point to make this work.
Example use cases: Redacting lambda function for the object before it is returned to the user, or enrich the data before is returned to the application.
CloudFront & Global Accelerator
AWS CloudFront
Content delivery network, they cache content at different locations world-wide that are closer to the user to improve read performance.
CloudFront has DDos protection since it the network is world wide and have integration with shield.
Origins
There are many resources that CloudFront can get the original content from, remember it is still a server that is caching content so it has to get the content from somewhere.
- S3 bucket: You can get files from a S3 bucket and cache them in the CDN. OAC (origin access control) is used for security. OAC is only for S3 bucket, and it will permit request to be only from CloudFront.
- HTTP backend: You can also get content from a HTTP endpoint, either from ALB, EC2 instances, for S3 websites. They can all be cached in the CDN because they are still serving contents
CloudFront vs S3 cross region replication
CloudFront is used for CDN, it has over 200 servers forming as a network to speed up the delivering content to the user. The files will be cached for up to a TTL, so it will be expired. Good for static content.
S3 replication on the other hand has to be configured to do replication on other regions. So it is only good if you only need to speed up the content deliver for a couple region and not world wide. There will be no caching happening because the files are stored. Good for dynamic content.
Using CloudFront
After you have setup the origin for CloudFront you will be getting a new DNS name, that DNS name will be used to access the content that you would like to be cached in the network, to be fast deliver to the customer.
ALB or EC2 as origin
Since CloudFront's edge location (servers) aren't part of VPC, in order to make ALB or EC2 as the origin you have to make them public for the CloudFront to access it. The user will be visiting the CloudFront DNS name, then CloudFront will make the request on behalf of the user and then cache the content, to serve it quicker.
CloudFront geo restriction
Restrict who can access your CDN based on the country the user is visiting from.
You can set up allowlist (whitelist) or blacklist.
This is good for control access to contents. Very similar to geo location with Route 53, to restrict content.
Cache invalidation
CloudFront won't know if you updated your content that it needs to refresh the cache since the TTL is long. It will only refresh only after TTL is expired.
But you can force an entire or partial cache refresh by using CloudFront Invalidation. you can invalidate all files or a specific path of the origin. Invalidated cache will refetch the content from the origin.
CloudFront pricing
Because CloudFront is a world-wide CDN it is priced differently based on which edge location (server) is used to retrieve the content.
Some countries are going to be more expensive than other of course.
There are three pricing models for CloudFront
- Price class all: You pick all regions, you will use ALL CDN servers, this will get you best performance but at cost of those higher pricing for some countries
- Price class 200: Get you CDN in most regions, but exclude the most expensive region
- Price class 100: Only pick the least expensive regions.
Global accelerator
So to understand why this service is needed here is the problem it solves: Say you deployed an application in India but you have users all over the world. The customer in say US would have to do many packet hops in order to reach the application in India. We want to minimize that latency as fast as possible.
Unicast IP: One server hold one IP address, this is the IP address that we are familiar with
Anycast IP: All server will hold the same IP and client is routed to the nearest one
Global accelerator solves the problem of needing user's packet to reach far away via public internet by minimizing the amount of public internet traffic that they need to go through. When you use global accelerator you will get 2 Anycast IP that you can use for your application (again they will reroute the traffic to the closest server to them).
The Anycast IP route traffic to the edge locations (the CDN servers) which then sends traffic to your actual application in AWS through the private AWS traffic (which is much faster and optimize and have less traffic from the public internet).
Benefits
You can use global accelerator for elastic IP, EC2 instances, ALB, NLB, public or private resources
You will get consistent performance because it routes with lower latency. It has built-in health check for fast failover.
It also have the same DDos protection because it is using the edge location like CloudFront.
CloudFront vs Accelerator
They both use same edge location network all around the world. Both use AWS Shield for DDos protection
CloudFront is used for caching purposes and the content is delivered from the edge location not from your actual application. It can also used for serving dynamic content.
Accelerator on the other hand is meant to minimize public internet hops, and the traffic is sent to the private AWS network to speed up the traffic. There is no caching going on because the content is served from directly from your application. And it will have fast failover due to built-in health checks.
More Storage Options
SnowFamily
These are secure, portable devices that allows you to college and process data at edge location, AND migrate data in and out of AWS. Two use cases.
It is used to perform offline data migration since if you are going to transfer data using the network it is going to take too much time. If it takes you more than a week to transfer over the network, then you should use snowball devices. You basically get a physical device shipped it to you and then you would upload your data, then ship it back. Then AWS will take your device and deposit the data importing it into S3 bucket.
There are three different device family
Snowball Edge
Allows you to move TBs or PBs of data in or out of AWS. You will be paying per data transfer job.
There are two kinds, snowball edge storage optimized where you get 80 TB of HDD capacity. The other is snowball edge compute optimized where you get 42 TB of HDD. The difference is in the compute power that you get from this machine which we will discuss later, we only focusing on transferring data for now.
Snowcone and Snowcone SSD
Meant for environment where you will only collect little bit of data. Can withstand harsh weather.
Snowcone: 8TB
Snowcone SSD: 14 TB.
You must provide your own battery / cables. You can sent back data by shipping it or connect to using online option via AWS DataSync. Online option is only available for snowcone family
Snowmobile
An actual truck where you can upload your data into to. It can hold exabytes of data! It is a lot of freaking data.
There is GPS, 24/7 video surveillance. Use snowmobile if you need to transfer more than 10 PB.
Edge computing
There are two devices that can also be used to do offline computing in edge location. What's an edge location? Where you aren't able to access the internet or the cloud and have no compute power. Mining station, ship at sea, or truck in road.
Snowcone and Snowball edge are the two devices that can also do edge computing. They can preprocess data, do machine learning, and if you need to transfer the data back to AWS you can ship it right back.
This is good for if you need to work on the data close to the site that you are collecting from.
Snowcone & Snowcone SSD: 2 CPU, 4GB of memory, can have wired and wireless access
Snowball edge compute optimized: 52 CPU! 208 GB of RAM! Even add GPU, but only 42 TB of storage
Snowball edge storage optimized: 40 CPU! 80 GB of RAM.
These two devices can run EC2 instance and AWS lambda using AWS OpsHub (A software you install on your computer and laptop to manage your snow family devices). Use it to spin up EC2 or lambdas.
Pricing is marketed using long-term deployment.
Data to glacier
What if you want to transfer data to glacier, because Snowfamily can only do it into a S3 bucket, you will have to set up lifecycle policy to transfer those data to glacier
Amazon FSx
Managed file system service.
FSx for Windows file server
Fully managed windows file system. Can be mounted even by Linux EC2 instances.
You can access it from on-premise infrastructure by VPN or direct connect. You can also configure to be multi-AZ, backed-up daily to S3.
Storage option: SSD for low latency and HDD.
FSx for Lustre
Distributed file system for large-scale computing. Used for machine learning and high performance computing!
FSx for NetApp ONTAP
Use this if you already use ONTAP to move it to cloud.
This works with lots of platform, Linux, Windows, MacOS, very broad compatibility.
Storage shrinks or grow.
FSx for OpenZFS
Compatible only for ZFS.
Move workload already on ZFS.
Also broad compatibility.
Storage gateway
Gives your on-premise data center the access to unlimited cloud storage.
There are four different storage types. If you don't have additional virtualization capability to run these gateway you can order it from Amazon and install it into your server.
S3 file gateway
You will connect your on-premise data server to the S3 bucket. The on-premise server connect to the S3 via S3 file gateway. On-premise will access it using NFS or SMB protocol, but underlying it will be translated to HTTPS request.
The S3 file gateway will also cache mostly used data.
FSx file gateway
You deploy a Window File Server on FSx. You can add a FSx file gateway for caching frequently used data. Give you low latency, the main reason why to add storage gateway.
Volume gateway
On-premise data volume will be transported to S3 bucket that is then backed up into EBS volume snapshots.
Cached volumes: low latency access to most recent data. Data is stored into S3 for low-latency access.
Stored volumes: Entire dataset is on premise, scheduled backup to S3. For disaster recovery.
Tape gateway
Some companies have backup process using physical tapes. Then tape gateway use the same process but back it up to the cloud.
AWS transfer family
Sent file in and out of Amazon S3 or EFS using FTP protocol.
Support FTP, FTPS (adds security on FTP), SFTP (Uses ssh)
It is a managed infrastructure!
User can access the protocol using the endpoint provided by the family or use your own DNS name using Route 53.
DataSync
Move large amount of data to and from. On-premise / other cloud locations to AWS. An agent is required to move the data.
You can move data from AWS to AWS different storage services. An agent is not needed!
S3, EFS, FSx are all supported. The replication tasks can be scheduled hourly, daily, weekly. Not going to be synchronized immediately.
File permissions and metadata are preserved.
Snowcone have agent pre-installed for transferring the data after you ship it back to AWS.
SQS, SNS, Kinesis, Active MQ
Decoupling mechanism
When we deploy multiple applications they will inevitably need to communicate. There is two ways of doing the application.
- Synchronous communication: Direct connection for direct communication between application
- Asynchronous / event based communication: Application is connected to a queue, then the message will come asynchronous and the application will react to that message to do whatever it needs to do
Synchronous can be a problem if there is sudden spike of traffic. It is better to decouple your application so that your application can scale independently, without worrying about direct communication
Amazon SQS (Simple queue service)
It is a queue, it will have messages. There will be a producer that will be sending messages to the SQS queue. There can be multiple producer sending the messages.
Message can be anything.
Consumer will be polling (checking for messages) the messages from the queue (hey do you have any messages for me), then will process the message then delete the queue.
There are two types of SQS, standard queues and FIFO queues.
Standard queues
The first services in AWS. Used for decoupling applications (If two applications are directly connected, we break it up by introducing an intermediate service to deliver the messages asynchronously to scale better).
For standard queue you get unlimited throughput (send as many as you want) and unlimited messages in queue.
Max retention of message: 4 days, up to 14 days. Low latency for sending and receiving it < 10ms for publish and receive.
Size < 256KB
Duplicate message can happen! Standard queue is delivery at least once occasionally service. You would need to take that into account in your application.
It is also best effort ordering, so messages can be out of order not in the order that they are sent.
Sending/reading messages
Application will produce message by using the SDK (SendMessage API).
Consumers can be running on EC2 instances, your own server, or AWS lambda (You can trigger AWS lambda via queues). Consumer poll for messages and can get up to 10 messages at a time and process them i.e. insert them into RDS database or just print it out.
After consuming message you will delete the message using DeleteMessage API.
Multiple consumers
SQS can have multiple consumers and process messages in parallel. So in the case that the consumer doesn't process the message fast enough it will be received by other consumer, since the message hasn't been deleted yet.
This is why it is at least once delivery and also best effort ordering.
ASG with metric of SQS
You can scale consumer using auto scaling group polling from SQS. The metric to scale from is the Queue Length.ApproximateNumberOfMessages. Once the length go over a certain level it will trigger the auto scaling group to spin up more EC2 instances to handle those messages. So you can handle more messages per second.
SQS as database buffer!
With the same setup as the previous statement. If we have request going into an auto scaling group and then the EC2 instances write say transactions into database in order to store records of the transactions. Sudden spike of requests will OVERWHELM the databases and some writes to the database will fail! How do we solve this? We use SQS as a buffer to store the messages that we need to write, because it is infinitely scalable, can store unlimited amount of messages. Then we can poll the messages with another auto scaling group to then process the transactions, and then the transaction are only deleted if the insertion succeed. If it failed then we don't delete the message and just try again.
Without the SQS buffer there are chances that the request writing to the database will fail and the transaction is lost! If it is done asynchronously, which mostly is.
Decoupling frontend and backend application
If you have an application that receives request to process a video and store it into a S3 bucket. If you have many request then it might not be able to handle it quickly. So you can decouple your application separating the request from the handling of the request, by storing the request into a SQS then process it with another application.
This is so that your frontend application will not be lagged or stuck at receiving request and HANDLING it at the same time. You can scale the frontend and backend independently.
SQS Security
It has in-flight encryption using HTTPS.
At-rest encryption using KMS keys.
client-side encryption needs to be done by client itself.
Access control is done by IAM policies, but you also have SQS access policies just like S3 bucket policies for allowing other services to access the SQS queue.
Message visibility timeout
After a message is polled by a consumer, it becomes invisible to other consumers for some time. That timeout is 30 seconds, which means the message has 30 seconds to be processed. But if the message hasn't been deleted it will be "put back to the queue", and other consumer can receive the same message again.
This is how the message can be deliver multiple time time if it hasn't processed fast enough.
If your application know it is processing it and needs a little bit more time before it finishes consumer can call ChangeMessageVisibility to change the default invisible time. The visibility timeout if you set it high and consumer crash seeing the message again will take long time. If too long, then duplicate messages can occur.
The visibility timeout is per message.
Long polling
Consumer can wait for messages to arrive if there are none in queue. This is done to reduce the number of API calls to SQS queues and decrease latency for your application for the time for message to arrive to your application. This is referred as long polling.
Long polling can last from 1 to 20 seconds. 20 seconds preferred.
You can configure it at the queue level (every poll you do will be long poll) or you do it at API level by specifying how long you will be waiting.
FIFO queues
First in first out. The message order will be guaranteed as oppose to standard queue which ordering isn't guaranteed but best effort. When you poll for messages it will receive the message in the order it was sent.
It has limited throughput, can only sent 300 messages per second or 3000 messages per second with batching.
You will only receive the messages once! And messages are processed in the order sent by the consumer.
SNS (Simple notification service)
One producer that can sent same message to many receiver! Publish-subscribe pattern.
You will have your producer sent a message to the SNS Topic, and the topic can be subscribed by consumer who would like to receive the message, there can be many subscriber, and process the message.
Producer send message to one SNS topic. Consumer listen to the topic that they want. You can have 12m+ subscription per topic which is a lot!
SNS can sent data to subscriber which includes email subscriber, mobile notification subscriber, HTTP endpoint, SQS, lambda, Kinesis data firehose.
Lots of services will publish messages too, and they will sent it to a specified SNS topic. ASG scaling event for example.
Publish
To publish message you use SDK, you create topic, create subscription, then you publish to the topic
Security
In-flight encryption, at-rest encryption using KMS keys. If client wants to handle encryption they have to do it.
IAM policies to regulate access to SNS resources. Also can define SNS policies to define what other services can access SNS.
SNS + SQS Fan out pattern
You want message to sent to multiple SQS. If you sent it individually then it might not be guaranteed. Use fan out pattern!
Push once in SNS topic, SQS will be the subscriber to the topic. This will guarantee the message be sent to the SQS topic.
This is fully decoupled, no data loss. SQS allow for data persistence, delaying the process work since is stored in queues.
You need to allow SNS to write to the SQS. SQS can also be in other region as well and it will work!
S3 event to multiple queues
You can only sent S3 event like an object being created to one location, but what if you want to sent it to multiple places? Then you can use the SNS + SQS fan out pattern. Let the event notification be publish to SNS. Then you have SQS as subscribers, then that event notification can then go to multiple places where ever you like! Services, services, other lambda!
SNS to S3 via Kinesis data firehose
Kinesis data firehose let you deliver streaming data (data that are continuously being produced, logs of customers, ...) to a destination. You can apply transformation before storing the data to say S3 bucket.
SNS can also be used to sent data to Kinesis. So you sent the data to SNS topic, then Kinesis data firehose can subscribe to it. because it let you set a destination for the streaming data you can set destination to say a S3 bucket to store it there. Or do some transformation before storing it into S3.
FIFO topic + SQS FIFO
Same like FIFO queue, the messages will be ordered for a topic that is published to.
You get ordering and no duplication. However you can only have SQS FIFO as the subscriber and limited throughput just like FIFO queue.
You use this to do fan out for SQS FIFO
Messaging filtering
You can also add a JSON policy used as a filtering policy for each subscriber. Based on the JSON policy the subscriber will receive message that it filters for.
If a subscriber doesn't have a filter policy it will receive every message.
Amazon Kinesis
It let you collect, process, and analyze streaming data in real-time. The streaming data can be from logs, metrics, IoT telemetry data. As long as the data is generated fast then is considered streaming data.
Kinesis data stream, Kinesis data firehose, Kinesis data analytics, Kinesis video streams. Four Kinesis services.
Kinesis data stream
A way for you to stream and output your streaming data to consumers.
It is made up of multiple shards that you can provision ahead of time. Each shard provide a channel for your data to float through.
Producer will be sending data to your kinesis data stream, they will be using SDK or Kinesis producer library which uses SDK at the lower level for streaming data like logs.
SDK will sent records to the kinesis data stream, record is made up of partition key and data blob. Partition key tells which shard the data will go through (after hashing the partition key, data blob is the actual data can be up to 1 MB.
The producer can sent data 1 MB/sec or 1000 record/sec per shard.
After the data is stored in Kinesis data stream it can be consumed by application using SDK or Kinesis client library which again also uses SDK. Lambda, Kinesis data firehose, or Kinesis data analytics.
The consumer receives the record which made up of partition key, sequence number, and data blob. Sequence number represent where it was in the specified shard.
Consumer consume at 2 MB/sec per shard for all consumer. OR
at 2 MB/sec per shard per consumer. (Enhanced fan-out pattern)
Properties of data stream
Data in data stream can be stored up to a year. The data cannot be deleted. Gives you the ability to reply the data.
Data are ordered by the same partition.
Capacity mode
Provisioned mode: You pick the number of shard the data stream has in advance.
You also pay per shard provisioned per hour.
On-demand mode: No need to provision or manage the capacity. Default capacity provision is 4 shards. You will be paying per stream per hour and data in/out per GB.
Pick provisioned mode if you know what you need in advance.
Security
IAM policies to control where the data goes.
HTTPS encryption in-flight, and at rest using KMS. You can also encrypt data yourself.
VPC endpoint can be used to Kinesis access in a VPC.
Kinesis firehose
A managed service that also takes streaming data, and then it can store it into a destination, including AWS services (S3, OpenSearch, and Redshift, need to know these by heart). You can also store it into third party destination (data dog, Splunk) or sent it to your own HTTP endpoints.
The producer for firehose are the same as data stream, can be from application, clients, that ultimately sent streaming data via the SDK or Kinesis producer library. However, in addition, it can also take Kinesis data stream as producer, CloudWatch, and SNS topics.
Streaming data can be optionally transformed via a lambda function.
The records are sent up to 1 MB per second. The data are sent to destination via batches. Which means the data are sent near real time since the data are sent in batches. The data will only be sent at 1 MB minimum or after 60 latency if not full batch.
You can also sent backup data or failed data to another S3 bucket for backup.
Firehose scales automatically and is serverless, and you pay for the data that is going through the firehose.
Data stream vs Firehose
- Data stream you have to write your own custom code to producer and consume. It is real time! And you would have to provision the capacity, and there are data retention with replying data stream
- Firehose is fully managed service so you don't need to configure anything. You use it to sent data to S3 / Redshift / OpenSearch / 3rd party / HTTP. However, it is near real time due to data being in batches
- No data storage option for firehose and no reply capability
Kinesis Data Analytics
Not really needed to know but here it is.
Kinesis data analytics for SQL application
The source of the streaming data can be from Kinesis Data Stream or Kinesis Data Firehose. Then you can perform SQL statement for real-time analytics.
Along the analytics you can enrich the data from a S3 source, add more data to it.
Destination that you can sent it to is the same, Data Stream and Firehose, then you can sent it to its final destination that these two services can sent to.
Kinesis data analytics for Apache Flink
Use Apache Flink on the service. Use this if you want more powerful query compared to SQL.
Source can be from Kiensis Data Stream and Amazon MSK.
This is so that you can run any Flink application on managed cluster on AWS.
It does not read from firehose!
Data ordering Kinesis vs SQS FIFO
Kinesis
For Kinesis, imagine you have 20 trucks sending their GPS coordinates to 5 shards. The partition key will be truck_id where id is the number for the truck.
Now before inserting into one of the shard for data stream, it has to figure out which shard it goes into right? Well it will take the truck_id and hash it to figure out the appropriate place it is in and then sends the record there. Just like how HashMap does. It hashes the key and insert the data.
Same key will always go to the same place, so if your truck_id doesn't change your GPS will be sent to the same shard always. Therefore data for Kinesis data stream are ordered on shard level. Multiple truck mapped to same shard will sent the record to the same shard.
SQS FIFO
Standard queue: No ordering is guaranteed because of best effort ordering and at least once delivery.
FIFO queue: If you do not use group ID for the messages, then the message are consumed in the order they are sent but with only one consumer.
But you can group messages together using group ID. Then the messages will be ordered FIFO in group ID level. Say group ID 1 will have a FIFO ordering, and group ID 2 will have FIFO ordering. Then you can have multiple consumers if you have multiple group ID.
In the scenario of sending truck GPS. You would have each truck's GPS assigned a group ID. 20 trucks = 20 consumers. Still sent 300 msg per second, and 3000 if batching.
Amazon MQ
SQS, SNS are AWS protocol from AWS. You can use your own on-premise open protocol like MQTT, AMQB, TOMP, Openwire, WSS.
Amazon MQ can be used to allow migrating to the cloud. Is also a message broker service.
However, it doesn't scale as much as SQS / SNS. Runs on servers so you need Multi-AZ with failover.
To do failover, you need a standby MQ instance, they will mount to EFS, and when one AZ fails, it will failover to the standby instances.
ECS, Fargate, ECR, & EKS
Amazon ECS
Elastic container service. It is Amazon's own container platform. Let you launch container instance and run them.
If you want to launch docker container on AWS you will be launching ECS Tasks. Task definition tells what docker image to use to run the container, how much CPU and RAM are given to each container, and other configuration. The ECS task will be launched within ECS clusters.
ECS clusters can be EC2 launch type or fargate launch type.
EC2 launch type
Your cluster will be comprised of EC2 instance underneath which you have to provision in advance. After your container needs to be running on some kind of host right? In this case you will be running the container in EC2 instances.
In order to make the cluster using EC2 instances, each of the EC2 instances have to run ECS Agent to register itself into the ECS service and cluster that's specified.
Only after you run ECS Agent and regsiter itself will then able to run ECS task. AWS will take care of starting / stopping the containers.
Fargate launch type
This time you don't need to provision any EC2 instance underneath, it is all serverless (although there will be servers underneath but you don't have to worry about it!).
It is still considered a ECS cluster even though it is serverless :)
For fargate launch type you just need to define task definitions, then AWS will run the task for you without having you worrying about allocating the servers. Launch more tasks and it will scale automatically because is serverless!
Launching ECS tasks
When you launch your ECS tasks you get the option to pick which launch type. Either via the EC2 launch type from the cluster you defined. Or via fargate which you don't need to manage the infrastructure yourself. You get to pick it.
Service and tasks in a cluster
Task definition again tells AWS how to spin up the container. It contains the image links, configurations, how much CPU and RAM to give the container.
Task, is a running container with the defined Task definitions
Service, so you will be launching task under a service which is finally under the cluster. Service allows you specify the number of running task that you should maintain at all times, if any of the tasks gets unhealthy, it will stop it and then replace with a healthy one.
Service uses the cluster’s resources , so it could run one task in one instance and another in another EC2 instances.
You could run task directly without being under a service, this is good for short-term task or one time task.
IAM for ECS
EC2 instance profile role: This is only for EC2 launch type only because only EC2 launch type have ECS Agent. The profile will be used by ECS agent to make API to ECS service, ECR (for pulling docker images), and CloudWatch for sending logs.
ECS task role: Valid for both fargate and EC2 launch type. This allow each task to have a specific IAM role. This is so that each of your container can have access to different AWS service access. Task A need access to S3, Task B need access to a DynamoDB.
Load balancer integrations
You can expose containers running in clusters in front of a load balancer. ALB is supported.
NLB is only recommended if you need high throughput / high performance or pair it with AWS private link.
Don't use the classic load balancer
Data volumes (EFS)
To have persistent data, because container's data are not saved, you need to use EFS file system. You would mount them in your container in order to save data from the container.
Otherwise, files you write in the container will just be deleted after the container finishes.
EFS is multi-AZ also serverless because you don't have to worry about provisioning the storage it has. Scales automatically.
If you pair it with Fargate + EFS, this is serverless design. Fargate again you do not need to manage the infrastructure that runs the container yourself, it is all serverless. Scales automatically for you.
S3 cannot be used as a mount file system!
ECS Auto scaling
There is two level of scaling, scaling on service level and scaling on EC2 level.
To automatically scale the number of task that's running in a service you can use AWS Application Auto Scaling, the metric that you can use are:
- Scale it on average CPU utilization on Service
- Memory utilization
- ALB request count per target
Target tracking (scale it based on target metric value), step scaling (scale it based on specified CloudWatch alarm), or scheduled scaling (Scale it based on a specific time).
Remember that scaling the tasks doesn't mean the EC2 instances are scaled. Application Auto Scaling just increases the number of running container if the CPU utilization for that service exceeded threshold, or whatever you set your scaling policy to be. It is scaling for each service! You increase the number of task (containers).
To scale the EC2 instances you can use ASG based on CPU utilization like how we used before. Or the smarter way is ECS Cluster Capacity Provider and as soon as you lack capacity to launch tasks it will scale. It is paired with ASG. So if you use ASG with ECS cluster use it with capacity provider!
ECS Solution architecture
1. ECS task invoked by event bridge
S3 bucket sent event to event bridge, you can set rule to run a ECS task, to process whatever the event from the S3. Then sent result to DynamoDB.
This is serverless architecture.
2. Event bridge schedule
Can schedule a rule to trigger every hour to run a ECS task every hour. The container can do something every hour.
3. SQS queue
Message sent to SQS queue, then service in ECS cluster poll for messages and then let the task process it. You can scale ECS for task level based on the queue length as well. I.e increase the number of task based on the length of queue.
ECR
Elastic container registry. This is where you can store and pull images on AWS. There are both private and public registry for you to store images.
Fully integrated with ECS. When your EC2 instances want to pull image from ECR it needs sufficient permission from IAM role.
There is image vulnerability scanning built-in to ECR.
EKS
Elastic Kubernetes services. Another way of managing containers.
Kubernetes is a system for automatically deployment, scaling, and management of containers. It is an alternative to ECS because it is open-source.
You can launch Kubernetes backed by EC2 instances or Fargate which is serverless.
Use EKS if your company is already using Kubernetes on-premises or in another cloud and is migrating to AWS. Kubernetes itself work with any cloud provider.
Tasks in Kubernetes are called EKS pods.
EKS node are like EC2 instances that the pods are running in.
Node types
Managed node groups: AWS create and manages nodes (EC2 instances) for you. Can do on-demand or spot instances
Self-managed nodes: You have to create the nodes yourself and register them to the cluster. On-demand or spot instances
Fargate: You can also use fargate with EKS, you don't have to manage any nodes.
Data volumes
Data volumes can be attached to cluster for persistent data. It uses Container Storage interface.
EBS, EFS (only with fargate), FSx for Lustre, FSx for NetApp ONTAP. Can be use as volumes.
App Runner
Managed service to make it easy to deploy web apps and APIs at scale
You don't need to know any infrastructures.
You start with source code or container image, then you do configuration on CPU, RAM, for your web application. Then AWS will build and deploy your web app. Magic.
Automatic scaling, highly available, load balancer, encryption, you can connect to database, cache, and message broker.
Serverless, Lambdas, DynamoDB, Cognito, API Gateway
Lambda function
Why is serverless good? Well if you are using EC2 instances then you have to provision them if you need more compute power. You will be paying for those servers that are continuously running. It has limited RAM and CPU.
Lambda function you don't have to manage servers, you just need to write code then deploy it and it will run on-demand. And the scaling is automatic.
Can be triggered with many of AWS services.
Language support for lambda
There are a ton of support language supported for Lambda, even Rust! You can even write your own custom runtime API.
It can even use container image, but the container must conform to the lambda runtime API, but use ECS / Fargate for running containers.
Pricing on lambda
You will be paying per function invoke and the amount of execution time the function took.
$0.20 per 1 million request
$1.00 for 600,000 GB-second
So it is very cheap
Lambda integrations
API Gateway: Create a REST API endpoint to invoke lambda functions
DynamoDB: Triggers for lambda whenever there is a db update
S3: When files are stored into S3 it will have a trigger
Event Bridge: Can set up CRON task to run a function every hour say. What statement-pdf-generator used
Lots of services!
Lambda limits
The limits are per region! For each account.
The max memory you can allocate is 128 MB - max of 10 GB
The maximum time for execution is 15 minutes, anything above that lambda isn't a good use case
4 KB of environment variables
Disk capacity of 10 GB max for temporary storage, used to load in big functions in /tmp
Concurrency execution is 1000, but can be requested to increase it. This is how many lambda function can be executed each second at the same time.
Use /tmp for uploading your code if it exceed 50 MB of compressed size, or 250 MB of uncompressed size.
Edge functions
A code that you write and attach to CloudFront distribution (CDN). The idea is that these function will run close to user to minimize latency.
Two types CloudFront functions and Lambda@Edge. These functions are deployed globally. This is used to customize CDN content that's coming out of CloudFront.
CloudFront function
Viewer request and viewer response. So basically client will sent request to CloudFront and CloudFront will make the request on behalf of the viewer to the origin, then give the response back to the client.
High performance and used for customizing CDN.
This is native feature to CloudFront and you would write code in JavaScript
Very quick response < 1 ms for execution time.
Lambda@Edge
Function write in NodeJS or Python, this is used to change CloudFront requests and responses. Changing the origin request and origin request before it is sent as viewer response back to the client.
Let your run code that's closer to the user.
Write the function in one region before it is replicated to other locations.
Can do lots of things in execution time 5 - 10 seconds.
Use cases
- CloudFront functions: Cache key normalization, header manipulation, URL rewrites and redirect, request authentication and authorization
- Lambda@Edge: Longer execution time, adjustable CPU or memory, Network access to use external services for processing, file system access or access to body of HTTP request
Lambda in VPC
Lambda launch outside of your own VPC, they launch in AWS owned VPC. So lambda cannot access resources in your VPC like RDS, ElastiCache, internal ELB.
You have to launch your lambda in your VPC to have access to those private resources. Then it will have an ENI to access private subnet.
Lambda with RDS proxy to poll database connections, this is so that it doesn't open too many connection and closing it quickly. You make lambda connect to the proxy which keeps the database connection open and is used by the lambda functions to minimize connection to RDS instance.
RDS proxy is never public so you have to launch your lambda in your VPC to take advantage of pooling db request.
Amazon DynamoDB
Fully managed database just like RDS. However, rather than MySQL it is NoSQL database, it is not relational database.
Use case: Scale to massive workloads since it is internally distributed. Millions of requests per second, trillion of rows.
You get single-digit milliseconds fast and consistent performance!
Security is managed via IAM. Low cast and always available!
Standard and infrequent access table class
Basics
DynamoDB is made of tables, the database already exists don't need to create.
Each table have primary key, then you add rows. Attributes are columns and can be added over time very easily.
Max size of an item is 400 KB, so it can't store big object.
Great choice if your schema needs to rapidly evolve, because updating RDS schema is complicated process.
Capacity modes
Provisioned mode: provision the capacity how much read/write you need per second in advance.
For steady smooth workloads.
But you can auto-scaling RCU and WCE (Write capacity unit). Increase unit when needed decrease unit if not needed.
On-demand mode: Read and write capacity scale automatically based on your workload. But this is more expensive compared to provisioned mode since you paying for read/write that your app performed. For unpredictable workloads and sudden spikes.
DynamoDB Accelerator
Fully managed highly available in-memory cache for DynamoDB.
This solve for read congestion by caching. Microsecond latency!
No need to rewrite any logic modification.
DynamoDB accelerator vs ElastiCache
ElastiCache is used for aggregation result.
DAX is for individual object cache (for rows that you retrieve)
DynamoDB stream processing
Stream processing allow you to stream each item level modification.
Allow you to react to changes in real-time as your table changes. You can invoke AWS lambda on changes to your DynamoDB table.
DynamoDB streams: 24 hour retention, process using lambda triggers or kinesis adapter. But can only be consumed by limited number of consumers
Kinesis data streams: Sent the changes to Kinesis data stream, higher retention of 1 year, higher consumer, the consumer can be AWS lambda, data analytics, firehose.
DynamoDB global table
Global table is replicated cross region. You can write to either table and it will be replicated.
Global table is to make table accessible with low latency in multiple-regions. Application can read and write table in any region. This is active-active replication.
Have to enable DynamoDB stream to have replication possible.
DynamoDB Time to live
Automatically delete items after some time. As soon as the time expires then the item is deleted.
This is good for web session handling. Their session can have a expiry time then the session data is deleted after.
Backups for disaster recovery
Continuous backups using point-in-time recovery. Enable for last 35 days. The recovery creates new table
On-demand backups: Full backup for long-term retention until deleted explicitly.
Integration with S3
You can export table to S3, PITR must be enabled. This is so that you can do query on S3 with Athena.
This can also be a snapshot of your table.
Export DYNAMODB to JSON or ION format. Then you can import it back from S3 using CSV, JSON OR ion.
Amazon API Gateway
So there are multiple ways that a lambda gets triggered by the client. Client directly invoking it, front the lambda with an ALB to expose the lambda as a HTTP endpoint. There is anothe rway.
API Gateway is a serverless service that create REST API endpoint which can proxy request to the lambda, triggering it. You will get a DNS endpoint for invoking the API after you deploy it.
Proxy integration enrich the event for your integration. For example, it will enrich the event that's passed into your lambda so that it can use it to generate correct response.
Features
API Gateway handles API versioning, v1, v2,...
It also handles security for authentication and authorization, and handles different environments.
Transform and validate requests and response. Generate SDK and API specification, and cache API responses. This provides way more feature than a simple ALB that front the lambda function.
Combination
You usually use gateway with lambda function.
But it can also be combine with HTTP endpoint in the backend. This adds rate limiting, caching, user authentication
Combine with any AWS service: You can expose any AWS API through the API gateway, start AWS step function workflow, post a message to SQS. You would do this to add authentication, deploy your services publicly via a REST API. Basically you can expose your existing AWS resources via REST endpoint to the public.
Endpoint types
Edge-optimized (default): Make your API gateway accessible anywhere in the world. Your gateway will only be in one region, but request are routed from CloudFront to your API gateway to improve latency
Regional: You expect all user in the region you deploy in. But you can manually combine it with CloudFront to have more control/
Private: Not public, only accessed within your VPC using ENI.
Security
Authentication: You can authenticate user using IAM roles, (EC2 instance that wants to access API gateway). Cognito for external users. Or you can implement your own authorizer using lambda functions.
Custom domain name HTTPS: Integrated with AWS Certificate Manager, if you want HTTPS on your endpoint then you need to install certificate. For edge-optimize install it in us-east-1, regional then install it in the region you are exposing the gateway. You have to setup the CNAME and A record in Route 53.
Step Function
You use visual workflow to orchestrate your lambda function.
It can actually integrate with other services besides lambda function like EC2, ECS, API gateway!
You can have work flow of your data that goes through the step function, and then processes it at every stage, then also have human approval at certain stages.
Use for: Data processing, order fulfillment, any workflow that need graph to visualize.
Amazon Cognito
Give user an identity to interact with web and mobile application. They don't have AWS accounts but still want to use your resources that you have deploy.
Cognito user pools: Sign-in functionality for app user
Cognito identity pools (Federated identity, what we are using!): To give temporary AWS credentials to user to access resources directly.
Cognito VS IAM: Cognito is used for mobile and web application that sits outside of AWS, can also used for hundreds of users and authenticate with whatever you want.
Cognito user pools
Serverless database of user for your web and mobile apps. Can have simple login and password reset. MFA. Federated identities (users from other platform)
You can use CUP with API Gateway and ALB to authenticate user.
Cognito identity pools
Give direct AWS account using temporary AWS credentials.
This is how Cloudsentry give us that temporary AWS credential access.
You can use CIP with CUP to retrieve temporary access. And provide fine-grained, row access to DynamoDB. So that the user cannot read/write every row in the DynamoDB.
Serverless Architecture Discussion
MyTodoList
We want to make a mobile todo with REST API with HTTPS using serverless architecture. User should be able to interact with their own folder in S3. Users can write and read to-dos, but mostly read them. Database should scale and have high read throughput.
How do we do it?
API Gateway will be used for exposing endpoint. Behind it will invoke AWS lambda, and the lambda will be retrieving data from DynamoDB.
To add authentication we can use Amazon Cognito along with API Gateway to do verification.
To give user access to the S3 bucket, we will generate temporary credentials using Amazon Cognito then the client can use that temporary credentials to store/retrieve files from the S3 bucket.
To handle high read throughput for DynamoDB we can add DAX (Dyanmo Accelerator) so frequently access data are stored in the cache. Caching for REST request can be done at API gateway level as well.
Nothing is managed by us, it is all serverless.
MyBlog.com
This website should scale globally, rarely write blogs but read by many
Some websites is purely static files, the rest is a dynamic REST API. Caching should be implemented where possible, and new users should receive welcome email. Photo uploaded to the blog should generate a thumbnail.
CloudFront fronted S3 to expose the S3 files globally. Add origin access control so that the bucket can only be accessed by the CloudFront.
To add RESTful API you add API gateway, then invoke AWS lambda, that read from DynamoDB. Add DAX as well.
To give welcome email, you can use Dynamo stream, and trigger a lambda function to sent email using Simple Email Service for the welcome email.
You can also trigger the thumbnail creation by a lambda whenever the file is uploaded to S3 bucket.
Microservice Architecture
Synchronous pattern: API Gateway, load balancer, this allows you to make explicit call to the microservice
Asynchronous pattern: SQS, Kinesis, SNS, Lambda triggers, the invocation to the microservice will be done at a later time
There are challenges with microservices, but some of them can be solved by using serverless patterns.
Software updates offloading
If your EC2 instances host software updates, then there will be a sudden surge in traffic when people want to get a software updates. We want to optimize our cost and CPU how do we do it?
We just have to put CloudFront in front it, we don't need to change our architecture at all. CloudFront will scale for us, ASG don't have to scale as much. This is the easiest solution. CloudFront can make our application more scalable and cheaper since CloudFront cache the updates for us, so the request don't have to go to the EC2 instances as often.
Picking the Right Databases
How to pick right database
Depend on the question, read-heavy, write-heavy, or balanced workload? throughput needs?
How much data to store and for how long? Will it grow? Data durability?
Latency requirements?
Strong schema or flexibility? It depends on a lot of factors
RDS
Have fully managed postgresSQL / MySQl / Oracle / SQL Server / Maria DB
You will need to provision the RDS instance size and EBS volume type, but you can set up auto-scaling for storage
Read replicas for multi-AZ to give it more read capability or for another application so that it doesn't affect production access.
Use case: For storing relational databases OLTP online transactional processing like purchasing
Aurora
Compatible with Postgres and MySQL. It is AWS's proprietary database.
Data are stored in 6 replicas, across 3 AZ, highly available and have self-healing, also auto-scaling out of the box.
You can define custom endpoints for writer and reader database instances.
Aurora serverless: For unpredictable / intermittent workloads, no capacity planning
Aurora Multi-Masster: For fast failover for writers
Aurora Global: Up to 16 database read instance in each region
Cloning Aurora is much faster than snapshot restoring.
ElastiCache
In-memory data store that give you sub-milisecond latency. You need to provision EC2 instance type.
Redis give you multi-AZ and read replicas, and it is preserved when shutting down.
However, using ElastiCache you need to modify your code heavily.
ElastiCache is good for user session store.
DynamoDB
Fully managed serverless NoSQL database, millisecond latency. You can migrate MongoDB to here because it is key based as well with primary key.
Provisioned capacity: Used for smooth workload
On-demand capacity: Used for unpredictable workload
DynamoDB can replace ElastiCache as key/value store, can automatically expire data.
Highly available, multi-AZ, you can scale read/write independently.
DAX accelerator for microsecond read latency.
You can enable stream to react to database changes, insertion, update, delete and make it invoke lambda functions.
Good for schemas that are constantly changing.
S3
Key / value store, good for storing big objects. Max object size is 5 TB! Can have version capability.
DocumentDB
Aurora version of MongoDB, it is NoSQL database.
Used to store, query, and index JSON data.
Scales automatically, data replicated across 3 AZ.
Neptune
Fully managed graph database. Have edges and nodes.
Keyspaces
Managed Apache Cassandra. Serverless, scalable, highly available.
Another NoSQL distributed databases.
QLDB
Quantum ledger database. Ledger is a book recording financial transactions.
Again is highly available, serverless.
Review history of all changes made to your application. Immutable and cryptographically verifiable, basically the blockchain.
No decentralization, it is a central database
Timestream
Time + value. Basically a time series database.
Store and analyze trillions of events per day.
It is very fast if you are using time based data.
Data and Analytics
Athena
Query service that let you analyze data stored in S3. It uses SQL language to do the query.
You will be putting data into S3 then you will query it with Athena.
You pay per TB of data scanned. No need to pay for any server since it is serverless.
It is commonly used with Quicksights for reporting/dashboards.
Use case: Business intelligence / analytics / reporting and analyze logs
Serverless SQL to analyze S3 use Athena!
Improving performance
To improve Athena for performance you want to scan less data. Apache Parquet or ORC is recommended, you want to use Glue to convert your data. So that you can get column data and not all the data.
Do some compression for smaller retrievals/
Partition your S3 so that it is easier for querying virtual columns.
You only want to get subset of the column so you will be paying less.
Use larger files since it is easier to scan > lots of smaller files
Federated query
Allows you to run SQl queries across data stored in relational, non-relational, object, and custom data source. So basically let you run Athena on any type of databases even on-premise database.
You need a Data Source Connectors that runs on AWS Lambda to run Federated Queries.
Redshift
It is a database but also do analytics. The database is based on PostgresSQL. For data warehouse and analytics.
Load your data into Redshift and then do analytics. Column of storage data and then do parallel query engine.
SQL interface for performing queries.
Comparing with Athena, you have to load data into Redshift first, then it will have much faster query and join compared to Athena. It has indexes to have high performance. If many query, joints needed Redshift is better than Athena.
Redshift Cluster
Provision the node size in advance, can use reserved instances for cost saving
Has leader node for query planning and result aggregation.
Compute node: Actually carry out the queries
Snapshots and disaster recovery
Redshift can have multi-AZ for some clusters.
Snapshots are point in time backups of a cluster stored in S3, you can restore the snapshot to a new cluster. Snapshots can be taken automatically or you can do manual snapshot retained until you delete it.
You can also automatically copy snapshot of a cluster to another AWS region to do disaster recovery.
Inserting data into Redshift
You can insert data into Redshift via Kinesis data firehose (which will deposit it into S3 then it will issue a copy from S3 to redshift)
You can also copy the data from S3 from Redshift by issuing a copy command with the correct IAM role. Do it with Enhanced VPC Routing this so that the traffic is moving via VPCs and not via public internet because S3 is publicly accessible.
Or you can write from you EC2 instances into Redshift, but do it in large batch much more efficient.
Redshift spectrum
This is how you can query data that is already in S3 without loading it into Redshift. How do we do it? We have a redshift cluster available that's a must, then you will submit the query from your cluster, that query will reach the Redshift spectrum nodes (they sit in front of your S3 and then query the result, there are thousands of them so it is very efficient), then it will return you the data to the one that asked for the query.
You can leverage the nodes in Redshift spectrum rather than your own cluster to perform much more efficient query.
OpenSearch
Successor to ElasticSearch
DynamoDB you can only query by primary key or indexes, but with OpenSearch you can search in any field. You would use OpenSearch as a complement to another database. And you can also do analytics.
You will need to back it up with a cluster since it is not serverless.
The query language is not SQL it has it's own language.
Common pattern with OpenSearch
You have DynamoDB containing you data, you sent changes to DynamoDB stream with Lambda function reacting to it, and lambda insert it into OpenSearch. Finally you can then search using OpenSearch and retrieve the item from DynamoDB.
You can also sent CloudWatch log to lambda then sent it to OpenSearch to do some searching.
Kinesis Data Stream sent data to Kinesis Data firehose and optionally transform the data, then near real time sent data to OpenSearch.
Or you can sent Kinesis Data stream sent it to Lambda function to read it and write it into OpenSearch.
EMR
Elastic MapReduce. Create a Hadoop clusters for doing big data analyze.
The hadoop cluster will be hundreds of EC2 instances. EMR comes with lots of tools for big data scientist.
Apache Spark, HBase, presto, Flink. EMR helps you setting up with those tools for you.
Use case: data processing, machine learning, web indexing, big data
Master node: Manages the cluster, coordinate, and manage health, must be long running
Core node: Runs tasks and store data
Task node: Just to run tasks, usually spot instances
Purchasing options
On-demand: reliable, predictable, won't be terminated
Reserved: cost saving, used for master node and core node
Spot instances: Used for task nodes
QuickSight
Serverless machine learning powered business intelligence service to create interactive dashboards.
Fast, automatically salable, embedded, with per-session pricing.
Uses in-memory computation using SPICE engine, only if the data is imported into QuickSight. Doesn't work if you don't import the data.
Column-level security to prevent others seeing some columns.
You can use QuickSight with RDS, Aurora, RedShift, Athena, S3, OpenSearch, TimeSearch.
Dashboard and Analysis
Users and Groups only exist in the QuickSight they are not IAM.
Dashboard is read-only snapshot of an analysis that you can share.
Dashboard is then shared with users or groups. Users can also see the underlying data.
Glue
Managed extract, transform, and load ETL service. Fully managed
used to prepare and transform data for analytics.
Help convert data to parquet format
Parquet is column data, much better for filtering with Athena.
Glue can be used to transform data to parquet format.
Glue data catalog
Data crawler connect to databases. Write metadata of the columns to Data Catalog, then it is used to perform ETL.
Other things
Glue job bookmarks: Prevent re-processing old data
Glue elastic views: Combine and replicate data across multiple data stores using SQL.
Glue databrew: Clean and normalize data
Glue studio: GUI to create, run and monitor ETL jobs
Glue streaming ETL: Process streaming data as well
Lake formation
Data lake is a central place to store your data so you can do analytics on it.
Lake formation is managed service that make it easy to set up data lake in days. It is actually backed by a S3 underneath.
Automate collecting, cleansing, moving, cataloging data.
You can combine structured and unstructured data in the data lake. You can also migrate storage S3, RDS, NoSQL in AWS all to Lake formation.
You can also have row and column level control on Lake formation, finer grain of security.
Sources
Data source can be from S3, RDS, Aurora, sent these data into Lake formation.
Consumer
Athena, Redshift, EMR can be used to perform analytics.
Centralized permissions
There are multiple places to manage security so it is a mess. Lake formation solves this, you have a one place to manage your security on row and column level because all data are sent to Lake formation. So you don't have to manage the security everywhere.
Amazon MSK (Managed streaming for Apache Kafka)
Kafka is an alternative to Kinesis, both allow you to stream data.
There are broker. Which is like shards in Data Stream
Producer will produce data and they will sent the data to the broker which then is replicated across other brokers.
Consumer will consume the data, process it and sent it to other places. Can be Kinesis Data Analytics for Apache Flink, Glue, Lambda, or you custom application running on ECS, EKS, or EC2.
MSK manages Apache Kafka on AWS for you.
Data is stored on EBS volumes for as long as you want.
You also have the option to run Apache Kafka serverless.
Data stream vs MSK
1 MB message size limit per shard in Data Stream. But MSK can have higher input limits.
Kinesis Data Stream you can remove shards, but MSK you cannot only add.
Big data ingestion pipeline
IoT devices produces lots of data. The data are sent in real-time to Kinesis Data Streams.
Then you can forward data to a firehose then sent the data to S3 bucket.
You can then sent the deposit event to SQS queue, Lambda receives the event of the data being deposited into the S3, then will run Athena query and sent the query to S3 again.
Then you can use QuickSight to make dashboard on the data, or Redshift for actual analytics but Redshift isn't serverless!
Machine Learning
Rekognition
A machine learning service to find object, people, text, scenes in images and videos. Facial analysis and facial search to do user verification works as well.
You can use it to do labeling, content moderation, text detection, and many things.
Content moderation: Detect content that is inappropriate, unwanted, or offensive images or videos. You can use it in social media, broadcast media, ads, so you can disallow porn, racist contents.
After the image and video is submitted to Rekognition you can have final human review to comply with regulation just in case you need monitor inappropriate contents
Transcribe
Converts speech to text using deep learning of course. You can also redact Personally Identifiable Information with redaction as well, it will detect any names, address, age to redact it.
It supports automatic language identification for multi-lingual audio.
You would use this to transcribe customer service calls, automatic closed captioning and subtitling.
Polly
You turn text into speech using deep learning. Text-to-speech basically.
Lexicon: Customize pronunciation of words with Pronunciation Lexicons. Customize acronyms to make it say the entire word
SSML (Speech synthesis markup language): Use it to emphasize specific words or phrases, include breathing sounds, whispers, customize the way speech is said. More customization, lexicon is used to hint on how to pronounce the word.
Translate
Natural and accurate language translation. It let you localize content like website and application. Basically Google Translate.
Lex + Connect
Amazon Lex: Same technology that powers Alexa. It recognize speech to text and it understands natural language processing. It help basically help you build a chatbot for call center.
Amazon Connect: Virtual call center that can receive calls, create contact flows and is all in the cloud. No upfront payments, 80% cheaper than traditional call center.
Connect will stream the user's data to Amazon Lex to recognize the intention of the user, then it will invoke the correct lambda to carry out the correct action
Comprehend
Used for Natural language processing. Fully managed and serverless. Used to find relation in the text.
Find key phrases, what type of language it is, how positive or negative the text is. It will try to understand your text.
NLP is used for customer interaction analysis. Group articles that are related together instead of grouping it yourself.
Comprehend Medical
Detect useful information in unstructured clinical text. So basically comprehend but specialize in detecting health related information.
Test results, discharge summaries, physician's notes.
SageMaker
Fully managed service for building machine learning model.
All the services we have seen so far uses some ML model, SageMaker can help you make machine learning model to help you do something else.
If you want to make your own model then you need the server to do the training, then you would pass in the data and do some confidence threshold to make some prediction from historical data. Then you have to train and tune the model to better fit the data and output.
SageMaker will help you labeling, building, train and tune all for you. At the end you just need to use the model lol.
Forecast
Let you do forecast (prediction) of your sales. It is 50% more accurate than looking at the data itself.
You can use it to do your product demand planning, financial planning, resource planning.
You feed it the historical time data, then use Forecast to do your prediction.
Kendra
Managed document search service using Machine Learning. Let you extract what you asked from a document.
Document can be pdf, text, HTML, powerpoint, MS word. Then it uses machine learning to search through those document using natural language.
"Hey which floor is the bathroom" -> "1st floor"
Personalize
Machine learning service to make apps with real-time personalized recommendation.
An app that gives you recommendation based on what you did prior. Like recommending products after you bought some. It is what Amazon.com use!
Textract
Use machine learning to extra text, handwriting text, from any scanned document using AI and machine learning.
PDFs, images they all work!
Think scanning your check for a deposit.
AWS Monitoring CloudWatch, CloudTrail, Config
CloudWatch
CloudWatch Metrics
CloudWatch provides metrics (variable that you want to monitor) for every services in AWS. A metric belong to namespaces which is per services.
10 Dimension per metrics, associated instance id, environments, ...etc.
You can even make your own custom metrics to say monitor the RAM usage for example
Metric stream
You can stream them outside of CloudWatch to a destination. It is near-time delivery and low latency which is Kinesis Firehose, then you can sent it to anywhere. Filtering option is also available so that you only stream a subset of the metrics.
C1's Splunk is how this is done basically.
CloudWatch logs groups
You can create log groups for your logs. Inside each group you will have stream which is the logs from the application.
You can have life cycle policy for logs, and logs can be send to different places. S3, Data Streams, Firehose, Lambda, OpenSearch all those are valid destinations.
You can sent logs to CloudWatch logs using SDK, Logs Agent, or Unified Agent.
ECS automatically collect logs from containers, Lambda as well, API Gateway too so most services have logs pre-configured.
Metric filter
You can filter the logs to search for specifically the line that you want, then it can be created as a new metric!
Metric filter can also trigger CloudWatch alarms because it is a custom metric!
Exporting logs to S3
If you export straight from logs to S3 it won't be near real-time or real time. Instead you want to put a subscription filter then sent it to other places say Lambda, Kinesis Data firehose, and Data stream.
CloudWatch Agent and Logs Agent
By default no logs are going from EC2 instance to cloudWatch. You need to install an agent on EC2 to push those log files that you want to CloudWatch.
EC2 instance must have the appropriate IAM role to sent the logs. The agent can also be on-premise which is really nice.
Logs agent: Both for virtual server. Can only sent logs to CloudWatch logs
Unified agent (new): Will also collect system-level metrics like RAM, processes to CloudWatch. Have centralized configuration.
CPU metrics, Disk metrics, RAM, Netstat, Processes, swap space can be collected by Unified agent.
CloudWatch Alarms
Used to trigger action from any metrics even filter metrics.
- Ok: Not triggered
- Insufficient data: Not enough data
- Alarm: Triggered
Alarm have three main targets that you can do the action to.
- EC2 instances: Stop it, terminate it, reboot, or recover
- ASG: Trigger auto scaling action
- SNS: Sent notification to SNS then you can do whatever you want with those subscriber that subscribed to the topic
Composite alarm
CloudWatch are on single metric. Composite alarm monitor the states of multiple other alarms. Combining different alarms together using AND and OR conditions.
EventBridge
Formerly known as CloudWatch event
You can schedule CRON job, scripts to run periodically.
You can also use it to react to a service doing something. React to say IAM Root user sign in event, sent a message to SNS topic saying that the root user has signed in.
EventBridge rules
All services when doing something can sent the event through EventBridge. You will write rules to react to those events.
Then you can set up "destinations" to react to those events. Events are in JSON, which the destination will receive.
There is the default event bus that is created for each account. You can sent the event to this bus if you like, or to your custom event bus.
You can select to sent EVERY EVENT that occurred in AWS to a bus, but this is very expensive. You should only nick pick only the event that you care about.
Partner event bus can receive events from third party software like Datadog and Zendesk to EventBridge as well. So it can react to event outside of AWS as well!
You can also sent your own application's event to EventBridge!
Schema registry
EventBridge analyze the events in your bus and infer the schema.
The schema registry let you write code so that it will know in advance how data is structured in the event bus that is sent to the destination.
Basically the JSON file that defines how the event that is going to sent to the destination looks like.
Resource-based policy
Set permission for a specific event bus. Allow / deny event from another AWS account or AWS region.
CloudWatch insights
CloudWatch container insights
You can collect metrics and logs from containers. From ECS, EKS, Kubernetes.
For EKS the metrics and logs are collected using a containerized version of the CloudWatch Agent to find those containers.
CloudWatch lambda insights
Monitoring and troubleshooting solutions for serverless application on lambda.
It collects CPU time, memory, disk, network, cold starts for lambda.
CloudWatch contributor insights
See contributor data from time series. Find top talkers and understand who or what is impacting system performance.
Finding bad host who is doing malicious thing.
CloudWatch application insights
Give automated dashboards that show potential problems with monitored applications. The apps running on EC2 instances can be monitored but only certain technologies.
Those apps can link to other AWS services, and application insights can show you what issues those services connected can have. Helps troubleshooting.
CloudTrail
Audit logs for your AWS account. It provides you a history of events / API calls made within your AWS.
Console, SDK, CLI, and AWS services. You can store the logs to S3 or CloudWatch.
You can monitor all region or single region.
If resources are deleted then look into CloudTrail! To check who did it!
Events
Management events: Those are operation that performed on resources in your AWS. Two kinds read events (that doesn't modify resources) and write events (that may modify resources). You read IAM roles, or you add an IAM role, these are management events.
Data events: These are not logged by default. You get events on S3 object levels. GetObject, DeleteObject, PutObject, and again you can eparate read and write events. It also monitor lambda function execution invoke API.
CloudTrail insights events: Detect unusal activity in your account.
It looks like historical data, what normal activity looks like in your account, then find those unusual usage pattersn. You need to pay for this events.
Event retentions
By default stored for 90 days in CloudTrail. To store it longer put it into S3, then use Athena for analytics.
Intercept API calls
Any API call will be logged in CloudTrail, then the event will be stored into EventBridge, then you can set up rules to alert to SNS topic for a specific API call usage.
AWS Config
This helps monitor compliance of your AWS resources. It will give you a dashboard on the resources that you are monitoring
- I want this bucket not public -> Compliance or not compliance
- Every EBS disk is type gp2
- Each EC2 instance I deploy must be t2.micro
You can define rules or use the one provided by AWS
It doesn't do the remediation for you! It will just tell you that it isn't compliance. You can set up SSM automation document to do remediation. It can have retries in case remediation failed.
To receive notification on non-compliance resources you can set up on EventBridge. Or you can filtere it in SNS.
Summary
CloudWatch: Used for performance monitoring, you can also receive logs and analysis on specific metrics.
CloudTrail: Used for API call auditing. Define trail for more specific resources, it is a global service
Config: Record configuration changes for you resources, and also define what is compliance and what is not.
IAM Advanced
Organization
Let you manage multiple AWS account at the same time. There is one main account that's the management account, other account are member accounts.
Billing will all be sent to the management account. You get pricing benefits from using all services across the member accounts which is nice.
Organization unit
You first have Root organization unit (This is the most outer OU) in which your management account lives in. Then you can create other organization unit within the root one, kinda like sub groups. And within those subgroups that's where you can place member accounts. You can nest the organization unit within each other.
Benefit
You can enable CloudTrail for all accounts and have a central logging S3 account. Central CloudWatch logs as well.
Service control policies (SCP). IAM policies applied to OU or account to restrict users and roles. Applies to everything but the management account, so even if you deny a certain resource to the management account it will not apply. For SCP you can specify an allow list or block list. Allow list you explicitly allow actions while denying everything. Block list you allow everything then you deny certain actions.
You would apply SCP to each OU, and the account would have they inherit from their parent OU unless there is an explicit deny, even if there was an authorize. An deny would be a NO regardless if there is an authorize to that specific account.
IAM Condition
Used to conditionally apply IAM statement to a IAM policy. It is a field under the Statement key, with name "Condition". There are several conditions available for use.
- aws:SourceIP: Don't allow access to those who has IP that are listed
- aws:RequestedRegion: Don't allow access if you're not in certain region
- ec2:ResourceTag: Allow access to those EC2 instances with the corresponding tag.
- aws:MultiFactorAuthpresent: Only can carry out these access if they have MFA
IAM for S3
When writing IAM policy for S3 there is two kinds you can write.
Bucket level permission, defines which bucket you have access to.
Object level permission, defines which object you have access to, you would need to do / after the bucket to pick out those object that the policy applies to because you are doing it on object level.
Resource policies and aws:PrincipalOrgID
Restrict resource to only users under the specified organization.
IAM roles vs Resource-based policies
When a user, AWS service assumes a role it gives up the original permission that they had before.
If you use resource-based policies, policy that is attached to the resource not the user/role, then it doesn't have the give up the original permission that they had.
Lambda, SNS, SQS, CloudWatch, API Gateway uses resource-based policy, the resource defines who can access it and what kind of API it can make
Kinesis stream, ECS task use IAM role, so it will have to assume the role which define what it can do with what resources.
IAM permission boundaries
IAM permission boundary is like another IAM policy document that defines the upper bound that the IAM role / user's permission can get. Then you attach the actual IAM permission which must be equal or less than the permission defined in the boundary. If the permission you give to the user is outside of the boundary then it will receive no permission!
This can be used with combination with organization SCP. The junction of SCP, permission boundary, and IAM policy defines what the user can do.
The way the permission is evaluated is as follow:
- If there is an explicit deny then permission is denied. Even if there is an allow!
- Then it goes to SCP, if there isn't an explicit allow then it is denied, implicitly
- If it has the resource based policy then it is evaluate
- Then IAM that the role / user has
- Then permission boundary is evaluated
- Session policies is also evaluated
So all these policies do come into play and to allow / deny access
No explicit deny or explicit allow, then it is denied.
IAM Identity Center
Single sign on for all your AWS accounts. One login to access all applications.
One login into multiple AWS accounts basically you get to choose who to be logged in as. Identity center can be from AWS or from third party.
After you login you have permission set to define what the user can or cannot do.
You can define fine-grained permissions and assignment of policies.
Directory services
Microsoft active directory: A database that contains objects, users, accounts, computer, printer, file shares, and security groups.
There is centralized security. A domain controller to do verification on other machines that logs in.
AWS Directory services
AWS Managed microsoft AD: Mimics microsoft active directory so you can amange users. You can use an on-premise AD as another database for user logins, this is called "trust". The integration is out of the box.
AD connector: Proxy that redirect to on-premise AD, support MFA.
Simple AD: Standalone AD that you can use on the cloud.
However, if you have your own self-managed directory then you need to create two-way trust relationship using AWS Managed Microsoft AD. OR you can use AD connector.
AWS Control tower
Easy to set up a secure and compliant multi-account AWS environment. It uses AWS organization to make the accounts.
Guardrails are used to governance for your accounts that's set up from control tower.
Preventive guardrail uses SCP to restrict access
Detective guardrail uses AWS Config to identify non-compliance.
Basically it abstract away SCP and AWS Config for you if you don't want to deal with it.
KMS, SSM Parameter Store, Shield, WAF
Encryption 101
Encryption in flight: Achieved via SSL, your connection to the web server is encrypted, so no one can man in the middle you and sniff the packet that you're sending to find sensitive data
SSL certificates is used to established the secure connection. Certificate is used to authenticate the other party.
Server-side encryption at rest: Done by encrypting the data in rest that's stored in the database. The encryption and decryption key must be stored and managed somewhere a key server. The server will talk to it to retrieve the key to do encryption and decryption. The data is decrypted before the data is sent back to the client.
Client side encryption: The data is encrypted by the client and never decrypted by the server. The data will be decrypted by the receiving client. Could leverage envelope encryption to do client side encryption.
KMS
Key management service. A service that will manage the encryption key for us. Used IAM for authorization the access to the keys that's stored in KMS.
Every key usage will be audited using CloudTrail! Which is very nice to track who used the key and when.
Don't store secrets in plaintext, this is very bad!
KMS keys
Symmetric (AES-256 keys): Single key used for encryption and decryption. All AWS service that integrated with KMS use symmetric keys.
Asymmetric (RSA & ECC key pair): Public and private key. Encrypt/decrypt, sign/verify operations. Used for encryption outside of AWS by users who can't call KMS API.
AWS can own keys and they are free. SSE-S3, SSE-SQS.
AWS Managed keys are also free, but only within service that they are assigned to
Customer managed keys: cost $1 / month. You will also pay for the API calls to KMS.
Keys will be automatically rotated every 1 year.
Key scopes
The keys are scoped per region. So other region cannot access the KMS key from another region. You would need snapshots to restore the snapshot with encryption enabled with another KMS key. Same key cannot be used!!!
Key policies
Default KMS key policy if you don't have one: it allows everyone in this account to have access to this key
Custom KMS key policy: You define users, role that have access to the KMS key. Can share KMS key via cross-account sharing, then you can copy encrypted snapshots across different accounts which is cool.
Multi-region keys
The same key is going to be replicated across multiple region. This is so that you can encrypt in one region and decrypt it in another region as well.
They are possible because of same key ID, key material, and automatic rotation.
Multi-region are not global! Each are managed independently, therefore, it is not recommended to use them. Use it only for global client-side encryption.
DynanoDB global table + multi-region keys + client-side encryption
This is so that you can encrypt specific column of the DynanoDB it will only be decrypted to specific client. Protection against even database admin.
Same concept can also applied to Aurora as well.
S3 Replication Encryption
For object encryption with SSE-KMS you need to enable it. You specify which key to encrypt with. You are allowed to replicate the encrypted S3 bucketto another target bucket using a different key.
Object encrypted with SSE-C are never replicated.
SS3-S3 are replicated by default.
Sharing encrypted AMI
Then you also have to share the KMS key with the other user and have sufficient permission to use the key.
Then you can just launch the EC2 instances from the AMI that's encrypted, then they can create the same encrypted AMI with their own KMS keys.
SSM Parameter Store
System manager. A storage place for you to put secrets, password, and configuration files.
It is serverless scalable, durable, and have an easy SDK to use to retrieve and store those files.
You can have plaintext configuration or encryption of password and secret using KMS.
There is hierarchy for the parameter store just like a S3 bucket.
Standard and advanced parameter
Standard doesn't have parameter policies and is free to use.
Advance isn't free and can have parameter policies.
Parameter policy: Allow you to assign TTL to a parameter to force user to update or delete sensitive data like passwords. You can have multiple policies at a time.
AWS Secret Manager
Used for storing secrets. Different than SSM parameter store because it can force rotation of secrets every X days, and it can generate secrets on rotation using a Lambda function.
Secret can be encrypted using KMS. Integrates really well with RDS, Aurora, and PostgreSQL
Multi-region secrets
The secret can be replicated across regions, it will be kept in sync using read replicas with the primary secret.
Easy failover and promotion if one region fails.
AWS ACM
Certificate Manager. Let you get, manage, and deploy TLS certificates for HTTPS. This will give you in-flight encryption for your website.
ALB will be working with ACM to get the certificate for HTTPS. Supports both public and private TLS certificates.
Automatic TLS certificate renewal as well.
You cannot use ACM with EC2!
Process:
- List out the domain names to include in the certificate for
- DNS validation or email validation. DNS validation is preferred using CNAME record
- The certificate will take few hours to get verified
- Auto renewal
Importing public certificate
You can generate the certificate outside of ACM then import it, but this won't give you automatic certificate.
Can set up AWS Config to make sure the certificate is valid.
Or you can set up EventBridge for daily expiration events from ACM to invoke SNS notifications.
Integration with ALB
There will be a redirect to HTTPS when user visits HTTP, then it will leverage ACM to sent the certificate in order to establish a secure connection.
Integration with API Gateway
If you use Edge-optimized then TLS certificate must be in the same region as CloudFront which is us-east-1
.
If you use Regional then TLS certificate must be in the same region as the API Gateway.
Web Application Firewall
Protect your web application from common web exploits at HTTP level. XSS.
Deploy on ALB, API Gateway, CloudFront, Cognito User Pool.
Set up Access Control List rules: To filter based on IP sets, only let those IP through. Filter based on HTTP body. geo-match to allow or block specific countries.
Rate-based rules to protect against DDos protection.
Fixed IP while using WAF with load balancer
WAF doesn't support NLB, ALB don't have fixed IP. To solve this we can use global accelerator for fixed IP and WAF on the ALB after the traffic goes from global accelerator to the ALB.
AWS Shield
Help protect against DDos attacks. Distributed denial of services from all around the world.
AWS Shield standard: Is deployed already for every AWS customer and is free. Protection against SYN/UDP flooding attack, reflection attack.
AWS Shield advanced: This is optional and will cost 3k a month. Protect against more sophisticated attacks. And give you a response team.
Firewall Manager
Manage rules in all accounts of an AWS organization. The security policy will be apply to the entire AWS organization that you deployed.
Rules for WAF, Shield Advanced, Security group for EC2.
So it is a central place for you to manage security rules for your organization.
Newly deployed resources will have those security rules that you have set in firewall manager.
Differences
AWS WAF is good for granular protection for your resources. Use it together with Firewall manager to automate the deployment of WAF configuration on new resources.
AWS Shield gives you DDos protection, but advanced gives you a dedicated response team to help you mitigate DDos attacks if you're prone to DDos.
Amazon GuardDuty
Intelligent threat discovery to protect your AWS account.
Looks like CloudTrail event logs, data events and management events. Looks like VPC flow logs and DNS logs, EKS Audit logs.
Basically AI protecting your AWS account for unusual activity. The findings it finds can trigger EventBridge which then you can notify you accordingly.
Different than CloudTrail insight events in that this is more broad and looks at way more logs.
Very good protection against cryptocurrency attack!
Amazon Inspector
Run automated security assessments on these resources:
- EC2 instances: Find known vulnerbailities
- Container images: See whether the container have any vulnerabilitity
- Lambda function: Find software vulnerabilities and dependencies that it uses
Continuous scanning when it is needed. Database of CVE when updated then it will scan. Network reachability for EC2.
Report finding to AWS Security Hub and event to EventBridge.
Amazon Macie
Use machine learning and pattern matching to find and protect sensitive data in AWS.
Identify sensitive data like personally identifiable information and notify you in EventBridge. Analyze data in say S3 buckets.
VPC
Networking 101
IP Addresses
Every host or device on a network must be addressable, meaning that they should have something that can be referenced by in order to reach it as a destination under a defined system of addresses. That thing is called IP Addresses and this is how we can address host in a network.
If one computer want to communicate with another computer, then the address can be used the address to reach each other and send information.
The IP address must be unique on its own network.
IPv4 Addresses
IP addresses are made up of two parts. The network address which is to identify the network that the address is part of. Then the part after is used to identify the host within the network.
IPv4 in the old days were divided into five different classes, A through E.
- Class A: The first bit is 0 so it includes network address range from
0.0.0.0
to127.0.0.0
can be used as network ID. 24 bits for host - Class B: The first bit is 1 and second bit is 0 so it includes network address range from
128.0.0.0
to191.255.0.0
. 16 bits for host. - Class C: The two bit is 1 so it includes network address ranges from
192.0.0.0
to223.255.255.0
. 8 bits for host.
So if a company wants a IPv4 address they would go through of first picking a class of address that they want, which dictate how many host they can have in their network. But as you can see it is limited and the size isn't best fit, since the differences between two classes is huge.
Now after you receive a network address from either class, you can further divide the network into smaller network sections and is called subnetting. And by default each network has only one subnet without subnetting, because it contains all the host addresses defined within. To do subnetting you would basically provide a subnet mask to mask out the subnets that you are looking into.
Netmask is used for identifying the network that the destination IP address falls under, think of the classes, it is for finding out "which network you belong to". Then you would use Subnet mask if you have subnets to find out which subnet you belong to, then finally use the remaining bits to find the actual host.
Example:
You're issued the IP address 10.10.0.0/16
, here the Netmask is 255.255.0.0
. But if you are going to divide it into say 4 different subnets then you would need a Subnet mask
10.10.0.0/18
10.10.64.0/18
10.10.128.0/18
10.10.192.0/18
The remaining 14 bits will be used to identify each host within each of the subnets.
2^14 * 4 = 2^16. It is still covering all the host that the original big subnet has, but just divided further for ease of management
CIDR
Stands for classles inter-domain routing.
CIDR let you have a variable length of subnet masking, which is much more efficient compared to those classes we say in the previous section.
CIDR consist of two numbers, the network address then the second part is the Netmask indicating how many bits from left to right to mask the network address.
Let’s take an example:
11000000.10101000.01111011.10000100 -- Dest IP address (192.168.123.132)
11111111.11111111.11111111.00000000 -- Subnet mask (255.255.255.0)
192.168.123.132 / 24
As you can see the part highlighted in green is the network address, while the last 8 bits gives the host address. When the package arrives it will arrive in the 192.168.123.0 subnet and then be processed at the destination address of 192.168.123.132.
To know for AWS: It just basically help define the IP ranges that's all.
IP is made up of segments or octets. /32 says that no octet can change for the host, so it only one host address.
/24 means only last octet can change for the host addresses, and so on.
For example: 192.168.0.0 /16 means that the IP ranges from 192.168.0.0 - 192.168.255.255, you can two octets that can change.
Public vs Private IP
Public IP is assigned by IANA organization, and they also establish standard for private IP uses.
Private IP are as follows:
- 10.0.0.0 - 10.255.255.255 (10.0.0.0 / 8) This is for big networks
- 172.16.0.0 - 172.31.255.255 (172.16.0.0 / 12) This is for AWS default VPC private network range
- 192.168.0.0 - 192.168.255.255 (192.168.0.0 / 16) This is for home networks
Then rest of the IP addresses are public.
VPC
Virtual private cloud.
Max of 5 per region but you can increase it. Max CIDR per VPC is 5, so you are allowed to add more CIDR into your VPC and it isn't limited to just one CIDR block (defines the range of IPs that you can have for your EC2 instances).
Only private IPv4 range are allowed because it is private cloud.
Your VPC CIDR should not overlap with your other networks, in case you wanna connect two VPCs together. So it means that you can but you shouldn't
Default VPC
All new AWS account have a default VPC. New EC2 instances are launched into the default VPC if no subnet is not specified.
Default VPC have internet connectivity and all EC2 instances inside it have public and private IPv4 addresses. And you also get a public and private IPv4 DNS name.
Subnets
You divide the CIDR block that you are given into further subnets for better management. This is done via increasing the subnet mask bits.
Subnets are associated with an availability zone, so you would put subnets into different availability zone to get high availability.
For each subnet that you created it will reserve 5 IP the first 4 and last 1 in each subnet which means they cannot be used for assignment.
Ex: 10.0.0.0/24, then:
- 10.0.0.0: Is for Network address
- 10.0.0.1: Reserved by AWS for VPC router
- 10.0.0.2: Reserved by AWS for mapping Amazon provided DNS
- 10.0.0.3: Reserved for future use
- 10.0.0.255: For network broadcast, but isn't supported in VPC so it is reserved
So when you are creating your subnets keep in mind of the 5 reserved IPs. If you need 29 IP address for a subnet, then you can't use subnet size of /27. 2^5 = 32 - 5 = 27 < 29.
Subnets can be public or private. For public subnet choose a CIDR block that's small since they are just for public front-facing resources like ALB.
Internet gateway
Internet gateway allows resources like EC2 instances in a VPC to connect to the internet. It scales horizontally and is highly available and redundant.
One VPC can only be attached to one internet gateway and vice versa.
Internet gateway on their own don't allow internet access you have to also configure the route tables. You have to edit the route table to connect the EC2 instances with the router then connect to the internet gateway to finally be able to access the internet.
Route table
Route table tells how the network traffic from your subnet is directed, where does it go.
Each route table have subnet associations. They work together with internet gateway to actually provide internet access for the instances launched in the subnet.
If it is location address, then route it locally. Then any other IP address that's connecting to then route it to the internet gateway. It is just a configuration for the route table, that's all to enable internet access.
Bastion Hosts
Users want to access EC2 instances in a private subnet. How do we do it? We use bastion host, it is an EC2 instance named Bastion Host. It is EC2 instance in public subnet and the Bastion Host have access to the private subnet, so to access the private subnet EC2 instance the user first connect to Bastion Host then connect from Bastion Host to the EC2 instance in private subnet.
Bastion Host security group must allow access from public internet on port 22 for SSH and from restricted CIDR from your company for example, or your own IP address. This is for security you don't want everyone from the internet able to SSH into your Bastion Host.
The security group of the private EC2 instance must also allow security group of Bastion Host or the private IP of the bastion host to SSH into it.
NAT instances
NAT = Network address translation. Outdated but still in the exam
So NAT instances are still EC2 instance that acts as the NAT in normal world basically. They are not highly available and resilient right out of the box thus it is outdated.
The private EC2 instance doesn't have public internet access because it doesn't have routing table set up. But how do we give it internet access? NAT instances, gives those EC2 instances in private subnet to be connected to the internet.
NAT instances must be launch in public subnet and must have an elastic IP attach to it. Must disable source / destination IP check otherwise it won't work.
Then you would configure your routing table to route all the traffic from the private subnet to the NAT instances. NAT instance will be responsible for proxying the traffic, meaning the IP address of the private EC2 instance is hidden, instead the packet will be sent with the IP address of the NAT instance. When the response comes back NAT is smart enough to remember which EC2 instance requested this packet and forwards it back.
NAT gateway
Compared to NAT instance it has higher bandwidth, higher availability, and you don't need administration. Up to 45 Gbps and is all managed by AWS.
NAT gateway will be created for a specific availability zone, and uses an elastic IP.
Can't be used by EC2 instance if they are in the same subnet, only used for another subnet. And they must be launched in the public subnet.
No need to manage any security groups it is not needed!
Then all you need to do is to configure the route table to send traffic from the subnet to the NAT gateway.
Availability
NAT gateway is only resilient in a single AZ. You have to create multiple NAT in multi AZ for fault tolerance.
Security groups and NACLs
Network traffics goes to the NACL before going into the subnet. It has inbound / outbound rules. NACL are stateless. If it is allowed in, then the response after passing security group will be checked again for outbound rules because it doesn't remember that it allowed the same request in.
After it reaches into the subnet it will go through security group. It has inbound rule and outbound rules. Security group are stateful. If it is accepted in, then automatically the response from the EC2 instance are automatically accepted out.
If this time the request is outgoing from the EC2 instances. The outbound rules are first evaluated in the security group, then it wil lreach the NACL and be evaluated. Then the response will come back as incoming request then NACL will evaluate again, but for security group because it is stateful it won't be evaluated and automatically accepted in.
NACL
One NACL per subnet. New subnet are assigned default NACL. By default it allows everything in and out with the subnet it is associated with.
You define rules in NACL, and the higher precedence with a lower number and first rule match will drive the decision.
Last rule is * and denies all request if no rule match.
It also supports deny rule. It is stateless so every traffic needs to be evaluated. Security group return traffic are automatically accepted.
Ephemeral ports with NACL
You need to configure the appropriate ephemeral ports for the NACL.
VPC peering
Privately connect two VPC using AWS's network. This is achieved by making them behave as if they were in the same network, and this is why their CIDR should not overlap otherwise it will not work!
Peering connection are not transitive. If A can connect to B, and B can connect to C. But A cannot connect to C. It needs an explicit peering connection.
You must update route tables in each VPC's subnet to ensure EC2 instances can communicate with each other after you created the VPC peering connection. Basically you want to sent the traffic going to the other VPC via the peering connection in both of the routing tables.
Peering connection can happen cross account and cross region.
VPC endpoints
If you want to access AWS resources using AWS's private network and not have to go through the public internet then you can use VPC Endpoints. It is associated with a VPC.
There is two kinds of endpoint:
- Interface endpoint (powered by PrivateLink): It will provision ENI + security group. ENI will give you a private IP which will be the entry point to access the other services. This kind will allow you to communicate with lots of AWS services
- Gateway endpoint: Used as a target in a route table, so no IP address you need to worry about. This only works for S3 and DynamoDB, but this is free and scales automatically.
In the exam Gateway endpoint is usually preferred because it is free and scales automatically. Even though you can use interface endpoint.
Interface endpoint is only perferrable if you want access from your on-premise data center to the AWS center using private network.
VPC Flows logs
Capture information about IP traffic going into your interfaces.
Help monitor and troubleshoot connectivity issues. The logs can be sent to S3 or CloudWatch.
You can use Athena for S3 and CloudWatch insights.
AWS Site-toSite VPN
To enable connection from on-premise to AWS VPC you need two things. Virtual Private Gateway: this is the gateway created and attached to the VPC on AWS side.
Then you also need customer gateway which is set up on the on-premise data center that you want to establish the connection to. It is either a software or physical device to install.
VGW <-> CGW connection. Or if your CGW is private then you need to use NAT. CGW <-> NAT <-> VGW.
You need to enable route propagation to make site-to-site connection work.
CloudHub
Let you set up secure communication between multiple on-premise sites if you have multiple VPN connection.
It is low cost and hub-spoke model.
Direct Connect
You get a dedicated private connection from a remote network to your VPC.
It is used to get a more consistent network experience, and increase bandwidth throughput. You can access to both the public and private resources on the same connection.
However, setting up direct connect require you to find a physical location to set up and connect to it. You will get a private connectivity, none of the traffics are routed via public internet.
Connect to one or more VPC
If you want to connect the on-premise data center to one or more VPC using Direct Conect then you need to use Direct Connect Gateway. Allowing you to connect to multiple VPC and multiple region as well.
Connection type
Dedicated connection: 1 Gbps, 10 Gbps, and 100 Gbps.
Hosted connection: 50 Mbps, 500 Mbps, and 10 Gbps. cpaacity can be added or removed
However, it will take more than one month to set up this connection.
Encryption
Data is not encrypted but it doesn't matter because the connection is private. But if you want the connection to be encrypted then you can use a VPN on top of the connection.
Good for extra level of security.
Resiliency
You need to set up Direct Connection with multiple on-premise center.
To get maximum resiliency for critical workloads, then each data center need two Direct Connect locations that are independent with each other for max.
Site to Site VPN and Direct Connect
Use direct connect as primary connection, then use Site to Site VPN as the backup in case DX failed.
The traffic for site-to-site VPN goes through the public internet however, since it is a VPN the traffic is encrypted. Direct connect on the other hand doesn't encrypt the data but you can set up, this is because the connection is private it doesn't go through the public internet.
Direct connect requires more effort and physical labor to set up while site to site VPN is wayyy easier to set up.
Transit gateway
This allows you to have transitive peering between thousands of VPCs and on-premises connections. VPC Peering doesn't have transitive peering, but transit gateway allows you that.
It is regional resources and can work cross-region. To limit the connection of which VPC can talk to which is done by the route tables.
Is the only service that supports IP Multicast.
Site-to-site VPN ECMP
Equal-cost multi-path routing. Allow you to forward packet over multiple best path. This is used to create multiple site-to-site VPN connection to increase the bandwidth of your connection to AWS.
Share direct connect between multiple account
Transit gateway allows you to share a Direct Connect between multiple account.
VPC Traffic mirroring
Allow you to capture and inspect network traffic in your VPC.
You have to define the source ENI, and to ENI or NLB. Traffic mirroring can basically mirror the traffic that's sent to the source ENI will also be sent to the mirror ENI or NLB then you can analyze the traffic separately without disturbing the functionality.
Source and target can be same or different VPC via VPC peering.
Content inspection, threat monitoring, and troubleshooting.
IPv6 for VPC
IPv4 soon will be exhausted soon. So IPv6 is come up with to not have be exhausted in our near time. All IPv6 addresses are public.
x.x.x.x.x.x.x.x where x is hexadecimal.
You can enable IPv6 support for your VPC.
Egress-only internet gateway
Only used for IPv6 only. Similar to NAT gateway but for IPv6.
Allow instances in your VPC outbound connection over IPv6 while preventing internet to initiate an IPv6 connection to your instance. You have to update the route table to make it work.
Networking cost in AWS
Incoming traffic to EC2 instances are free. Communication between EC2 instances within availability zone using their private IP are also free.
Cross AZ using public IP / Elastic IP, i.e. the traffic leaves AWS, then has to go back into AWS will cost $0.02.
However, if you use private IP cross AZ then it is $0.01. This is the preferred method. Use private IP for cheaper and faster option.
Cross region is $0.02 per GB.
- Use private IP instead of public IP for good saving and better network performance
- Use same AZ for maximum savings. However, if you lose high availability.
Minimizing egress traffic network cost
Egress traffic = outbound, from AWS to outside
Ingress traffic = inbound traffic, from outside to AWS. This is typically free.
Try to keep as much internet within AWS to minimize the cost.
S3 data transfer pricing
Ingress traffic is free. However, egress traffic are not free, $0.09 per GB.
S3 transfer acceleration is additional cost per GB to get faster data transfer.
S3 to CloudFront is free.
Cross region replication is $0.02 per GB.
NAT Gateway vs Gateway VPC endpoint
Use public internet via NAT gateway to say access S3 buclet. $0.045 NAT gateway / hour + $0.045 data process / GB. So it isn't cheap.
However, if you use VPC endpoint gateway the cost is free. You pay $0.01 data transfer in/out for the same region. Significant lower cost.
AWS Network Firewall
Used to protect your entire Amazon VPC. From layer 3 to layer 7.
Will monitor any traffic, from internet and out from VPC.
Internally it uses AWS Gateway load balancer, but instead we don't have to set up our 3rd party appliances, but AWS manages it for us.
You have fine grain control to support thousands of rules. Allow, drop, alert traffic that matches rules.
Logs can also be analyze.
Disaster Recovery and Migration
Disaster recovery
Disaster recovery is about preparing for those disaster that can happen to the data center.
RPO: Recovery point objective. How often do you run backups. How far can you go back just before the data loss. "How much data did you lose just before the disaster". It will be measure in terms of time, the lower the better, that means that you basically lose little to no data.
RTO: Recovery time object. How long will it take you recover from the disaster. How much downtime after the disaster. Again also in time the lower the better.
Disaster recovery strategies
All these strategies will have different RPO and RTO and different cost.
Backup and restore
High RPO!
You make regular backups basically to the cloud, then when disaster strikes then you can just restore from the snapshot for RDS, or EBS.
This is cheap and the only cost is storing the backups, and high RPO and high RTO. Depending on how often do you make the backups.
Pilot light
A small version of the app is always running in the cloud. The useful and critical application.
Faster backup and restore since the critical system are already up. Then you just have to spin up the other non-core systems on the fly.
Lower RPO and RTO, but a little bit more expensive.
Warm standby
A full replication of the system but is scaled down. When disaster strikes then you scale it up.
Way faster RTO, and lower RPO but you are spending more now.
Multi Site / Hot Site Approach
You have basically a full replication not scaled down running concurrently on the cloud environment ready to go.
Very fast RTO and low RPO, his is the most expensive.
All AWS Multi Region
Deploy everything to the cloud to multiple regions, when one region is down just failover to the other.
Recovery tips
Backup do EBS snapshots, RDS automated backups. Push those snapshots and backup to S3.
High availability, use Route 53 to automatic migrate DNS from region to region.
RDS multi-AZ
DMS
Database migration service. You can quickly and securely migrate on-premise database to AWS. During migration the source database still remain available.
Supports lots of engine. Oracle to Oracle. Posgres to Postgres. Even Microsoft SQL server to Aurora.
Continuous data replication. You have to create an EC2 instance to do the migration and continuous replication if you need. The EC2 instances will basically pull data from the database and migrate it to the target.
Source and targets
Sources: On-premises and EC2 instances databases, even S3
Target: On-premise and EC2 instance databases. RDS, Redshift, DynamoDB, S3, Kinesis Data Stream
You can transform source to target they all work.
Schema conversion tools (SCT)
Convert database schema from one engine to another, if the engine aren't the same you will need to use this tool. If the database engine are the same then you don't need it.
Continuous replication
To set up continuous replication and not just a one time migration. You would still need SCT if the data engine aren't the same. Then you need the EC2 instance to do full migration and chance data capture (CDC) for continuous replication if there are new changes to the database.
RDS and Aurora Migration
RDS MySQL to Aurora
- DB snapshot from RDS then restore it as Aurora. But there are downtime if you are taking the snapshot
- Create an Aurora read replica from your RDS. Then do replication. Then promote the Aurora as own DB cluster. No downtime but there are cost
External MySQL to Aurora
- Percona XtraBackup to backup your on-premise MySQL to S3, then restore Aurora from S3
- Mysqldump and pipe it to Aurora, but this is slower since it doesn't use S3
If both database are up and running
Use DMS to do continuous replication.
RDS PostgresSQL to Aurora
- DB snapshot and then restore it as PostgresSQL Aurora, since Aurora support Postgres and MySQL
- The second option create read replica, then do replication and promote the Aurora as own DB cluster, same with MySQL.
External Postgres to Aurora
- Create backup and put it in S3, then import it using Aurora extension
Other on-premise services
VM Import / export: You can move existing VM to AWS or export them back to on-premise
AWS Application Discovery Service: Discover your on-premise server layout and plan migration
AWS Database Migration Service: Help migrate on-premise database to the cloud.
AWS Server migration service: Help migrate on-premise server to cloud incrementally.
AWS Application Migration Service: Migrate existing websites, applications, servers, and VM, and data to the cloud. Life and shift solution that simplify migration to the cloud. Convert your existing data center and replicate it on the cloud basically.
AWS Backup
Centrally manage and automate backup across AWS service. No need to create custom script and manual processes.
Supports lots of services for making backups. Support cross-region and cross-account backups for disaster recovery.
On-demand and scheduled backups. Define frequently of how often the backup is done.
Data is backed up in S3.
You can apply write once read many. The backup cannot be deleted.
Transferring large amount of data to AWS
200 TB of data with 100 Mbps internet.
- Over the internet / Site-to-site VPN: Immediate setup but very long time of transferring, so this isn't going to work
- Direct connect (1 Gbps): A month to set up DX, then still lots of time to do it.
- Snowball: Order snowballs, then one week to load data and sent it back.
- On-going replication: Site-to-site. DataSync. DMS, or DX.
VMware Cloud on AWS
Have VMware virtual machines on the cloud basically if you use VMware on-premise.
Sub-services under VMware are available and you have access to AWS cloud services like S3, EC2, Direct Connect.
Even More Architecture Discussion
Event processing in AWS
SQS + Lambda
Lambda service is going to poll from SQS, however, there can be problem with the message if it cannot be processed and go into infinite loops. In that case a dead letter queue can be set up to sent problematic message after certain retries.
SQS FIFO, if a message isn't processed then it is going to be blocking, in that case the dead letter queue can again be used to sent retried message that didn't work into it.
SNS + Lambda
The subscriber will be lambda and messages will be sent to the lambda. If the processing fails on the lambda level then you can set up a SQS dead letter queue for the lambda to sent the message it is unable to process to.
Fan out pattern: Deliver to multiple SQS
Use SNS and make SQS as the subscriber to make it reliable.
S3 events
Use it EventBridge to expand on the capability to react to S3 events.
EventBridge can also intercept API calls, to react to when a specific API occurs.
Caching strategies
Caching at edge: CloudFront, the caching strategies is near the user. However, the cache might not be updated, compared to the backend so you need to set up TTL.
Caching at API Gateway: Caching at API Gateway is doable. It is cached at the region it is deployed in.
Database cache: Redis, DAX, ElastiCache, help save frequently read data so your database isn't overwhelmed.
Lots of way of doing caching in AWS.
Blocking IP address
First line of defense is NACL for the VPC.
Then you have security group for EC2 instances, which only have allow rules. So you just need to allow specific IP range.
Install WAF. To do IP filtering.
If ALB is fronted by CloudFront because CloudFront is outside of VPC, then WAF can do IP filtering on CloudFront and use Geo restriction in CloudFront to restrict traffic that can come from certain countries.
High performance computing on AWS
Cloud is good for HPC, because you can spin up EC2 instances on-demand and just pay for the system you have used.
You can use HPC to do modeling, machine learning, a lot of things!
Services that help to do HPC is:
- AWS Direct connect: Move data to the cloud via private network.
- Snowball: Move lots of data to AWS offline.
- DataSync: Transfer data from on-premise to the cloud.
- EC2 instances: Can have CPU optimized instances, cost saving via spot instances and spot fleet.
- Placement group with cluster you can communicate between EC2 instances very fast because the EC2 instances are on the same rack.
- Elastic network adapter to speed up network speed for your EC2 instance
- Elastic fabric adapter, improved ENA for HPC but only works for Linux. Good for inter-node communication.
- FSx for Lustre: Dedicated for HPC, backed by S3. Only for Linux
- AWS ParallelCluster: Deploy HPC cluster on AWS. Automate creation of EC2 instances and cluster type. Use EFA on the clustser is available to improve network performance.
Highly available EC2 instances
By default EC2 launch in one availability zone. How can we make it more highly available and resist to failure.
We can have a standby EC2 instance in another availability zone. CloudWatch event or health watch to monitor the health of the EC2 and do the failover if the primary EC2 instance failed.
Some Random Services
CloudFormation
Declarative way for outlining your AWS infrastructure using a template.
"I want a security, I want two EC2 instance using security group"
Then CloudFormation will create your instances, infrastructure as code, you will never create the resources manually. Can recreate the resources on different regions quickly and automatically, you just need to give the template.
Changes to infrastructure are reviewed through code. You can schedule infrastructure delete and creation for cost saving.
You don't need to figure out the order of creating the resources since it is smart enough to do it. You can use existing template.
Deleting the resources allocated by the template is very easy, just one button click.
AWS SES
Simple email service. Fully managed service that let you send emails globally at scale. Can send and receive emails.
Gives statistics to whether or not the email is open or not. Can send bulk email as well.
Amazon Pinpoint
Two way marketing communication service. Email, SMS, push-notification, and SMS.
Let you create marketing and transactional SMS messages. Replies can be managed.
Used as SMS services.
SNS and SES your application need to manage messaging.
Pinpoint you create message in it and sent it.
SSM Session Manager
Let you start a secure shell on EC2 and on-premise server without SSH access, bastion hosts, or SSH keys. You do not need to open your port 22 on your EC2 instance.
The SSM Agent in your EC2 have correct IAM permission so it is able to execute commands in EC2 instances.
SSM Run Command
Let you execute script or just execute a command without needing SSH.
Output can be sent to S3 or CloudWatch logs. Status of script can be sent to SNS
SSM Patch Manager
Automatically apply OS updates, security updates, and applications.
Patch on-demand or schedule it.
System Manager Automation
Simplifies common maintenance and deployment tasks of EC2 instances. Restart instances, create an AMI, EBS snapshot.
Execute pre defined actions on your EC2 instances.
Cost Explorer
Visualize and understand your AWS cost and usage over time. Get dashboard and reports for your usage and cost.
You can choose saving plan to lower your cost. Find EC2 instances that are under utilized
Can also forecast usage based on past.
Elastic Transcoder
Convert media files stored in S3 into other formats.
S3 bucket stores MP4, then pipe it into transcoder then transfer it into AVI, MP3 and store it into output bucket.
It is scalable, cost effective. No need a EC2 instance to do the transcoding.
AWS Batch
Do batch processing at any scale.
A batch job, "has a start and end".
AWS Batch will dynamically launch EC2 instance or spot instances to deal with your batch jobs. You just need to submit your job in terms of docker images and run on ECS.
It help cost optimization and focus less on infrastructure.
AWS AppFlow
Securely transfer data between Software as a service application and AWS.
Source of the data can be from Salesforce, slack and ServiceNow
Destination can be Amazon S3, Redshift. Snowflake.
You can just use AppFlow to do the data transfer so you don't need to worry about actually writing code to do the transfer.