Public Cloud – Andrew Baker's Blog Posts

AWS: Install and configure the AWS CLI on a Macbook

You can absolutely get the following from the AWS help pages; but this is the lazy way to get everything you need for a simple single account setup.

Run the two commands below to drop the package on your Mac.

$ curl "https://awscli.amazonaws.com/AWSCLIV2.pkg" -o "AWSCLIV2.pkg"
$ sudo installer -pkg AWSCLIV2.pkg -target /

Then check the versions you have installed:

$ which aws
$ aws --version

Next you need to setup your environment. Note: This is NOT the recommended way (as it uses long term credentials).

The following example configures a default profile using sample values. Replace them with your own values as described in the following sections.

$ aws configure
AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE
AWS Secret Access Key [None]: secretaccesskey
Default region name [None]: af-south-1
Default output format [None]: json

You can also use named profiles. The following example configures a profile named userprod using sample values. Replace them with your own values as described in the following sections.

$ aws configure --profile userprod
AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE
AWS Secret Access Key [None]: secretaccesskey
Default region name [None]: af-south-1
Default output format [None]: json

Get your access keys

Sign in to the AWS Management Console and open the IAM console at https://console.aws.amazon.com/iam/.
In the navigation pane of the IAM console, select Users and then select the User name of the user that you created previously.
On the user’s page, select the Security credentials page. Then, under Access keys, select Create access key.
For Create access key Step 1, choose Command Line Interface (CLI).
For Create access key Step 2, enter an optional tag and select Next.
For Create access key Step 3, select Download .csv file to save a .csv file with your IAM user’s access key and secret access key. You need this information for later.
Select Done.

AWS: Automatically Stop and Start your EC2 Services

Below is a quick (am busy) outline on how to automatically stop and start your EC2 instances.

Step 1: Tag your resources

In order to decide which instances stop and start you first need to add an auto-start-stop: Yes tag to all the instances you want to be affected by the start / stop functions. Note: You can use “Resource Groups and Tag Editor” to bulk apply these tags to the resources you want to be affected by the lambda functions you are going to create. See below (click the orange button called “Manage tags of Selected Resources”).

Step 2: Create a new role for our lambda functions

First we need to create the IAM role to run the Lambda functions. Go to IAM and click the “Create Role” button. Then select “AWS Service” from the “Trusted entity options”, and select Lambda from the “Use Cases” options. Then click “Next”, followed by “Create Policy”. To specify the permission, simply Click the JSON button on the right of the screen and enter the below policy (swapping the region and account id for your region and account id):

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "ec2:DescribeInstances",
                "ec2:StartInstances",
                "ec2:DescribeTags",
                "logs:*",
                "ec2:DescribeInstanceTypes",
                "ec2:StopInstances",
                "ec2:DescribeInstanceStatus"
            ],
            "Resource": "arn:aws:ec2:<region>:<accountID>:instance/*",
            "Condition": {
                "StringEquals": {
                    "aws:ResourceTag/auto-start-stop": "Yes"
                }
            }
        }
    ]
}

Hit next and under “Review and create”, save the above policy as ec2-lambda-start-stop by clicking the “Create Policy” button. Next, search for this newly created policy and select it as per below and hit “Next”.

You will now see the “Name, review, and create” screen. Here you simply need to hit “Create Role” after you enter the role name as ec2-lambda-start-stop-role.

Note the policy is restricted to only have access to EC2 instances that contains auto-start-stop: Yes tags (least privileges).

If you want to review your role, this is how it should look. You can see I have filled in my region and account number in the policy:

Step 3: Create Lambda Functions To Start/Stop EC2 Instances

In this section we will create two lambda functions, one to start the instances and the other to stop the instances.

Step 3a: Add the Stop EC2 instance function

Goto Lambda console and click on create function
Create a lambda function with a function name of stop-ec2-instance-lambda, python3.11 runtime, and ec2-lambda-stop-start-role (see image below).

Next add the lamdba stop function and save it as stop-ec2-instance. Note, you will need to change the value of the region_name parameter accordingly.

import json
import boto3

ec2 = boto3.resource('ec2', region_name='af-south-1')
def lambda_handler(event, context):
   instances = ec2.instances.filter(Filters=[{'Name': 'instance-state-name', 'Values': ['running']},{'Name': 'tag:auto-start-stop','Values':['Yes']}])
   for instance in instances:
       id=instance.id
       ec2.instances.filter(InstanceIds=[id]).stop()
       print("Instance ID is stopped:- "+instance.id)
   return "success"

This is how your Lambda function should look:

Step 3b: Add the Start EC2 instance function

Goto Lambda console and click on create function
Create lambda functions with start-ec2-instance, python3.11 runtime, and ec2-lambda-stop-start-role.
Then add the below code and save the function as start-ec2-instance-lambda.

Note, you will need to change the value of the region_name parameter accordingly.

import json
import boto3

ec2 = boto3.resource('ec2', region_name='af-south-1')
def lambda_handler(event, context):
   instances = ec2.instances.filter(Filters=[{'Name': 'instance-state-name', 'Values': ['stopped']},{'Name': 'tag:auto-start-stop','Values':['Yes']}])
   for instance in instances:
       id=instance.id
       ec2.instances.filter(InstanceIds=[id]).stop()
       print("Instance ID is stopped:- "+instance.id)
   return "success"

4. Summary

If either of the above lambda functions are triggered, they will start or stop your EC2 instances based on the instance state and the value of auto-start-stop tag. To automate this you can simply setup up cron jobs, step functions, AWS Event Bridge, Jenkins etc.

How to trigger Scaling Events using Stress-ng Command

If you are testing how your autoscaling policies respond to CPU load then a really simple way to test this is using the “stress” command. Note: this is a very crude mechanism to test and wherever possible you should try and generate synthetic application load.

#!/bin/bash

# DESCRIPTION: After updating from the repo, installs stress-ng, a tool used to create various system load for testing purposes.
yum update -y
# Install stress-ng
sudo apt install stress-ng

# CPU spike: Run a CPU spike for 5 seconds
uptime
stress-ng --cpu 4 --timeout 5s --metrics-brief
uptime

# Disk Test: Start N (2) workers continually writing, reading and removing temporary files:
stress-ng --disk 2 --timeout 5s --metrics-brief

# Memory stress test
# Populate memory. Use mmap N bytes per vm worker, the default is 256MB. 
# You can also specify the size as % of total available memory or in units of 
# Bytes, KBytes, MBytes and GBytes using the suffix b, k, m or g:
# Note: The --vm 2 will start N workers (2 workers) continuously calling 
# mmap/munmap and writing to the allocated memory. Note that this can cause 
# systems to trip the kernel OOM killer on Linux systems if not enough 
# physical memory and swap is not available
stress-ng --vm 2 --vm-bytes 1G --timeout 5s

# Combination Stress
# To run for 5 seconds with 4 cpu stressors, 2 io stressors and 1 vm 
# stressor using 1GB of virtual memory, enter:
stress-ng --cpu 4 --io 2 --vm 1 --vm-bytes 1G --timeout 5s --metrics-brief

Part 2: Increasing your Cloud consumption (the sane way)

Introduction

This article follows on from the “Cloud Migrations Crusade” blog post…

A single tenancy datacenter is a fixed scale, fixed price service on a closed network. The costs of the resources in the datacenter are divided up and shared out to the enterprise constituents on a semi-random basis. If anyone uses less resources than the forecast this generates waste which is shared back to the enterprise. If there is more demand than forecasted, it will either generate service degradation, panic or an outage! This model is clearly fragile and doesn’t respond quickly to change; it is also wasteful as it requires a level of overprovisioning based on forecast consumption (otherwise you will experience delays in projects, service degradation or have reduced resilience).

Cloud, on the other hand is a multi-tenanted on demand software service which you pay for as you use. But surely having multiple tenants running on the same fixed capacity actually increases the risks, and just because its in the cloud it doesn’t mean that you can get away without over provisioning – so who sits with the over provisioned costs? The cloud providers have to build this into their rates. So cloud providers have to deal with a balance sheet of fixed capacity shared amongst customers running on demand infrastructure. They do this with very clever forecasting, very short provisioning cycles and asking their customers for forecasts and then offering discounts for pre-commits.

Anything that moves you back towards managing resources levels / forecasting will destroy a huge portion of the value of moving to the cloud in the first instance. For example, if you have ever been to a Re:Invent you will be flawed by the rate of innovation and also how easy it is to absorb these new innovative products. But wait – you just signed a 5yr cost commit and now you learn about Aurura’s new serverless database model. You realise that you can save millions of dollars; but you have to wait for your 5yr commits to expire before you adopt or maybe start mining bitcoin with all your excess commits! This is anti-innovation and anti-customer.

Whats even worse is that pre-commits are typically signed up front on day 1- this is total madness!!! At the point where you know nothing about your brave new world, you use the old costs as a proxy to predict the new costs so that you can squeeze a lousy 5px saving at the risk of 100px of the commit size! What you will start to learn is that your cloud success is NOT based on the commercial contract that you sign with your cloud provider; its actually based on the quality of the engineering talent that your organisation is able to attract. Cloud is a IP war – its not a legal/sourcing war. Allow yourself to learn, don’t box yourself in on day 1. When you sign the pre-commit you will notice your first year utilisation projections are actually tiny and therefore the savings are small. So whats the point of signing so early on when the risk is at a maximum and the gains are at a minimum? When you sign this deal you are essentially turning the cloud into a “financial data center” – you have destroyed the cloud before you even started!

A Lesson from the field – Solving Hadoop Compute Demand Spike:

We moved 7000 cores of burst compute to AWS to solve a capacity issue on premise. That’s expensive, so lets “fix the costs”! We can go a sign a RI (reserved instance), play with spot, buy savings plans or even beg / barter for some EDP relief. But instead we plugged the service usuage into Quicksight and analysed the queries. We found one query was using 60 percent of the entire banks compute! Nobody confessed to owning the query, so we just disabled it (if you need a reason for your change management; describe the change as “disabling a financial DDOS”). We quickly found the service owner and explained that running a table scan across billions of rows to return a report with just last months data is not a good idea. We also explained that if they don’t fix this we will start billing them in 6 weeks time (a few million dollars). The team deployed a fix and now we run the banks big data stack at half the costs – just by tuning one query!!!

So the point of the above is that there is no substitute for engineering excellence. You have to understand and engineer the cloud to win, you cannot contract yourself into the cloud. The more contracts you sign the more failures you will experience. This leads me to point 2…

Step 2: Training, Training, Training

Start the biggest training campaign you possibly can – make this your crusade. Train everyone; business, finance, security, infrastructure – you name it, you train it. Don’t limit what anyone can train on, training is cheap – feast as much as you can. Look at Udemy, ACloudGuru, Youtube, WhizLabs etc etc etc. If you get this wrong then you will find your organisation fills up with expensive consultants and bespoke migration products that you don’t need ++ can easily do yourself, via opensource or with your cloud provider toolsets. In fact I would go one step further – if your not prepared to learn about the cloud, your not ready to go there.

Step 3: The OS Build

When you do start your cloud migration and begin to review your base OS images – go right back to the very beginning, remove every single product in all of these base builds. Look at what you can get out the box from your cloud provider and really push yourself hard on what do I really need vs nice to have. But the trick is that to get the real benefit from a cloud migration, you have to start by making your builds as “naked” as possible. Nothing should move into the base build without a good reason. Ownership and report lines are not a good enough reason for someones special “tool” to make it into the build. This process, if done correctly, should deliver you between 20-40px of your cloud migration savings. Do this badly and your costs, complexity and support will all head in the wrong direction.

Security HAS to be a first class citizen of your new world. In most organizations this will likely make for some awkward cultural collisions (control and ownership vs agility) and some difficult dialogs. The cloud, by definition, should be liberating – so how do you secure it without creating a “cloud bunker” that nobody can actually use? More on this later… 🙂

Step 4: Hybrid Networking

For any organisation with data centers – make no mistake, if you get this wrong its over before it starts.

AWS: Making use of S3s ETags to check if a file has been altered

I was playing with S3 the other day an I noticed that a file which I had uploaded twice, in two different locations had an identical ETag. This immediately made me think that this tag was some kind of hash. So I had a quick look AWS documentation and this ETag turns out to be marginally useful. ETag is an “Entity Tag” and its basically a MD5 hash of the file (although once the file is bigger than 5gb it appears to use another hashing algorithm).

So if you ever want to compare a local copy of a file with an AWS S3 copy of a file you just need to install MD5 (the below steps are for ubuntu linux):

# Update your ubunto
# Download the latest package lists
sudo apt update
# Perform the upgrade
sudo apt-get upgrade -y
# Now install common utils (inc MD5)
sudo apt install -y ucommon-utils
# Upgrades involving the Linux kernel, changing dependencies, adding / removing new packages etc
sudo apt-get dist-upgrade

Next to view the MD5 hash of a file simple type:

# View MD5 hash of
md5sum myfilename.myextension
2aa318899bdf388488656c46127bd814  myfilename.myextension
# The first number above will match your S3 Etag if its not been altered

Below is the screenshot of the properties that you will see in S3 with a matching MD5 hash:

Using TPC-H tools to Create Test Data for AWS Redshift and AWS EMR

If you need to test out your big data tools below is a useful set of scripts that I have used in the past for aws emr and redshift the below might be helpful:

install git
 sudo yum install make git -y
 install the tpch-kit
 git clone https://github.com/gregrahn/tpch-kit
 cd tpch-kit/dbgen
 sudo yum install gcc -y
 Compile the tpch kit
 make OS=LINUX
 Go home
 cd ~
 Now make your emr data
 mkdir emrdata
 Tell tcph to use the this dir
 export DSS_PATH=$HOME/emrdata
 cd tpch-kit/dbgen
 Now run dbgen in verbose mode, with tables (orders), 10gb data size
 ./dbgen -v -T o -s 10
 move the data to a s3 bucket
 cd $HOME/emrdata
 aws s3api create-bucket -- bucket andrewbakerbigdata --region af-south-1 --LocationConstraint=af-south-1
 aws s3 cp $HOME/emrdata s3://andrewbakerbigdata/emrdata --recursive
 cd $HOME
 mkdir redshiftdata
 Tell tcph to use the this dir
 export DSS_PATH=$HOME/redshiftdata
 Now make your redshift data
 cd tpch-kit/dbgen
 Now run dbgen in verbose mode, with tables (orders), 40gb data size
 ./dbgen -v -T o -s 40
 These are big files, so lets find out how big they are and split them
 Count lines
 cd $HOME/redshiftdata
 wc -l orders.tbl
 Now split orders into 15m lines per file
 split -d -l 15000000 -a 4 orders.tbl orders.tbl.
 Now split line items
 wc -l lineitem.tbl
 split -d -l 60000000 -a 4 lineitem.tbl lineitem.tbl.
 Now clean up the master files
 rm orders.tbl
 rm lineitem.tbl
 move the split data to a s3 bucket
 aws s3 cp $HOME/redshiftdata s3://andrewbakerbigdata/redshiftdata --recursive

AWS: Please Fix Poor Error Messages, API standards and Bad Defaulting

3 Ways to Handle Work Frustration (Without Quitting) | Business Markets and Stocks News | madison.com

This is a short blog, and its actually just simple a plea to AWS. Please can you do three things?

North Virginia appears to be the AWS master node. Having this region as a master region causes a large number of support issues (for example S3, KMS, Cloudfront, ACM all use this pet region and all of their APIs suffer as a result). This coupled with point 2) creates some material angst.
Work a little harder on your error messages – they are often really (really) bad. I will post some examples at the bottom of this post over time. But you have to do some basics like reject unknown parameters (yes it’s useful to know there is a typo vs just ignore the parameter).
Use standard parameters across your APIs (eg make specifying the region consistent (even within single products its not consistently applied) and make your verbs consistent).

As a simple example, below i am logged into an EC2 instances in af-south-1 and I can create an S3 bucket in North Virginia, but not in af-south-1. I am sure there is a “fix” (change some config, find out an API parameter was invalid and was silently ignored etc) – but this isn’t the point. The risk (and its real) is that in an attempt to debug this, developers will tend to open up security groups, open up NACLs, widen IAM roles etc. When the devs finally fix the issue; they will be very unlikely to retrace all their steps and restore everything else that they changed. This means that you end up with debugging scars that create overly permission services, due to poor errors messages, inconsistent API parameters/behaviors and a regional bias. Note: I am aware of commercial products, like Radware’s CWP – but that’s not the point. I shouldn’t ever need to debug by dialling back security. Observability was supposed to be there from day 1. The combination of tangential error messages, inconsistent APIs and lack of decent debug information from core services like IAM and S3, are creating a problem that shouldn’t exist.

AWS is a global cloud provider – services should work identically across all regions, and APIs should have standards, APIs shouldn’t silently ignore mistyped parameters, the base config required should either come from context (ie am running in region x) or config (aws config) – not a global default region.

Please note: I deleted the bucket between running the two commands ++ awsconfigure seemed to be ignored by createbucket

[ec2-user@ip-172-31-24-139 emrdata]$ aws s3api create-bucket --bucket ajbbigdatabucketlab2021
{
    "Location": "/ajbbigdatabucketlab2021"
}
[ec2-user@ip-172-31-24-139 emrdata]$ aws s3api create-bucket --bucket ajbbigdatabucketlab2021 --region af-south-1

An error occurred (IllegalLocationConstraintException) when calling the CreateBucket operation: 
The unspecified location constraint is incompatible for the region specific endpoint this request was sent to.

Note, I worked around the createbucket behavior by replacing it with mb:

[ec2-user@ip-172-31-24-139 emrdata]$ aws s3 mb s3://ajbbigdatabucketlab2021 --region af-south-1
make_bucket: ajbbigdatabucketlab2021

Thanks to the AWS dudes for letting me know how to get this working. It turns out the create-bucket and mb APIs, dont use standard parameters. See below (region tag needs to be replaced by a verbose bucket config tag):

aws s3api create-bucket --bucket ajbbigdatabucketlab2021 --create-bucket-configuration LocationConstraint=af-south-1

Part 1: The Great Public Cloud Crusade…

“Not all cloud transformations are created equally…!”

The cloud is hot…. not just a little hot, but smokin hot!! Covid is messing with the economy, customers are battling financially, the macro economic outlook is problematic, vendor costs are high and climbing and security needs more investment every year. What on earth do we do??!! I know…. lets start a crusade – lets go to the cloud!!!!

Cloud used to be just for the cool kids, the start ups, the hipsters… but not anymore, now corporates are coming and they are coming in their droves. The cloud transformation conversation is playing out globally for almost all sectors, from health care, to pharmaceuticals and finance. The hype and urban legends around public cloud are a creating a lot of FOMO.

For finance teams under severe cost pressures, the cloud has to be an obvious place to seek out some much need pain relief. CIOs are giving glorious on stage testimonials, decrying victory after having gone live with their first “bot in the cloud”. So what is there to blog about, it’s all wonderful right…? Maybe not…

The Backdrop…

Imagine your a CIO or CTO, you haven’t cut code for a while or maybe you have a finance background. Anyway your architecture skills are a bit rusty/vacant, you have been outsourcing technology work for years, you are awash with vendor products, all the integration points are “custom” (aka arc welded) and and hence your stack is very fragile. In fact its so fragile you can trigger outages when someone closes your datacentre door a little too hard! Your technology teams all have low/zero cloud knowledge and now you have been asked to transform your organisation by shipping it off to the cloud… So what do you do???

Lots of organisations believe this challenge is simply a case of finding the cheapest cloud provider, write a legal document, some SLAs, find a vendor who can whiz your servers into the cloud – then you simply cut a cheque. But the truth is the cloud requires IP and if you don’t have IP (aka engineers) then you have a problem…

Plan A: Project Borg

This is an easy, problem – right? Just ask the AWS borg to assimilate you!!! The “Borg” strategy can be achieved by:

Install some software agents in your data centers to come up with a total thumb suck on how much you think you will spend in the cloud. Note: your lack of any real understanding of how the cloud works should not ring any warning bells.
Factor down this thumb suck using another made up / arbitrary “risk factor”.
Next, sign an intergalactic cloud commit with your cloud provider of choice and try to squeeze more than a 10px discount out for taking this enormous risk.
Finally pick up the phone to one of the big 5 consultants and get them to “assimilate” you in the cloud (using some tool to perform a bitwise copy of your servers into the cloud).

Before you know it your peppering your board and excos with those ghastly cloud packs, you are sending out group wide emails with pictures of clouds on them, you are telling your teams to become “cloud ready”. What’s worse your burning serious money as the consultancy team you called in did the usual land and expand. But you cant seem to get a sense of any meaningful progress (and no, a BOT in the cloud doesn’t count as progress).

To fund this new cloud expense line you have to start strangling your existing production spending, maybe you are running your servers for an extra year or two, strangling the network spend, keep these storage arrays for just a little while longer. But don’t worry, before you know it you will be in the cloud – right??

The Problem Statement

The problem is that public cloud was never about physically your iffy datacentre software with someone else; it’s was supposed to be about transformation of this software. The legacy software in your datacentre is almost certainly poisonous and in interdependencies will be as lethal as they are opaque. If you move it, pain will follow and you wont see any real commercial benefits for years.

Put another way, your datacentre is the technical equivalent of a swap. Luckily those lovely cloud people have built you a nice clean swimming pool. BUT don’t go and pump your swamp into this new swimming pool!

Crusades have never give us rational outcomes, you forgot to imagine where the customer was in this painful sideways move, what exactly did you want from this? In fact cloud crusades suffer from a list of oversights, weaknesses and risks:

Actual digital “transformation” will take years to realise (if ever). All you did was just changed your hosting and how you pay for technology – nothing else actually changed.
Your customer value proposition will be totally unchanged, sadly you are still as digital as a fax machine!
Key infrastructure teams will start realising their is no future for them and start wandering. Creating even more instability.
Stability will be problematic as your hybrid network has created a BGP birds nest.
Your company signed a 5 year cloud commit. You took your current tech spend, halved it and then asked your cloud provider to give you discounts on this projected spend. You will likely see around a 10px-15px reduction in your EDP (enterprise discount program) rates, and for this you are taking ENORMOUS downside risks. You’re also accidentally discouraging efficient utilisation of resources. in favour of a culture of “ram it in the cloud and review it once our EDP period expires”.
Your balance sheet will balloon, such that you will end up with a cost base of not dissimilar to NASA, you will need a PhD to diagnose issues and your delivery cadence will be close to zero. Additionally, you will need to create an impairment factory to deal with all your stranded assets.

So what does this approach actually achieve? You will of added a ton of intangible assets by balance sheeting a bunch of profees, you will likely be less stable and even be less secure (more on this later), you know that this is an unsustainable project and that it is the equivalent of an organisational heart transplant. The only people that now understand your organisation, are a team of well paid consultants on a 5x salary multiple and sadly you cannot stop this process – you have to keep paying and praying. Put simply, cloud mass migration (aka assimilation) is a bad idea – so don’t do it!

The key here is that your tech teams have to transform themselves. Nothing can act on them, the transformation has to come from within. When you review organisations that have been around for a while, they may have had a few mergers, have high vendor dependencies and low technology skills; you will tend to end up with the combined/systemic complexity suffering from something similar to Quantum Entanglement. Next we ask an external agency with a suite of tools to unpack this unobservable, irreducible complexity with a few tools, then we get expensive external forces to reverse engineer these entangled systems and recreate them somewhere else. This is not reasonable or rationale – its daft and we should stop doing this.

If not this, then what?

The “then what” bit is even longer that the “not this” bit. So I am posting this as it and if I get 100 hits I will write up the other way – little by little 🙂

Click here to read the work in progress link on another approach to scaling cloud usage…