If you are testing how your autoscaling policies respond to CPU load then a really simple way to test this is using the “stress” command. Note: this is a very crude mechanism to test and wherever possible you should try and generate synthetic application load.
#!/bin/bash
# DESCRIPTION: After updating from the repo, installs stress-ng, a tool used to create various system load for testing purposes.
yum update -y
# Install stress-ng
sudo apt install stress-ng
# CPU spike: Run a CPU spike for 5 seconds
uptime
stress-ng --cpu 4 --timeout 5s --metrics-brief
uptime
# Disk Test: Start N (2) workers continually writing, reading and removing temporary files:
stress-ng --disk 2 --timeout 5s --metrics-brief
# Memory stress test
# Populate memory. Use mmap N bytes per vm worker, the default is 256MB.
# You can also specify the size as % of total available memory or in units of
# Bytes, KBytes, MBytes and GBytes using the suffix b, k, m or g:
# Note: The --vm 2 will start N workers (2 workers) continuously calling
# mmap/munmap and writing to the allocated memory. Note that this can cause
# systems to trip the kernel OOM killer on Linux systems if not enough
# physical memory and swap is not available
stress-ng --vm 2 --vm-bytes 1G --timeout 5s
# Combination Stress
# To run for 5 seconds with 4 cpu stressors, 2 io stressors and 1 vm
# stressor using 1GB of virtual memory, enter:
stress-ng --cpu 4 --io 2 --vm 1 --vm-bytes 1G --timeout 5s --metrics-brief
I was playing with S3 the other day an I noticed that a file which I had uploaded twice, in two different locations had an identical ETag. This immediately made me think that this tag was some kind of hash. So I had a quick look AWS documentation and this ETag turns out to be marginally useful. ETag is an “Entity Tag” and its basically a MD5 hash of the file (although once the file is bigger than 5gb it appears to use another hashing algorithm).
So if you ever want to compare a local copy of a file with an AWS S3 copy of a file you just need to install MD5 (the below steps are for ubuntu linux):
# Update your ubunto
# Download the latest package lists
sudo apt update
# Perform the upgrade
sudo apt-get upgrade -y
# Now install common utils (inc MD5)
sudo apt install -y ucommon-utils
# Upgrades involving the Linux kernel, changing dependencies, adding / removing new packages etc
sudo apt-get dist-upgrade
Next to view the MD5 hash of a file simple type:
# View MD5 hash of
md5sum myfilename.myextension
2aa318899bdf388488656c46127bd814 myfilename.myextension
# The first number above will match your S3 Etag if its not been altered
Below is the screenshot of the properties that you will see in S3 with a matching MD5 hash:
If you need to test out your big data tools below is a useful set of scripts that I have used in the past for aws emr and redshift the below might be helpful:
install git
sudo yum install make git -y
install the tpch-kit
git clone https://github.com/gregrahn/tpch-kit
cd tpch-kit/dbgen
sudo yum install gcc -y
Compile the tpch kit
make OS=LINUX
Go home
cd ~
Now make your emr data
mkdir emrdata
Tell tcph to use the this dir
export DSS_PATH=$HOME/emrdata
cd tpch-kit/dbgen
Now run dbgen in verbose mode, with tables (orders), 10gb data size
./dbgen -v -T o -s 10
move the data to a s3 bucket
cd $HOME/emrdata
aws s3api create-bucket -- bucket andrewbakerbigdata --region af-south-1 --LocationConstraint=af-south-1
aws s3 cp $HOME/emrdata s3://andrewbakerbigdata/emrdata --recursive
cd $HOME
mkdir redshiftdata
Tell tcph to use the this dir
export DSS_PATH=$HOME/redshiftdata
Now make your redshift data
cd tpch-kit/dbgen
Now run dbgen in verbose mode, with tables (orders), 40gb data size
./dbgen -v -T o -s 40
These are big files, so lets find out how big they are and split them
Count lines
cd $HOME/redshiftdata
wc -l orders.tbl
Now split orders into 15m lines per file
split -d -l 15000000 -a 4 orders.tbl orders.tbl.
Now split line items
wc -l lineitem.tbl
split -d -l 60000000 -a 4 lineitem.tbl lineitem.tbl.
Now clean up the master files
rm orders.tbl
rm lineitem.tbl
move the split data to a s3 bucket
aws s3 cp $HOME/redshiftdata s3://andrewbakerbigdata/redshiftdata --recursive
This is a short blog, and its actually just simple a plea to AWS. Please can you do three things?
North Virginia appears to be the AWS master node. Having this region as a master region causes a large number of support issues (for example S3, KMS, Cloudfront, ACM all use this pet region and all of their APIs suffer as a result). This coupled with point 2) creates some material angst.
Work a little harder on your error messages – they are often really (really) bad. I will post some examples at the bottom of this post over time. But you have to do some basics like reject unknown parameters (yes it’s useful to know there is a typo vs just ignore the parameter).
Use standard parameters across your APIs (eg make specifying the region consistent (even within single products its not consistently applied) and make your verbs consistent).
As a simple example, below i am logged into an EC2 instances in af-south-1 and I can create an S3 bucket in North Virginia, but not in af-south-1. I am sure there is a “fix” (change some config, find out an API parameter was invalid and was silently ignored etc) – but this isn’t the point. The risk (and its real) is that in an attempt to debug this, developers will tend to open up security groups, open up NACLs, widen IAM roles etc. When the devs finally fix the issue; they will be very unlikely to retrace all their steps and restore everything else that they changed. This means that you end up with debugging scars that create overly permission services, due to poor errors messages, inconsistent API parameters/behaviors and a regional bias. Note: I am aware of commercial products, like Radware’s CWP – but that’s not the point. I shouldn’t ever need to debug by dialling back security. Observability was supposed to be there from day 1. The combination of tangential error messages, inconsistent APIs and lack of decent debug information from core services like IAM and S3, are creating a problem that shouldn’t exist.
AWS is a global cloud provider – services should work identically across all regions, and APIs should have standards, APIs shouldn’t silently ignore mistyped parameters, the base config required should either come from context (ie am running in region x) or config (aws config) – not a global default region.
Please note: I deleted the bucket between running the two commands ++ awsconfigure seemed to be ignored by createbucket
[ec2-user@ip-172-31-24-139 emrdata]$ aws s3api create-bucket --bucket ajbbigdatabucketlab2021
{
"Location": "/ajbbigdatabucketlab2021"
}
[ec2-user@ip-172-31-24-139 emrdata]$ aws s3api create-bucket --bucket ajbbigdatabucketlab2021 --region af-south-1
An error occurred (IllegalLocationConstraintException) when calling the CreateBucket operation:
The unspecified location constraint is incompatible for the region specific endpoint this request was sent to.
Note, I worked around the createbucket behavior by replacing it with mb:
Thanks to the AWS dudes for letting me know how to get this working. It turns out the create-bucket and mb APIs, dont use standard parameters. See below (region tag needs to be replaced by a verbose bucket config tag):