Technology Culture: The Sinking Car Syndrome

This is (hopefully) a short blog that will give you back a small piece of your life…

In technology, we rightly spend hours pouring over failure in order that we might understand it and therefore fix it and avoid it in the future. This seems a reasonable approach, learn from your mistakes, understand failure, plan your remediation etc etc. But is it possible that there are some instances where doing this is inappropriate? To answer this simple question, let me give you an analogy…

You decide that you want to travel from London to New York. Sounds reasonable so far…. But you decide you want to go by car! The reasoning for this is as follows:

  1. Cars are “tried and tested”.
  2. We have an existing deal with multiple car suppliers and we get great discounts.
  3. The key decision maker is a car enthusiast.
  4. The incumbent team understand cars and can support this choice.
  5. Cars are what we have available right now and we want to start execution tomorrow, so lets just make it work.

You first try a small hatchback and only manage to get around 3m off the coast of Scotland. Next up you figure you will get a more durable car, so you get a truck – but sadly this only makes 2m headway from the beach. You report back to the team and they send you a brand new Porsche and this time you give yourself an even bigger run up at the sea and you manage to make a whopping 4m, before the car sinks. The team now analyse all the data to figure out why each car sunk and what they can do to make this better. The team continue to experiment with various cars and progress is observed over time. After 6 months the team has managed to travel 12m towards their goal of driving to New York. The main reason for the progress is that the sunken cars are starting to form a land bridge. The leadership has now spent over 200m USD on this venture and don’t feel they can pivot, so they start to brainstorm how to make this work.

Maybe wind the windows up a little tighter, maybe the cars need more underseal, maybe over inflate the tyres or maybe we simply need way more cars? All of these may or may not make a difference. But here’s the challenge: you made a bad engineering choice and anything you do will just be a variant of bad. It will never be good and you cannot win with your choice.

The above obviously sounds a bit daft (and it is), but the point is that I am often called in after downtime to review an architecture to find a route cause and suggest remediation. But what is not always understood is that bad technology choices can be as likely to succeed as driving from London to New York. Sometimes you simply need to look at alternatives, you need a boat or a plane. The product architecture can be terminal, it wont ever be what you want it to be and no amount of analysis or spend will change this. The trick is to accept the brutal reality of your situation and move your focus towards choosing the technology that you need to transition to. Next try and figure out how quickly can you can do this pivot…

How to Install Apps From Anywhere on Apple Mac

Previously Macs would allow you to install software from anywhere. Now you will see the error message “NMAPxx.mpkg cannot be opened because its from an unidentified developer”. If you want to fix this and enable apps to be install from anywhere, you will need to run the following command line:

sudo spctl --master-disable

Once you have run the script you should then see the “Anywhere” option in the System Preferences > Security & Privacy Tab!

Part 2: Increasing your Cloud consumption (the sane way)

Introduction

This article follows on from the “Cloud Migrations Crusade” blog post…

A single tenancy datacenter is a fixed scale, fixed price service on a closed network. The costs of the resources in the datacenter are divided up and shared out to the enterprise constituents on a semi-random basis. If anyone uses less resources than the forecast this generates waste which is shared back to the enterprise. If there is more demand than forecasted, it will either generate service degradation, panic or an outage! This model is clearly fragile and doesn’t respond quickly to change; it is also wasteful as it requires a level of overprovisioning based on forecast consumption (otherwise you will experience delays in projects, service degradation or have reduced resilience).

Cloud, on the other hand is a multi-tenanted on demand software service which you pay for as you use. But surely having multiple tenants running on the same fixed capacity actually increases the risks, and just because its in the cloud it doesn’t mean that you can get away without over provisioning – so who sits with the over provisioned costs? The cloud providers have to build this into their rates. So cloud providers have to deal with a balance sheet of fixed capacity shared amongst customers running on demand infrastructure. They do this with very clever forecasting, very short provisioning cycles and asking their customers for forecasts and then offering discounts for pre-commits.

Anything that moves you back towards managing resources levels / forecasting will destroy a huge portion of the value of moving to the cloud in the first instance. For example, if you have ever been to a Re:Invent you will be flawed by the rate of innovation and also how easy it is to absorb these new innovative products. But wait – you just signed a 5yr cost commit and now you learn about Aurura’s new serverless database model. You realise that you can save millions of dollars; but you have to wait for your 5yr commits to expire before you adopt or maybe start mining bitcoin with all your excess commits! This is anti-innovation and anti-customer.

Whats even worse is that pre-commits are typically signed up front on day 1- this is total madness!!! At the point where you know nothing about your brave new world, you use the old costs as a proxy to predict the new costs so that you can squeeze a lousy 5px saving at the risk of 100px of the commit size! What you will start to learn is that your cloud success is NOT based on the commercial contract that you sign with your cloud provider; its actually based on the quality of the engineering talent that your organisation is able to attract. Cloud is a IP war – its not a legal/sourcing war. Allow yourself to learn, don’t box yourself in on day 1. When you sign the pre-commit you will notice your first year utilisation projections are actually tiny and therefore the savings are small. So whats the point of signing so early on when the risk is at a maximum and the gains are at a minimum? When you sign this deal you are essentially turning the cloud into a “financial data center” – you have destroyed the cloud before you even started!

A Lesson from the field – Solving Hadoop Compute Demand Spike:

We moved 7000 cores of burst compute to AWS to solve a capacity issue on premise. That’s expensive, so lets “fix the costs”! We can go a sign a RI (reserved instance), play with spot, buy savings plans or even beg / barter for some EDP relief. But instead we plugged the service usuage into Quicksight and analysed the queries. We found one query was using 60 percent of the entire banks compute! Nobody confessed to owning the query, so we just disabled it (if you need a reason for your change management; describe the change as “disabling a financial DDOS”). We quickly found the service owner and explained that running a table scan across billions of rows to return a report with just last months data is not a good idea. We also explained that if they don’t fix this we will start billing them in 6 weeks time (a few million dollars). The team deployed a fix and now we run the banks big data stack at half the costs – just by tuning one query!!!

So the point of the above is that there is no substitute for engineering excellence. You have to understand and engineer the cloud to win, you cannot contract yourself into the cloud. The more contracts you sign the more failures you will experience. This leads me to point 2…

Step 2: Training, Training, Training

Start the biggest training campaign you possibly can – make this your crusade. Train everyone; business, finance, security, infrastructure – you name it, you train it. Don’t limit what anyone can train on, training is cheap – feast as much as you can. Look at Udemy, ACloudGuru, Youtube, WhizLabs etc etc etc. If you get this wrong then you will find your organisation fills up with expensive consultants and bespoke migration products that you don’t need ++ can easily do yourself, via opensource or with your cloud provider toolsets. In fact I would go one step further – if your not prepared to learn about the cloud, your not ready to go there.

Step 3: The OS Build

When you do start your cloud migration and begin to review your base OS images – go right back to the very beginning, remove every single product in all of these base builds. Look at what you can get out the box from your cloud provider and really push yourself hard on what do I really need vs nice to have. But the trick is that to get the real benefit from a cloud migration, you have to start by making your builds as “naked” as possible. Nothing should move into the base build without a good reason. Ownership and report lines are not a good enough reason for someones special “tool” to make it into the build. This process, if done correctly, should deliver you between 20-40px of your cloud migration savings. Do this badly and your costs, complexity and support will all head in the wrong direction.

Security HAS to be a first class citizen of your new world. In most organizations this will likely make for some awkward cultural collisions (control and ownership vs agility) and some difficult dialogs. The cloud, by definition, should be liberating – so how do you secure it without creating a “cloud bunker” that nobody can actually use? More on this later… 🙂

Step 4: Hybrid Networking

For any organisation with data centers – make no mistake, if you get this wrong its over before it starts.

The Least Privileged Lie

In technology, there is a tendency to solve a problem badly by using gross simplification, then come up with a catchy one liner and then broadcast this as doctrine or a principle. Nothing ticks more boxes in this regard, than the principle of least privileges. The ensuing enterprise scale deadlocks created by a crippling implementation of least privileges, is almost certainly lost on its evangelists. This blog will try to put an end to the slavish efforts of many security teams that are trying to ration out micro permissions and hope the digital revolution can fit into some break glass approval process.

What is this “Least Privileged” thing? Why does it exist? What are the alternatives? Wikipedia gives you a good overview of this here. The first line contains an obvious and glaring issue: “The principle means giving a user account or process only those privileges which are essential to perform its intended function”. Here the principle is being applied equally to users and processes/code. The principle also states only give privileges that are essential. What this principle is trying to say, is that we should treat human beings and code as the same thing and that we should only give humans “essential” permissions. Firstly, who on earth figures out what that bar for essential is and how do they ascertain what is and what is not essential? Do you really need to use storage? Do you really need an API? If I give you an API, do you need Puts and Gets?

Human beings are NOT deterministic. If I have a team of humans that can operate under the principle of least privileges then I don’t need them in the first place. I can simply replace them with some AI/RPA. Imagine the brutal pain of a break glass activity every time someone needed to do something “unexpected”. “Hi boss, I need to use the bathroom on the 1st floor – can you approve this? <Gulp> Boss you took too long… I no longer need your approval!”. Applying least privileges to code would seem to make some sense; BUT only if you never updated the code and if did update the code you need to make sure you have 100px test coverage.

So why did some bright spark want to duck tape the world to such a brittle pain yielding principle? At the heart of this are three issues. Identity, Immutability, and Trust. If there are other ways to solve these issues then we don’t need to pain and risks of trying to implement something that will never actually work, creates friction and critically creates a false sense of security. Least Privileges will never save anyone, you will just be told that if you could have performed this security miracle then you would have been fine. But you cannot and so you are not.

Whats interesting to me is that the least privileged lie is so widely ignored. For example, just think about how we implement user access. If we truly believed in least privileges then every user would have a unique set of privileges assigned to them. Instead, because we acknowledge this is burdensome we approximate the privileges that a user will need using policies which we attach to groups. The moment we add a user to one of these groups, we are approximating their required privileges and start to become overly permissive.

Lets be clear with each other, anyone trying to implement least privileges is living a lie. The extent of the lie normally only becomes clear after the event. So this blog post is designed to re-point energy towards sustainable alternatives that work, and additionally remove the need for the myriad of micro permissive handbrakes (that routinely get switched off to debug outages and issues).

Who are you?

This is the biggest issue and still remains the largest risk in technology today. If I don’t know who you are then I really really want to limit what you can do. Experiencing a root/super user account take over, is a doomsday scenario for any organisation. So lets limit the blast zone of these accounts right?

This applies equally to code and humans. For code this problem has been solved a long time ago, and if you look

Is this really my code?

The DAO Ethereum Recursion Bug: El Gordo!

If you found my article, I would consider it a reasonable assumption that you already understand the importance of this

Brief Introduction

The splitDAO function was created in order for some members of the DAO to separate themselves and their tokens from the main DAO, creating a new ‘child DAO’, for example in case they found themselves in disagreement with the majority.

The child DAO goes through the same 27 day creation period as the original DAO. Pre-requisite steps in order to call a splitDAO function are the creation of a new proposal on the original DAO and designation of a curator for the child DAO.

The child DAO created by the attacker has been referred as ‘darkDAO’ on reddit and the name seems to have stuck. The proposal and split process necessary for the attack was initiated at least 7 days prior to the incident.

The exploit one alone would have been economically unviable (attacker would have needed to put up 1/20th of the stolen amount upfront in the original DAO) and the second one alone would have been time intensive because normally only one splitDAO call could be done per transaction.

One way to see this is that the attacker performed a fraud, or a theft. Another way, more interesting for its implications, is that the attacker took the contract literally and followed the rule of code.

In their (allegedly) own words:

@stevecalifornia on Hacker News – https://news.ycombinator.com/item?id=11926150

“DAO, I closely read your contract and agreed to execute the clause where I can withdraw eth repeatedly and only be charged for my initial withdraw.

Thank you for the $70 million. Let me know if you draw up any other contracts I can participate in.

Regards, 0x304a554a310c7e546dfe434669c62820b7d83490″

The HACK

An “attacker” managed to combine two “exploits” in the DAO.

1) The attacker called the splitDAO function recursively (up to 20 times).

2) To make the attack more efficient, the attacker also managed to replicate the incident from the same two addresses (using the same tokens over and over again (approx 250 times).

Quote from “Luigi Renna”: To put this instance in a natural language perspective the Attacker requested the DAO “I want to withdraw all my tokens, and before that I want to withdraw all my tokens, and before that I want to… etc.” And be charged only once.

The Code That was Hacked

Below is the now infamous SplitDAO function in all its glory:

function splitDAO(
uint _proposalID,
address _newCurator
) noEther onlyTokenholders returns (bool _success) {
...
// Get the ether to be moved. Notice that this is done first!
uint fundsToBeMoved =
(balances[msg.sender] * p.splitData[0].splitBalance) /
p.splitData[0].totalSupply;
if (p.splitData[0].newDAO.createTokenProxy.value(fundsToBeMoved)(msg.sender) == false) // << This is
the line the attacker wants to run more than once
throw;
...
// Burn DAO Tokens
Transfer(msg.sender, 0, balances[msg.sender]);
withdrawRewardFor(msg.sender); // be nice, and get his rewards
// XXXXX Notice the preceding line is critically before the next few
•http://hackingdistributed.com/2016/06/18/analysis-of-the-dao-exploit/ Question: So what about Blockchain/DLs?
totalSupply -= balances[msg.sender]; // THIS IS DONE LAST
balances[msg.sender] = 0; // THIS IS DONE LAST TOO
paidOut[msg.sender] = 0;
return true;
}

The basic idea behind the hack was:

1) Propose a split.
2) Execute the split.
3) When the DAO goes to withdraw your reward, call the function to execute a split before that withdrawal
finishes (ie recursion).
4) The function will start running again without updating your balance!!!
5) The line we marked above as “This is the line the attacker wants to run more than once“ will run more
than once!

Thoughts on the Hack

The code is easy to fix, you can just simply zero the balances immediately after calculating the fundsToBeMoved. But this bug is not the real issue for me – the main problem can be split into two areas:

  1. Ethereum is a Turing complete language with a stack/heap, exceptions and recursion. This means that even with the best intentions, we will create a vast number of routes through the code base to allow similar recursion and other bugs to be exposed. The only barrier really is how much ether it will cost to expose the bugs.
  2. There is no Escape Hatch for “bad” contracts. Emin Gun Sirer wrote a great article on this here: Escape hatches for smart contracts

Whilst a lot of focus is going in 2) – I believe that more focus needs to be put into making the language “safer”. For me, this would involve basic changes like:

  1. Blocking recursion. Blocking recursion at compile and runtime would be a big step forward. I struggle to see a downside to this, as in general recursion doesn’t scale, is slow and can always be replaced with iteration. Replace Recursion with Iteration
  2. Sandbox parameterisation. Currently there are no sandbox parameters to sit the contracts in. This means contracts the economic impact of these contracts are more empirical than deterministic. If you could abstract core parameters of the contract like the amount, wallet addresses etc etc and place these key values in an immutable wrapper than unwanted outcomes would be harder to achieve.
  3. Transactionality. Currently there in no obvious mechanism to peform economic functions wraped in an ATOMIC transaction. This ability would mean that economic values could be copied from the heap to the stack and moved as desired; but critically recursive calls could no revist the heap to essentially “duplicate” the value. It would also mean that if a benign fault occured the execution of the contract would be idempotent. Obviously blocking recursive calls goes some way towards this; but adding transactionality would be a material improvement.

I have other ideas including segregation of duties using API layers between core and contract code – but am still getting my head around Ethereum’s architecture.