This is (hopefully) a short blog that will give you back a small piece of your life…
In technology, we rightly spend hours pouring over failure in order that we might understand it and therefore fix it and avoid it in the future. This seems a reasonable approach, learn from your mistakes, understand failure, plan your remediation etc etc. But is it possible that there are some instances where doing this is inappropriate? To answer this simple question, let me give you an analogy…
You decide that you want to travel from London to New York. Sounds reasonable so far…. But you decide you want to go by car! The reasoning for this is as follows:
Cars are “tried and tested”.
We have an existing deal with multiple car suppliers and we get great discounts.
The key decision maker is a car enthusiast.
The incumbent team understand cars and can support this choice.
Cars are what we have available right now and we want to start execution tomorrow, so lets just make it work.
You first try a small hatchback and only manage to get around 3m off the coast of Scotland. Next up you figure you will get a more durable car, so you get a truck – but sadly this only makes 2m headway from the beach. You report back to the team and they send you a brand new Porsche and this time you give yourself an even bigger run up at the sea and you manage to make a whopping 4m, before the car sinks. The team now analyse all the data to figure out why each car sunk and what they can do to make this better. The team continue to experiment with various cars and progress is observed over time. After 6 months the team has managed to travel 12m towards their goal of driving to New York. The main reason for the progress is that the sunken cars are starting to form a land bridge. The leadership has now spent over 200m USD on this venture and don’t feel they can pivot, so they start to brainstorm how to make this work.
Maybe wind the windows up a little tighter, maybe the cars need more underseal, maybe over inflate the tyres or maybe we simply need way more cars? All of these may or may not make a difference. But here’s the challenge: you made a bad engineering choice and anything you do will just be a variant of bad. It will never be good and you cannot win with your choice.
The above obviously sounds a bit daft (and it is), but the point is that I am often called in after downtime to review an architecture to find a route cause and suggest remediation. But what is not always understood is that bad technology choices can be as likely to succeed as driving from London to New York. Sometimes you simply need to look at alternatives, you need a boat or a plane. The product architecture can be terminal, it wont ever be what you want it to be and no amount of analysis or spend will change this. The trick is to accept the brutal reality of your situation and move your focus towards choosing the technology that you need to transition to. Next try and figure out how quickly can you can do this pivot…
I have seen many organisations restructure their technology teams over and over, but whichever model they opt for – they never seem to be able to get the desired results with respect to speed, resilience and quality. For this reason organisations will tend to oscillate from centralised teams, which are organised around skills and reuse, to federated teams that are organised around products and time to market. This article exams the failure modes for both centralised and federated teams and then questions whether there are better alternatives?
Day 1: Centralisation
Centralised technology teams tend to create frustrated product teams, long backlogs and lots of prioritisation escalations. Quality is normally fairly consistent (it can be consistently bad or consistently good), but speed is generally considered problematic.
These central teams will often institute some kind of ticketing system to coordinate their activities. They will even use tickets to coordinate activities between their own, narrowly focused central teams. These teams will create reports that demonstrate “millions” of tickets have been closed each month. This will be sold as some form of success / progress. Dependent product teams, on the other hand, will still struggle to get their work loads into production and frequently escalate the bottlenecks created using a centralised approach.
Central teams tend to focus on reusability, creating large consolidated central services and cost compression. Their architecture will tend to create massive “risk concentrators”, whereby they reuse the same infrastructure for the entire organisation. Any upgrades to these central services tend to be a life threatening event, making things like minor version changes and even patching extremely challenging. These central services will have poorly designed logical boundaries. This means that “bad” consumers of these shared services, can create saturation outages which affect the entire enterprise. These teams will be comfortable with mainframes, large physical datacenters and have a poor culture of learning. The technology stack will be at least 20 years old and you will often hear the term, “tried and tested”. They will view change as a bad thing and will create reports showing that change is causing outages. They will periodically suggest a slow down, or freezes to combat the great evil of “change”. There will be no attempt made to get better at delivering change and everything will be described as a “journey”. It will take years to get anything done in this world and the technology stack will be a legacy, expensive, immutable blob of tightly coupled vendor products.
Day 2: Lets Federate!
Eventually delivery pressure builds and the organisation capitulates into the chaotic world of federation. This is quickly followed by an explosion in headcount, as each product team attempts to utopian state of “end to end autonomy”. End to end autonomy, is the equivalent of absolute freedom – it simple does not and cannot exist. Why can’t you have an absolute state of full autonomy? It turns out that unless you’re a one product startup, you will have to share certain services and channels with other products. This means that any single products “autonomy” expressed in any shared channel/framework, ends up becoming another products “constraint”.
A great example of this is a client facing channel, like an app or a web site. Imagine if you carved up a channel into little product size pieces. Imagine how hard it would be to support your clients, where do you route support queries? Even something basic, like trying to keep the channel available would be difficult, as there is no single team stopping other teams from injecting failure planes and vendor SDKs into critical shared areas. In this world, each product does what it wants, when it wants, how it wants and the channel ends up yielding an inconsistent, frustrating and unstable customer experience. No product team will ever get around to dealing with complex issues, like fraud, cyber security or even basics like observability. Instead they will naturally chase PnL. In the end, you will have to resort to using social media to field complaints and resolve issues. Game theory depicts this as something called the “Tragedy of the Commons” – it’s these common assets that die in the federated world.
In the federated world, the lack of scope for staff results in ballooning your headcount and aggregating roles across multiple disciplines for the staff you have managed to hire. Highly skilled staff tend to get very bored with their “golden cages” and search out more challenging roles at other companies.
You will see lots of key man risk in this world. Because product teams can never fully fund their end to end autonomy – you will commonly see individuals that looks after networking, DBA and storage and cyber. When this person eventually resigns, the risks from their undocumented “tactile fixes” quickly start to materialise as the flow of outages starts to takes a grip. You will struggle to hire highly skilled resources into this model, as the scope of the roles you advertise are restrictively narrow, eg a DBA to look after a single database, a senior UX person to look after 3 screens. Obviously, if you manage to hire a senior UX person, you can then show them the 2 databases you also want them to manage 😂
If not this, then what?
Is the issue that we didn’t try hard enough? Maybe we should have given the model more time to bear fruit? Maybe we didn’t get the teams buy in? So what I am saying? I am saying BOTH federated and centralised models will eventually fail, because they are extremes. These are not the only choices on the table, there are a vast number of states in between, depending on your architecture, the size of your organization and the pools of skills you have.
Before you start tinkering with your organisations structure it’s key that you agree on what is the purpose of your organisation structure? Specifically – what are you trying to do and how are you trying to do it? Centralists will argue that economies of scale, better quality are key. But the federation proponents will point to time to market and speed. So how do you design your organisation?
There are two main parameters that you should try to optimise for:
Domain Optimisation: Design your structure around people and skills (from the centralised model). Give your staff enough domain / scope to be able to solve complex problems and add value across multiple products in your enterprise. The benefit of teams with wide domains, is that you can put your best resources on your biggest problems. But watch out, because as the domain of each team increases, so will the dependencies on this team.
Dependency Optimisation: Design your structure around flow/output by removing dependencies and enabling self service (from the federated model). Put simply, try to pay down dependencies by changing your architecture to enable self service such that product teams can execute quickly, whilst benefiting from high quality, reusable building blocks.
These two parameters are antagonistic, with your underlying architecture being the lever to change the gradient of the yield.
Your company cannot be successful, if you narrow the scope of your roles down to a single product. Senior, skilled engineers need domain and companies need to make sure that complicated problems flow to those who can best solve these problems. Having myopically scoped roles, not only balloons your headcount, it also means that your best staff might be sat on your easiest problems. More than this, how do you practically hire the various disciplines you need, if you’re scope is restricted to a single product that might only occasionally use a fraction of your skills.
You need to give staff from a skill / discipline enough scope to make sure they are stretched and that you’re getting value from them. If this domain creates a bottleneck, then you should look to fracture these pools of skills by creating multiple layers. Keeping one team close to operational workloads (to reduce dependencies) and a second team looking a more strategic problems. For example, you can have a UX team look after a single channel, but also have a strategic UX team look after more long dated / complex UX challenges (like peer analysis, telemetry insights, redesigns etc).
As we already discussed, end to end autonomy is a bogus construct. But teams should absolutely look to shed as many dependencies as possible, so that they can yield flow without begging other teams to do their jobs. There are two ways of reducing dependency:
Reduce the scope of a role and look at creating multiple pools of skills with different scopes.
Change your technology architecture.
Typically only item 1) is considered, and this is the crux of this article. Performing periodic org structure rewrites simply gives you a new poison to swallow. This is great, maybe you like strawberry flavoured arsenic! But my question is, why not stop taking poison altogether?
If you look at the anecdotal graph below you can see the relationship between domain / scope and dependency. This graph shows you that as you reduce domain you reduce dependency. Put simply, the lower your dependancies them more “federated” your organisation and the more domain your staff have the more “centralised” your organisation is.
What you will also observe that poorly architectured systems exhibit a “dependency cliff”. What this means is that even as you reduce the scope of your roles – your will not see any dependency benefit. This is because your systems are so tightly coupled that any amount of org structure gymnastics will not give you lower dependencies. If you attempt to carve up any shared systems that are exhibiting a dependency cliff, you have to hire more staff, you will have more outages, less output and more escalations.
To resolve any dependency cliffs, you have a few choices:
De-aggregate/re-architect workloads. If all your products sit in a single core banking platform, then do NOT buy a new core banking platform. Instead, rather rearchitect these products, to separate the shared services (eg ledger) from the product services (eg home loans). This is a complex topic and needs a detailed debate.
Optimise workloads. Acknowledge that a channel or platform can be a product in its own right and ensure that most of the changes product teams want to make on a channel / platform can be automated. Create product specific pipelines, create product enclaves (eg PWA), allow the product teams the ability to store and update state on the channel without having to go through testing release cycles.
Ensure any central services are opensourced. This will enable product teams to contribute changes and not feel “captive” to the cadence of central teams.
Deliver all services with REST APIs to ensure all shared services can be self-service.
There is no easy win when it comes to org structure, because its typically your architecture that drives all your issues. So shuffling people around from one line manager to another will achieve very little. If you want to be successful you will need to look at each service and product in detail and try and remove dependencies by making architectural changes such that product teams can self service wherever possible. When you remove architectural constraints, you will steepen the gradient of the line and give your staff broad domains, without adding dependencies and bottlenecks.
Am done with this article. I really thought I was going to be quick to write and I have run out of energy. I will probably do another push on this in a few weeks. Please DM with spelling mistakes or something that doesn’t make sense.
I think you’re a genius! You found this blog and your reading it – what more evidence do I need?! So why do you keep asking others to think for you?
There is a harmful bias built into most technology projects that assumes “the customer knows best” and this is simply a lie. The customer will know what works and what doesn’t when you give them a product; but thats not the same as being able to give specification/requirements. Sadly, somehow technologists have been relegated to order takers that are unable to make decisions or move forwards without detailed requirements. I disagree.
In general, everyone (including technologists) should fixate on understanding your customers, collaborating across all disciplines, testing ideas with customers, making decisions and executing. If you get it wrong, learn, get feedback, fix issues, then rinse and repeat. If you are going through a one way door or making a big call; then by all means validate. But don’t forget that your a genius and you work with other geniuses. So stop asking for requirements, switch your brain on and show off your unfiltered genius. You may even meet requirements that your customers haven’t even dreamt of!
Many corporate technology teams are unable to operate without an analyst to gather, collate and serve up pages of requirements. This learnt helplessness is problematic. There are definitely times, especially on complex projects where analysts working together with technologists can create more focus and speed up product development. But there is also a balance to be found in that a technology teams should feel confident to ideate solutions themselves.
Finally, one of the biggest causes for large delays on technology workstreams is the lack of challenge around requirements. If your customer wants an edge case feature that’s extremely difficult to do; then you should consider delaying it or even not doing it. Try to find a way around complex requirements, develop other features or evolve the feature to something that is deliverable. Never get bogged down on a requirement that will sink your project. You should always have way more features than you can ever deliver, so if you deliver everything your customer wanted there is an argument to say this is wasteful and indulgent. You will also be constantly disappointed when your customer changes their minds!
Bonuscide is a term used to describe incentive schemes that progressively poisons an organisation by ensuring the flow of discretionary pay is non does not serve the organisations goals. These schemes can be observed in two main ways, the loss of key staff or the reduction in client/customer base.
Bonuscide becomes more observable during a major crisis (for example covid 19). Companies that practise this will create self harm by amplifying the fiscal impact of the crisis on a specific population of staff that are key to the companies success. For example, legacy organisations will tend to target skills that the board or exco don’t understand and disproportionately target its technology teams, whilst protect their many layers of management.
The kinds of symptoms that will be visible are listed below:
Rolling downside metrics: A metric will be used to reduce the discretionary pay pool, but this metric was never previously used to as an upside metric. If at some future stage the metric becomes favourable
Pivot Upside Metrics: If the financial measure that was chosen in 1) improves in the future; a new/alternative unfavourable financial measure will be substituted.
Status Quo: Discretionary pay will always favour the preservation of the management of status quo. Incentives will never flow to those involved in execution or change, because these companies are governed Pournelle’s Iron Law of Bureaucracy.
Panic Pay: Companies that practice bonuside are periodically forced to carry out poorly thought through emergency incentives to their residual staff. This will create a negative selection process (whereby they lockin the tail performers after loosing their top talent).
Trust Vacuum: Leaders involved in managing this pay process will feel compromised, as they know that the trusted relationship with their team will be indefinitely tainted.
Business Case: The savings generated by the reduced discretionary compensation will be a small fraction of the additional costs and revenue impact that that the saving in compensation will have. This phenomenon is well covered in my previous post on Constraint Theory.
Put simply, if a business case was created for this exercise, it wouldn’t see the light of day. The end result of bonuscide is the creation of a corporate trust / talent vacuum that leads to significant long term harm and brand damage.
A single tenancy datacenter is a fixed scale, fixed price service on a closed network. The costs of the resources in the datacenter are divided up and shared out to the enterprise constituents on a semi-random basis. If anyone uses less resources than the forecast this generates waste which is shared back to the enterprise. If there is more demand than forecasted, it will either generate service degradation, panic or an outage! This model is clearly fragile and doesn’t respond quickly to change; it is also wasteful as it requires a level of overprovisioning based on forecast consumption (otherwise you will experience delays in projects, service degradation or have reduced resilience).
Cloud, on the other hand is a multi-tenanted on demand software service which you pay for as you use. But surely having multiple tenants running on the same fixed capacity actually increases the risks, and just because its in the cloud it doesn’t mean that you can get away without over provisioning – so who sits with the over provisioned costs? The cloud providers have to build this into their rates. So cloud providers have to deal with a balance sheet of fixed capacity shared amongst customers running on demand infrastructure. They do this with very clever forecasting, very short provisioning cycles and asking their customers for forecasts and then offering discounts for pre-commits.
Anything that moves you back towards managing resources levels / forecasting will destroy a huge portion of the value of moving to the cloud in the first instance. For example, if you have ever been to a Re:Invent you will be flawed by the rate of innovation and also how easy it is to absorb these new innovative products. But wait – you just signed a 5yr cost commit and now you learn about Aurura’s new serverless database model. You realise that you can save millions of dollars; but you have to wait for your 5yr commits to expire before you adopt or maybe start mining bitcoin with all your excess commits! This is anti-innovation and anti-customer.
Whats even worse is that pre-commits are typically signed up front on day 1- this is total madness!!! At the point where you know nothing about your brave new world, you use the old costs as a proxy to predict the new costs so that you can squeeze a lousy 5px saving at the risk of 100px of the commit size! What you will start to learn is that your cloud success is NOT based on the commercial contract that you sign with your cloud provider; its actually based on the quality of the engineering talent that your organisation is able to attract. Cloud is a IP war – its not a legal/sourcing war. Allow yourself to learn, don’t box yourself in on day 1. When you sign the pre-commit you will notice your first year utilisation projections are actually tiny and therefore the savings are small. So whats the point of signing so early on when the risk is at a maximum and the gains are at a minimum? When you sign this deal you are essentially turning the cloud into a “financial data center” – you have destroyed the cloud before you even started!
A Lesson from the field – Solving Hadoop Compute Demand Spike:
We moved 7000 cores of burst compute to AWS to solve a capacity issue on premise. That’s expensive, so lets “fix the costs”! We can go a sign a RI (reserved instance), play with spot, buy savings plans or even beg / barter for some EDP relief. But instead we plugged the service usuage into Quicksight and analysed the queries. We found one query was using 60 percent of the entire banks compute! Nobody confessed to owning the query, so we just disabled it (if you need a reason for your change management; describe the change as “disabling a financial DDOS”). We quickly found the service owner and explained that running a table scan across billions of rows to return a report with just last months data is not a good idea. We also explained that if they don’t fix this we will start billing them in 6 weeks time (a few million dollars). The team deployed a fix and now we run the banks big data stack at half the costs – just by tuning one query!!!
So the point of the above is that there is no substitute for engineering excellence. You have to understand and engineer the cloud to win, you cannot contract yourself into the cloud. The more contracts you sign the more failures you will experience. This leads me to point 2…
Step 2: Training, Training, Training
Start the biggest training campaign you possibly can – make this your crusade. Train everyone; business, finance, security, infrastructure – you name it, you train it. Don’t limit what anyone can train on, training is cheap – feast as much as you can. Look at Udemy, ACloudGuru, Youtube, WhizLabs etc etc etc. If you get this wrong then you will find your organisation fills up with expensive consultants and bespoke migration products that you don’t need ++ can easily do yourself, via opensource or with your cloud provider toolsets. In fact I would go one step further – if your not prepared to learn about the cloud, your not ready to go there.
Step 3: The OS Build
When you do start your cloud migration and begin to review your base OS images – go right back to the very beginning, remove every single product in all of these base builds. Look at what you can get out the box from your cloud provider and really push yourself hard on what do I really need vs nice to have. But the trick is that to get the real benefit from a cloud migration, you have to start by making your builds as “naked” as possible. Nothing should move into the base build without a good reason. Ownership and report lines are not a good enough reason for someones special “tool” to make it into the build. This process, if done correctly, should deliver you between 20-40px of your cloud migration savings. Do this badly and your costs, complexity and support will all head in the wrong direction.
Security HAS to be a first class citizen of your new world. In most organizations this will likely make for some awkward cultural collisions (control and ownership vs agility) and some difficult dialogs. The cloud, by definition, should be liberating – so how do you secure it without creating a “cloud bunker” that nobody can actually use? More on this later… 🙂
Step 4: Hybrid Networking
For any organisation with data centers – make no mistake, if you get this wrong its over before it starts.
In technology, there is a tendency to solve a problem badly by using gross simplification, then come up with a catchy one liner and then broadcast this as doctrine or a principle. Nothing ticks more boxes in this regard, than the principle of least privileges. The ensuing enterprise scale deadlocks created by a crippling implementation of least privileges, is almost certainly lost on its evangelists. This blog will try to put an end to the slavish efforts of many security teams that are trying to ration out micro permissions and hope the digital revolution can fit into some break glass approval process.
What is this “Least Privileged” thing? Why does it exist? What are the alternatives? Wikipedia gives you a good overview of this here. The first line contains an obvious and glaring issue: “The principle means giving a user account or process only those privileges which are essential to perform its intended function”. Here the principle is being applied equally to users and processes/code. The principle also states only give privileges that are essential. What this principle is trying to say, is that we should treat human beings and code as the same thing and that we should only give humans “essential” permissions. Firstly, who on earth figures out what that bar for essential is and how do they ascertain what is and what is not essential? Do you really need to use storage? Do you really need an API? If I give you an API, do you need Puts and Gets?
Human beings are NOT deterministic. If I have a team of humans that can operate under the principle of least privileges then I don’t need them in the first place. I can simply replace them with some AI/RPA. Imagine the brutal pain of a break glass activity every time someone needed to do something “unexpected”. “Hi boss, I need to use the bathroom on the 1st floor – can you approve this? <Gulp> Boss you took too long… I no longer need your approval!”. Applying least privileges to code would seem to make some sense; BUT only if you never updated the code and if did update the code you need to make sure you have 100px test coverage.
So why did some bright spark want to duck tape the world to such a brittle pain yielding principle? At the heart of this are three issues. Identity, Immutability, and Trust. If there are other ways to solve these issues then we don’t need to pain and risks of trying to implement something that will never actually work, creates friction and critically creates a false sense of security. Least Privileges will never save anyone, you will just be told that if you could have performed this security miracle then you would have been fine. But you cannot and so you are not.
Whats interesting to me is that the least privileged lie is so widely ignored. For example, just think about how we implement user access. If we truly believed in least privileges then every user would have a unique set of privileges assigned to them. Instead, because we acknowledge this is burdensome we approximate the privileges that a user will need using policies which we attach to groups. The moment we add a user to one of these groups, we are approximating their required privileges and start to become overly permissive.
Lets be clear with each other, anyone trying to implement least privileges is living a lie. The extent of the lie normally only becomes clear after the event. So this blog post is designed to re-point energy towards sustainable alternatives that work, and additionally remove the need for the myriad of micro permissive handbrakes (that routinely get switched off to debug outages and issues).
Who are you?
This is the biggest issue and still remains the largest risk in technology today. If I don’t know who you are then I really really want to limit what you can do. Experiencing a root/super user account take over, is a doomsday scenario for any organisation. So lets limit the blast zone of these accounts right?
This applies equally to code and humans. For code this problem has been solved a long time ago, and if you look
In most large corporates technology will typically report into either finance or operations. This means that it will tend to be subject to cultural inheritance, which is not always a good thing. One example of where the cultural default should be challenged is when managing IP duplication. In finance or operations duplication rarely yields any benefits and will often result in unnecessary costs and/or inconsistent customer experiences. Because of this, technology teams will tend to be asked to centrally analyse all incoming workstreams for convergence opportunities. If any seemingly overlapping effort is discovered, this would then typically be extracted into a central, “do it once” team. Experienced technologists will likely remark that it generally turns out that the analysis process is very slow, overlaps are small, the cost of extracting them are high, additional complexity is introduced, backlogs become unmanageable, testing the consolidated “swiss army knife” product is problematic and critically, the teams are typically reduced to crawling speed as they try to transport context and requirements to the central delivery team. I have called the above process “Triplication”, simply because is creates more waste and costs more than duplication ever could (and also because my finance colleagues seem to connect with this term).
The article below attempts to explain why we fear duplication and why slavishly trying to remove all duplication is a mistake. Having said this, a purely federated model or abundant resource model with no collaboration leads to similarly chronic issues (I will write an article about “Federated Strangulation” shortly).
The Three Big Corporate Fears
The fear of doing something badly.
The fear of doing something twice (duplication).
The fear of doing nothing at all.
In my experience, most corporates focus on fear 1) and 2). They will typically focus on layers of governance, contractual bindings, interlocks and magic metric tracking (SLA, OLA, KPI etc etc). The governance is typically multi-layered, with each forum meeting infrequently and ingesting the data in a unique format (no sense in not duplicating the governance overhead, right?!). As a result these large corporates typically achieve fear 3) – they will do nothing at all.
Most start-ups/tech companies worry almost exclusively about 3) – as a result they achieve a bit of 1) and 2). Control is highly federated, decision trees are short, and teams are self empowered and self organising. Dead ends are found quickly, bad ideas are cancelled or remediated as the work progresses. Given my rather bias narrative about – it won’t be a surprise to learn that I believe 3) is the greatest of all evils. To allow yourself to be overtaken is the greatest of all evils, to watch a race that you should be running is the most extreme form a failure.
For me, managed duplication can be a positive thing. But the key is that you have to manage it properly. You will often see divergence and consolidation in equal measure as the various work streams mature. The key to managing duplication is to enforce scarcity of resources and collaboration. Additionally, you may find that a decentralised team could become conflicted when it is asked to manage multiple business units interests. This is actually success! This means this team has created something that has been virally absorbed by other parts of the business – it means you have created something thats actually good! When this happens look at your contribution options, and sometimes it may make sense to split the product team up into a several business facing teams and a core platform engineering team. If however, there is no collaboration and an abundance of resources are thrown at all problems, you end up with material and avoidable waste. Additionally, observe exactly what your duplicating – never duplicate a commodity and never federate data. You also need to avoid a snowflake culture and make sure that were it makes sense you are trying to share.
Triplication happens when a two or more products are misunderstood to be “similar” and then attempted to be fused together. The over aggregation of your product development streams will yield most of the below:
1) Cripplingly slow and expensive to develop.
2) Risk concentration/instability. Every release will cause trauma to multiple customer bases.
3) Unsupportable. It will take you days to work out what went wrong and how on earth you can fix the issue as you will suffer from Quantum Entanglement.
4) Untestable. The complexity of the product will guarantee each release causes distress.
5) Low grade client experience.
Initially these problems will be described as “teething problems”. After a while it becomes clearer that the problem is not fixing itself. Next you will likely start the “stability” projects. A year or so later after the next pile of cash is burnt there will be a realisation that this is as good as it gets. At this point, senior managers start to see the writing on the wall and will quickly distance themselves from the product. Luckily for them, nobody will likely remember exactly whom in the many approval forums thought this was a good idea in the first place. Next the product starts to get linked to the term “legacy”. The final chapter for this violation of common sense, is the multi-year decommissioning process. BUT – its highly likely that the strategic replacement contains the exact same flaws as the legacy product…
To conclude, I created the term “Triplication” as I needed a way to succinctly explain that things can get worse when you lump them together without a good understanding of why your doing this. I needed a way to help challenge statements like, “you have to be able to extract efficiencies if you just lump all your teams together”. This thinking is equivalent to saying; “hey I have a great idea…! We ALL like music, right?? So lets save money – lets go buy a single CD for all of us!”
The reality for those that have played out the triplication scenario in real life, is that you will see costs balloon, progress grind to a halt, revenues fall of a cliff and the final step in the debacle is usually a loss of trust – followed by the inevitable outsourcing pill. On the other hand collaboration, scarcity, lean, quick MVPs, shared learning, cloud, open source, common rails and internal mobility are the friends of fast deliverables, customer satisfaction and yes – low costs!
The cloud is hot…. not just a little hot, but smokin hot!! Covid is messing with the economy, customers are battling financially, the macro economic outlook is problematic, vendor costs are high and climbing and security needs more investment every year. What on earth do we do??!! I know…. lets start a crusade – lets go to the cloud!!!!
Cloud used to be just for the cool kids, the start ups, the hipsters… but not anymore, now corporates are coming and they are coming in their droves. The cloud transformation conversation is playing out globally for almost all sectors, from health care, to pharmaceuticals and finance. The hype and urban legends around public cloud are a creating a lot of FOMO.
For finance teams under severe cost pressures, the cloud has to be an obvious place to seek out some much need pain relief. CIOs are giving glorious on stage testimonials, decrying victory after having gone live with their first “bot in the cloud”. So what is there to blog about, it’s all wonderful right…? Maybe not…
Imagine your a CIO or CTO, you haven’t cut code for a while or maybe you have a finance background. Anyway your architecture skills are a bit rusty/vacant, you have been outsourcing technology work for years, you are awash with vendor products, all the integration points are “custom” (aka arc welded) and and hence your stack is very fragile. In fact its so fragile you can trigger outages when someone closes your datacentre door a little too hard! Your technology teams all have low/zero cloud knowledge and now you have been asked to transform your organisation by shipping it off to the cloud… So what do you do???
Lots of organisations believe this challenge is simply a case of finding the cheapest cloud provider, write a legal document, some SLAs, find a vendor who can whiz your servers into the cloud – then you simply cut a cheque. But the truth is the cloud requires IP and if you don’t have IP (aka engineers) then you have a problem…
Plan A: Project Borg
This is an easy, problem – right? Just ask the AWS borg to assimilate you!!! The “Borg” strategy can be achieved by:
Install some software agents in your data centers to come up with a total thumb suck on how much you think you will spend in the cloud. Note: your lack of any real understanding of how the cloud works should not ring any warning bells.
Factor down this thumb suck using another made up / arbitrary “risk factor”.
Next, sign an intergalactic cloud commit with your cloud provider of choice and try to squeeze more than a 10px discount out for taking this enormous risk.
Finally pick up the phone to one of the big 5 consultants and get them to “assimilate” you in the cloud (using some tool to perform a bitwise copy of your servers into the cloud).
Before you know it your peppering your board and excos with those ghastly cloud packs, you are sending out group wide emails with pictures of clouds on them, you are telling your teams to become “cloud ready”. What’s worse your burning serious money as the consultancy team you called in did the usual land and expand. But you cant seem to get a sense of any meaningful progress (and no, a BOT in the cloud doesn’t count as progress).
To fund this new cloud expense line you have to start strangling your existing production spending, maybe you are running your servers for an extra year or two, strangling the network spend, keep these storage arrays for just a little while longer. But don’t worry, before you know it you will be in the cloud – right??
The Problem Statement
The problem is that public cloud was never about physically your iffy datacentre software with someone else; it’s was supposed to be about transformation of this software. The legacy software in your datacentre is almost certainly poisonous and in interdependencies will be as lethal as they are opaque. If you move it, pain will follow and you wont see any real commercial benefits for years.
Put another way, your datacentre is the technical equivalent of a swap. Luckily those lovely cloud people have built you a nice clean swimming pool. BUT don’t go and pump your swamp into this new swimming pool!
Crusades have never give us rational outcomes, you forgot to imagine where the customer was in this painful sideways move, what exactly did you want from this? In fact cloud crusades suffer from a list of oversights, weaknesses and risks:
Actual digital “transformation” will take years to realise (if ever). All you did was just changed your hosting and how you pay for technology – nothing else actually changed.
Your customer value proposition will be totally unchanged, sadly you are still as digital as a fax machine!
Key infrastructure teams will start realising their is no future for them and start wandering. Creating even more instability.
Stability will be problematic as your hybrid network has created a BGP birds nest.
Your company signed a 5 year cloud commit. You took your current tech spend, halved it and then asked your cloud provider to give you discounts on this projected spend. You will likely see around a 10px-15px reduction in your EDP (enterprise discount program) rates, and for this you are taking ENORMOUS downside risks. You’re also accidentally discouraging efficient utilisation of resources. in favour of a culture of “ram it in the cloud and review it once our EDP period expires”.
Your balance sheet will balloon, such that you will end up with a cost base of not dissimilar to NASA, you will need a PhD to diagnose issues and your delivery cadence will be close to zero. Additionally, you will need to create an impairment factory to deal with all your stranded assets.
So what does this approach actually achieve? You will of added a ton of intangible assets by balance sheeting a bunch of profees, you will likely be less stable and even be less secure (more on this later), you know that this is an unsustainable project and that it is the equivalent of an organisational heart transplant. The only people that now understand your organisation, are a team of well paid consultants on a 5x salary multiple and sadly you cannot stop this process – you have to keep paying and praying. Put simply, cloud mass migration (aka assimilation) is a bad idea – so don’t do it!
The key here is that your tech teams have to transform themselves. Nothing can act on them, the transformation has to come from within. When you review organisations that have been around for a while, they may have had a few mergers, have high vendor dependencies and low technology skills; you will tend to end up with the combined/systemic complexity suffering from something similar to Quantum Entanglement. Next we ask an external agency with a suite of tools to unpack this unobservable, irreducible complexity with a few tools, then we get expensive external forces to reverse engineer these entangled systems and recreate them somewhere else. This is not reasonable or rationale – its daft and we should stop doing this.
If not this, then what?
The “then what” bit is even longer that the “not this” bit. So I am posting this as it and if I get 100 hits I will write up the other way – little by little 🙂
Click here to read the work in progress link on another approach to scaling cloud usage…
There are two fundamental ways to run technology inside your company (and various states in-between)
Before we get started, what lies beneath is actually a rant. It’s not even a well disguised rant and the reason for this is two fold. Firstly, I don’t expect many people to read this, so the return on the effort required to dress it up would be low. Secondly, I wrote this for my benefit to relieve frustration – nothing to do with trying to educate or for some greater good thingy. So if you continue to read this article, you are essentially paying my “frustration tax”… So thanks for that 🙂
Large corporates will tend to confuse a bad culture with poor performing staff. My experience is that a lot of execution issues center around cultural issues.
There are broadly two fundamental models of organising tech inside a corperate (and various states in-between). To assess which world you live in, I have described the general characteristics of the two models below:
Your technology leadership/technology exco consists of auditors, CFOs, HR, risk management and generic management. They have never, will never and can never make a technology decision, as there is no understanding of technology. To solve for this, the leadership simply syndicate every single decision out to the widest possible group. Decisions take ages, as most of these groups also dont understand technology. When an eventual decision is made there can be no accountability as its unclear who made the call. By default they will always favour the status quo, they are extremely uncomfortable with all forms of change, they are fragile leaders and will focus a lof of effort on blame offloading. They still believe you cant get fired for buying IBM (no disrespect to IBM).
You will spend a number of months collating voluminous documents containing all possible requirements for your products over the next n years. This is often mistaken to be a “strategy document”.
You believe in developing technology away from your customer in “the center”. One day you feel you will “extract synergies” and “economies of scale” if you do this.
You believe each business head should have a “single face off” into technology. This “engagement manager” (aka human inbox) attempts to use “influence” by routing escalation emails through the organisations hierarchy to try generate outcomes.
Your central functions haven’t quite delivered yet, cost a small fortune, have top heavy management structure, BUT will “definitely deliver next year”. These functions manage using “gates and governance” and often send out new multi-page “Target Operating Model” documents which nobody reads/understands.
You believe you can fix technology with manually captured metrics like SLAs, OLAs, KPIs etc etc. You have hundreds of these measures and you pay a small army to collate these metrics each month. These measures always come back as “Green”, but you are still unhappy with the performance of your technology teams. You feel that if you could just come up with that “killer metric” that would show how bad technology really is; you could then finally sign an outsourcing deal to “fix it”.
You ask for documents from technology showing exactly what will be delivered over the next 3-5 years time horizon. You believe in these documents and feel that they are valuable. Historically they have never been close to accurate. You will likely see an item in 2025 which states “Digital Onboarding of Customers” – this is around the same time that the rest of the world will be enjoying the benefits of teleportation!
You hire an army of people to track that technology are “delivering on time”, they are not delivering on time and you still don’t know why. To scale output you hire more people to track why technology is not delivering on time.
You are constantly underwhelmed with what eventually gets delivered. It doesn’t do anything you really wanted it to and your customer feedback is lackluster at best.
There are clear lines of distinction between technology and the business. Phrases like “I could make my budget… if I could only get some delivery from tech” are common place.
The products delivered have all the charisma and charm of a road traffic accident.
You believe, if you can just get the governance right and capture the requirements, then you will get the product you wanted.
Your engagements with technology will take the form of “I need EVERYTHING from product X delivered by by date X. How much will it cost?”. When you get the reply a few months later, you will ALWAYS reply “that’s crazy!”. You will then cycle requirements down until you get beneath a “magic” number (that’s unknown to technology). This process and take years.
Your teams spend more time in meetings or delivering static documents, than delivering business value.
You layer governance meetings with various disconnected management teams, all with competing interests. This essentially turns an approval process into a multi-week/month attrition circus. The participants of this process have to explain and re-explain the exact same business case in multiple formats and leave the process with zero added insights, but minus a chunk of morale.
If you ever looked at any documents signed related to technology spend, there are likely to be dozens of signatures on the documents and nobody quite recalls why they signed the document and who owns the actual decision for the spend.
You direct as much technology spend as possible through a project governance process. This gives the illusion of a highly variable tech spend, focused on strategic investments. Your aggregate technology spend is nothing short of intergalactic. Funding abruptly expires at the end of each project and the resources promptly vanish. Most of your products are immature/half finished, you have a high number of (expensive) contractors, you leak IP, you love NDAs, nobody knows how to maintain or improve the products in production (as the contractors that delivered it have all left) and if you ever tried to reduced the SI spend, the “world will end”.
You can’t quite understand why everything in technology costs 10x what it would if you gave it to a start up.
Technology is a grudge purchase, and internal technology spend is an absolute last resort. Most of the people in your technology teams have never written a line of code. You urge your CIO to sign outsourcing documents with a “low cost location”. You believe this will give you a 20px reduction in costs over 5 yrs. All the previous attempts to do this have failed. Next time will be different as this time you will capture the correct SLAs to fix technology. You use expressions like “we are a bank, not a technology company”.
You believe all the value stream is created in the selling of the digital artefact; not the creation of the artefact itself. Your solution to poorly designed artefacts that are hard to sell is to hire more sales staff.
You fear open source and prefer to use multimillion dollar vendors for basics like middleware (please see Spring), business process management and databases. You build everything in your own data centre as you believe the “it’s a regulatory requirement”. It takes you several months to deliver hardware, or a engage with your vendors.
You actively promote Monoculture. You believe that “do it once, everywhere, for everything, forever” is a meaningful sentence. You install single instances of monolithical services that everyone is force to use. These services are not automated and so generate hugely painful backlogs, both for maintenance and adding new services. You can never upgrade these monoliths without asking everyone to “stop what your doing and retest everything”. Teams spend a chunk of their time trying to engineer ways of “hacking” these monoliths in order to be able to deliver value to customers.
You believe that most things in software are “utilities” and therefore belong in the center. You never look to prove that you get any benefits from this centralisation like lower costs, better quality or faster delivery times. The consumer of this service and the service provider will now both hire managers to work the functional boundary.
In response to chronic stability issues you create various heavily populated central functions to solve things like “Capacity Management”, “Service Management”, “Change Management”, “Stability Management”. You can prepend the words: “We are really bad at ” to any of the above. Every time there is an outage the various management functions lead a charge into the application teams to report on what went wrong. The application teams spend more time speaking to the various functions than solving the actual issue. Getting a route cause is lengthy, as none of the various teams cannot agree on one version of the truth and there is infighting as each of the various areas looks to abstract themselves of any responsibility. You have effectively created a “functionally gridlock”.
The business tend to export poor unilateral technology decisions to technology, and technology exports poor unilateral business decisions back. Technology will add arbitrarily points of friction/governance layers, change freezes, make material spend decisions and freeze hiring on critical products. Business will in turn sign low grade technology deals and hand them over the fence to technology to “make good” the decision.
Your technology leaders will slavishly stick to their multiyear strategies; irrespective of how painful these choices may prove to be. If any innovation or alternatives are discovered during the sluggish execution of these strategies, they will resist any attempts to alter direction and try something new.
Bonus is assigned in a single amorphous blob at the top of the function. Both performing and non performing technology teams are equally “punished” with a low grade aggregate pot. Business, customer feedback and product quality have close to zero influence on either the pot size or its distribution. There are obvious advantages to hiring lower performing candidates and just a few rockstars/key men (as you would not be able to cover fair compensation for teams of high performers). It is highly likely that your organisation institutes some kind of bell curve on PD grades which further enforces that you can only have a rationed percentage of “A players”. This serves to foster the “key man” culture, ignoring the benefit of creating highly performing teams and creates a built in mechanism for “bonus casualties”.
In an attempt to create efficiencies within an inefficient structure, you will tend to triplicate (find superficially similar products and artificially force them together at great expense, reduced output and increased cost/complexity). See The Triplication Paradigm
Alignment costs cripple your organisation. You often find yourself aligning, re-aligning and then aligning again. Minutes are tacitly approved, then disputed, memories of what was agreed between the various functional areas are diverse. You periodically get back in a room with the same people, to discuss the same subjects and cover the same ground. Eventually you will tend to yield – resigning yourself to doing nothing (or find a “work around”).
Your organisation believe rules will protect quality and consistency (the scaled efficiency model). Your process-focused culture will drive out the high-performing employees. When the market shifts quickly due to new technology, competitors, or business models, you can’t keep up and lose customers to competitors who adapt. You are slow-moving, rule-oriented and over time will grind “painfully into irrelevance”.
You hire technologists to make technology decisions. They are both collaborative and inclusive, but are happy to weigh up the facts and make decisions. These leaders constantly learn and love to share and tech.
Business and technology is a pure partnership, with product development often starting with a sentence similar to, “I have x USD – work with me to get an MVP out in 3 months.”
You realise the centre is not the place for client centric delivery. Technology is seen as an intimate collaborative process.
You believe in self organising and self navigation. No layering, no aggregation, no grand standing.
Your central functions are thin, cheap and deliver scaled learning to the business facing teams. Other teams want to talk to them to share success, they are non political, encourage collaboration and do not sell gates and governance. The functions serve using the “zero wait” principle (i.e. they run with you to help, or let you run ahead). They NEVER block.
You have no manual metrics like SLAs, OLAs, KPIs etc. You just talk to the customer and fix what’s broken. All your products have comprehensive dashboards showing machine generated metrics, that would help track issues.
You have no idea what you will be doing in 3 – 5 years.
You have zero people tracking/reporting delivery. If you need to look in the rear view mirror, get it yourself from Jira.
You are constantly flawed with the innovative solutions your teams have come up with. You could actually sell your products to other companies!
There are no clear lines of distinction between technology and the business.
The products delivered are designed iteratively with the customer at the centre, they have an “inspired” experience.
You don’t believe in any governance, except macro finance.
You neither read nor write requirements documents. Product documentation is developed after the fact and where possible is automated (eg Swagger).
You teams reject most meetings, have a 15 min daily standup, use instance messaging groups, phone calls etc etc.
Approval always sits at the lowest federated level, ideally with a single person or product owner.
If you ever looked at any documents signed relating to technology spend, there would be single signatory. This person would be the actual decision maker, not an arbitrary person high up in the organisational hierarchy.
You have zero projects. You work with finance at a macro level, not micro. Everything is a product and your variable tech spend is low. You have a high number of permanent staff and encourage internal mobility. Changing existing products is easy. You can’t remember when you last signed an NDA.
You are able to both compete with and partner with start-ups. Your teams review start-up investments and find often find them “lacking”.
Big outsourcing deals are never considered. You either buy and build (e.g. frameworks, cloud platforms) or build (with focus on open source). You actively recruit world class developers. You encourage a policy of “eat what you kill” to decommission expensive vendor solutions that are choking investment in the rest of the estate.
You believe value is created equally in both selling and creation of digital artefacts; and you reward symmetrically.
All new products are developed in the cloud, with an open source first approach. Your teams even share back to the wider open source community, e.g. https://twitter.com/donovancmuller. Your developers are proactively trained in security, you anonymising data, carry out automatic penetration testing as a part of your CI and hardware is just a debate about the efficient consumption of compute (e.g. serverless compute https://aws.amazon.com/lambda/details/).
You actively encourage scaled learning, width of thought and interop – not points of singularity.
You have have no system issues and therefore no central functions to track, the teams are empowered to solve the customer experience end to end.
All decisions (technology or business) are co-authored and validated as a single product team.
Pay is dealt holistically, can often include revenue share and you are incentivised to hire the best and create the best products. There are no bell curves, good feedback is not rationed.
Your technology leaders will not publish strategy documents, but instead focus on scaled learning, constant iteration, experimentation and prototypes. They will tend to evolve their strategies as they go and will actively look to absorb new ideas and constantly course correct.
You pay zero for alignment; you have self organising teams that own their customers and stacks end to end.
You believe in accountability, empowerment and Learning at Scale. You have principles, but no rules or heavy processes to “protect” quality/output.
I appreciate this article might be provocative, and that’s by design. The main point from the above is that the incumbent tech status quo in most corporates is not really challenged. Instead the tendency is to continuously draw and then redraw functional boundaries – in the process traumatising staff, relearning the exact same lessons and creating an internal talent vacuum. It’s interesting that the removal of these boundaries altogether is rarely questioned. In fact, in response to failed tech strategies most corporates will actually choose to scale these failed strategies – rather than question the validity of the approach. This is my view is a material mistake – technology is all about intimacy, and to get it right you need to embed it into your companies DNA.
To summarise, if you build an institution with “padded walls” you would expect a certain type of individual to be comfortable living there. Similarly, if you build a corporate that does not reflect it’s customer, if the roles of the leaders of this organisation are so generic it’s not clear what sits where and who does what, if layering your successful teams is culturally acceptable – then the staff working in this organisation will typically be staff that don’t care about details like the customer and being successful. This, I would argue, is not a long term game plan.