devops tale
Read an interesting article by @dseven about a company going bankrupt after a faulty deployment. Here are my notes
Notes
- Knightmare devops tale
- Blog - https://dougseven.com/2014/04/17/knightmare-a-devops-cautionary-tale/
- Knight capital group is involved in stock market trading business.
- They were one of the largest trader for US equities.
- High speed algorithm to do trading.
- Knight updated their “algorithmic router” to have a new functionality called “SMARS”
- Reason - NYSE launched a new program - “Retail Liquidity Program” - improve pricing for retail investors through retail brokers
- SMARS
- splits large order into smaller orders and sends to the market.
- Replaces old-unused functionality “Power Pug”
- Feature flag used to enable SMARS was shared with “Power Pug”
- Executing via feature flag was tested.
- Power Pug
- Routed child orders based on a parent order
- Bug: Child orders were not tracking whether parent order was fullfilled or not.
- Deployment
- Technician manually updated software on 8 servers
- Missed updating on one of the servers
- There was no process for verification with a second pair of eyes.
- By enabling the feature flag, Power Pug functionality was activated in the 8th server
- Impact
- Bug in system caused high number of trades in the market
- Difficult to debug in high speed market which is doing 700 million shares/minute.
- Knight capital had $450 million loss in 45 minutes
- Net worth was $365 million
- Recovery
- No kill switch
- Reverted SMARS code on deployed boxes.
- This amplified the issue. Power Pug got enabled on all servers.
- Shut the service down
- Takeaways
- Clean code
- Power Pug code was not cleaned up. It was lying unused for 8 years.
- Deployment
- Lack of process/checklists
- Missing automated deployment
- Missing repeatable process
- Missing verification system after deployment.
- Monitoring
- Realtime monitoring for new features
- Clean code