Read an interesting article by @dseven about a company going bankrupt after a faulty deployment. Here are my notes

Notes

  • Knightmare devops tale
    • Blog - https://dougseven.com/2014/04/17/knightmare-a-devops-cautionary-tale/
    • Knight capital group is involved in stock market trading business.
      • They were one of the largest trader for US equities.
      • High speed algorithm to do trading.
    • Knight updated their “algorithmic router” to have a new functionality called “SMARS”
      • Reason - NYSE launched a new program - “Retail Liquidity Program” - improve pricing for retail investors through retail brokers
    • SMARS
      • splits large order into smaller orders and sends to the market.
      • Replaces old-unused functionality “Power Pug”
      • Feature flag used to enable SMARS was shared with “Power Pug”
        • Executing via feature flag was tested.
    • Power Pug
      • Routed child orders based on a parent order
      • Bug: Child orders were not tracking whether parent order was fullfilled or not.
    • Deployment
      • Technician manually updated software on 8 servers
      • Missed updating on one of the servers
      • There was no process for verification with a second pair of eyes.
      • By enabling the feature flag, Power Pug functionality was activated in the 8th server
    • Impact
      • Bug in system caused high number of trades in the market
      • Difficult to debug in high speed market which is doing 700 million shares/minute.
      • Knight capital had $450 million loss in 45 minutes
        • Net worth was $365 million
    • Recovery
      • No kill switch
      • Reverted SMARS code on deployed boxes.
        • This amplified the issue. Power Pug got enabled on all servers.
      • Shut the service down
    • Takeaways
      • Clean code
        • Power Pug code was not cleaned up. It was lying unused for 8 years.
      • Deployment
        • Lack of process/checklists
        • Missing automated deployment
          • Missing repeatable process
        • Missing verification system after deployment.
      • Monitoring
        • Realtime monitoring for new features