Tail latency

Read an interesting paper on tail latency in services by Jeffrey Dean/Luiz André Barroso. Paper discusses how tail latency impacts overall sla and how managing it improves overall service experiance. Also talks about various approaches towards managing tail latencies.

The tail at scale - https://dl.acm.org/doi/abs/10.1145/2408776.2408794
- At large scale, temporary high latency episodes may come to dominate overall service performance.
- Reasons for increase in response time which leads to high tail latency in components of a service
  - Shared resources
    - resource contention
    - eg: cpu contention between services, same application different requests contend for resources.
    - Global resource sharing - application that runs on separate servers but shares global resources - network switches, shared file systems.
  - Daemons
    - Background daemons generating multi-millisecond hiccups when scheduled. But on average their utilisation might look fine - considering idle time.
  - Maintenance
    - Background activities - data reconstruction in distributed file systems, log file compactions, garbage collection in certain languages
  - Queueing
    - Multiple layers of queueing in intermediate servers and network switches
- Common technique for reducing latency - parallelise sub-operations across many different machines
  - Variability in latency distribution is magnified in this scenario.
  - Reducing component variability
    - Differentiate service classes and higher level queuing
      - Use differentiated service classes to prefer scheduling requests which are non-interactive to user
      - Keep interactive queues short so that high priority requests can be served.
    - Reduce head-of-line blocking
      - Break down long running requests into a sequence of smaller requests - allows interleaving of other short running requests.
    - Manage background activities and synchronised disruption
      - Throttling, breaking down heavy operations into smaller one’s.
      - Schedule when overall load is less
      - Synchronised background activity incase if the same system is handling interactive requests
- Within request short term adaptations
  - Using replication to reduce tail latencies, these methods might works on requests which can be served by read-only, loosely consistent data sets.
  - Hedge requests
    - Have multiple replica’s of data. Initial request to primary data source, if the response is not available in 95% sla, redirect to secondary data source. According to google stats, this is an effective way to reduce 99% latencies.
      - Stats - request to get 1000 keys stored in a big table. Sending a hedge requests brings down 99% latency from 1800ms to 74ms with 2% additional load.
    - Problem - multiple servers can handle same requests. Efficient resource consumption can be guaranteed if we can cancel the requests if other server picks up for execution.
  - Tied requests
    - Big variable for request serving time is queuing delay. Once request is picked up from queue, variability of completion time goes down quickly.
    - Tied requests - enqueue requests on multiple servers, and let the servers communicate each other status of the requests and cancel out in one if the other server picks up the request for execution.
      - BUT - what if the both servers picks up the request for execution at the same time? This can happen if the queue length on both is 0
        
        Client when sending the request can add a delay while sending the request to the secondary server. - 2 x average network delay
  - Why can’t we probe remote queues first and send request where queue length is minimum.
    - Less effective considering these factors.
      - Load levels can change between probe and request time
      - Request service time can be hard to estimate based on system and hardware variability.
      - Clients can create hotspots by sending all to one at same time.
  - Distributed shortest-positioning time first system
    - request sent to one server and forwarded to replicas only if the primary fails to answer after a certain delay.
  - All these techniques assume cause of variable latency is not due to common issues which affect multiple replicas.
- Cross-Request Long-Term Adaptations
  - Techniques applicable for reducing latency variability caused by service time/load imbalance.
    - Partition data, such that load is balanced.
      - Single partition assignment is rarely efficient
        
        Performance of underlying machine is not uniform
        
        Data induced load imbalance.
    - Micro-partitions
      - Multiple partitions in a single machine, there can be multiple replicas of a single partition.
      - Failure recovery speed is improved, load balancing is a matter of moving responsibility from one machine to another.
    - Selective replication
      - Detect or predict items that can cause data imbalance as a result load imbalance.
        
        Create additional replicas for these
        
        Create additional replicas to spread load
      - Google web search uses this approach
        
        make copies of popular documents in multiple micro-partitions.
    - Latency-induced probation
      - Observe latency from various machines
        
        Remove the node with increased latency, this improves overall performance of the service
        
        Collect the stats from the problematic node in shadow mode, add it back in traffic once the latency is back to normal
- Large information Retrieval Systems
  - Speed is a key quality metric
    - Returning good results is better that returning best results slower.
  - Good enough
    - Waiting for slow servers impacts latency. In this case use good enough results which responds faster to maintain latency levels.
      - Remain careful to ensure rate of good-enough results are rate.
      - Skip non-essential sub-systems to maintain latency. Have priorities?
  - Canary requests
    - To ensure change doesn’t affect the overall infra, allocate partial traffic to set of servers with new change.