Distributed Systems: System Models
2.2 SYSTEM MODELS⌗
- Two generals problem: A model of networks. We assume that nodes are honest, but messages can get lost.
- Byzantine generals problem: A model of node behaviour. We assume that messages get delivered, but nodes can be dishonest / manipulated.
In real systems, both nodes and network might be faulty.
We want to capture assumptions in a system model consisting of:
- Network behaviour (e.g message loss)
- Node behaviour (e.g Crashes)
- Timing behaviour (e.g latency)
class ServiceExcption(Exception):
@staticmethod
def hello(string):
print(string)
Network Behaviour⌗
Assume bidi point-to-point communication between two nodes. We can have:
- Reliable (perfect) links: A message is received if and only if it is sent.
- Fair-loss links: Messages may be lost, duplicated or reordered. If you keep retrying, a message eventually gets through.
- Arbitrary (Active adversary) links: A malicious adversary may interfere with messages (eavesdrop, modify, drop, spoof, replay).
Network Partition is when a link between nodes is interrupted for a finite amount of time.
Interesting thing with this model of network behaviour is that the links can be upgraded.
- Fair-loss link -> Reliable link: Using Retry + Deduplication logic.
- Arbitrary link -> Fair-loss link: Using TLS protocol. This can only be upgraded though, if the active adversary does not entirely stop the transmission of packets.
Node Behaviour⌗
Each node executes a specified algorithm, assuming one of the following:
- Crash-stop (fail-stop): A node is faulty if it crashes (at any moment). After a crash, it stops executing forever.
- Crash-recovery (fail-recovery): A node may crash at any moment, losing its in-memory state. It mar resume executing some other time.
- Byzantine (fail-arbitrary): A node is faulty if it deviates from its algorithm. This could be crashing or malicious behaviour.
A node that is not faulty is called correct. An assumption here is that a node does not know if another node is faulty, that can be handled by Fault Detection mechanisms.
Synchrony (Timing) Behaviour⌗
Assume one of the following for network and nodes:
- Synchronous: Message latency no greater than a known upper bound. Nodes execute algorithm at a known speed.
- Partially Synchronous: The system is asynchronous for some finite (but unknown) periods of time, synchronous otherwise.
- Asynchronous: Messages can be delayed arbitrarily. Nodes can pause execution arbitrarily. No timing guarantees at all.
Here are some violations of synchrony in practice:
Violations of predictable latency:
- Message loss requiring retry due to a network partition.
- Congestion/contention causing queueing.
- Network/route reconfiguration.
Violations of predictable code execution speed:
- OS scheduling issues e.g priority inversion.
- Stop-the-world GC pauses.
- Page faults, swap, thrashing.
RTOS provide scheduling guarantees, but most distributed systems do not use RTOS.