GitHub - davidkwlam/status-monitor

#Status Monitor

A status monitor for machines that is robust to failures at any node. Each node displays its current knowledge of the status of others nodes.

This was an assignment for a distributed systems course taken at UBC.

##Usage

Add participanting nodes to the nodes file (with each hostname on separate lines). Note that only a subset of nodes need be added to this list.

The script can be run without arguments ("python status_monitor.py").

To see the status page at a node, go to http://MY.NODES.ADDRESS:8000.

##Notes

#####The architecture is a modified gossip protocol

Each node fires a status update to M random nodes every T seconds (user configurable; in this case M=8 and T=60)

Each node listens for incoming status updates and checks to see if it has received it before: if yes, it forwards that update to M random nodes, otherwise ignores

Each status update is timestamped with the node’s local time, thus receiving nodes can compare incoming updates to the last update (i.e. it is a logical timestamp)

#####Nodes can join the service by simply sending a status update to a node

Node A that joins and sends a status update to node B will cause the addition of A to B’s list of known nodes (services are bootstrapped with a short list of known nodes)

Thus as a new node’s status gets forwarded to all nodes, that new node will begin receiving updates

Nodes will halt forwarding status updates to nodes that are down until until they come back online (thus leaving the service is a matter of halting the sending of status updates)

#####A node is considered down if the last status update received was >T*4 seconds ago

Since a node is supposed to emit an update every T seconds, the likelihood of a node not receiving a sent status update in the last T*4 seconds is:

M/(N-1) = likelihood of being forwarded a status update S from another node (N = # participating nodes, M = # of nodes chosen for status forwarding)

1 - M/(N-1) = likelihood of not receiving a status update S from a single node

(1 - M/(N-1)) ^ (N-1) = likelihood of not receiving a status update S from ANY node

((1 - M/(N-1)) ^ (N-1)) ^ 4 = likelihood of not receiving a status update S from ANY node in the last T*4 seconds

If N=160 and M=8, the likelihood of node A not receiving a status update from node B in the last T*4 seconds (despite sending one) is 5.5e-15 (i.e. this is the % inaccuracy).

Note: this doesn’t consider dropped packets, though the increase in % inaccuracy should not be material for this number of nodes. Since M=8 nodes are chosen at random to be forwarded a status update, even a 50% packet loss means that on average M*50% = 4 servers will receive it.

#####The information displayed can be easily extended

Adding more status information is a matter of simply adding a member variable and unix command to the “Status” class (see: status_monitor.py) and updating the webpage

#####Every participating node is capable of serving a status webpage

The webpage displays the node’s list of last received status updates from each node:

Green: last status received <T*2 seconds ago
Yellow: last status received between T2 and T4 seconds ago
Red: last status received >T*4 seconds ago

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
README.md		README.md
nodes		nodes
status_monitor.py		status_monitor.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

davidkwlam/status-monitor

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages