PYME.IO.clusterIO module

API interface to PYME cluster storage.

The cluster storage effectively aggregates storage across a number of cluster nodes. In contrast to a standard file-system, we only support atomic reads and writes of entire files, and only support a single write to a given file (no modification or replacement). Under the hood, the file system is implemented as a bunch of HTTP servers, with the clusterIO library simply presenting a unified view of the files on all servers.

Note

PYME cluster storage has a number of design aspects which trade some safety for speed. Importantly it lacks features such as locking found in traditional filesystems. This is largely offset by only supporting atomic writes, and being write once (no deletion or modification), but does rely on clients being well behaved (i.e. not trying to write to the same file concurrently) - see put_file() for more details. Directory listings etc .. are also not guaranteed to update immediately after file puts, with a latency of up to 2s for directory updates being possible.

The most important functions are:

get_file() which performs an atomic get
put_file() which performs an atomic put to a randomly 1 chosen server
listdir() which performs a (unified) directory listing

There are also a bunch of utility functions, mirroring some of those available in the os, os.path, and glob modules:

And finally some functions to facilitate efficient (and data local) operations

is_local() which determines if a file is hosted on the local machine
get_local_path() which returns a local path if a file is local
locate_file() which returns a list of http urls where the given file can be found
put_files() which implements a streamed, high performance, put of multiple files

There are are also higher level functions in PYME.IO.unifiedIO which allows files to be accessed in a consistent way given either a cluster URI or a local path and PYME.IO.clusterResults which helps with saving tabular data to the cluster.

Cluster identification

clusterIO automatically identifies the individual data servers that make up the cluster using the mDNS (zeroconf) protocol. If you run a (or multiple) server(s) (see PYME.cluster.HTTPDataServer) clusterIO will find them with no additional configuration. To support multiple clusters on a single network segment, however, we have the concept of a cluster name. Servers will broadcast the name of the cluster they belong to as part of their mDNS advertisement.

On the client side, you can select which cluster you want to talk to with the serverfilter argument to the clusterIO functions. Each function will access all servers which include the value of serverfilter in their name. If unspecified, the value of the PYME.config configuration option dataserver-filter is used, which in turn defaults to the local computer name. This enables the use of a local “cluster of one” for analysis without any additional configuration and without interfering with any clusters which are already on the network. When setting up a proper, multi-computer cluster, set the dataserver-filter config option on all members of the cluster o a common value. Once setup in this manner, it should not be necessary to actually specify serverfilter when using clusterIO.

Note

The nature of serverfilter matches is both powerful and requires some care. A value of serverfilter='' will match every data server running on the local network and present them as an aggregated cluster. Similarly serverfilter='cluster' will match cluster, cluster1, cluster2, 3cluster etc and prevent them all as an aggregated cluster. This cluster aggregation opens up interesting postprocessing and display options, but to avoid unexpected effects it would be prudent to follow the following recommendations.

  1. Avoid cluster names which are substrings of other cluster names

  2. Always use the full cluster name for serverfilter.

Depending on how useful aggregation actually proves to be serverfilter might change to either requiring an exact match or compiling to a regex at some point in the future.

Footnotes

1

technically we do some basic load balancing

Function Documentation

PYME.IO.clusterIO.cglob(pattern, serverfilter='fv-az569-89')

Find files matching a given glob on the cluster. Analogous to the python glob.glob function.

Parameters
patternstring glob
serverfiltercluster name (optional)
Returns
a list of files matching the glob
PYME.IO.clusterIO.exists(name, serverfilter='fv-az569-89')

Test whether a file exists on the cluster. Analogue to os.path.exists for local files.

Parameters
namestring, file path
serverfiltername of the cluster (optional)
Returns
True if file exists, else False
PYME.IO.clusterIO.get_dir_manager(serverfilter='fv-az569-89')
PYME.IO.clusterIO.get_file(filename, serverfilter='fv-az569-89', numRetries=3, use_file_cache=True, local_short_circuit=True, timeout=5)

Get a file from the cluster.

Parameters
filenamestring

filename relative to cluster root

serverfilterstring

A filter to use when finding servers - used to facilitate the operation for multiple clusters on the one network segment. Note that this is still not fully supported.

numRetriesint

The number of times to retry on failure

use_file_cachebool

By default we cache the last 100 files requested locally. This cache never expires, although entries are dropped when we get over 100 entries. Under our working assumption that data on the cluster is immutable, this is generally safe, with the exception of log files and files streamed using the _aggregate functionality. We can optionally request a non-cached version of the file.

local_short_circuit: bool

if file exists locally, load/read/return contents directly in this thread unless this flag is False in which case we will get the contents through the dataserver over the network.

Returns
PYME.IO.clusterIO.get_local_path(filename, serverfilter)

Get a local path, if available based on a cluster filename

Parameters
filename
serverfilter
Returns
PYME.IO.clusterIO.get_status(serverfilter='fv-az569-89')

Get status of cluster servers (currently only used in the clusterIO web service)

Parameters
serverfilter: str

the cluster name (optional), to select a specific cluster

Returns
status_list: list
a status dictionary for each node. See PYME.cluster.HTTPDataServer.updateStatus
Disk: dict
total: int

storage on the node [bytes]

used: int

used storage on the node [bytes]

free: int

available storage on the node [bytes]

CPUUsage: float

cpu usage as a percentile

MemUsage: dict
total: int

total RAM [bytes]

available: int

free RAM [bytes]

percent: float

percent usage

used: int

used RAM [bytes], calculated differently depending on platform

free: int

RAM which is zero’d and ready to go [bytes]

[other]:

more platform-specific fields

Network: dict
send: int

bytes sent per second since the last status update

recv: int

bytes received per second since the last status update

GPUUsage: list of float

[optional] returned for NVIDIA GPUs only. Should be compute usage per gpu as percent?

GPUMem: list of float

[optional] returned for NVIDIA GPUs only. Should be memory usage per gpu as percent?

PYME.IO.clusterIO.is_local(filename, serverfilter)

Test to see if a file is held on the local computer.

This is used in the distributed scheduler to score tasks.

Parameters
filename
serverfilter
Returns
PYME.IO.clusterIO.isdir(name, serverfilter='fv-az569-89')

Tests if a given path on the cluster is a directory. Analogous to os.path.isdir

Parameters
name
serverfilter
Returns
True or False
PYME.IO.clusterIO.listdir(dirname, serverfilter='fv-az569-89')

Lists the contents of a directory on the cluster. Similar to os.listdir, but directories are indicated by a trailing slash

PYME.IO.clusterIO.listdirectory(dirname, serverfilter='fv-az569-89', timeout=5)

Lists the contents of a directory on the cluster.

Returns a dictionary mapping filenames to clusterListing.FileInfo named tuples.

PYME.IO.clusterIO.locate_file(filename, serverfilter='fv-az569-89', return_first_hit=False)

Searches the cluster to find which server(s) a given file is stored on

Parameters
filenamestr

The file name

serverfilterstr

The name of our cluster (allows for having multiple clusters on the same network segment)

return_first_hitbool

Whether to try and find all locations, or return when we find the first copy

Returns
PYME.IO.clusterIO.mirror_file(filename, serverfilter='fv-az569-89')

Copies a given file to another server on the cluster (chosen by algorithm)

The actual copy is performed peer to peer.

This is used in cluster duplication and should not (usually) be called by end-user code

PYME.IO.clusterIO.parseURL(URL)
PYME.IO.clusterIO.put_file(filename, data, serverfilter='fv-az569-89', timeout=10)

Put a file to the cluster. The server on which the file resides is chosen by a crude load-balancing algorithm designed to uniformly distribute data across the servers within the cluster. The target file must not exist.

Warning

Putting a file is not strictly safe when run from multiple processes, and might result in unexpected behaviour if puts with identical filenames are made concurrently (within ~2s). It is up to the calling code to ensure that such filename collisions cannot occur. In practice this is reasonably easy to achieve when machine generated filenames are used, but implies that interfaces which allow the user to specify arbitrary filenames should run through a single user interface with external locking (e.g. clusterUI), particularly if there is any chance that multiple users will be creating files simultaneously.

Parameters
filenamestring

path to new file, which much not exist

databytes

the data to put

serverfilterstring

the cluster name (optional)

timeout: float

timeout in seconds for http operations. Warning: alter from the default setting of 1s only with extreme care. If operations are timing out it is usually an indication that something else is going wrong and you should usually fix this first. The serverless and lockless architecture depends on having low latency.

Returns
PYME.IO.clusterIO.put_files(files, serverfilter='fv-az569-89', timeout=30)

Put a bunch of files to a single server in the cluster (chosen by algorithm)

This uses a long-lived http2 session with keep-alive to avoid the connection overhead in creating a new session for each file, and puts files before waiting for a response to the last put. This function exists to facilitate fast streaming

As it reads the replies after attempting to put all the files, this is currently not as safe as put_file (in handling failures we assume that no attempts were successful after the first failed file).

Parameters
fileslist of tuple

a list of tuples of the form (<string> filepath, <bytes> data) for the files to be uploaded

serverfilter: str

the cluster name (optional), to select a specific cluster

Returns
PYME.IO.clusterIO.stat(name, serverfilter='fv-az569-89')

Cluster analog to os.stat

Parameters
name
serverfilter
Returns
PYME.IO.clusterIO.to_bytes(input)

Helper function for python3k to force urls etc to byte strings

Parameters
input
Returns
PYME.IO.clusterIO.walk(top, topdown=True, on_error=None, followlinks=False, serverfilter='fv-az569-89')

Directory tree generator. Adapted from the os.walk function in the python std library.

see docs for os.walk for usage details