PYME.IO.clusterIO module¶

API interface to PYME cluster storage.¶

The cluster storage effectively aggregates storage across a number of cluster nodes. In contrast to a standard file-system, we only support atomic reads and writes of entire files, and only support a single write to a given file (no modification or replacement). Under the hood, the file system is implemented as a bunch of HTTP servers, with the clusterIO library simply presenting a unified view of the files on all servers.

Note

PYME cluster storage has a number of design aspects which trade some safety for speed. Importantly it lacks features such as locking found in traditional filesystems. This is largely offset by only supporting atomic writes, and being write once (no deletion or modification), but does rely on clients being well behaved (i.e. not trying to write to the same file concurrently) - see put_file() for more details. Directory listings etc .. are also not guaranteed to update immediately after file puts, with a latency of up to 2s for directory updates being possible.

The most important functions are:

get_file() which performs an atomic get

put_file() which performs an atomic put to a randomly 1 chosen server

listdir() which performs a (unified) directory listing

There are also a bunch of utility functions, mirroring some of those available in the os, os.path, and glob modules:

exists() cglob() walk() isdir() stat()

And finally some functions to facilitate efficient (and data local) operations

is_local() which determines if a file is hosted on the local machine

get_local_path() which returns a local path if a file is local

locate_file() which returns a list of http urls where the given file can be found

put_files() which implements a streamed, high performance, put of multiple files

There are are also higher level functions in PYME.IO.unifiedIO which allows files to be accessed in a consistent way given either a cluster URI or a local path and PYME.IO.clusterResults which helps with saving tabular data to the cluster.

Cluster identification¶

clusterIO automatically identifies the individual data servers that make up the cluster using the mDNS (zeroconf) protocol. If you run a (or multiple) server(s) (see PYME.cluster.HTTPDataServer) clusterIO will find them with no additional configuration. To support multiple clusters on a single network segment, however, we have the concept of a cluster name. Servers will broadcast the name of the cluster they belong to as part of their mDNS advertisement.

On the client side, you can select which cluster you want to talk to with the serverfilter argument to the clusterIO functions. Each function will access all servers which include the value of serverfilter in their name. If unspecified, the value of the PYME.config configuration option dataserver-filter is used, which in turn defaults to the local computer name. This enables the use of a local “cluster of one” for analysis without any additional configuration and without interfering with any clusters which are already on the network. When setting up a proper, multi-computer cluster, set the dataserver-filter config option on all members of the cluster o a common value. Once setup in this manner, it should not be necessary to actually specify serverfilter when using clusterIO.

Note

The nature of serverfilter matches is both powerful and requires some care. A value of serverfilter='' will match every data server running on the local network and present them as an aggregated cluster. Similarly serverfilter='cluster' will match cluster, cluster1, cluster2, 3cluster etc and prevent them all as an aggregated cluster. This cluster aggregation opens up interesting postprocessing and display options, but to avoid unexpected effects it would be prudent to follow the following recommendations.

Avoid cluster names which are substrings of other cluster names

Always use the full cluster name for serverfilter.

Depending on how useful aggregation actually proves to be serverfilter might change to either requiring an exact match or compiling to a regex at some point in the future.

Footnotes

1: technically we do some basic load balancing

Function Documentation¶

PYME.IO.clusterIO.cglob(pattern, serverfilter='fv-az569-89')¶

Find files matching a given glob on the cluster. Analogous to the python glob.glob function.

Parameters

patternstring glob
serverfiltercluster name (optional)

Returns

a list of files matching the glob

PYME.IO.clusterIO.exists(name, serverfilter='fv-az569-89')¶

Test whether a file exists on the cluster. Analogue to os.path.exists for local files.

Parameters

namestring, file path
serverfiltername of the cluster (optional)

Returns

True if file exists, else False

PYME.IO.clusterIO.get_dir_manager(serverfilter='fv-az569-89')¶

PYME.IO.clusterIO.get_file(filename, serverfilter='fv-az569-89', numRetries=3, use_file_cache=True, local_short_circuit=True, timeout=5)¶

Get a file from the cluster.

Parameters

filenamestring: filename relative to cluster root
serverfilterstring: A filter to use when finding servers - used to facilitate the operation for multiple clusters on the one network segment. Note that this is still not fully supported.
numRetriesint: The number of times to retry on failure
use_file_cachebool: By default we cache the last 100 files requested locally. This cache never expires, although entries are dropped when we get over 100 entries. Under our working assumption that data on the cluster is immutable, this is generally safe, with the exception of log files and files streamed using the _aggregate functionality. We can optionally request a non-cached version of the file.
local_short_circuit: bool: if file exists locally, load/read/return contents directly in this thread unless this flag is False in which case we will get the contents through the dataserver over the network.

Returns

PYME.IO.clusterIO.get_local_path(filename, serverfilter)¶

Get a local path, if available based on a cluster filename

Parameters

filename
serverfilter

Returns

PYME.IO.clusterIO.get_status(serverfilter='fv-az569-89')¶

Get status of cluster servers (currently only used in the clusterIO web service)

Parameters

serverfilter: str: the cluster name (optional), to select a specific cluster

Returns

status_list: list

a status dictionary for each node. See PYME.cluster.HTTPDataServer.updateStatus

Disk: dict

total: int: storage on the node [bytes]
used: int: used storage on the node [bytes]
free: int: available storage on the node [bytes]

CPUUsage: float

cpu usage as a percentile

MemUsage: dict

total: int: total RAM [bytes]
available: int: free RAM [bytes]
percent: float: percent usage
used: int: used RAM [bytes], calculated differently depending on platform
free: int: RAM which is zero’d and ready to go [bytes]
[other]:: more platform-specific fields

Network: dict

send: int: bytes sent per second since the last status update
recv: int: bytes received per second since the last status update

GPUUsage: list of float

[optional] returned for NVIDIA GPUs only. Should be compute usage per gpu as percent?

GPUMem: list of float

[optional] returned for NVIDIA GPUs only. Should be memory usage per gpu as percent?

PYME.IO.clusterIO.is_local(filename, serverfilter)¶

Test to see if a file is held on the local computer.

This is used in the distributed scheduler to score tasks.

Parameters

filename
serverfilter

Returns

PYME.IO.clusterIO.isdir(name, serverfilter='fv-az569-89')¶

Tests if a given path on the cluster is a directory. Analogous to os.path.isdir

Parameters

name
serverfilter

Returns

True or False

PYME.IO.clusterIO.listdir(dirname, serverfilter='fv-az569-89')¶: Lists the contents of a directory on the cluster. Similar to os.listdir, but directories are indicated by a trailing slash

PYME.IO.clusterIO.listdirectory(dirname, serverfilter='fv-az569-89', timeout=5)¶

Lists the contents of a directory on the cluster.

Returns a dictionary mapping filenames to clusterListing.FileInfo named tuples.

PYME.IO.clusterIO.locate_file(filename, serverfilter='fv-az569-89', return_first_hit=False)¶

Searches the cluster to find which server(s) a given file is stored on

Parameters

filenamestr: The file name
serverfilterstr: The name of our cluster (allows for having multiple clusters on the same network segment)
return_first_hitbool: Whether to try and find all locations, or return when we find the first copy

Returns

PYME.IO.clusterIO.mirror_file(filename, serverfilter='fv-az569-89')¶

Copies a given file to another server on the cluster (chosen by algorithm)

The actual copy is performed peer to peer.

This is used in cluster duplication and should not (usually) be called by end-user code

PYME.IO.clusterIO.parseURL(URL)¶

PYME.IO.clusterIO.put_file(filename, data, serverfilter='fv-az569-89', timeout=10)¶

Put a file to the cluster. The server on which the file resides is chosen by a crude load-balancing algorithm designed to uniformly distribute data across the servers within the cluster. The target file must not exist.

Warning

Putting a file is not strictly safe when run from multiple processes, and might result in unexpected behaviour if puts with identical filenames are made concurrently (within ~2s). It is up to the calling code to ensure that such filename collisions cannot occur. In practice this is reasonably easy to achieve when machine generated filenames are used, but implies that interfaces which allow the user to specify arbitrary filenames should run through a single user interface with external locking (e.g. clusterUI), particularly if there is any chance that multiple users will be creating files simultaneously.

Parameters

filenamestring: path to new file, which much not exist
databytes: the data to put
serverfilterstring: the cluster name (optional)
timeout: float: timeout in seconds for http operations. Warning: alter from the default setting of 1s only with extreme care. If operations are timing out it is usually an indication that something else is going wrong and you should usually fix this first. The serverless and lockless architecture depends on having low latency.

Returns

PYME.IO.clusterIO.put_files(files, serverfilter='fv-az569-89', timeout=30)¶

Put a bunch of files to a single server in the cluster (chosen by algorithm)

This uses a long-lived http2 session with keep-alive to avoid the connection overhead in creating a new session for each file, and puts files before waiting for a response to the last put. This function exists to facilitate fast streaming

As it reads the replies after attempting to put all the files, this is currently not as safe as put_file (in handling failures we assume that no attempts were successful after the first failed file).

Parameters

fileslist of tuple: a list of tuples of the form (<string> filepath, <bytes> data) for the files to be uploaded
serverfilter: str: the cluster name (optional), to select a specific cluster

Returns

PYME.IO.clusterIO.stat(name, serverfilter='fv-az569-89')¶

Cluster analog to os.stat

Parameters

name
serverfilter

Returns

PYME.IO.clusterIO.to_bytes(input)¶

Helper function for python3k to force urls etc to byte strings

Parameters

input

Returns

PYME.IO.clusterIO.walk(top, topdown=True, on_error=None, followlinks=False, serverfilter='fv-az569-89')¶

Directory tree generator. Adapted from the os.walk function in the python std library.

see docs for os.walk for usage details