PYME.IO.clusterIO module¶
API interface to PYME cluster storage.¶
The cluster storage effectively aggregates storage across a number of cluster nodes. In contrast to a standard file-system, we only support atomic reads and writes of entire files, and only support a single write to a given file (no modification or replacement). Under the hood, the file system is implemented as a bunch of HTTP servers, with the clusterIO library simply presenting a unified view of the files on all servers.
Note
PYME cluster storage has a number of design aspects which trade some safety for speed. Importantly it lacks features such as locking found in traditional filesystems. This is largely offset by only supporting atomic writes, and being write once (no deletion or modification), but does rely on clients being well behaved (i.e. not trying to write to the same file concurrently) - see put_file() for more details. Directory listings etc .. are also not guaranteed to update immediately after file puts, with a latency of up to 2s for directory updates being possible.
The most important functions are:
get_file()
which performs an atomic getput_file()
which performs an atomic put to a randomly 1 chosen serverlistdir()
which performs a (unified) directory listing
There are also a bunch of utility functions, mirroring some of those available in the os, os.path, and glob modules:
And finally some functions to facilitate efficient (and data local) operations
is_local()
which determines if a file is hosted on the local machineget_local_path()
which returns a local path if a file is locallocate_file()
which returns a list of http urls where the given file can be foundput_files()
which implements a streamed, high performance, put of multiple files
There are are also higher level functions in PYME.IO.unifiedIO
which allows files to be accessed in a consistent
way given either a cluster URI or a local path and PYME.IO.clusterResults
which helps with saving tabular data to
the cluster.
Cluster identification¶
clusterIO automatically identifies the individual data servers that make up the cluster using the mDNS (zeroconf) protocol.
If you run a (or multiple) server(s) (see PYME.cluster.HTTPDataServer
) clusterIO will find them with no additional
configuration. To support multiple clusters on a single network segment, however, we have the concept of a cluster name.
Servers will broadcast the name of the cluster they belong to as part of their mDNS advertisement.
On the client side, you can select which cluster you want to talk to with the serverfilter
argument to the
clusterIO functions. Each function will access all servers which include the value of serverfilter
in their name. If
unspecified, the value of the PYME.config
configuration option dataserver-filter
is used, which in turn defaults
to the local computer name. This enables the use of a local “cluster of one” for analysis without any additional configuration
and without interfering with any clusters which are already on the network. When setting up a proper, multi-computer cluster,
set the dataserver-filter
config option on all members of the cluster o a common value. Once setup in this manner, it
should not be necessary to actually specify serverfilter
when using clusterIO.
Note
The nature of
serverfilter
matches is both powerful and requires some care. A value ofserverfilter=''
will match every data server running on the local network and present them as an aggregated cluster. Similarlyserverfilter='cluster'
will matchcluster
,cluster1
,cluster2
,3cluster
etc and prevent them all as an aggregated cluster. This cluster aggregation opens up interesting postprocessing and display options, but to avoid unexpected effects it would be prudent to follow the following recommendations.
Avoid cluster names which are substrings of other cluster names
Always use the full cluster name for
serverfilter
.Depending on how useful aggregation actually proves to be
serverfilter
might change to either requiring an exact match or compiling to a regex at some point in the future.
Footnotes
- 1
technically we do some basic load balancing
Function Documentation¶
- PYME.IO.clusterIO.cglob(pattern, serverfilter='fv-az569-89')¶
Find files matching a given glob on the cluster. Analogous to the python glob.glob function.
- Parameters
- patternstring glob
- serverfiltercluster name (optional)
- Returns
- a list of files matching the glob
- PYME.IO.clusterIO.exists(name, serverfilter='fv-az569-89')¶
Test whether a file exists on the cluster. Analogue to os.path.exists for local files.
- Parameters
- namestring, file path
- serverfiltername of the cluster (optional)
- Returns
- True if file exists, else False
- PYME.IO.clusterIO.get_dir_manager(serverfilter='fv-az569-89')¶
- PYME.IO.clusterIO.get_file(filename, serverfilter='fv-az569-89', numRetries=3, use_file_cache=True, local_short_circuit=True, timeout=5)¶
Get a file from the cluster.
- Parameters
- filenamestring
filename relative to cluster root
- serverfilterstring
A filter to use when finding servers - used to facilitate the operation for multiple clusters on the one network segment. Note that this is still not fully supported.
- numRetriesint
The number of times to retry on failure
- use_file_cachebool
By default we cache the last 100 files requested locally. This cache never expires, although entries are dropped when we get over 100 entries. Under our working assumption that data on the cluster is immutable, this is generally safe, with the exception of log files and files streamed using the _aggregate functionality. We can optionally request a non-cached version of the file.
- local_short_circuit: bool
if file exists locally, load/read/return contents directly in this thread unless this flag is False in which case we will get the contents through the dataserver over the network.
- Returns
- PYME.IO.clusterIO.get_local_path(filename, serverfilter)¶
Get a local path, if available based on a cluster filename
- Parameters
- filename
- serverfilter
- Returns
- PYME.IO.clusterIO.get_status(serverfilter='fv-az569-89')¶
Get status of cluster servers (currently only used in the clusterIO web service)
- Parameters
- serverfilter: str
the cluster name (optional), to select a specific cluster
- Returns
- status_list: list
- a status dictionary for each node. See PYME.cluster.HTTPDataServer.updateStatus
- Disk: dict
- total: int
storage on the node [bytes]
- used: int
used storage on the node [bytes]
- free: int
available storage on the node [bytes]
- CPUUsage: float
cpu usage as a percentile
- MemUsage: dict
- total: int
total RAM [bytes]
- available: int
free RAM [bytes]
- percent: float
percent usage
- used: int
used RAM [bytes], calculated differently depending on platform
- free: int
RAM which is zero’d and ready to go [bytes]
- [other]:
more platform-specific fields
- Network: dict
- send: int
bytes sent per second since the last status update
- recv: int
bytes received per second since the last status update
- GPUUsage: list of float
[optional] returned for NVIDIA GPUs only. Should be compute usage per gpu as percent?
- GPUMem: list of float
[optional] returned for NVIDIA GPUs only. Should be memory usage per gpu as percent?
- PYME.IO.clusterIO.is_local(filename, serverfilter)¶
Test to see if a file is held on the local computer.
This is used in the distributed scheduler to score tasks.
- Parameters
- filename
- serverfilter
- Returns
- PYME.IO.clusterIO.isdir(name, serverfilter='fv-az569-89')¶
Tests if a given path on the cluster is a directory. Analogous to os.path.isdir
- Parameters
- name
- serverfilter
- Returns
- True or False
- PYME.IO.clusterIO.listdir(dirname, serverfilter='fv-az569-89')¶
Lists the contents of a directory on the cluster. Similar to os.listdir, but directories are indicated by a trailing slash
- PYME.IO.clusterIO.listdirectory(dirname, serverfilter='fv-az569-89', timeout=5)¶
Lists the contents of a directory on the cluster.
Returns a dictionary mapping filenames to clusterListing.FileInfo named tuples.
- PYME.IO.clusterIO.locate_file(filename, serverfilter='fv-az569-89', return_first_hit=False)¶
Searches the cluster to find which server(s) a given file is stored on
- Parameters
- filenamestr
The file name
- serverfilterstr
The name of our cluster (allows for having multiple clusters on the same network segment)
- return_first_hitbool
Whether to try and find all locations, or return when we find the first copy
- Returns
- PYME.IO.clusterIO.mirror_file(filename, serverfilter='fv-az569-89')¶
Copies a given file to another server on the cluster (chosen by algorithm)
The actual copy is performed peer to peer.
This is used in cluster duplication and should not (usually) be called by end-user code
- PYME.IO.clusterIO.parseURL(URL)¶
- PYME.IO.clusterIO.put_file(filename, data, serverfilter='fv-az569-89', timeout=10)¶
Put a file to the cluster. The server on which the file resides is chosen by a crude load-balancing algorithm designed to uniformly distribute data across the servers within the cluster. The target file must not exist.
Warning
Putting a file is not strictly safe when run from multiple processes, and might result in unexpected behaviour if puts with identical filenames are made concurrently (within ~2s). It is up to the calling code to ensure that such filename collisions cannot occur. In practice this is reasonably easy to achieve when machine generated filenames are used, but implies that interfaces which allow the user to specify arbitrary filenames should run through a single user interface with external locking (e.g. clusterUI), particularly if there is any chance that multiple users will be creating files simultaneously.
- Parameters
- filenamestring
path to new file, which much not exist
- databytes
the data to put
- serverfilterstring
the cluster name (optional)
- timeout: float
timeout in seconds for http operations. Warning: alter from the default setting of 1s only with extreme care. If operations are timing out it is usually an indication that something else is going wrong and you should usually fix this first. The serverless and lockless architecture depends on having low latency.
- Returns
- PYME.IO.clusterIO.put_files(files, serverfilter='fv-az569-89', timeout=30)¶
Put a bunch of files to a single server in the cluster (chosen by algorithm)
This uses a long-lived http2 session with keep-alive to avoid the connection overhead in creating a new session for each file, and puts files before waiting for a response to the last put. This function exists to facilitate fast streaming
As it reads the replies after attempting to put all the files, this is currently not as safe as put_file (in handling failures we assume that no attempts were successful after the first failed file).
- Parameters
- fileslist of tuple
a list of tuples of the form (<string> filepath, <bytes> data) for the files to be uploaded
- serverfilter: str
the cluster name (optional), to select a specific cluster
- Returns
- PYME.IO.clusterIO.stat(name, serverfilter='fv-az569-89')¶
Cluster analog to os.stat
- Parameters
- name
- serverfilter
- Returns
- PYME.IO.clusterIO.to_bytes(input)¶
Helper function for python3k to force urls etc to byte strings
- Parameters
- input
- Returns
- PYME.IO.clusterIO.walk(top, topdown=True, on_error=None, followlinks=False, serverfilter='fv-az569-89')¶
Directory tree generator. Adapted from the os.walk function in the python std library.
see docs for os.walk for usage details