FileSet

class typhon.files.fileset.FileSet(path, handler=None, name=None, info_via=None, time_coverage=None, info_cache=None, exclude=None, placeholder=None, max_threads=None, max_processes=None, worker_type=None, read_args=None, write_args=None, post_reader=None, compress=True, decompress=True, temp_dir=None, fs=None)[source]

Provide methods to handle a set of multiple files

For more examples and an user guide, look at this tutorial.

Examples

FileSet with multiple files:

from typhon.files import FileSet

# Define a fileset consisting of multiple files:
files = FileSet(
    path="/dir/{year}/{month}/{day}/{hour}{minute}{second}.nc",
    name="TestData",
    # If the time coverage of the data cannot be retrieved from the
    # filename, you should set this to "handler" and giving a file
    # handler to this object:
    info_via="filename"
)

# Find some files of the fileset:
for file in files.find("2017-01-01", "2017-01-02"):
    # Should print the path of the file and its time coverage:
    print(file)

FileSet with a single file:

# Define a fileset consisting of a single file:
file = FileSet(
    # Simply use the path without placeholders:
    path="/path/to/file.nc",
    name="TestData2",
    # The time coverage of the data cannot be retrieved from the
    # filename (because there are no placeholders). You can use the
    # file handler get_info() method with info_via="handler" or you
    # can define the time coverage here directly:
    time_coverage=("2007-01-01 13:00:00", "2007-01-14 13:00:00")
)

FileSet to open MHS files:

from typhon.files import FileSet, MHS_HDF

# Define a fileset consisting of multiple files:
files = FileSet(
    path="/dir/{year}/{month}/{day}/{hour}{minute}{second}.nc",
    name="MHS",
    handler=MHS_HDF(),
)

# Find some files of the fileset:
for file in files.find("2017-01-01", "2017-01-02"):
    # Should print the path of the file and its time coverage:
    print(file)

References

The FileSet class is inspired by the implemented dataset classes in atmlab developed by Gerrit Holl.

__init__(path, handler=None, name=None, info_via=None, time_coverage=None, info_cache=None, exclude=None, placeholder=None, max_threads=None, max_processes=None, worker_type=None, read_args=None, write_args=None, post_reader=None, compress=True, decompress=True, temp_dir=None, fs=None)[source]

Initialize a FileSet object.

Parameters:

path – A string with the complete path to the files. The string can contain placeholder such as {year}, {month}, etc. See below for a complete list. The direct use of restricted regular expressions is also possible. Please note that instead of dots ‘.’ the asterisk ‘*’ is interpreted as wildcard. If no placeholders are given, the path must point to a file. This fileset is then seen as a single file set. You can also define your own placeholders by using the parameter placeholder.
name – The name of the fileset.
handler – An object which can handle the fileset files. This fileset class does not care which format its files have when this file handler object is given. You can use a file handler class from typhon.files, use FileHandler or write your own class. If no file handler is given, an adequate one is automatically selected for the most common filename suffixes. Please note that if no file handler is specified (and none could set automatically), this fileset’s functionality is restricted.
info_via – Defines how further information about the file will be retrieved (e.g. time coverage). Possible options are filename, handler or both. Default is filename. That means that the placeholders in the file’s path will be parsed to obtain information. If this is handler, the get_info() method is used. If this is both, both options will be executed but the information from the file handler overwrites conflicting information from the filename.
info_cache – Retrieving further information (such as time coverage) about a file may take a while, especially when get_info is set to handler. Therefore, if the file information is cached, multiple calls of find() (for time periods that are close) are significantly faster. Specify a name to a file here (which need not exist) if you wish to save the information data to a file. When restarting your script, this cache is used.
time_coverage – If this fileset consists of multiple files, this parameter is the relative time coverage (i.e. a timedelta, e.g. “1 hour”) of each file. If the ending time of a file cannot be retrieved by its file handler or filename, it is then its starting time + time_coverage. Can be a timedelta object or a string with time information (e.g. “2 seconds”). Otherwise the missing ending time of each file will be set to its starting time. If this fileset consists of a single file, then this is its absolute time coverage. Set this to a tuple of timestamps (datetime objects or strings). Otherwise the period between year 1 and 9999 will be used as a default time coverage.
exclude – A list of time periods (tuples of two timestamps) or filenames (strings) that will be excluded when searching for files of this fileset.
placeholder – A dictionary with pairs of placeholder name and a regular expression matching its content. These are user-defined placeholders, the standard temporal placeholders do not have to be defined.
max_threads – Maximal number of threads that will be used to parallelise some methods (e.g. writing in background). This sets also the default for map()-like methods (default is 3).
max_processes – Maximal number of processes that will be used to parallelise some methods. This sets also the default for map()-like methods (default is 8).
worker_type – The type of the workers that will be used to parallelise some methods. Can be process (default) or thread.
read_args – Additional keyword arguments in a dictionary that should always be passed to read().
write_args – Additional keyword arguments in a dictionary that should always be passed to write().
post_reader – A reference to a function that will be called after reading a file. Can be used for post-processing or field selection, etc. Its signature must be callable(file_info, file_data).
temp_dir – You can set here your own temporary directory that this FileSet object should use for compressing and decompressing files. Per default it uses the tempdir given by tempfile.gettempdir (see tempfile.gettempdir()).
compress – If true and path ends with a compression suffix (such as .zip, .gz, .b2z, etc.), newly created files will be compressed after writing them to disk. Default value is true.
decompress – If true and path ends with a compression suffix (such as .zip, .gz, .b2z, etc.), files will be decompressed before reading them. Default value is true.
fs – Instance of implementation of fsspec.spec.AbstractFileSystem. By passing a remote filesystem implementation this allows for searching for and opening files on remote file systems such as Amazon S3 using s3fs.S3FileSystem.

You can use regular expressions or placeholders in path to generalize the files path. Placeholders are going to be captured and returned by file-finding methods such as find(). Temporal placeholders will be converted to datetime objects and represent a file’s time coverage. Allowed temporal placeholders in the path argument are:

Placeholder	Description	Example
year	Four digits indicating the year.	1999
year2	Two digits indicating the year. [1]	58 (=2058)
month	Two digits indicating the month.	09
day	Two digits indicating the day.	08
doy	Three digits indicating the day of the year.	002
hour	Two digits indicating the hour.	22
minute	Two digits indicating the minute.	58
second	Two digits indicating the second.	58
millisecond	Three digits indicating the millisecond.	999

All those place holders are also allowed to have the prefix end (e.g. end_year). They represent the end of the time coverage.

Moreover, you are allowed do define your own placeholders by using the parameter placeholder or set_placeholders(). Their names must consist of alphanumeric signs (underscores are also allowed).

Methods

`__init__`(path[, handler, name, info_via, ...])	Initialize a FileSet object.
`align`(other[, start, end, matches, ...])	Collect files from this fileset and a matching other fileset
`collect`([start, end, files, return_info])	Load all files between two dates sorted by their starting time
`copy`()	Create a so-called deep-copy of this fileset object
`delete`([dry_run])	Remove files in this fileset from the disk
`detect`(test, args, *kwargs)	Search for anomalies in fileset
`dislink`(name_or_fileset)	Remove the link between this and another fileset
`exclude_files`(filenames)
`exclude_times`(periods)
`find`([start, end, sort, only_path, bundle, ...])	Find all files of this fileset in a given time period.
`find_closest`(timestamp[, filters])	Find the closest file to a timestamp
`get_filename`(times[, template, fill])	Generate the full path and name of a file for a time period
`get_info`(file_info[, retrieve_via])	Get information about a file.
`get_placeholders`()	Get placeholders for this FileSet.
`icollect`([start, end, files])	Load all files between two dates sorted by their starting time
`imap`(args, *kwargs)	Apply a function on files and return the result immediately
`is_excluded`(file)	Checks whether a file is excluded from this FileSet.
`link`(other_fileset[, linker])	Link this fileset with another FileSet
`load_cache`(filename)	Load the information cache from a JSON file
`make_dirs`(filename)
`map`(func[, args, kwargs, files, on_content, ...])	Apply a function on files of this fileset with parallel workers
`match`(other[, start, end, max_interval, ...])	Find matching files between two filesets
`move`([target, convert, copy])	Move (or copy) files from this fileset to another location
`parse_filename`(filename[, template])	Parse the filename with temporal and additional regular expressions.
`read`(file_info, **read_args)	Open and read a file
`reset_cache`()	Reset the information cache
`save_cache`(filename)	Save the information cache to a JSON file
`set_placeholders`(**placeholders)	Set placeholders for this FileSet.
`to_dataframe`([include_times])	Create a pandas.Dataframe from this FileSet
`write`(data, file_info[, in_background])	Write content to a file by using the FileSet's file handler.

Attributes

`default_handler`
`name`	Get or set the fileset's name.
`path`	Gets or sets the path to the fileset's files.
`time_coverage`	Get and set the time coverage of the files of this fileset
`year2_threshold`