FileSet

class typhon.files.fileset.FileSet(path, handler=None, name=None, info_via=None, time_coverage=None, info_cache=None, exclude=None, placeholder=None, max_threads=None, max_processes=None, worker_type=None, read_args=None, write_args=None, post_reader=None, compress=True, decompress=True, temp_dir=None, fs=None)[source]

Provide methods to handle a set of multiple files

For more examples and an user guide, look at this tutorial.

Examples

FileSet with multiple files:

from typhon.files import FileSet

# Define a fileset consisting of multiple files:
files = FileSet(
    path="/dir/{year}/{month}/{day}/{hour}{minute}{second}.nc",
    name="TestData",
    # If the time coverage of the data cannot be retrieved from the
    # filename, you should set this to "handler" and giving a file
    # handler to this object:
    info_via="filename"
)

# Find some files of the fileset:
for file in files.find("2017-01-01", "2017-01-02"):
    # Should print the path of the file and its time coverage:
    print(file)

FileSet with a single file:

# Define a fileset consisting of a single file:
file = FileSet(
    # Simply use the path without placeholders:
    path="/path/to/file.nc",
    name="TestData2",
    # The time coverage of the data cannot be retrieved from the
    # filename (because there are no placeholders). You can use the
    # file handler get_info() method with info_via="handler" or you
    # can define the time coverage here directly:
    time_coverage=("2007-01-01 13:00:00", "2007-01-14 13:00:00")
)

FileSet to open MHS files:

from typhon.files import FileSet, MHS_HDF

# Define a fileset consisting of multiple files:
files = FileSet(
    path="/dir/{year}/{month}/{day}/{hour}{minute}{second}.nc",
    name="MHS",
    handler=MHS_HDF(),
)

# Find some files of the fileset:
for file in files.find("2017-01-01", "2017-01-02"):
    # Should print the path of the file and its time coverage:
    print(file)

References

The FileSet class is inspired by the implemented dataset classes in atmlab developed by Gerrit Holl.

__init__(path, handler=None, name=None, info_via=None, time_coverage=None, info_cache=None, exclude=None, placeholder=None, max_threads=None, max_processes=None, worker_type=None, read_args=None, write_args=None, post_reader=None, compress=True, decompress=True, temp_dir=None, fs=None)[source]

Initialize a FileSet object.

Parameters:
  • path – A string with the complete path to the files. The string can contain placeholder such as {year}, {month}, etc. See below for a complete list. The direct use of restricted regular expressions is also possible. Please note that instead of dots ‘.’ the asterisk ‘*’ is interpreted as wildcard. If no placeholders are given, the path must point to a file. This fileset is then seen as a single file set. You can also define your own placeholders by using the parameter placeholder.

  • name – The name of the fileset.

  • handler – An object which can handle the fileset files. This fileset class does not care which format its files have when this file handler object is given. You can use a file handler class from typhon.files, use FileHandler or write your own class. If no file handler is given, an adequate one is automatically selected for the most common filename suffixes. Please note that if no file handler is specified (and none could set automatically), this fileset’s functionality is restricted.

  • info_via – Defines how further information about the file will be retrieved (e.g. time coverage). Possible options are filename, handler or both. Default is filename. That means that the placeholders in the file’s path will be parsed to obtain information. If this is handler, the get_info() method is used. If this is both, both options will be executed but the information from the file handler overwrites conflicting information from the filename.

  • info_cache – Retrieving further information (such as time coverage) about a file may take a while, especially when get_info is set to handler. Therefore, if the file information is cached, multiple calls of find() (for time periods that are close) are significantly faster. Specify a name to a file here (which need not exist) if you wish to save the information data to a file. When restarting your script, this cache is used.

  • time_coverage – If this fileset consists of multiple files, this parameter is the relative time coverage (i.e. a timedelta, e.g. “1 hour”) of each file. If the ending time of a file cannot be retrieved by its file handler or filename, it is then its starting time + time_coverage. Can be a timedelta object or a string with time information (e.g. “2 seconds”). Otherwise the missing ending time of each file will be set to its starting time. If this fileset consists of a single file, then this is its absolute time coverage. Set this to a tuple of timestamps (datetime objects or strings). Otherwise the period between year 1 and 9999 will be used as a default time coverage.

  • exclude – A list of time periods (tuples of two timestamps) or filenames (strings) that will be excluded when searching for files of this fileset.

  • placeholder – A dictionary with pairs of placeholder name and a regular expression matching its content. These are user-defined placeholders, the standard temporal placeholders do not have to be defined.

  • max_threads – Maximal number of threads that will be used to parallelise some methods (e.g. writing in background). This sets also the default for map()-like methods (default is 3).

  • max_processes – Maximal number of processes that will be used to parallelise some methods. This sets also the default for map()-like methods (default is 8).

  • worker_type – The type of the workers that will be used to parallelise some methods. Can be process (default) or thread.

  • read_args – Additional keyword arguments in a dictionary that should always be passed to read().

  • write_args – Additional keyword arguments in a dictionary that should always be passed to write().

  • post_reader – A reference to a function that will be called after reading a file. Can be used for post-processing or field selection, etc. Its signature must be callable(file_info, file_data).

  • temp_dir – You can set here your own temporary directory that this FileSet object should use for compressing and decompressing files. Per default it uses the tempdir given by tempfile.gettempdir (see tempfile.gettempdir()).

  • compress – If true and path ends with a compression suffix (such as .zip, .gz, .b2z, etc.), newly created files will be compressed after writing them to disk. Default value is true.

  • decompress – If true and path ends with a compression suffix (such as .zip, .gz, .b2z, etc.), files will be decompressed before reading them. Default value is true.

  • fs – Instance of implementation of fsspec.spec.AbstractFileSystem. By passing a remote filesystem implementation this allows for searching for and opening files on remote file systems such as Amazon S3 using s3fs.S3FileSystem.

You can use regular expressions or placeholders in path to generalize the files path. Placeholders are going to be captured and returned by file-finding methods such as find(). Temporal placeholders will be converted to datetime objects and represent a file’s time coverage. Allowed temporal placeholders in the path argument are:

Placeholder

Description

Example

year

Four digits indicating the year.

1999

year2

Two digits indicating the year. [1]

58 (=2058)

month

Two digits indicating the month.

09

day

Two digits indicating the day.

08

doy

Three digits indicating the day of the year.

002

hour

Two digits indicating the hour.

22

minute

Two digits indicating the minute.

58

second

Two digits indicating the second.

58

millisecond

Three digits indicating the millisecond.

999

All those place holders are also allowed to have the prefix end (e.g. end_year). They represent the end of the time coverage.

Moreover, you are allowed do define your own placeholders by using the parameter placeholder or set_placeholders(). Their names must consist of alphanumeric signs (underscores are also allowed).

Methods

__init__(path[, handler, name, info_via, ...])

Initialize a FileSet object.

align(other[, start, end, matches, ...])

Collect files from this fileset and a matching other fileset

collect([start, end, files, return_info])

Load all files between two dates sorted by their starting time

copy()

Create a so-called deep-copy of this fileset object

delete([dry_run])

Remove files in this fileset from the disk

detect(test, *args, **kwargs)

Search for anomalies in fileset

dislink(name_or_fileset)

Remove the link between this and another fileset

exclude_files(filenames)

exclude_times(periods)

find([start, end, sort, only_path, bundle, ...])

Find all files of this fileset in a given time period.

find_closest(timestamp[, filters])

Find the closest file to a timestamp

get_filename(times[, template, fill])

Generate the full path and name of a file for a time period

get_info(file_info[, retrieve_via])

Get information about a file.

get_placeholders()

Get placeholders for this FileSet.

icollect([start, end, files])

Load all files between two dates sorted by their starting time

imap(*args, **kwargs)

Apply a function on files and return the result immediately

is_excluded(file)

Checks whether a file is excluded from this FileSet.

link(other_fileset[, linker])

Link this fileset with another FileSet

load_cache(filename)

Load the information cache from a JSON file

make_dirs(filename)

map(func[, args, kwargs, files, on_content, ...])

Apply a function on files of this fileset with parallel workers

match(other[, start, end, max_interval, ...])

Find matching files between two filesets

move([target, convert, copy])

Move (or copy) files from this fileset to another location

parse_filename(filename[, template])

Parse the filename with temporal and additional regular expressions.

read(file_info, **read_args)

Open and read a file

reset_cache()

Reset the information cache

save_cache(filename)

Save the information cache to a JSON file

set_placeholders(**placeholders)

Set placeholders for this FileSet.

to_dataframe([include_times])

Create a pandas.Dataframe from this FileSet

write(data, file_info[, in_background])

Write content to a file by using the FileSet's file handler.

Attributes

default_handler

name

Get or set the fileset's name.

path

Gets or sets the path to the fileset's files.

time_coverage

Get and set the time coverage of the files of this fileset

year2_threshold