FileSet
- class typhon.files.fileset.FileSet(path, handler=None, name=None, info_via=None, time_coverage=None, info_cache=None, exclude=None, placeholder=None, max_threads=None, max_processes=None, worker_type=None, read_args=None, write_args=None, post_reader=None, compress=True, decompress=True, temp_dir=None, fs=None)[source]
Provide methods to handle a set of multiple files
For more examples and an user guide, look at this tutorial.
Examples
FileSet with multiple files:
from typhon.files import FileSet # Define a fileset consisting of multiple files: files = FileSet( path="/dir/{year}/{month}/{day}/{hour}{minute}{second}.nc", name="TestData", # If the time coverage of the data cannot be retrieved from the # filename, you should set this to "handler" and giving a file # handler to this object: info_via="filename" ) # Find some files of the fileset: for file in files.find("2017-01-01", "2017-01-02"): # Should print the path of the file and its time coverage: print(file)
FileSet with a single file:
# Define a fileset consisting of a single file: file = FileSet( # Simply use the path without placeholders: path="/path/to/file.nc", name="TestData2", # The time coverage of the data cannot be retrieved from the # filename (because there are no placeholders). You can use the # file handler get_info() method with info_via="handler" or you # can define the time coverage here directly: time_coverage=("2007-01-01 13:00:00", "2007-01-14 13:00:00") )
FileSet to open MHS files:
from typhon.files import FileSet, MHS_HDF # Define a fileset consisting of multiple files: files = FileSet( path="/dir/{year}/{month}/{day}/{hour}{minute}{second}.nc", name="MHS", handler=MHS_HDF(), ) # Find some files of the fileset: for file in files.find("2017-01-01", "2017-01-02"): # Should print the path of the file and its time coverage: print(file)
References
The FileSet class is inspired by the implemented dataset classes in atmlab developed by Gerrit Holl.
- __init__(path, handler=None, name=None, info_via=None, time_coverage=None, info_cache=None, exclude=None, placeholder=None, max_threads=None, max_processes=None, worker_type=None, read_args=None, write_args=None, post_reader=None, compress=True, decompress=True, temp_dir=None, fs=None)[source]
Initialize a FileSet object.
- Parameters:
path – A string with the complete path to the files. The string can contain placeholder such as {year}, {month}, etc. See below for a complete list. The direct use of restricted regular expressions is also possible. Please note that instead of dots ‘.’ the asterisk ‘*’ is interpreted as wildcard. If no placeholders are given, the path must point to a file. This fileset is then seen as a single file set. You can also define your own placeholders by using the parameter placeholder.
name – The name of the fileset.
handler – An object which can handle the fileset files. This fileset class does not care which format its files have when this file handler object is given. You can use a file handler class from typhon.files, use
FileHandler
or write your own class. If no file handler is given, an adequate one is automatically selected for the most common filename suffixes. Please note that if no file handler is specified (and none could set automatically), this fileset’s functionality is restricted.info_via – Defines how further information about the file will be retrieved (e.g. time coverage). Possible options are filename, handler or both. Default is filename. That means that the placeholders in the file’s path will be parsed to obtain information. If this is handler, the
get_info()
method is used. If this is both, both options will be executed but the information from the file handler overwrites conflicting information from the filename.info_cache – Retrieving further information (such as time coverage) about a file may take a while, especially when get_info is set to handler. Therefore, if the file information is cached, multiple calls of
find()
(for time periods that are close) are significantly faster. Specify a name to a file here (which need not exist) if you wish to save the information data to a file. When restarting your script, this cache is used.time_coverage – If this fileset consists of multiple files, this parameter is the relative time coverage (i.e. a timedelta, e.g. “1 hour”) of each file. If the ending time of a file cannot be retrieved by its file handler or filename, it is then its starting time + time_coverage. Can be a timedelta object or a string with time information (e.g. “2 seconds”). Otherwise the missing ending time of each file will be set to its starting time. If this fileset consists of a single file, then this is its absolute time coverage. Set this to a tuple of timestamps (datetime objects or strings). Otherwise the period between year 1 and 9999 will be used as a default time coverage.
exclude – A list of time periods (tuples of two timestamps) or filenames (strings) that will be excluded when searching for files of this fileset.
placeholder – A dictionary with pairs of placeholder name and a regular expression matching its content. These are user-defined placeholders, the standard temporal placeholders do not have to be defined.
max_threads – Maximal number of threads that will be used to parallelise some methods (e.g. writing in background). This sets also the default for
map()
-like methods (default is 3).max_processes – Maximal number of processes that will be used to parallelise some methods. This sets also the default for
map()
-like methods (default is 8).worker_type – The type of the workers that will be used to parallelise some methods. Can be process (default) or thread.
read_args – Additional keyword arguments in a dictionary that should always be passed to
read()
.write_args – Additional keyword arguments in a dictionary that should always be passed to
write()
.post_reader – A reference to a function that will be called after reading a file. Can be used for post-processing or field selection, etc. Its signature must be callable(file_info, file_data).
temp_dir – You can set here your own temporary directory that this FileSet object should use for compressing and decompressing files. Per default it uses the tempdir given by tempfile.gettempdir (see
tempfile.gettempdir()
).compress – If true and path ends with a compression suffix (such as .zip, .gz, .b2z, etc.), newly created files will be compressed after writing them to disk. Default value is true.
decompress – If true and path ends with a compression suffix (such as .zip, .gz, .b2z, etc.), files will be decompressed before reading them. Default value is true.
fs – Instance of implementation of fsspec.spec.AbstractFileSystem. By passing a remote filesystem implementation this allows for searching for and opening files on remote file systems such as Amazon S3 using s3fs.S3FileSystem.
You can use regular expressions or placeholders in path to generalize the files path. Placeholders are going to be captured and returned by file-finding methods such as
find()
. Temporal placeholders will be converted to datetime objects and represent a file’s time coverage. Allowed temporal placeholders in the path argument are:Placeholder
Description
Example
year
Four digits indicating the year.
1999
year2
Two digits indicating the year. [1]
58 (=2058)
month
Two digits indicating the month.
09
day
Two digits indicating the day.
08
doy
Three digits indicating the day of the year.
002
hour
Two digits indicating the hour.
22
minute
Two digits indicating the minute.
58
second
Two digits indicating the second.
58
millisecond
Three digits indicating the millisecond.
999
All those place holders are also allowed to have the prefix end (e.g. end_year). They represent the end of the time coverage.
Moreover, you are allowed do define your own placeholders by using the parameter placeholder or
set_placeholders()
. Their names must consist of alphanumeric signs (underscores are also allowed).
Methods
__init__
(path[, handler, name, info_via, ...])Initialize a FileSet object.
align
(other[, start, end, matches, ...])Collect files from this fileset and a matching other fileset
collect
([start, end, files, return_info])Load all files between two dates sorted by their starting time
copy
()Create a so-called deep-copy of this fileset object
delete
([dry_run])Remove files in this fileset from the disk
detect
(test, *args, **kwargs)Search for anomalies in fileset
dislink
(name_or_fileset)Remove the link between this and another fileset
exclude_files
(filenames)exclude_times
(periods)find
([start, end, sort, only_path, bundle, ...])Find all files of this fileset in a given time period.
find_closest
(timestamp[, filters])Find the closest file to a timestamp
get_filename
(times[, template, fill])Generate the full path and name of a file for a time period
get_info
(file_info[, retrieve_via])Get information about a file.
Get placeholders for this FileSet.
icollect
([start, end, files])Load all files between two dates sorted by their starting time
imap
(*args, **kwargs)Apply a function on files and return the result immediately
is_excluded
(file)Checks whether a file is excluded from this FileSet.
link
(other_fileset[, linker])Link this fileset with another FileSet
load_cache
(filename)Load the information cache from a JSON file
make_dirs
(filename)map
(func[, args, kwargs, files, on_content, ...])Apply a function on files of this fileset with parallel workers
match
(other[, start, end, max_interval, ...])Find matching files between two filesets
move
([target, convert, copy])Move (or copy) files from this fileset to another location
parse_filename
(filename[, template])Parse the filename with temporal and additional regular expressions.
read
(file_info, **read_args)Open and read a file
Reset the information cache
save_cache
(filename)Save the information cache to a JSON file
set_placeholders
(**placeholders)Set placeholders for this FileSet.
to_dataframe
([include_times])Create a pandas.Dataframe from this FileSet
write
(data, file_info[, in_background])Write content to a file by using the FileSet's file handler.
Attributes
default_handler
name
Get or set the fileset's name.
path
Gets or sets the path to the fileset's files.
time_coverage
Get and set the time coverage of the files of this fileset
year2_threshold