map

Collocations.map(func, args=None, kwargs=None, files=None, on_content=False, pass_info=None, read_args=None, output=None, max_workers=None, worker_type=None, return_info=False, error_to_warning=False, **find_kwargs)

Apply a function on files of this fileset with parallel workers

This method can use multiple workers processes / threads to boost the procedure significantly. Depending on which system you work, you should try different numbers for max_workers.

Use this if you need to process the files as fast as possible without needing to retrieve the results immediately. Otherwise you should consider using imap() in a for-loop.

Notes

This method sorts the results after the starting time of the files unless sort is False.

Parameters
  • func – A reference to a function that should be applied.

  • args – A list/tuple with positional arguments that should be passed to func. It will be extended with the file arguments, i.e. a FileInfo object if on_content is false or - if on_content and pass_info are true - the read content of a file and its corresponding FileInfo object.

  • kwargs – A dictionary with keyword arguments that should be passed to func.

  • files – If you have already a list of files that you want to process, pass it here. The list can contain filenames or lists (bundles) of filenames. If this parameter is given, it is not allowed to set start and end then.

  • on_content – If true, the file will be read before func will be applied. The content will then be passed to func.

  • pass_info – If on_content is true, this decides whether also the FileInfo object of the read file should be passed to func. Default is false.

  • read_args – Additional keyword arguments that will be passed to the reading function (see read() for more information). Will be ignored if on_content is False.

  • output – Set this to a path containing placeholders or a FileSet object and the return value of func will be copied there if it is not None.

  • max_workers – Max. number of parallel workers to use. When lacking performance, you should change this number.

  • worker_type – The type of the workers that will be used to parallelize func. Can be process or thread. If func is a function that needs to share a lot of data with its parallelized copies, you should set this to thread. Note that this may reduce the performance due to Python’s Global Interpreter Lock (GIL <https://stackoverflow.com/q/1294382>).

  • worker_initializer – DEPRECATED! Must be a reference to a function that is called once when initialising a new worker. Can be used to preload variables into a worker’s workspace. See also https://docs.python.org/3.1/library/multiprocessing.html#module-multiprocessing.pool for more information.

  • worker_initargs – DEPRECATED! A tuple with arguments for worker_initializer.

  • return_info – If true, return a FileInfo object with each return value indicating to which file the function was applied.

  • error_to_warning – Normally, if an exception is raised during reading of a file, this method is aborted. However, if you set this to true, only a warning is given and None is returned. This parameter will be ignored if on_content=True.

  • **find_kwargs – Additional keyword arguments that are allowed for find() such as start or end.

Returns

A list with tuples of a FileInfo object and the return value of the function applied to this file. If output is set, the second element is not the return value but a boolean values indicating whether the return value was not None.

Examples

## Imaging you want to calculate some statistical values from the
## data of the files
def calc_statistics(content, file_info):
    # return the mean and maximum value
    return content["data"].mean(), content["data"].max()

results = fileset.map(
    calc_statistics, start="2018-01-01", end="2018-01-02",
    on_content=True, return_info=True,
)

# This will be run after processing all files...
for file, result in results
    print(file) # prints the FileInfo object
    print(result) # prints the mean and maximum value

## If you need the results directly, you can use imap instead:
results = fileset.imap(
    calc_statistics, start="2018-01-01", end="2018-01-02",
    on_content=True,
)

for result in results
    # After the first file has been processed, this will be run
    # immediately ...
    print(result) # prints the mean and maximum value

If you need to pass some args to the function, use the parameters args and kwargs:

def calc_statistics(arg1, content, kwarg1=None):
    # return the mean and maximum value
    return content["data"].mean(), content["data"].max()

# Note: If you do not use the start or the end parameter, all
# files in the fileset are going to be processed:
results = fileset.map(
    calc_statistics, args=("value1",),
    kwargs={"kwarg1": "value2"}, on_content=True,
)