sda.api.file_discovery#

File and path discovery utilities for SDA test data.

This module contains functions for discovering test files, resolving paths, and scanning directory structures for data files.

Attributes#

Functions#

resolve_test_path(test_name[, data_sharepoint])

Resolve test name to folder path.

discover_data_file(folder_path, test_name[, ...])

Find the actual data file to load in the specified folder.

get_base_folders()

Get the base folder paths for the data and Spark sites.

get_data_folder(test_name)

Get the data folder based on the test name.

list_all_files([filter, max_recursion_level, verbose])

List all files in the data folders, using filter.

list_all_tests([include_2021_2022, ...])

List all available test names from the data folders.

list_all_files_in_test(test_name[, filter, ...])

List all files available for a specific test (T*** tests only).

Module Contents#

sda.api.file_discovery.CH5_CAMPAIGN_TESTS = ['T218', 'T219', 'T220', 'T221', 'T235', 'T236', 'T240', 'T241', 'T244', 'T245', 'T253', 'T256',...#
sda.api.file_discovery.resolve_test_path(test_name, data_sharepoint='auto')#

Resolve test name to folder path.

Parameters:
  • test_name (str | Path) – Name of the test file to load, ex. T135, or direct path to file.

  • data_sharepoint (str, optional) – Name of data sharepoint where the test is located, by default “auto”.

Returns:

Path to the folder containing the test data.

Return type:

Path

Raises:

ValueError – If data_sharepoint is specified with a direct path, or if sharepoint doesn’t exist.

Examples

>>> from sda.api.file_discovery import resolve_test_path
>>> folder_path = resolve_test_path("T183")

See also

get_data_folder()

Get data folder for test name

discover_data_file()

Find data file in resolved folder

sda.api.file_discovery.discover_data_file(folder_path, test_name, datafilename_filter='*.xls*')#

Find the actual data file to load in the specified folder.

Parameters:
  • folder_path (Path) – Path to the folder to search in.

  • test_name (str | Path) – Name of the test to search for.

  • datafilename_filter (str, optional) – Filter to apply to the data files, by default “.xls” (matches .xlsx, .xlsm, .xls).

Returns:

Path to the data file.

Return type:

Path

Raises:

FileNotFoundError – If no file or multiple files are found matching the criteria.

Examples

>>> from pathlib import Path
>>> from sda.api.file_discovery import discover_data_file
>>> folder = Path("~/data/T183")
>>> file_path = discover_data_file(folder, "T183", "*.xls*")

See also

resolve_test_path()

Resolve test name to folder path

sda.api.file_discovery.get_base_folders()#

Get the base folder paths for the data and Spark sites.

Returns:

Dictionary containing the paths to the Spark data sites, with expanded user key (~/)

Return type:

dict[str, Path]

Raises:
  • ValueError – If the configuration file does not contain the required keys.

  • FileNotFoundError – If the data folder or Spark sites path does not exist.

Notes

This function reads the configuration file to get the paths for the old data site and Spark sites. It checks for the required keys in the configuration and raises appropriate errors if they are missing. The paths are expanded to their full absolute paths.

In CI environments or when SDA_USE_TEST_DATA environment variable is set, test data fixtures are automatically included.

sda.api.file_discovery.get_data_folder(test_name)#

Get the data folder based on the test name.

Parameters:

test_name (str) – Name of the test file to load, ex. T135.

Returns:

Path to the data folder.

Return type:

Path

Notes

The corresponding data sharepoint site is automatically determined by the test name:

  • if TXXX with 239<=XXX; returns “6. DATA 2024 S2/Gif/[test_name]”

  • if TXXX with 192<=XXX<239; returns “6. DATA 2025 S1/Gif/[test_name]”

  • if TXXX with 100<=XXX<192; returns “SPARK/6. Données/2024/[test_name]” (except T102, T104 in “2023/231218” and T105 in “2023/231221”)

  • if 22XX, returns “SPARK/6. Données/2022/[test_name]”

  • if 21XX, returns “SPARK/6. Données/2021/[test_name]”

  • if 20XX, returns “SPARK/6. Données/2020/[test_name]”

  • some test names also correspond to CH-5 campaigns, they are on the “7. DATA 2024-2025CH5” Sharepoint. See the list of tests here :

Test data fixtures (T097, T083) are supported for CI testing.

sda.api.file_discovery.list_all_files(filter='*.xls*', max_recursion_level=None, verbose=0)#

List all files in the data folders, using filter.

Data folder path is read from the configuration file ~/sda.json.

Parameters:
  • filter (str, optional) – Filter to apply to the file names, by default “.xls” (matches .xlsx, .xlsm, .xls). Use * for all files.

  • max_recursion_level (int, optional) – Maximum recursion depth for directory scanning. If None (default), uses unlimited recursion (original behavior). If 2, provides optimal 38x speedup with perfect test discovery accuracy. If 3, provides 7x speedup. Higher values reduce performance gains while maintaining compatibility with deeper file structures.

  • verbose (int) – verbosity level

Returns:

List of Path objects pointing to discovered files.

Return type:

list

Examples

Use no filter:

>>> from sda.api.file_discovery import list_all_files
>>> datafiles = list_all_files("*")

Get only Excel files:

>>> excel_files = list_all_files("*.xls*")
>>> print(f"Found {len(excel_files)} Excel files")

Use limited recursion for massive performance improvement:

>>> # 38x faster for test discovery with perfect accuracy
>>> fast_files = list_all_files("*.xls*", max_recursion_level=2)
>>> # 7x faster with deeper file compatibility
>>> files = list_all_files("*.xls*", max_recursion_level=3)

Print files in max_recursion_level=3 but not in max_recursion_level=2:

file_paths_2 = list_all_files(filter="*.xls*", max_recursion_level=2)
file_paths_3 = list_all_files(filter="*.xls*", max_recursion_level=3)
print("Files in max_recursion_level=3 but not in max_recursion_level=2:")
for path in file_paths_3:
    if path not in file_paths_2:
        print(path)

See also

list_all_tests()

List available test names from files

list_all_files_in_test()

List all files for a specific test

sda.api.file_discovery.list_all_tests(include_2021_2022=True, include_2023_current=True, filter='*', max_recursion_level=2, verbose=0)#

List all available test names from the data folders.

This function discovers test files and extracts their test names, providing a dynamic way to find available tests without hardcoding.

Parameters:
  • include_2021_2022 (bool, default True) – Whether to include 2021-2022 test data.

  • include_2023_current (bool, default True) – Whether to include 2023-current test data.

  • filter (str, default "*") – Filter pattern for test names (supports wildcards). Examples: “T1*” for tests starting with T1, “2021” for tests containing 2021.

  • max_recursion_level (int, optional) – Maximum recursion depth for directory scanning. If None, uses unlimited recursion. If 2, provides optimal ~38x speedup with perfect accuracy (if data files architecture is respected). Default None.

  • verbose (int, default 0) – Verbosity level.

Returns:

List of available test names (e.g., [“21s16”, “T183”, “T196”]).

Return type:

list

Examples

>>> from sda.api.file_discovery import list_all_tests
>>> tests = list_all_tests()
>>> print(f"Found {len(tests)} tests: {tests[:5]}...")  # Show first 5
>>> # Get only 2023+ tests
>>> recent_tests = list_all_tests(include_2021_2022=False)
>>> # Get tests starting with T1
>>> t1_tests = list_all_tests(filter="T1*")
>>> # Get tests containing specific patterns
>>> filtered_tests = list_all_tests(filter="*183*")

See also

list_all_files()

List all data files in folders

list_all_files_in_test()

List all files for a specific test

sda.api.file_discovery.list_all_files_in_test(test_name, filter='*', max_recursion_level=2, verbose=0)#

List all files available for a specific test (T*** tests only).

This function discovers all files within a test’s data folder, providing users with a way to explore what data files are available for analysis.

Note: This function only works with 2023+ test names that start with “T” (e.g., “T183”, “T196”). For 2021-2022 tests (e.g., “21s16”), use list_all_files() with appropriate filters instead.

Parameters:
  • test_name (str) – Name of the test to search for files, must start with “T” (e.g., “T183”, “T196”).

  • filter (str, optional) – Filter pattern for file names, by default “*” (all files). Examples: “.xlsx” for Excel files, “.csv” for CSV files, “*.txt” for text files.

  • max_recursion_level (int, optional) – Maximum recursion depth for directory scanning, by default 2. Provides optimal performance while maintaining compatibility.

  • verbose (int, optional) – Verbosity level, by default 0.

Returns:

List of Path objects pointing to all files found in the test folder.

Return type:

list

Raises:
  • ValueError – If the test name doesn’t start with “T” (not a 2023+ test format).

  • FileNotFoundError – If the test folder cannot be found or accessed.

Examples

List all files for a T*** test:

>>> from sda.api.file_discovery import list_all_files_in_test
>>> files = list_all_files_in_test("T183")
>>> print(f"Found {len(files)} files for T183")

List only Excel files:

>>> excel_files = list_all_files_in_test("T183", filter="*.xls*")
>>> print(f"Excel files: {[f.name for f in excel_files]}")

List with different file types:

>>> # Get all CSV files
>>> csv_files = list_all_files_in_test("T183", filter="*.csv")
>>>
>>> # Get all image files
>>> images = list_all_files_in_test("T183", filter="*.png")
>>>
>>> # Get all files (default)
>>> all_files = list_all_files_in_test("T183")

See also

list_all_tests()

List all available test names

list_all_files()

List all files across all tests

get_data_folder()

Get the data folder for a test name