sda.api.file_discovery ====================== .. py:module:: sda.api.file_discovery .. autoapi-nested-parse:: File and path discovery utilities for SDA test data. This module contains functions for discovering test files, resolving paths, and scanning directory structures for data files. Attributes ---------- .. autoapisummary:: sda.api.file_discovery.CH5_CAMPAIGN_TESTS Functions --------- .. autoapisummary:: sda.api.file_discovery.resolve_test_path sda.api.file_discovery.discover_data_file sda.api.file_discovery.get_base_folders sda.api.file_discovery.get_data_folder sda.api.file_discovery.list_all_files sda.api.file_discovery.list_all_tests sda.api.file_discovery.list_all_files_in_test Module Contents --------------- .. py:data:: CH5_CAMPAIGN_TESTS :value: ['T218', 'T219', 'T220', 'T221', 'T235', 'T236', 'T240', 'T241', 'T244', 'T245', 'T253', 'T256',... .. py:function:: resolve_test_path(test_name, data_sharepoint = 'auto') Resolve test name to folder path. :param test_name: Name of the test file to load, ex. ``T135``, or direct path to file. :type test_name: :py:class:`str | Path` :param data_sharepoint: Name of data sharepoint where the test is located, by default "auto". :type data_sharepoint: :py:class:`str`, *optional* :returns: Path to the folder containing the test data. :rtype: :py:class:`Path` :raises ValueError: If data_sharepoint is specified with a direct path, or if sharepoint doesn't exist. .. rubric:: Examples >>> from sda.api.file_discovery import resolve_test_path >>> folder_path = resolve_test_path("T183") .. seealso:: :py:func:`~sda.api.file_discovery.get_data_folder` Get data folder for test name :py:func:`~sda.api.file_discovery.discover_data_file` Find data file in resolved folder .. py:function:: discover_data_file(folder_path, test_name, datafilename_filter = '*.xls*') Find the actual data file to load in the specified folder. :param folder_path: Path to the folder to search in. :type folder_path: :py:class:`Path` :param test_name: Name of the test to search for. :type test_name: :py:class:`str | Path` :param datafilename_filter: Filter to apply to the data files, by default "*.xls*" (matches .xlsx, .xlsm, .xls). :type datafilename_filter: :py:class:`str`, *optional* :returns: Path to the data file. :rtype: :py:class:`Path` :raises FileNotFoundError: If no file or multiple files are found matching the criteria. .. rubric:: Examples >>> from pathlib import Path >>> from sda.api.file_discovery import discover_data_file >>> folder = Path("~/data/T183") >>> file_path = discover_data_file(folder, "T183", "*.xls*") .. seealso:: :py:func:`~sda.api.file_discovery.resolve_test_path` Resolve test name to folder path .. py:function:: get_base_folders() Get the base folder paths for the data and Spark sites. :returns: Dictionary containing the paths to the Spark data sites, with expanded user key (~/) :rtype: :py:class:`dict[str`, :py:class:`Path]` :raises ValueError: If the configuration file does not contain the required keys. :raises FileNotFoundError: If the data folder or Spark sites path does not exist. .. rubric:: Notes This function reads the configuration file to get the paths for the old data site and Spark sites. It checks for the required keys in the configuration and raises appropriate errors if they are missing. The paths are expanded to their full absolute paths. In CI environments or when SDA_USE_TEST_DATA environment variable is set, test data fixtures are automatically included. .. py:function:: get_data_folder(test_name) Get the data folder based on the test name. :param test_name: Name of the test file to load, ex. ``T135``. :type test_name: :py:class:`str` :returns: Path to the data folder. :rtype: :py:class:`Path` .. rubric:: Notes The corresponding data sharepoint site is automatically determined by the test name: - if TXXX with 239<=XXX; returns "6. DATA 2024 S2/Gif/[test_name]" - if TXXX with 192<=XXX<239; returns "6. DATA 2025 S1/Gif/[test_name]" - if TXXX with 100<=XXX<192; returns "SPARK/6. Données/2024/[test_name]" (except T102, T104 in "2023/231218" and T105 in "2023/231221") - if 22XX, returns "SPARK/6. Données/2022/[test_name]" - if 21XX, returns "SPARK/6. Données/2021/[test_name]" - if 20XX, returns "SPARK/6. Données/2020/[test_name]" - some test names also correspond to CH-5 campaigns, they are on the "7. DATA 2024-2025CH5" Sharepoint. See the list of tests here : :py:data:`~sda.api.file_discovery.CH5_CAMPAIGN_TESTS` Test data fixtures (T097, T083) are supported for CI testing. .. py:function:: list_all_files(filter='*.xls*', max_recursion_level=None, verbose=0) List all files in the data folders, using filter. Data folder path is read from the configuration file `~/sda.json`. :param filter: Filter to apply to the file names, by default "*.xls*" (matches .xlsx, .xlsm, .xls). Use ``*`` for all files. :type filter: :py:class:`str`, *optional* :param max_recursion_level: Maximum recursion depth for directory scanning. If None (default), uses unlimited recursion (original behavior). If 2, provides optimal 38x speedup with perfect test discovery accuracy. If 3, provides 7x speedup. Higher values reduce performance gains while maintaining compatibility with deeper file structures. :type max_recursion_level: :py:class:`int`, *optional* :param verbose: verbosity level :type verbose: :py:class:`int` :returns: List of Path objects pointing to discovered files. :rtype: :py:class:`list` .. rubric:: Examples Use no filter: >>> from sda.api.file_discovery import list_all_files >>> datafiles = list_all_files("*") Get only Excel files: >>> excel_files = list_all_files("*.xls*") >>> print(f"Found {len(excel_files)} Excel files") Use limited recursion for massive performance improvement: >>> # 38x faster for test discovery with perfect accuracy >>> fast_files = list_all_files("*.xls*", max_recursion_level=2) >>> # 7x faster with deeper file compatibility >>> files = list_all_files("*.xls*", max_recursion_level=3) Print files in max_recursion_level=3 but not in max_recursion_level=2:: file_paths_2 = list_all_files(filter="*.xls*", max_recursion_level=2) file_paths_3 = list_all_files(filter="*.xls*", max_recursion_level=3) print("Files in max_recursion_level=3 but not in max_recursion_level=2:") for path in file_paths_3: if path not in file_paths_2: print(path) .. seealso:: :py:func:`~sda.api.file_discovery.list_all_tests` List available test names from files :py:func:`~sda.api.file_discovery.list_all_files_in_test` List all files for a specific test .. py:function:: list_all_tests(include_2021_2022=True, include_2023_current=True, filter='*', max_recursion_level=2, verbose=0) List all available test names from the data folders. This function discovers test files and extracts their test names, providing a dynamic way to find available tests without hardcoding. :param include_2021_2022: Whether to include 2021-2022 test data. :type include_2021_2022: :py:class:`bool`, *default* :py:obj:`True` :param include_2023_current: Whether to include 2023-current test data. :type include_2023_current: :py:class:`bool`, *default* :py:obj:`True` :param filter: Filter pattern for test names (supports wildcards). Examples: "T1*" for tests starting with T1, "*2021*" for tests containing 2021. :type filter: :py:class:`str`, *default* ``"*"`` :param max_recursion_level: Maximum recursion depth for directory scanning. If None, uses unlimited recursion. If 2, provides optimal ~38x speedup with perfect accuracy (if data files architecture is respected). Default None. :type max_recursion_level: :py:class:`int`, *optional* :param verbose: Verbosity level. :type verbose: :py:class:`int`, *default* ``0`` :returns: List of available test names (e.g., ["21s16", "T183", "T196"]). :rtype: :py:class:`list` .. rubric:: Examples >>> from sda.api.file_discovery import list_all_tests >>> tests = list_all_tests() >>> print(f"Found {len(tests)} tests: {tests[:5]}...") # Show first 5 >>> # Get only 2023+ tests >>> recent_tests = list_all_tests(include_2021_2022=False) >>> # Get tests starting with T1 >>> t1_tests = list_all_tests(filter="T1*") >>> # Get tests containing specific patterns >>> filtered_tests = list_all_tests(filter="*183*") .. seealso:: :py:func:`~sda.api.file_discovery.list_all_files` List all data files in folders :py:func:`~sda.api.file_discovery.list_all_files_in_test` List all files for a specific test .. py:function:: list_all_files_in_test(test_name, filter='*', max_recursion_level=2, verbose=0) List all files available for a specific test (T*** tests only). This function discovers all files within a test's data folder, providing users with a way to explore what data files are available for analysis. **Note**: This function only works with 2023+ test names that start with "T" (e.g., "T183", "T196"). For 2021-2022 tests (e.g., "21s16"), use :py:func:`~sda.api.file_discovery.list_all_files` with appropriate filters instead. :param test_name: Name of the test to search for files, must start with "T" (e.g., "T183", "T196"). :type test_name: :py:class:`str` :param filter: Filter pattern for file names, by default "*" (all files). Examples: "*.xlsx" for Excel files, "*.csv" for CSV files, "*.txt" for text files. :type filter: :py:class:`str`, *optional* :param max_recursion_level: Maximum recursion depth for directory scanning, by default 2. Provides optimal performance while maintaining compatibility. :type max_recursion_level: :py:class:`int`, *optional* :param verbose: Verbosity level, by default 0. :type verbose: :py:class:`int`, *optional* :returns: List of Path objects pointing to all files found in the test folder. :rtype: :py:class:`list` :raises ValueError: If the test name doesn't start with "T" (not a 2023+ test format). :raises FileNotFoundError: If the test folder cannot be found or accessed. .. rubric:: Examples List all files for a T*** test: >>> from sda.api.file_discovery import list_all_files_in_test >>> files = list_all_files_in_test("T183") >>> print(f"Found {len(files)} files for T183") List only Excel files: >>> excel_files = list_all_files_in_test("T183", filter="*.xls*") >>> print(f"Excel files: {[f.name for f in excel_files]}") List with different file types: >>> # Get all CSV files >>> csv_files = list_all_files_in_test("T183", filter="*.csv") >>> >>> # Get all image files >>> images = list_all_files_in_test("T183", filter="*.png") >>> >>> # Get all files (default) >>> all_files = list_all_files_in_test("T183") .. seealso:: :py:func:`~sda.api.file_discovery.list_all_tests` List all available test names :py:func:`~sda.api.file_discovery.list_all_files` List all files across all tests :py:func:`~sda.api.file_discovery.get_data_folder` Get the data folder for a test name