sda.api#

SDA API module for loading and processing experimental data.

This module provides functions for loading test data from various sources including SharePoint sites, local files, and databases.

Submodules#

Exceptions#

TableNotFoundError

Custom exception raised when no Excel tables are found in a file.

Functions#

analyze_database_files([file_filter, max_files, ...])

Analyze database files across multiple data sources.

create_summary_dataframe(results_df, all_tests, all_files)

Create a comprehensive summary DataFrame from analysis results.

generate_database_report(results_df[, save_to_file])

Generate a detailed database report from parsing results.

print_quality_thermometer(results_df)

Print a visual quality thermometer showing data health metrics.

discover_data_file(folder_path, test_name[, ...])

Find the actual data file to load in the specified folder.

get_base_folders()

Get the base folder paths for the data and Spark sites.

get_data_folder(test_name)

Get the data folder based on the test name.

list_all_files([filter, max_recursion_level, verbose])

List all files in the data folders, using filter.

list_all_files_in_test(test_name[, filter, ...])

List all files available for a specific test (T*** tests only).

list_all_tests([include_2021_2022, ...])

List all available test names from the data folders.

resolve_test_path(test_name[, data_sharepoint])

Resolve test name to folder path.

load_test(test_name[, datafilename_filter, ...])

Load data from a specific test by name or path.

load_tests([test_names, include_2021_2022, ...])

Load multiple tests by names or discover all available tests.

parse_files(files[, command, columns_to_keep, ...])

Parse multiple files using the unified parser approach.

load_data(instrument_code[, start_time, end_time])

Load data from a specific table in the database based on the instrument code and time range.

parse_test_files(files[, drop_unnamed_columns, ...])

Parse experiment files using unified Excel table detection.

get_excel_core_properties(path)

Extract core properties from Excel file using direct XML parsing.

Package Contents#

sda.api.analyze_database_files(file_filter='*.xls*', max_files=None, verbose=True, include_details=False)#

Analyze database files across multiple data sources.

This function discovers data files, parses them individually while measuring success rates, and returns a comprehensive analysis of table detection, worksheets, and data structure.

Parameters:
  • file_filter (str, default "*.xls*") – Filter pattern for files to analyze (e.g., “.xls”, “.csv”, “”).

  • max_files (int, optional) – Maximum number of files to process (useful for quick testing).

  • verbose (bool, default True) – Whether to print progress information.

  • include_details (bool, default False) – Whether to include detailed column information in the results.

Returns:

DataFrame with columns: - test_name: Name of the test - file_name: Name of the parsed file - file_path: Full path to the file - file_size_mb: Size of the file in megabytes - num_points: Number of data points (rows) in the parsed DataFrame - num_columns: Number of columns in the parsed DataFrame - parse_time_seconds: Time taken to parse the file - points_per_second: Parsing rate (points per second) - mb_per_second: File processing rate (MB per second) - status: Success/Error status - error_message: Error details if parsing failed - plume_url: URL to the test path - full_path: Full path to the file if include_details is True

Return type:

pd.DataFrame

Examples

>>> from sda.api.performance import analyze_database_files

# Analyze all Excel files >>> results = analyze_database_files(file_filter=”.xls”)

# Quick test with first 5 files >>> results = analyze_database_files(max_files=5, verbose=True)

# Analyze specific file types >>> csv_results = analyze_database_files(file_filter=”*.csv”)

See also

generate_database_report

Generate detailed reports from analysis results

print_quality_thermometer

Display visual quality assessment

sda.api.create_summary_dataframe(results_df, all_tests, all_files)#

Create a comprehensive summary DataFrame from analysis results.

This function generates a structured summary of the analysis including: - Overall discovery metrics (tests, files) - File analysis metrics (success/failure rates) - Data volume metrics - Performance metrics

Parameters:
  • results_df (pd.DataFrame) – Results from analyze_database_files()

  • all_tests (list) – List of all discovered tests

  • all_files (list) – List of all discovered files

Returns:

Summary DataFrame with metrics organized by category

Return type:

pd.DataFrame

Examples

>>> from sda.api.performance import (
...     analyze_database_files,
...     create_summary_dataframe,
... )
>>> from sda.api import list_all_tests, list_all_files
>>>
>>> results = analyze_database_files(max_files=10)
>>> all_tests = list_all_tests()
>>> all_files = list_all_files()
>>> summary_df = create_summary_dataframe(results, all_tests, all_files)
sda.api.generate_database_report(results_df, save_to_file=None)#

Generate a detailed database report from parsing results.

Parameters:
  • results_df (pd.DataFrame) – Results from analyze_database_files()

  • save_to_file (str, optional) – Path to save the report (Excel format)

Returns:

Dictionary containing summary statistics

Return type:

dict

Examples

>>> from sda.api.performance import (
...     analyze_database_files,
...     generate_database_report,
... )
>>> results = analyze_database_files(max_files=10)
>>> summary = generate_database_report(results, save_to_file="report.xlsx")

See also

analyze_database_files

Run performance analysis on data files

print_quality_thermometer

Display visual quality metrics

sda.api.print_quality_thermometer(results_df)#

Print a visual quality thermometer showing data health metrics.

Parameters:

results_df (pd.DataFrame) – Results DataFrame from analyze_database_files()

Examples

>>> from sda.api.performance import (
...     analyze_database_files,
...     print_quality_thermometer,
... )
>>> results = analyze_database_files(max_files=10)
>>> print_quality_thermometer(results)
sda.api.discover_data_file(folder_path, test_name, datafilename_filter='*.xls*')#

Find the actual data file to load in the specified folder.

Parameters:
  • folder_path (Path) – Path to the folder to search in.

  • test_name (str | Path) – Name of the test to search for.

  • datafilename_filter (str, optional) – Filter to apply to the data files, by default “.xls” (matches .xlsx, .xlsm, .xls).

Returns:

Path to the data file.

Return type:

Path

Raises:

FileNotFoundError – If no file or multiple files are found matching the criteria.

Examples

>>> from pathlib import Path
>>> from sda.api.file_discovery import discover_data_file
>>> folder = Path("~/data/T183")
>>> file_path = discover_data_file(folder, "T183", "*.xls*")

See also

resolve_test_path()

Resolve test name to folder path

sda.api.get_base_folders()#

Get the base folder paths for the data and Spark sites.

Returns:

Dictionary containing the paths to the Spark data sites, with expanded user key (~/)

Return type:

dict[str, Path]

Raises:
  • ValueError – If the configuration file does not contain the required keys.

  • FileNotFoundError – If the data folder or Spark sites path does not exist.

Notes

This function reads the configuration file to get the paths for the old data site and Spark sites. It checks for the required keys in the configuration and raises appropriate errors if they are missing. The paths are expanded to their full absolute paths.

In CI environments or when SDA_USE_TEST_DATA environment variable is set, test data fixtures are automatically included.

sda.api.get_data_folder(test_name)#

Get the data folder based on the test name.

Parameters:

test_name (str) – Name of the test file to load, ex. T135.

Returns:

Path to the data folder.

Return type:

Path

Notes

The corresponding data sharepoint site is automatically determined by the test name:

  • if TXXX with 239<=XXX; returns “6. DATA 2024 S2/Gif/[test_name]”

  • if TXXX with 192<=XXX<239; returns “6. DATA 2025 S1/Gif/[test_name]”

  • if TXXX with 100<=XXX<192; returns “SPARK/6. Données/2024/[test_name]” (except T102, T104 in “2023/231218” and T105 in “2023/231221”)

  • if 22XX, returns “SPARK/6. Données/2022/[test_name]”

  • if 21XX, returns “SPARK/6. Données/2021/[test_name]”

  • if 20XX, returns “SPARK/6. Données/2020/[test_name]”

  • some test names also correspond to CH-5 campaigns, they are on the “7. DATA 2024-2025CH5” Sharepoint. See the list of tests here :

Test data fixtures (T097, T083) are supported for CI testing.

sda.api.list_all_files(filter='*.xls*', max_recursion_level=None, verbose=0)#

List all files in the data folders, using filter.

Data folder path is read from the configuration file ~/sda.json.

Parameters:
  • filter (str, optional) – Filter to apply to the file names, by default “.xls” (matches .xlsx, .xlsm, .xls). Use * for all files.

  • max_recursion_level (int, optional) – Maximum recursion depth for directory scanning. If None (default), uses unlimited recursion (original behavior). If 2, provides optimal 38x speedup with perfect test discovery accuracy. If 3, provides 7x speedup. Higher values reduce performance gains while maintaining compatibility with deeper file structures.

  • verbose (int) – verbosity level

Returns:

List of Path objects pointing to discovered files.

Return type:

list

Examples

Use no filter:

>>> from sda.api.file_discovery import list_all_files
>>> datafiles = list_all_files("*")

Get only Excel files:

>>> excel_files = list_all_files("*.xls*")
>>> print(f"Found {len(excel_files)} Excel files")

Use limited recursion for massive performance improvement:

>>> # 38x faster for test discovery with perfect accuracy
>>> fast_files = list_all_files("*.xls*", max_recursion_level=2)
>>> # 7x faster with deeper file compatibility
>>> files = list_all_files("*.xls*", max_recursion_level=3)

Print files in max_recursion_level=3 but not in max_recursion_level=2:

file_paths_2 = list_all_files(filter="*.xls*", max_recursion_level=2)
file_paths_3 = list_all_files(filter="*.xls*", max_recursion_level=3)
print("Files in max_recursion_level=3 but not in max_recursion_level=2:")
for path in file_paths_3:
    if path not in file_paths_2:
        print(path)

See also

list_all_tests()

List available test names from files

list_all_files_in_test()

List all files for a specific test

sda.api.list_all_files_in_test(test_name, filter='*', max_recursion_level=2, verbose=0)#

List all files available for a specific test (T*** tests only).

This function discovers all files within a test’s data folder, providing users with a way to explore what data files are available for analysis.

Note: This function only works with 2023+ test names that start with “T” (e.g., “T183”, “T196”). For 2021-2022 tests (e.g., “21s16”), use list_all_files() with appropriate filters instead.

Parameters:
  • test_name (str) – Name of the test to search for files, must start with “T” (e.g., “T183”, “T196”).

  • filter (str, optional) – Filter pattern for file names, by default “*” (all files). Examples: “.xlsx” for Excel files, “.csv” for CSV files, “*.txt” for text files.

  • max_recursion_level (int, optional) – Maximum recursion depth for directory scanning, by default 2. Provides optimal performance while maintaining compatibility.

  • verbose (int, optional) – Verbosity level, by default 0.

Returns:

List of Path objects pointing to all files found in the test folder.

Return type:

list

Raises:
  • ValueError – If the test name doesn’t start with “T” (not a 2023+ test format).

  • FileNotFoundError – If the test folder cannot be found or accessed.

Examples

List all files for a T*** test:

>>> from sda.api.file_discovery import list_all_files_in_test
>>> files = list_all_files_in_test("T183")
>>> print(f"Found {len(files)} files for T183")

List only Excel files:

>>> excel_files = list_all_files_in_test("T183", filter="*.xls*")
>>> print(f"Excel files: {[f.name for f in excel_files]}")

List with different file types:

>>> # Get all CSV files
>>> csv_files = list_all_files_in_test("T183", filter="*.csv")
>>>
>>> # Get all image files
>>> images = list_all_files_in_test("T183", filter="*.png")
>>>
>>> # Get all files (default)
>>> all_files = list_all_files_in_test("T183")

See also

list_all_tests()

List all available test names

list_all_files()

List all files across all tests

get_data_folder()

Get the data folder for a test name

sda.api.list_all_tests(include_2021_2022=True, include_2023_current=True, filter='*', max_recursion_level=2, verbose=0)#

List all available test names from the data folders.

This function discovers test files and extracts their test names, providing a dynamic way to find available tests without hardcoding.

Parameters:
  • include_2021_2022 (bool, default True) – Whether to include 2021-2022 test data.

  • include_2023_current (bool, default True) – Whether to include 2023-current test data.

  • filter (str, default "*") – Filter pattern for test names (supports wildcards). Examples: “T1*” for tests starting with T1, “2021” for tests containing 2021.

  • max_recursion_level (int, optional) – Maximum recursion depth for directory scanning. If None, uses unlimited recursion. If 2, provides optimal ~38x speedup with perfect accuracy (if data files architecture is respected). Default None.

  • verbose (int, default 0) – Verbosity level.

Returns:

List of available test names (e.g., [“21s16”, “T183”, “T196”]).

Return type:

list

Examples

>>> from sda.api.file_discovery import list_all_tests
>>> tests = list_all_tests()
>>> print(f"Found {len(tests)} tests: {tests[:5]}...")  # Show first 5
>>> # Get only 2023+ tests
>>> recent_tests = list_all_tests(include_2021_2022=False)
>>> # Get tests starting with T1
>>> t1_tests = list_all_tests(filter="T1*")
>>> # Get tests containing specific patterns
>>> filtered_tests = list_all_tests(filter="*183*")

See also

list_all_files()

List all data files in folders

list_all_files_in_test()

List all files for a specific test

sda.api.resolve_test_path(test_name, data_sharepoint='auto')#

Resolve test name to folder path.

Parameters:
  • test_name (str | Path) – Name of the test file to load, ex. T135, or direct path to file.

  • data_sharepoint (str, optional) – Name of data sharepoint where the test is located, by default “auto”.

Returns:

Path to the folder containing the test data.

Return type:

Path

Raises:

ValueError – If data_sharepoint is specified with a direct path, or if sharepoint doesn’t exist.

Examples

>>> from sda.api.file_discovery import resolve_test_path
>>> folder_path = resolve_test_path("T183")

See also

get_data_folder()

Get data folder for test name

discover_data_file()

Find data file in resolved folder

sda.api.load_test(test_name, datafilename_filter='*.xls*', data_sharepoint='auto', command=None, columns_to_keep=None, verbose=1, column_not_found='raise', table_not_found='raise', suppress_polars_warnings=False, **kwargs)#

Load data from a specific test by name or path.

This is a high-level function that handles the complete workflow: path resolution, file discovery, and data parsing using the unified parser approach.

Parameters:
  • test_name (str | Path) – Name of the test file to load, ex. T135, or direct path to file.

  • datafilename_filter (str, optional) – Filter to apply to the data files, by default “.xls” (matches .xlsx, .xlsm, .xls).

  • data_sharepoint (str, optional) – Name of data sharepoint where the test is located, by default “auto”.

  • command (dict, optional) – Legacy reading commands. Now largely ignored in favor of automatic Excel table detection.

  • columns_to_keep (list, optional) – List of columns to keep (default: None).

  • verbose (int, default 1) – Verbosity level.

  • column_not_found (str, default 'raise') – What to do if a column is not found.

  • table_not_found (str, default 'raise') – What to do if no Excel tables are found in a file. Can be ‘raise’ or ‘warn’.

  • suppress_polars_warnings (bool, default False) – If True, suppress Polars dtype warning messages during reading.

  • **kwargs – Additional arguments passed to the parser function.

Returns:

pandas.DataFrame containing the test data with automatic table detection.

Return type:

pandas.DataFrame

Examples

Load by test name:

>>> from sda.api.load import load_test
>>> df = load_test("T183")

Load with specific filter:

>>> df = load_test("T183", datafilename_filter="*_processed.xlsx")

Load from direct path:

>>> df = load_test("/path/to/data/T183_experiment.xlsx")

See also

resolve_test_path()

Resolve a test name or path to the canonical folder path.

discover_data_file()

Find a data file in the resolved test folder using a filename pattern.

parse_test_files()

Unified parser for all supported years (2021+).

parse_files()

Parse one or many files directly when you already know the paths.

sda.api.load_tests(test_names=None, include_2021_2022=True, include_2023_current=True)#

Load multiple tests by names or discover all available tests.

This function provides a convenient way to load multiple test datasets at once, with automatic discovery if no specific test names are provided.

Parameters:
  • test_names (list, optional) – List of test names to load. If None, discovers all available tests.

  • include_2021_2022 (bool, default True) – Whether to include 2021-2022 test data in discovery.

  • include_2023_current (bool, default True) – Whether to include 2023-current test data in discovery.

Returns:

pandas.DataFrame combined from all loaded tests with automatic table detection.

Return type:

pandas.DataFrame

Examples

Load specific tests:

>>> from sda.api.load import load_tests
>>> df = load_tests(["T183", "T196"])

Load all available tests:

>>> df = load_tests()  # Discovers and loads all tests

Load only recent tests:

>>> df = load_tests(include_2021_2022=False)

See also

list_all_tests()

Discover available test identifiers.

load_test()

Load a single test.

parse_files()

Parse multiple files directly.

sda.api.parse_files(files, command=None, columns_to_keep=None, verbose=1, column_not_found='warn', table_not_found='raise', **kwargs)#

Parse multiple files using the unified parser approach.

Since all files now use the same unified parser (parse_test_files with ExcelFileHandler), this function directly calls that parser without any grouping logic.

Parameters:
  • files (str, Path, list, or dict) – Files to parse. Same format as parse_test_files (unified approach).

  • command (dict, optional) – Default command for reading files. If None, will use parser-specific defaults.

  • columns_to_keep (list, optional) – List of columns to keep (only applies to 2023+ data).

  • verbose (int, default 1) – Verbosity level.

  • column_not_found (str, default 'warn') – What to do if a column is not found.

  • table_not_found (str, default 'raise') – What to do if no Excel tables are found in a file. Can be ‘raise’ or ‘warn’.

  • **kwargs – Additional arguments passed to parser functions.

Returns:

pandas.DataFrame combined from all files with automatic table detection.

Return type:

pandas.DataFrame

Examples

Parse a single file:

>>> from sda.api.load import parse_files
>>> df = parse_files("T183.xlsx")

Parse multiple files:

>>> files = ["T183.xlsx", "T196.xlsx"]
>>> df = parse_files(files)

Parse with legacy command format:

>>> files = {"data.xlsx": {}}  # Empty dict uses table detection
>>> df = parse_files(files)

See also

parse_test_files()

Unified parser function used under the hood.

list_all_files()

Discover eligible files to parse inside a test folder.

load_test()

High-level convenience wrapper to load a single test by name.

sda.api.load_data(instrument_code, start_time=None, end_time=None)#

Load data from a specific table in the database based on the instrument code and time range.

Parameters:
  • instrument_code (str) – The code of the instrument used to determine the table and columns to query.

  • start_time (str, optional) – The start timestamp for filtering the data. Defaults to None.

  • end_time (str, optional) – The end timestamp for filtering the data. Defaults to None.

Returns:

A DataFrame containing the selected data from the specified time range and instrument.

Return type:

pandas.DataFrame

Raises:
  • ValueError

    • If the instrument code is not found in the table mapping.

    • If the database configuration is incomplete or contains invalid values.

  • ConnectionError

    • If the connection to the database fails.

    • If the SQL query fails to execute due to connection issues.

Notes

  • The function reads database configurations from ~/sda.json.

    It is essential that this file contains the following keys: - DB_USER: Database username. - DB_PASSWORD: Database password. - DB_HOST: Hostname or IP address of the database server. - DB_PORT: Port number the database server is listening on. - DB_NAME: Name of the database to connect to.

  • If neither start_time nor end_time is provided,

    the function returns all data from the specified table.

  • If only start_time is provided, the function returns data from start_time to the current time.

  • Ensure that the database server is accessible and the credentials provided are correct.

  • For more information on setting up the configuration file, visit:

    https://github.com/spark-cleantech/sda/blob/main/README.md#postgres-credentials–connection

sda.api.parse_test_files(files, drop_unnamed_columns=True, command=None, dropna=True, columns_to_keep=None, verbose=1, column_not_found='raise', table_not_found='raise', suppress_polars_warnings=False)#

Parse experiment files using unified Excel table detection.

Uses ExcelFileHandler with XML Discovery + Polars reading for maximum performance and accuracy. Automatically detects Excel Table objects (Insert > Table in Excel) for robust data loading.

Parameters:
  • files (str | Path | list | dict[str, dict] | dict[Path, dict]) –

    Files to parse. Can be:

    • str: filename

    • Path: filename

    • list: list of filenames

    • dict: {filename : special_reading_commands} (legacy parameter, mostly ignored)

      • filename: str

      • special_reading_commands: dict (legacy, mostly ignored in favor of table detection)

  • drop_unnamed_columns (bool, optional) – If True, drop unnamed columns (columns with no name), default to True.

  • command (dict, optional) – Legacy reading commands. Now largely ignored in favor of automatic Excel table detection. For CSV files, standard pandas parameters still apply.

  • dropna (bool) – Drop rows with all NaN, default to True.

  • columns_to_keep (list) – List of columns to keep for filtering (default: None)

  • verbose (int) – Verbosity level (default: 1). If 0, no output. If 1, print progress with basic information. If 2, print detailed file processing information.

  • column_not_found (str) –

    What to do if a column is not found in the file. Can be ‘raise’, ‘warn’, or ‘ignore’.

    • ’raise’: raise an error if a column is not found.

    • ’warn’: print a warning if a column is not found.

    • ’ignore’: do nothing if a column is not found.

  • table_not_found (str) –

    What to do if no Excel tables are found in a file. Can be ‘raise’ or ‘warn’.

    • ’raise’: raise a TableNotFoundError if no tables are found (default).

    • ’warn’: print a warning if no tables are found and skip the file.

  • suppress_polars_warnings (bool) – If True, suppress Polars dtype warning messages during reading (default: False).

Returns:

Combined DataFrame from all processed files with automatic table detection

Return type:

pd.DataFrame

Examples

>>> files = ["T137.xlsx", "T153.xlsx"]
>>> df = parse_test_files(files)
>>> files = {"test_data.xlsx": {}}  # Empty dict uses table detection
>>> df = parse_test_files(files)

Notes

Uses ExcelFileHandler with XML-based table discovery and Polars reading engine for optimal performance. Legacy pandas sheet-based reading is deprecated.

exception sda.api.TableNotFoundError(message, test_name=None, file_name=None)#

Bases: Exception

Custom exception raised when no Excel tables are found in a file.

This exception provides specific information about table detection failures and can be caught separately from other errors for custom handling.

test_name = None#
file_name = None#
sda.api.get_excel_core_properties(path)#

Extract core properties from Excel file using direct XML parsing.

This is much faster than loading the entire workbook with openpyxl. Performance improvement: ~57x faster than openpyxl.load_workbook().

Parameters:

path (str | Path) – Path to the Excel file.

Returns:

Dictionary with core properties: title, subject, creator, description, keywords, lastModifiedBy.

Return type:

dict[str, str | None]

Examples

>>> from sda.api.parse_utils import get_excel_core_properties
>>> props = get_excel_core_properties("test_file.xlsx")
>>> print(f"Keywords: {props['keywords']}")
Keywords: STT0.1

Was used previously when STT (Spark Test Template) was used to identify test files.

See also

get_test_template_number()

Extract metadata from Excel file properties.

select_parser()

Choose parser based on file metadata