Feature Request: Prepare the Package for Glitch Generation
Description
We need to enhance the MDC package to support glitch generation reliably. Currently, the glitch extraction process relies on deepextractor that has installation issues, and the model is stored directly in the deepextractor git repo (which is not best practice). This blocks seamless integration for users.
To resolve this, I propose refactoring the glitch handling. Below are the key issues and two viable approaches to proceed.
Current Issues
- Non-installable DL package:
deepextractorused for extracting glitches from data cannot be installed via standard methods (e.g., pip or setup.py, or pyproject.toml) - Model storage in Git repo: The trained model is committed directly to the repo, which bloats the repository size. It should be hosted externally (e.g., in Hugging Face Hub) for versioned access and easier sharing.
Proposed Approaches
- Option 1: On-the-fly extraction with model hosting
- Host the model in a Hugging Face repository for easy loading (e.g., via transformers or direct download).
- Refactor
deepextractorto make it installable (e.g., fix dependencies, create a proper PyPI package. - Integrate on-the-fly glitch extraction into MDC: Run the model during package execution to subtract glitches dynamically from input data.
- Pros: Keeps the pipeline flexible; no pre-processing needed.
- Cons: Increases runtime for users with limited compute; requires
deepextractorto be well-maintained.
- Option 2: Pre-extracted glitches via Zenodo
- Pre-run the glitch extraction offline (using the current setup) and reconstruct glitches.
- Upload the resulting glitch datasets to a Zenodo repository (with DOIs for citability).
- Download pre-extracted glitches from Zenodo and apply them directly in the generation workflow.
- Pros: Eliminates DL dependencies and runtime overhead; ensures reproducibility.
- Cons: Requires upfront data generation and hosting; less flexible for custom datasets.