1 An Orchestrated Framework for automated speech data processing and alignment
Tóm tắt
The creation of large-scale, diverse speech datasets, crucial for state-of-the-art Automatic Speech Recognition (ASR), remains a significant bottleneck. This paper introduces a novel, orchestrated multi-pipeline framework designed to fully automate this process, from YouTube content discovery to the generation of phonetically aligned data suitable for ASR training. Our integrated system seamlessly combines key components: a configurable data crawler equipped with robust proxy and cookie management for efficient content acquisition; a neural processing pipeline incorporating Voice Activity Detection (VAD), ASR, speaker diarization, and automated quality assessment; and a specialized pronunciation alignment system leveraging the Montreal Forced Aligner (MFA) to produce precise word-level timing annotations. Implemented as containerized services managed by an Apache Airflow orchestration framework, the system achieves remarkable efficiency and scalability. Demonstrating its capabilities, the framework processed over 1000 hours of initial Vietnamese YouTube audio, yielding 813 hours of high-quality, aligned data with an end-to-end processing throughput exceeding 4.x times real-time and achieving 98% automation across the workflow. This represents a significant reduction in manual effort compared to traditional methods, enabling systematic quality control through integrated filtering mechanisms. The architecture’s inherent modularity and scalability make it readily adaptable to various languages and extendable beyond ASR to other audio-based machine learning applications.
