For frontier multimodal models, processing text is not enough. Explore our dedicated Card Views below for deep technical specifications on Automatic Speech Recognition (ASR) pipelines.
Automated transcript accuracy scoring.
Submit massive audio datasets and your model's paired generated transcripts to our API. We will automatically calculate the Word Error Rate (WER), Character Error Rate (CER), and flag critical alignment deviations.
POST /v1/asr/evaluate
import acadify
async def run_asr_eval():
job = await acadify.asr.evaluate_batch(
dataset_uri="s3://enterprise-data/audio_samples/",
model_predictions="s3://enterprise-data/model_v2_transcripts.jsonl",
language="en-US",
metrics=["WER", "CER"]
)
print(f"ASR Job ID: {job.id}")
The WER is calculated using the Levenshtein distance formula: WER = (S + D + I) / N, where S is substitutions, D is deletions, I is insertions, and N is the total number of words in the reference transcript.
SME-verified corrections for heavy accents and noise.
Automated WER scoring struggles with heavy regional accents, severe background noise, and specialized domain jargon. Utilizing our SME network allows you to generate flawless, human-verified timestamped VTT/SRT correction files.
We accept .WAV, .FLAC, and .MP3. For optimal linguist review, we require a minimum sampling rate of 16kHz. Files should be chunked to less than 15 minutes each.
Every transcript is reviewed by two independent audio engineers. If they disagree on a transcription, the audio segment is routed to a Senior Linguist for final arbitration.
Identifying catastrophic homophone failures.
Not all ASR errors are equal. Dropping an "uh" or "um" is harmless, but transcribing "hyper" instead of "hypo" in a medical context is catastrophic. Acadify automatically parses transcripts for high-risk semantic inversions.
| Error Type | Reference Audio | Model Transcript | Severity |
|---|---|---|---|
| Filler Deletion | "I think, um, it is right." | "I think it is right." | Low |
| Homophone Collision | "The patient was prescribed hypothyroid meds." | "The patient was prescribed hyperthyroid meds." | Critical |
| Entity Hallucination | "Call John Doe." | "Call John Smith." | High |
"Who spoke when?" accuracy testing.
Diarization evaluates a model's ability to segment audio by distinct speakers. Acadify uses the Diarization Error Rate (DER) metric, which combines False Alarm speaker time, Missed speaker time, and Speaker Confusion time.
{
"audio_id": "meeting_001.wav",
"segments": [
{"speaker": "Speaker_1", "start_time": 0.00, "end_time": 5.23, "transcript": "Hello everyone."},
{"speaker": "Speaker_2", "start_time": 5.24, "end_time": 8.10, "transcript": "Hi, let's begin."}
]
}