ASR Feedback Protocols | Acadify AI Docs

Multimodal Voice AI Evaluation

For frontier multimodal models, processing text is not enough. Explore our dedicated Card Views below for deep technical specifications on Automatic Speech Recognition (ASR) pipelines.

1. Word Error Rate (WER) API

Automated transcript accuracy scoring.

Submit massive audio datasets and your model's paired generated transcripts to our API. We will automatically calculate the Word Error Rate (WER), Character Error Rate (CER), and flag critical alignment deviations.

Endpoint Details

POST /v1/asr/evaluate

import acadify

async def run_asr_eval():
    job = await acadify.asr.evaluate_batch(
        dataset_uri="s3://enterprise-data/audio_samples/",
        model_predictions="s3://enterprise-data/model_v2_transcripts.jsonl",
        language="en-US",
        metrics=["WER", "CER"]
    )
    print(f"ASR Job ID: {job.id}")

The Calculation

The WER is calculated using the Levenshtein distance formula: WER = (S + D + I) / N, where S is substitutions, D is deletions, I is insertions, and N is the total number of words in the reference transcript.

2. Human-in-the-Loop Audio Alignment

SME-verified corrections for heavy accents and noise.

Automated WER scoring struggles with heavy regional accents, severe background noise, and specialized domain jargon. Utilizing our SME network allows you to generate flawless, human-verified timestamped VTT/SRT correction files.

Supported Formats

We accept .WAV, .FLAC, and .MP3. For optimal linguist review, we require a minimum sampling rate of 16kHz. Files should be chunked to less than 15 minutes each.

Linguist Validation

Every transcript is reviewed by two independent audio engineers. If they disagree on a transcription, the audio segment is routed to a Senior Linguist for final arbitration.

3. Contextual Hallucination Flagging

Identifying catastrophic homophone failures.

Not all ASR errors are equal. Dropping an "uh" or "um" is harmless, but transcribing "hyper" instead of "hypo" in a medical context is catastrophic. Acadify automatically parses transcripts for high-risk semantic inversions.

Error Type	Reference Audio	Model Transcript	Severity
Filler Deletion	"I think, um, it is right."	"I think it is right."	Low
Homophone Collision	"The patient was prescribed hypothyroid meds."	"The patient was prescribed hyperthyroid meds."	Critical
Entity Hallucination	"Call John Doe."	"Call John Smith."	High

4. Speaker Diarization Validation

"Who spoke when?" accuracy testing.

Diarization evaluates a model's ability to segment audio by distinct speakers. Acadify uses the Diarization Error Rate (DER) metric, which combines False Alarm speaker time, Missed speaker time, and Speaker Confusion time.

Submitting Diarization Schemas

{
  "audio_id": "meeting_001.wav",
  "segments": [
    {"speaker": "Speaker_1", "start_time": 0.00, "end_time": 5.23, "transcript": "Hello everyone."},
    {"speaker": "Speaker_2", "start_time": 5.24, "end_time": 8.10, "transcript": "Hi, let's begin."}
  ]
}