M.E.AI.Abstractions - Speech to Text Abstraction #5838

RogerBarreto · 2025-02-03T17:14:23Z

ADR - Introducing Speech To Text Abstraction

Problem Statement

The project requires the ability to transcribe and translate speech audios to text. The project is a proof of concept to validate the ISpeechToTextClient abstraction against different transcription and translation APIs providing a consistent interface for the project to use.

Note

The names used for the proposed abstractions below are open and can be changed at any time given a bigger consensus.

Considered Options

Option 1: Generic Multi Modality Abstraction `IModelClient<TInput, TOutput>` (Discarded)

This option would have provided a generic abstraction for all models, including audio transcription. However, this would have made the abstraction too generic and brought up some questioning during the meeting:

Usability Concerns:

The generic interface could make the API less intuitive and harder to use, as users would not be guided towards the specific options they need. 1

Naming and Clarity:

Generic names like "complete streaming" do not convey the specific functionality, making it difficult for users to understand what the method does. Specific names like "transcribe" or "generate song" would be clearer. 2
Implementation Complexity:

Implementing a generic interface would still require concrete implementations for each permutation of input and output types, which could be complex and cumbersome. 3
Specific Use Cases:

Different services have specific requirements and optimizations for their modalities, which may not be effectively captured by a generic interface. 4
Future Proofing vs. Practicality:

While a generic interface aims to be future-proof, it may not be practical for current needs and could lead to an explosion of permutations that are not all relevant. 5
Separation of Streaming and Non-Streaming:

There was a concern about separating streaming and non-streaming interfaces, as it could complicate the API further. 6

Option 2: Speech to Text Abstraction `ISpeechToTextClient` (Preferred)

This option would provide a specific abstraction for audio transcription and audio translations, which would be more intuitive and easier to use. The specific interface would allow for better optimization and customization for each service.

Initially was thought on having different interfaces one for streaming and another for non-streaming api, but after some discussion, it was decided to have a single interface similar to what we have in IChatClient.

Note

Further modality abstractions will mostly follow this as a standard moving forward.

public interface ISpeechToTextClient : IDisposable
{
    Task<SpeechToTextResponse> GetResponseAsync(
        IList<IAsyncEnumerable<DataContent>> speechContents, 
        SpeechToTextOptions? options = null, 
        CancellationToken cancellationToken = default);

    IAsyncEnumerable<SpeechToTextResponseUpdate> GetStreamingResponseAsync(
        IList<IAsyncEnumerable<AudioContent>> speechContents,
        SpeechToTextOptions? options = null,
        CancellationToken cancellationToken = default);
}

Inputs:

IAsyncEnumerable<DataContent>, as a simpler and recent interface, it allows for upload streaming audio data contents to the service.

This API, also enables usage of large audio files or real-time transcription (without having to load to full file in-memory) and can easily be extended to support different audio input types like a single DataContent or a Stream instance.

Supporting scenarios like:

Single in-memory data of audio. Non up-streaming audio
One audio streamed in multiple audio content chunks - Real-time Transcription
Single or multiple audio uri (referenced) audioContents - Batch Transcription

Single DataContent type input extension

// Non-Streaming API
public static Task<SpeechToTextResponse> GetResponseAsync(
    this ISpeechToTextClient client,
    DataContent speechContent, 
    SpeechToTextOptions? options = null, 
    CancellationToken cancellationToken = default);

// Streaming API
public static IAsyncEnumerable<SpeechToTextResponseUpdate> GetStreamingResponseAsync(
    this ISpeechToTextClient client,
    DataContent speechContent, 
    SpeechToTextOptions? options = null, 
    CancellationToken cancellationToken = default);

Stream type input extension

// Non-Streaming API
public static Task<SpeechToTextResponse> GetResponseAsync(
    this ISpeechToTextClient client,
    Stream audioStream,
    SpeechToTextOptions? options = null,
    CancellationToken cancellationToken = default);

// Streaming API
public static IAsyncEnumerable<SpeechToTextResponseUpdate> GetStreamingResponseAsync(
    this ISpeechToTextClient client,
    DataContent speechContent, 
    SpeechToTextOptions? options = null, 
    CancellationToken cancellationToken = default);

SpeechToTextOptions, analogous to existing ChatOptions it allows providing additional options on both Streaming and Non-Streaming APIs for the service, such as language, model, or other parameters.

public class SpeechToTextOptions
{
    public string? ResponseId { get; set; }

    public string? ModelId { get; set; }

    // Source speech language
    public string? SpeechLanguage { get; set; }

    // Generated text language
    public string? TextLanguage { get; set; }

    public int? SpeechSampleRate { get; set; }

    public AdditionalPropertiesDictionary? AdditionalProperties { get; set; }

    public virtual SpeechToTextOptions Clone();
}

ResponseId is a unique identifier for the completion of the transcription. This can be useful while using Non-Streaming API to track the completion status of a specific long-running transcription process (Batch).

Note

Usage of ResponseId follows the convention for Chat.

- ModelId is a unique identifier for the model to use for transcription.
  - AssemblyAI Models - For features like Speaker Diarization, and more.
  - OpenAI model - whisper-1.
  - Azure AI Speech - Custom model when using batch API.
- SpeechLanguage is the language of the audio content.
  - Azure Cognitive Speech - Supported languages
- SpeechSampleRate is the sample rate of the audio content. Real-time speech to text generation requires a specific sample rate.

Outputs:

SpeechToTextResponse, For non-streaming API analogous to existing ChatResponse it provides the text generated result and additional information about the speech response.

public class SpeechToTextResponse
{
    [JsonConstructor]
    public SpeechToTextResponse();

    public SpeechToTextResponse(IList<AIContent> contents);

    public SpeechToTextResponse(string? content);

    public string? ResponseId { get; set; }

    public string? ModelId { get; set; }

    [AllowNull]
    public IList<SpeechToTextMessage> Choices

    [JsonIgnore]
    public SpeechToTextMessage Message => Choices[0];

    public TimeSpan? StartTime { get; set; }

    public TimeSpan? EndTime { get; set; }

    [JsonIgnore]
    public object? RawRepresentation { get; set; }

    /// <summary>Gets or sets any additional properties associated with the chat completion.</summary>
    public AdditionalPropertiesDictionary? AdditionalProperties { get; set; }

ResponseId is a unique identifier for the response. This can be useful while using Non-Streaming API to track the completion status of a specific long-running speech to text generation process (Batch).

Note

Usage of Response as a prefix initially following the convention for ChatResponse type for consistency.

- ModelId is a unique identifier for the model used for transcription.
- Choices is a list of generated Text SpeechToTextMessages each referring to a generated text for the given speech DataContent index. Majority of cases this will be a single message that can also be accessed in the Message property, similar to the ChatResponse.
- StartTime and EndTime represents both Timestamps from where the text started and ended relative to the speech audio length.
  
  i.e: Audio starts with instrumental music for the first 30 seconds before any speech, the trascription should start from 30 seconds forward, same for the end time.

Note

TimeSpan is used to represent the time stamps as it is more intuitive and easier to work with, some services give the time in milliseconds, ticks or other formats.

SpeechToTextResponseUpdate, For streaming API, analogous to existing ChatResponseUpdate it provides the speech to text result as multiple chunks of updates, that represents the content generated as well as any important information about the processing.

    public class SpeechToTextResponseUpdate
    {
        [JsonConstructor]
        public SpeechToTextResponseUpdate()

        public SpeechToTextResponseUpdate(IList<AIContent> contents)

        public SpeechToTextResponseUpdate(string? content)

        public string? ResponseId { get; set; }

        public int InputIndex { get; set; }

        public int ChoiceIndex { get; set; }

        public TimeSpan? StartTime { get; set; }

        public TimeSpan? EndTime { get; set; }

        public required SpeechToTextResponseUpdateKind Kind { get; set; }

        public object? RawRepresentation { get; set; }

        public AdditionalPropertiesDictionary? AdditionalProperties { get; set; }

        [JsonIgnore]
        public string? Text => Contents[0].Text

        public IList<AIContent> Contents
    }

ResponseId is a unique identifier for the speech to text response.
StartTime and EndTime for the given transcribed chunk represents the timestamp where it starts and ends relative to the audio length.

i.e: Audio starts with instrumental music for the first 30 seconds before any speech, the transcription chunk will flush with the StartTime from 30 seconds forward until the last word of the chunk which will represent the end time.

Note

TimeSpan is used to represent the time stamps as it is more intuitive and easier to work with, some services give the time in milliseconds, ticks or other formats.

- Contents is a list of AIContent objects that represent the transcription result. 99% use cases this will be one TextContent object that can be retrieved from the Text property similarly as a Text in ChatMessage.
- Kind is a struct similarly to ChatRole
  
  The decision to use an struct similarly to ChatRole will allow more flexibility and customization for different API updates where can provide extra update definitions which can be very specific and won't fall much into the general categories described below, this will allow implementers to not skip such updates, providing a more specific Kind update.
```
[JsonConverter(typeof(Converter))]
public readonly struct SpeechToTextResponseUpdateKind : IEquatable<AudioTranscriptionUpdateKind>
{
    public static SpeechToTextResponseUpdateKind SessionOpen { get; } = new("sessionopen");
    public static SpeechToTextResponseUpdateKind Error { get; } = new("error");
    public static SpeechToTextResponseUpdateKind TextUpdating { get; } = new("textupdating");
    public static SpeechToTextResponseUpdateKind TextUpdated { get; } = new("textupdated");
    public static SpeechToTextResponseUpdateKind SessionClose { get; } = new("sessionclose");

    // Similar implementation to ChatRole
}
```
  General Update Kinds:
  - Session Open - When the transcription session is open.
  - Text Updating - When the speech to text is in progress, without waiting for silence. (Preferably for UI updates)
    
    Different apis used different names for this, ie:
    - AssemblyAI: PartialTranscriptReceived
    - Whisper.net: SegmentData
    - Azure AI Speech: RecognizingSpeech
  - Text Updated - When a speech to text block is complete after a small period of silence.
    
    Different API names for this, ie:
    - AssemblyAI: FinalTranscriptReceived
    - Whisper.net: N/A (Not supported by the internal API)
    - Azure AI Speech: RecognizedSpeech
  - Session Close - When the transcription session is closed.
  - Error - When an error occurs during the speech to text process.
    
    Errors during the streaming can happen, and normally won't block the ongoing process, but can provide more detailed information about the error. For this reason instead of throwing an exception, the error can be provided as part of the ongoing streaming using a dedicated content I'm calling here ErrorContent.
    
    The idea of providing an ErrorContent is mainly to avoid using TextContent combining the error title, code and details in a single string, which can be harder to parse and open's a poorer user experience and bad precedent for error handling / error content.
    
    Similarly to the UsageContent in Chat, if an update want to provide a more detailed error information as part of the ongoing streaming, adding the ErrorContent that represents the error message, code, and details, may work best for providing more specific error details that are part of an ongoing process.
```
public class ErrorContent : AIContent
{
    public required string Message { get; set; } // An error must have a message
    public string? Code { get; set; } // Can be non-numerical
    public string? Details { get; set; }
}
```
  Specific API categories:
  - Azure AI Speech Examples

Additional Extensions:

`Stream` -> `ToAsyncEnumerable<T> : where T : DataContent`

This extension method allows converting a Stream to an IAsyncEnumerable<T> where T is a DataContent type, this will allow the usage of Stream as an input for the ISpeechToTextClient without the need to load the entire stream into memory and simplifying the usage of the API for majority of mainstream scenarios where Stream type is used.

As we have already extensions for Stream this eventually could be dropped but proved to be useful when callers wanted to easily consume a Stream as an IAsyncEnumerable<T>.

public static class StreamExtensions 
{
    public static async IAsyncEnumerable<T> ToAsyncEnumerable<T>(this Stream audioStream, string? mediaType = null)
        where T : DataContent
    {
        var buffer = new byte[4096];
        while ((await audioStream.ReadAsync(buffer, 0, buffer.Length)) > 0)
        {
            yield return (T)Activator.CreateInstance(typeof(T), [(ReadOnlyMemory<byte>)buffer, mediaType])!;
        }
    }
}

`IAsyncEnumerable<T> -> ToStream<T> : where T : DataContent`

Allows converting an IAsyncEnumerable<T> to a Stream where T is a DataContent type

This extension will be very useful for implementers of the ISpeechToTextClient to provide a simple way to convert the IAsyncEnumerable<T> to a Stream for the underlying service to consume, which majority all of the services SDK's currently support.

public static class IAsyncEnumerableExtensions
{
    public static Stream ToStream<T>(this IAsyncEnumerable<T> stream, T? firstChunk = null, CancellationToken cancellationToken = default) 
        where T : DataContent
        => new DataContentAsyncEnumerableStream<T>(stream, firstChunk, cancellationToken);
}

// Internal class to handle an IAsyncEnumerable<T> as Stream
internal class DataContentAsyncEnumerableStream<T> : Stream 
    where T : DataContent
{
    internal DataContentAsyncEnumerableStream(IAsyncEnumerable<T> asyncEnumerable, T? firstChunk = null, CancellationToken cancellationToken = default)

    public override async Task<int> ReadAsync(byte[] buffer, int offset, int count, CancellationToken cancellationToken = default)
    {
        // Implementation
    }
}

Azure AI Speech SDK - Example

public class MySpeechToTextClient : ISpeechToTextClient
{
    public async Task<AudioTranscriptionCompletion> TranscribeAsync(
        IAsyncEnumerable<AudioContent> audioContents, 
        AudioTranscriptionOptions? options = null, 
        CancellationToken cancellationToken = default)
    {
        using var audioStream = audioContents.ToStream();
        using var audioConfig = AudioConfig.FromStreamInput(audioStream);

        // ...
    }
}

SK Abstractions and Adapters

Similarly how we have ChatClient and ChatService abstractions, we will have SpeechToTextClient and AudioToTextService abstractions, where the SpeechToTextClient will be the main entry point for the project to consume the audio transcription services, and the AudioToTextService will be the main entry point for the services to implement the audio transcription services.

public static class AudioToTextServiceExtensions
{
    public static ISpeechToTextClient ToSpeechToTextClient(this IAudioToTextService service)
    {
        ArgumentNullException.ThrowIfNull(service);

        return service is ISpeechToTextClient client ?
            client :
            new SpeechToTextClientAudioToTextService(service);
    }

    public static IAudioToTextService ToAudioToTextService(this ISpeechToTextClient client)
    {
        ArgumentNullException.ThrowIfNull(client);
        return client is IAudioToTextService service ?
            service :
            new AudioToTextServiceSpeechToTextClient(client);
    }
}

RogerBarreto · 2025-02-03T17:16:32Z

@dotnet-policy-service agree company="Microsoft"

dotnet-comment-bot · 2025-02-03T17:49:34Z

‼️ Found issues ‼️

Project	Coverage Type	Expected	Actual
Microsoft.Extensions.AI.Abstractions	Line	83	69.66 🔻
Microsoft.Extensions.AI.Abstractions	Branch	83	66.2 🔻
Microsoft.Gen.MetadataExtractor	Line	98	57.35 🔻
Microsoft.Gen.MetadataExtractor	Branch	98	62.5 🔻
Microsoft.Extensions.AI.Ollama	Line	80	78.25 🔻

🎉 Good job! The coverage increased 🎉
Update MinCodeCoverage in the project files.

Project	Expected	Actual
Microsoft.Extensions.AI	88	89
Microsoft.Extensions.AI.OpenAI	77	78

Full code coverage report: https://dev.azure.com/dnceng-public/public/_build/results?buildId=938431&view=codecoverage-tab

Add unit tests for the new `IAudioTranscriptionClient` interface and related classes. * **AudioTranscriptionClientTests.cs** - Add tests for `CompleteAsync` and `CompleteStreamingAsync` methods. - Verify invalid arguments throw exceptions. - Test the creation of text messages asynchronously. * **AudioTranscriptionCompletionTests.cs** - Add tests for constructors and properties. - Verify JSON serialization and deserialization. - Test the `ToString` method and `ToStreamingAudioTranscriptionUpdates` method. * **AudioTranscriptionChoiceTests.cs** - Add tests for constructors and properties. - Verify JSON serialization and deserialization. - Test the `Text` property and `Contents` list. * **StreamingAudioTranscriptionUpdateTests.cs** - Add tests for constructors and properties. - Verify JSON serialization and deserialization. - Test the `Kind` property with existing and random values. --- For more details, open the [Copilot Workspace session](https://copilot-workspace.githubnext.com/RogerBarreto/extensions/tree/audio-transcription-abstraction?shareId=XXXX-XXXX-XXXX-XXXX).

Add unit tests for AudioTranscription

dotnet-comment-bot · 2025-02-06T17:05:41Z

‼️ Found issues ‼️

Project	Coverage Type	Expected	Actual
Microsoft.Extensions.Caching.Hybrid	Line	86	82.77 🔻
Microsoft.Extensions.AI.Abstractions	Line	83	81.95 🔻
Microsoft.Extensions.AI.Abstractions	Branch	83	73.8 🔻
Microsoft.Gen.MetadataExtractor	Line	98	57.35 🔻
Microsoft.Gen.MetadataExtractor	Branch	98	62.5 🔻
Microsoft.Extensions.AI.Ollama	Line	80	78.25 🔻

🎉 Good job! The coverage increased 🎉
Update MinCodeCoverage in the project files.

Project	Expected	Actual
Microsoft.Extensions.AI.OpenAI	77	78
Microsoft.Extensions.AI	88	89

Full code coverage report: https://dev.azure.com/dnceng-public/public/_build/results?buildId=942860&view=codecoverage-tab

dotnet-comment-bot · 2025-02-08T20:02:31Z

‼️ Found issues ‼️

Project	Coverage Type	Expected	Actual
Microsoft.Extensions.AI.Ollama	Line	80	78.11 🔻
Microsoft.Extensions.Caching.Hybrid	Line	86	82.92 🔻
Microsoft.Extensions.AI.OpenAI	Line	77	74.23 🔻
Microsoft.Extensions.AI.OpenAI	Branch	77	63.08 🔻
Microsoft.Extensions.AI.Abstractions	Line	83	81.36 🔻
Microsoft.Extensions.AI.Abstractions	Branch	83	74.51 🔻
Microsoft.Gen.MetadataExtractor	Line	98	57.35 🔻
Microsoft.Gen.MetadataExtractor	Branch	98	62.5 🔻

🎉 Good job! The coverage increased 🎉
Update MinCodeCoverage in the project files.

Project	Expected	Actual
Microsoft.Extensions.AI.AzureAIInference	91	92
Microsoft.Extensions.AI	88	89

Full code coverage report: https://dev.azure.com/dnceng-public/public/_build/results?buildId=945523&view=codecoverage-tab

dotnet-comment-bot · 2025-02-09T20:05:13Z

‼️ Found issues ‼️

Project	Coverage Type	Expected	Actual
Microsoft.Extensions.AI.Abstractions	Branch	83	81.05 🔻
Microsoft.Extensions.Caching.Hybrid	Line	86	82.77 🔻
Microsoft.Extensions.AI.Ollama	Line	80	78.11 🔻
Microsoft.Extensions.AI.OpenAI	Branch	77	70.56 🔻
Microsoft.Extensions.AI	Line	88	80.31 🔻
Microsoft.Extensions.AI	Branch	88	87.64 🔻
Microsoft.Gen.MetadataExtractor	Line	98	57.35 🔻
Microsoft.Gen.MetadataExtractor	Branch	98	62.5 🔻

🎉 Good job! The coverage increased 🎉
Update MinCodeCoverage in the project files.

Project	Expected	Actual
Microsoft.Extensions.AI.AzureAIInference	91	92

Full code coverage report: https://dev.azure.com/dnceng-public/public/_build/results?buildId=945918&view=codecoverage-tab

...ibraries/Microsoft.Extensions.AI.Abstractions/AudioTranscription/AudioTranscriptionChoice.cs

...icrosoft.Extensions.AI.Abstractions/AudioTranscription/AudioTranscriptionClientExtensions.cs

...ries/Microsoft.Extensions.AI.Abstractions/AudioTranscription/AudioTranscriptionCompletion.cs

...Libraries/Microsoft.Extensions.AI.Abstractions/Utilities/DataContentAsyncEnumerableStream.cs

stephentoub · 2025-02-11T03:40:56Z

...Libraries/Microsoft.Extensions.AI.Abstractions/Utilities/DataContentAsyncEnumerableStream.cs

+        }
+
+        return bytesRead;
+    }


Depending on how the stream is used, it could be nice to override CopyToAsync. That implementation can just enumerate and write out each chunk.

I'm not fully sure if we can get much of a benefit as this method also expects a bufferSize, and I would need to be careful how to handle that if each DataContent has a different buffer size (which is being taken care of within the ReadAsync override impl).

...Libraries/Microsoft.Extensions.AI.Abstractions/Utilities/DataContentAsyncEnumerableStream.cs

stephentoub · 2025-02-11T03:43:40Z

src/Libraries/Microsoft.Extensions.AI.Abstractions/Utilities/StreamExtensions.cs

+#endif
+        {
+            yield return (T)Activator.CreateInstance(typeof(T), [(ReadOnlyMemory<byte>)buffer, mediaType])!;
+        }


This is handing out the same buffer multiple times. It's not going to be obvious to a caller that if they grab a buffer and MoveNext, that MoveNext will have overwrittten their buffer.

Nice observation, fixed.

The issue still exists in the NET8+ path.

I think this method should not be public. We can ensure that we're consuming it appropriately in our own uses, but as a public method, we have to accomodate the possibility of misuse.

src/Libraries/Microsoft.Extensions.AI.Abstractions/Utilities/StreamExtensions.cs

…o-transcription-abstraction

luisquintanilla · 2025-02-19T15:27:04Z

cc: @Swimburger for visibility. Feedback is appreciated. Thanks!

stephentoub · 2025-02-28T19:18:52Z

src/Libraries/Microsoft.Extensions.AI.Abstractions/Contents/ErrorContent.cs

+/// Represents an error content.
+/// </summary>
+[DebuggerDisplay("{DebuggerDisplay,nq}")]
+public class ErrorContent : AIContent


Where is this being constructed / used? I only see it used in a few tests.

stephentoub · 2025-02-28T19:19:18Z

src/Libraries/Microsoft.Extensions.AI.Abstractions/Contents/ErrorContent.cs

+    {
+        get
+        {
+            string display = $"Message = {Message} ";


Suggested change

string display = $"Message = {Message} ";

string display = $"Error = {Message} ";

stephentoub · 2025-02-28T19:19:40Z

src/Libraries/Microsoft.Extensions.AI.Abstractions/Contents/ErrorContent.cs

+    }
+
+    /// <summary>Gets or sets the error message.</summary>
+    public string Message { get; set; }


The setter should validate the non-nullness just like the ctor.

stephentoub · 2025-02-28T19:22:18Z

src/Libraries/Microsoft.Extensions.AI.Abstractions/SpeechToText/ISpeechToTextClient.cs

+    /// <param name="cancellationToken">The <see cref="CancellationToken"/> to monitor for cancellation requests. The default is <see cref="CancellationToken.None"/>.</param>
+    /// <returns>The text generated by the client.</returns>
+    Task<SpeechToTextResponse> GetResponseAsync(
+        IList<IAsyncEnumerable<DataContent>> speechContents,


Are there any scenarios where an implementation is expected to mutate this? With chat, this is expected to be a history, but with speech-to-text, presumably it's generally more of a one-and-done kind of thing? Maybe this should be an IEnumerable instead of an IList?

Wait, I just noticed, this is an IList<IAsyncEnumerable<DataContent>> rather than an IAsyncEnumerable<DataContent>? The intent here is this handles multiple inputs, each of which is an asynchronously produced sequence of content?

stephentoub · 2025-02-28T20:35:33Z

src/Libraries/Microsoft.Extensions.AI.Abstractions/SpeechToText/ISpeechToTextClient.cs

+    /// <param name="options">The speech to text options to configure the request.</param>
+    /// <param name="cancellationToken">The <see cref="CancellationToken"/> to monitor for cancellation requests. The default is <see cref="CancellationToken.None"/>.</param>
+    /// <returns>The text generated by the client.</returns>
+    Task<SpeechToTextResponse> GetResponseAsync(


Should this be TranscribeAsync? It's a more specific operation than on IChatClient. Or are there other uses than transcription?

stephentoub · 2025-02-28T20:37:48Z

src/Libraries/Microsoft.Extensions.AI.Abstractions/SpeechToText/SpeechToTextClientExtensions.cs

+
+#pragma warning disable VSTHRD200 // Use "Async" suffix for async methods
+#pragma warning disable CS1998 // Async method lacks 'await' operators and will run synchronously
+    private static async IAsyncEnumerable<T> ToAsyncEnumerable<T>(this IEnumerable<T> source)


Eventually we'll be able to replace this with the one in System.Linq.AsyncEnumerable, but we'll need to wait I expect until that's out of preview

stephentoub · 2025-02-28T20:39:23Z

src/Libraries/Microsoft.Extensions.AI.Abstractions/SpeechToText/SpeechToTextClientExtensions.cs

+        SpeechToTextOptions? options = null,
+        CancellationToken cancellationToken = default)
+    {
+        IEnumerable<DataContent> speechContents = [Throw.IfNull(speechContent)];


This is an unnecessary intermediate enumerable. You could have another overload of ToAsyncEnumerable that just yields a single T.

stephentoub · 2025-02-28T20:40:17Z

src/Libraries/Microsoft.Extensions.AI.Abstractions/SpeechToText/SpeechToTextClientMetadata.cs

+    /// </param>
+    /// <param name="providerUri">The URL for accessing the speech to text  provider, if applicable.</param>
+    /// <param name="modelId">The ID of the speech to text  model used, if applicable.</param>
+    public SpeechToTextClientMetadata(string? providerName = null, Uri? providerUri = null, string? modelId = null)


Is there any other common metadata that all known providers support?

stephentoub · 2025-02-28T20:47:55Z

src/Libraries/Microsoft.Extensions.AI.Abstractions/SpeechToText/SpeechToTextOptions.cs

+public class SpeechToTextOptions
+{
+    private CultureInfo? _speechLanguage;
+    private CultureInfo? _textLanguage;


Why are these CultureInfos? Where are these culture info objects used?

stephentoub · 2025-02-28T20:50:03Z

src/Libraries/Microsoft.Extensions.AI.Abstractions/SpeechToText/SpeechToTextResponse.cs

+    /// <summary>Initializes a new instance of the <see cref="SpeechToTextResponse"/> class.</summary>
+    /// <param name="choices">The list of choices in the response, one message per choice.</param>
+    [JsonConstructor]
+    public SpeechToTextResponse(IList<SpeechToTextMessage> choices)


What does choices map to here? Does that map to the multiple inputs provided to the GetResponseAsync method? Choices is the wrong name for that, I think.

stephentoub · 2025-02-28T20:52:56Z

...es/Microsoft.Extensions.AI.Abstractions/SpeechToText/SpeechToTextResponseUpdateExtensions.cs

+    private static void ProcessUpdate(SpeechToTextResponseUpdate update, Dictionary<int, SpeechToTextMessage> choices, SpeechToTextResponse response)
+    {
+        response.ResponseId ??= update.ResponseId;
+        response.ModelId ??= update.ModelId;


My draft PR at #5998 switches this to be on a last-wins model. I think that's a more desirable ordering, especially if multiple responses might be coming back, in which case you want this response object to more effectively represent the last rather than first.

stephentoub · 2025-02-28T20:54:48Z

...ibraries/Microsoft.Extensions.AI.Abstractions/SpeechToText/SpeechToTextResponseUpdateKind.cs

+    public static SpeechToTextResponseUpdateKind TextUpdated { get; } = new("textupdated");
+
+    /// <summary>Gets when the generated text session is closed.</summary>
+    public static SpeechToTextResponseUpdateKind SessionClose { get; } = new("sessionclose");


Is the expectation that in an update sequence you always get a session open, then zero or more pairs of textupdating/textupdated, and then a session close, with zero or more errors sprinkled throughout?

stephentoub · 2025-02-28T21:05:16Z

src/Libraries/Microsoft.Extensions.AI.OpenAI/DataContentAsyncEnumerableStream.cs

+    public override long Length => throw new NotSupportedException();
+
+    /// <inheritdoc/>
+    public override async Task CopyToAsync(Stream destination, int bufferSize, CancellationToken cancellationToken)


Is this override needed? It looks like what the base implementation will end up doing.

stephentoub · 2025-02-28T21:41:18Z

src/Libraries/Microsoft.Extensions.AI/SpeechToText/AnonymousDelegatingSpeechToTextClient.cs

+namespace Microsoft.Extensions.AI;
+
+/// <summary>A delegating speech to text client that wraps an inner client with implementations provided by delegates.</summary>
+public sealed class AnonymousDelegatingSpeechToTextClient : DelegatingSpeechToTextClient


This was made internal for chat / embeddings

stephentoub · 2025-02-28T21:54:20Z

src/Libraries/Microsoft.Extensions.AI/SpeechToText/AnonymousDelegatingSpeechToTextClient.cs

+    /// </param>
+    /// <param name="cancellationToken">The <see cref="CancellationToken"/> to monitor for cancellation requests. The default is <see cref="CancellationToken.None"/>.</param>
+    /// <returns>A <see cref="Task"/> that represents the completion of the operation.</returns>
+    public delegate Task GetResponseSharedFunc(


Chat changed this to just use Func<>

stephentoub · 2025-02-28T21:56:37Z

test/Libraries/Microsoft.Extensions.AI.OpenAI.Tests/OpenAISpeechToTextClientTests.cs

+    private static ISpeechToTextClient CreateSpeechToTextClient(HttpClient httpClient, string modelId) =>
+        new OpenAIClient(new ApiKeyCredential("apikey"), new OpenAIClientOptions { Transport = new HttpClientPipelineTransport(httpClient) })
+        .AsSpeechToTextClient(modelId);
+}


Can we avoid adding these large wav files? We especially don't want to add them multiple times.

Once merged, they'll be in the repo's history forever.

stephentoub · 2025-03-11T14:21:40Z

@RogerBarreto, anything I can do to help move this along? Thanks!

RogerBarreto added 2 commits January 31, 2025 14:44

Latests adjustments

a8dd64f

Final adjustments before UT and IT, building + xml docs

6986a9f

RogerBarreto changed the title ~~M.E.AI - Audio transcription abstraction (WIP) - Missing UT + IT~~ M.E.AI.Abstractions - Audio transcription abstraction (WIP) - Missing UT + IT Feb 3, 2025

dotnet-policy-service bot assigned RogerBarreto Feb 3, 2025

RogerBarreto added 2 commits February 4, 2025 15:15

Update Unit Tests until compile

2f380cb

RussKie added the area-ai Microsoft.Extensions.AI libraries label Feb 4, 2025

RogerBarreto added 4 commits February 5, 2025 21:03

Update Unit Tests

b64edff

Remove culture info

11508fd

Merge pull request #3 from RogerBarreto/add-audio-transcription-tests

2fd2390

Add unit tests for AudioTranscription

Fix warnings + UT

e74ffc8

Adding missing components for IT

aaff261

RogerBarreto added 2 commits February 9, 2025 15:29

Adding the TranscriptionBuilder

c40099c

Adding concrete OpenAI builder, logging client and OpenAI UT IT

f6146dc

stephentoub reviewed Feb 11, 2025

View reviewed changes

RogerBarreto added 4 commits February 14, 2025 15:40

Merge branch 'main' of https://github.com/dotnet/extensions into audi…

924e816

…o-transcription-abstraction

Conflict and Merge fixes

7dc52b3

Address PR comments

86d0e35

Fix UT

fec0953

RogerBarreto added 5 commits February 19, 2025 16:48

Renaming Abstractions to Speech to Text

e3c5779

Adding CopyToAsync override

3a252b6

minor update

8cc9160

Merge branch 'main' into audio-transcription-abstraction

bad57dd

Adding response UT checks for audio transcription

b6f1968

RogerBarreto added 6 commits February 25, 2025 22:04

Add SpeechToText Builder + UT

6bfb26b

Add DependencyInjection Patterns

4377a29

Add Logging SpeechToText Client UT + Updates

7a531cd

Merge branch 'main' into audio-transcription-abstraction

cba7a36

Fix descriptions

319d283

Merge branch 'main' into audio-transcription-abstraction

f72497f

RogerBarreto marked this pull request as ready for review February 26, 2025 09:48

RogerBarreto requested review from a team as code owners February 26, 2025 09:48

RogerBarreto changed the title ~~M.E.AI.Abstractions - Speech to Text Abstraction (WIP) - Missing UT + IT~~ M.E.AI.Abstractions - Speech to Text Abstraction Feb 26, 2025

RogerBarreto requested a review from stephentoub February 26, 2025 17:17

Merge branch 'main' into audio-transcription-abstraction

d82f394

stephentoub reviewed Feb 28, 2025

View reviewed changes

luisquintanilla mentioned this pull request Mar 11, 2025

DataContent fails with valid data URI #6055

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

M.E.AI.Abstractions - Speech to Text Abstraction #5838

M.E.AI.Abstractions - Speech to Text Abstraction #5838

RogerBarreto commented Feb 3, 2025 •

edited

Loading

General Update Kinds:

RogerBarreto commented Feb 3, 2025

dotnet-comment-bot commented Feb 3, 2025

dotnet-comment-bot commented Feb 6, 2025

dotnet-comment-bot commented Feb 8, 2025

dotnet-comment-bot commented Feb 9, 2025

stephentoub Feb 11, 2025

RogerBarreto Feb 19, 2025

stephentoub Feb 11, 2025

RogerBarreto Feb 17, 2025

stephentoub Feb 28, 2025

stephentoub Feb 28, 2025

luisquintanilla commented Feb 19, 2025

stephentoub Feb 28, 2025

stephentoub Feb 28, 2025

stephentoub Feb 28, 2025

stephentoub Feb 28, 2025

stephentoub Feb 28, 2025

stephentoub Feb 28, 2025

stephentoub Feb 28, 2025

stephentoub Feb 28, 2025

stephentoub Feb 28, 2025

stephentoub Feb 28, 2025

stephentoub Feb 28, 2025

stephentoub Feb 28, 2025

stephentoub Feb 28, 2025

stephentoub Feb 28, 2025

stephentoub Feb 28, 2025 •

edited

Loading

stephentoub Feb 28, 2025

stephentoub Feb 28, 2025

stephentoub commented Mar 11, 2025

	string display = $"Message = {Message} ";
	string display = $"Error = {Message} ";

M.E.AI.Abstractions - Speech to Text Abstraction #5838

Are you sure you want to change the base?

M.E.AI.Abstractions - Speech to Text Abstraction #5838

Conversation

RogerBarreto commented Feb 3, 2025 • edited Loading

ADR - Introducing Speech To Text Abstraction

Problem Statement

Considered Options

Option 1: Generic Multi Modality Abstraction IModelClient<TInput, TOutput> (Discarded)

Usability Concerns:

Option 2: Speech to Text Abstraction ISpeechToTextClient (Preferred)

Inputs:

Single DataContent type input extension

Stream type input extension

Outputs:

General Update Kinds:

Additional Extensions:

Stream -> ToAsyncEnumerable<T> : where T : DataContent

IAsyncEnumerable<T> -> ToStream<T> : where T : DataContent

Azure AI Speech SDK - Example

SK Abstractions and Adapters

RogerBarreto commented Feb 3, 2025

dotnet-comment-bot commented Feb 3, 2025

dotnet-comment-bot commented Feb 6, 2025

dotnet-comment-bot commented Feb 8, 2025

dotnet-comment-bot commented Feb 9, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

luisquintanilla commented Feb 19, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephentoub Feb 28, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephentoub commented Mar 11, 2025

RogerBarreto commented Feb 3, 2025 •

edited

Loading

Option 1: Generic Multi Modality Abstraction `IModelClient<TInput, TOutput>` (Discarded)

Option 2: Speech to Text Abstraction `ISpeechToTextClient` (Preferred)

`Stream` -> `ToAsyncEnumerable<T> : where T : DataContent`

`IAsyncEnumerable<T> -> ToStream<T> : where T : DataContent`

stephentoub Feb 28, 2025 •

edited

Loading