-
Notifications
You must be signed in to change notification settings - Fork 782
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
M.E.AI.Abstractions - Speech to Text Abstraction #5838
base: main
Are you sure you want to change the base?
M.E.AI.Abstractions - Speech to Text Abstraction #5838
Conversation
@dotnet-policy-service agree company="Microsoft" |
🎉 Good job! The coverage increased 🎉
Full code coverage report: https://dev.azure.com/dnceng-public/public/_build/results?buildId=938431&view=codecoverage-tab |
Add unit tests for the new `IAudioTranscriptionClient` interface and related classes. * **AudioTranscriptionClientTests.cs** - Add tests for `CompleteAsync` and `CompleteStreamingAsync` methods. - Verify invalid arguments throw exceptions. - Test the creation of text messages asynchronously. * **AudioTranscriptionCompletionTests.cs** - Add tests for constructors and properties. - Verify JSON serialization and deserialization. - Test the `ToString` method and `ToStreamingAudioTranscriptionUpdates` method. * **AudioTranscriptionChoiceTests.cs** - Add tests for constructors and properties. - Verify JSON serialization and deserialization. - Test the `Text` property and `Contents` list. * **StreamingAudioTranscriptionUpdateTests.cs** - Add tests for constructors and properties. - Verify JSON serialization and deserialization. - Test the `Kind` property with existing and random values. --- For more details, open the [Copilot Workspace session](https://copilot-workspace.githubnext.com/RogerBarreto/extensions/tree/audio-transcription-abstraction?shareId=XXXX-XXXX-XXXX-XXXX).
Add unit tests for AudioTranscription
🎉 Good job! The coverage increased 🎉
Full code coverage report: https://dev.azure.com/dnceng-public/public/_build/results?buildId=942860&view=codecoverage-tab |
🎉 Good job! The coverage increased 🎉
Full code coverage report: https://dev.azure.com/dnceng-public/public/_build/results?buildId=945523&view=codecoverage-tab |
🎉 Good job! The coverage increased 🎉
Full code coverage report: https://dev.azure.com/dnceng-public/public/_build/results?buildId=945918&view=codecoverage-tab |
...ibraries/Microsoft.Extensions.AI.Abstractions/AudioTranscription/AudioTranscriptionChoice.cs
Outdated
Show resolved
Hide resolved
...ibraries/Microsoft.Extensions.AI.Abstractions/AudioTranscription/AudioTranscriptionChoice.cs
Outdated
Show resolved
Hide resolved
...icrosoft.Extensions.AI.Abstractions/AudioTranscription/AudioTranscriptionClientExtensions.cs
Outdated
Show resolved
Hide resolved
...icrosoft.Extensions.AI.Abstractions/AudioTranscription/AudioTranscriptionClientExtensions.cs
Outdated
Show resolved
Hide resolved
...ries/Microsoft.Extensions.AI.Abstractions/AudioTranscription/AudioTranscriptionCompletion.cs
Outdated
Show resolved
Hide resolved
...Libraries/Microsoft.Extensions.AI.Abstractions/Utilities/DataContentAsyncEnumerableStream.cs
Outdated
Show resolved
Hide resolved
} | ||
|
||
return bytesRead; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending on how the stream is used, it could be nice to override CopyToAsync. That implementation can just enumerate and write out each chunk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not fully sure if we can get much of a benefit as this method also expects a bufferSize, and I would need to be careful how to handle that if each DataContent has a different buffer size (which is being taken care of within the ReadAsync
override impl).
...Libraries/Microsoft.Extensions.AI.Abstractions/Utilities/DataContentAsyncEnumerableStream.cs
Outdated
Show resolved
Hide resolved
#endif | ||
{ | ||
yield return (T)Activator.CreateInstance(typeof(T), [(ReadOnlyMemory<byte>)buffer, mediaType])!; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is handing out the same buffer multiple times. It's not going to be obvious to a caller that if they grab a buffer and MoveNext, that MoveNext will have overwrittten their buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice observation, fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue still exists in the NET8+ path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this method should not be public. We can ensure that we're consuming it appropriately in our own uses, but as a public method, we have to accomodate the possibility of misuse.
src/Libraries/Microsoft.Extensions.AI.Abstractions/Utilities/StreamExtensions.cs
Outdated
Show resolved
Hide resolved
…o-transcription-abstraction
cc: @Swimburger for visibility. Feedback is appreciated. Thanks! |
/// Represents an error content. | ||
/// </summary> | ||
[DebuggerDisplay("{DebuggerDisplay,nq}")] | ||
public class ErrorContent : AIContent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this being constructed / used? I only see it used in a few tests.
{ | ||
get | ||
{ | ||
string display = $"Message = {Message} "; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
string display = $"Message = {Message} "; | |
string display = $"Error = {Message} "; |
} | ||
|
||
/// <summary>Gets or sets the error message.</summary> | ||
public string Message { get; set; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The setter should validate the non-nullness just like the ctor.
/// <param name="cancellationToken">The <see cref="CancellationToken"/> to monitor for cancellation requests. The default is <see cref="CancellationToken.None"/>.</param> | ||
/// <returns>The text generated by the client.</returns> | ||
Task<SpeechToTextResponse> GetResponseAsync( | ||
IList<IAsyncEnumerable<DataContent>> speechContents, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any scenarios where an implementation is expected to mutate this? With chat, this is expected to be a history, but with speech-to-text, presumably it's generally more of a one-and-done kind of thing? Maybe this should be an IEnumerable instead of an IList?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, I just noticed, this is an IList<IAsyncEnumerable<DataContent>>
rather than an IAsyncEnumerable<DataContent>
? The intent here is this handles multiple inputs, each of which is an asynchronously produced sequence of content?
/// <param name="options">The speech to text options to configure the request.</param> | ||
/// <param name="cancellationToken">The <see cref="CancellationToken"/> to monitor for cancellation requests. The default is <see cref="CancellationToken.None"/>.</param> | ||
/// <returns>The text generated by the client.</returns> | ||
Task<SpeechToTextResponse> GetResponseAsync( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be TranscribeAsync? It's a more specific operation than on IChatClient. Or are there other uses than transcription?
|
||
#pragma warning disable VSTHRD200 // Use "Async" suffix for async methods | ||
#pragma warning disable CS1998 // Async method lacks 'await' operators and will run synchronously | ||
private static async IAsyncEnumerable<T> ToAsyncEnumerable<T>(this IEnumerable<T> source) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eventually we'll be able to replace this with the one in System.Linq.AsyncEnumerable, but we'll need to wait I expect until that's out of preview
SpeechToTextOptions? options = null, | ||
CancellationToken cancellationToken = default) | ||
{ | ||
IEnumerable<DataContent> speechContents = [Throw.IfNull(speechContent)]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an unnecessary intermediate enumerable. You could have another overload of ToAsyncEnumerable that just yields a single T.
/// </param> | ||
/// <param name="providerUri">The URL for accessing the speech to text provider, if applicable.</param> | ||
/// <param name="modelId">The ID of the speech to text model used, if applicable.</param> | ||
public SpeechToTextClientMetadata(string? providerName = null, Uri? providerUri = null, string? modelId = null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any other common metadata that all known providers support?
public class SpeechToTextOptions | ||
{ | ||
private CultureInfo? _speechLanguage; | ||
private CultureInfo? _textLanguage; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are these CultureInfos? Where are these culture info objects used?
/// <summary>Initializes a new instance of the <see cref="SpeechToTextResponse"/> class.</summary> | ||
/// <param name="choices">The list of choices in the response, one message per choice.</param> | ||
[JsonConstructor] | ||
public SpeechToTextResponse(IList<SpeechToTextMessage> choices) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does choices map to here? Does that map to the multiple inputs provided to the GetResponseAsync method? Choices is the wrong name for that, I think.
private static void ProcessUpdate(SpeechToTextResponseUpdate update, Dictionary<int, SpeechToTextMessage> choices, SpeechToTextResponse response) | ||
{ | ||
response.ResponseId ??= update.ResponseId; | ||
response.ModelId ??= update.ModelId; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My draft PR at #5998 switches this to be on a last-wins model. I think that's a more desirable ordering, especially if multiple responses might be coming back, in which case you want this response object to more effectively represent the last rather than first.
public static SpeechToTextResponseUpdateKind TextUpdated { get; } = new("textupdated"); | ||
|
||
/// <summary>Gets when the generated text session is closed.</summary> | ||
public static SpeechToTextResponseUpdateKind SessionClose { get; } = new("sessionclose"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the expectation that in an update sequence you always get a session open, then zero or more pairs of textupdating/textupdated, and then a session close, with zero or more errors sprinkled throughout?
public override long Length => throw new NotSupportedException(); | ||
|
||
/// <inheritdoc/> | ||
public override async Task CopyToAsync(Stream destination, int bufferSize, CancellationToken cancellationToken) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this override needed? It looks like what the base implementation will end up doing.
namespace Microsoft.Extensions.AI; | ||
|
||
/// <summary>A delegating speech to text client that wraps an inner client with implementations provided by delegates.</summary> | ||
public sealed class AnonymousDelegatingSpeechToTextClient : DelegatingSpeechToTextClient |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was made internal for chat / embeddings
/// </param> | ||
/// <param name="cancellationToken">The <see cref="CancellationToken"/> to monitor for cancellation requests. The default is <see cref="CancellationToken.None"/>.</param> | ||
/// <returns>A <see cref="Task"/> that represents the completion of the operation.</returns> | ||
public delegate Task GetResponseSharedFunc( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Chat changed this to just use Func<>
private static ISpeechToTextClient CreateSpeechToTextClient(HttpClient httpClient, string modelId) => | ||
new OpenAIClient(new ApiKeyCredential("apikey"), new OpenAIClientOptions { Transport = new HttpClientPipelineTransport(httpClient) }) | ||
.AsSpeechToTextClient(modelId); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we avoid adding these large wav files? We especially don't want to add them multiple times.
Once merged, they'll be in the repo's history forever.
@RogerBarreto, anything I can do to help move this along? Thanks! |
ADR - Introducing Speech To Text Abstraction
Problem Statement
The project requires the ability to transcribe and translate speech audios to text. The project is a proof of concept to validate the
ISpeechToTextClient
abstraction against different transcription and translation APIs providing a consistent interface for the project to use.Note
The names used for the proposed abstractions below are open and can be changed at any time given a bigger consensus.
Considered Options
Option 1: Generic Multi Modality Abstraction
IModelClient<TInput, TOutput>
(Discarded)This option would have provided a generic abstraction for all models, including audio transcription. However, this would have made the abstraction too generic and brought up some questioning during the meeting:
Usability Concerns:
The generic interface could make the API less intuitive and harder to use, as users would not be guided towards the specific options they need. 1
Naming and Clarity:
Generic names like "complete streaming" do not convey the specific functionality, making it difficult for users to understand what the method does. Specific names like "transcribe" or "generate song" would be clearer. 2
Implementation Complexity:
Implementing a generic interface would still require concrete implementations for each permutation of input and output types, which could be complex and cumbersome. 3
Specific Use Cases:
Different services have specific requirements and optimizations for their modalities, which may not be effectively captured by a generic interface. 4
Future Proofing vs. Practicality:
While a generic interface aims to be future-proof, it may not be practical for current needs and could lead to an explosion of permutations that are not all relevant. 5
Separation of Streaming and Non-Streaming:
There was a concern about separating streaming and non-streaming interfaces, as it could complicate the API further. 6
Option 2: Speech to Text Abstraction
ISpeechToTextClient
(Preferred)This option would provide a specific abstraction for audio transcription and audio translations, which would be more intuitive and easier to use. The specific interface would allow for better optimization and customization for each service.
Initially was thought on having different interfaces one for streaming and another for non-streaming api, but after some discussion, it was decided to have a single interface similar to what we have in
IChatClient
.Note
Further modality abstractions will mostly follow this as a standard moving forward.
Inputs:
IAsyncEnumerable<DataContent>
, as a simpler and recent interface, it allows for upload streaming audio data contents to the service.This API, also enables usage of large audio files or real-time transcription (without having to load to full file in-memory) and can easily be extended to support different audio input types like a single
DataContent
or aStream
instance.Supporting scenarios like:
Single DataContent type input extension
Stream type input extension
SpeechToTextOptions
, analogous to existingChatOptions
it allows providing additional options on both Streaming and Non-Streaming APIs for the service, such as language, model, or other parameters.ResponseId
is a unique identifier for the completion of the transcription. This can be useful while using Non-Streaming API to track the completion status of a specific long-running transcription process (Batch).Note
Usage of
ResponseId
follows the convention for Chat.ModelId
is a unique identifier for the model to use for transcription.SpeechLanguage
is the language of the audio content.Azure Cognitive Speech
- Supported languagesSpeechSampleRate
is the sample rate of the audio content. Real-time speech to text generation requires a specific sample rate.Outputs:
SpeechToTextResponse
, For non-streaming API analogous to existingChatResponse
it provides the text generated result and additional information about the speech response.ResponseId
is a unique identifier for the response. This can be useful while using Non-Streaming API to track the completion status of a specific long-running speech to text generation process (Batch).Note
Usage of
Response
as a prefix initially following the convention forChatResponse
type for consistency.ModelId
is a unique identifier for the model used for transcription.Choices
is a list of generated TextSpeechToTextMessage
s each referring to a generated text for the given speechDataContent
index. Majority of cases this will be a single message that can also be accessed in theMessage
property, similar to theChatResponse
.StartTime
andEndTime
represents both Timestamps from where the text started and ended relative to the speech audio length.i.e: Audio starts with instrumental music for the first 30 seconds before any speech, the trascription should start from 30 seconds forward, same for the end time.
Note
TimeSpan
is used to represent the time stamps as it is more intuitive and easier to work with, some services give the time in milliseconds, ticks or other formats.SpeechToTextResponseUpdate
, For streaming API, analogous to existingChatResponseUpdate
it provides the speech to text result as multiple chunks of updates, that represents the content generated as well as any important information about the processing.ResponseId
is a unique identifier for the speech to text response.StartTime
andEndTime
for the given transcribed chunk represents the timestamp where it starts and ends relative to the audio length.i.e: Audio starts with instrumental music for the first 30 seconds before any speech, the transcription chunk will flush with the StartTime from 30 seconds forward until the last word of the chunk which will represent the end time.
Note
TimeSpan
is used to represent the time stamps as it is more intuitive and easier to work with, some services give the time in milliseconds, ticks or other formats.Contents
is a list ofAIContent
objects that represent the transcription result. 99% use cases this will be oneTextContent
object that can be retrieved from theText
property similarly as aText
inChatMessage
.Kind
is astruct
similarly toChatRole
The decision to use an
struct
similarly toChatRole
will allow more flexibility and customization for different API updates where can provide extra update definitions which can be very specific and won't fall much into the general categories described below, this will allow implementers to not skip such updates, providing a more specificKind
update.General Update Kinds:
Session Open
- When the transcription session is open.Text Updating
- When the speech to text is in progress, without waiting for silence. (Preferably for UI updates)Different apis used different names for this, ie:
PartialTranscriptReceived
SegmentData
RecognizingSpeech
Text Updated
- When a speech to text block is complete after a small period of silence.Different API names for this, ie:
FinalTranscriptReceived
RecognizedSpeech
Session Close
- When the transcription session is closed.Error
- When an error occurs during the speech to text process.Errors during the streaming can happen, and normally won't block the ongoing process, but can provide more detailed information about the error. For this reason instead of throwing an exception, the error can be provided as part of the ongoing streaming using a dedicated content I'm calling here
ErrorContent
.The idea of providing an
ErrorContent
is mainly to avoid usingTextContent
combining the error title, code and details in a single string, which can be harder to parse and open's a poorer user experience and bad precedent for error handling / error content.Similarly to the
UsageContent
in Chat, if an update want to provide a more detailed error information as part of the ongoing streaming, adding theErrorContent
that represents the error message, code, and details, may work best for providing more specific error details that are part of an ongoing process.Specific API categories:
Additional Extensions:
Stream
->ToAsyncEnumerable<T> : where T : DataContent
This extension method allows converting a
Stream
to anIAsyncEnumerable<T>
whereT
is aDataContent
type, this will allow the usage ofStream
as an input for theISpeechToTextClient
without the need to load the entire stream into memory and simplifying the usage of the API for majority of mainstream scenarios whereStream
type is used.As we have already extensions for
Stream
this eventually could be dropped but proved to be useful when callers wanted to easily consume a Stream as anIAsyncEnumerable<T>
.IAsyncEnumerable<T> -> ToStream<T> : where T : DataContent
Allows converting an
IAsyncEnumerable<T>
to aStream
whereT
is aDataContent
typeThis extension will be very useful for implementers of the
ISpeechToTextClient
to provide a simple way to convert theIAsyncEnumerable<T>
to aStream
for the underlying service to consume, which majority all of the services SDK's currently support.Azure AI Speech SDK - Example
SK Abstractions and Adapters
Similarly how we have
ChatClient
andChatService
abstractions, we will haveSpeechToTextClient
andAudioToTextService
abstractions, where theSpeechToTextClient
will be the main entry point for the project to consume the audio transcription services, and theAudioToTextService
will be the main entry point for the services to implement the audio transcription services.