Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identifying the use of synthesised voices for pre-recorded audio #400

Open
HadrienGardeur opened this issue Sep 11, 2024 · 14 comments
Open

Comments

@HadrienGardeur
Copy link
Member

HadrienGardeur commented Sep 11, 2024

ONIX supports the concept of unnamed persons in order to identify that a contributor is actually:

  • Unknown, anonymous or a group of various contributors
  • a TTS synthesised voice (male, female, unspecified or based on a real voice actor)
  • or an AI

In our techniques for full audio, displaying that info to the user seems extremely important. This is useful across audiobooks, EPUB and Daisy files where knowing whether pre-recorded audio is human narrated or a synthesised voice could impact the user's decision to select a publication.

Since we lack the ability to express this information in EPUB, we should also explore how this could be represented as well (probably by refining media:narrator).

@wareid
Copy link
Contributor

wareid commented Sep 11, 2024

TTS is on the reading system side though, not the EPUB/file side, so I don't think it's valid to push this requirement or metadata into the EPUB. I'd separate that out, but knowing if the synchronized audio or audiobook is AI-narrated would be beneficial, I agree with that.

@HadrienGardeur
Copy link
Member Author

TTS is often used to automate the production of reflowable EPUB with media overlays or audiobooks. I'm not talking about TTS by the reading system here, but pre-recorded audio produced with a TTS engine.

@madeleinerothberg
Copy link
Collaborator

madeleinerothberg commented Sep 11, 2024 via email

@wareid
Copy link
Contributor

wareid commented Sep 11, 2024

Instead of confusing terminology with a re-use of TTS, maybe it's better/clearer to specify "computer generated", "AI generated", etc.?

@HadrienGardeur
Copy link
Member Author

HadrienGardeur commented Sep 11, 2024

Synthesised voices then? I wouldn't use AI for that, as only a subset of these voices use ML/AI at all.

@HadrienGardeur HadrienGardeur changed the title Identifying the use of TTS for pre-recorded audio Identifying the use of synthesised voices for pre-recorded audio Sep 11, 2024
@wareid
Copy link
Contributor

wareid commented Sep 11, 2024

I do think there is a distinction to be made, since generally the TTS-generated audio is a bit more "mechanical" sounding vs AI-generated/more advanced language models that sound more "natural". We could differentiate on that even, since the goal is setting user expectation around what they will be purchasing/borrowing.

@mattgarrish
Copy link
Member

Maybe the time is now to decide the best way to express that in our current metadata and the necessary terms.

Right, we added synchronizedAudioText as a feature, but that doesn't tell you anything about what kind of narration. I'm just not sure if this is a fit for a feature -- it probably belongs in the media overlays metadata.

I don't see that there's anything stopping us from recommending similar values (or pattern) be expressed in the media:narrator property to the ones @HadrienGardeur has already pointed out. Presumably, if you're okay with listing the name of the voice you used, you could swap that in for the generic gender descriptors. I expect users are going to assume a human narrator absent a synthesized label.

The problem with a refinement property off the narrator is I'm not sure everyone will want to list the name of the voice they're using, so then what are you refining in that case?

@HadrienGardeur
Copy link
Member Author

HadrienGardeur commented Sep 11, 2024

I do think there is a distinction to be made, since generally the TTS-generated audio is a bit more "mechanical" sounding vs AI-generated/more advanced language models that sound more "natural". We could differentiate on that even, since the goal is setting user expectation around what they will be purchasing/borrowing.

Over the last few months, I've documented hundreds of voices available on various browsers/platforms and I can say that this is a tricky thing to do.

In my case, I used a quality property inspired by the values and descriptions returned by the Android API for voices.

We need to keep in mind that this is a fast-moving target and quality is constantly going up. A few years from now, knowing that this is based on an ML/AI based voice won't mean much as the quality profiles will have changed a lot.
Since metadata are rarely updated (if ever), I wouldn't recommend listing such subjective information in an EPUB.

The problem with a refinement property off the narrator is I'm not sure everyone will want to list the name of the voice they're using, so then what are you refining in that case?

Yeah, that's definitely an issue with metadata in EPUB. If we had an object model (let's say JSON), we would simply add this information under narrator alongside a name. Some of them would include a name, others wouldn't.

@clapierre
Copy link
Collaborator

@HadrienGardeur I get where you are coming from, and the quality of these voices are really becoming impressive. But I think knowing at the very least if this was a human recording or generated will be an important distinction. Folks may want to seek out one over the other for a variety of reasons.

I also think that certain types of voices will allow to be sped up and knowing if this book enables that could be a benefit as well when deciding to purchase a specific book or not.

@HadrienGardeur
Copy link
Member Author

HadrienGardeur commented Sep 11, 2024

[…] this was a human recording or generated will be an important distinction. Folks may want to seek out one over the other for a variety of reasons.

Do you mean if the voice was fully generated or based on human recording? ONIX does a fairly good job with a specific code for that: "Synthesised voice – based on real voice actor".

I agree that this is something useful and a good example where providing media:narrator with the name of the voice actor, but refining it with a code indicating a synthesised voice would work quite well.

Voice cloning is becoming increasingly common.

It's a key feature of the new ElevenLabs Reader, Amazon announced that they're rolling this out in beta for creator earlier this week, it's at the core of StoryTel's experiment and Apple-silicon based devices even allow users to do that with support for what they call "Personal voices".

Aside from the usual companies in this field, there are also a number of open source voice cloning models as well.

@mattgarrish
Copy link
Member

If we had an object model (let's say JSON), we would simply add this information under narrator alongside a name.

Dare to dream! 😉

I'm sure we could always hack something together for epub with an eye at creating proper metadata for richer formats. Maybe the hack here would be to use "null" as a placeholder, so you might get something like:

<meta property="media:narrator" id="#nar01">null</meta>
<meta property="media:voiceType" refines="#nar01">synthesized</meta>
<meta property="schema:gender" refines="#nar01">female</meta>

I don't particularly like it, but if this falls into the display guidance maybe that's the place to handle the presentation to users.

Alternatively, you could bump a property like gender up to be the default name when one isn't provided, as that can be more important to some people than whether the voice is synthesized (e.g., if you have high frequency hearing loss, male voices are usually easier to comprehend). That would avoid "null" getting displayed by any reading system/vendor that just picks out the narrator property, at least.

At any rate, there are always ways to work around epub's metadata.

@HadrienGardeur
Copy link
Member Author

That would avoid "null" getting displayed by any reading system/vendor that just picks out the narrator property, at least.

I was checking ONIX descriptions for these codes again and they're all limited to "read by", which got me thinking.

Currently, media:narrator indicates the presence of a narrator AND contains the name of the narrator at the same time.

What if instead of a media:voiceType we had a property that indicated:

  • that there's a synthesised narrator
  • AND contained the gender as a value

With a synthesised voice based on a real voice, we could use that property to refine media:narrator:

<meta property="media:narrator" id="#nar01">John Doe</meta>
<meta property="media:synthetisedNarrator" refines="#nar01">male</meta>

Whereas with a synthesised voiced that's not based on a real voice, we could simply omit media:narrator and use this property directly:

<meta property="media:synthetisedNarrator">male</meta>

@chrisONIX
Copy link

chrisONIX commented Sep 12, 2024 via email

@HadrienGardeur
Copy link
Member Author

APA recently published recommendations for identifying the use of AI narration: https://www.audiopub.org/naming-guidelines-for-ai-narrated-audiobooks

Since we cover both ONIX and EPUB, I think this could be labeled as an item to discuss in a future revision of the guide.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants