#1230: Captioning for XR Accessibility with W3C’s Michael Cooper

Michael Cooper works for the World Wide Web Consortium’s Web Accessibility Initiative, and he was attending the XR Access Symposium to learn more about the existing XR accessibility efforts but also to moderate a break-out session about captions in XR. One of Cooper’s big takeaways is that there is no magical, one-size-fits-all solution to captioning in XR because people have different needs, different preferences, and different contexts that means that there is a need for frameworks to help make captions easily customizable. There are a lot of potential customizable options for spatial captions that include distance from the speaker, text size, text color, weight, layout, the size of the box, whether or not they have transparent boxes, preventing occlusion of objects by the captions, whether it moves with the speaker or not, and how to handle off-screen speakers.

Gallaudet University’s Christian Vogler warned XR Access participants about the dangers of doing a one-to-one translation for how captions are handled in 2D into how they’re handled in 3D since there different modalities like haptics that could help reduce information overload. One of the demos that was being shown at the XR Access Symposium implemented a wide range of these different spatial caption options, and so there is a need to develop an framework for the different game engines and the open web with WebXR as well as an opportunity at the platform level like with the Apple Vision Pro or Meta Quest ecosystem to implement a captioning framework.

Cooper told me in this interview, “I do think that we need design guidance. There are a lot of good ways to do captions in XR. There are some bad ways to do it, and so we need people to know about that. Going down the road, I think that we are going to need to develop semantic formats for the captions and for the objects that they represent. So there’s a lot of excitement about that. But again, there’s a big sense of caution that the space is so early that we don’t want to overstandardize. And as a person who works for a standards organization, that’s a big takeaway that I have to take.”

It’s again worth bringing up what Khronos Group President Neil Trevett told to me about the process of standardization, “The number one golden rule of standardization is don’t do R&D by standardization committee… Until we have multiple folks doing the awesome necessary work of Darwinian experimentation, until we have multiple examples of a needed technology and everyone is agreeing that it’s needed and how we would do it, but we’re just doing it in annoyingly different ways. That’s the point at which standardization can help.”

It’s still very early days for this type of Darwinian experimentation with Owlchemy Labs’ innovations of captioning starting with Vacation Simulator in October 2019 as well as the captioning experiments and accessibility features by ILM Immersive (formerly ILMxLAB) within Star Wars: Tales from the Galaxy’s Edge. The live captioning within social VR platform of AltSpaceVR was also pretty groundbreaking (RIP AltSpaceVR), and VRChat has had a number of Speech-to-Text implementations including ones that can be integrated into an avatar including VRCstt (and their RabidCrab’s TTS Patreon), VRCWizard’s TTS-Voice-Wizard, and VRC STT System. There were also a number of unofficial, community-made accessibility mods before VRChat’s Easy Anti-Cheat change eliminated all quality-of-life mods such as VRC-CC and VRC Live Captions Mod. There have also been a number of different strategies within 360 videos over the years that would burn in captions at either 1, 2, or 3 different locations. The more locations the captions, then more ability one has to to look around without missing any action within the environment and still be able to read the captions. At Laval Virtual 2023, I saw some integrations of OpenAI’s Whisper to do live transcription and captioning as they were feeding text into ChatGPT 3.5. I doubt that there are many folks who have experienced all of these captions implementations, and there are likely a lot more that I haven’t included here. But we are still well within the phase of Darwinian experimentation when it comes to captioning, and trying to map out some of the key areas that can be customizable.

This is a listener-supported podcast through the Voices of VR Patreon.

Music: Fatality

Rough Transcript

[00:00:05.452] Kent Bye: The Voices of VR Podcast. Hello, my name is Kent Bye, and welcome to the Voices of VR Podcast. It's a podcast that looks at the future of spatial computing. You can support the podcast at patreon.com slash voicesofvr. So this is episode number nine out of 15 of my series on XR accessibility. Today's episode is with Michael Cooper, who works for the World Wide Web Consortium's Web Accessibility Initiative. So Michael was here learning about XR and just listening to the community, and he's been a part of web accessibility for a while now. But coming in with XR, there's all these new expansions for what's this extra third dimension going to add. And so when you talk about captions, we're taking this technology from television that's a 30-year-old technology and just doing a one-to-one translation into 3D doesn't always work the best way. And so there's a group discussion that Michael was a part of leading and trying to get some feedback of what are some of the major issues for how do we start to address or even think about captions within XR. So we'll be covering a lot of the different discussions and major points that are made throughout that group discussion and some of the major themes that were arising, as well as some of the other demos that I saw that were also exploring the ability to customize things. And that was, I guess, spoiler alert, customization and being able to have something that is unique to each person's context is a bit of a recurring theme throughout this conversation where there's no one answer for captions, but that there's a variety of new opportunities for what's possible and being able to explore that full range of possibilities and have customization options, I think is going to be a key issue as we start to move forward. So that's where we're coming on today's episode of the Voices of VR podcast. So this interview with Michael happened on Thursday, June 15th, 2023 at the XR Access Symposium in New York City, New York. So with that, let's go ahead and dive right in.

[00:02:02.827] Michael Cooper: My name is Michael Cooper. I work for the World Wide Web Consortium's Web Accessibility Initiative. We work on making the web accessible to people with disabilities. And as technology evolves, we're always looking at what we need to do next. XR is clearly an evolving area. We're known for creating the Web Content Accessibility Guidelines, which have been very successful for the web. They clearly do not address much of XR, aside from a very generic way. So one of my many reasons to be here is to learn more about what we need to be doing in that space.

[00:02:33.273] Kent Bye: Great. Maybe you could give a bit more context as to your background and your journey into the space.

[00:02:37.694] Michael Cooper: Sure. Well, I came into accessibility sort of by accident, working in my student disability services office. And then I went to grad school to pursue education, but found my technical skills and accessibility kept steering me back to the space. I ended up working for an organization called CAST, which had created the program Bobby, which was the first web accessibility evaluation tool. So I was able to take over management of that project and manage it for a few years just at the time that the Web Content Accessibility Guidelines 1.0 was coming out and Bobby was the only quote implementation of those guidelines at the time. So that connected me to W3C and some years later I took a job at the W3C continuing to work on the evolution of the Web Content Accessibility Guidelines.

[00:03:24.549] Kent Bye: OK, so I know that there is a WebXR as an open standard on the web-friendly interface API to be able to do immersive experiences. And so I've had a number of different conversations around how, actually, a lot of the limitations for taking the WebXR from a recommendation into specification was to make sure that all these accessibility guidelines were being taken care of. And so I'd love to hear some of your initial thoughts in terms of how captions fit into this larger ecosystem of maybe some minimum requirements for what you would like to see to have these WebXR experiences to be the most accessible as they can be.

[00:04:01.177] Michael Cooper: Yeah. So WebXR is, in a sense, a mirror of XR on the web space. It uses web APIs to create the environment. It's currently approaching completion just of its first version, so it provides basic XR functionality for the web that is interoperable so different devices can work with it. As such, it doesn't provide special features beyond simply the ability to define objects, render graphics, all of that stuff, your basic XR stuff. However, the web has a long tradition of enhancing accessibility. For instance, HTML5 added media accessibility complete with transcript, multiple caption tracks, all of that stuff. So WebXR is a technology that has that kind of potential. W3C has a very clear mission to work on that kind of enhancement, and we in the Web Accessibility Initiative would be exploring that. Certainly what I've been learning from this conference is it's premature to decide exactly what, say, a captioning API should be, but I would also say that it would be very helpful for one to exist. We need the interoperability and we need the easy access to accessibility features that just isn't going to come if things are done ad hoc, and that's the way it is both with WebXR and regular XR at this time.

[00:05:19.077] Kent Bye: Great. So you were helping to facilitate a group discussion here at the XR Access Symposium of 2023, leading a group discussion around captions. And so I'd love to hear some of the big takeaways that you had from that session.

[00:05:32.259] Michael Cooper: Yeah, my biggest takeaway is that captions need to be customized. People have different needs, different preferences, different contexts. So there is no one right way to do captions in XR. There are some wrong ways to do it. We've explored some of those. But what we need to do is develop frameworks for captions to be easily customized, ideally on the device. I don't even know if I can go into the list of things that need to be considered in captioning, such as distance from the speaker, text size, text color, weight, layout, the size of the box, whether it's transparent, whether it moves with the speaker. what you do about speakers who are not on the screen. There's just a whole list of things that are somewhat different from static video captioning that we need to pay attention to. And then we have users with different disabilities, working in different environments, trying to do different things. They will find a certain thing works better for them in one context than another. So that's my biggest takeaway is caption customizability is really critical. That said, design guidance for captions is also really important. The entire industry is just trying to figure it out as it goes right now. And some really great creative work is being done, but it's hit and miss in terms of its usefulness. So, you know, while we don't want to say you must do your captions this way, We do want to say, look at these ways of doing your captioning and see what you can do. And ideally, of course, provide as much customization as you can as well. So for me, those are the biggest takeaways. There's a whole lot more details we could talk about, but that was my biggest insight.

[00:07:09.689] Kent Bye: Yeah, I know that earlier in the day, Christian Vogler, who's at Gallaudet University, was warning, I guess, taking standards that were established, best practices for capturing standards that originated from what he said was low-resolution TV in the 2D context. But when you put it into a spatial context, it's not always a one-to-one translation. So I'd love to hear any expansion on that as an idea and some of the discussions on What does the spatialization give you and different ways of guiding attention, the use of haptics, all these other multimodal ways of maybe there's options or maybe there's a caution from just doing a direct one-to-one translation. And one of the things he said was that you don't want to have something that's standardized and then you can't go back because too many people were adopting something that doesn't actually work in a spatial context.

[00:07:55.868] Michael Cooper: Yeah, so my understanding of what Christian was initially pointing out was that television captions specified, because of the low resolution of the screen, that you could have two or three lines of captions, the text had to be a certain size, a certain weight, it had to be all caps because you couldn't have lowercase, you had a very limited display of captions. As they evolved, you could have lowercase, caps for instances, color, etc. But it's still pretty limited. And that is the model that has been incorporated into many XR projects. So you have a caption window appearing with sometimes black and white captions, or something equally uninspiring. Some of the challenges that brings to XR beyond that it simply is not nearly as rich as the environment demands, first of all you have to watch the caption window separately from the action in the XR. That's extremely tiring and you can lose a lot of context in a video window, in a small television window that was less of a concern. You know, as we look at using captions to augment the XR experience, we run into a whole new set of challenges, such as when there are multiple speakers, how do you identify which speaker is speaking? What if they're both speaking at the same time? What if one or both of them is not on screen? There's a whole bunch of issues that television-based captions haven't needed to deal with. So I think that's sort of a second point. Beyond the design limitations of old-style captions that we simply don't need any longer, the usage patterns simply have to adapt to the new technology.

[00:09:22.265] Kent Bye: Yeah, and I know that there were some demos that were here at the XR Access Symposium looking at different models of, say, as you're turning your head, it sort of tracks with your head. Or in a lot of 360 videos, I've seen it's locked to the world. And I guess one of the things that came up in discussions earlier today is just having a broad range of options that are allowing people to dial in what type of fidelity of information that they want to receive, whether they want to dial it up to get flooded. risks different aspects of information overload. But some people are happy with that. Or maybe there's ways of having different tiered systems of how much information is being presented. So I don't know if that was another part of the discussion in terms of dealing with this issue of information overload and also potentially having options and customizations for people to have a variety of different types of standards that people can choose from based upon what their needs are.

[00:10:11.372] Michael Cooper: Yeah, we didn't talk specifically about tiered systems for captions in our breakout session, but I think along with a lot of accessibility, that's something that you might want. We did come up with an example where perhaps a more distant speaker, you've got the problem where if the caption is the same size, you may not see that they're distant, or you may be distracted by what should be a background sound. On the other hand, sometimes it's relevant to overhear a distant conversation, and so we talked about ways to enable that. One thought that came up was, you know, what if the distance speakers have smaller captions, you know, sort of parallels the softer voice. You need to design that really carefully, but that's a way that we could work on bringing, you know, a more rich experience in and work with that spatial environment.

[00:10:54.501] Kent Bye: Yeah, happening at the same time, I believe, was another session around 360 video descriptions for folks who are either blind or with low sight. And I guess there's this challenge of describing the scene. And when you think about the DOM for a website, you have the ability to have screen readers that are able to digest a whole bunch of different types of information there. And so I'd love to hear any differentiation between the types of things like screen readers and being able to describe scenes versus whether or not there's an overlap for people who are deaf or hard of hearing that are able to also maybe use some of that metadata that is having different objects that maybe have additional metadata are described as well that maybe that's a part of this captioning system that's more object-oriented and spatialized amongst a space. So yeah, I'd love to hear if there's any discussions around the overlap between the different types of systems that may be merging together and revealed to the user.

[00:11:48.237] Michael Cooper: Yeah, I have to say I haven't been involved in many discussions about that here, but I know that is an issue. I can speak to how the DOM works. Originally the DOM is only semantics about the structure of the content and screen readers know what to do with it and present you a meaningful flow for paragraphs, lists, sections, etc. What you don't get from the DOM is knowledge about what an object is if it's something other than a unit of text. And then that's where we get into other technologies that provide additional information. So there can be various sorts of descriptions and interactions that enhance the object. But what the assistive technology What InTheEnd is doing is getting a text string that describes the object and not having much more sophisticated interaction than that. In the XR space, we have much less native semantics to draw on for content. We can't say, this is a document with paragraphs. What we need is other kinds of semantics to describe the objects in the scene. And at the moment, we don't have a good definition of what those semantics might be. It's possible to develop taxonomies, ontologies of objects that might exist. And certainly many of those exist. For instance, there are ontologies for city design, and you could label things as buildings. The question then is, is that labeled in a universally recognized enough manner that it actually helps the user of the screen reader? We will need this sort of functionality for XR but it's going to take a while to get there. I think the first level will be labeling objects with text labels, saying this is a building, this is its name. You can interact with the object, know that it's a singular object, and get its label. Getting more metadata will be an important part of XR, but that's, I think, some years down the road.

[00:13:36.012] Kent Bye: Yeah, I know traditionally WebGL has been a bit of a black box, but I'm wondering if with WebGPU if it's going to be a little bit more like having access to scene graphs or have access to that type of specialized information. I know there's GLTF as an emerging standard, but yeah, I don't know if you've been following any of the nuances of that type of stuff because it feels like in the past it's been difficult because it's hard to understand what's happening with painting pixels on a canvas with WebGL, but now that we're moving into WebGPU, but also potentially other methods of integrating the GPU with GLTF and other scene graph USD or other emerging standards as we move forward, that maybe there'll be more opportunities to have more direct access to what would be the equivalent of like a 3D DOM, but using these emerging formats like either USD or GLTF.

[00:14:23.532] Michael Cooper: That would certainly be my hope. Unfortunately, I'm not knowledgeable enough about those formats to say too much. I know that WebGPU is undergoing a revision right now. My understanding is that at the moment, those are mainly rendering formats rather than information architecture formats. On the other hand, it certainly should be possible to add informational metadata to them, and that's certainly an approach we might consider.

[00:14:48.206] Kent Bye: Yeah, well, at the wrap-up here with the different group discussions, you got up on stage and were sharing a number of different takeaways from the discussion after you had facilitated this discussion. You had mentioned some of your personal takeaways, but I'm wondering if you could share any of the other things that you were sharing to the wider community in terms of stuff that was worth mentioning to the broader context of this discussion.

[00:15:07.715] Michael Cooper: Yeah, that's a good question. So now I have to think about what did I think was most important. I would reiterate that customizability really came out. We're not prepared to define a design pattern for captions and say, here's how everybody should do it. We need to push for that, and I think that's partly a technological issue. We need to work with that on the platforms, but also with authors in the short term who are going to be asked to figure this all out by themselves. Beyond that, I do think that we need design guidance. There are a lot of good ways to do captions in XR. There are some bad ways to do it, and so we need people to know about that. Going down the road, I think that we are going to need to develop semantic formats for the captions and for the objects that they represent. So there's a lot of excitement about that. But again, there's a big sense of caution that the space is so early that we don't want to overstandardize. And as a person who works for a standards organization, that's a big takeaway that I have to take.

[00:16:04.592] Kent Bye: Great. And finally, wondering if you could share what you think the ultimate potential of immersive media spatial computing with accessibility in mind might be and what it might be able to enable.

[00:16:18.857] Michael Cooper: Sure, I think there are probably a lot of common answers to that. It enables alternate presentation of self, interaction with people in different locales, exploring experiences that are otherwise out of reach to you for whatever reason. I've certainly enjoyed doing the wingsuit videos, the 360 wingsuit videos. I'm not going to do that. So if you're a person with a mobility impairment, some aspects of VR may be very intriguing for that aspect. In the future, I don't really envision that we're all going to go around wearing headsets or even glasses. But VR is a compelling technology that enables new forms of socialization and new forms of work. It is going to come. So what we want to do now is think about what do we want it to be, just as we did with the web. This is actually why I joined the web consortium at the beginning was I want to be in on the ground floor. Let's make this technology accessible from the ground up. We didn't fully succeed, but we did a lot better than if we hadn't tried. We're in the same place with XR and it's going to have probably an equally transformative impact on the world. So we really need it to work for everybody. The world is more aware than ever now that there are people with disabilities in larger numbers than we knew about. There are also people with many different situations that lead to similar needs. For XR to really do anything, it's going to need accessibility. And yeah, so that's what I think we need to have the vision to do.

[00:17:49.914] Kent Bye: Is there anything else that's left unsaid that you'd like to say to the broader Immersive community?

[00:17:55.279] Michael Cooper: I can't think of anything right now because I'm so new to this space. I'm really trying to bring the World Wide Web Consortium into this space. We did some work on WebXR a few years ago, but we've now been kind of watching and waiting. And at least in the accessibility space, we can never do that. So that's what brought me here. It's been very interesting.

[00:18:13.845] Kent Bye: Awesome. Well, Michael, thanks for sharing a bit of what's happening with the W3C. And I think that accessibility is going to be a key part of making this a technology that is inclusive of everyone. So lots of important and challenging problems still yet to be solved. But I'm glad you're here to help facilitate some of those discussions and help to figure it all out. So thanks for joining me here to help break it all down. So thank you.

[00:18:34.434] Michael Cooper: Yeah, thanks very much. I've been very happy to be here.

[00:18:37.773] Kent Bye: So that was Michael Cooper. He works for the World Wide Web Consortium's Web Accessibility Initiative, and he was at the XR Access Symposium leading a group discussion on captions within XR. So I have a number of takeaways about this interview is that, first of all, Well, the theme that kept repeating again and again throughout this conversation was the customizability for these different captions and the multitude of many different options from the distance from the speaker, the text size, the text color, the weight, the layout, the size of the box, whether it's transparent, whether or not it moves with the speaker, captions interacting and colliding with the different objects in the scene. What do you do with the offscreen speakers? So lots of different discussions around the multitude of different options And there didn't seem to be one universal answer at this point. I think folks are still exploring what's possible. I'd point to Alchemy Labs and the work that they've done on Vacation Simulator, as well as Cosmonius High, where they've added these captions so that as you move around, the bubble kind of moves. So it's still in your field of view, but it's pointing to where the speaker is. And then when you go back to where the speaker is, then it hovers over that speaker's head. So I think that actually is a really Good approach, but like Michael said there's lots of different options for how to handle all these different captions there's also audio cues with the sound effects and other aspect and so, you know, one of the things that Christian Vogler said was just trying to minimize the amount of visual overload because not everything needs to be translated into visual feedback and sometimes a You can start to use other modalities like haptics to be able to explore how to give emphasis or show other types of intensity or just draw your attention towards something using haptic feedback rather than using just sound alone. You can start to use haptics as a way of filling that gap, at least with the MetaQuest Pro. The Apple Vision Pro doesn't have any hand track controllers and so haptics becomes a little bit more of an issue there. I think it's worth going back and looking at the Web Content Accessibility Guidelines, the WCAG. They have a 2.2 new draft specification that just came out on May 17, 2023. And they have the perceivable, operable, understandable, and robust. And then the fifth section is conformance. And so there's all sorts of different ways of the best practices for web accessibility And there's all sorts of other user requirements that are out there that I mentioned before in a previous episode, but I'll just run through them again because I think it's worth calling out that there's existing web usability guidelines that may be applicable for XR as you start to fuse together all these different things. So there's the XR accessibility user requirements, there's the synchronization accessibility user requirements, there's natural language interface accessibility user requirements, there's RTC usability requirements for audio, there's accessibility for remote meetings, collaboration tools, accessibility user requirements, media accessibility user requirements, core accessibility API mappings, graphics API, graphics accessibility API mappings, and then graphics accessibility API mappings. Lots of existing web accessibility guidelines. We didn't really necessarily dive into that in this conversation. We were really just trying to get an update as to what's happening with captioning because there's plenty of open questions there in terms of where that's going to go here in the future. So yeah, it seems like it's a little bit too early to do any standardization at this point. There's still a need to push the technology and do different reference implementations and to see what are some best practices for how to start to implement this before some other guidelines come up. And there may be a whole range of different things that you can start to tweak and customize based upon what you want even specific to what you want for different situations. So I think right now there's a lot of basic default where you don't actually have any customization options, but there was one demo that I got to see at the XR Access Symposium where you had all sorts of different types of customizations that you could do with captions. I didn't get the name of the project, but it's definitely worth following up and seeing the implementation of getting a wide range of different options for different types of captioning. I'll also point out a film that showed at Sundance this year, it's called The Tube of Thieves. And Tube of Thieves was doing lots of different innovations when it comes to captioning, even how it was using a little bit more of a spatial captioning of different audio cues. And yeah, just a really beautiful film that I highly recommend checking out, especially when it comes to captioning. Innovations in captioning so definitely check out the tube of these if you have a chance to check that out as well Because I think they're really pushing all that forward as well. And yeah as it continues to move forward with accessibility I think web XR is something to keep an eye on actually there might be an ability to start to do some leading-edge work when it comes to accessibility with web XR and the web already has all sorts of different integrations with things like screen readers and so getting something like a screen reader to work within the context of WebXR again has this black box of WebGL and then the future of WebGPU and so what's going to be the equivalent of the 3D scene graph like glTF file as an example or USD how can there start to use the architecture of the spatial layout of these different objects, and then potentially even adding metadata into that as well so that screen readers would be able to get additional information about that. Again, I'll refer to Cosmonius High from Alchemy Labs. In that conversation, we dig into a lot more of prototyping what would a screen reader type of experience look like for folks who are low vision or blind in the context of their game called Cosmonius High. So we'll be digging into that here in a couple episodes. So, that's all I have for today, and I just wanted to thank you for listening to the Voices of VR podcast. And if you enjoy the podcast, then please do spread the word, tell your friends, and consider becoming a member of the Patreon. This is a business-supported podcast, and so I do rely upon donations from people like yourself in order to continue to bring you this coverage. So you can become a member and donate today at patreon.com slash voicesofvr. Thanks for listening.

More from this show