Sound of the metaverse: Meta creates AI models to improve virtual audio

2022-07-14
关注

Zoom calls, meetings in the metaverse and virtual events could all be improved in the future thanks to a series of AI models developed by engineers at Meta, which the company says match sound to imagery, mimicking the way humans experience sound in the real world.

Meta’s new AI model can match the sound of an audio stream with the image of a room. (Image by LeoPatrizi / iStock)

The three models, developed in partnership with researchers from the University of Texas at Austin, are known as Visual-Acoustic Matching, Visually-Informed Dereverberation and VisualVoice. Meta has made the models available for developers.

“We need AI models that understand a person’s physical surroundings based on both how they look and how things sound,” the company said in a blog post explaining the new models.

“For example, there’s a big difference between how a concert would sound in a large venue versus in your living room. That’s because the geometry of a physical space, the materials and surfaces in the area, and the proximity of where the sounds are coming from all factor into how we hear audio.”

Related

Emerging Technology

Meta’s new Sphere AI tool will check the accuracy of Wikipedia entries

Emerging Technology

NHS drones to deliver chemotherapy treatment to patients

Emerging Technology

Emotion recognition is mostly ineffective. Why are companies still investing in it?

Emerging Technology

Google claims ‘quantum advantage’ for machine learning

Meta’s new audio AI models

The Visual Acoustic-Matching model can take an audio clip recorded anywhere, along with an image of a room or other space, and transform the clip to make it sound like it was recorded in that room.

An example use case for this could be to ensure people in a video chat experience sound the same way. So if one is at home, another in a coffee shop and a third in an office the sound could be adapted so that what you hear comes across as if it were in the room you are sitting in.

Visually-Informed Dereverberation is a model that does the opposite, it takes sounds and visual cues from a space, then focuses on removing reverberation from the space. For example, it can focus on the music from a violin even if it is recorded inside a large train station.

Content from our partners

How clinical trials infrastructure is undergoing digital transformation

Webinar – Top 3 Ways to Build Security into DevOps

Tech sector is making progress on diversity, but advances must accelerate

Finally, the VisualVoice model uses visual and audio cues to split speech from other background sounds and voices, allowing the listener to focus on a specific conversation. This could be used in a large conference hall with lots of people mingling.

This focused audio technique could also be used to generate better quality subtitles or make it easier for future machine learning to understand speech output when more than one person is talking, Meta explained.

How AI can improve audio in virtual experiences

Rob Godman, reader in music at the University of Hertfordshire and an expert in acoustic spaces, told Tech Monitor this work feeds into a human need to understand where we are in the world and brings it to virtual settings.

“We have to think about how humans perceive sound in their environment,” Godman says. “Human beings want to know where sound is coming from, how big a space is and how small a space is. When listening to sound being created we listen to several different things. One is the source, but you also listen to what happens to sound when combined with the room – the acoustics.”

Being able to capture and mimic that second aspect correctly could make virtual worlds and spaces seem more realistic, he explains, and do away with the disconnect humans might experience if the visuals don’t accurately match the audio.

An example of this could be a concert where a choice is performing outdoors, but the actual audio is recorded inside a cathedral, complete with significant reverb. That reverb wouldn’t be expected on a beach, so the mismatch of sound and visual would be unexpected and off putting.

Godman said the biggest change is how the perception of the listener is considered when implementing these AI models. “The position of the listener needs to be thought out a great deal,” he says. “The sound made close to a person compared to metres away is important. It is based around the speed of sound in air so a small delay in the time it takes to get to a person is utterly crucial.”

He said part of the problem with improving audio is the lack of end-user equipment, explaining users will “spend thousands of pounds on curved monitor but won’t pay more than £20 for a pair of headphones”.

Professor Mark Plumbley, EPSRC Fellow in AI for Sound at the University of Surrey, is developing classifiers for different types of sounds so they can be removed or highlighted in recordings. “If you are going to create this realistic experience for people you need the vision and sound to match,” he says.

Data, insights and analysis delivered to you View all newsletters By The Tech Monitor team Sign up to our newsletters

“It is harder for a computer than I think it would be for people. When we are listening to sounds there is an effect called directional marking that helps us focus on the sound from somebody in front of us and ignore sounds from the side.

This is something we’re used to doing in the real world, Plumbley says. “If you are in a cocktail party, with lots of conversations going on, you can focus on the conversation of interest, we can block out sounds from the side or elsewhere,” he says. “This is a challenging thing to do in a virtual world.”

He says a lot of this work has come about because of changes in machine learning, with better deep learning techniques that work across different disciplines, including sound and image AI. “A lot of these things are related to signal processing,” Plumbley adds.

“Whether sounds, gravitational waves or time series information from financial data. They are about signals that come over time. In the past researchers had to build individual ways for different types of objects to extract out different things. Now we are finding deep learning models are able to pull out the patterns.”

Read more: Google’s LaMBDA AI is not sentient but could pose a security risk

Topics in this article: Meta, Metaverse

参考译文
元世界的声音:元创造AI模型来改善虚拟音频
由于Meta公司工程师开发的一系列人工智能模型,Zoom电话、元世界会议和虚拟活动在未来都将得到改进。该公司表示,该模型可以将声音与图像相匹配,模仿人类在现实世界中体验声音的方式。这三种模型是与德克萨斯大学奥斯汀分校的研究人员合作开发的,分别被称为视觉-声音匹配、视觉-通知去everberation和VisualVoice。Meta为开发人员提供了模型。该公司在一篇解释新模型的博客文章中表示:“我们需要能够根据人的外观和声音来理解其物理环境的人工智能模型。”“例如,音乐会在大型场地和在你的客厅里会有很大的不同。这是因为物理空间的几何形状、该区域的材料和表面以及声音来源的远近都是影响我们听到声音的因素。视觉声学匹配模型可以将录制在任何地方的音频片段,以及房间或其他空间的图像,并将其转换成在那个房间录制的声音。这方面的一个示例用例可能是确保人们在视频聊天体验中以同样的方式听起来。因此,如果一个人在家里,另一个人在咖啡店,还有一个人在办公室,这种声音可以被调整,这样你听到的声音就好像是在你坐着的房间里。visual - informed Dereverberation是一种与之相反的模式,它从空间中获取声音和视觉线索,然后专注于消除空间中的回响。例如,它可以专注于小提琴的音乐,即使它是在一个大型火车站录制的。最后,VisualVoice模型使用视觉和音频线索将语音从其他背景声音和声音中分离出来,让听众专注于特定的对话。这可以用于有很多人在一起的大型会议大厅。Meta解释说,这种专注的音频技术还可以用来生成更高质量的字幕,或者让未来的机器学习更容易理解不止一个人说话时的语音输出。赫特福德大学的音乐读者、声学空间专家罗布·戈德曼告诉《科技箴言》,这项工作满足了人类了解我们在世界上的位置的需求,并将其带到虚拟环境中。“我们必须思考人类如何感知环境中的声音,”戈德曼说。“人类想知道声音来自哪里,空间有多大,空间有多小。当听声音被创造出来的时候,我们会听到好几种不同的东西。一个是声音的来源,但你也会听到声音与房间结合时发生了什么——音响效果。他解释道:“如果能够正确地捕捉和模拟第二要素,虚拟世界和空间就会显得更加真实,并消除了视觉效果与音频不匹配时人们可能会产生的脱节。这方面的一个例子可能是一场音乐会,人们选择在户外表演,但实际的音频是在大教堂内录制的,并伴有显著的混响。这种混响在海滩上是不可能出现的,所以声音和视觉的不匹配会让人意外和反感。戈德曼表示,最大的变化是在实施这些人工智能模型时如何考虑听者的看法。“倾听者的立场需要深思熟虑,”他说。“与几米外的人相比,近距离发出的声音很重要。它是以空气中的声速为基础的,所以在到达人体所需的时间上有一点延迟是至关重要的。他说,改善音频的部分问题在于缺乏终端用户设备,他解释说,用户会“花数千英镑购买曲面显示器,但不会为一副耳机支付超过20英镑”。 英国萨里大学人工智能声音研究中心的研究员、EPSRC教授马克·普拉布利正在为不同类型的声音开发分类器,以便它们可以在录音中被删除或突出显示。他说:“如果你要为人们创造这种逼真的体验,你就需要视觉和声音相匹配。”“对电脑来说,这比我认为的对人来说要难得多。当我们听声音时,有一种叫做定向标记的效果,它能帮助我们专注于来自前方的声音,而忽略来自侧面的声音。这是我们在现实世界中经常做的事情,Plumbley说。他说:“如果你在一个鸡尾酒会上,有很多人在交谈,你可以专注于你感兴趣的谈话,我们可以屏蔽来自旁边或其他地方的声音。”“在虚拟世界中,这是一件具有挑战性的事情。”他说,这方面的很多工作都是因为机器学习的变化,有了更好的深度学习技术,可以在不同的学科中工作,包括声音和图像AI。“很多事情都与信号处理有关,”Plumbley补充道。无论是声音、引力波还是来自金融数据的时间序列信息。它们是关于随时间而来的信号。在过去,研究人员必须为不同类型的物体建立单独的方法,以提取不同的东西。现在,我们发现深度学习模型能够提取出这些模式。”
您觉得本篇内容如何
评分

相关产品

Visual Sound BlueDriver-M3S 音频放大器和前置放大器

JK Audio","BlueDriver S-Series wireless audio link.is a 3 pin male mic level mono XLR that plugs into mixer input to receive low latency full bandwidth audioCSR aptX delivers outstanding stereo audio quality over Bluetooth wireless technology.The new S-series models are dedicated transmitters or receivers.These models will also pair to other JK Audio products equipped with Bluetooth wireless technology which

Cole-Parmer GO-35639-61 红外线温度计

environments • Laser targets your measurement area• For special surfaces such as concrete all Fluke modelsFluke models 566 and 568 offer additional features including an innovative graphical display, adjustableThese meters also indicate min\/max, dif, avg, two color flashing with audio high and low alarms, probe

Audio Precision APx511 Family 音频放大器和前置放大器

The audio analyzer designed specifically for the needs of hearing instrument manufacturers.","The APx511 Hearing Instrument Audio Analyzer is designed to meet the needs of hearing instrument manufacturersthe specific measurements and I\/O required for hearing instrument production test, delivered with Audio,"All APx models use the same software, so with the R&D department using an APx525 or APx585 and APx511,"In addition to standard audio tests such as level, distortion, frequency response, and attack & release

ValueTronics 2015-P 音频放大器和前置放大器

,"Additional Features:","Applications","The Keithley 2015-P and 2016-P Audio Analyzing Digital Multimetersand the Models 2015 and 2016 Total Harmonic Distortion Multimeters combine audio band quality measurementsassessing non-linear distortion in components, devices, and systems, DSP-based processing allows the ModelsThey can measure Total Harmonic Distortion (THD) over the complete 20Hz to 20kHz audio band.,"In addition to THD, the Models 2015, 2015-P, 2016, and 2016-P can compute THD+Noise and Signalto-Noise

TelephoneStuff.com SA-S90-FR 扬声器

Include: UL Standard 1480 (Speakers) and ULStandard 1971 and UL Standard 1480 (Speaker Strobes)- Accepts audioinput levels of 0.5 VRMS or 25 VRMS- Power Input Voltage 16.0 to 33.0 VDC- 4 Wall or ceiling mount models

AUDIX Corporation MB5055 Miniaturized Condenser Microphone 音频麦克风

","The MicroBoom™ is available in two models, the MICROBOOM-50, which is 50 inches (1270 mm) long andHigh quality shielded cable is used internally to insure the cleanest audio signal path between the microphone

Boston Acoustics, Inc. TVee Model 25 扬声器

Boston Acoustics TVee Model 25 virtual surround sound speaker system envelops you in rich, room-filling audiosits below your TV to fill the room with rich audio.Plug-and-play audio system for your TV provides virtual surround sound. View larger.The good...The sound quality was great and beat out several much more expensive models.It also auto-detects the audio feed for both Automatic Turn On & OFF.

TMP Pro Distribution 50614 音频麦克风

C models feature simple one-knob compression controls on key channels to achieve great vocal sounds.All models feature high-quality Neutrik™ balanced XLR connectors on mono microphone\/line channels andConventional audio compressors with their threshold, ratio, knee, makeup gain and other controls can

评论

您需要登录才可以回复|注册

提交评论

提取码
复制提取码
点击跳转至百度网盘