Tag: cs.SD

#cs.SD

Latest First Most Viewed Alphabetical

All Conference (266) Law Review (314) Academic (4957) Think Tank (60) News (791) Journal (139) Technology & AI (4) Business & Strategy (1) Finance & Economics (2) Legal & Compliance (1) Innovation & Research (0) International Affairs (2) Cybersecurity (2) Healthcare & Biotech (2)

Academic · 1 min

Audio Spatially-Guided Fusion for Audio-Visual Navigation

arXiv:2604.02389v1 Announce Type: cross Abstract: Audio-visual Navigation refers to an agent utilizing visual and auditory information in complex 3D environments to accomplish target localization and …

Xinyu Zhou, Yinfeng Yu

43 views Apr 6

Academic · 1 min

CIPHER: Conformer-based Inference of Phonemes from High-density EEG

arXiv:2604.02362v1 Announce Type: cross Abstract: Decoding speech information from scalp EEG remains difficult due to low SNR and spatial blurring. We present CIPHER (Conformer-based Inference …

Varshith Madishetty

29 views Apr 6

Academic · 1 min

Do Audio-Visual Large Language Models Really See and Hear?

arXiv:2604.02605v1 Announce Type: new Abstract: Audio-Visual Large Language Models (AVLLMs) are emerging as unified interfaces to multimodal perception. We present the first mechanistic interpretability study …

Ramaneswaran Selvakumar, Kaousheik Jayakumar, S Sakshi, Sreyan Ghosh, Ruohan Gao, Dinesh Manocha

29 views Apr 6

Academic · 1 min

MuQ-Eval: An Open-Source Per-Sample Quality Metric for AI Music Generation Evaluation

arXiv:2603.22677v1 Announce Type: new Abstract: Distributional metrics such as Fr\'echet Audio Distance cannot score individual music clips and correlate poorly with human judgments, while the …

Di Zhu, Zixuan Li

60 views Mar 25

Academic · 1 min

ST-GDance++: A Scalable Spatial-Temporal Diffusion for Long-Duration Group Choreography

arXiv:2603.22316v1 Announce Type: new Abstract: Group dance generation from music requires synchronizing multiple dancers while maintaining spatial coordination, making it highly relevant to applications such …

Jing Xu, Weiqiang Wang, Cunjian Chen, Jun Liu, Qiuhong Ke

43 views Mar 25

Academic · 1 min

Abjad-Kids: An Arabic Speech Classification Dataset for Primary Education

arXiv:2603.20255v1 Announce Type: new Abstract: Speech-based AI educational applications have gained significant interest in recent years, particularly for children. However, children speech research remains limited …

Abdul Aziz Snoubara, Baraa Al_Maradni, Haya Al_Naal, Malek Al_Madrmani, Roaa Jdini, Seedra Zarzour, Khloud Al Jallad

63 views Mar 24

Academic · 1 min

Assessing the Ability of Neural TTS Systems to Model Consonant-Induced F0 Perturbation

arXiv:2603.21078v1 Announce Type: new Abstract: This study proposes a segmental-level prosodic probing framework to evaluate neural TTS models' ability to reproduce consonant-induced f0 perturbation, a …

Tianle Yang, Chengzhe Sun, Phil Rose, Cassandra L. Jacobs, Siwei Lyu

58 views Mar 24

Academic · 1 min

DEAF: A Benchmark for Diagnostic Evaluation of Acoustic Faithfulness in Audio Language Models

arXiv:2603.18048v1 Announce Type: new Abstract: Recent Audio Multimodal Large Language Models (Audio MLLMs) demonstrate impressive performance on speech benchmarks, yet it remains unclear whether these …

Jiaqi Xiong, Yunjia Qi, Qi Cao, Yu Zheng, Weisheng Xu, Ziteng Wang, Ruofan Liao, Yutong Zhang, Sichen Liu

90 views Mar 20

Academic · 1 min

DiscoPhon: Benchmarking the Unsupervised Discovery of Phoneme Inventories With Discrete Speech Units

arXiv:2603.18612v1 Announce Type: new Abstract: We introduce DiscoPhon, a multilingual benchmark for evaluating unsupervised phoneme discovery from discrete speech units. DiscoPhon covers 6 dev and …

Maxime Poli, Manel Khentout, Angelo Ortiz Tandazo, Ewan Dunbar, Emmanuel Chemla, Emmanuel Dupoux

76 views Mar 20

Academic · 1 min

ALIGN: Adversarial Learning for Generalizable Speech Neuroprosthesis

arXiv:2603.18299v1 Announce Type: new Abstract: Intracortical brain-computer interfaces (BCIs) can decode speech from neural activity with high accuracy when trained on data pooled across recording …

Zhanqi Zhang, Shun Li, Bernardo L. Sabatini, Mikio Aoi, Gal Mishne

71 views Mar 20

Academic · 1 min

Multimodal Emotion Regression with Multi-Objective Optimization and VAD-Aware Audio Modeling for the 10th ABAW EMI …

arXiv:2603.13760v1 Announce Type: new Abstract: We participated in the 10th ABAW Challenge, focusing on the Emotional Mimicry Intensity (EMI) Estimation track on the Hume-Vidmimic2 dataset. …

Jiawen Huang, Chenxi Huang, Zhuofan Wen, Hailiang Yao, Shun Chen, Longjiang Yang, Cong Yu, Fengyu Zhang, Ran Liu, Bin Liu

54 views Mar 17

Academic · 1 min

PARSA-Bench: A Comprehensive Persian Audio-Language Model Benchmark

arXiv:2603.14456v1 Announce Type: new Abstract: Persian poses unique audio understanding challenges through its classical poetry, traditional music, and pervasive code-switching - none captured by existing …

Mohammad Javad Ranjbar Kalahroodi, Mohammad Amini, Parmis Bathayan, Heshaam Faili, Azadeh Shakery

65 views Mar 17

1 2 3

#cs.SD

Audio Spatially-Guided Fusion for Audio-Visual Navigation

CIPHER: Conformer-based Inference of Phonemes from High-density EEG

Do Audio-Visual Large Language Models Really See and Hear?

MuQ-Eval: An Open-Source Per-Sample Quality Metric for AI Music Generation Evaluation

ST-GDance++: A Scalable Spatial-Temporal Diffusion for Long-Duration Group Choreography

Abjad-Kids: An Arabic Speech Classification Dataset for Primary Education

Assessing the Ability of Neural TTS Systems to Model Consonant-Induced F0 Perturbation

DEAF: A Benchmark for Diagnostic Evaluation of Acoustic Faithfulness in Audio Language Models

DiscoPhon: Benchmarking the Unsupervised Discovery of Phoneme Inventories With Discrete Speech Units

ALIGN: Adversarial Learning for Generalizable Speech Neuroprosthesis

Multimodal Emotion Regression with Multi-Objective Optimization and VAD-Aware Audio Modeling for the 10th ABAW EMI …

PARSA-Bench: A Comprehensive Persian Audio-Language Model Benchmark

JCG, PC

HSOLLC Co., Ltd.