TopoHuBERT project

Topological properties of attention for speech processing
We apply Topological Data Analysis (TDA) methods to speech classification problems and to introspection of the pre-trained speech model. To this end, we introduce a number of topological and algebraic features, which are derived from the transformer's attention maps and embeddings. We empirically show that a simple linear classifier built on top of such features outperforms fine-tuning classification head. In particular, we achieve an improvement of up to 9% accuracy and 5% ERR on four common datasets with the HuBERT model. On the Crema-D dataset the proposed feature set establishes a new state-of-the-art performance by reaching the performance of accuracy equal to 80.155. Last but not least we show that topological features are capable of revealing the functional roles of speech transformer heads. For example, we find the heads, that are capable to distinguish in a precise way between emotions (sad/happy), sample source (natural/generated) or recognize one of two voices without any downstream fine-tuning. To do so, we introduce a ranking function, which separates topological representations of a single head. The results demonstrate that TDA is a promising research direction for speech analysis, specifically, for the tasks that require structural prediction.
This webpage provides suplementoral and additional material for our article "Topological properties of attention for speech processing" that was accepted to INTERSPEECH 2023. Due to the strict constraints on the the size of the submitions, in our paper we had to omit or mention too briefly some of the achieved results, but here we can present them in their enterietry without any fear for page count.

Our article

Our article "Topological properties of attention for speech processing" is currently available at ArXiv:2211.17223.

Our repository

Github repository with code that allows to reproduce experiments from our paper or try them on a completely new task or dataset is available at github.com/ArGintum/topohubert.

Right now it is under constuction

Used datasets

In our work we used several publicaly available datasets. Here we would like to thank authors of the datasers and provide quick links to the websites of those projects.

ASVSpoof 2019 available at datashare.ed.ac.uk/handle/10283/3336 . We used only LA part from it for the task of Detection of Synthetic Speech.
CREMA-D available at github.com/CheyneyComputerScience/CREMA-D .
IEMOCAP available at sail.usc.edu/iemocap. We used it according to the procedures implemented in SUPERB benchmark.
VoxCeleb1 available at www.robots.ox.ac.uk/~vgg/data/voxceleb/. We used it according to the procedures implemented in SUPERB benchmark.

We used several approaches to studying the topology and spectral properties of attention maps of HuBERT model and performed various experiments, but due to the tight size constraints were unable to put them all into our paper. This section presents compilation of paragraphs, maps and charts that didn't fit into the article or (in a different circumstances) would make its appendix.

Appendix A. Separation of single models, where we show that (if we know which Spoofing model is playing against us) a single HuBERT head is often enough to sucessfully detect synthetic speech.
Appendix B. Separation of a pair of speakers, where we show that from HuBERT perspective speech of (almost) any random pair of people differs very much, but if there are too many people those differences interfere into incomperehensible noise.
Appendix C. Attention meets Power Spectrum, where we explore the correlation between our attention features and features calculated from the power spectrum of the utterance.
Appendix D. Separation of a pair of emotions, where we show how our topological features can distinguish between a pair of emotions .

In our work we extensively use methods of Topological Data Analysis (TDA), a modern field that was developed from numerous works in algebraic topology and computational geometry over the last two decades. We know that due to the novelty of TDA our Reader may not be accustomed with it, however short format didn't allow us to put proper explanations in the paper.

Here we will attempt to provide information necessary for better understanding of the topology-related part of our paper: a glossary, analysis of features used in our models, and references to other works implementing TDA methods for Transformer models (not only speech).

We are the research team behind the article "Topological properties of attention for speech processing" and this webpage:
Eduard Tulchinskii, Kristian Kuznetsov, Laida Kushnareva, Daniil Cherniavskii, Serguei Barannikov, Irina Piontkovskaya, Sergey Nikolenko, and Evgeny Burnaev.

Right now you can contact us via e-mail at Eduard.Tulchinskiy@ skoltech.ru

TopoHuBERT project

Topological properties of attention for speech processing

Article and Source Code

Our article

Our repository

Used datasets

Appendices and Supplementary Material

About Topological Data Analysis

TDA Glossary

Introduction to TDA

TDA for Transformers

Contact information