Research Topics
Current Works
Evaluation Compaigns
Ph.D. thesis
Master thesis

Research Topics

  • Automatic Speech Recognition
  • Natural Language Processing
  • Keywords/Topics Extraction
  • Machine Learning


  • 2012-2016 : ContNomina Project
  • The technologies involved in information retrieval in large audio/video databases are often based on the analysis of large, but closed, corpora, and on machine learning techniques and statistical modeling of the written and spoken language. The effectiveness of these approaches is now widely acknowledged, but they nevertheless have major flaws, particularly for what concern new words and proper names, two types of inputs that are crucial for the interpretation of the content but which are extremely difficult to model from the analysis of closed corpora. In the context of diachronic data (data which change over time) new names appear constantly requiring dynamic updates of the lexicons and language models used by the speech recognition system. As a result, the project ContNomina focuses on the problem of proper names in automatic audio processing systems by exploiting in the most efficient way the context of the processed documents. To do this, the project will address:

    • the statistical modeling of contexts and of relationships between contexts and proper names;
    • the contextualisation of the recognition module through the dynamic adjustment of the lexicon and of the language model in order to make them more accurate and certainly more relevant in terms of lexical coverage, particularly with respect to proper names;
    • the detection of proper names, on the one hand, in text documents for building lists of proper names, and on the other hand, in the output of the recognition system to identify spoken proper names in the audio / video data.
    Resources developed during this project will be made accessible to the scientific community. This will correspond to a lexicon of phonetized proper names (currently such a lexicon is not available in French) and annotations of an audio / video corpus. A WEB demonstrator will be implemented to validate the scientific developments achieved in the project.

  • 2012-2015 : DECODA Project
  • The goal of the DECODA project is to reduce the development cost of Speech Analytics systems by reducing the need for manual annotation. This project aims to propose robust speech data mining tools in the framework of call-center monitoring and evaluation, by means of weakly supervised methods. The applicative framework of the project is the call-center of the RATP (Paris public transport authority). This project tackles two very important open issues in the development of speech mining methods from spontaneous speech recorded in call-centers : robustness (how to extract relevant information from very noisy and spontaneous speech messages) and weak supervision (how to reduce the annotation effort needed to train and adapt recognition and classification models).

  • 2011-2014 : SuMACC Project
  • The search for a concept in multimedia databases or on the Internet encounters major issues due to the diversity of concept representations that may depend on one or several different modalities, such as pictures, video, speech, text, sounds... Typically, a concept such as ``Olympic Games'' may be mapped into video of the opening ceremony, in text documents focusing on a specific race, in radio shows etc... The SuMACC project aims to develop models supporting these variabilities related to the multimedia content, with a particular focus on the Web.
    Methods for concept discovery and tracking in text documents have been largely studied in the last decades. These methods are now relatively mature and effective. Moreover, the video processing communities produced great efforts to design methods for tracking concrete objects or object categories, especially in the TrecVid evaluation campaigns.
    Nonetheless, multimodal approaches remain poorly developed and most previous works proposes solutions for only one modality (video, audio or text). From a technological point of view, most identification methods are based on statistical models. To correctly estimate model parameters, a large amount of data is however mandatory. Collecting and annotating such large corpuses is generally too costly, thus avoiding the emergence of multimedia approaches.
    The SuMACC project addresses these two major issues related to the multimodal representation of concepts and to the training strategies that could enable a low-cost estimate of concept signatures.

  • 2014- : VERA Project
  • The VERA project aims at developing tools for diagnostic, localization, and measurements of automatic transcription errors. This project is based on a consortium of first-rate academic actors in this field. The objective is to study the errors in detail (at the perceptive, acoustico-phonetics, lexical, and syntactic levels) in order to yield a precise diagnosis of possible lacks of the current classical models on certain classes of linguistic phenomena. At the application level, the VERA project is justified by an observation?that a number of applications offering access to the contents of multimedia data are made possible by the use of automatic transcription of speech: subtitling of video, search for precise portions of audio-visual archives, automated reports of meetings, extraction and structuring of information (Speech Analytics) in multimedia contents (Web, call centers, ?). However large scale deployment is often slowed down by the fact that transcription by automatic speech recognition systems contains too many errors. Research and development in speech recognition has focused, successfully until now, on the improvement of methods and models implemented in the transcription process, measured through the word error rate; however, past a given performance level, the the cost of reducing the residual errors increases exponentially.

Evaluation Compaigns

  • 2013 : MediaEval'13
  • Spoken Web Search Task
    The task involves searching FOR audio content WITHIN audio content USING an audio content query. This task is particularly interesting for speech researchers in the area of spoken term detection or low-resource speech processing.
    A sets of un-transcribed audio files from multiple languages and a set of queries will be provided to researchers. The task requires that each occurrence of a query within the audio content be identified. Both the correct audio files, and the locations of each query term within the audio files must be found. No transcriptions, language tags or any other metadata will be provided. The task therefore requires researchers to build a language-independent, acoustic content independent audio search system.

    Crowdsourcing in Multimedia Task (New)
    The goal of this task is to allow participants to explore the potential of crowdsourcing for enhancing the potential of visual content analysis for creating descriptions of social images or for improving descriptions of social images that have been contributed by users (i.e., tags).
    We provide a dataset of Creative Common social images with the focus on fashion images. The development dataset is annotated by two group of annotators: one group are the AMT annotators (which can possibly contain noisy annotations) and the other are trusted annotators known to authors (which create the correct annotations). The participant can use the annotations of AMT workers as well as any methods for analyzing visual content or socially-contributed metadata to generate an enhanced set of labels for the images. The task targets two binary labels: whether or not an image is fashion-related and whether or not an image is correctly tagged with a particular fashion item.

    Soundtrack Selection (MusiClef) Task (New)
    The MusiClef 2013: "Soundtrack Selection for Commercials" task aims at analyzing music usage in TV commercials and determining music that fits a given commercial video. Usually, music consultants select a song to advertise a particular brand or a product. The MusiClef benchmarking activity, in contrast, aims at making this process automated by taking into account both context- and content-based information about the video, the brand, and the music. This is a challenging task, in which multimodal information sources should be considered, which do not trivially connect to each other.

Ph.D. Thesis

Subject: Factor analysis for acoustic modeling for systems speech recognition

Thesis committee:

Members Laurent Besacier (Professor)
LIG - University of J. Fourier

Régine André-Obrecht (Professor)
IRIT - University of Toulouse 3

Jean-Luc Gauvain (Professor)
LIMSI - University of Paris

Guillaume GRAVIER (Research Scientist)
IRISA/CNRS - University of Rennes 1

Denis JOUVET (Research Director)
LORIA/INRIA - University of Nancy
Advisors Driss Matrouf (HDR)
LIA - University of Avignon

Georges Linarès (Professor)
LIA - University Avignon

The thesis manuscript is available in French: These_Mohamed_Bouallegue.pdf

Thesis abstract

In this thesis, we propose to use techniques based on factor analysis to build acoustic models for automatic speech processing, especially Automatic Speech Recognition (ASR). Frstly, we were interested in reducing the footprint memory of acoustic models. Our factor analysis-based method demonstrated that it is possible to pool the parameters of acoustic models and still maintain performance similar to the one obtained with the baseline models. The proposed modeling leads us to deconstruct the ensemble of the acoustic model parameters into independent parameter sub-sets, which allow a great flexibility for particular adaptations (speakers, genre, new tasks etc.).

With current modeling techniques, the state of a Hidden Markov Model (HMM) is represented by a combination of Gaussians (GMM : Gaussian Mixture Model). We propose as an alternative a vector representation of states : the factors of states. These factors of states enable us to accurately measure the similarity between the states of the HMM by means of an euclidean distance for example. Using this vector representation, we propose a simple and effective method for building acoustic models with shared states. This procedure is even more effective when applied to under-resourced languages.

Finally, we concentrated our efforts on the robustness of the speech recognition systems to acoustic variabilities, particularly those generated by the environment. In our various experiments, we examined speaker variability, channel variability and additive noise. Through our factor analysis-based approach, we demonstrated the possibility of modeling these different types of acoustic variability as an additive component in the cepstral domain. By compensation of this component from the cepstral vectors, we are able to cancel out the harmful effect it has on speech recognition.

Keywords: Automatic speech recognition, factor analysis, compact acoustic modeling, phonetic classification, acoustic variability.

Master in computer science

Subject: The expanding request technique
Master Supervisor: Olivier Kraif

Thesis abstract

The expanding request technique is harnessed in the search for documents so as find out more relevant document. Similar techniques are used to uncover linguistically phrases by analyzing the syntactical and semantic levels in order to increase the number of relevant expressions in the course of a research. For example the expanding request technique and the likes are useful in finding out again instances in a master's thesis dedicated to translation and to find out more thorough indices. This work is based on the utilization of this research technique to find out illustrations of polylexical structures that are expressed in a "canonical" from. After analyzing various sorts of canonical expressions that are likely to be used for teaching purposes whether handbooks or dictionaries, we have chosen to focus on the paradigmatic instance of verb phrases and verb-based structures.

Keywords: Expression seeking, Specific grammar, expanding request, consistent expression.