Audio-based classroom activity detection for primary school lessons

Ríos, Braulio

Supervisor(es): Cancela, Pablo - Capdehourat, Germán

Resumen:

Classroom Activity Detection (CAD) is a challenging task, especially for primary school lessons, where student participation is fragmented, short, and often concurrent with teacher speech and background noise. This thesis proposes and evaluates three CAD models: two based on supervised audio classification (trained on a proprietary dataset that was annotated for this work), and one based on unsupervised diarization. These models are assessed through the visualization of the estimated label density, rather than typical CAD segment visualizations. This approach proves to be more effective in dealing with the highly fragmented segments observed in this specific use case. The main metric to compare these models is the correlation coefficient between estimated and ground-truth label densities. The density and correlation are used to evaluate the accuracy of the models in capturing the temporal distribution of the different classroom activities. Complimentary to that, another metric that is also used is the error in the total time estimated for each label (e.g., estimated Teacher Talking Time or TTT). The supervised models, based on an LSTM neural network and a decision tree classifier, achieve similar classification performance, outperforming the unsupervised diarization pipeline. Even a small amount of training data is enough for the supervised models to achieve the performance of the diarization system, and they generalize well to previously unseen voices. The unsupervised diarization model does not require training data for this particular task, but its performance is not as good as the supervised models to detect the teacher’s voice. Additionally, it cannot distinguish properly between the labels “single student” and “group work”. Overall, the supervised CAD models proposed in this thesis demonstrate promising results for primary school lessons, even with limited training data. These models could be used to develop valuable tools to support classroom observation and evaluation.


Detalles Bibliográficos
2023
Beca de Maestría ANII
Classroom activity detection
Classroom monitoring
Diarization
Audio classification
Ceibal
Edtech
Educational technology
Primary school education
LSTM
Speech processing
Machine learning
Supervised learning
Unsupervised learning
Audio processing
Inglés
Universidad de la República
COLIBRI
https://hdl.handle.net/20.500.12008/40734
Acceso abierto
Licencia Creative Commons Atribución - No Comercial - Sin Derivadas (CC - By-NC-ND 4.0)
Resumen:
Sumario:Classroom Activity Detection (CAD) is a challenging task, especially for primary school lessons, where student participation is fragmented, short, and often concurrent with teacher speech and background noise. This thesis proposes and evaluates three CAD models: two based on supervised audio classification (trained on a proprietary dataset that was annotated for this work), and one based on unsupervised diarization. These models are assessed through the visualization of the estimated label density, rather than typical CAD segment visualizations. This approach proves to be more effective in dealing with the highly fragmented segments observed in this specific use case. The main metric to compare these models is the correlation coefficient between estimated and ground-truth label densities. The density and correlation are used to evaluate the accuracy of the models in capturing the temporal distribution of the different classroom activities. Complimentary to that, another metric that is also used is the error in the total time estimated for each label (e.g., estimated Teacher Talking Time or TTT). The supervised models, based on an LSTM neural network and a decision tree classifier, achieve similar classification performance, outperforming the unsupervised diarization pipeline. Even a small amount of training data is enough for the supervised models to achieve the performance of the diarization system, and they generalize well to previously unseen voices. The unsupervised diarization model does not require training data for this particular task, but its performance is not as good as the supervised models to detect the teacher’s voice. Additionally, it cannot distinguish properly between the labels “single student” and “group work”. Overall, the supervised CAD models proposed in this thesis demonstrate promising results for primary school lessons, even with limited training data. These models could be used to develop valuable tools to support classroom observation and evaluation.