Audio-based classroom activity detection for primary school lessons
Supervisor(es): Cancela, Pablo - Capdehourat, Germán
Resumen:
Classroom Activity Detection (CAD) is a challenging task, especially for primary school lessons, where student participation is fragmented, short, and often concurrent with teacher speech and background noise. This thesis proposes and evaluates three CAD models: two based on supervised audio classification (trained on a proprietary dataset that was annotated for this work), and one based on unsupervised diarization. These models are assessed through the visualization of the estimated label density, rather than typical CAD segment visualizations. This approach proves to be more effective in dealing with the highly fragmented segments observed in this specific use case. The main metric to compare these models is the correlation coefficient between estimated and ground-truth label densities. The density and correlation are used to evaluate the accuracy of the models in capturing the temporal distribution of the different classroom activities. Complimentary to that, another metric that is also used is the error in the total time estimated for each label (e.g., estimated Teacher Talking Time or TTT). The supervised models, based on an LSTM neural network and a decision tree classifier, achieve similar classification performance, outperforming the unsupervised diarization pipeline. Even a small amount of training data is enough for the supervised models to achieve the performance of the diarization system, and they generalize well to previously unseen voices. The unsupervised diarization model does not require training data for this particular task, but its performance is not as good as the supervised models to detect the teacher’s voice. Additionally, it cannot distinguish properly between the labels “single student” and “group work”. Overall, the supervised CAD models proposed in this thesis demonstrate promising results for primary school lessons, even with limited training data. These models could be used to develop valuable tools to support classroom observation and evaluation.
2023 | |
Beca de Maestría ANII | |
Classroom activity detection Classroom monitoring Diarization Audio classification Ceibal Edtech Educational technology Primary school education LSTM Speech processing Machine learning Supervised learning Unsupervised learning Audio processing |
|
Inglés | |
Universidad de la República | |
COLIBRI | |
https://hdl.handle.net/20.500.12008/40734 | |
Acceso abierto | |
Licencia Creative Commons Atribución - No Comercial - Sin Derivadas (CC - By-NC-ND 4.0) |
Sumario: | Classroom Activity Detection (CAD) is a challenging task, especially for primary school lessons, where student participation is fragmented, short, and often concurrent with teacher speech and background noise. This thesis proposes and evaluates three CAD models: two based on supervised audio classification (trained on a proprietary dataset that was annotated for this work), and one based on unsupervised diarization. These models are assessed through the visualization of the estimated label density, rather than typical CAD segment visualizations. This approach proves to be more effective in dealing with the highly fragmented segments observed in this specific use case. The main metric to compare these models is the correlation coefficient between estimated and ground-truth label densities. The density and correlation are used to evaluate the accuracy of the models in capturing the temporal distribution of the different classroom activities. Complimentary to that, another metric that is also used is the error in the total time estimated for each label (e.g., estimated Teacher Talking Time or TTT). The supervised models, based on an LSTM neural network and a decision tree classifier, achieve similar classification performance, outperforming the unsupervised diarization pipeline. Even a small amount of training data is enough for the supervised models to achieve the performance of the diarization system, and they generalize well to previously unseen voices. The unsupervised diarization model does not require training data for this particular task, but its performance is not as good as the supervised models to detect the teacher’s voice. Additionally, it cannot distinguish properly between the labels “single student” and “group work”. Overall, the supervised CAD models proposed in this thesis demonstrate promising results for primary school lessons, even with limited training data. These models could be used to develop valuable tools to support classroom observation and evaluation. |
---|