This is an accompanying webpage for the paper titled “Exploratory Study Of Human-AI Interaction For Hindustani Music”, presented at the NeurIPS creative AI track 2024. There are video snippets from the user interviews conducted along with some explanatory text.
Below we present some examples of interaction between the participants and the model for each of the tasks. We present both favored and unfavored examples of generations with some text and figures to justify why something was perceived as good or bad.
Participants found the task of choosing between the two samples provided by the model either easy or difficult. Below we present two examples of an “easy choice” followed by one example of a “difficult” choice. As discussed below, the choice was sometimes easy since one of the choices was clearly worse than the other, either due to being of inferior musical quality (example 1), or being mostly silence (example 2).
As discussed in Section 4.2, participants found that model outputs were sometimes incoherent with the input provided to the model. Below we present an example of such incoherence, along with one example of coherence and try to explain why that was the case.
The figures presented below are pitch contours extracted from the input audio and the generated audio with the aim of illustrating why the responses sound good or bad. The contours are converted to the logarithmic scale with 0 corresponding to the perceived tonic of the participant’s audio input. Light grey regions are used to depict a coarse boundary of the notes utilized by the participant when providing their input.
Note: the contours in the video were present for debugging purposes during the interviews. While they are correct, they are not presented in a clear and understandable manner, as a result pitch contour plots have been presented below with relevant text, we ask the reader to please follow the figures and ignore the plots in the videos.
(Incoherent Sample) In this sample P1 played a very simple input using six unique notes. The model generated an output with a lot of movement and also introduced two notes that were not included (highlighted in red in the figure below) which made the input sound odd.
Pitch contour representation of an incoherent sample generated by the model. The highlighted parts in red indicate notes that were not in the scale used by the participant. Additionally the model output has a lot more melodic movement compared to the participant input based on which he commented that “… it's (model's response) probably not keep like keeping the similar structure” (Section 4.2).
Video of the sample
(Coherent sample) An example where P3 remarks ‘nice’ in response to the model’s output. It is likely because the model maintains the same scale as in the input.
The participant appreciates this output since it maintains the notes provided by the user. Due to the rough approximation of the reference pitch against which normalization was performed, the pitch contours are not precise in terms of position. As a result the author, who has trained extensively in Hindustani music for over 20 years, performed a rough transcription of both samples (presented as a black line) to make it easier to parse the melodic idea visually. The participant was attempting to sing in Raag Lalit. The notes present in the Raag are highlighted in grey boxes and as noted, the model output generates the same notes, which the participant highly appreciates (as discussed in Section 4.1).
Pitch contour representation of coherent sample generated by the model. The participant tried to sing a melody in the scale of raag Lalit, the notes of which are highlighted in grey horizontal bars. A rough transcription is provided as a black line to highlight the melodic idea, the notes in model output match the scale in the model input.
Video of the sample.
Below we present some examples of the interaction for the melodic reinterpretation task. The participants seemed to enjoy this task more since they could see a clear relationship between the input and output (illustrated in examples 1 and 2 below). The participants also found creative ways to interact with the model by trying to convert colloquial phrases into melodic phrases (example 3).