Simultaneous singing makes sense, but simultaneous talking doesn't make sense.
If singing was originally a form of communication, it must have been a form of group communication.
In a song, it totally makes sense for two or more singers to be singing the same words of the same song at the same time.
During normal speech, it does not make sense for more than one speaker to be speaking the same words of the same sentence at the same time.
Normal spontaneous conversational speech is a function of what the speaker is thinking at the exact moment they are speaking, and it is almost impossible for two different people to be thinking exactly the same thought at the same time.
There may have been time when music, or some ancestor of music as we know it, ie so-called "proto-music", was a major form of communication.
But if proto-music was a form of communication, it was a very different kind of communication from the kind of communication that modern conversational speech is.
The existence of group singing compared to the non-existence of group talking strongly suggests that proto-music was a form of communication where it totally made sense for multiple "speakers" to say exactly the same thing at once.
This hypothesis does raise an obvious question of coordination. If multiple communicators are going to communicate the same thing at the same time, how do they plan in advance what they are going to communicate and when they are going to communicate it?
A possible answer to this question is to be found in another aspect of music that applies to music but not to conversational speech, which is that of repetition.
In music, you can "communicate" a certain tune, with or without words. And then it totally makes sense to communicate exactly the same tune (with or without words) again. And again. And again. Maybe for a few minutes, before all the listeners get bored hearing the same tune repeated.
With conversational speech, that does not happen. If any repetition occurs in conversational speech, it is usually limited to very small components of the speech, eg repeated words like "go go go!". It is never normal to repeat a full sentence (unless of course for some reason the speaker believes that the listener didn't hear it the first time).
This observation about repetition suggests a possible scenario for how group communication could have occurred, without the need for a priori planning and coordination:
One individual initiates the communication by "communicating" a tune or a song proto-musically.
The same individual repeats exactly what they just communicated, over and over again.
The other individual or individuals listening, ie the "audience", decide whether they are convinced of the truth or likely truth of what the initial communicator has communicated.
If anyone in the audience agrees with what has been communicated, they join in, repeating exactly the same communication, in unison with the initiator.
Eventually all members of the audience who are going to join in have joined in. There may be some members of the audience who have declined to join in, because they are not convinced enough of the truth or value of what is being communicated by those who have joined in. (Indeed in many cases there will be those who are, based on the social relations between different members of the group, generally skeptical about any communications from that particular initiator, and in general the question of who most often joins in with communications initiated by particular individuals would depend on those social relationships.)
If enough audience members show their agreement by joining in the communication, group action may then be taken with regard to whatever it is that the communication is about (although it doesn't always have to be a communication about something that requires immediate action).
We can summarize how this contrasts with the normal back-and-forth aspect of conversation speech:
In conversational speech:
I say what I think.
You respond with what you think about what I just said.
I respond with what I think about what you just said you think about what I initially said.
Others within earshot might join in and say what they think about everything that has been said so far.
etc
The conversation might eventually result in some of the parties to the conversation taking some coordinated action with regard to the topic of the conversation.
Whereas, in proto-musical communication (as hypothesised):
I say what I think.
I repeat what I just said, endlessly.
If you agree with me, you join in and repeatedly say that you think the same thing.
If anyone else in earshot also decides that they agree with what I think, they may also decide to join in and say that they also think the same thing.
After a certain number of repetitions, it has been determined how many individuals in the group agree with what I initially said, and depending on what that is, we might all initiate some coordinated action with respect to it.