# Alice Corpus version 1.0 #

Last update: 2017/09

This corpus is available from <https://TO.BE.DETERMINED>.

This corpus is the version 1.0 of a Human-Machine dialogue corpus
collected in a Woz setting.  Dialogues consist in information-seeking
dialogue with a virtual agent controlled by a Woz system and a human.

See the "References" section of this document for more information.

## XML File Formatting

Dialogue files follow the XML format:

```xml
<dialogue id="[UNIQUE_DIALOGUE_IDENTIFIER]" condition="[CONDITION_IDENTIFIER]" duration="[DIALOGUE_DURATION]">
	<part type="before-interruption">
		<utterance id="[UTTERANCE_SEQUENCE_NUM]" speaker="UTTERANCE_SPEAKER_IDENFITIER">[UTTERANCE_TRANSCRIPTION]</utterance>
		<utterance id="[UTTERANCE_SEQUENCE_NUM]" speaker="UTTERANCE_SPEAKER_IDENFITIER">[UTTERANCE_TRANSCRIPTION]</utterance>
		...
	</part>
	<part type="before-interruption">
		<utterance id="[UTTERANCE_SEQUENCE_NUM]" speaker="UTTERANCE_SPEAKER_IDENFITIER">[UTTERANCE_TRANSCRIPTION]</utterance>
		<utterance id="[UTTERANCE_SEQUENCE_NUM]" speaker="UTTERANCE_SPEAKER_IDENFITIER">[UTTERANCE_TRANSCRIPTION]</utterance>
		...
	</part>
	<part type="after-interruption">
		<utterance id="[UTTERANCE_SEQUENCE_NUM]" speaker="UTTERANCE_SPEAKER_IDENFITIER">[UTTERANCE_TRANSCRIPTION]</utterance>
		<utterance id="[UTTERANCE_SEQUENCE_NUM]" speaker="UTTERANCE_SPEAKER_IDENFITIER">[UTTERANCE_TRANSCRIPTION]</utterance>
		...
	</part>

</dialogue>
```

Description of the fields, for the dialogue:
* `UNIQUE_DIALOGUE_IDENTIFIER`: a unique dialogue identifier
* `CONDITION_IDENTIFIER`: data collection condition (either "B" or "C")
* `DIALOGUE_DURATION`: dialogue duration following the format MmSs where M is the number of minutes and S is the number of seconds (e.g., "7m36s")
* `GENDER`: the gender of the participant talking to the agent (either "male" or "female")

Description of the utterances:
* `UTTERANCE_SEQUENCE_NUM`: the index of the utterance in the sequence of utterances forming the dialogue
* `UTTERANCE_SPEAKER_IDENFITIER`: either "A" for the virtual agent, "U" for the user or "E" for the experimenter
* `UTTERANCE_TRANSCRIPTION`: utterance transcription. May include several specific tags/tokens:
   - `NAME`: anonymisation of a name
   - `UNK`: unrecognized token
   - `(laugh)`: indicates a laughter
   - `noise`: indicates the utterance of the agent is given after the noise manipulation
   - `missing`: indicates the agent misses the information (i.e. has not the knowledge to answer the question)   - 
   - `wrong`: indicates an intentional wrong response by the agent (only in B-condition)


Utterances  may  contain  overlap  tags specifying  tokens  produced  by
different speakers that overlap, e.g.:
```xml
	...
	<utterance id="039" speaker="U">for how long was Alice <overlap id="001">falling?</overlap></utterance>
</part>
<part type="interruption">
    <utterance id="040" speaker="E">
    <overlap id="001">i'm sorry</overlap>. do you like some water, coffee or tea?</utterance>
</part>
```

Overlap tags are specified as follows:
```xml
	<overlap id="OVERLAP_ID">OVERLAPPING_CONTENT</overlap>
```
where:
* `OVERLAP_ID`: a unique overlap identifier
* `OVERLAPPING_CONTENT`: tokens produced at the same time

## Reference
TBD
