Minutes of the Second Workshop on Internationalizing SSML

Foundation for Research and Technology - Hellas (FORTH) in Heraklion, Crete,
site of the W3C Office in Greece

30-31 May 2006

Each session includes the presentation of one or two papers, followed by a discussion about at least one item presented in the papers. Some discussions will refer to items from several previously presented papers.

Attendees

Jerneja Zganec Gros (Alpineon, Slovenia)

Geza Nemeth (BME TMIT)

Geza Kiss (BME TMIT)

Nixon Patel (Bhrigus Inc.)

Raghunath K. Joshi (Centre for Development of Advanced Computing/C-DAC Mumbai)

Chris Vosnidis (Dialogos Speech Communications S.A.)

Bonardo Davide (Loquendo)

Kimmo Parssinen (Audio Applications, Nokia Specific Symbian SW, Technology Platforms)

Ksenia Shalonova (OutsideEcho)

Oumayma Dakkak (HIAST)

Przemyslaw Zdroik (France Telecom R&D Poland)

Paolo Baggia (Loquendo)

Max Froumentin (W3C)

Kazuyuki Ashimura (W3C)

Richard Ishida (W3C)

Dan Burnett (W3C Invited Expert)

Tuesday 30 May, 8:30-18:00

Session 1: Introductory

Moderator:: Kazuyuki Ashimura
Scribe:: Max Froumentin

Welcome and meeting logistics — FORTH –[Slides]

none

Workshop expectations — Kazuyuki Ashimura –[Slides]

none

Introduction on W3C and Voiec Browser Working Group — Max Froumentin –[Slides]

none

Internationalization of SSML — Dan Burnett [Slides]

none

PLS for SSML — Paolo Baggia –[Slides]

Q: if a word is homophone and homograph, what would be the hierarchy. In dewanagery the word "kurl" is "hand", also "do". Spelling is the same, meaning different. Which would you take care first.

Paolo: don't care. If pronunciation is the same, then the TTS will say it right. It's at another level, that of the semantic Web (outside of the workshop)

Nixon: but the answer is right there.

Paolo: nothing prevents a lexicon with 2 entries: same graphemes, same pronunciation but different "role". We'll talk about this problem in the relevant sections.

Q: Why not use SAMPA for phonetic alphabet

Paolo: IPA is something you can reference. SAMPA has many variants, and even some companies has their own. IPA is difficult to write but at least it's one alphabet that tries to cover all sounds

Updates to RFC 3066 — Richard Ishida –[Slides]

Max: what script Tag for Japanese?

Richard: not sure how that works.

Anna: for Greek, it would be interesting for Ancient Greek?

Richard: would be interesting indeed. But for modern greek, that would be not needed.

Q: people sometimes tend to abuse that, e.g. in Greek, they write in latin scripts in SMS, and don't use diacritics.

A: Poles do that too, in the case of Polish, it's not "real" Polish

Kazuyuki: difference between Scottish English and Irish or Welsh

Richard: dialects

Nixon: we need to come up with a breakdown of dialects, how do we register them

Richard: register with IANA

Dan: IANA allows you to create registries and specify how to add values (who's responsible).
E.g. Top Level Domain Names, MIME types, character sets.

(BREAK)

Session 2: Languages / Dialects

Moderator:: Richard Ishida
Scribe:: Dan Burnett / Paolo Baggia

Ksenia Shalonova: Position paper for SSML workshop in Crete –[Slides, wave1, wave2, wave3]

Topics include: tones, dialects & styles, schwa deletion

 Perpsective for the Local Language Speech Technology Initiative - provide tools for languages in non developed countries - KiSwahili (whole East Africa 20 million) - isiZulu   (South Africa - government sponsor african languages) - Hindi  - non developed countries not access to PC info from mobile phones   (cheap) also Internet access - Many kiosks - Huge number of illiterate people  - There are business opportunities in these countries   - kiSwahili - services in Kenia and Tanzania   - isiZulu   - Kiosk in South Africa   - Hindi     - info to book railway tickets  Decomposition of words into constituents

Nixon Patel: SSML Extensions for Text to Speech systems in Indian Languages –[Slides]

Topics include: syllables, loan words, <dialect>

 - Nature of Indian language scritps - Issues across TTS rendering in all languages   Speech Language Technology Lab @ Bhrigus - Playing leadership role - 10 members and advisors   3 PhDs + 4 Masters - initiating SSML and VXML chapters in India  Nature of Indian languages - basic units of the writing system are AKsharas - Aksharas is syllabic in nature   forms are V, CV, CCV, CCCV   - always ends with a vowel (or nasalized vowel)     in written form - ~1652 dialects/native languages - 22 languages officially recognized  Convergence of IL Script - Aksharas are syllabic in nature - Common Phonetic base   - Share common set of speech sounds across all     language 0 Fairly good  - Each Indian languages (IL) has its own script - All share common phonetic base - Non tonal  How to represent Indian lanugage Scripts - Unicode   Useful for rendering the Indian   - not suitable for keying    - not suitable to build modules sucg as text-normalization - Itrans-3 / OM - A transliteration scheme by IISc Bangalore   India and Carnegie Moellon University   - useful for "Keying-in and store" the scripts of Indian     language using QWERTY keyboards   - useful processing and writing modules/rules for     letter-to-sound, text normalization etc.  Issues in TTS rendering in IL - TTS should be able to pronounce words as AKshara - Lanugages have heavy influence  - <phoneme alphabet="itrans-3" ph="  ">   but   <syllable alphabet="itrans-3" syl="naa too">...  - Motivation for Loan Word <alien>   - BANK has to be pronounced as /B/ /AE/ /N/ /K/   - /AE/ phoneme  Dialect Element - To load language rules  Conclusions: - proposed: <syllable>, <alien>, <dialect>

Discussion: How should dialects be supported? What are the shortcomings of RFC3066? –[RFC 3066bis article]

 understand the troubles  Dan: This come in another workshop - Distinguish written and spoken language - there is new version of RFC3066bis - is that sufficient or separate markings?  Joshi: In Indian languages 16 standard   languages + 3 more languages - they are spoken, not written  Nixon: trade off active implementation   perspective, very inefficient load resources  Ouyama: Similar problem in Arabic - component text-to-phoneme - in syllables are too many, diphones to save    PB:  - try to clarify different ways in SSML today:   - phoneme   - lexicon   with possible extensions to deal with ambiguities:   - token role -> lexicon   - token xml:lang  GezaN:  - not for engine developers, but for application developers - dialect or languages are the same, if different you need   a different engine - proposal from Univ. Hung  Chris:  - we have this discussion because IL share the same phoneme - if big difference, create a new engine  Dan: is xml:lang enough?  Nixon: Yes  Richard: - xml:lang is the text processing directive   it is the content of the element - there is need of other directives for other activities   like loading lexicons, changing voices, etc.  Dan:  - kick off discussion   xml:lang is doing two purposes   add a new attribute like "speak-as" to specify the   target language - SSML 1.1 will discuss the new xml:lang to understand   if a single attribute is enough or a second one is   better

(LUNCH)

Session 3: Syllables / Tokens

Moderator:: Paolo Baggia
Scribe:: Max Froumentin

Geza Nemeth and Geza Kiss: Proposals for Extending the speech Synthesis Markup Language (SSML) 1.0 form the Point of View of Hungarian TTS Developers –[Slides]

Topics include: speaking-style, syllable structure, phoneme language, parts of speech, phonetic alphabet for Solvenian, foreign text

Davide Bonardo: SSML Extensions for Multilanguage usage –[Slides, wave1, wave2, wave3, wave4, wave5]

Topics include: interpret as, <token>

Discussion: How should syllable structure / token be represented in SSML?

Paolo: Loquendo prefers "token", a term not too linguistically marked. Dan: I agree. We've had interesting related discussions regarding Chinese. I've heard 3 issues in all: - unit demarcation, unit might be word, token, morpheme, syllable, phonemes, mora. - unit activities: change pronunciation, emphasize, timing, stress volume (phoneme element today) - link to lexicon: token is used so that there's a tie to lexicon. SRGS has tokens, which are tied to lexicons. There has to be linkage that's clearly understood between SSML and PLS Nixon: SISR? Dan: SISR is a separate processing step. The first is to match the sounds to words, the second is to map words to meaning ("coke" and "coca-cola" both map to coca-cola). SISR is used to setup this mapping. Paolo: if we know if it's an SMS, it helps because you can have many acronyms ('mtfbwy'). ASR shares similar problems. But in SRGS there is already <token> for things similar to this. e.g. "New York" has corticulation: better to keep together so you mark it up as <token>New York</token> in SRGS. Geza Kiss: Chinese word boundary detection is important. Dan: yes, we talked about it in the group. The Chinese don't care about the element name (word, token), they just need it. Chris: Question is do we delegate a lot of responsability to the TTS? Can we assume TTS can handle SMS script? Also, it's very useful to combine POS with inflected languages. This additional information is useful. Finally, about emotional. About italian within English, Paolo: SMS, emotions are for later discussion. Right now we want to talk about word segmentation. for languages, we have to draw a line otherwise we're going to redefine the whole of SSML. What we offer are a few ways of changing the English if it doesn't well. SSML 2 may go beyond that. Jerneja: with respect to token/word, it's very important to have PoS for highly inflected languages. Raghunath: phoneme/morpheme/word/sentence. The word has to be splitted in some way. So it token similar to morpheme? Paolo: yes, similar to morpheme and other things. Phoneme is a piece of something but this something is different according to the language. Raghunath: but morpheme has semantic bearings Paolo: either we do line morpheme which has a precise and technical meaning. Or we go for something practical: an element with no precise definition but which works for splitting words. Geza Nemeth: an attribute to <token>. InChinese, you have to differentiate prosodic, pronunciation unit. Ksenia: can you add tones for African languages to token? Paolo: yes, that was a proposal from the Chinese participants. Token is decomposition with characterisation. Dan: we may put features in SSML which we leave half-specified and flexible, but we do have to know the linkage with the lexicon, even if the lexicon entries are not well-defined either. What's most important for the Chinese turned out that using an alphabet that's syllable-based did most of the work. Here the concern is different: what do we want to do with the segmentation offered. Geza Kiss: a token cannot mean a syllable. SSML says so. Dan: yes, if you have a one-word syllable, then yes. Otherwise no. To me token means word, except I don't want to say word, because of some languages. Paolo: you're saying: we're missing <token>, and <phoneme> could change semantics. Oumayma: you map the phoneme from the mother language: do you have look-up tables to map phonemes? Have you done statistical analysis? Davide: each language is a table of phoneme, and we have a patented algorithm that does the mapping, based on linguistic classification. We do it for all 18 languages we support. Kazuyuki: about Japanese tokenization processing. Japanese word is morpheme: not only grapheme, but PoS. Second problem is that there are many compound words in Japanese [shows example on whiteboard: /ura niwa / niwa / niwa / niwa tori / ga iru/. That's a problem for Japanese TTS, which have to do analysis. Japanese lexicons have both separate and compound entries. Paolo: so useful to tokenize? Kazuyuki: yes, for separate and compound Paolo: se recursive token Geza Nemeth: not compatible with SSML 1.0 Dan: in SSML1.0, all examples were ones that where you tried to override the TTS. So is this particular thing sufficient to fix the processors: what's the minimal thing we need to do. In this case, would one level of tokenization is sufficient? Raghunath: quick comment on whiteboard. In sanskrit there are many compound words. Gives example ("Arun-Udei"?) with corticulation. Dan: that example is in any language? [says "Idnstrd" for "I don't unstertand]. A TTS may be smart, but may not be able to tokenize everything. Ksenia: in African tonal languages, you may need several level of tokens Geza Nemeth: 1. upward compatibility with SSML 1.0? 2. I think there should be a subdivision of token, otherwise it's a mess. 3. SSML could be used after semantic analysis from text, to be passed to synthesizer. Dan: there was an example given in Chinese: a poem which, according to where the boundary was, meant one thing or the opposite. Max: so the ambuguity exists for humans too. Should SSML do better? guess? Dan: the engine that generates the SSML necessarily adds semantic information in any case. Paolo: there is the problem of backwards compatibility and scalability for the future versions. You'll want more than <token> so will add lots of new elements. Przemyslaw: in arabic TTS, tokenize text is also important, then vocalize. One level of tokenizing is enough. Oumayma: Arabic is a syllabic language, so do you rely on this fact ? Przemyslaw: in order to vocalize/vowelize, we need token markup. It's easier. Oumayma: I don't agree. Is he working on the signal or on the text. On the signal, I disagree, on text: a whole word can be a collection of phonemes, so it's easier to tokenize.

 MAJOR POINTS:    - unit demarcation (approximately word-level and below, i.e. not   subphrases of a sentence): unit might be word, token, morpheme,   syllable, phonemes, mora. We agree it's important and <token> might   be a good short-term solution.    - unit-level processing: change pronunciation, emphasize, timing,     stress volume (phoneme element today)    - link to lexicon: token is used so that there's a tie to     lexicon. SRGS has tokens, which are tied to lexicons. There has to     be linkage that's clearly understood between SSML and PLS.      If the token is at the word level then it can be marked as another language.    - token and xml:lang      Dan: Do you need lower than Word-level language identification?

 Geza Nemeth: in German or agglutinative language. Compound words with      different pronunciation requirements.  Richard: again, it's the question of what a word is. "teburu-ni-wa"     is one word? It has English and Japanese.

(BREAK)

Session 4: IPA / Phonetic Alphabets

Moderator:: Kazuyuki Ashimura
Scribe:: Paolo Baggia

4. Raghunath K. Joshi: The Phonemic model from India for Bi-modal Applications –[Slides]

Topics include: IPA and phonetic model

 Model for Multilingual communication (textual/verbal)  Deshnanagari - a common script for all Indian languages Multilingual happenings - social events in Mumbai  Non semantic - Sound Poems  Collaborative research on notation system for Indian music with Dr. Ashok Ranade  Manual typographic activity for many years (syllabic breaks and meaning breaks)  Indian Oral tradition had a long history  Veda families from Oral → Text → Phonetics → Grammar  Definition of Phoneme (Vamas) Vowel - Consonants  Definition of Phonemes (2000 B.C.)  Formation of articulate sounds and mode  Speech Related Issues  Indus Signs  Brahmi script  Consonant sound + Vowel sound renderings in different scripts  + accent marks rendering of Vedic Sanskrit  Concept of InPho - Correlation with IPA - Proposal of phonemic codes  Range of IPA  InPho Issues IPA Issues  Position Statement   Concrete Text  ↔ Stylistic speech Formatted Text ↔ Synthetic Speech Simple Text    ↔ Monotone Speech  Conclusion:

Discussion: What phonetic/phonemic alphabets should be used for SSML? Is IPA satisfactory for representing the pronunciation of words?

Ksenia: Include Schwa in SSML text, instead of complex processing Richard: In English you need to use the lexicon for everything Ksenia: Not possible enumerate for highly inflected languages Richard: Hindi is present Ksenia: It is an issue to eliminate or not. Przem: Also in Polish, it is not a solution even if it can help. Joshi: All the Indian languages are based on Sanskrit Add phoneme together Nixon: In the languages we have done, they are solved by additional Schwa Dan: Similar issue of what occurs on Hebrew. SSML is to mark text an human can read. If the processor has a problem an alternative pronunciation is given. Richard: If you look Hebrew and Arabic is more and more, Hindi is much less. For SSML you let the processor to do more. If you can do that, you need to do that. Nixon: What we did was exactly that, create a dictionary and then Paolo: On defence of IPA - It is one way of writing pronunciations, many drawbacks: difficult to type, to read, but part of the problem is that Dan: At first I strongly disagree with Paolo. Many people do not like IPA, especially in China, it is taught in school to children. Is easy to type. Richard: Totally agree from Dan and Paolo. But even for Chinese if you need allophones you will use IPA. Back to Schwa, if there are morphological rules? Ksenia: If there is a morphological boundary one rule, if not another rule. Richard: If you add a "virama" sign, you will give the missing information. Dan: If there is a way in the script to adjust the text. It could be done manually or by pre-processing. There could be concerns for Accessibility to do this. (Discussion on Bhrigus in Indian) Ksenia: Why you have <sub>? Dan: it is a facility. Chris: IPa is useful as a common ground. You can create a resource. Prescribe to use other alphabet, but give a presence. Richard: This is way you have markup. You can add the metadata to describe the difference. [...] Dan: I was thinking to create a registry for alphabet. With a process to create an alphabet. This will be discussed in the group. Kazuyuki: We will continue this discussion in the further topics tomorrow.

(DINNER)

Wednesday 31 May, 8:30-18:00

Session 5: Multilanguages Issues

Moderator:: Richard Ishida
Scribe:: Max Froumentin

Kimmo Parssinen: Development Challenges of Multilingual Text-to-Speech systems –[Slides]

Topics include: fallback language, language mappings, preprocessor, multilingual

Zdroik Przemysław: Position paper for 2nd W3C Workshop on Internationalizing the Speech Synthesis Markup Language (SSML) –[Slides]

Topics include: token element, missing diacritics, word stress

Discussion: How to represent foreign words?

Ksenia: african languages? Kimmo: we probably support Zulu and Afrikaans, but they're not available before the UI supports does, and currently S60 phones don't have it. But a political decision is that all UI languages are supported Nixon: footprint of ASR? Kimmo: about 200 KB, TTS engine is 20-30 KB + language data Richard: spoken vs written languages? Kimmo: processor wouldn't change the voice, but would do their best Max: basically what Davide sugested Paolo: element or attribute? Kimmo: as long as it's understandable, and that the requirements are fulfilles Géza: could use say-as with lexicons for that. There could be 2 layers of lexicons, one provided by TTS engine provider, one by application developer. Helpful to be in the standard in some way. However I'm afraid about the same lexicon for ASR and TTS. Paolo: we try to accommodate both ASR and TTS needs in PLS lexicon. The user would be used for very few adjustements. In many simple cases, you want adjustments both in ASR and TTS. Géza: in ASR, how would you relate the lexicon to the standard grammar? Paolo: both grammar and ssml have the lexicon element, so you can refer to one lexicon from both. What's only missing in the standard, so with PLS we're trying to address that simply Kimmo: yes, the key here is the standard. Paolo: about the list of preferred languages, xml:lang doesn't take more than one, so we would need to have a new attribute. Richard: for later discussion...

Paolo: can we have more than one attribute in xml:lang? If not, we need another attribute. Richard: you could, in principle have multiple values in xml:lang, but I strongly discourage that, in order to align with other specs. xml:lang may serve as a default Kris: xml:lang describes the content, but it should not be overloaded with additional meaning, giving instructions to the TTS. It should be another attribute with extra semantics. Dan: yes, there is a need to distinguish between written and spoken attributes. Max: xml:lang would then be not used Paolo: no, we need both as hints to the TTS. Today isn't clear. Géza: suggest <voice lang="...">...</voice> Oumayma: you can create a synthesizer for all languages, and SSML should support that. Przemysław: it's very hard Géza: don't forget the problem with "unknown" Paolo: just checked, and found that xml:lang is required on <speak>. I suppose that "unknown" would be handled with no xml:lang GézaK: doesn't work with subparts that we want to mark unknown. Richard: xml:lang could be empty "", it may mean "unknown" or "this is not a language", not sure. ISO-639 has an "unknown" tag, I should check. In principle you can have unknown anyway. Dan: whichever way, then if there is an existing way, then we can use it GézaK: worth mentioning in the spec, though. Kazuyuki: language specification can affect both SSML and PLS. Is it interesting to specify separately in PLS, i.e. select PLS. GézaK: each PLS has one language, so there could be a selection according to the language specified in the SSML instance. Richard: but lexicons are specified at the top of the SSML document. e.g. <speak xml:lang="en"> The French word for chat is <xml:lang="fr">chat</xml:lang></speak> Paolo: you can have 2 lexicons after <speak>: english.pls and French.pls. Problem is that the engine has to load both to know the lexicon languages. [dan reads from SSML spec about lexicons] GézaK: if there isn't a French "chat" in the lexicon, then you should use the English "chat" Dan: if you had en-GB and en-US in the lexicon, which one do I use if I get lang="en" ? So the matching is not simple. We could specify on language, but the lower parts (region/dialect) is up to the processor. Oumayma: in written text, foreign words can be in italics or surrounded by quotation marks. That provides hints that the output may sound strange. Dan: but not necessarily, e.g. internet, webcast, iPod, etc. in German. Richard: coming back to question Nixon: we don't have to be so specific, just a tag marking as "alien" and let the processor handle it. Going back to "unknown". Przemysław: you may also find "chat" and in 2 lexicons. Max: if you have in English: "tonight we're showing <<la vita e bella>> and <<les quatre-cent coups>>". Do you use different normalisation rules for each foreign language part? Przemysław: yes. Chris: arguing for synthesizer to find the best way of pronouncing the best way to pronounce a word. Easy in Greek, because you can easily find out foreign words with the scripts. Dan: just like "foreign" Richard: doesn't apply to all languages. Dan: synthesizing: people want to avoid changing voice when they want to change language. You want a piece to be spoken in some different way. In what way do you want to be spoken in a different voice (which could be the same speaker...) Géza: 3 ways. 1. phonemes, 2. prosody, 3. accent. So far we haven't said anything about 2, we haven't separated 1 and 2. Dan: sometimes you don't even want to pronounce film titles, in widely different languages you'd just translate. Jerneja: movie titles aren't the best examples: people's names are most important. You can't skip or translate them.

(BREAK)

Session 6: Disambiguation of Multiple Pronunciation

Moderator:: Jim Larson → Paolo Baggia
Scribe:: Richard Ishida

Oumayma Aldakkak: Computational Methods to Vocalize Arabic Texts –[Slides1, Slides2, Slides3]

Topics include: vowel length identification, POS, type of emotion (mainly mentioned in attachments: SSML_A.dpf.pdf, final_Emotion.pdf)

  Paolo: What means: Incorporation of the localization module? Module takes written language and adds vowels SSML is unable to work on text fed from news feed or such    Nixon: How do you handle syllables?  OD: Grapheme to phoneme: provide vowels Then convert to semi syllables    Nixon: What are the units you use?  OD: diphones    Demo of vocalizer, shows that there are multiple possibilities for a number of words/phrases Choice of alternative made by unsupervised learning algorithm

Jerneja Zganec Gross: Considerations on using PLS for Slovenian Pronunciation Lexicon Construction –[Slides]

Topics include: pronunciation style, emotion, dialects, pron-source, POS

  Paolo: Is Multext-East LRs an standard that is internationally known?  We would prefer to use existing standards rather than create ourselves.  JZG: not an ISO standard, but developed with fairly wide group of people  Chris: Who is the intended user for PLS?  Paolo: the application developer - JZG has just mentioned use for internal development - this is a possible extension, but it would be a great deal of work

Discussion: How should multiple pronunciation identified and disambiguated in SSML? Is parts of speech useful enough? Should emotions be included in SSML?

Paolo: what can SSML (the interface to the TTS engine) do? Do we need special markup? Paolo explains that the SSML can be associated with other preprocessors or the work can be done in the TTS. SSML could be used to pass messages to the TTS. Dan: You can't do anything inside the SSML document - it's a document. But you can say what changes need to be made by a processor. OAD: Vocalization should be incorporated in the TTS. POS markup in SSML would help the vocalization. GezaN: Can you add POS information to 'words'? RI: No, arabic words carry multiple morphs and you need to mark these up to solve the problem Kimmo: Why would you mark up the text rather than just add the vowels? Paolo: The idea is that you can add part of speech to help the vocalisation. OAD: Sometimes text in say a news feed is partially vocalised where ambiguity needs to be avoided. Paolo: So SSML could include information that text is not vocalise or partially vocalise or vocalised. Kazuyuki: I think vocalization is not special processing but text analysis included in TTS. Is the input of vocalization plain text? OAD: Yes Kazuyuki: So vocalization is a module of TTS. Paolo: But you could also preprocess the SSML text. Dan: Main reason for markup in SSML is to direct processing that follows. So either you do the vocalization no your own, or you need some assistance from the author to help with the vocalization. Only thing that seems to have been mentioned so far is morphological markup. We also heard that there may be several morphs in a single word, and that doesn't exist at the moment and I cannot figure out how to make that happen without extensive work. GezaN: That also applies to Hungarian, Slovene and other agglutinative languages. Discussion about whether Arabic and other agglutinative languages are the same wrt POS. OAD: I don't think we need to put markup in SSML for morph. JZG: But surely it can help? Pzremek: But why use markup if you could just add the vowels to the text? Dan explains the distinction between automatically processing text such as news feeds via handcrafting text that will be often reused. Geza: We should conserve the original text and use markup to annotate it. It would be useful to provide markup for POS because it is also useful for prosody and other things. Chris: SSML is not an annotation tool. It won't be used for morphological annotation. If we add more and more information, where do we stop - why not just provide phonetic transcriptions. Dan: SSML is a language of hints only - a processor can actually ignore almost everything specified - almost every TTS vendor has smart TTS processors and may not agree with what other people suggest. Since it is a hint language, we need to consider what kind of hint is useful, on the assumption that the TTS is pretty good most of the time. POS markup (incl gender, case, etc) was not implemented because of the difficulty of working out what the labels should be. Perhaps we can provide a mechanism for people to annotate the way they want. It's useful for people who don't know phonetic alphabets but do know the POS info, and another group that..... One of the best examples is numbers - needs to match in gender, case, etc. TTS needs to respond and knows you are talking about pencils or tables, and you could tag the number to say what the gender etc is. Maybe in the future we will be able to standardise POS, but it's probably too early right now. Nixon: Is it important to keep the original text when vocalising? Agreed to talk about that later. GezaN and Dan: Summary: There is some interest in trying to standardise at least a subset of the possible values. There is broad interest in enabling labelling. JZG: There may be some ISO work going on in this area. Paolo: We'd be very interesting to find out about that, because we can't handle this ourselves. Kazuyuki: JEITA people suggested in Japanese TTS POS is useful but it is not used in input info for TTS, but Ruby is used. Dan: It's still not clear to us whether ruby needs to be in SSML.

(LUNCH)

Session 7: Say-as Topics

Moderator:: Dan Burnett
Scribe:: Paolo Baggia

Chris Vosnidis: Position paper for the Second W3C Workshop on Internationalizing the Speech Synthesis Markup Language (SSML) –[Slides]

Topics include: inflection, format, details, alias maps

Discussion: How should the <say-as> tag be extended?

 Motivations: - Greek is heavily inflected languages   nouns, adj, verbs - Several inflectiona- relato - How does inflection word   - Inflection attributes are shared between certain     elements in the same context   - Elements might not be neighboring  Inflection definition in Say-as - provide hints to the synthesis processor   - which inflected version to use?   for: case, number, gender   Ex. number 3  We need context sensitive substitutions - aliasmap element with inflections  Additional <say-as> - there is no template for describing the way token   should be rendered - new major version should address - two small changes to existing say-as   - use details for date   - use format for pjhone  Say-as Telephone: format  <say-as interpret-as="telephone" format="3223"   details="30">2156664444</sayas>  Say-as Date: details   (Clarifications on the examples)

Paolo: the aliasmap seems to be related to the lexicon Chris: yes, but you need to change to standard Max: the "k" is difficult for lexicon, because you can expand as "kilos" and also "kilometer" Paolo: All of them can stay in the lexicon, you will use the role to reference them, but in this example there is also the inflection. This is a problem for PLS Dan: Is it possible to grasp the inflection from the sentence? Chris: Not simple. Dan: The way was designed is more an "interpret-as". It is not about the rendering. There is not enough information. Richard: if were dealing to news fedd, this kind of markup will be used several time. Can be done in case by case case? Chris: Orthography for numbers is complex and they can be generated from a database. I don't want to do a complex as a pre-processing. Richard: The issue is if the say-as useful for doing this? Przem: We don't know the number of e-mails. Paolo: I agree this is difficult to be done on lexicon. Dan: Why not to use the lexicon? Paolo: Because there is a number "3 k." and you cannot put all the numbers in the lexicon Dan: Ok. Max: If you go in that direction also the example of read for Russian will expand in very large number Paolo: You are right, but a general solution for highly inflected and generative languages is for a future version. We would like to discuss Richard/Dan: Relation of context and numbers in Japanese and Russian. You cannot pronounce unless there is the number. Oumayama: Explanation of variation of gender and case for numbers. Dan: Say-as is a tricky element for many reasons. SSML really does not need say-as, because it is a way to convey a semantic category. Everybody like "date", but "cardinal" seemed to be minor for many languages. This requires a separate effort for say-as, because too big. Chris: Returns on the point of NMTOKEN to CDATA to include values like details="blah?param1=xxx&" Dan: We need to answer what we want to accomplish. Discussion on telephone number reading: New use case: When you say a phone number you may not want to use the normal grouping for that country for understandability. Example for a greek person an US number is read in a certain way (discussion on Dan's questions) Ksenia: Introduce spelling? Chris: is already in the W3C Note called "character" Paolo: The real issue is if we want to restart this activity and if there are enough people interested in it. The current situation is not clean for the standard. Dan: I agree with Paolo. Richard: The question I have is how can i do spelling in Arabic and Indian. Oumayma: We have 28 letters, we spell them. We say the name of the letter "a", "b" (not /bi/). We do not pronounce short vowels in spelling. The spelling is describing how it looks in the paper. Richard: What about Indian languages? Prof. Joshi: Example spell: W3C. No plural, but "doubleyou three ci" Description of phonemes. Richard: Ex. spelling "kamra" (that means "room") Joshi/Nixon: In Hindi the syllables are pronounced but if the people will not understand the single phoneme will be pronunced.

(BREAK)

Session 8: Remaining Topics

Moderator:: Dan Burnett
Scribe:: Max Froumentin

Discussion: remaining topics and new topics arising during earlier discussions

 Remaining topics, with count of interest  
 * specialized content (SMS, email) : 11      GezaN: problem is that it would be very useful to be able to find     the characters that the network is supporting. Synthesizer may     generate differently with character set. Also SMS should be part     of say as. Also link to pronunciation lexicon.        Przemysław: SMS centers cut diacritics    Max: if xml:lang="pl" and encoding="us-ascii" and style="sms" => then infer diacritics    GezaN shows example: diacritics are removed, but sometimes only some of them  
 * speaking style: 8    Ksenia: it's science-fiction at this stage. So many parametersneeded.    GezaK: we already have speaking-style="casual"     Paolo: the values are difficult to define. Open list?    Max: is it generating the style voice, or the values?    Paolo: you can have a specific database for a style. Up to the     processor if it has a news style. Maybe change voice.    Ksenia: don't think that's up for standardisation.    Paolo: it is a way of addressing the problem, it is possible to customise your TTS.     SSML doesn't let you specify to accuse    Ksenia: never heard one with good speaking style    GesaN: they exist.    Oumayma: about style synthesis, you add that later    Max: language tags? RFC 3066?    Richard: not appropriate a language    Raghunath: speaking style is important. There's a parallel to handwriting style.  
 * Stress marking: 6    Przemysław: best quality TTS has special signs for stress indicators. There's no   ways to do that right now, and you have to use IPA    Dan: that's something you could use a lexicon for, and a phoneme. Would a different   alphabet be a solution? In Chinese, pinyin allows specifying tone?    Przemysław: one character would be enough. Maybe 2.    Paolo: not SSML's job to standardise stress markup    Dan: perhaps because many TTS stress differently, so they market differently.     It's not clear how you would want to standardize. IPA has first and second stress.    Przemysław: but not practical    GezaK: can use <sub>    Dan: wondering is the issue is standardising marking stress, or     designing a language.   Nixon: agree with fact that features should be left to application. Opens a can of worms how   many ways to do it.    Chris: there's IPA and alternative alphabets instead of IPA.    Oumayma: it's all a question of prosody: we can do everything with prosody.    
 * stronger linkage to PLS: 5    Jerneja: several interfaces where PLS should link to SSML,   e.g. PoS, maybe some speaking style (pronouncing a word casually or   not).  
 * Emotions: 4    Oumayma: for emotions we play on 3 prosodic features (amplitude, contour, ??).   In SSML, you could put a mark to have phrase or a word to be pronouced happily   or angrily, which will result in modifying it's prosody    Nixon: Ellen Eide has a good paper on that    Dan+Paolo: she's involved and interested!    Dan: asks who built a TTS engine and who's comfortable including emotions.   result: little bit more than 50%  
 * making change via orthography or markup: 3   (<sub> or other markup or change the original script)  
 * metadata: 2  
 * african tones: 1

Session 9: Summary and conclusion

Moderator:: Jim Larson → Dan Burnett & Max Froumentin
Scribe:: Dan Burnett → Kazuyuki Ashimura

Discussion: What are the next steps?

 Invite experts to the SSML subgroup of the Voice Browser Working Group
to update requirements and develop specifications.  VBWG need experts on the issues clarified in this workshop because we only
work on group interest.  So we would like to invite experts to the SSML subgroup of
the Voice Browser
Working Group to update requirements and develop specifications.  What contribution do you need?  Max: If you're non-member: can get public draft, public mailing list, and your comments welcome. every comments will be addressed.  If member & VBWG participant: can participate in f2f meetings, telephone conferences,
and can also subscribe internal mailing list.  In addition, can participate in all other WG's activity like semantic web.  Paolo: there're 2 steps to participate in SSML activity: (1) W3C member and (2) VBWG participants  Max: When you become a VBWG participant, Patent Policy agreement is required.  Nemeth: I'll check whether my organization is member or not.  Paolo: research institute or university should be get a discount.  Max: please visit W3C web site: http://www.w3.org/Consortium/join for the procedure. And please ask Max and Kaz about the details ;-)   Conduct workshop(s) for other languages.  Dan: In Beijing, there were contributions from China, Japanese, Korea and Poland. This time, we have many other languages: Greek, Indian, Russian, Arabic, Hungarian, ...  What/Where to do next?  Ksenia: How about South Africa?  Max: Please join us and propose it ;-)  Ksenia: Especially localized French or English in South Africa.  Nixon: I'd like to suggest India.  Richard: Another Workshop (not held by VBWG) will be held at CDAC
in India (=W3C India Office) in August  Jerneja: There will be HLT conference  Dan: Support for major language family like Slavic is required.  Chris: Why not Turkey?  Dan: Middle east is important.  Paolo: 2 turkeys couldn't participate in this Workshop because of
schedule arrangement and deadline issue. we should have given longer time for participants...  Nixon: Is there any local activities? China and india have big market. if there is local chapter, more people are available. WG might be big enough...  Max:In fact, this Workshop was originaly planned in Turkey. But changed to Greece because of bird flu.  Ksenia: How about Speecon in Russia?  oumayma: Is individual membership available?  max: People, who should participate because they have knowledge and skill, can
participate in WG as invited experts.  Dan: Thank you for your thoughtful suggestions and comments. If nothing else, adjourned.

[Workshop adjourned]

The Call for Participation , the Agenda and the Logistics Information are available.

Jim Larson and Kazuyuki Ashimura, Workshop Co-chairs
Max Froumentin, Voice Activity Lead

$Id: minutes.html,v 1.21 2009/01/05 15:44:09 ashimura Exp $

Minutes of the Second Workshop on Internationalizing SSML

Foundation for Research and Technology - Hellas (FORTH) in Heraklion, Crete, site of the W3C Office in Greece

30-31 May 2006

Attendees

Tuesday 30 May, 8:30-18:00

Welcome and meeting logistics — FORTH –[Slides]

Workshop expectations — Kazuyuki Ashimura –[Slides]

Introduction on W3C and Voiec Browser Working Group — Max Froumentin –[Slides]

Internationalization of SSML — Dan Burnett [Slides]

PLS for SSML — Paolo Baggia –[Slides]

Updates to RFC 3066 — Richard Ishida –[Slides]

(BREAK)

Ksenia Shalonova: Position paper for SSML workshop in Crete –[Slides, wave1, wave2, wave3]

Nixon Patel: SSML Extensions for Text to Speech systems in Indian Languages –[Slides]

Discussion: How should dialects be supported? What are the shortcomings of RFC3066? –[RFC 3066bis article]

(LUNCH)

Geza Nemeth and Geza Kiss: Proposals for Extending the speech Synthesis Markup Language (SSML) 1.0 form the Point of View of Hungarian TTS Developers –[Slides]

Davide Bonardo: SSML Extensions for Multilanguage usage –[Slides, wave1, wave2, wave3, wave4, wave5]

Discussion: How should syllable structure / token be represented in SSML?

(BREAK)

4. Raghunath K. Joshi: The Phonemic model from India for Bi-modal Applications –[Slides]

Discussion: What phonetic/phonemic alphabets should be used for SSML? Is IPA satisfactory for representing the pronunciation of words?

(DINNER)

Wednesday 31 May, 8:30-18:00

Kimmo Parssinen: Development Challenges of Multilingual Text-to-Speech systems –[Slides]

Zdroik Przemysław: Position paper for 2nd W3C Workshop on Internationalizing the Speech Synthesis Markup Language (SSML) –[Slides]

Discussion: How to represent foreign words?

(BREAK)

Oumayma Aldakkak: Computational Methods to Vocalize Arabic Texts –[Slides1, Slides2, Slides3]

Jerneja Zganec Gross: Considerations on using PLS for Slovenian Pronunciation Lexicon Construction –[Slides]

Discussion: How should multiple pronunciation identified and disambiguated in SSML? Is parts of speech useful enough? Should emotions be included in SSML?

(LUNCH)

Chris Vosnidis: Position paper for the Second W3C Workshop on Internationalizing the Speech Synthesis Markup Language (SSML) –[Slides]

Discussion: How should the <say-as> tag be extended?

(BREAK)

Discussion: remaining topics and new topics arising during earlier discussions

Discussion: What are the next steps?

Foundation for Research and Technology - Hellas (FORTH) in Heraklion, Crete,
site of the W3C Office in Greece