Microsoft Speech API

Microsoft Speech API: This article is about the Speech API. For other uses, see SAPI (disambiguation).

The Speech Application Programming Interface or SAPI is an API developed by Microsoft to allow the use of speech recognition and speech synthesis within Windows applications. To date, a number of versions of the API have been released, which have shipped either as part of a Speech SDK, or as part of the Windows OS itself. Applications that use SAPI include Microsoft Office, Microsoft Agent and Microsoft Speech Server.

In general all versions of the API have been designed such that a software developer can write an application to perform speech recognition and synthesis by using a standard set of interfaces, accessible from a variety of programming languages. In addition, it is possible for a 3rd-party company to produce their own Speech Recognition and Text-To-Speech engines or adapt existing engines to work with SAPI. In principle, as long as these engines conform to the defined interfaces they can be used instead of the Microsoft-supplied engines.

In general the Speech API is a freely-redistributable component which can be shipped with any Windows application that wishes to use speech technology. Many versions (although not all) of the speech recognition and synthesis engines are also freely redistributable.

There have been two main 'families' of the Microsoft Speech API. SAPI versions 1 through 4 are all similar to each other, with extra features in each newer version. SAPI 5 however was a completely new interface, released in 2000. Since then several sub-versions of this API have been released.

Contents

1 Basic architecture

2 Versions

2.1 SAPI 1-4 API family

2.1.1 SAPI 1

2.1.2 SAPI 3

2.1.3 SAPI 4

2.2 SAPI 5 API family

2.2.1 SAPI 5.0

2.2.2 SAPI 5.1

2.2.3 SAPI 5.2

2.2.4 SAPI 5.3

2.2.5 SAPI 5.4

2.3 SAPI 5 Voices

2.4 Managed code Speech API

3 Speech functionality in Windows Vista

4 Compatibility

4.1 SAPI 5

4.2 SAPI 4

5 Major applications using SAPI

6 Libraries using SAPI output

7 See also

8 External links

9 References

Basic architecture

Broadly the Speech API can be viewed as an interface or piece of middleware which sits between applications and speech engines (recognition and synthesis). In SAPI versions 1 to 4, applications could directly communicate with engines. The API included an abstract interface definition which applications and engines conformed to. Applications could also use simplified higher-level objects rather than directly call methods on the engines.

In SAPI 5 however, applications and engines do not directly communicate with each other. Instead each talk to a runtime component (sapi.dll). There is an API implemented by this component which applications use, and another set of interfaces for engines.

Typically in SAPI 5 applications issue calls through the API (for example to load a recognition grammar; start recognition; or provide text to be synthesized). The sapi.dll runtime component interprets these commands and processes them, where necessary calling on the engine through the engine interfaces (for example, the loading of a grammar from a file is done in the runtime, but then the grammar data is passed to the recognition engine to actually use in recognition). The recognition and synthesis engines also generate events while processing (for example, to indicate an utterance has been recognized or to indicate word boundaries in the synthesized speech). These pass in the reverse direction, from the engines, through the runtime dll, and on to an event sink in the application.

In addition to the actual API definition and runtime dll, other components are shipped with all versions of SAPI to make a complete Speech Software Development Kit. The following components are among those included in most versions of the Speech SDK:

API definition files - in MIDL and as C or C++ header files.

Runtime components - e.g. sapi.dll.

Control Panel applet - to select and configure default speech recognizer and synthesizer.

Text-To-Speech engines in multiple languages.

Speech Recognition engines in multiple languages.

Redistributable components to allow developers to package the engines and runtime with their application code to produce a single installable application.

Sample application code.

Sample engines - implementations of the necessary engine interfaces but with no true speech processing which could be used as a sample for those porting an engine to SAPI.

Documentation.

Versions

Xuedong Huang was a key person who led Microsoft's early SAPI efforts.

SAPI 1-4 API family

SAPI 1

The first version of SAPI was released in 1995, and was supported on Windows 95 and Windows NT 3.51. This version included low-level Direct Speech Recognition and Direct Text To Speech APIs which applications could use to directly control engines, as well as simplified 'higher-level' Voice Command and Voice Talk APIs.

SAPI 3

SAPI 3.0 was released in 1997. It added limited support for dictation speech recognition (discrete speech, not continuous), and additional sample applications and audio sources.

SAPI 4

SAPI 4.0 was released in 1998. This version of SAPI included both the core COM API; together with C++ wrapper classes to make programming from C++ easier; and ActiveX controls to allow drag-and-drop Visual Basic development. This was shipped as part of an SDK that included recognition and synthesis engines. It also shipped (with synthesis engines only) in Windows 2000.

The main components of the SAPI 4 API (which were all available in C++, COM, and ActiveX flavors) were:

Voice Command - high-level objects for command & control speech recognition

Voice Dictation - high-level objects for continuous dictation speech recognition

Voice Talk - high-level objects for speech synthesis

Voice Telephony - objects for writing telephone speech applications

Direct Speech Recognition - objects for direct control of recognition engine

Direct Text To Speech - objects for direct control of synthesis engine

Audio objects - for reading to and from an audio device or file

SAPI 5 API family

The Speech SDK version 5.0, incorporating the SAPI 5.0 runtime was released in 2000. This was a complete redesign from previous versions and neither engines nor applications which used older versions of SAPI could use the new version without considerable modification.

The design of the new API included the concept of strictly separating the application and engine so all calls were routed through the runtime sapi.dll. This change was intended to make the API more 'engine-independent', preventing applications from inadvertently depending on features of a specific engine. In addition this change was aimed at making it much easier to incorporate speech technology into an application by moving some management and initialization code into the runtime.

The new API was initially a pure COM API and could be used easily only from C/C++. Support for VB and scripting languages were added later. Operating systems from Windows 98 and NT 4.0 upwards were supported.

Major features of the API include:

Shared Recognizer. For desktop speech recognition applications, a recognizer object can be used that runs in a separate process (sapisvr.exe). All applications using the shared recognizer communicate with this single instance. This allows sharing of resources, removes contention for the microphone and allows for a global UI for control of all speech applications.

In-proc recognizer. For applications that require explicit control of the recognition process the in-proc recognizer object can be used instead of the shared one.

Grammar objects. Speech grammars are used to specify the words that the recognizer is listening for. SAPI 5 defines an XML markup for specifying a grammar, as well as mechanisms to create them dynamically in code. Methods also exist for instructing the recognizer to load a built-in dictation language model.

Voice object. This performs speech synthesis, producing an audio stream from text. A markup language (similar to XML, but not strictly XML) can be used for controlling the synthesis process.

Audio interfaces. The runtime includes objects for performing speech input from the microphone or speech output to speakers (or any sound device); as well as to and from wave files. It is also possible to write a custom audio object to stream audio to or from a non-standard location.

User lexicon object. This allows custom words and pronunciations to be added by a user or application. These are added to the recognition or synthesis engine's built-in lexicons.

Object tokens. This is a concept allowing recognition and TTS engines, audio objects, lexicons and other categories of object to be registered, enumerated and instantiated in a common way.

SAPI 5.0

This version shipped in late 2000 as part of the Speech SDK version 5.0, together with version 5.0 recognition and synthesis engines. The recognition engines supported continuous dictation and command & control and were released in U.S. English, Japanese and Simplified Chinese versions. In the U.S. English system, special acoustic models were available for children's speech and telephony speech. The synthesis engine was available in English and Chinese. This version of the API and recognition engines also shipped in Microsoft Office XP in 2001.

SAPI 5.1

This version shipped in late 2001 as part of the Speech SDK version 5.1. Automation-compliant interfaces were added to the API to allow use from Visual Basic, scripting languages such as JScript, and managed code. This version of the API and TTS engines was shipped in Windows XP. Windows XP Tablet PC Edition and Office 2003 also include this version, but with a substantially improved version 6 recognition engine and Traditional Chinese.

SAPI 5.2

This was a special version of the API for use only in the Microsoft Speech Server which shipped in 2004. It added support for SRGS and SSML mark-up languages, as well as additional server features and performance improvements. The Speech Server also shipped with the version 6 desktop recognition engine and the version 7 server recognition engine.

SAPI 5.3

This is the version of the API that ships in Windows Vista together with new recognition and synthesis engines. As Windows Speech Recognition is now integrated into the operating system, the Speech SDK and APIs are a part of the Windows SDK. SAPI 5.3 includes the following new features:

Support for W3C XML speech grammars for recognition and synthesis. The Speech Synthesis Markup Language (SSML) version 1.0 provides the ability to mark up voice characteristics, speed, volume, pitch, emphasis, and pronunciation.

The Speech Recognition Grammar Specification (SRGS) supports the definition of context-free grammars, with two limitations:

It does not support the use of SRGS to specify dual-tone modulated-frequency (touch-tone) grammars.

It does not support Augmented Backus–Naur form (ABNF).

Support for semantic interpretation script within grammars. SAPI 5.3 enables an SRGS grammar to be annotated with JavaScript for semantic interpretation to supplement the recognized text.

User-Specified shortcuts in lexicons, which is the ability to add a string to the lexicon and associate it with a shortcut word. When dictating, the user can say the shortcut word and the recognizer will return the expanded string.

Additional functionality and ease-of-programming provided by new types.

Performance improvements, improved reliability and security.

Version 8 of the speech recognition engine ("Microsoft Speech Recognizer")

SAPI 5.4

This is an updated version of the API that ships in Windows 7.

SAPI 5 Voices

Main article: Microsoft text-to-speech voices

Microsoft Sam (Speech Articulation Module) is a commonly-shipped SAPI 5 voice. In addition, Microsoft Office XP and Office 2003 installed L&H Michael and Michelle voices. The SAPI 5.1 SDK installs 2 more voices, Mike and Mary. Windows Vista includes Microsoft Anna which replaces Microsoft Sam. Anna is designed to sound more natural and offer greater intelligibility. The Chinese version of Windows Vista and later Windows client versions also include a female voice named Microsoft Lili. Microsoft Anna is also installed on Windows XP by Microsoft Streets & Trips 2006 and later versions.

Managed code Speech API

A managed code API ships as part of the .NET Framework 3.0^[1]. It has similar functionality to SAPI 5 but is more suitable to be used by managed code applications. The new API is available on Windows XP, Windows Server 2003, Windows Vista, and Windows Server 2008.

The existing SAPI 5 API can also be used from managed code to a limited extent by creating COM Interop code (helper code designed to assist in accessing COM interfaces and classes). This works well in some scenarios however the new API should provide a more seamless experience equivalent to using any other managed code library.

Speech functionality in Windows Vista

See also: Windows Speech Recognition

Windows Vista includes a number of new speech-related features including:

Speech control of the full Windows GUI and applications

New tutorial, microphone wizard, and UI for controlling speech recognition

New version of the Speech API runtime: SAPI 5.3

Built-in updated Speech Recognition engine (Version 8)

New Speech Synthesis engine and SAPI voice Microsoft Anna

Managed code speech API (codenamed SpeechFX)

Speech recognition support for 8 languages at release time: U.S. English, U.K. English, traditional Chinese, simplified Chinese, Japanese, German, French and Spanish, with more language to be released later.

Microsoft Agent most notably, and all other Microsoft speech applications use SAPI 5.

Compatibility

The Speech API is compatible with the following operating systems: ^[2]

SAPI 5

Microsoft Windows 7

Microsoft Windows Vista

Microsoft Windows 2003

Microsoft Windows XP

Microsoft Windows 2000

SAPI 4

Microsoft Windows Millennium Edition

Microsoft Windows 98

Microsoft Windows NT 4.0, Service Pack 6a, in English, Japanese and Simplified Chinese.

Major applications using SAPI

Microsoft Windows XP Tablet PC Edition includes SAPI 5.1 and speech recognition engines 6.1 for English, Japanese, and Chinese (simplified and traditional)

Windows Speech Recognition in Windows Vista

Microsoft Narrator in Windows 2000 and later Windows operating systems

Microsoft Office XP and Office 2003

Microsoft Excel 2002, Microsoft Excel 2003, and Microsoft Excel 2007 for speaking spreadsheet data

Microsoft Voice Command for Windows Pocket PC and Windows Mobile

Microsoft Plus! Voice Command for Windows Media Player

Dragon NaturallySpeaking general-purpose speech recognition application

Adobe Reader uses voice output to read document content

CoolSpeech, text-to-speech application that reads text aloud from a variety of sources

Window-Eyes screen reader

JAWS screen reader

NVDA open-source screen reader http://www.nvda-project.org/

Klango player, client software for the Klango.net social network http://Klango.net

Multi Crew Experience, voice control for Microsoft Flight Simulator http://www.multicrewxp.com

Libraries using SAPI output

FastFormat, via its speech_sink

Pantheios, via its be.speech back-end

See also

List of speech recognition software

Microsoft Speech Application SDK (SASDK)

External links

Microsoft site for SAPI

Microsoft download site for Speech API Software Developers Kit version 5.1

Microsoft Systems Journal Whitepaper by Mike Rozak on the first version of SAPI

Microsoft Speech Team blog

Multi Crew Experience, voice control for Flight Simulator

References

^ Michael Dunn. "Speech synthesis and recognition in .NET - Give applications a voice". Redmond Developer News. http://reddevnews.com/articles/2007/02/15/give-applications-a-voice.aspx?sc_lang=en. Retrieved 2011-11-09.

^ Microsoft Corporation. "SAPI System Requirements". MSDN. http://msdn.microsoft.com/library/default.asp?url=/library/en-us/SAPI51sr/html/system_requirements.asp. Retrieved 2006-04-12.

v · d · eMicrosoft APIs and frameworks

Graphics
Desktop Window Manager · Direct2D · Direct3D (extensions) · GDI / GDI+ · WPF · Windows Color System · Windows Image Acquisition · Windows Imaging Component

Audio
DirectMusic · DirectSound · DirectX plugin · XACT · Speech API · XAudio2

Multimedia
DirectX (Media Objects · Video Acceleration) · DirectInput · DirectShow · Image Mastering API · Managed DirectX · Media Foundation · XNA · Windows Media · Video for Windows

Web
MSHTML · RSS Platform · JScript · VBScript · BHO · XDR · SideBar Gadgets

Data access
Data Access Components · Extensible Storage Engine · ADO.NET · ADO.NET Entity Framework · Sync Framework · Jet Engine · MSXML · OLE DB · OPC

Networking
Winsock (LSP) · Winsock Kernel · Filtering Platform · Network Driver Interface Specification · Windows Rally · BITS · P2P API · MSMQ · MS MPI · DirectPlay

Communication
Messaging API · Telephony API · WCF

Administration and
management

Win32 console · Windows Script Host · WMI (extensions) · PowerShell · Task Scheduler · Offline Files · Shadow Copy · Windows Installer · Error Reporting · Event Log · Common Log File System

Component model
COM · COM+ · ActiveX · Distributed Component Object Model · .NET Framework

Libraries
Base Class Library (BCL) · Microsoft Foundation Classes (MFC) · Active Template Library (ATL) · Windows Template Library (WTL)

Device drivers
Windows Driver Model · Windows Driver Foundation (KMDF · UMDF) · WDDM · NDIS · UAA · Broadcast Driver Architecture · VxD

Security
Crypto API (CAPICOM) · Windows CardSpace · Data Protection API · Security Support Provider Interface (SSPI)

.NET
ASP.NET · ADO.NET · Base Class Library (BCL) · Remoting · Silverlight · TPL · WCF · WCS · WPF · WF

Software factories
EFx Factory · Enterprise Library · Composite UI · CCF · CSF

IPC
MSRPC · Dynamic Data Exchange (DDE) · Remoting · WCF

Accessibility
Active Accessibility · UI Automation

Text and multilingual
support

DirectWrite · Text Services Framework · Text Object Model · Input method editor · Language Interface Pack · Multilingual User Interface · Uniscribe

v · d · eSpeech synthesis

Proprietary software
BrowseAloud · CereProc · DECtalk · IVONA · Microsoft Agent · Microsoft Speech API · Microsoft text-to-speech voices · Readspeaker · Talk It! · Voice browser · Vocaloid · Cantor · Voiceroid

Free software
eSpeak · Gnuspeech · Festival Speech Synthesis System · FreeTTS

Machine
Echo 2 · Pattern playback · Phasor · RIAS · Texas Instruments LPC Speech Chips · TuVox

Applications
AOLbyPhone · DialogOS · Dr. Sbaitso · MBROLA · Microsoft Narrator · Microsoft Speech Server · PlainTalk · Voice font

Protocols
Speech Synthesis Markup Language

Developers/Researchers
Catherine Browman · Franklin Seaney Cooper · Gunnar Fant · Haskins Laboratories · Wolfgang von Kempelen · Ignatius Mattingly · Philip Rubin · VoiceWeb · VoiceXML · Yamaha

Process
Articulatory synthesis · Concatenative synthesis · Currah · Inverse filter · PSOLA · Phase vocoder · SABLE · Self-voicing

Categories:
Microsoft application programming interfaces
Voice technology
Speech synthesis

Игры ⚽ Нужен реферат?

Look at other dictionaries:

Microsoft Speech API — Speech Application Programming Interface (SAPI) интерфейс программирования приложений, основанный на технологии COM, предназначенный для распознавания и синтеза речи. Распознавание речи Распознавание речи процесс преобразования произнесённых слов … Википедия
Microsoft Speech Server — The Microsoft Speech Server is a product from Microsoft designed to allow the authoring and deployment of IVR applications incorporating Speech Recognition, Speech Synthesis and DTMF. The first version of the server was released in 2004 as… … Wikipedia
Microsoft Telephony API — TAPI (англ. Telephony Application Programming Interface интерфейс программирования приложений для телефонии) позволяет подключать ПК, работающие под управлением Windows, к системам передачи голосовой информации офисным… … Википедия
Microsoft Agent — Microsoft provides examples on its website for the use of Agent. Microsoft Agent is a technology developed by Microsoft which employs animated characters, text to speech engines, and speech recognition software to enhance interaction with… … Wikipedia
Speech Synthesis Markup Language — (SSML) (Язык Разметки Синтеза Речи) представляет собой основанный на XML язык разметки для приложений синтеза речи[1]. Он был рекомендован рабочей группой W3C[2]. SSML часто встраивается в сценарии VoiceXML для интерактивных систем телефонии[3].… … Википедия
Microsoft Narrator — A component of Microsoft Windows Screenshot of Microsoft Narrator in … Wikipedia
Speech Application Programming Interface — The Speech Application Programming Interface or SAPI is an API developed by Microsoft to allow the use of speech recognition and speech synthesis within Windows applications. To date a number of versions of the API have been released, which have… … Wikipedia
Microsoft text-to-speech voices — The Microsoft text to speech voices are speech synthesizers provided for use with applications that use the Microsoft Speech API (SAPI). Microsoft Sam is the default text to speech male voice in Microsoft Windows 2000 and Windows XP. It is used… … Wikipedia
Speech synthesis — Stephen Hawking is one of the most famous people using speech synthesis to communicate Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented… … Wikipedia
Microsoft UI Automation — (UIA) is an Application Programming Interface (API) for User Interface (UI) accessibility that is designed to help Assistive Technology (AT) products interact with standard and custom UI elements of an application (or the operating system) as… … Wikipedia

Academic Dictionaries and Encyclopedias

Microsoft Speech API

Contents

Basic architecture

Versions

SAPI 1-4 API family

SAPI 1

SAPI 3

SAPI 4

SAPI 5 API family

SAPI 5.0

SAPI 5.1

SAPI 5.2

SAPI 5.3

SAPI 5.4

SAPI 5 Voices

Managed code Speech API

Speech functionality in Windows Vista

Compatibility

SAPI 5

SAPI 4

Major applications using SAPI

Libraries using SAPI output

See also

External links

References

Look at other dictionaries:

Share the article and excerpts

Academic Dictionaries and Encyclopedias

Wikipedia

Microsoft Speech API

Contents

Basic architecture

Versions

SAPI 1-4 API family

SAPI 1

SAPI 3

SAPI 4

SAPI 5 API family

SAPI 5.0

SAPI 5.1

SAPI 5.2

SAPI 5.3

SAPI 5.4

SAPI 5 Voices

Managed code Speech API

Speech functionality in Windows Vista

Compatibility

SAPI 5

SAPI 4

Major applications using SAPI

Libraries using SAPI output

See also

External links

References

Look at other dictionaries:

Share the article and excerpts

Direct link