Galaxy Communicator Documentation:

MITRE Demo: Toy Travel "System"

License / Documentation home / Help and feedback

In order to illustrate how to assemble an end-to-end system, we've constructed an example which operates according to a preset sequence of messages. Below, we describe the configuration of servers, how to run them, and what they illustrate.


The configuration file

The configuration file which is distributed with the distribution looks like this:
(
( {c dialogue_output
     :frame {c greeting } }
  {c generator_output
     :output_string "Welcome to Communicator. How may I help you?" }
  {c synthesizer_output
     :sample_rate 8000
     :encoding_format "linear16"
     :num_samples 14520 } )

( {c audio_input
     :sample_rate 8000
     :encoding_format "linear16"
     :num_samples 16560 }
  {c text_input
     :input_string "I WANT TO FLY TO LOS ANGELES" }
  {c recognizer_output
     :input_string "I WANT TO FLY LOS ANGELES" }
  {c parser_output
     :frame {c flight
               :destination "LOS ANGELES" } }
  {c dialogue_output
     :frame {c query_departure } }
  {c generator_output
     :output_string "Where are you traveling from?" }
  {c synthesizer_output
     :sample_rate 8000
     :encoding_format "linear16"
     :num_samples 9560 } )

( {c audio_input
     :sample_rate 8000
     :encoding_format "linear16"
     :num_samples 4580 }
  {c text_input
     :input_string "BOSTON" }
  {c recognizer_output
     :input_string "BOSTON" }
  {c parser_output
     :frame {c flight
               :city "BOSTON" } }
  {c backend_query
     :sql_query "select airline, flight_number, departure_datetime from flight_table where departure_aiport = 'BOS' and arrival_airport = 'LAX'" }
  {c backend_output
     :column_names ( "airline" "flight_number" "departure_datetime" )
     :nfound 2
     :values ( ( "AA" "115" "1144" )
               ( "UA" "436" "1405" ) ) }
  {c dialogue_output
     :frame {c db_result
               :column_names ( "airline" "flight_number" "departure_datetime" )
               :tuples ( ( "AA" "115" "1144" )
                         ( "UA" "436" "1405" ) ) } }
  {c generator_output
     :output_string "American Airlines flight 115 leaves at 11:44 AM, and United flight 436 leaves at 2:05 PM" }
  {c synthesizer_output
     :sample_rate 8000
     :encoding_format "linear16"
     :num_samples 35068 } )

)

The file is a list of lists. Each list corresponds to the consequences of a single user input (either "typed input" or "spoken input"). Each user input list is a sequence of frames. The name of each frame corresponds to a processing stage, and each processing stage has a set of key-value pairs which are recognized. Here are the steps and keys:
 
Step Frame name keys
Audio gesture from the Audio server to Recognizer audio_input :sample_rate (integer), :encoding_format (string), :num_samples (integer)
Text gesture from UI server to Parser text_input :input_string (string)
From Recognizer to Parser recognizer_output :input_string (string)
From Parser to Dialogue parser_output :frame (frame)
Query from Dialogue to Backend backend_query :sql_query (string)
Result of query to Backend backend_output :column_names (list of strings), :nfound (integer), :values (list of lists of strings)
New message to user from Dialogue server to to Generator dialogue_output :frame (frame)
From Generator, to Synthesizer or UI generator_output :output_string (string)
From Synthesizer to Audio synthesizer_output :sample_rate (integer), :encoding_format (string), :num_samples (integer)

Each list is treated as a single set of actions. Each server works by looking for an action set which contains the data corresponding to the input it gets in the appropriate step. If it finds the appropriate input, and the action set also contains an appropriate output, it produces the output.


The servers

This demo consists of several sample servers. None of these servers actually recognize speech, or parse text, or track dialogue; they illustrate only the ways such servers might interact, and how these servers might be constructed (especially in the more complex cases). The servers are: We've made a serious attempt to make the wrappers for these servers (Parser.c, Generator.c, etc.) into plausible wrappers for the appropriate functionality. The file component_engine.c contains the code which "implements" the appropriate functionality and defines a plausible API for it. This file also contains the code which digests the configuration file.

Parser, generator, database, dialogue

All these servers work fairly straightforwardly. The first three servers simply receive a frame and return a response. The Dialogue server both exercises server-to-server subdialogues using GalSS_EnvDispatchFrame() and issues the result as a new message to the Hub.

Recognizer, synthesizer

Both these servers use brokering to interact with the audio server. These servers exchange 16 bit samples with the audio server, consisting of random data whose length is dictated by the configuration file. These files might serve as useful templates for constructing brokered servers of this type. Notice in the recognizer that when audio data is brokered, whatever processing is performed on the data (such as running the recognizer and sending the results to the Hub) has to happen in a broker callback or a broker finalizer, which means that it must contact the Hub with a new message when it's done.

Audio, text in/text out

These servers are the most complex, because they must monitor some input source other than the Hub. We've used the MITRE stdin polling mechanism to stand in for the audio or GUI input source. When you start the server, it will connect to the Hub; each you press the carriage return, it will send the next predetermined input; and so on until it runs out of inputs, at which point it will disconnect and shut down. You can restart the server and rerun these inputs as many times as you like; each time a separate session will be created.


The program file

The toy travel demo uses the following program file, which we have annotated extensively.
;; The value of LOG_VERSION: will be used in
;; the annotation rules.

LOG_VERSION: "toy travel, version 1"

;; Use extended syntax (new in version 3.0).

PGM_SYNTAX: extended

;; This means that the log directory hierarchy
;; will start in the directory where the Hub is run.

LOG_DIR: .

;; Both audio and UI will be HUB clients, and
;; they will share a port.

SERVICE_TYPE: Audio
CLIENT_PORT: 2800
OPERATIONS: Play

SERVICE_TYPE: UI
CLIENT_PORT: 2800
OPERATIONS: Print

SERVER: Parser
HOST: localhost
PORT: 10000
OPERATIONS: Parse

SERVER: Dialogue
HOST: localhost
PORT: 18500
OPERATIONS: DoDialogue DoGreeting

SERVER: Generator
HOST: localhost
PORT: 16000
OPERATIONS: Generate

SERVER: Backend
HOST: localhost
PORT: 13000
OPERATIONS: Retrieve

SERVER: Recognizer
HOST: localhost
PORT: 11000
OPERATIONS: Recognize

SERVER: Synthesizer
HOST: localhost
PORT: 15500
OPERATIONS: Synthesize

SERVER: IOMonitor
HOST: localhost
PORT: 10050
OPERATIONS: ReportIO

;; We use four crucial functions in the Builtin server.

SERVER: Builtin
OPERATIONS: new_session end_session call_program nop hub_break

;; For logging, I will timestamp everything. Since
;; I'm also logging all the relevant keys, I really
;; don't need to timestamp, since they'll be added
;; automatically, but it's harmless and good practice.

TIMESTAMP: Play Print Parse DoDialogue DoGreeting Generate Retrieve \
Recognize Synthesize new_session end_session call_program \
FromAudio OpenAudioSession OpenTextSession \
FromRecognizer FromUI UserInput \
FromDialogue DBQuery FromSynthesizer

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;;
;;            AUDIO INPUT
;;
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

;; This first program handles input from the audio
;; server. The :host and :port and :call_id are
;; typically the information required to establish
;; a brokering connection.

PROGRAM: FromAudio

RULE: :host & :port & :call_id --> Recognizer.Recognize
IN: :host :port :encoding_format :sample_rate :call_id
LOG_IN: :host :port :encoding_format :sample_rate :call_id
OUT: none!

;; This program handles opening
;; an audio session. It marks audio available for the session
;; by using a key whose prefix is :hub_session_.

PROGRAM: OpenAudioSession

;; Notice that we use the nop dispatch function to
;; "host" a call to OUT: to set a session variable.

RULE: --> Builtin.nop
OUT: ($in(:audio_available session) 1)

;; Now we create a session. This is not technically
;; necessary, since the audio server creates a
;; session by virtue of how it connects using the
;; listener-in-Hub functionality.

RULE: --> Builtin.new_session

;; Finally, I kick off the system greeting.

RULE: --> Dialogue.DoGreeting

;; This program dispatches recognizer results to
;; the general program which handles user input.

PROGRAM: FromRecognizer

RULE: :input_string --> Builtin.call_program
IN: (:program "UserInput") :input_string
OUT: none!

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;;
;;            TEXT INPUT
;;
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

;; This program handles opening
;; a text in/text out session. It creates a session
;; and kicks off the user greeting.

PROGRAM: OpenTextSession

RULE: --> Builtin.new_session

RULE: --> Dialogue.DoGreeting

;; This function relays the typed input to the
;; main body of the input processing.

PROGRAM: FromUI

RULE: :input_string --> Builtin.call_program
IN: (:program "UserInput") :input_string

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;;
;;            MAIN INPUT BODY
;;
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

;; This program handles the main input text processing.
;; It passes the result of the parsing to the dialogue
;; manager, and if there's a response, it relays it
;; to the output processing program.

PROGRAM: UserInput

RULE: :input_string --> IOMonitor.ReportIO
IN: (:utterance :input_string) (:who "user")
OUT: none!

RULE: :input_string --> Parser.Parse
IN: :input_string
OUT: :frame
LOG_IN: :input_string
LOG_OUT: :frame

;; We're not waiting for a reply, so the errors will be
;; signalled by new messages.

RULE: :frame --> Dialogue.DoDialogue
IN: :frame
LOG_IN: :frame
OUT: none!

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;;
;;            MAIN OUTPUT BODY
;;
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

;; This program handles the response from the dialogue
;; manager which may take one of a number of
;; forms (database tuples to be described, or perhaps
;; an already-formatted frame).

PROGRAM: FromDialogue

RULE: :output_frame --> Generator.Generate
IN: :output_frame
OUT: :output_string
LOG_IN: :output_frame
LOG_OUT: :output_string

RULE: :output_string --> IOMonitor.ReportIO
IN: (:utterance :output_string) (:who "system")
OUT: none!

;; At this point, we need to decide whether to respond
;; using audio or not. We condition this on our Hub
;; session variable.

RULE: $in(:audio_available session) & :output_string --> Synthesizer.Synthesize
IN: :output_string
LOG_IN: :output_string
OUT: none!

RULE: ! $in(:audio_available session) & :output_string --> UI.Print
IN: :output_string
LOG_IN: :output_string
OUT: none!

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;;
;;            DB SUBQUERY
;;
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

;; This program handles the server-to-server subdialogue
;; through which the dialogue manager queries the database.

PROGRAM: DBQuery

RULE: :sql_query --> Backend.Retrieve
IN: :sql_query
OUT: :column_names :nfound :values
LOG_IN: :sql_query
LOG_OUT: :column_names :nfound :values

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;;
;;            AUDIO OUTPUT
;;
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

;; Finally, the synthesizer chooses to produce a new
;; message. We could also accomplish the same
;; result by returning a value from the synthesizer and
;; adding more rules to the FromDialogue program above.

PROGRAM: FromSynthesizer

RULE: :host & :port & :call_id --> Audio.Play
IN: :host :port :encoding_format :sample_rate :call_id :num_samples
LOG_IN: :host :port :encoding_format :sample_rate :call_id :num_samples


Running the demo

The process monitor configuration file contrib/MITRE/demos/toy-travel/toy-travel.config can be used to start up a process monitor in compressed mode (a single pane with buttons to select the different processes). This process monitor handles the Hub and creates additional process monitors for the the UI elements. Start it up like this:
% cd contrib/MITRE/demos/toy-travel
% ../../tools/bin/process_monitor toy-travel.config -- example.frames toy-travel.pgm &
If you select "Process Control --> Restart all", all the servers will be started in order, and the Hub. Also, two additional process monitors will be started for the UI and Audio servers. Both these servers use the interaction paradigm described for audio and text. Both these servers can be running at the same time, and both of them can be connected at the same time; the sessions will be distinguished appropriately.

Please note that in the case of audio simulation, the audio data being passed around is an array of randomly generated bytes. The computer's audio device is not involved, and you will not hear any audio played at any point.


Examining the logs

As a final step, we've also included an annotation rules file which supports the log analysis tools. So you can look at the results using these tools. You should also consult the logging documentation to learn more about how logs are created and where they're stored.
% cd contrib/MITRE/demos/toy-travel
% ../../tools/bin/process_monitor toy-travel.config -- example.frames toy-travel.pgm &
[...after running the example...]
% ../../tools/bin/xml_summarize --rule_base annotation_rules.xml sls/[datedir]/[sessionnum]
Checking: sls/20000804/018
Reading raw Hub log: sls/20000804/018/sls-20000804-018-hublog.txt
...read.
Converting to XML: sls/20000804/018/sls-20000804-018-hublog.txt
...converted.
Reading rule file: annotation_rules.xml
...read.
Applying rules: annotation_rules.xml
...succeeded.
Resegmenting...
...resegmented.
Fri Aug 4 2000 at 21:40:21.89: Task-specific portion and overall task ended.
 
 
 

Fri Aug 4 2000 at 21:40:16.54: New system turn began.

Fri Aug 4 2000 at 21:40:16.56 to Fri Aug 4 2000 at 21:40:16.58: System started speaking.
Fri Aug 4 2000 at 21:40:16.56 to Fri Aug 4 2000 at 21:40:16.58: System finished speaking.
System said: Welcome to Communicator. How may I help you?
 

Fri Aug 4 2000 at 21:40:17.78: New user turn began.

Fri Aug 4 2000 at 21:40:17.78: User started speaking.
Fri Aug 4 2000 at 21:40:17.78: User finished speaking.
Recognizer heard: I WANT TO FLY LOS ANGELES
 

Fri Aug 4 2000 at 21:40:17.93: New system turn began.

Fri Aug 4 2000 at 21:40:17.95 to Fri Aug 4 2000 at 21:40:17.96: System started speaking.
Fri Aug 4 2000 at 21:40:17.95 to Fri Aug 4 2000 at 21:40:17.96: System finished speaking.
System said: Where are you traveling from?
 

Fri Aug 4 2000 at 21:40:19.02: New user turn began.

Fri Aug 4 2000 at 21:40:19.02: User started speaking.
Fri Aug 4 2000 at 21:40:19.02: User finished speaking.
Recognizer heard: BOSTON
 

Fri Aug 4 2000 at 21:40:19.19: New system turn began.

Fri Aug 4 2000 at 21:40:19.23 to Fri Aug 4 2000 at 21:40:19.25: System started speaking.
Fri Aug 4 2000 at 21:40:19.23 to Fri Aug 4 2000 at 21:40:19.25: System finished speaking.
System said: American Airlines flight 115 leaves at 11:44 AM, and United flight 436 leaves at 2:05 PM

The timestamps for the start and end of speech are identical because our audio server doesn't send separate notifications for audio start and end. It also doesn't log the audio and report the logfile back to the Hub. A true audio server would do both these things.
License / Documentation home / Help and feedback
Last updated September 24, 2001