You should use amazon transcribe to turn text into lifelike speech using deep learning.

In the previous post we started an overview of Machine Learning and Artificial Intelligence services in AWS, including Amazon Sagemaker and Amazon Rekognition. In this one we will take a look at Amazon Polly, Amazon Translate, Amazon Transcribe, Amazon Comprehend and Amazon Textract.

Amazon Polly

Amazon Polly is a service that turns text into lifelike speech, allowing you to create applications that talk, and build entirely new categories of speech-enabled products. Polly's Text-to-Speech (TTS) service uses advanced deep learning technologies to synthesize natural sounding human speech. There are 31 Languages and 9 different voices (may vary according to language) supported by Amazon Polly.

In addition to Standard TTS voices, Amazon Polly offers Neural Text-to-Speech (NTTS) voices that deliver advanced improvements in speech quality through a new machine learning approach. Polly’s Neural TTS technology also supports a Newscaster speaking style that is tailored to news narration use cases.

There are several output file formats available such as MP3, OGG, PCM and Speech Marks with different sample rates (8000Hz, 16000Hz, 22050Hz, 24000Hz).

The Web console of Amazon Polly just contains a couple of small tabs, where you can test it.

You should use amazon transcribe to turn text into lifelike speech using deep learning.

You can also use Amazon Polly to generate speech from documents marked up with Speech Synthesis Markup Language (SSML). Using SSML-enhanced text gives you additional control over how Amazon Polly generates speech from the text you provide.

For example, you can include a long pause within your text, or change the speech rate or pitch (example below).

<speak>
     Mary had a little lamb <break time="2s"/>Whose fleece was white as snow.
</speak>

Other options include:

  • using phonetic pronunciation

  • using the Newscaster speaking style.

  • including breathing sounds

  • emphasizing specific words or phrases (example below)

<speak>
     I already told you I <emphasis level="strong">really like</emphasis> that person.
</speak>
  • Whispering (example below)

<speak>
     When any voice is made to whisper, <amazon:effect name="whispered">
<prosody rate="-10%">the sound is slower and quieter than normal speech
</prosody></amazon:effect>
</speak>

You can also customize the pronunciation of specific words and phrases by uploading lexicon files in the PLS format.

You should use amazon transcribe to turn text into lifelike speech using deep learning.

You can try Amazon Polly within the Free tier. Free tier includes 5 million characters per month for speech or Speech Marks requests, for the first 12 months, starting from your first request for speech.

After 1 year Amazon Polly’s Standard voices are priced at $4.00 per 1 million characters for speech or Speech Marks requests. Amazon Polly’s Neural voices are priced at $16.00 per 1 million characters for speech or Speech Marks requested.

Amazon Transcribe

Amazon Transcribe is an automatic speech recognition service that uses machine learning models to convert audio to text.

Amazon Transcribe’s features allow you to ingest audio input, produce easy-to-read transcripts, improve accuracy with language customization, and filter content to ensure customer privacy. Practical use cases for Amazon Transcribe include transcribing and analyzing customer-agent calls and creating closed captions for videos.

With Amazon Transcribe, you can add speech-to-text capabilities to any application.

Amazon Transcribe allows you to perform real-time transcription, submit transcription jobs, and train custom language models for audio that is specific to your use case. The transcription accuracy of a custom language model can be better than that of the general model. You can also create a custom vocabulary that is a collection of words or phrases that improves the transcription accuracy of special terms. These terms are generally domain-specific. You can create a vocabulary filter from a text file containing a list of words that are profane, offensive, or otherwise undesirable to show to the readers of your transcripts. You can use this filter to mask or remove words from the results in your transcription job. You can mask, remove, or tag words in your real-time streams.

There are two sub services such as Call Analytics and Transcribe Medical that may be useful for specific companies.

You should use amazon transcribe to turn text into lifelike speech using deep learning.

Amazon Transcribe supports 12 languages, e.g. English, Chinese, French, German, Italian, Spanish, Japanese, Koorean, etc. It can identify or redact one or more types of personally identifiable information (PII) in your transcript.

With Amazon Transcribe, you pay-as-you-go based on the seconds of audio transcribed per month. It’s easy to get started with the Amazon Transcribe Free Tier. Upon signup, start analyzing up to 60 audio minutes monthly, free for the first 12 months. After 12 month pricing depends on the type of functionality that you use, volume of data and AWS region. For example, standard batch transcription costs $0.02400 per minute for the first 250,000 minutes in N. Virginia.

Amazon Textract

Amazon Textract is a service that automatically detects and extracts text and data from scanned documents. It goes beyond simple optical character recognition (OCR) to also identify the contents of fields in forms and information stored in tables

How Textract works:

You should use amazon transcribe to turn text into lifelike speech using deep learning.

Amazon Textract currently supports PNG, JPEG, TIFF, and PDF formats.

It can detect raw text:

You should use amazon transcribe to turn text into lifelike speech using deep learning.

Or even table:

You should use amazon transcribe to turn text into lifelike speech using deep learning.

It perfectly works with receipts and invoices:

You should use amazon transcribe to turn text into lifelike speech using deep learning.

You can get started for free with the AWS Free Tier. For the first three months after account sign-up, new customers can analyze up to 1,000 pages per month using the Detecting Document Text API and up to 100 pages per month using the Analyze Document Text API. After 3 months “Detect Document Text API” for the first 1 Million pages will cost $0.0015 per page. Over 1 Million pages will cost $0.0006 per page.

Amazon Comprehend

Amazon Comprehend uses natural language processing (NLP) to extract insights about the content of documents. Amazon Comprehend processes any text file in UTF-8 format, and semi-structured documents, like PDF and Word documents. It develops insights by recognizing the entities, key phrases, language, sentiments, and other common elements in a document.

You should use amazon transcribe to turn text into lifelike speech using deep learning.

Amazon Comprehend allows to perform real-time analysis, submitting jobs, create custom classifications and use the service for Medical field:

You should use amazon transcribe to turn text into lifelike speech using deep learning.

Some of the insights that Amazon Comprehend develops about a document include:

  • Entities – Amazon Comprehend returns a list of entities, such as people, places, and locations, identified in a document.

You should use amazon transcribe to turn text into lifelike speech using deep learning.

  • Key phrases – Amazon Comprehend extracts key phrases that appear in a document. For example, a document about a basketball game might return the names of the teams, the name of the venue, and the final score.

You should use amazon transcribe to turn text into lifelike speech using deep learning.

  • Language – Amazon Comprehend identifies the dominant language in a document. Amazon Comprehend can identify 100 languages.

You should use amazon transcribe to turn text into lifelike speech using deep learning.

  • PII – Amazon Comprehend analyzes documents to detect personal data that could be used to identify an individual, such as an address, bank account number, or phone number.

You should use amazon transcribe to turn text into lifelike speech using deep learning.

  • Sentiment – Amazon Comprehend determines the emotional sentiment of a document. Sentiment can be positive, neutral, negative, or mixed.

You should use amazon transcribe to turn text into lifelike speech using deep learning.

  • Syntax – Amazon Comprehend parses each word in your document and determines the part of speech for the word. For example, in the sentence "It is raining today in Seattle," "it" is identified as a pronoun, "raining" is identified as a verb, and "Seattle" is identified as a proper noun.

You should use amazon transcribe to turn text into lifelike speech using deep learning.

Amazon Comprehend pricing depends on features that are used and data volume.

You should use amazon transcribe to turn text into lifelike speech using deep learning.

Amazon Translate

Amazon Translate is a neural machine translation service that delivers fast, high-quality, affordable, and customizable language translation. Neural machine translation is a form of language translation automation that uses deep learning models to deliver more accurate and more natural sounding translation than traditional statistical and rule-based translation algorithms.

With Amazon Translate, you can localize content such as websites and applications for your diverse users, easily translate large volumes of text for analysis, and efficiently enable cross-lingual communication between users.

Intento recently ranked Amazon Translate as the top machine translation provider in 2020 across 14 language pairs, 16 industry sectors and 8 content types.

Amazon translate supports 75 languages for real-time or batch translation. You can also add a custom terminology to specify how Amazon Translate should translate specific terms, such as brand names, model names, character names, or other unique content.

You should use amazon transcribe to turn text into lifelike speech using deep learning.

With Amazon Translate, you pay-as-you-go based on the number of characters of text that you processed. It’s easy to get started with the Amazon Translate Free Tier. Translate up to 2M characters monthly - free for the first 12 months, starting from your first translation request. After 1 year a Standard Translation will cost $15.00 per million characters, Active Custom Translation will cost $60.00 per million characters.

Example of serverless application

Let’s take a look at how we can use the above mentioned services together. Lambda functions (python boto3) will be orchestrated by Step Functions and use Transcribe, Translate, Comprehend and Polly. A high level diagram is below:

You should use amazon transcribe to turn text into lifelike speech using deep learning.

  1. A client uploads an audio file in English into S3 bucket. An event notification is configured in S3. When an object is uploaded, the event triggers Lambda “trigger-ai-pipeline”.

You should use amazon transcribe to turn text into lifelike speech using deep learning.

The Lambda code is below:

import boto3
import os
import json

stepfunctions = boto3.client('stepfunctions')

def lambda_handler(event, context):
    bucket = event['Records'][0]['s3']['bucket']['name']
    key = event['Records'][0]['s3']['object']['key']

    input = {
        "Bucket" : bucket,
        "Key": key
    }

    response = stepfunctions.start_execution(
        stateMachineArn=os.environ['STATEMACHINEARN'],
        input=json.dumps(input)
    )

    return json.dumps(response, default=str)

Statemachine ARN is added in Lambda environment variables.

2. The state machine looks as follows:

You should use amazon transcribe to turn text into lifelike speech using deep learning.

Below is a definition of state machine:

{
  "StartAt": "Start Transcribe",
  "States": {
    "Start Transcribe": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Parameters": {
        "FunctionName": "arn:aws:lambda:us-east-1:3**********3:function:demo-start-transcribe-lambda:$LATEST",
        "Payload": {
          "Input.$": "$"
        }
      },
      "Next": "Wait for Transcribe"
    },
    "Wait for Transcribe": {
      "Type": "Wait",
      "Seconds": 45,
      "Next": "Check Transcribe Status"
    },
    "Check Transcribe Status": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Parameters": {
        "FunctionName": "arn:aws:lambda:us-east-1:3**********3:function:demo-transcribe-status-lambda:$LATEST",
        "Payload": {
          "Input.$": "$"
        }
      },
      "Next": "Is Transcribe Complete"
    },
    "Is Transcribe Complete": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.Payload.TranscriptionJobStatus",
          "StringEquals": "COMPLETED",
          "Next": "Transcript Available"
        },
        {
          "Variable": "$.Payload.TranscriptionJobStatus",
          "StringEquals": "FAILED",
          "Next": "Transcribe Failed"
        }
      ],
      "Default": "Wait for Transcribe"
    },
    "Transcript Available": {
      "Type": "Parallel",
      "Branches": [
        {
          "StartAt": "Translate text",
          "States": {
            "Translate text": {
              "Type": "Task",
              "Resource": "arn:aws:states:::lambda:invoke",
              "Parameters": {
                "FunctionName": "arn:aws:lambda:us-east-1:3**********3:function:demo-translate-lambda:$LATEST",
                "Payload": {
                  "Input.$": "$"
                }
              },
              "End": true
            }
          }
        },
        {
          "StartAt": "Comprehend sentiment",
          "States": {
            "Comprehend sentiment": {
              "Type": "Task",
              "Resource": "arn:aws:states:::lambda:invoke",
              "Parameters": {
                "FunctionName": "arn:aws:lambda:us-east-1:3**********3:function:demo-comprehend-sentiment-lambda:$LATEST",
                "Payload": {
                  "Input.$": "$"
                }
              },
              "End": true
            }
          }
        }
      ],
      "Next": "Convert text to speech"
    },
    "Convert text to speech": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Parameters": {
        "FunctionName": "arn:aws:lambda:us-east-1:3**********3:function:demo-start-polly-lambda:$LATEST",
        "Payload": {
          "Input.$": "$"
        }
      },
      "End": true
    },
    "Transcribe Failed": {
      "Type": "Fail"
    }
  }
}

3. Step Function starts with a function “demo-start-transcribe-lambda”. This Lambda is asynchronous and just starts a transcribe job, because the maximum duration of Lambda function is 15 minutes while Transcribe can take much more. Lambda should not wait for Transcribe service, because it would be not optimal and not cost effective. Code of Lambda function is below:

import boto3
import os
import uuid

transcribe_client = boto3.client('transcribe')

def lambda_handler(event, context):
    input = event['Input']
    s3Path = f"s3://{input['Bucket']}/{input['Key']}"
    jobName = f"{input['Key']}-{str(uuid.uuid4())}"

    response = transcribe_client.start_transcription_job(
        TranscriptionJobName=jobName,
        LanguageCode=os.environ['LANGUAGECODE'],
        Media={'MediaFileUri': s3Path},
        Settings={
            'ShowSpeakerLabels': False,
            'ChannelIdentification': False
        }
    )

    print(response)

    return {'TranscriptionJobName': response['TranscriptionJob']['TranscriptionJobName']}

4. The next step is a simple wait for 45 seconds.

5. After waiting, the next Lambda function “demo-transcribe-status-lambda” is executed. It calls the Transcribe service and checks the status of the previously started job. Lambda code is below:

import boto3

transcribe_client = boto3.client('transcribe')

def lambda_handler(event, context):
    payload = event['Input']['Payload']
    transcriptionJobName = payload['TranscriptionJobName']

    response = transcribe_client.get_transcription_job(
        TranscriptionJobName=transcriptionJobName
    )

    transcriptionJob = response['TranscriptionJob']

    transcriptFileUri = "none"
    if 'Transcript' in transcriptionJob:
        if'TranscriptFileUri' in transcriptionJob['Transcript']:
            transcriptFileUri = transcriptionJob['Transcript']['TranscriptFileUri']

    return {
        'TranscriptFileUri': transcriptFileUri,
        'TranscriptionJobName': transcriptionJobName,
        'TranscriptionJobStatus': transcriptionJob['TranscriptionJobStatus']
    }

6. The next step is checking whether a job status is “COMPLETED” or “FAILED”. If none of them, just return to the waiting step and after 45 seconds check status again.

7. When status is “COMPLETED” and we get the output text file, two steps are executed in parallel: Lambdas “demo-translate-lambda” and “demo-comprehend-sentiment-lambda”. The “translate” function translates given text, in our case from English to Spanish.

import boto3
import json
import os
import urllib.request

translate_client = boto3.client('translate')

def lambda_handler(event, context):
    payload = event['Input']['Payload']
    transcriptFileUri = payload['TranscriptFileUri']
    transcriptionJobName = payload['TranscriptionJobName']

    transcriptFile = urllib.request.urlopen(transcriptFileUri).read()
    transcript = json.loads(transcriptFile)
    transcript_text = transcript['results']['transcripts'][0]['transcript']

    response = translate_client.translate_text(
        Text=transcript_text,
        SourceLanguageCode=os.environ['SOURCELANGUAGECODE'],
        TargetLanguageCode=os.environ['TARGETLANGUAGECODE']
    )

    return {
       'TranslatedText': response['TranslatedText'],
       'TranscriptionJobName': transcriptionJobName,
    }

Comprehend” lambda tries to understand whether text is positive, negative or neutral. Code is below:

import boto3
import json
import urllib.request

comprehend_client = boto3.client('comprehend')

def lambda_handler(event, context):
    payload = event['Input']['Payload']
    transcriptFileUri = payload['TranscriptFileUri']
    transcriptionJobName = payload['TranscriptionJobName']

    transcriptFile = urllib.request.urlopen(transcriptFileUri).read()
    transcript = json.loads(transcriptFile)
    transcript_text = transcript['results']['transcripts'][0]['transcript']

    response = comprehend_client.detect_sentiment(
        Text=transcript_text,
        LanguageCode='en'
    )

    sentiment = response['Sentiment']

    return {
       'Sentiment': sentiment,
       'TranscriptionJobName': transcriptionJobName
    }

8. The next step is Lambda “demo-start-polly-lambda”, that takes the translated to Spanish text and creates an audio file. Code is below:

# https://docs.aws.amazon.com/polly/latest/dg/voicelist.html
import boto3
import os

polly_client = boto3.client('polly')

def lambda_handler(event, context):
    payload = event['Input'][0]['Payload']
    payload_other = event['Input'][1]['Payload']

    payload.update(payload_other)

    translatedText = payload['TranslatedText']
    transcriptionJobName = payload['TranscriptionJobName']
    sentiment = payload['Sentiment']

    response = polly_client.start_speech_synthesis_task(
        LanguageCode=os.environ['LANGUAGECODE'],
        OutputFormat='mp3',
        OutputS3BucketName=os.environ['OUTPUTS3BUCKETNAME'],
        OutputS3KeyPrefix=f'{sentiment}/{transcriptionJobName}',
        Text=translatedText,
        TextType='text',
        VoiceId=os.environ['VOICEID']
    )

    return {
        'TaskId': response['SynthesisTask']['TaskId'],
        'TranscriptionJobName': transcriptionJobName
    }

During the test we upload two different audio files in English, in the first one a person says: “Gloomy days are the worst. Why can't the Sun be Out Instead? There's always Tomorrow, I guess?”.

In the second file a person says: “It is a great day to be you. Seize the day and believe that you can do anything you set your mind to.

You should use amazon transcribe to turn text into lifelike speech using deep learning.

Transcriptions were completed:

You should use amazon transcribe to turn text into lifelike speech using deep learning.

After comprehension, translation and converting to speech we can see two folders in the output bucket:

You should use amazon transcribe to turn text into lifelike speech using deep learning.

In the “POSITIVE” folder we can see a file “GreatDayToBeYou.mp3” in Spanish:

You should use amazon transcribe to turn text into lifelike speech using deep learning.

In the “NEGATIVE” folder we can find “GloomyDays.mp3”

You should use amazon transcribe to turn text into lifelike speech using deep learning.

Conclusion

There are many interesting, helpful and easy-to-use services in the AWS Machine Learning family. We have covered Polly, Transcribe, Textract, Comprehend and Translate and tried them in the task, where we needed to process, translate, comprehend and create audio files. Of course, you can train and use your own ML model for all descriptited cases, but first of all you need to understand if it’s really worth investing time. AWS is continuously improving and training modes that are behind mentioned services and you don’t need to look after infrastructure and logic they are based on.

Which AWS service helps customers to automatically convert speech to text?

Amazon Transcribe is an AWS service that makes it easy for customers to convert speech-to-text.

Which tool allows you to centrally manage all users and roles permissions in your organization in AWS?

AWS Organizations is an account management service that enables you to consolidate multiple AWS accounts into an organization that you create and centrally manage.

Which tool allows you to centrally manage all users and roles permissions in your organization technical account manager Tam service control policies IAM config?

S18/Q02: Which tool allows you to centrally manage all users and roles permissions in your organization? Service control policies (SCPs) are a type of organization policy that you can use to manage permissions in your organization. An SCP spans all IAM users, groups, and roles, including the AWS account root user.

Which of the following statements about AWS regions is true?

The statement "An availability zone's name (for example, us-east-1a) may change across AWS accounts" is true, as different accounts may remap AZs to different names to ensure better resource distribution.