AWS Logo
Menu

All the things that Amazon Comprehend, Rekognition, Textract, Polly, Transcribe, and Others Do

Developers are programmers, but not necessarily experts in all code-related aspects. Creating ML-dependent functions requires specific knowledge of models and algorithms that not everyone has. Fortunately, there are ready-to-use APIs that leverage pre-trained models to run ML functions without ML knowledge, securely.

Elizabeth Fuentes
Amazon Employee
Published Jun 21, 2023
Last Modified Mar 18, 2024
Developers - those who provide solutions to computer problems, establish base procedures, program, and maintain solutions - are indeed programmers, but that doesn't automatically make them experts in everything related to code. Take, for example, the creation of an ML-dependent function: it necessitates familiarity with models and algorithm training, knowledge that isn't common among all programmers.
Fortunately, there are ready-to-use APIs that leverage existing, previously trained models to execute ML functions, and they can be used without the need for ML knowledge. Additionally, they ensure the security of the information shared with them. Up next, I'll introduce you to some specific ML API services and four use cases to get you familiar with them and let your imagination run wild.

How Do Ready-to-Use ML-Function APIs Work? Just Follow These 3 Simple Steps:

  1. Define the input, the location of the object in an Amazon S3 bucket or text.
  2. Invoke the API using this input.
  3. Output will be in JSON format.
Diagram AIML like API in your APP"

Let's Take a Look at the APIs

AWS offers a variety of ML and AI services designed to expedite their implementation in your applications. These services range from those that equip you with the necessary infrastructure to train your own models to those that come as ready-to-use, pre-trained API calls. Let's now focus specifically on some examples of the latter:
API TypeHow you can doService Name
๐Ÿ”Ž Analysis of images (.png, .jpg) /videos (.mp4)

Label detection (predefined or custom)

Image properties and moderation.

Facial detection, comparison and analysis.

Face search

People paths.

Personal Protective Equipment

Celebrities recognition.

Text in image

Inappropriate or offensive content

Amazon Rekognition
๐Ÿ”Ž Detection and analysis of text in documents (PNG, JPG, PDF or TIFF)

Processes individual or bundled documents.

Detect typed and handwritten text

Recognize documents, like financial reports, medical records, ID document (drivers licenses and passports) and tax forms.

Extract text, forms, and tables from documents with structured data.

Amazon Textract
๐Ÿ”Ž Natural Language Processing (NLP) and text analysisProcesses documents and extracts information such as:

Entities

Events

Key phrases

Dominant language

Sentiment

Targeted sentiment

Syntax analysis.

Custom classification and entity recognition.

Managing custom models.

Amazon comprehend
๐Ÿ”Ž Text to speech

Supports multiple languages and includes a variety of lifelike voices.

Includes a number of Neural Text-to-Speech (NTTS) voices, delivering ground-breaking improvements in speech quality through a new machine learning approach, thereby offering to customers the most natural and human-like text-to-speech voices possible.

Neural TTS technology also supports a Newscaster speaking style that is tailored to news narration use cases.

Amazon Polly
๐Ÿ”Ž Speech to Text

Convert audio (Supported formats) to text.

Transcribe media in real time (streaming) or you can transcribe media files located in an Amazon S3 bucket (batch).

Improve accuracy for your specific use case with language customization, filter content to ensure customer privacy or audience-appropriate language, analyze content in multi-channel audio, partition the speech of individual speakers

Amazon Transcribe
๐Ÿ”Ž TranslateTranslate unstructured text (UTF-8) documents or to build applications that work in multiple languagesAmazon Translate

๐Ÿš€ Use Cases

The most effective way to learn programming is by solving problems through code development. The same principle applies when learning how to use a service: you need to actively use it to understand it. The following four use cases are examples of both real and hypothetical problems that I tackled during my learning process.
If you're passionate about utilizing video as a tool for education, it would be ideal to reach as many people as possible. One common barrier to this is language. This application enables you to create subtitles and translate them into any desired language to remove this barrier.
Create subtitles and translate them into the language you want"
  1. Upload the .mp4 video to an Amazon s3 bucket.
  2. A Lambda Function makes the call to Transcribe API.
  3. Subtitles file in the original language are downloaded to S3 Bucket.
  4. A Lambda Function makes the call to Translate API.
  5. Subtitles file in the new language is downloaded to S3 Bucket.
Here's the code to create this solution.
Many people possess piles of documents at home, ranging from letters from past lovers to medical records, children's school memorabilia, and bank statements, etc. Wouldn't it be convenient to neatly store these in the cloud? Explore and learn about the functionalities of Textract and Comprehend with this app.
Detecting entities and sentiment from a document"
  1. Upload the document (PNG, JPG, PDF or TIFF) to an S3 Bucket.
  2. A Lambda Function makes the call to Textract API.
  3. With the response from Textract, Lambda Function makes the call to Comprehend API.
  4. A Lambda Function makes the call to the Translate API.
  5. The response is saved in an S3 bucket.
Here's the code to create this solution
I was curious how an Italian speaking Chinese sounded, and since Polly has native voices for each language I created this notebook to play ๐Ÿ˜‚.
Make Polly Talk"
  1. From a Jupyter Notebook make the call to Polly API.
  2. Polly stores the result in a S3 bucket.
  3. Retrieves the audio.
Here's the code to create this solution
Iยดm fan of action movies and wanted to try Rekognition with the trailer of Die Hard, so I created this application and wow! Each dataframe is pure violence ๐Ÿซฃ... I invite you to try it with a trailer of your favorite movie.
Video content moderation"
  1. Upload the .mp4 video to an s3 bucket.
  2. A Lambda Function makes the call to Rekognition API.
  3. Once the video review is finished, a new Lambda Function retrieves the result and stores it in an s3 bucket.
Here's the code to create this solution

Conclusion

You've now learned that AIML can be used via an API call to perform a variety of tasks such as analyzing images and videos, detecting and analyzing text in scanned documents, and leveraging Natural Language Processing (NLP) to extract sentiment from dominant languages, among many other things. In addition, you have the capability to convert text to speech and vice versa, and to utilize a language translator, all within the reach of a single API call.
This just scratches the surface of what can be achieved by leveraging AIML applications via API calls.
No doubt, there's a real or hypothetical problem you'd like to address using one of these services. Even if you don't have one in mind, I've provided these links for you to continue experimenting and learning:

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Comments