AWS Machine Learning Specialty Bite Size Recap 2/3

AWS Machine Learning Specialty 2/3

Model Retraining

Retraining with a larger number of epochs doesn’t make sense if the model has already reached the global minimum on test data.

Dropout Regularization

Applying dropout regularization at the flatten layer is incorrect.
Dropout is typically used to combat overfitting, and its application depends on the gap between validation error and training error.

Model Complexity

Augmenting model complexity by increasing the number of layers is incorrect.
Increasing layers may negatively impact the model, potentially causing overfitting.

AWS Glue Data Catalog

Contains references to data used in ETL jobs.
Essential for creating data warehouses or data lakes.
Serves as an index for location, schema, and runtime metrics.
Information stored as metadata tables.

EMR Cluster vs. AWS Glue

Creating an EMR cluster involves more configuration effort than AWS Glue.

AWS Data Pipeline and AWS Glue Data Catalog

Using AWS Data Pipeline to automate data transformation jobs and AWS Glue Data Catalog for storing metadata is incorrect.
Requires configuring and managing compute resources for EMR.

Amazon EMR

Instantly provisions capacity for data-intensive tasks.
Suitable for applications like web indexing, data mining, log file analysis, machine learning, and more.
Eliminates the need for time-consuming setup, management, or tuning of clusters.

Amazon QuickSight

Scalable, serverless, embeddable BI service.
Machine learning-powered business intelligence for the cloud.
Enables easy creation and publication of interactive BI dashboards with ML-powered insights.

Generating Precision-Recall Data

Amazon EMR is the best choice for generating precision-recall data, especially for big data processing (150TB).

Custom CloudWatch Dashboards

Direct creation of custom CloudWatch dashboards from S3 data is not possible.

Redshift in the Scenario

Redshift has no application in this scenario; it is only used to store the output of EMR.

Pipe Input Mode vs. File Input Mode in SageMaker

Pipe Input Mode
Data fed on-the-fly into the algorithm container without involving disk I/O.
Shortens download process and reduces startup time.
Generally better read throughput than File input mode.
Enables training on datasets larger than the 16 TB EBS volume size limit.
File Input Mode
Default mode for training in Amazon SageMaker.
Increases throughput but not the best choice among the given options.

Amazon Elastic Inference

Allows attaching low-cost GPU-powered acceleration to EC2, Sagemaker, or ECS tasks.
Reduces deep learning inference costs by up to 75%.
Supports TensorFlow, Apache MXNet, PyTorch, and ONNX models.
Enables precise configuration of GPU-powered inference acceleration.

Text Cleaning in NLP

Integral stage in NLP pipeline for structured processing of unstructured texts.
Examples include lowercase conversion, word tokenization, stop word removal, HTML tag removal, stemming, lemmatization, etc.

Fixing Spelling Errors

Correcting a specific word (“niht” to “night”) is impractical for all posts.

Part-of-Speech (PoS) Tagging

Primarily used for categorizing words in a text corpus, not for text preprocessing.

One-Hot Encoding vs. Word2Vec

One-hot encoding is unsuitable for Word2Vec as it poorly captures semantics between words.
Tokenization is a better approach for processing individual words.

SageMaker Object2Vec Algorithm Components

Two input channels, two encoders (enc0 and enc1), and a comparator.
Comparator compares embeddings and outputs scores indicating relationship strength.
Encoders convert objects into fixed-length embedding vectors for comparison.
Dropout hyperparameter reduces overfitting by trimming codependent neurons.
L1 regularization is not available for Amazon SageMaker Object2Vec; it’s used for simple regression models.

Happy learning! :)