AWS Machine Learning Specialty Bite Size Recap 2/3

AWS Machine Learning Specialty 2/3

Model Retraining

  • Retraining with a larger number of epochs doesn’t make sense if the model has already reached the global minimum on test data.

Dropout Regularization

  • Applying dropout regularization at the flatten layer is incorrect.
  • Dropout is typically used to combat overfitting, and its application depends on the gap between validation error and training error.

Model Complexity

  • Augmenting model complexity by increasing the number of layers is incorrect.
  • Increasing layers may negatively impact the model, potentially causing overfitting.

AWS Glue Data Catalog

  • Contains references to data used in ETL jobs.
  • Essential for creating data warehouses or data lakes.
  • Serves as an index for location, schema, and runtime metrics.
  • Information stored as metadata tables.

EMR Cluster vs. AWS Glue

  • Creating an EMR cluster involves more configuration effort than AWS Glue.

AWS Data Pipeline and AWS Glue Data Catalog

  • Using AWS Data Pipeline to automate data transformation jobs and AWS Glue Data Catalog for storing metadata is incorrect.
  • Requires configuring and managing compute resources for EMR.

Amazon EMR

  • Instantly provisions capacity for data-intensive tasks.
  • Suitable for applications like web indexing, data mining, log file analysis, machine learning, and more.
  • Eliminates the need for time-consuming setup, management, or tuning of clusters.

Amazon QuickSight

  • Scalable, serverless, embeddable BI service.
  • Machine learning-powered business intelligence for the cloud.
  • Enables easy creation and publication of interactive BI dashboards with ML-powered insights.

Generating Precision-Recall Data

  • Amazon EMR is the best choice for generating precision-recall data, especially for big data processing (150TB).

Custom CloudWatch Dashboards

  • Direct creation of custom CloudWatch dashboards from S3 data is not possible.

Redshift in the Scenario

  • Redshift has no application in this scenario; it is only used to store the output of EMR.

Pipe Input Mode vs. File Input Mode in SageMaker

  • Pipe Input Mode
  • Data fed on-the-fly into the algorithm container without involving disk I/O.
  • Shortens download process and reduces startup time.
  • Generally better read throughput than File input mode.
  • Enables training on datasets larger than the 16 TB EBS volume size limit.
  • File Input Mode
  • Default mode for training in Amazon SageMaker.
  • Increases throughput but not the best choice among the given options.

Amazon Elastic Inference

  • Allows attaching low-cost GPU-powered acceleration to EC2, Sagemaker, or ECS tasks.
  • Reduces deep learning inference costs by up to 75%.
  • Supports TensorFlow, Apache MXNet, PyTorch, and ONNX models.
  • Enables precise configuration of GPU-powered inference acceleration.

Text Cleaning in NLP

  • Integral stage in NLP pipeline for structured processing of unstructured texts.
  • Examples include lowercase conversion, word tokenization, stop word removal, HTML tag removal, stemming, lemmatization, etc.

Fixing Spelling Errors

  • Correcting a specific word (“niht” to “night”) is impractical for all posts.

Part-of-Speech (PoS) Tagging

  • Primarily used for categorizing words in a text corpus, not for text preprocessing.

One-Hot Encoding vs. Word2Vec

  • One-hot encoding is unsuitable for Word2Vec as it poorly captures semantics between words.
  • Tokenization is a better approach for processing individual words.

SageMaker Object2Vec Algorithm Components

  • Two input channels, two encoders (enc0 and enc1), and a comparator.
  • Comparator compares embeddings and outputs scores indicating relationship strength.
  • Encoders convert objects into fixed-length embedding vectors for comparison.
  • Dropout hyperparameter reduces overfitting by trimming codependent neurons.
  • L1 regularization is not available for Amazon SageMaker Object2Vec; it’s used for simple regression models.

Happy learning! :)