AWS Machine Learning Specialty Bite Size Recap 2/3
AWS Machine Learning Specialty 2/3
Model Retraining
- Retraining with a larger number of epochs doesn’t make sense if the model has already reached the global minimum on test data.
 
Dropout Regularization
- Applying dropout regularization at the flatten layer is incorrect.
 - Dropout is typically used to combat overfitting, and its application depends on the gap between validation error and training error.
 
Model Complexity
- Augmenting model complexity by increasing the number of layers is incorrect.
 - Increasing layers may negatively impact the model, potentially causing overfitting.
 
AWS Glue Data Catalog
- Contains references to data used in ETL jobs.
 - Essential for creating data warehouses or data lakes.
 - Serves as an index for location, schema, and runtime metrics.
 - Information stored as metadata tables.
 
EMR Cluster vs. AWS Glue
- Creating an EMR cluster involves more configuration effort than AWS Glue.
 
AWS Data Pipeline and AWS Glue Data Catalog
- Using AWS Data Pipeline to automate data transformation jobs and AWS Glue Data Catalog for storing metadata is incorrect.
 - Requires configuring and managing compute resources for EMR.
 
Amazon EMR
- Instantly provisions capacity for data-intensive tasks.
 - Suitable for applications like web indexing, data mining, log file analysis, machine learning, and more.
 - Eliminates the need for time-consuming setup, management, or tuning of clusters.
 
Amazon QuickSight
- Scalable, serverless, embeddable BI service.
 - Machine learning-powered business intelligence for the cloud.
 - Enables easy creation and publication of interactive BI dashboards with ML-powered insights.
 
Generating Precision-Recall Data
- Amazon EMR is the best choice for generating precision-recall data, especially for big data processing (150TB).
 
Custom CloudWatch Dashboards
- Direct creation of custom CloudWatch dashboards from S3 data is not possible.
 
Redshift in the Scenario
- Redshift has no application in this scenario; it is only used to store the output of EMR.
 
Pipe Input Mode vs. File Input Mode in SageMaker
- Pipe Input Mode
 - Data fed on-the-fly into the algorithm container without involving disk I/O.
 - Shortens download process and reduces startup time.
 - Generally better read throughput than File input mode.
 - Enables training on datasets larger than the 16 TB EBS volume size limit.
 - File Input Mode
 - Default mode for training in Amazon SageMaker.
 - Increases throughput but not the best choice among the given options.
 
Amazon Elastic Inference
- Allows attaching low-cost GPU-powered acceleration to EC2, Sagemaker, or ECS tasks.
 - Reduces deep learning inference costs by up to 75%.
 - Supports TensorFlow, Apache MXNet, PyTorch, and ONNX models.
 - Enables precise configuration of GPU-powered inference acceleration.
 
Text Cleaning in NLP
- Integral stage in NLP pipeline for structured processing of unstructured texts.
 - Examples include lowercase conversion, word tokenization, stop word removal, HTML tag removal, stemming, lemmatization, etc.
 
Fixing Spelling Errors
- Correcting a specific word (“niht” to “night”) is impractical for all posts.
 
Part-of-Speech (PoS) Tagging
- Primarily used for categorizing words in a text corpus, not for text preprocessing.
 
One-Hot Encoding vs. Word2Vec
- One-hot encoding is unsuitable for Word2Vec as it poorly captures semantics between words.
 - Tokenization is a better approach for processing individual words.
 
SageMaker Object2Vec Algorithm Components
- Two input channels, two encoders (enc0 and enc1), and a comparator.
 - Comparator compares embeddings and outputs scores indicating relationship strength.
 - Encoders convert objects into fixed-length embedding vectors for comparison.
 - Dropout hyperparameter reduces overfitting by trimming codependent neurons.
 - L1 regularization is not available for Amazon SageMaker Object2Vec; it’s used for simple regression models.
 
Happy learning! :)