AWS Machine Learning Specialty Bite Size Recap 2/3
AWS Machine Learning Specialty 2/3
Model Retraining
- Retraining with a larger number of epochs doesn’t make sense if the model has already reached the global minimum on test data.
Dropout Regularization
- Applying dropout regularization at the flatten layer is incorrect.
- Dropout is typically used to combat overfitting, and its application depends on the gap between validation error and training error.
Model Complexity
- Augmenting model complexity by increasing the number of layers is incorrect.
- Increasing layers may negatively impact the model, potentially causing overfitting.
AWS Glue Data Catalog
- Contains references to data used in ETL jobs.
- Essential for creating data warehouses or data lakes.
- Serves as an index for location, schema, and runtime metrics.
- Information stored as metadata tables.
EMR Cluster vs. AWS Glue
- Creating an EMR cluster involves more configuration effort than AWS Glue.
AWS Data Pipeline and AWS Glue Data Catalog
- Using AWS Data Pipeline to automate data transformation jobs and AWS Glue Data Catalog for storing metadata is incorrect.
- Requires configuring and managing compute resources for EMR.
Amazon EMR
- Instantly provisions capacity for data-intensive tasks.
- Suitable for applications like web indexing, data mining, log file analysis, machine learning, and more.
- Eliminates the need for time-consuming setup, management, or tuning of clusters.
Amazon QuickSight
- Scalable, serverless, embeddable BI service.
- Machine learning-powered business intelligence for the cloud.
- Enables easy creation and publication of interactive BI dashboards with ML-powered insights.
Generating Precision-Recall Data
- Amazon EMR is the best choice for generating precision-recall data, especially for big data processing (150TB).
Custom CloudWatch Dashboards
- Direct creation of custom CloudWatch dashboards from S3 data is not possible.
Redshift in the Scenario
- Redshift has no application in this scenario; it is only used to store the output of EMR.
Pipe Input Mode vs. File Input Mode in SageMaker
- Pipe Input Mode
- Data fed on-the-fly into the algorithm container without involving disk I/O.
- Shortens download process and reduces startup time.
- Generally better read throughput than File input mode.
- Enables training on datasets larger than the 16 TB EBS volume size limit.
- File Input Mode
- Default mode for training in Amazon SageMaker.
- Increases throughput but not the best choice among the given options.
Amazon Elastic Inference
- Allows attaching low-cost GPU-powered acceleration to EC2, Sagemaker, or ECS tasks.
- Reduces deep learning inference costs by up to 75%.
- Supports TensorFlow, Apache MXNet, PyTorch, and ONNX models.
- Enables precise configuration of GPU-powered inference acceleration.
Text Cleaning in NLP
- Integral stage in NLP pipeline for structured processing of unstructured texts.
- Examples include lowercase conversion, word tokenization, stop word removal, HTML tag removal, stemming, lemmatization, etc.
Fixing Spelling Errors
- Correcting a specific word (“niht” to “night”) is impractical for all posts.
Part-of-Speech (PoS) Tagging
- Primarily used for categorizing words in a text corpus, not for text preprocessing.
One-Hot Encoding vs. Word2Vec
- One-hot encoding is unsuitable for Word2Vec as it poorly captures semantics between words.
- Tokenization is a better approach for processing individual words.
SageMaker Object2Vec Algorithm Components
- Two input channels, two encoders (enc0 and enc1), and a comparator.
- Comparator compares embeddings and outputs scores indicating relationship strength.
- Encoders convert objects into fixed-length embedding vectors for comparison.
- Dropout hyperparameter reduces overfitting by trimming codependent neurons.
- L1 regularization is not available for Amazon SageMaker Object2Vec; it’s used for simple regression models.
Happy learning! :)