AWS Machine Learning Specialty Bite Size Recap 3/3
AWS Machine Learning Specialty 3/3
Amazon ML Evaluation Metrics
- Area Under the Curve (AUC)
- Measures model’s ability to predict higher scores for positive examples.
- Independent of score cut-off, providing insight into prediction accuracy without threshold selection.
- Receiver Operating Characteristic (ROC) Curve
- Graphical plot showing diagnostic ability as the discrimination threshold varies.
Evaluate with Scatter Plot
- Represent relationships between variables, not model evaluation.
Evaluate with Root Mean Square Error (RMSE)
- BAd for binary classification; used for regression models.
Credit Risk Scenario
- High accuracy might mislead if not balanced with true positives and false positives.
- AUC of 0.9, even with lower accuracy, is considered better for identifying risky loans.
Amazon SageMaker AutoPilot
- Simplifies ML model training by handling feature engineering, model training, and selection.
- Requires only the upload of a training dataset to S3.
k-NN Models
- Use Euclidean distance to measure similarity between target data and a specific class.
AWS Data Pipeline
- Reliably processes and moves data between AWS compute/storage services and on-premises sources.
- Transfers results efficiently to services like Amazon S3, Amazon RDS, Amazon DynamoDB, and Amazon EMR.
SageMaker Training Job
- Includes training data S3 URL, ML compute instance configuration, output S3 bucket URL, and training code ECR path.
AWS DMS Connection
- Can’t directly set up connections between Amazon RDS SQL Server or Amazon DynamoDB and SageMaker.
Buffering in Amazon Kinesis Data Firehose
- Buffers incoming streaming data based on size or time before delivering to destinations.
- Buffer size ranges from 1MB to 128MB for Amazon S3 and 1MB to 100MB for Amazon Elasticsearch Service.
- Dynamically adjusts buffer size to catch up if data delivery to the destination falls behind data writing.
Evaluation Metrics Formulas
- Recall TP / (TP + FN)
- False Negative Rate FN / (FN + TP)
- Cost Function (3 * FN) + FP
Transfer Learning
- Network initialized with pre-trained weights.
- Top fully connected layer initialized with random weights.
- Fine-tuning of the whole network with new data.
- Suitable for training with smaller datasets.
Imputing Missing Values
- Commonly involves replacing missing values with mean or median.
- Understanding data is crucial before choosing a replacement strategy.
- Supervised learning for approximating missing values often yields better results.
Amazon SageMaker Data Formats
- Protobuf recordIO format recommended for training.
- Pipe mode streams data directly from Amazon S3, providing faster start times and better throughput.
- Pipe mode doesn’t support Apache Parquet; File mode is slower and default but less efficient.
Synthetic Minority Oversampling Technique (SMOTE)
- Oversampling approach for the minority class, creating synthetic examples.
- Involves introducing synthetic examples along line segments of minority class nearest neighbors.
- Preferable for dealing with imbalanced datasets, such as fraudulent cases.
Apache Spark on Amazon EMR
- Best place for running Apache Spark.
- Allows easy creation of managed Spark clusters.
- Includes libraries for machine learning (MLlib), stream processing (Spark Streaming), and graph processing (GraphX).
- Collaborative filtering leverages other users’ experiences.
Apache HBase and Machine Learning
- Apache HBase is a non-relational distributed database, not suitable for machine learning tasks.
- Content-based filtering is more suited for predicting based on product attributes.
AWS Panorama for Computer Vision
- AWS Panorama allows bringing computer vision to on-premises cameras.
- Enables predictions locally with high accuracy and low latency.
- Suitable for leveraging existing IP cameras without AI capabilities.
Term Frequency - Inverse Document Frequency (TfIdf)
- Algorithm to convert text data into a numerical representation.
- Utilizes Term Frequency (word frequency in a sentence) and Inverse Document Frequency (word frequency in the whole corpus).
- Scikit-learn CountVectorizer Class is incorrect; it provides a simple word count.
Amazon S3 Storage Classes
- S3 Standard-IA designed for long-lived, infrequently accessed data.
- Retrieval fee applicable; suitable for infrequent access.
- Incurring costs using an EC2 instance is incorrect; Glacier Deep Archive provides delayed access; 30-day wait before transitioning to Standard-IA.
AWS Step Functions
- Serverless function orchestrator for sequencing AWS Lambda functions and multiple services.
- Visual interface for creating and running checkpointed and event-driven workflows.
- Efficient orchestration of multiple ETL jobs possible.
Amazon SageMaker Object2Vec
- General-purpose neural embedding algorithm.
- Highly customizable; learns low-dimensional dense embeddings.
- Not suitable for extracting embeddings representing compliance in a claim.
SVM with Radial Basis Function (RBF) Kernel
- Variation of SVM for separating non-linear data.
- Efficiently maps data to a higher dimension.
- Suitable for separating randomly distributed data in a 2-D space.
Happy learning! :)