AWS Machine Learning Specialty Bite Size Recap 3/3

AWS Machine Learning Specialty 3/3

Amazon ML Evaluation Metrics

  • Area Under the Curve (AUC)
    • Measures model’s ability to predict higher scores for positive examples.
    • Independent of score cut-off, providing insight into prediction accuracy without threshold selection.
  • Receiver Operating Characteristic (ROC) Curve
    • Graphical plot showing diagnostic ability as the discrimination threshold varies.

Evaluate with Scatter Plot

  • Represent relationships between variables, not model evaluation.

Evaluate with Root Mean Square Error (RMSE)

  • BAd for binary classification; used for regression models.

Credit Risk Scenario

  • High accuracy might mislead if not balanced with true positives and false positives.
  • AUC of 0.9, even with lower accuracy, is considered better for identifying risky loans.

Amazon SageMaker AutoPilot

  • Simplifies ML model training by handling feature engineering, model training, and selection.
  • Requires only the upload of a training dataset to S3.

k-NN Models

  • Use Euclidean distance to measure similarity between target data and a specific class.

AWS Data Pipeline

  • Reliably processes and moves data between AWS compute/storage services and on-premises sources.
  • Transfers results efficiently to services like Amazon S3, Amazon RDS, Amazon DynamoDB, and Amazon EMR.

SageMaker Training Job

  • Includes training data S3 URL, ML compute instance configuration, output S3 bucket URL, and training code ECR path.

AWS DMS Connection

  • Can’t directly set up connections between Amazon RDS SQL Server or Amazon DynamoDB and SageMaker.

Buffering in Amazon Kinesis Data Firehose

  • Buffers incoming streaming data based on size or time before delivering to destinations.
  • Buffer size ranges from 1MB to 128MB for Amazon S3 and 1MB to 100MB for Amazon Elasticsearch Service.
  • Dynamically adjusts buffer size to catch up if data delivery to the destination falls behind data writing.

Evaluation Metrics Formulas

  • Recall TP / (TP + FN)
  • False Negative Rate FN / (FN + TP)
  • Cost Function (3 * FN) + FP

Transfer Learning

  • Network initialized with pre-trained weights.
  • Top fully connected layer initialized with random weights.
  • Fine-tuning of the whole network with new data.
  • Suitable for training with smaller datasets.

Imputing Missing Values

  • Commonly involves replacing missing values with mean or median.
  • Understanding data is crucial before choosing a replacement strategy.
  • Supervised learning for approximating missing values often yields better results.

Amazon SageMaker Data Formats

  • Protobuf recordIO format recommended for training.
  • Pipe mode streams data directly from Amazon S3, providing faster start times and better throughput.
  • Pipe mode doesn’t support Apache Parquet; File mode is slower and default but less efficient.

Synthetic Minority Oversampling Technique (SMOTE)

  • Oversampling approach for the minority class, creating synthetic examples.
  • Involves introducing synthetic examples along line segments of minority class nearest neighbors.
  • Preferable for dealing with imbalanced datasets, such as fraudulent cases.

Apache Spark on Amazon EMR

  • Best place for running Apache Spark.
  • Allows easy creation of managed Spark clusters.
  • Includes libraries for machine learning (MLlib), stream processing (Spark Streaming), and graph processing (GraphX).
  • Collaborative filtering leverages other users’ experiences.

Apache HBase and Machine Learning

  • Apache HBase is a non-relational distributed database, not suitable for machine learning tasks.
  • Content-based filtering is more suited for predicting based on product attributes.

AWS Panorama for Computer Vision

  • AWS Panorama allows bringing computer vision to on-premises cameras.
  • Enables predictions locally with high accuracy and low latency.
  • Suitable for leveraging existing IP cameras without AI capabilities.

Term Frequency - Inverse Document Frequency (TfIdf)

  • Algorithm to convert text data into a numerical representation.
  • Utilizes Term Frequency (word frequency in a sentence) and Inverse Document Frequency (word frequency in the whole corpus).
  • Scikit-learn CountVectorizer Class is incorrect; it provides a simple word count.

Amazon S3 Storage Classes

  • S3 Standard-IA designed for long-lived, infrequently accessed data.
  • Retrieval fee applicable; suitable for infrequent access.
  • Incurring costs using an EC2 instance is incorrect; Glacier Deep Archive provides delayed access; 30-day wait before transitioning to Standard-IA.

AWS Step Functions

  • Serverless function orchestrator for sequencing AWS Lambda functions and multiple services.
  • Visual interface for creating and running checkpointed and event-driven workflows.
  • Efficient orchestration of multiple ETL jobs possible.

Amazon SageMaker Object2Vec

  • General-purpose neural embedding algorithm.
  • Highly customizable; learns low-dimensional dense embeddings.
  • Not suitable for extracting embeddings representing compliance in a claim.

SVM with Radial Basis Function (RBF) Kernel

  • Variation of SVM for separating non-linear data.
  • Efficiently maps data to a higher dimension.
  • Suitable for separating randomly distributed data in a 2-D space.

Happy learning! :)