AWS Machine Learning Specialty: Bite Size Recap 1/3

AWS Machine Learning Specialty 1/3

Amazon Personalize

  • Configure an event tracker in Amazon Personalize to record real-time user interactions using AWS Python SDK, AWS Amplify, or AWS CLI.
  • If the system is trained on historical data, it may yield poor results over time. Create an event tracker based on real-time user interactions to overcome this challenge.
  • A “recipe” is a term specifying an appropriate algorithm for a given use case.

AWS Glue Data Catalog

  • Contains references to data used as sources and targets for extract, transform, and load (ETL) jobs in AWS Glue.

Amazon Athena

  • Allows easy data analysis in AWS S3 using standard SQL.
  • Operates in a serverless manner.

AWS Glue

  • Operates in a serverless manner.
  • Used for ETL jobs.

Athena and AWS Glue Combined Usage

  • Use AWS Glue for ETL jobs and Athena for SQL queries on processed data.
  • Supports various data formats, including CSV, TSV, JSON, Textfiles, Apache ORC, and Apache Parquet.
  • Compression, partitioning, and using columnar formats like Apache Parquet can enhance performance and reduce costs.
  • Parquet and ORC support predicate pushdown, optimizing query execution based on statistics stored in blocks.
  • Athena charges based on the amount of data scanned per query, allowing cost savings through data partitioning, compression, and columnar conversion.

Apache Parquet

  • Open-source columnar storage format.
  • 2x faster and takes up 6x less storage in Amazon S3 compared to other text formats.
  • Copyable to Amazon Redshift cluster from Amazon S3.
  • Configurable and runnable transformation jobs from CSV to Parquet using AWS Glue.
  • Well-suited for AWS analytics services like Amazon Athena and Amazon Redshift Spectrum.

Kinesis Data Analytics

  • Cannot directly run queries against data stored in S3 bucket.

AWS Batch

  • Enables easy running of thousands of batch computing jobs on AWS.
  • No need to install and manage batch computing software or server clusters.
  • Focus on analyzing results and solving problems.

AI Knowledge

  • Transfer Learning:
    • Network initialized with pre-trained weights; only the top fully connected layer has random weights.
    • Whole network fine-tuned with new data.
  • Bias and Variance:
    • Bias: Error when a model simplifies assumptions towards a target variable.
    • Variance: Error when a model becomes too sensitive to small fluctuations on unseen data.
    • High-bias model is underfitting; high-variance model is overfitting; a balanced model has low bias and low variance.
  • Reducing Bias Error:
    • Add more images to training data through data augmentation methods.
  • Neural Network Layers:
    • The number of layers needed depends on the complexity of the problem.

Bayesian Network

  • Representation of a joint probability distribution of random variables with a possible mutual causal relationship.
  • Nodes represent random variables, edges represent causal relationships, and each node has a conditional probability distribution.

Pearson’s Correlation Coefficient

  • Measures statistical relationship between two variables.
  • Closer to 1 indicates positive correlation; closer to -1 suggests negative correlation; near 0 means weaker correlation.

Logarithmic Transformation

  • Helps positively skewed data conform to normally distributed data.
  • Positively skewed distribution has values clustering to the left with a longer right tail.
  • Normal distribution is symmetrical about the mean.

Laplace Transform

  • Transformation method simplifying complex differential equations into algebraic equations.
  • Mainly used for digital signal processing.

Amazon EMR Spot Instances

  • Task nodes process data but do not hold persistent data in HDFS.
  • If terminated due to rising Spot prices, no data is lost.
  • Consider running Core nodes in Spot instances only when data loss is tolerable.

Happy learning! :)