AWS Machine Learning Specialty: Bite Size Recap 1/3
AWS Machine Learning Specialty 1/3
Amazon Personalize
- Configure an event tracker in Amazon Personalize to record real-time user interactions using AWS Python SDK, AWS Amplify, or AWS CLI.
 - If the system is trained on historical data, it may yield poor results over time. Create an event tracker based on real-time user interactions to overcome this challenge.
 - A “recipe” is a term specifying an appropriate algorithm for a given use case.
 
AWS Glue Data Catalog
- Contains references to data used as sources and targets for extract, transform, and load (ETL) jobs in AWS Glue.
 
Amazon Athena
- Allows easy data analysis in AWS S3 using standard SQL.
 - Operates in a serverless manner.
 
AWS Glue
- Operates in a serverless manner.
 - Used for ETL jobs.
 
Athena and AWS Glue Combined Usage
- Use AWS Glue for ETL jobs and Athena for SQL queries on processed data.
 - Supports various data formats, including CSV, TSV, JSON, Textfiles, Apache ORC, and Apache Parquet.
 - Compression, partitioning, and using columnar formats like Apache Parquet can enhance performance and reduce costs.
 - Parquet and ORC support predicate pushdown, optimizing query execution based on statistics stored in blocks.
 - Athena charges based on the amount of data scanned per query, allowing cost savings through data partitioning, compression, and columnar conversion.
 
Apache Parquet
- Open-source columnar storage format.
 - 2x faster and takes up 6x less storage in Amazon S3 compared to other text formats.
 - Copyable to Amazon Redshift cluster from Amazon S3.
 - Configurable and runnable transformation jobs from CSV to Parquet using AWS Glue.
 - Well-suited for AWS analytics services like Amazon Athena and Amazon Redshift Spectrum.
 
Kinesis Data Analytics
- Cannot directly run queries against data stored in S3 bucket.
 
AWS Batch
- Enables easy running of thousands of batch computing jobs on AWS.
 - No need to install and manage batch computing software or server clusters.
 - Focus on analyzing results and solving problems.
 
AI Knowledge
- Transfer Learning:
    
- Network initialized with pre-trained weights; only the top fully connected layer has random weights.
 - Whole network fine-tuned with new data.
 
 - Bias and Variance:
    
- Bias: Error when a model simplifies assumptions towards a target variable.
 - Variance: Error when a model becomes too sensitive to small fluctuations on unseen data.
 - High-bias model is underfitting; high-variance model is overfitting; a balanced model has low bias and low variance.
 
 - Reducing Bias Error:
    
- Add more images to training data through data augmentation methods.
 
 - Neural Network Layers:
    
- The number of layers needed depends on the complexity of the problem.
 
 
Bayesian Network
- Representation of a joint probability distribution of random variables with a possible mutual causal relationship.
 - Nodes represent random variables, edges represent causal relationships, and each node has a conditional probability distribution.
 
Pearson’s Correlation Coefficient
- Measures statistical relationship between two variables.
 - Closer to 1 indicates positive correlation; closer to -1 suggests negative correlation; near 0 means weaker correlation.
 
Logarithmic Transformation
- Helps positively skewed data conform to normally distributed data.
 - Positively skewed distribution has values clustering to the left with a longer right tail.
 - Normal distribution is symmetrical about the mean.
 
Laplace Transform
- Transformation method simplifying complex differential equations into algebraic equations.
 - Mainly used for digital signal processing.
 
Amazon EMR Spot Instances
- Task nodes process data but do not hold persistent data in HDFS.
 - If terminated due to rising Spot prices, no data is lost.
 - Consider running Core nodes in Spot instances only when data loss is tolerable.
 
Happy learning! :)