AWS Machine Learning Specialty: Bite Size Recap 1/3
AWS Machine Learning Specialty 1/3
Amazon Personalize
- Configure an event tracker in Amazon Personalize to record real-time user interactions using AWS Python SDK, AWS Amplify, or AWS CLI.
- If the system is trained on historical data, it may yield poor results over time. Create an event tracker based on real-time user interactions to overcome this challenge.
- A “recipe” is a term specifying an appropriate algorithm for a given use case.
AWS Glue Data Catalog
- Contains references to data used as sources and targets for extract, transform, and load (ETL) jobs in AWS Glue.
Amazon Athena
- Allows easy data analysis in AWS S3 using standard SQL.
- Operates in a serverless manner.
AWS Glue
- Operates in a serverless manner.
- Used for ETL jobs.
Athena and AWS Glue Combined Usage
- Use AWS Glue for ETL jobs and Athena for SQL queries on processed data.
- Supports various data formats, including CSV, TSV, JSON, Textfiles, Apache ORC, and Apache Parquet.
- Compression, partitioning, and using columnar formats like Apache Parquet can enhance performance and reduce costs.
- Parquet and ORC support predicate pushdown, optimizing query execution based on statistics stored in blocks.
- Athena charges based on the amount of data scanned per query, allowing cost savings through data partitioning, compression, and columnar conversion.
Apache Parquet
- Open-source columnar storage format.
- 2x faster and takes up 6x less storage in Amazon S3 compared to other text formats.
- Copyable to Amazon Redshift cluster from Amazon S3.
- Configurable and runnable transformation jobs from CSV to Parquet using AWS Glue.
- Well-suited for AWS analytics services like Amazon Athena and Amazon Redshift Spectrum.
Kinesis Data Analytics
- Cannot directly run queries against data stored in S3 bucket.
AWS Batch
- Enables easy running of thousands of batch computing jobs on AWS.
- No need to install and manage batch computing software or server clusters.
- Focus on analyzing results and solving problems.
AI Knowledge
- Transfer Learning:
- Network initialized with pre-trained weights; only the top fully connected layer has random weights.
- Whole network fine-tuned with new data.
- Bias and Variance:
- Bias: Error when a model simplifies assumptions towards a target variable.
- Variance: Error when a model becomes too sensitive to small fluctuations on unseen data.
- High-bias model is underfitting; high-variance model is overfitting; a balanced model has low bias and low variance.
- Reducing Bias Error:
- Add more images to training data through data augmentation methods.
- Neural Network Layers:
- The number of layers needed depends on the complexity of the problem.
Bayesian Network
- Representation of a joint probability distribution of random variables with a possible mutual causal relationship.
- Nodes represent random variables, edges represent causal relationships, and each node has a conditional probability distribution.
Pearson’s Correlation Coefficient
- Measures statistical relationship between two variables.
- Closer to 1 indicates positive correlation; closer to -1 suggests negative correlation; near 0 means weaker correlation.
Logarithmic Transformation
- Helps positively skewed data conform to normally distributed data.
- Positively skewed distribution has values clustering to the left with a longer right tail.
- Normal distribution is symmetrical about the mean.
Laplace Transform
- Transformation method simplifying complex differential equations into algebraic equations.
- Mainly used for digital signal processing.
Amazon EMR Spot Instances
- Task nodes process data but do not hold persistent data in HDFS.
- If terminated due to rising Spot prices, no data is lost.
- Consider running Core nodes in Spot instances only when data loss is tolerable.
Happy learning! :)