Requirements:
- Programming Experience in a language like Python, Go etc.
- Solid Knowledge of Operating Systems
- Understanding of ETL pipelines
- Heavy, In-Depth Database Knowledge – SQL and NoSQL
- Data Warehousing – Hadoop, MapReduce, HIVE, PIG, Apache Spark, Kafka
- Basic Machine Learning Familiarity
Requirements:
- Programming Experience in a language like Python, R etc.
- Statistics (familiarity with statistical tests, distributions, maximum likelihood estimators, etc.)
- Basic Machine Learning Skills
- Multivariable Calculus & Linear Algebra
- In-Depth knowledge of SQL
- Data Wrangling and EDA Skills
- Data Visualization and Communication
Requirements:
- Programming Experience in a language like Python, R, Java etc.
- Advanced Probability and Statistics Knowledge
- Data Modeling & Evaluation
- Advanced Machine Learning
- Software Engineering and System Design
Q. How to find outliers in data?
A. i. If you know the outlier values, then you may set some threshold value for the outliers. So, by filtering the data that lies inside that values you can get filtered data.
ii. If you don't know the outlier values in advance, you can apply clustering to find out the clusters and drop the data that lies outside that. Same goes for other models like Linear Regression or SVM.
iii. Scatter plots and Box plots are used to find visualize outliers so you can use them for visualization part.
Q. If the dataset you are using is large and you face runtime issues handling it, how would you handle it?
A. Different appraoches:
- Historical Data:
- Large Dataset: - See this - Load data in batches
- Small Datasets: You are good to go with Pandas and Numpy as usual
- Realtime Data: - You need to look into big data solutions like Kafka, Hadoop etc
Q. Why CatBoost and XGBoost is better than gradient boosting in sklearn?
A. See here
See here
Q. How Gradient Boosting works?
A. See here
Q. How Hyperparameters in an algorithm work?
A. See here
Q. How Linear Regression works?
A. See here