High Level Project Management Overview Blog – Data Science - 360DigiTMG
  •   + 603 2092 9488
  •   [email protected]
  •   INNODATATICS SDN BHD (1265527-M) Level 16, 1 Sentral, Jalan Stesen Sentral 5, KL Sentral, Kuala Lumpur, Malaysia.
  • Register

  • High Level Project Management Overview Blog – Data Science

    • August 18, 2019
    • Posted By : excelr_admin
    • 0 Comment
    1. Business Problem
      1. Business Objective – Minimize Defaulters / Minimize Fraud
      2. Business Constraints – Maximize Profits / Maximize Convenience
    2. Data Collection
      1. Primary Data Sources – Data collected at that moment – Surveys / Experiments
        1. Costly
        2. Time-consuming / Low quality
        3. Get the exact variable
      2. Secondary Data Sources – Data which is collected beforehand
        1. Quick access to data
        2. Free of cost
        3. Need not have data of interest

       

    3. Data Cleansing / Data Preparation / Exploratory Data Analysis / Feature Engineering
      1. Data Cleansing / Data Preparation
        1. Outlier Analysis / Treatment – 3R (Rectify, Retain, Remove)
        2. Missingness of data – Imputation – Mean, Median, Mode, Regression, KNN
        3. Standardization (X-Min(X)/Range(X) / Normalization (X-Mu/Sigma)) – Unitless and Scale Free
        4. Discretization / Binning / Grouping
        5. Transformation (log, exp, etc.)
          1. Non-linear
          2. Non-normal
          3. Heteroscedasticity – unequal variance
          4. Collinearity
        6. Dummy variable creation – One hot encoding
      2. Exploratory Data Analysis
        1. First moment business decision / Measures of central tendency
          1. Mean, Median, Mode
        2. Second moment business decision / Measures of dispersion
          1. Variance, Standard Deviation, Range
        3. Third moment business decision – Skewness
        4. Fourth moment business decision – Kurtosis
        5. Graphical Representation
          1. Univariate
            1. Box Plot
              1. Primary purpose – Identify outliers
              2. Secondary purpose – Identify shape of distribution
            2. Histogram
              1. Primary purpose – Identify Shape of distribution
              2. Secondary purpose – Identify outliers
            3. Q-Q plot – Data are normal or not
          2. Bivariate
            1. Scatter plot
              1. Primary purposes
                1. Direction-Positive, Negative, no correlation
                2. Strength – Strong, moderate, weak – Subjective; Objective – correlation coefficient;
                  r: -1 to +1; |r| > 0.85; |r| < 0.4
                3. Linear or Non-linear / Curvilinear
              2. Secondary purposes
                1. Clusters
                2. Outliers
            2. Feature Engineering / Feature Extraction – Using your given variables, try to apply domain knowledge to come up with more meaningful derived variables
            3. Feature Selection -> Decision Tree (Information Gain), Random Forest (Variable Importance plot), Hypothesis testing, Lasso regression, Ridge regression

       

    4. Data Mining (Cross sectional)
      1. Supervised Learning / Machine Learning / Predictive Modelling (Y known)
        1. Regression Analysis (Interpret the parameters)
          1. Y= Continuous -> Linear Regression
          2. Y = Discrete (2 categories) -> Logistic Regression
          3. Y = Discrete (> 2 categories) -> Multinomial / Ordinal Regression
          4. Y = Count -> Poisson / Negative Binomial Regression
          5. Excessive Zero – ZIP / ZINB / Hurdle
        2. KNN
        3. Naive Bayes
        4. Black Box Techniques (No interpretation exists)
          1. Neural Networks
          2. SVM
        5. Ensemble Techniques
          1. Stacking
          2. Bagging(Random Forest)
          3. Boosting (Decision Tree)
      2. Unsupervised Learning (Y unknown)
        1. Clustering / Segmentation – Reduce the rows
          1. K-Means / non-hierarchical – Upfront determine the # of clusters – Scree plot / Elbow curve
          2. Hierarchical / Agglomerative – Dendrogram
          3. DBSCAN
          4. OPTICS
          5. CLARA
          6. K-medians / K-Medoids / K-modes
        2. Dimension Reduction – Reduce the columns
          1. PCA, Factor Analysis
          2. SVD
        3. Association Rules / Market Basket Analysis / Affinity Analysis
          1. Support
          2. Confidence
          3. Lift Ratio > 1 => Antecedent and Consequent have strong association
        4. Recommender Systems
        5. Network Analytics
          1. Degree
          2. Closeness
          3. Betweenness
          4. Eigenvector
          5. Page Rank
        6. Text Mining & NLP
          1. BoW
          2. TDM / DTM
          3. TF / TFIDF
      3. Forecasting / Time Series
        1. Model Based Approaches
          1. Trend
            1. Linear
            2. Exponential
            3. Quadratic
          2. Seasonality
            1. Additive
            2. Multiplicative
        2. Data Based Approaches
          1. AR
          2. MA
          3. ES
            1. SES
            2. Holts
            3. HoltWinters
    5. Big Data Hadoop e-Learning videos
    Share

No comments found

LEAVE COMMENT

Your email address will not be published. Required fields are marked *

Call Us
Hide

Enquire now

  CONTACT US