# Module Information

### Masters Modules

*Advanced Portfolio Theory (STA5086Z)*

This course is intended to expose students to the more advanced topics in portfolio theory, portfolio management and risk management. Statistical techniques such as optimisation, simulation, spectral decomposition of the covariance matrix and robust optimisation are some of the techniques that will be utilised in the models. Notwithstanding, the emphasis in this course is on the practical application of the models and theories. There will thus be an emphasis on the quantification of these measures and parameterisation of models in a South African (and African) setting. Furthermore there will be a focus on the interpretation and linkages between the concepts.

*Advanced Topics in Regression Analysis (STA5090Z)*

In this module, basic regression concepts shall be examined before moving on to advanced methods that allow for more flexibility in modelling. Topics to be covered include Ordinary Least Squares Regression, Subset Selection, Shrinkage Methods, Principal Component Regression and Partial Least Squares Regression, Piecewise Polynomials, Smoothing Splines, Wavelet Smoothing, Kernel Smoothing Methods, Mixture Models and Generalised Additive Models.

*Bayesian Decision Analysis (STA5061Z)*

The aim is to provide the student with a broad background of the Bayesian approach to decision analysis and statistical inferences, addressing in particular:

- The theoretical, philosophical and behavioural background of subjective probability and subjective expected utility (SEU);
- The interpretation of this background for statistical inference, with examples from a variety of contexts;
- The computational tools needed for implementing Bayesian statistical inference in practice; The role of Bayesian networks for modelling of inference and decision making in complex systems.

*Bioinformatics for high-throughout biology (IBS5004Z)*

This course is aimed to introduce students to bioinformatics techiniques related to processing, analysis and interpretation of high-throughput biological data. It will cover the analysis of next generation sequence data of different types (metagenomic, RNA-Seq and full genome); statistical analysis of NGS in relation to metadata associated with it; phylogenetic analysis of sequence data; and medical population genetics NGS or array data. The students who complete the course will be skilled both in handling big biological data sets, and in their downstream interpretation.

*Causal Modelling (STA5062Z)*

To introduce students to the concept of causality, causal diagrams and causal modelling. Topics to be covered include Counterfactual Theory, Directed Acyclical Graphs, Propensity Scores, Inverse Probability Weighting, Marginal Structural Models, G-estimation, Path Analysis, Confirmatory Factor Analysis, Structural Equation Modelling (SEM), Multiple Group SEM, MIMIC (Multiple Indicator and Multiple Causes) Models, Multilevel SEM, and Latent Growth Curve SEM. The course cover both the theory and the application of the methods with computer software such as R, STATA and LISREL.

*Data Analysis for High Frequency Trading (STA5091Z)*

This course aims to equip students with data science skills required to manage and explore high-frequency financial market data. This includes managing large financial data sets, carrying out statistical analysis of large data sets and knowledge relating to the link between statistical analysis of fast large data sets, the modelling thereof and how this can be used to understand and control real-time trading and risk systems in modern financial markets. The course aims to consolidate prior knowledge relating to the statistical properties of daily sampled financial data and to then extend this to the analysis, exploration and data science of large data sets relating to both limit-order data and real-time transaction data.

*Database for Data Scientists (CSC5007Z)*

This course will introduce students with little or no prior experience to the three cornerstone database technologies for big data, namely relational, NoSQL and Hadoop ecosystems, The course aims to give students an understanding of how data is organised and manipulated at large scale, and practical experience of the design and development of such databases using open source infrastructure. The relational part will cover conceptual, logical and physical database design, including ER modelling and normalisation theory, as well as SQL coding and best practices for performance enhancement. NoSQL database were developed for big data and semi-structured data applications where relational systems are too inefficient; all four types of NoSQL architecture will be introduced. Distributed data processing is key in manipulating large data sets effectively. The final section of the course will teach the popular Hadoop technologies for distributed data processing, such as MapReduce programming and the execution model of Apache Spark.

*Data Science for Astronomy (AST5004Z)*

This course introduce students to various aspects of data intensive astrophysics, ranging from data visualisation and complex databases, to advanced statistical tools for astronomical data analysis and computational astrophysics. At the core of this module are examples in modern data-intensive astrophysics derived from the global data challenges around MeerKAT, the Square Kilometre Array (SKA), associated projects in radio astronomy, and other large multi-wavelength surveys. Students will be introduced to the use of Bayesian statistics in astronomy, the complexity of visualising large date cubes, optimising database operations in the presence of multi-dimensional data, data mining and discovery tools, and the role of large-scale simulations to interpret the significance of astronomical observations.

*Data Science for Industry (STA5073Z)*

This course seeks to equip the student with the skills required for a career in Data Science within industry. Topics covered include A/B Testing, Design of Experiments (which includes Randomization, Block Design and Replication), Natural Language Processing an Recommendation Systems. It teaches students how to deal with non-standard data sets such as images, audio recordings and network graphs.

*Data Science for Particle Physics (PHY5007Z)*

This course introduces students to the important computational aspects of particle physics research. Using examples from current research at the European Organization for Nuclear Research (CERN), the students are introduced to; the basic principles of particle physics, the Grid computing model employed by the Worldwide LHC Computing Grid (WLCG), the simulation of particle physics data, the ROOT data analysis tool used by all the large particle physics collaborations, the signal extraction and significance estimation techniques employed by the most recent particle discoveries including concepts like nuisance parameters and the look-elsewhere effect.

*Data Visualization (CSC5008Z)*

Visualization is the graphical representation of data with the goal of improving comprehension, communication, hypothesis generation and decision making. This course aims to teach the principles of effective vascularization of large, multidimensional data sets. We cover the field of visual thinking, outlining current understanding of human perception and demonstration how we can use this knowledge to create more effective data visualizations.

*Decision Modelling for prescriptive analytics (STA5074Z)*

This course aims to develop an understating of the role of formal (soft and hard; deterministic and stochastic) modelling in decision support and analysis, to develop understanding of the key technologies behind decision modelling for prescriptive analytics, and to introduce new tools and techniques for analysing data in new ways in order to improve decision making.

*Design of Clinical Trials (STA5063Z)*

This module will look at the Design of Clinical Trials. Concepts of randomisation, replication and blocking will be discussed. Students will be introduced to the different phases, that is Phases I, II, III and IV of trial designs. Specific designs which will also be covered include, *inter alia*, randomised trials, dose-escalation studies, cross-over trials, PK/PD studies, designs for survival studies and multi-centre trials. The implications of the specific design for the analysis of the data will be discussed.

*Ecological Statistics (STA5064Z)*

This module will cover the latest statistical methods particular to ecological statistics. Topics to be covered include Capture-Mark-Recapture Models (Closed and Open Populations, Multi-state Models), Distance Sampling, Occupancy Models and State-Space Models in Ecology.

*Financial Econometrics (STA5065Z)*

This module comprises an advanced econometric and quantitative perspective of the following key areas: Market efficiency in macro-economic markets including the JSE, bond market ans short-term interest rate markets; Characteristics of the JSE and it's sectors; appropriate return transformations, the notion of company specific, sector specific and market wide effect; Special focus on the R$ exchange rate; its's effect on local markets (JSE and bond); causes of changes and modelling the impact on inflation; Technical modelling of bond market (Nelson-Siegel parameterisation) and the share market (Black Scholes; derivatives)

*Longitudinal Data Analysis (STA5067Z)*

This module will look at the latest methods for the analysis of longitudinal data. Longitudinal data arise as a result of repeated measurements of the same variable for a single observational unit. This lead to complex data structures that need to be taken into account, to decisions as to the particular hypothesis of interest, to a consideration of appropriate functional forms to characterize the longitudinal profile, and to potentially complex problems arising as a result of missing data. Topics to be covered include: Introduction to longitudinal data and linear mixed effect models; Generalized Estimating Equations, Generalized linear mixed effect models, Nonlinear Mixed Effect Models, including PK/PD modelling and Growth Curve modelling, Smoothing Spline Models, Missing Data, Causal Models.

*Machine Learning (STA5068Z)*

This course is highly recommended for those who wish to pursue a career in Data Science. The course serves as an overview of the increasingly important field of Machine Learning. An introduction is given to the key concepts, goals and terminology of Machine Learning. Subsequently, the lectures cover some basic theory and techniques that can serve to guide the analysis of large data sets. The implementation of some popular learning algorithms is examined. This includes Neural Networks, Support Vector Machines, Boosting and Random Forests. Throughout the course, comparisons and contrasts are made with traditional statistical practice.

For students wishing to specialise in Data Science, it is recommended that one also takes the “Programming in Python” and “Database System” modules of the Masters in Information Technology offered by the Computer Science department.

*Mathematical Modelling of Infectious Disease Modelling (STA5066Z)*

Infectious diseases remain a leading cause of morbidity and mortality worldwide, with HIV, tuberculosis and malaria estimated to cause 10% of all deaths each year. Mathematical models are being increasingly used to understand the transmission of infections and to evaluate the potential impact of control programmes in reducing morbidity and mortality. Applications include determining optimal control strategies against new or emergent infections, such as swine flu or Ebola, or against HIV, tuberculosis and malaria, and predicting the impact of vaccination strategies against common infections such as measles and rubella. This course will cover introductory and advanced concepts in mathematical modelling including deterministic and stochastic models, individual based models, and spatial models. Concepts covered include model building, equilibrium analysis, data fitting, sensitivity analysis and an introduction to health economics modelling.

*Multivariate Statistics (STA5069Z)*

In this module, multivariate statistical analysis methods with associated graphical representations will be discussed. Topics to be covered include Principal Component Analysis and PCA biplots, Simple and Multiple Correspondence Analysis, Multidimensional Scaling, Cluster Analysis, Discriminant Analysis, Canonical Variate Analysis, Analysis of Distance.

*Problem Structuring and System Dynamics (STA5070Z)*

Problem Structuring: This section aims to explore a number of tools and methods which support the initial phases of a process of enquiry or analysis. Our interest is in understanding both the epistemological basis of different approaches as well as evaluating the extent to which they add rigour and promote insight. We will be critiquing the efficacy of different approaches through a variety of case studies. System Dynamics: This section extends qualitative systems understanding to more formal and quantified computer-based models that can be used in a simulation mode. The purpose is to understand system effects of complexities such as feedback loops, and to integrate softer subjective insights into quantitative models to explore potential effects.

*Simulation and Optimisation (STA5071Z)*

*Statistical and High Performance Computing (STA5075Z)*

This course aims to provide student with a foundation in statical computing for data science. The course is divided into three sections, namely Basic Programming, High Performance Computing and Simulation & Optimisation. In the first section students will learn how to write computer programs to analyse data with the R Language and Environment for Statistical Computing. Students will then be taught how to run jobs in parallel on a remote computer cluster using a Linux command prompt. Finally the course will introduce students to the fundamental principles and uses of simulation and optimisation.

*Supervised Learning (STA5067Z)*

Supervised learning is a set of statistical modelling tools for predicting or estimating the relationships between predictor and target variables in complex data sets. As part of the Masters in Data Science degree this course aims to familiarise students with the statistical methodology needed to analyse the relationships between predictor and target variables in big data. The students should be able to apply the appropriate statical methods such as Generalized Linear Models, Tree-Based Methods, Multivariate Methods, Feature Extraction, Support Vector Machines and Neural Networks to analyse a big data set and estimate the relationships between the predictor and target variables.

*Survival Analysis (STA5072Z)*

This module will look at latest methods for the analysis of time to event data, including Censoring mechanisms (Type 1 right censoring, type 2 right censoring, interval censoring and left censoring),Survival likelihood,Kaplan-Meier method and its variance , Confidence interval for survival function, Hypothesis testing in nonparametric setting (logrank test, test for trend), Cox Proportional Hazards model (assumptions, model building, diagnostic techniques, checking proportional odds assumption),Parametric survival models in the proportional hazards metric (Exponential, Weibull), Parametric survival models in the accelerated failure time metric, The extended Cox model, interactions and time-varying covariates, Joint modelling/Multivariate/Clustered survival data, Frailty models

*Unsupervised Learning (STA5077Z)*

This course aims to familiarise students with the statistical methodology needed to analyse relationships between variables in big data without having causal relationships with predictor and response variables. Topics covered include association rules and market basket analysis, self-organising maps, multidimensional scaling, cluster analysis, principal component analysis.