Introduction to Data Science

This course is listed in USOS as: Introduction to Data Science WFAIS.IF-N018.0  (Wprowadzenie do Analityki Danych), 60 hours 6 ECTS.

When

Will start in winter semester 2017/2018

Lecturers

Target audience

Students of 1st and 2nd level studies (preferably 1st)

Prerequisites

  • A course in statistics and/or probability calculus
  • A programming course (preferably Python)

Course outcomes

We expect students after finishing this course to have an understanding of the overall data science process which consists of:

  • Acquiring the data
  • Cleaning and validating the data
  • Doing exploratory data analysis
  • Modeling the data (linear and logistic regression)
  • Visualizing the data
  • Statistical inference: drawing conclusions/answering questions/testing hypothesis
  • Presenting findings

Students will also gain an overall view of the Machine Learning “family tree”: Supervised, Unsupervised and Reinforcement Learning. Specifically, students after finishing the course should be able to:

  • Acquire structured and some to some extent the unstructured data from various sources such as text files, pdf files, databases (SQL and NoSQL) and WWW.
  • Clean the acquired data if necessary which includes handling missing data.
  • Manipulate the collected data using operations that include slicing, indexing, multi-indexing, grouping, merging and aggregating
  • Perform exploratory data analysis on data. That includes calculating various statistical descriptors and estimating the errors on those descriptors.
  • Perform a dimensional reduction using PCA analysis.
  • Visualize the data using appropriate plots.
  • Fit a linear model to the data and check its validity and robustness.
  • Use the logistic regression for classification
  • Perform clusterisation using k-means algorithm as an example of unsupervised learning.
  • Estimate the errors of the calculated descriptors or parameters using resampling methods such as jackknife and/or bootstrap.
  • State a hypothesis concerning the data and prove or disprove it with a specified significance level. Evaluate an A/B test.
  • Implement a k-nearest neighbor’s classifier.
  • Implement a Bayesian classifier.
  • Deal with a large amount of data that does not fit into a single computer memory using PySpark.
  • Present and justify their findings.

Students will also acquire the working knowledge of the tools described in next section.

Tools

We will use Python language and its packages. We will base our course on Jupyter notebooks together with Python packages SciPy, NumPy, Pandas, scikit-learn, matplotlib, plotly and seaborn. Additionally, PySpark will also be used for large data set analysis.

Assessment

The students will be required during the period of the course to carry out at least two data science projects and present their conclusion for assessment.