## Introduction to Data Science

This course is listed in USOS as: Introduction to Data Science WFAIS.IF-N018.0  (Wprowadzenie do Analityki Danych), 60 hours 6 ECTS.

### When

Will start in winter semester 2017/2018

### Target audience

Students of 1st and 2nd level studies (preferably 1st)

### Prerequisites

• A course in statistics and/or probability calculus
• A programming course (preferably Python)

### Course outcomes

We expect students after finishing this course to have an understanding of the overall data science process which consists of:

• Acquiring the data
• Cleaning and validating the data
• Doing exploratory data analysis
• Modeling the data (linear and logistic regression)
• Visualizing the data
• Statistical inference: drawing conclusions/answering questions/testing hypothesis
• Presenting findings

Students will also gain an overall view of the Machine Learning “family tree”: Supervised, Unsupervised and Reinforcement Learning. Specifically, students after finishing the course should be able to:

• Acquire structured and some to some extent the unstructured data from various sources such as text files, pdf files, databases (SQL and NoSQL) and WWW.
• Clean the acquired data if necessary which includes handling missing data.
• Manipulate the collected data using operations that include slicing, indexing, multi-indexing, grouping, merging and aggregating
• Perform exploratory data analysis on data. That includes calculating various statistical descriptors and estimating the errors on those descriptors.
• Perform a dimensional reduction using PCA analysis.
• Visualize the data using appropriate plots.
• Fit a linear model to the data and check its validity and robustness.
• Use the logistic regression for classification
• Perform clusterisation using k-means algorithm as an example of unsupervised learning.
• Estimate the errors of the calculated descriptors or parameters using resampling methods such as jackknife and/or bootstrap.
• State a hypothesis concerning the data and prove or disprove it with a specified significance level. Evaluate an A/B test.
• Implement a k-nearest neighbor’s classifier.
• Implement a Bayesian classifier.
• Deal with a large amount of data that does not fit into a single computer memory using PySpark.
• Present and justify their findings.

Students will also acquire the working knowledge of the tools described in next section.

### Tools

We will use Python language and its packages. We will base our course on Jupyter notebooks together with Python packages SciPy, NumPy, Pandas, scikit-learn, matplotlib, plotly and seaborn. Additionally, PySpark will also be used for large data set analysis.

### Assessment

The students will be required during the period of the course to carry out at least two data science projects and present their conclusion for assessment.