Statistics play a key role in the jobs of data scientists. Collecting, organizing, analyzing, and interpreting data are all tasks that involve the use of statistics. Statistics help to provide summaries of data and to simplify large amounts of information. The two main types of statistics that data scientists use are descriptive and inferential statistics. Both types are important to data science for different reasons and it is important to know when to use which category.
Very simply, descriptive statistics are used to describe the data you have collected. In descriptive statistics, we use a small sample of the population that we are studying. For example, let’s say we wanted to study the work habits of employees at a large company. With over 1000 staff, it would be very challenging to interview each person. We can, however, use a sample of the population, surveying 10 individuals.
Measures of central tendency such as mean, median, and mode are used in descriptive statistics to describe data that is typical or average. If we surveyed 10 people about how many hours they work each week, the results might look like this.
Number of hours worked: 35, 42, 40, 37, 41, 35, 35, 45, 44, 49
The mean is calculated by adding all of the numbers together, then dividing by the number of people.
35+42+40+37+40+35+35+45+44+49 = 403 / 10 = 40.3
Thus, the mean or the average number of hours worked is 40.3.
To calculate the median, first order the numbers from smallest to largest, then find the middle.
35, 35, 35, 37, 40, 41, 42, 44, 45, 49
In this case, 40 and 41 are the numbers in the middle. We add those together and divide the sum by two to get the median.
Thus, our median is 40+41 = 81/2 = 40.5.
The mode is simply the number that appears most often. We can see that the number 35 appears three times, making it the mode for this data set.
This example used a very small sample of only 10 people. When we are working with large data sets, descriptive statistics are useful for explaining information in a practical manner. For example, if we surveyed 1000 people about their salary, we could present the information by saying that the mean salary for this sample was $46,000. We do not need to share all 1000 salaries for the information to be useful and relevant.
Both quantitative and categorical data are used in the field of statistics. Quantitative data involves numbers that may be discrete (whole numbers) or continuous. For example, the number of pets someone has would be discrete, while temperature would be continuous. Categorical data includes ordinal data – that which can be logically ordered, and nominal data – that which has no logical order. An example of ordinal data would be age, and an example of nominal data could be hair color. Measures of dispersion used in descriptive statistics include standard deviation, variance, skew, and range. Correlation and chi-square are two of the most frequently used measures of association.
Descriptive statistics often use visuals to describe data sets. Pie charts, bar graphs, histograms, and scatter plots are all examples of ways to present your data. Choosing the best graphic depends on several factors including how many variables you have, how large your data set is, and whether or not you are measuring changes over time versus static data. Visuals can be extremely helpful when presenting your data.
Inferential statistics use a given sample to make conclusions or inferences about the larger population. This can be helpful when the population is too large to study. With our earlier example, we surveyed 10 employees – this was our sample out of a population of 1000. Inferential statistics allow us to determine whether or not that sample was representative of the whole population.
We can make estimations, predictions, or generalizations about the population using inferential statistics. For example, new medications are tested in a small sample of the population. Using probability theory, we can then determine whether the results of the sample study can be applied to the entire population. Results that are considered statistically significant must show that there is a 95% or higher chance that the effectiveness was caused by the medication, and not by chance.
Hypothesis testing and probability are used in inferential statistics to establish if a sample is significantly different than the population. Tests commonly used in inferential statistics include regression analysis, t-tests, and analysis of variance (ANOVA). Regression analysis is typically used to find the relationship between a dependent variable and one or more independent variables. T-tests are used when comparing groups and can tell you the statistical significance of the differences between them. ANOVA is similar to a t-test, and it can be used to evaluate the degrees of variance between groups.
Descriptive statistics describe the data, while inferential statistics use that data to make a conclusion.
Descriptive statistics focus on the sample, while inferential statistics focus on the population.
Small data sets are used in descriptive statistics, while large data sets are used in inferential statistics.
Charts and graphs are used to describe results in descriptive statistics, while in inferential statistics, results are described as a probability score.
Data is more precise with descriptive statistics, while inferential statistics can have less accurate data.
Statistics are an integral part of data science and both inferential and descriptive statistics are useful in different areas. When you need to organize, summarize, or describe key aspects of a data set, descriptive statistics are best. If you need to make judgments or conclusions about a larger population, the best option is to use inferential statistics. Knowing when and how to use each type will help you to present the data in a way that makes the most sense.