Depending on how we use data, the study of statistics is divided into two main areas: descriptive and inferential. In descriptive statistics, we describe a situation by collecting, organizing, summarizing, and presenting the data. In inferential statistics, we try to make an inference from our collected data to populations by generalizing, estimating, testing, and making predictions. We will preserve the inferential statistics for the future and will focus on the descriptive branch of statistics here.
Suppose the statistics class just had a test. The teacher checked and recorded the test scores of students. The test scores represent numbers that, in statistical terms, are called data, and the whole set of numbers of the students is called a data set. But these numbers are meaningless if we don’t know what measures and who those numbers are measured on. Since we know that these are the test scores for the students enrolled in statistics class, these numbers may convey important information about class performance, test difficulty, students’ abilities, content knowledge, and even testing environment if placed in context.
The statisticians will call the students as elements, and the score of each student as an observation. Soon These observations were part of the teacher’s assessment, and she needs to use these data to analyze the content she taught. Imagine if she had over 30 students, it would be hard for her to look at a data set. It would be much more helpful if she organized the data into tables, drawn graphs, or calculated the average.
As mentioned earlier, data can refer to numbers or other subjective labels, and they are useless without their context. One easy way to provide context is to answer the Ws—who, what, when, where, why (if possible), and how—of the dataset we're working with.
Knowing who is involved in generating the data we have at hand provides more information about the cases (circumstances) for which (or whom) data is collected. That being said, there are a lot of ways to describe these individuals involved:
Respondents refer to individuals who contribute and answer surveys, providing information about themselves or their opinions on a particulat topic. 🦉
Subjects (or participants) refer to individuals (or sometimes other types of units, such as groups or organizations) involved in experiments, where they are exposed to a treatment or intervention and the effect of that treatment is measured. 👩
In addition to human subjects, data can be collected from a wide range of other types of units, such as animals, plants, or inanimate objects. These units are often referred to as experimental units. 🌳
It is important to consider the who of data when designing a study or analysis, as the characteristics of the units being studied can affect the results and conclusions that can be drawn. For example, a study that is conducted with a sample of college students may not be generalizable (you'll learn more about generalizability down the road!) to the broader population, while a study that is conducted with a representative sample of the population may be more generalizable.
Variables are characteristics or attributes that are measured or observed for each individual in a study. The variables should have a name that clearly identifies what has been measured, so that the data collected can be easily understood and analyzed. 🔎
There are different types of variables, including:
Dependent variables: These are the variables that are being measured or observed in a study. The value of the dependent variable is thought to depend on the value of one or more independent variables.
Independent variables: These are the variables that are being manipulated or controlled in a study. The value of the independent variable is thought to influence the value of the dependent variable.
Controlled variables: These are variables that are kept constant or controlled in a study, in order to eliminate their influence on the dependent variable.
It is important to carefully consider the variables that will be measured in a study, as they will determine the questions that can be answered and the conclusions that can be drawn. It is also important to ensure that the variables are accurately and consistently measured, in order to ensure the validity and reliability of the study (we'll learn more about this when we talk about experimental design and set-up!). In addition,
section 1.2 goes further in-depth on the specifics of variables. 😁
The more we know about the context, the more we'll understand about the data we have! This is where the when and where of our data come in.
The when refers to the time at which the data was collected, which can have an impact on the values that are recorded. For example, values recorded at different points in time may reflect different trends or patterns. ⏰
The where of data refers to the location where the data was collected, which can also have an impact on the values that are recorded. For example, values recorded in different geographical locations may reflect different social, cultural, or economic factors. 🗺️
Both the when and where of data can be important considerations when interpreting the results of a study or analysis. It is important to carefully consider the context in which the data was collected, as it can help to better understand the meaning and implications of the results.
The questions that we ask of a variable, or the why of our analysis, shape how we think about and approach the variable. The questions we ask can influence the way we define and measure the variable, as well as the type of statistical analysis that we use to analyze the data. 🖥️
For example, if we are interested in understanding the relationship between two variables (say, amount of sleep and test scores), we might ask questions such as:
If there is a relationship, what is the nature of the relationship (e.g. positive, negative, or no relationship)?
Is the relationship statistically significant, or could it have occurred by chance?
Answering these types of questions can help us to better understand the data and draw meaningful conclusions. It is important to carefully consider the questions that we want to answer when designing a study or analysis, in order to ensure that the appropriate data is collected and analyzed.
The how of data collection refers to the methods or techniques that are used to collect the data, and it can have a significant impact on the quality and reliability of the data.
There are many different methods for collecting data, including surveys, experiments, observations, and secondary data sources. Each method has its own strengths and limitations, and it is important to choose the most appropriate method for the research question being addressed. 📜
For example, Internet surveys can be a convenient and cost-effective way to collect data from a large number of respondents, but they may also be unreliable due to biases, such as nonresponse bias (where certain groups are more or less likely to respond to the survey) or response bias (where the responses are not accurate or honest). 😔
It is important to carefully consider the how of data collection when designing a study or analysis, in order to ensure that the data is of sufficient quality and reliability to support the research question and conclusions.
Tying these factors together, large data is hard to read and to draw conclusions from it. By constructing tables, drawing graphs, or calculating summary measures such as averages, make up the descriptive portion of statistics. The next few sections will show how to construct tables, graphs, and calculate summary measures. The two branches of statistics are strongly connected, and the knowledge gained in the first few units is going to help you when you are introduced to many inference procedures.
- Descriptive Statistics
- Data
- Data Set
- Element
- Observations