by: Maria Sckolnick
Are you planning to do a research project? This document is meant to help you get started and organized. Try to answer the questions listed in the relevant sections below before you start and before you meet with a statistician to discuss your project. Not all questions apply to all types of data.
Before you start
- Are you conducting an experiment or an observational study?
- What is your hypothesis or research question?
- What software do you plan to use for data collection, storage and analysis?
- Would you like to make predictions for future observations?
- Would you like to characterize associations between outcome and treatments/conditions?
- What software do you plan to use for analysis?
If you have an experiment
In this type of study, you assign the conditions you want to study randomly to subgroups of study participants. Because exposure to treatment conditions is random and thus independent from any other pre-existing characteristics in your study group, you will be able to draw conclusions about causality between treatment/exposure and outcome.
Think about the treatments and/or conditions you want to study
- What conditions/treatments are you studying?
- How many levels do you have in each treatment/condition?
- What experimental unit do you assign treatments/conditions to?
- Are all treatments applied to the same experimental unit?
- How many times do you plan to repeat the same treatment/condition combination?
Think about the outcomes you are interested in
- How many outcomes will you be measuring?
- Are the outcomes numerical, categorical or time to event?
- If numerical, are they discrete or continuous?
- If categorical, are they ordinal?
- Are you measuring/sampling each experimental unit once or more than once? Over time?
- Do you have prior information about the distribution of your outcome(s)?
- What difference in outcome do you consider meaningful in the context of your study?
Think about anything else that may affect your outcome
- Are there blocking factors or covariates you would like to adjust for? These are conditions that likely affect outcome but you are not interested in how they affect the outcome.
If your study is observational
In this type of study, you observe a population in its natural environment. You do not assign factors, treatments or conditions to study participants. You can only draw conclusions about associations between the factors in which study participants differ. It is difficult to establish causality.
Think about the source of your data
- Are you collecting your own data?
- What population are your data a sample of?
- How are the data collected? – Are there potential biases?
- How are the data structured?
- How are the data formatted?
- How many observations do you have?
Think about the outcomes you are interested in
- What outcomes are you interested in? How many outcomes are you interested in?
- Are the outcomes numerical, categorical or time to event?
- If numerical, are they discrete or continuous?
- If categorical, how many categories? Are they ordinal?
- Do you have prior information about the distribution of your outcome(s)?
- What difference in outcome do you consider meaningful in the context of your study? How many observations do you need to show this meaningful difference?
Think about the covariates/factors that you want to study
- Which covariates/factors would you like to study?
- Which covariates/factors would you like to adjust for?
Once you have your data
As a first step, it is a good idea, to summarize your data:
- Descriptive statistics on all variables
- How many observations in all?
- How many observations per treatment combination group?
- Mean/median, SD,…
- cross-correlations among numerical variables
- Contingency tables, proportions observed
- Graphing – this depends on your type of data. You may want to consider using
- Histograms (for visualizing distribution of continuous variables)
- box-and-whisker plots (for visualizing means/medians and ranges of continuous variables from different groups)
- scatter plots (for the relationship between two continuous variables)
- Bar graphs (for frequency of categorical variable categories)
Often you will have to consider doing some data cleaning and formatting. The type of software you are using will determine some of these steps:
- How do you want to treat missing values? – ignore, impute?
- Are data formats specified correctly (categorical, numerical, date-time, nominal, ordinal)
- Are your data in wide format or long format?
Depending on your research question you may choose to conduct hypothesis tests
- Are you comparing your sample to a known population?
- Are you comparing two samples from two different conditions/treatment?
- Are you comparing matched observations from two samples?
- Are you comparing samples from more than 2 conditions/treatments?
- Are you comparing correlations between two different values on the same observation?
- Decide on confidence levels
- Adjust for multiple comparisons
Modeling
Modeling is the process by which outcome values are modeled as a mathematical combination of the explanatory variable values. This type of analysis helps to clarify complex relationships between outcomes and treatments or background characteristics in the studied population. Results from modeling can help elucidate effects from individual characteristics or treatments on outcome status in a population sufficiently similar to the study population. Additionally, modeling can provide means to predict future outcomes in populations with known characteristics and treatments. Effective modeling requires information about the distribution of outcomes, adequate sample size and knowledge about relevant characteristics that are associated with the outcome.
- What model is appropriate for your data? – This depends on the type of observations and conditions you are trying to study. Statistical software will fit most data to any model, whether it is appropriate or not. Make sure you understand why the model of your choice is appropriate. To help you evaluate this, think about these questions:
- Are the model assumptions met?
- Do your data require transformation?
- Do you have enough data to allow for the complexity of your chosen model?
- Evaluate and adjust the model
- Which terms are relevant?
- Does the model have a good fit for the data?
- What is the interpretation of your model parameters?
- For which population is the model valid?