dna model -> Statistics m34 galaxy -> Research 2xfj protein -> Links protein gel -> Personal Bristol Balloon Fiesta SuSTaIn UoB
Mathematics home Statistics home Postgrad opportunities Statistics clinic Peter Green Research > Links Personal Statistical Science Editorial policy Future papers Instructions for authors SuSTaIn ACEMS Past Earth Network Royal Statistical Society International Society for Bayesian Analysis Royal Society

PJG Home

Statistics clinic feedback, January 2016 to April 2019

[1] Many thanks for your help!

Questions discussed:

1. How to get the slope of the linear section of a baroreflex plot. - Difficult question, need additional software to do properly. Potential use of logit function. Plan is for me to go and look into the literature again to see if we can see how other teams have handled the problem.

2. How to do a repeated measures ANOVA given missing data points. - Can impute mean in place of gap and then trial with lowest and highest values in place of mean to check results are still robust. However, in this data set it may be better to run the analysis with both last value carried forward and next value carried backwards, and then compare to see if consistent result.

3. Method for multivariate analysis to see which baseline variables correlate with the change in BP. - - Given variables aren't independent, suggestion is to plot main correlation with secondary variables illustrated using different shape or colour data points to see if any other relationships become apparent.

4. Left ventricular mass data fails test of normality - should parametric or non-parametric analysis be used. - Physiological data never truly normal distribution as multiple sub-populations are included, plus small dataset, therefore use non-parametric test.

Thanks again.
[2] Hello,

I visited the statistics clinic today with a question regarding how to analyse data we collected from a survey of clinicians about how they prefer to treat a particular hand condition. I wanted to investigate preference differences between clinicians and I was advised to calculated the chi statistic and rank them by size for their contribution to the overall model.

This was really helpful advice and I would wholeheartedly recommend the statistics clinic to anyone!

Many thanks for this excellent service.

Best wishes,
[3] Hello

I came to the clinic yesterday. I wanted confirmation of my interpretation of a logistic regression model using interactions and suggestions about how to improve the model I was using (on SPSS). The guys were very helpful and had some very helpful suggestions.

Thank you very much,
[4] Stats team,

Many thanks for your help today. As requested I have outlined the questions I had and the conclusions reached.

Question 1: In my study on medical education I examined the views (using a questionnaire) of two groups. I wanted to know whether in analysing the data I could selectively combine some of the categories ('agree' and 'strongly agree' versus 'disagree' and 'strongly disagree') and omit another category ('neither agree or disagree') in order to compare group A with group B.

Answer 1:The advice was that this would be fine, provided it was clear in the presentation of the results, and that any statistical test (chi square appropriate in this case) was explained to be in relation to those expressing a polarised view (i.e. it didn't include the 'neither agree or disagree' category).

Question 2: In order to assess internal consistency, could I compare one element (e.g. participants view of the usefulness of a meeting, total score 20) with the mean of questions on several elements of the meeting (e.g. participants view of: 'the meeting introduction', 'the meeting content' and 'the meeting closure' total score 60, so 60/3 = 20).

Answer 2: I could compare compare these (chi square would be appropriate), but that any statistical difference may not reflect a lack of internal consistency, but may reflect the component parts not equally the overview i.e. there was another key element in the meeting that had been missed in the breakdown (e.g. 'discussion groups that took place within the meeting').

Thanks again,
[5] Dear Statistic Clinic team,

I did visit the clinic the 3rd of February.

The nature of consultation was double check if the tests and software I did use to compare the chemical composition of 3 samples was appropiate.

I would recomend Statistic Clinic to anyone with statistical issues.

Thank for your help
[6] Hello,

First of all just to say that this is a really good idea and is very helpful!

The question I had was mainly about reporting results and CIs from multilevel modelling on the original scale, as my outcome variable was transformed and the square root of it was used.

Your advice was to calculate each coefficient for an average person, do the same for the average plus 1 and then the difference would be the coefficient on the original scale. They also suggested to use bootstrapping with the raw outcome variable just to check that the outcome between the 2 methods are roughly the same.

All the best,
[7] Thanks for giving me this good opportunity to go to the statistics-clinic group talking with you.

My problem is how to choose a good candidate number of principal components.

I was given some good ideas with regression process to reduce the dimension of matrix before Principal components analysis. It gives me a good idea to think about my problem. I now have a more clear mind in current work and pay more attention on the objective of my project. Totally today's meeting is really helpful to me. Thanks a lot. best wishes,
[8] Dear Statistics Clinic,

As requested, here is a summary of my question:

I asked how to understand regression when an independent variable mediates its effect through two or more mediator variables. You explained how to put the regression coefficients together additively.

As requested by your record-keeping form, I have copied my supervisor into this email.

Many thanks for your help,
[9] Hello,

I visited the clinic today, it was extremely helpful and I wish I knew about it at the beginning of my PhD!

We discussed classification techniques for large nuclear forensic datasets, I already had a script for lda analysis provided by an industrial sponsor and this was modified and the technique explained.

Very glad I dropped in thank you for the time,

Many Thanks,
[10] My question concerned: how to test for significantly enriched proteins in condition A vs. condition B using a dataset containing a list of proteins and their respective abundance under condition A and condition B.

The stats clinic clarified the statistical test I should use, how to implement the test in R and how to test for an overall difference between the two data sets using a GLM.

I was helped with a question relating to comparing values extracted using a sigmoidal function. A couple of approaches were suggested, such as coding the data in terms of correct/incorrect responses and looking at the frequency of these, and also suggested looking into Hotelling's T-squared test.

Many thanks for offering this drop-in clinic.
[11] Hello,

As advised during the Statistics clinic, I am sending a brief description of my question, as well as the recommended course of action.

Question: I am planning to use statistical models to construct a meaningful dissimilarity metric for different sets of symbolic sequences. I wanted to check the general validity of my approach, as well as asking about some more specific concerns I have about my approach.

Recommended course of action: general approach seems to make sense (if I explained it well enough). Use simulation studies to address my specific concerns. Can use multidimensional scaling to visualise dissimilarities once they're found.

It was really useful to have a chat about the problem, thank you very much for your time.

Kind regards,
[12] Nature of question: To find out the appropriate statistical tests I could use to assess the variability in certain properties of my organisms between and within species and also through time, as current tests gave results which did not match the spread of data.

Course of action recommended; transforming data and look closer at histograms and histograms of residuals to explain statistical results.
[13] Thanks for the help on two-way ANOVA test. I was told to plot the data first. If the data within treatments are obviously separated from each other, even though the assumptions for ANOVA test are not met, the conclusion on estimates (e.g. mean of each treatment) can still be made without quoting p value.
[14] Thank you for your statistical recommendations today afternoon. The following sentences summarises the nature of my question and your recommended course of action.

I wanted to know how I could analyse my data statistically using two-way ANOVA with the Bonferroni post hoc test according to a previous similar literature. After I explained to you my data, you recommended me to use the t-test for my statistical analysis with Bonferroni correction and that I can't have many information out of the statistical analysis due to the very small sample size (3). Also, the smaller the sample size the less likely to detect smaller differences between samples.

I have copied this email to my supervisor as you requested.
[15] Thank you for your help.

First - it is important to say that your help has been fundamental. There is no where else I could have gone for help. Without your help today - I would have submitted a manuscript with potentially the incorrect statistics.

Summary of questions and answers: 1. Is ANCOVA the most suitable way to control for a covariate: since I have different BMI's for my control and treatment group. Answer: The team pointed out I need to know whether BMI actually affects the outcome variable. Plotting a box plot to start with is important. Then go onto complete general linear modelling with age, BMI and sex as covariates. Look into how to complete interactions in a GLM using SPSS .

2. I have data that is binary (presence of an anatomical variant or not) in my control and treatment group. I want to compare the prevalence of this between groups, whilst controlling for covariants. Answer: Simpsons paradox was mentioned when talking about Fishers Exact test. It was suggested that linear regression should be used to test whether the variant prevalence is different between groups and using the predicted risk ratio.

3. Is binary logistic regression the best way to test which variables are related to the presence of disease or not? Answer: yes, but express data regarding odds ratio in a more clinically useful way. e.g express unit +/- x amount from mean for each variable that is independently associated with the disease.

Thank you so much for your help.
[16] While at the statistics session, I asked if my current methods for survey data analysis was appropriate. I was told it was and that I should elaborate in-text the assumptions I made and the reasoning behind my analysis; including an equation in summation form would be helpful.
[17] Dear Stats Clinic,

I came to the clinic yesterday as I was trying to fit a GLM and test for the significance of the fixed effects, but the distribution of the data and residuals were both leptokurtic. Your advisor recommended bootstrap linear regression to test for the significance of the terms, and also confirmed that a Gamma distribution GLM should also be suitable for my data to avoid bootstrapping.

Many thanks for your help.
[18] I have attended the statistics clinic last week and found it very helpful. I have quite large dataset generated form different types of soil. I'm asking whether I've applied a correct statistical test in the workflow in order to discriminate them and how do I proof that I'm doing it correctly. The stats team have suggested me to do a leave one out cross validation method for all of my samples. And I will definitely come to the next session for further discussion. Thank you.
[19] Suitability of behavioural data for 3-way ANOVA, effect of violated assumptions. Approaches for differentiating spindles/seizures, in particular discriminant analysis of spectral characteristics.

Thanks a lot for your help, I found it very useful.
[20] I needed advice how to analyse the risk effects of different combinations of HLA and non-HLA genes on development of multiple autoimmunity in children with type 1 diabetes.

I was able to analyse their separate effects, but was unsure how to figure out how their combinations modulate the initial risks.

The stats clinic was really helpful for me. I was explained the principles underlying the calculations of independent variables and confounders interaction, which helped me further to run this analysis of my data on SPSS.
[21] I attended the stats clinic on June 22nd. My problem related to how one might identify an inflection point in non-linear data. I was advised to consider an iterative process involving a best fit of two linear models.

The advice I received was tremendously helpful and I am very grateful to the team for the support they provided. I'll be recommending this clinic to colleagues.
[22] I attended the statistics clinic on 6th July with my colleague. We attended the clinic to verify if our planned analyses for a virtual beverage choice study were valid. We were advised that our analysis at the group level was appropriate, but in order to avoid an ecological fallacy, we should visualise our data at the individual level as well (we were advised to perform this analysis in R).

Thank you.
[23] I visited the clinic today to ask for help in modelling the intrinsic variation in a dataset of noisy measurements with truncation.

The advice I received was very helpful. It was confirmed that my likelihood model was appropriate for the problem, but I was given a suggestion to look at implementing a gamma distribution instead of a lognormal to allow an analytic solution.

The clinic was very friendly and helpful - thanks to all involved!
[24] I visited the statistics clinic for help figuring out what information to include when I write up my results (basic descriptive statistics or regression outputs) and whether I should collect any additional information. The recommendation is to include a particular regression output, and also to calculate confidence intervals.
[25] I came to discuss a repeated-measures design for which I was planning to use GLMM in R. We discussed the procedure in R, a bit about error distributions and how to check if the model is a good fit. I thought the session was helpful and gave me a bit more confidence in what I aimed to do.
[26] Summary of question and response:

a) What is the best way to test multivariate continuous data across multiple groups?

Use ANOVA to calculate any differences in means and then use a K-S test to test distributions.

b) What is the best way to compare proportional discrete abundance data?

Use multiple chi-squared tests to compare between each group and/or use a chi-squared test to compare each group against a theoretical mean group.

Thank you for your help.
[27] I visited today to ask about a k-sample test of cumulative frequency distributions; the Kolmogorov-Smirnov test is available for 2 samples but how to test multiple samples is not clear.

You found a downloadable program for R for an alternative test (Anderson-Darling) which works for k samples, and went through how to apply Bonferroni corrections for multiple testing with the K-S test.
[28] I attended the statistics clinic today to ask whether my statistical analysis using an ethnographic dataset looked OK, particularly with regard to the use of network lag terms, and to ask whether there was anything else I should try. I was advised that my work looked satisfactory, though if I were a statistics student I might try alternative ways of constructing the network matrix to see how that would affect the results.

Thank you so much for your time and your help! All the best.
[29] I had 89 students complete a before and after questionnaire giving a Likert scale answer between 1 and 5 for three different parameters. I had calculated that there was statistically significant change using Excel and t test tables but needed to check
  • whether I had used the appropriate test for my population
  • whether I had calculated this correctly
  • how I could convert my figure into a p < 0.005 etc type answer
The recommendation was that a t test was appropriate but that I should consider a one tailed instead of a two tailed test. The calculation I used computed a s.d. for both sets together instead of separately which I needed to clarify. The t tables were explained to me too in more detail which was really helpful.

Thank you very much for the help.
[30] I am a PhD researcher. I had doubts on the PC analysis and ANOVA done on the cranial measurements of two bat species. I wanted to clarify some doubts both in these analyses and in the results interpretation. The clinic was very informative and useful for me because I was convinced and got clarified on my doubts.
[31] Thank you for the help last week. I am writing an RCUK grant and needed advice on a possible machine-learning application, your advice was very helpful in defining the nature of the problem I have and who could potential help in the University. This is what I needed. Many thanks.
[32] Thanks for your help. This is the first time I've attended and it was incredibly helpful. The consultant was very patient and really helped me with the statistics problem I had.

Nature of my question: I had a metagenomics (DNA samples containing DNA from mixed unknown species of bacteria) dataset and I wanted help to look how I could statistically show which species were present or absent based on a reference dataset of known values.
[33] Thank you so much for organizing the clinic!

I had a question about performing regression analysis of data, choosing types of variables (nominal vs ordinal), comparison of different regression models built with different methods, calculation of Bayesian Information Criteria and one-leave-out cross-validation for ordinal regression.
[34] I visited the statistics clinic today. It was really helpful! I had questions about the use of Lasso regression for variable selection. My plots look odd because the predictor variables are not much related to the dependent variable, so I am mostly trying to fit noise. I was told that I could try the hdi (high dimensional inference) R package for getting an indication of the significance of the regression coefficients.
[35] My question was: given how noisy MRI data is, whats the best way of quantifying the change (if any) in T2 relaxation between two T2 scans of the same volunteer scanned with the same magnet and protocol.

Answer was to calculate the variability on a voxel wise basis per participant, by computing a squared difference image of the whole brain, grey matter, white matter, or a specific region of interest for each participant and dividing by the number of voxels.
[36] I attended the statistics clinic yesterday. I wanted confirmation on the correct use of Generalized Additive Models (GAMs). We checked the use of the random effect, the fit of the models, etc. We also had a broader discussion on GAMs which was very helpful.
[37] I had been asked by a reviewer to re-analyse some data using a mixed effects model rather than an ANOVA, but didn't know what to include as random effects in the model, or how to define the covariance structure. Advisor was very helpful in clarifying these points, and suggested that I use something predicted to affect the measure that I didn't manipulate as a random effect (e.g. age), and helped me understand how the covariance matrix should be used.
[38] My question was to confirm that I do not need any stat for data interpretation. The advice I've received was I don't need it because the sample size is too small (n=3).
[39] My question was based on whether a fixed- or random-effects model would be more appropriate for the estimation of the explanatory power of both time-variant and -invariant factors for event counts in provinces measured over a short time series. Besides answering this question I also received tips for model checks as well as a useful R package to do the calculations more quickly and efficiently than the one I had been using.

On another question (interaction in the context of multi-level models) I was referred to the JGI as this seems to be the focus of their work.
[40] Just emailing as requested to summarise my experience at the statistics clinic. I came to the session with a question about the best statistical test to use on my data set, which consists of multiple repeats of detecting the amount of a particular protein bound to particular locations on DNA in three cell types. I found the session extremely helpful and was able to briefly explain my data and logic behind the statistical test I'd previously used (students t test), and why I didn't think that was appropriate. I received advice and explanation of several other tests which would be more appropriate (One way ANOVA and non-parametric tests like Mann Whitney and Wilcoxon test).
[41] Question: What test should I use to conduct the equivalent of a mixed ANOVA with one within-subjects (repeated) measure and one between-subjects (group) measure with discrete data that is not normally distributed in SPSS?

Recommended action: try to do the test in R instead of SPSS. Alternatively answer the two questions separately (i.e. check for within-subjects difference separate from between). Another option may be to transform the data.
[42] The nature of my question was how to use data I had collected mathematically/statistically. Your advice was superhelpful and I now feel equipped to move forwards in analysing the data. I would highly recommend the clinic.
[43] I came to the statistics clinic to ask two main questions;

1. How can I assess the significance of my data when one sample is intrinsically heterogeneous, thus giving a huge error. - I was told to try a non-parametric tets (Mann-Whitney), or if this fails, to rank the data and analyse eg the top 70%.

2. I also asked how I can compare samples that consistently show the same trend, but are numerically very different between experimental repeats. - I was told to subtract one sample from the other, for each repeat, and then do a one sample t-test.
[44] Nature of my question: I have data measuring the amount of different bacteria species present in an environmental sample. I also have 'artificial data' containing known quantities of known species. I was seeking advice on how to model the artificial data and compare this with the 'real' data.

I was given advice on generating further artificial data for different proportions of species (compared with 1:1 ratio I currently have) and I was given advice on how to model this data, initially staying with a GLM.

Thank you for your help at the statistics clinic.
[45] We talked about 1. Logistic regression model for comparing ventilation responses to hypoxia in different groups

2. Power size calculations and advise regarding appropriate assumptions in doing these.

Helpful suggestions provided and link to useful website resource.
[46] Three problems identified:

1) Attempting to determine clusters of different N-dimensional technology time series in an unbiased manner so as to verify that actually expected groupings are supported, and not merely the result of forcing the data to arbitrarily fit into expected categories as a result of the selection of the technologies to consider.

2) Determining from those time series classifications that have been found to have some statistically significant alignment to the expected classifications which dimensions of the time series (specifically, which combination of patent indicators) provide the most suitable combination for use in a time series classification prediction model.

3) Deriving a model using the identified most relevant patent indicator set(s) for technology time series classification.

Course of action proposed:

1) Use of Dynamic Time Warping (to provide feature-based distance measure between dissimilar time series curves) and K-Medoids clustering on the N-dimensional time series for different patent indicator subsets to identify the indicator/dimension combinations that produce classification results with a statistically significant alignment to the expected classifications. K-Medoids recommended rather than K-Means as K-Means does not necessarily have any meaning for time series. 'PAM' algorithm used for K-Medoids clustering due to small number of observations (i.e. technologies) considered. Due to the small sample size (10 or 12 technologies considered in the dataset), Fisher's exact test was recommended for application to the 2x2 or 3x3 confusion matrices generated for determining indicator subsets that were statistically significant.

2) Cross-validation approach proposed based on sequentially training and then generating test predictions for different subset decompositions of the time series data again using Dynamic Time Warping & K-Medoids clustering (distinct subsets used for training and test data), and then using the average number of misclassified observations as a means to rank each statistically significant patent indicator subset considered. Following this discussion I have adopted 'Leave-p-out' cross-validation due to the small technology sample size considered, and have so far being using 'p = half of the available time series dimensions', to ensure cross-validation is based on the greatest possible number of sampling exercises.

3) For model building purposes Functional Data Analysis was recommended - this is what I am now looking to implement having established the top ranking patent indicator subsets from the previous cross-validation exercise.
[47] I went to the Statistics clinic looking for help to know how to manage the problem of missing data in a longitudinal trial (repeated measurement). Due to the low number of repetitions in every group the main suggestion was to explore causal analysis.
[48] My question was about the F-distribution and Chi-Square distribution in a hypothesis test analysis. My course of action will be use the F-distribution with p values or in matlab with finv.

I want to say thanks for the help, patience and availability and for having the statistic-clinic.
[49] My question: can I statistically analyse my data (differences between 3 groups: control, treated, epileptic animals, ignoring the within subjects factor) using a t-test instead of my current use of factorial ANOVA (I have 1 factor within subjects & 1 factor between subjects)?

Your answer: yes, use the t-test with Bonferroni correction.
[50] My question was regarding a polynomial regression model for modelling the temperature of components in the drivetrain of a wind turbine. My two questions were regarding:
- reducing the seasonality of my output
- having a statistical way to set a 'threshold' above which the component temperature could be judged to be performing abnormally.

The recommendation I received was to seek greater clarity on the method used in the polynomial regression and to look at multiple linear regression for the first point. For the second point, I was advised to look at statistical process control for setting the threshold.
[51] Today I was helped by two of your volunteers with regards to a multilevel modelling problem. The issue related to whether covariates could be included in a multilevel ordinal probit regression model and what effect this would have on the other parameters estimated in the model. I was advised that covariates would be fine to include in the model and that they would likely just shift the cut points of the model.
[52] I attended today with some questions about comparing the similarity between vectors and different measures. I had great help from your consultants who talked me through the defining properties of distance and how norms can be used - in particular the L_p norm which allows some tuning of the penalties for difference.

Thanks for the help.
[53] I came to the stats clinic as I had a few questions regarding GLMMs, particularly on the use of different error distributions and underdispersed/overdispersed data. I was given some useful guidance, and recommended to look into more data transformations, other packages for building GLMMs and potentially using dummy variables.
[54] I attended to ask advice about the validity of power calculations (at the planning stage of a study) - I was advised that the approach I had taken was broadly correct, as long as the data used to calculate effect size was normally distributed. Adding a graph showing power vs effect size was suggested.

Very helpful service & I hope to come back to get stats outline for a project proposal reviewed next month,
[55] I visited the statistic clinic to learn the main steps to develop a meta-analysis. The advisors suggested me to start with this basic points: Cochrane Collaboration Meta-analysis and Prisma Statement. Useful recommendations to star working by my self and share doubts in the next open clinic. Thanks for your help.
[56] Thank you for your time and help today. The following is a brief summary of today's visit.

I wanted to know the possible reasons behind my results were not statistically significant despite having a big effect size (when compared to previous literature) and, to have some guidance regarding sample size calculation.

I understood that the variability of the data (due to experimental or biological causes) makes a big difference when comparing results and testing significance. And, I was given some advice regarding sample size calculation.
[57] I asked you for help in analysing my qPCR data, a comparison of 2 groups with 3 replicates for each. You told me that the best thing to do was to plot a graph of the raw data so that the reader can judge for themselves whether the argument is convincing. We also discussed what statistical test I could use and you explained that I can't use a non-parametric test because with only 3 replicates there is not enough power to get a significant p-value. You explained that I should use a t-test for two samples of unequal variance but I should be aware of the assumptions of the test.

We also discussed ways of generating a correlation coefficient for a scatter plot.
[58] I came to discuss data regression for a dataset I've compiled for marine phytoplankton looking at cell size, maximum growth rate, temperature and taxa. We discussed different types of regression (and associated error estimates) and we agreed for me to look into details onto the linear mixed effect models to account for different taxa on top of cell size and growth rate. In particular I plan to:

- read more about linear mixed effect models

- check the equation for adding a taxa effect onto alpha (the slope in a log-log transformed)

- run the regression using R
[59] As part of my subliminal priming in Human-computer interaction research. I came to today's session (20th of sept) with the following question: How to analyse response time data taking into account the different stimuli locations and types studied.

The clinic advised me to use a Multiple linear regression model.
[60] I approached the clinic with a data set involving 54 individuals, split into two groups, treatment and control. Within the treatment group there are non responders. 16 independent markers had been measured on immune cells of each individual and I wanted to compare their expression between treatment and control groups, as well as using PCA to examine the whole data set.

I was advised to exclude non responders by taking out individuals whose growth rate departed significantly from the mean, to use appropriate comparisons and to use specific functions within R to perform a PCA.

Many thanks for your assistance
[61] I came to the clinic thinking I wanted a statistical tool to compare distributions (e.g. Kolmogorov-Smirnov test). Turns out what I was actually wanted was Bayesian Model comparison, using the Bayes Factor. However, given the difficulties in trying to calculate the marginal likelihood analytically, and the pitfalls in trying to compute it from the MCMC sampling points that I have available, we decided I should use the Akaike Information criterion to compare my models.
[62] Thank you for you assistance at the Stats clinic today. I had a number of questions regarding the most appropriate way of statistically analysing data related to the use of surgical simulators and their impact on surgical practice. I was given clear advice regarding the type of statistical test I require and the principles in how to do this for my MSc project.
[63] I came in last Wednesday to discuss my plans for taking soil samples in Antarctica to test for human impact. I was advised the that plan looked fine for a basic comparison of inside-outside, but that I should follow up with a spatial statistician to double check.

Thanks very much for your help.
[64] In this clinic I asked for assistance with predicting values (specifically predicting the value of ?Interval? when the probability of the binary outcome ?Success? was 0.25) from a mixed effects logistic regression model in R, and with identifying and checking model assumptions. I was informed that use of the predict() function would be most appropriate but since 'Box' (categorical) and 'Trial' (continuous) also had small but significant effects I would either have to run separate models for each value, input the mean/median value for each, or create a new model omitting these (since Box and Interval were counterbalanced), and that there is limited scope for model checking of mixed effects logistic regression models as residuals are not particularly informative due to the binary nature of the outcome variable.
[65] I asked a question about which statistical test to use to analyse the data in an area of research I am pursuing.

I was provided with appropriate advice on which test to user as well as alternatives I could use.

Thank you so much for your help, I will definitely return in future.
[66] My project looked at the flow rate of fluid though dentine and my question was regarding which analysis of my data would work best. The recommended action was to look at random effects analysis of variance tests and find one that fit.
[67] I was wondering if the Zero-Inflated statistical model would be good to analyse my count data with quite a lot of zeros. I was recommended to use a quasipoisson GLM instead as this did fit the data and it is easier to interpret the results. Thank you very much for this session, it really helped me a lot!
[68] I came by the clinic today to discuss my fish wounding and repair study, where we have assessed blood vessel regrowth at the wound site with and without an inflammatory response, but were struggling to apply the appropriate ANOVA test. Given the differences in variance between datasets, the recommended course of action was to either perform a nonparametric test (with no assumption of Gaussian distribution), or to transform the data via a log or square root transformation prior to performing a one-way analysis of variance and subsequent post testing (Bonferroni or Dunnett). Thanks again for your help!
[69] I had seven continuous outcome variables from two datasets, which I had been hoping to analyse using mixed effects general linear models, however several of the outcome variables were skewed (4 highly positively, one slightly negatively) and the one variable I had attempted to model had greatly heteroscedastic and non-normal residuals that were not resolved by various transformations, so I came to the clinic for advice on alternative analysis methods.

I was advised that it was more important to assess that the mean-variance relationship for the specified distribution is correct than whether the residuals were normal or homoscedastic, and that for some outcomes (such as those based on mean durations of an event) a gamma-family generalised linear model may be appropriate, though since there are many significant outliers it may be necessary to use other analysis methods such as quasi-likelihood methods or heavy-tailed distributions (neither of which are available in current versions of lme4 but can be performed using other packages) or if necessary non-parametric rank-based tests which are robust to outliers.
[70] My question for stats clinic was about model selection between two multiple linear regression models, one with all available predictors as starting point (M1) and the other with only predictors that didn't show collinearity as starting point (M2). As the final models were both single predictor models, but the M2 explained considerably less of the dependent variable than the M1, I wasn't sure whether the M1 would still be better fit for the data, regardless of collinearity that might have caused elimination of the predictor that ended up being the only significant predictor in the M2.

I was suggested to test whether the M1 indeed explains more of the DV than the M2 and to take a look at variable selection with knockoffs and/or LASSO methods.
[71] Q: Is Cox proportional hazards regression w/ time-varying covariates a suitable model for survival analysis of my data?

A: Certain properties of my dataset were identified that do not fit with this particular model: discrete-time and mixed effects. I was directed to look at survival analysis for discrete-time data (knowing the nomenclature was key here). I was also directed to include mixed effects in my model to account for the multi-level indexing of subjects (e.g. animal, region, cell, bouton) that would introduce considerable error in my outcome. This would involve using R and some already-formulated packages (e.g. coxme, for comparison to coxph and SPSS output already done).

Q: Is it fair to compare the results of survival analysis to results using randomised data for the covariate?

A: It was identified that shuffled data could be used, but randomised data would suffice and that it was reasonable to compare the outcomes of the model with real and randomised data for the covariate to answer the question as to whether the result is merely due to a random factor of the biology (details excluded).

Q: Is the Pearson's correlation of two variables (with the same units) giving me information about the correlation of the changes in those variables?

A: The conclusion was that the correlation of the two variables at each individual timepoint would not necessarily be describing anything about how they change together across time. This is where cross-correlation comes in, or analysis of the correlation of the change in the two variables between pairs of timepoints.

The discussion was very informative and thorough, attempting to identify any potential sources of error or assumptions with my entire approach (not just my particular questions). I would highly recommend the statistics clinic and will hopefully be visiting again with other questions and an update.
[72] We discussed how to plot and handle particle analysis data (counting particles in a set of 50 sequential images). I came away with some ideas for plotting the data generated by imageJ and will bring back some plots to the next clinic to discuss the findings and what to do next with the data.

The advice I received was very helpful.
[73] I attended the stats clinic today with some questions on sampling methods and data analysis, regarding a project to assess whether there is a relationship between the use of feeding-grounds by gulls and human-related activity at those sites. We were advised that our sampling methods seemed suitable, and recommended to use a pilot study and the inclusion of some more weather variables in our data collection/analysis. We were also advised on what statistical methods appeared to be most suitable at this stage.
[74] My question for today's stats clinic was about p-value correction for my mixed effect regression models that I had run for different timewindows and conditions of my ERP experiment.

I was suggested to use a stepwise addition instead of backward elimination procedure for model selection and to create a "super-model" with all the conditions and timewindows included.
[75] I needed advice on how to analyse a n-of-1 trial with multiple repeated measures at multiple discrete time points. We decided to go with a mixed regression model with fixed and random effects.
[76] At today's clinic I brought some animal behavioural data with the hope to check whether the generalised additive mixed models I had built were suitable. I was reassured that my approach was statistically sound and given some modifications to try (such as eliminating censored observations) to see if it improved the model fit.
[77] Question : How can I compare my data set that consists of 30-80 replicate measurements in 3 conditions and this experiment has been repeated 3 times:

Most data is non normal with a few abnormally high results.

Recommended actions: (1) Analyse ALL data using two way ANOVA + multiple comparisons. (2) Analyse ALL data using two way ANOVA + multiple comparisons on data where abnormally high values have been excluded. (3) Analyse ALL data using non-parametric Kruskal-Wallis follwed by Dunns. (4) Check for agreement in p-values


Use Randomisation variation of two-way ANOVA -- 'Edgington's approach'
[78] I attended the stats clinic today to ask about the data collection for our study. We are comparing teaching methods that would supplement simulation teaching for medical students. I brought the data collected from the previous four sessions that we held, and asked for advise on a) whether the way we were collecting data was appropriate and b) how would be best to analyse the data in way that would be appropriate for publication.

The advice given to me was that we were collecting data in the correct way for what we were looking for. I was advised to take the data analysis back to basics and compare the data as a whole before subgrouping the data to have a look at the different variables, and to do this using histograms and standard deviation. I was also advised to include the raw data in the analysis for one data collection point, as this was more valuable than an average.

I would just like to say thank you so much for all the help provided today. The consultants that I spoke to were very helpful, and I am sorry that the project is so confusing! Please do let me know their full names (or what recognition you would like) so that we can include them in the acknowledgements section of our paper.
[79] My questions were to do with the problems encountered carrying out the analyses of my data - how to deal with my concern areas in a strong statistical manner. Specifically I discussed the problem with the main outcome variables; specific handling for the very small numbers severely impaired; multiple imputation for missing data; the Stata error messages; and multiple testing.

The recommended course of action was to use Poisson regression not logistic regression. That using a linear regression for injury count would be the best for my data, and would also enable me to solve a number of the other issues that have been causing problems within logistic regression. I was also advised that the effect of missing data could be assessed within the model itself and how to do this was explained to me. Further lecture notes were forwarded to me by one of the statistics lecturers who spoke to me, as she teaches Poisson regression. Some books were also recommended to me (An introduction...., Annette Dobson and Core Statistics, Simon Wood).

Thank you so much for this input - it has confirmed the issues I was concerned by were valid areas to need to be considered, but shown me they can be dealt with satisfactorily. It has also helped the blockage in my Multiple imputation where only certain variables would not impute despite the Stata being correct.
[80] My question to you yesterday was on the appropriate statistics test for a spatial model output on jellyfish ingress and the importance of seasonal vs inter-annual signals. Having run this with a 2-way ANOVA on the raw data, it crashed the HPC node (using R).

You suggested to run the analysis on the averaged dataset, which reduces the computational power needed significantly.
[81] This Wednesday myself and Anouk attended the statistics clinic with some questions regarding our GLMM analysis. Talking to the consultant was extremely useful, especially the advice to test for an interaction between two of our variables to assess whether inclusion in our model was necessary, and advice on how to handle collinearity between some of our environmental variables. We now feel more prepared to begin drafting a first manuscript of our study.
[82] I wanted help interpreting and choosing the correct tests for statistical analysis on linguistic data. I was recommended to normalise the means of each speaker before running individual t-tests for my formant values for a vowel sound. Also recommended to complete a binomial confidence interval for another set of data. The consultants were incredibly patient and amazingly helpful. Thank you very much.
[83] I went to the statistics clinic to ask for help clarifying a linear model using integral of a ring to predict electric charge. The relationship was modelled with a LM initially, then as GAM to account for heteroscedasticity of variance. The predict function was recommended to obtain charge values from integral values.

Thank you, I found it really helpful.
[84] I attended the statistics clinic on the 9.1.19 to seek advice on statistical methods to analysis the field data I have collected during my PhD. I received very helpful advice on the use of mixed factor regression models to fit my data. I now feel much more confident in working with my results. Thanks!
[85] I came to the stats clinic on 23rd Jan, and discussed ways of and alternatives to comparing correlations. The consultant was very helpful in offering advice on how to graph information from correlation analyses as well as offering canonical correlations as a potential option for investigation. We also had a good discussion about appropriate use of statistics & p-values in general. Overall a very good / helpful experience - thanks!
[86] My question was about using ANOVA for dominant control analysis.

Talking with you guys helped me better understand how unbalanced design and potential correlations among factors play a role on my estimated factor contributions.

Thank you!
[87] Thanks for the input on stat analysis of the data.

Question: In a data set looking at 11 proteins/phosphoproteins across a set of 2 knockdowns vs control upon 4 treatment time points what is the best statistical test and post hoc testing required?

Recommended course of action:

1) 2-way RM ANOVA (to control for gel/experiment variation each day that is presumably random)

2) Check overall significance of time factor, column factor (control and knockdowns) and interaction (time*column) for changes in the response profile of knockdowns vs control. Time factor (VEGF treatment) would overall be significant due to robust triggering of phosphorylations. In case the column or interaction is significant proceed for any of the recommended post hoc test (Dunnett more stringent and Sidak lenient)

3)You may correct for multiple comparisons with Bonferroni's correction which is very stringent therefore optional. If data comes significant after correction its trustworthy.
[88] We have relative frequency (%) distribution data allocated to bins of unequal size for control and treatment samples, we would like to know if there's a significant difference between these two data sets.

The data are best shown as cumulative frequency plots and the difference (maximum vertical separation, k) observed should be compared against the distribution of values (k) given if random permutations of the whole data set are generated.
[89] I came in today with some in vivo (experiment performed in mice) data to have an opinion on the type of stat that I should use to analyse these data.

The person that helped me today confirmed that I was using the right approach and stat type.
[90] Questions:

- How can I improve the performance of Multiple Linear Regression for time series dataset?

- Is there any other statistical method that potentially performs satisfactorily for time series dataset?


- Consider creating dummy variables (One-Hot Encoding) for DateTime related columns such as Month, Day, Hour.

- Try to use VARMAX, VARIMAX, ARIMAX and alike.

- Have a look at the book "Introduction the Linear Model With R".
[91] I was looking for some advice on how to analyse my high dimensional data of combinatorial receptor expression analysis of lymphocytes derived from tumour and non-tumour sites in mice. I am trying to analyse differences in expression levels of 7 different receptors (128 possible receptor combinations) between different mice, between tumour and non-tumour sites, and between different treatment conditions.

It was recommended that I continue to learn to use R, and read further about the generalised linear models and hierarchical clustering analysis, and come back in 2 weeks to discuss further.
[92] Thanks for your email. I came along and spoke to one of your colleagues who was very helpful and helped me understand my data a lot better. I have a lot more confidence that I have the right test now. I'm going to sit and redo all my stats and will come along to the next session if needed.
[93] I have attended the stats-clinic today. My main issue was that my glm models were not normally distributed. The consultant was very helpful and suggested that I run my models separately (Females and Males) with a poisson distribution and then do a step-wise variable selection with the function update.

Professor Peter Green, School of Mathematics, University of Bristol, Bristol, BS8 1TW, UK.
Email link Telephone: +44 (0)117 928 7967; Fax: +44 (0)117 928 7999
Peter in Chinese characters email as QR barcode