\(^1\)Department of General Psychology, University of Padova, Italy

\(^2\)Department of Philosophy, Sociology, Pedagogy and Applied Psychology, University of Padova, Italy

\(^3\)Department of Communication Sciences, Humanities and International Studies, University of Urbino Carlo Bo, Italy


Aims and content

The present document includes the data pre-processing steps used to read the different types of data collected over three days from a sample of 105 healthy adults:


For each data type, the following pre-processing steps are implemented to generate the wide and long datasets to be used for subsequent data analysis (see analytical reports 1. Psychometric and descriptive analyses and 2. Regression analyses):

  1. Raw data reading & merging

  2. Raw data recoding & processing

  3. Data filtering based on inclusion criteria and data quality


Here, we remove all objects from the R global environment, and we set the system time zone for better temporal synchronization across data files.

# removing all objets from the workspace
rm(list=ls())

# setting system time zone to GMT (for consistent temporal synchronization)
Sys.setenv(tz="GMT")


The following R packages are used in this document (see References section):

# required packages
packages <- c("ggplot2","gridExtra","jsonlite","plyr","lubridate","scales","knitr","reshape2","careless",
              "shiny","pracma","lattice")

# generate packages references
knitr::write_bib(c(.packages(), packages),"packagesProc.bib")

# # run to install missing packages
# xfun::pkg_attach2(packages, message = FALSE); rm(list=ls())



1. PrelQS data

Here, the retrospective self-report data collected with the preliminary questionnaire are read and saved in the PrelQS dataset, recoded, and filtered based on the inclusion criteria.

library(ggplot2);library(gridExtra) # loading required packages


1.1. Data reading

First, we read the Preliminary_qs.csv data file exported from Google Forms.

PrelQS <- read.csv("qs preliminare/Preliminary_qs.csv")


1.2. Data recoding

Second, we recode the PrelQS variables to be used for the analyses.

1.2.1. ID recoding

As a first step, we recode participants’ identification codes ID (currently corresponding to their e-mail addresses) using the “HRVXXX” format (e.g., from ‘’ to ‘HRV001’). This is done with the prel.qs_IDrecode() function. Since the participants’ e-mail addresses are confidential, both the function and the original dataset are not showed.

# renaming the first two variables
colnames(PrelQS)[1:2] <- c("timestamp","ID") 

# loading and using the function to recode the ID variable
source("prel.qs_IDrecode.R") 
(PrelQS <- prel.qs_IDrecode(PrelQS))[1:3,] # showing first three rows


1.2.2. Other variables

Second, we rename the other variables, remove those not considered for the present study, and recode the considered variables to be used in the analyses.

# renaming all variables
colnames(PrelQS) <- c("time","ID", # submission timestamp & participants ID
                      "consent", # consent to participate (all 'yes')
                      "age","gender","weight","height","itaMT", # demographics
                      
                      # inclusion criteria
                      "drugs","drugs.which","cvDysf","cvDysf.which",
                      "otherDysf","otherDysf.which","phone","phone.which",
                      
                      # retrospective scales (* = not considered in this study)
                      paste("DASS21",1:21,sep=""), # Depression, Anxiety, and Stress Scale*
                      paste("PANAS",1:20,sep=""), # Positive and Negative Affective Schedule*
                      paste("DERS",1:36,sep=""), # Difficulties in Emotion Regulation Scale*
                      paste("BIS11",c(1:15,17:20,22,23,25:30),sep=""), # Barratt Impulsiveness Scale-11 (3 items out)
                      paste("PSI",1:17,sep="")) # Physical Symptoms Inventory*

# keeping only considered variables (demographics and inclusion criteria)
PrelQS <- PrelQS[,c("ID","time",colnames(PrelQS)[which(colnames(PrelQS)=="age"):which(colnames(PrelQS)=="height")],
                    colnames(PrelQS)[which(colnames(PrelQS)=="drugs"):which(colnames(PrelQS)=="phone.which")])]

# recoding variables
PrelQS[,c("ID","gender")] <- lapply(PrelQS[,c("ID","gender")],as.factor) # ID and gender as factor
PrelQS$time <- as.POSIXct(PrelQS$time,format="%Y/%m/%d %I:%M:%S %p") # time as POSIXct
PrelQS[PrelQS$height<3,"height"] <- PrelQS[PrelQS$height<3,"height"]*100 # correcting heights reported in meters
PrelQS <- cbind(PrelQS[,1:4],BMI=PrelQS$weight/(PrelQS$height/100)^2, # computing BMI, removing weight & height
                PrelQS[,7:ncol(PrelQS)])
for(i in which(colnames(PrelQS)%in%c("drugs","cvDysf","otherDysf","phone"))){ PrelQS[,i] <- gsub("Sì","Yes",PrelQS[,i]) }
PrelQS[,which(colnames(PrelQS)=="drugs"):which(colnames(PrelQS)=="phone.which")] <- # inclusion criteria as factors
  lapply(PrelQS[,which(colnames(PrelQS)=="drugs"):which(colnames(PrelQS)=="phone.which")],as.factor)

# sorting dataset by ID and time
PrelQS <- PrelQS[order(PrelQS$ID,PrelQS$time),]


1.3. Cleaning & filtering

Third, we clean and filter the data based on the inclusion criteria and by focusing on those participants that were actually invited to participate to the daily protocol (N = 105).

First, we inspect the original number of cases in the PrelQS dataset. The questionnaire was completed by a total of 164 respondents, of which a subsample was invited to participate in the daily protocol.

cat("Original No. of responses to the PrelQS =",nrow(PrelQS))
## Original No. of responses to the PrelQS = 164


1.3.1. Double responses

First, we inspect and remove cases of double responses (N = 3 couples of responses with the same ID value). In these cases, only the first response (i.e., the one with the earilest timestamp) is included.

# detecting double responses
cat(nrow(PrelQS[duplicated(PrelQS$ID),]),"cases of double responses (same ID)")
## 3 cases of double responses (same ID)
# removing double responses
memory <- PrelQS
PrelQS <- PrelQS[!duplicated(PrelQS$ID),]
cat("Removed",nrow(memory)-nrow(PrelQS),"double responses")
## Removed 3 double responses


1.3.2. Inclusion criteria

Second, we take a look at the variables concerning the inclusion criteria of the study:

  1. Not suffering from dysfunctions (e.g., anxiety disorder) or taking medications affecting the nervous system (e.g., antidepressants)

  2. Not suffering from dysfunctions (e.g., hypertension) or taking medications affecting the cardiovascular system (e.g., beta blockers)

  3. Owning an Android or iOS smartphone


In section 1.2.1., we recoded participants identification codes and we marked the cases of participants not meeting the inclusion criteria, as well as other cases of participants that were not invited to take part in the EMA protocol for other reasons, using the “OUT” label in the ID variable. From a total of 161 responses to the preliminary questionnaire, we can see that 105 participants (65%) were actually invited to participate in the daily protocol.

cat("Total No. of responses to the PrelQS =",nrow(PrelQS),"\n -",
    nrow(PrelQS[substr(PrelQS$ID,1,3)!="OUT",]),"invited\n -",nrow(PrelQS[substr(PrelQS$ID,1,3)=="OUT",]),"not invited")
## Total No. of responses to the PrelQS = 161 
##  - 105 invited
##  - 56 not invited


Here, we better evaluate the reasons for the exclusion of the remaining 56 participants.

SUMMARY

  • 9 participants were not invited because they reported suffering from cardiovascular (i.e., premature heart beat, problems with the mitral valve) or other dysfunction (i.e., thalassemia, anxiety, Crohn’s disease) and/or taking antidepressants

  • 1 participant was not invited because she did not own a smartphone

  • Further 46 participants were not invited to join the EMA protocol for other reasons (e.g., calendar incompatibilities, end of the data collection)


Here, we compare the demographic and retrospective variables between included and excluded participants. Most participants that were not invited to take part in the daily protocol were women (i.e., likely due to the higher No. of women undergraduates in psychology, which was the case for most participants, whereas participants were invited with the aim to balance the sample by gender), with no marked differences in terms of age or BMI

# preparing data
PrelQS$excl <- "IN" # creating excl variable (factor)
PrelQS[substr(PrelQS$ID,1,3)=="OUT","excl"] <- "OUT"
PrelQS$excl <- as.factor(PrelQS$excl)

# plotting
grid.arrange(ggplot(PrelQS,aes(x=excl,fill=gender)) + geom_bar(),
             ggplot(PrelQS,aes(x=excl,y=age,fill=excl)) + geom_violin(),
             ggplot(PrelQS,aes(x=excl,y=BMI,fill=excl)) + geom_violin(),nrow=1) 

# removing excl
PrelQS$excl <- NULL


DYSFUNCTIONS (7)

Here, we show the No. of participants not invited to take part in the EMA protocol due to dysfunctions affecting either the cardiovascular or the nervous system. We can see that 4 females were excluded since they reported suffering from a cardiovascular dysfunction including premature heart beat (027, 055) and problems with the mitral valve (033, 053). 3 further females (4.61%) were excluded since they reported suffering from other dysfunctions affecting the nervous system, including thalassemia (009), anxiety (020), Crohn’s disease (049).

# inclusion criteria variables
ic <- colnames(PrelQS)[which(colnames(PrelQS)=="drugs"):which(colnames(PrelQS)=="phone.which")]

# cardiovascular dysfunctions
PrelQS[substr(PrelQS$ID,1,3)=="OUT" & PrelQS$cvDysf=="Yes", c("ID","gender",ic)]
# other dysfunctions
PrelQS[substr(PrelQS$ID,1,3)=="OUT" & PrelQS$otherDysf=="Yes", c("ID","gender",ic)]


MEDICATIONS (3)

Here, we show the No. of participants not invited to take part in the EMA protocol due to medications affecting either the cardiovascular or the nervous system. 3 participants (017, F; 029, M; 033, M) were excluded since they took antidepressants (Laroxyl, Venlafaxina, Remeron), one of which also suffered from a cardiovascular dysfunction.

# cardiovascular dysfunctions
PrelQS[substr(PrelQS$ID,1,3)=="OUT" & PrelQS$drugs=="Yes", c("ID","gender",ic)]


SMARTPHONE (1)

Here, we show the No. of participants not invited to take part in the EMA protocol due to medications affecting either the cardiovascular or the nervous system. Only 1 female (051) was excluded since she reported to not own a personal smarpthone.

# cardiovascular dysfunctions
PrelQS[substr(PrelQS$ID,1,3)=="OUT" & PrelQS$phone=="No", c("ID","gender",ic)]


OTHER REASONS (46)

The further 46 participants marked with "OUT" were not invited due to other reasons (e.g., calendar incompatibility, end of the study). Here, we inspect the data of such participants and we remove them from the dataset. We can see that these participants were mainly females (probably due to the higher No. of female compared to undergraduates participating to psychological studies, and our goal of having a balanced sample), with no other peculiarities compared to the rest of the sample.

# showing participants that were not invited for other reasons
PrelQS[substr(PrelQS$ID,1,3)=="OUT" & !(PrelQS$ID%in%c("OUT033","OUT053","OUT027","OUT055","OUT009","OUT020",
                                                       "OUT049","OUT017","OUT029","OUT051")),]


1.3.3. Data filtering

Third, we remove all participants that were not invited to the daily protocol (marked as “OUT”) from the Prelqs dataset.

new <- PrelQS[!substr(PrelQS$ID,1,3)=="OUT",]
cat("Filtered",nrow(PrelQS)-nrow(new),"cases; new number of cases =",nrow(new))
## Filtered 56 cases; new number of cases = 105
PrelQS <- new


1.3.4. Flagged participants

Fourth, we flag further participants that are not excluded but whose inclusion might bias some of the results.

1.3.4.1. Dysfunctions

First, some of the included participants reported suffering from dysfunctions or taking medications that were considered irrelevant for the current study. Here, we better inspect those conditions (N = 9 and 13, respectively):

  • 6 included participants (4 females, 2 males) reported suffering from cardiovascular dysfunctions: arrhythmia, tachycardia episodes, and bicuspid aortic valve

  • 3 included participants (2 females, 1 male) reported suffering from other dysfunctions: hypothyroidism, asthma, and allergy

  • 6 females reported taking hormonal contraceptives

  • 8 participants took other medications including antihistamines (2 males, 5 females), and thyroid hormones (EUTIROX, 1 female)

# cardiovascular dysfunctions
PrelQS[PrelQS$cvDysf=="Yes", c("ID","gender",ic)]
# other dysfunctions
PrelQS[PrelQS$otherDysf=="Yes", c("ID","gender",ic)]
# drugs
PrelQS[PrelQS$drugs=="Yes", c("ID","gender",ic)]


Although we considered such conditions as potentially irrelevant for the present study, some of them (especially cardiovascular and other dysfunctions) might play a role. Thus, we create the dysfun variable to flag cases of included participants with cardiovascular and/or other dysfunctions.

# recoding dysfun variable (accounting for both cardiovascular and other dysfunctions)
PrelQS$dysfun <- 0
PrelQS[PrelQS$cvDysf=="Yes" | PrelQS$otherDysf=="Yes","dysfun"] <- 1

# recoding drugs variable
PrelQS$drugs <- gsub("Yes","1",gsub("No","0",PrelQS$drugs))

# removing unnecessary variables
PrelQS[,c("cvDysf","otherDysf","cvDysf.which","otherDysf.which","drugs.which","phone","phone.which")] <- NULL

# summary of dysfun and drugs
PrelQS[,c("dysfun","drugs")] <- lapply(PrelQS[,c("dysfun","drugs")],as.factor) # both as factor
summary(PrelQS[substr(PrelQS$ID,1,3)=="HRV",c("dysfun","drugs")])
##  dysfun drugs 
##  0:96   0:92  
##  1: 9   1:13


1.3.4.2. Postponed day

Second, four of the included participants (4.1%) lost one protocol day (i.e., day 3 for HRV013, HRV021, and HRV091, day 1 for HRV021), which was postponed to a following week. Here, we mark these participants with postponed = 1.

PrelQS$postponed <- 0
PrelQS[PrelQS$ID %in% c("HRV013","HRV021","HRV067","HRV091"),"postponed"] <- 1


1.3.5. Further excluded

Finally, 8 further participants were excluded due to dropping-out during the EMA protocol (009, F; 059, M), poor quality of physiological data (015, F; 046, F; 066, M) or technical problems with the mobile app (011, M; 027, M; 029, M). Here, we mark these participants as ‘OUT’ and we inspect again the differences between included and excluded participants. We can note that excluded participants do not substantially deviate from the distributions of the included participants. Then, we remove them from the PrelQS dataset.

# marking 8 excluded participants as "OUT"
memory <- PrelQS
PrelQS$ID <- as.character(PrelQS$ID)
excl <- c("HRV009","HRV011","HRV015","HRV027","HRV029","HRV046","HRV059","HRV066")
PrelQS[PrelQS$ID%in%excl,"ID"] <- gsub("HRV","OUT",PrelQS[PrelQS$ID%in%excl,"ID"]) # marking excluded participants as "OUTXXX"

# plotting
PrelQS$excl <- "IN" # creating excl variable (factor)
PrelQS[substr(PrelQS$ID,1,3)=="OUT","excl"] <- "OUT"
PrelQS$excl <- as.factor(PrelQS$excl)
grid.arrange(ggplot(PrelQS,aes(x=excl,fill=gender)) + geom_bar(),
             ggplot(PrelQS,aes(x=excl,y=age,fill=excl)) + geom_violin(),
             ggplot(PrelQS,aes(x=excl,y=BMI,fill=excl)) + geom_violin(),nrow=1) 

# updating number of included participants
cat(nrow(PrelQS[substr(PrelQS$ID,1,3)=="HRV",]),"included participants out of",
    nrow(memory[substr(memory$ID,1,3)=="HRV",]),"invited participants")
## 97 included participants out of 105 invited participants
# removing excluded participants
PrelQS <- PrelQS[substr(PrelQS$ID,1,3)=="HRV",]
PrelQS$ID <- as.factor(PrelQS$ID) # ID as factor
PrelQS <- PrelQS[order(PrelQS$ID),] # sorting by ID
PrelQS$excl <- NULL # removing excl variable



2. ESM data

Here, the raw data collected with the experience sampling method (ESM) are read and saved in the ESM dataset, recoded, and filtered based on response rate and inclusion criteria.

2.1. Data reading

First, we read the JSON data files exported from the Sensus Mobile app (Xiong et al., 2016).

Specifically, the Protocols created with Sensus were configured to store the recorded .JSON data files in our private AWS S3 bucket, from which they were downloaded and stored in the ESM/data folder. From the Sensus app, we also exported the Probe Definition files, which we use to replace the input IDs (i.e., by default, each item is associated with an alphanumeic code) with the corresponding input names (i.e., the item labels that we set in the protocols). These files were saved in the ESM/probe folder.

The readSurveyData function is used to optimize the ESM data reading.

readSurveyData

#' @title Read JSON data exported from the Sensus mobile app
#' @param data.path = character string indicating the full path to the folder where the JSON raw data are stored.
#' @param probe.definition = character string indicating the full path to the folder where Probe Definition JSON file(s) (from the Sensus mobile app: Protocol > Probe > Scripted Interaction > Share definition).
#' @param messages = logical value indicating whether the processing steps should be printed (default: TRUE).
readSurveyData <- function(data.path,probe.definition,messages=TRUE){ require(jsonlite)

  # 1. Reading data
  # .......................................
  options(digits.secs=3) # setting No. of digits
  paths = list.files(data.path,recursive=TRUE,full.names=TRUE,include.dirs=FALSE) # listing files in data path
  if(messages==TRUE){ cat("\n\nReading",length(paths),"data files from",data.path,"...")}
  var.names <- c("ParticipantId","Timestamp","InputId","Response","RunTimestamp","SubmissionTimestamp",
                 "ScriptName","ProtocolId","$type") # selecting variables of interest
  data <- as.data.frame(matrix(nrow=0,ncol=9)) # empty dataframe creation and population
  colnames(data) <- var.names
  for(path in paths){ 
    if(file.info(path)$size>0){ # only reading Datum files (i.e., containing ScriptDatum, which is sized > 0 Kb)
      new.data <- read_json(path,simplifyDataFrame=TRUE)
      if(class(new.data)=="data.frame" & !is.null(new.data$Response)){ # only keeping files with Response data
        if(class(new.data$Response)=="data.frame"){ # sometimes the Response column is read as dataframe
          new.data$Response <- as.character(new.data$Response$`$values`)}
        data <- rbind(data,new.data[var.names]) }}} 
  data <- data[!is.na(data$ParticipantId),] # only keeping data with ParticipantId information (i.e., participant identification)
  data <- data[!is.na(data$InputId),] # only keeping data with InputId information (i.e., item identification)
  names(data)[9] <- "os" # $type as OS (android or iOS) # mobile OS information
  data[,9] <- gsub("Sensus.Probes.User.Scripts.ScriptDatum, Sensus","",data[,9]) # removing unuseful information
  
  # 2. From Response Ids to Item labels (based on Probe Definition) 
  # ...............................................................
  if(!is.na(probe.definition)){ if(messages==TRUE){ cat("\n\nConverting ResponseId into InputId based on Probe Definition...")}
    # function to read Probe Definition files
    readProbe <- function(path){ probedefinition <- read_json(path,simplifyDataFrame=TRUE) # first probe definition file
      # reading input labels of the first inputGroup
      inputs <- probedefinition$ScriptRunners$`$values`$Script$InputGroups$`$values`[[1]]$Inputs$`$values`[[1]]$Name
      infos <- probedefinition$ScriptRunners$`$values`$Script$InputGroups$`$values`[[1]] # input labels
      PROTOCOL <- data.frame(protocolName=probedefinition$Protocol$Name,protocolId=probedefinition$Protocol$Id, # protocol name
                             scriptName=infos$Name,inputName=inputs,inputId=infos$Inputs$`$values`[[1]]$Id) # script name
      if(length(probedefinition$ScriptRunners$`$values`$Script$InputGroups$`$values`)>1){ # reading the following inputGroups
        for(i in 2:length(probedefinition$ScriptRunners$`$values`$Script$InputGroups$`$values`)){
          inputs <- probedefinition$ScriptRunners$`$values`$Script$InputGroups$`$values`[[i]]$Inputs$`$values`[[1]]$Name
          infos <- probedefinition$ScriptRunners$`$values`$Script$InputGroups$`$values`[[i]]
          PROTOCOL <- rbind(PROTOCOL,data.frame(protocolName=probedefinition$Protocol$Name,
                                                protocolId=probedefinition$Protocol$Id,
                                                scriptName=infos$Name,inputName=inputs,
                                                inputId=infos$Inputs$`$values`[[1]]$Id)) }}
      return(PROTOCOL) }
    paths = list.files(probe.definition,recursive=TRUE,full.names=TRUE,include.dirs=FALSE) # listing files in probe.definition path
    PROTOCOL <- readProbe(paths[1]) # reading the first Probe Definition file
    if(length(list.files(probe.definition))>1){ # reading the following Probe Definition files when > 1
      for(path in paths[2:length(paths)]){ PROTOCOL <- rbind(PROTOCOL,readProbe(path)) }}
    for(i in 1:nrow(data)){ 
      for(j in 1:nrow(PROTOCOL)){ if(!is.na(data[i,3]) & data[i,3]==PROTOCOL[j,5]){ data[i,3] <- as.character(PROTOCOL[j,4]) }}}}
  
  # 3. Cleaning and unlisting Response data
  # ...............................................................
  if(messages==TRUE){ cat("\n\nUnlisting Response data and removing system information...")}
  data$Response <- gsub("list","",data$Response) # cleaning categorical items from Sensus system info
  data$Response <- gsub(paste("c","\\(|\\)",sep=""),"",data$Response)
  data$Response <- gsub("\\(|\\)","",data$Response)
  data$Response <- gsub("\\[|\\]","",data$Response)
  data$Response <- gsub("\\$type` = \"System.Collections.Generic.List`1System.Object, mscorlib, mscorlib\", ","",data$Response)
  data$Response <- gsub("\\$values","",data$Response)
  data$Response <- gsub('``` = ', "",data$Response)
  data$Response <- gsub('\ ', "",data$Response)
  data$Response <- gsub('\"', "",data$Response)
  data$Response <- gsub('\"No\"', "No",data$Response)
  data$Response <- gsub('\"Sì\"', "Si",data$Response)
  if(class(data$Response)=="data.frame"){ data$Response <- as.character(data$Response$`$values`[[1]]) # unlisting Response column
    } else { data$Response <- as.character(data$Response) }
  
  # 4. Encoding time information
  # ...............................................................
  if(messages==TRUE){ cat("\n\nConverting Timestamp variables as POSIXct...")}
  timestamps <- c("Timestamp","RunTimestamp","SubmissionTimestamp") # timestamps variables
  data[,timestamps] <- lapply(data[,timestamps],function(x) as.POSIXct(x,format="%Y-%m-%dT%H:%M:%OS",tz="GMT"))
  
  # 5. Cleaning incomplete double responses
  # ...............................................................
  # in some cases, multiple responses are recorded with the same ParticipantId, RunTimeStamp, and InputId
  if(messages==TRUE){ cat("\n\nLooking for double responses...\n")}
  data$id.r <- paste(data$ParticipantId,data$RunTimestamp,data$InputId,sep="_") # ID-time-item identifier
  data$id.t <- paste(data$ParticipantId,data$RunTimestamp,sep="_") # ID-time identifier
  doubles <- levels(as.factor(data[duplicated(data$id.r),"id.t"])) # double responses (same ID-time-item identifier)
  ndoubles <- rep(length(doubles),2) # No. of double responses
  for(double in doubles){ subId <- levels(as.factor(data[data$id.t==double,"SubmissionTimestamp"])) # submission time
    if(length(subId)>1){ ndoubles[1] <- ndoubles[1] - 1
      maxDouble <- max(as.POSIXct(subId)) # when the SubmissionTimestamp is different -> splitting responses
      data[data$id.t==double,"RunTimestamp"] <- maxDouble - 61.20 # 2nd RunTimeStamp = Submission - 61.20 sec (median resp time) 
    } else { ndoubles[2] <- ndoubles[2] - 1
      data <- rbind(data[data$id.t!=double,], # when even the SubmissionTimestamp is different -> removing double resp.
                    data[data$id.t==double & !duplicated(data$id.r),]) }}
  cat(" - Removed",ndoubles[1],"double responses (same ID, InputName, RunTimestamp, & SubmissionTimestamp)\n",
      "- Splitted",ndoubles[2],"double responses with different SubmissionTimestamp\n",
      "  (in these cases, RunTimestamp is recomputed as SubmissionTimestamp - 61.20 sec (average response length)")
  
  # 6. Sorting columns and Reshaping
  # ...............................................................
  if(messages==TRUE){ cat("\n\nSorting columns and reshaping (one row per data entry)...")}
  colnames(data)[1] <- "ID" # selecting and sorting columns
  data <- data[,c("ID","os","ProtocolId","ScriptName","RunTimestamp","SubmissionTimestamp","InputId","Response","Timestamp")]
  data <- data[order(data$ID,data$Timestamp),] # sorting rows
  wide <- reshape(data[,1:ncol(data)-1],v.names=c("Response"),timevar=c("InputId"), # reshaping from long to wide (resp-by-resp)
                  idvar=c("RunTimestamp","SubmissionTimestamp"),direction=c("wide"),sep="")
  colnames(wide) <- gsub("Response","",gsub("\\-",".",colnames(wide))) # removing label "Response" from ResponseId
  wide <- wide[order(wide$ID,wide$RunTimestamp),] # sorting rows by participant ID and RunTimestamp
  # wide <- data.frame(cbind(wide[1:3],wide[ncol(data)],wide[5:ncol(data)-1])) # final sorting of rows and columns
  wide <- wide[order(wide$ID,wide$RunTimestamp),]
  row.names(wide) <- as.character(1:nrow(wide))
  if(messages==TRUE){ cat("\n\nRead",nrow(wide),"responses from",nlevels(as.factor(data$ID)),"participants.") }
  
  # 7. Processing response times
  # ...............................................................
  data$RT <- as.numeric(NA) # computing item response times (i.e., seconds between each response and the previous one)
  for(i in 2:nrow(data)){ if(data[i,"ID"]==data[i-1,"ID"] & data[i,"RunTimestamp"]==data[i-1,"RunTimestamp"]){
    data[i,"RT"] <- difftime(data[i,"Timestamp"],data[i-1,"Timestamp"],units="secs")}}
  data$RT.tot <- as.numeric(NA) # computing survey total response time (i.e., last - first response of each survey)
  data$survey <- as.factor(paste(data$ID,data$RunTimestamp,sep="_"))
  for(survey in levels(data$survey)){ data[data$survey==survey,"RT.tot"] <-
    difftime(data[data$survey==survey,"SubmissionTimestamp"][1],
             data[data$survey==survey,"Timestamp"][1],units="secs")}
  rt <- reshape(data[,c("ID","RunTimestamp","SubmissionTimestamp","RT.tot","InputId","RT")], # reshaping
                v.names=c("RT"),timevar=c("InputId"),idvar=c("RunTimestamp","SubmissionTimestamp"),direction=c("wide"),sep="")
  colnames(rt)[5:ncol(rt)] <- gsub("RT","",colnames(rt)[5:ncol(rt)]) # removing label "Timestamp" from ResponseId
  
  # 7. Returning processed dataset and response time
  # ...............................................................
  return(list(wide,rt))}


Here, we use the readSurveyData function (depicted above) to read and preliminary recode the .JSON data files, and to merge them into the ESM dataset. In one single case, the JSON string contains (illegal) UTF8 byte-order-mark.

# data reading, encoding and saving
ESM <- readSurveyData(data.path="ESM/data",probe.definition="ESM/probe")
## Loading required package: jsonlite
## 
## 
## Reading 3200 data files from ESM/data ...
## Warning: JSON string contains (illegal) UTF8 byte-order-mark!
## 
## 
## Converting ResponseId into InputId based on Probe Definition...
## 
## Unlisting Response data and removing system information...
## 
## Converting Timestamp variables as POSIXct...
## 
## Looking for double responses...
##  - Removed 2 double responses (same ID, InputName, RunTimestamp, & SubmissionTimestamp)
##  - Splitted 10 double responses with different SubmissionTimestamp
##    (in these cases, RunTimestamp is recomputed as SubmissionTimestamp - 61.20 sec (average response length)
## 
## Sorting columns and reshaping (one row per data entry)...
## 
## Read 1392 responses from 104 participants.
ESM.RT <- ESM[[2]] # dataset with response times
ESM <- ESM[[1]] # dataset with participant responses


We also integrate the ESM dataset with 4 additional responses that were sent via instant messages (i.e., screenshot of the responses) due to technical problems. These were stored in the screenshotX4.rda file.

load("ESM/screenshot_surevyMalfunctioning/screenshotX4.RData") # loading additiona responses
cat("Adding",nrow(screenshotX4),"additional responses sent via instant messages") # N = 4
## Adding 4 additional responses sent via instant messages
ESM <- rbind(ESM,screenshotX4) # adding additional responses to the ESM dataset


Finally, we import the “baseline surveys” data (i.e., ESM data collected in the lab at the beginning of each day, using Google Forms), exported from Google Form as a .CSV file. Some recoding procedures will be necessary before merging the baseline and the ESM datasets.

# reading raw dataset exported from Google Forms
baseline <- read.csv("baseline/Baseline.csv",header=TRUE)
cat("Reading",nrow(baseline),"responses to the baseline questionnaire from",
    nlevels(as.factor(baseline$Nome.utente)),"participants")
## Reading 311 responses to the baseline questionnaire from 108 participants


2.2. Data recoding

Second, we recode the ESM and baseline variables to be used for the analyses, and we merge the two datasets.

The following packages and functions are used to optimize the ESM data recoding:

library(lubridate); library(plyr); library(scales)
within.day.adjust

within.day.adjust <- function(data){
  
  # Creating within.day.indicator (dafault = 0)
  data$within.day <- NA
  
  # adjusting within.day based on the scheduled times
  for(i in 1:nrow(data)){
    if(strftime(data[i,"RunTimestamp",], # survey 1 between 11:20 (- 10 min error) and 11:40 (up to 12.00 + 10min error)
                format="%H:%M:%S")>strftime("1970-01-01 11:10:00",
                                            format="%H:%M:%S") & strftime(data[i,"RunTimestamp",],
                                                                          format="%H:%M:%S")<strftime("1970-01-01 12:10:00",
                                                                                                      format="%H:%M:%S")){
      data[i,"within.day"] = 1 } 
    else if(strftime(data[i,"RunTimestamp",], # survey 2 between 12:30 (- 10min error) and 12:50 (up to 13:10 + 10min error)
                     format="%H:%M:%S")>strftime("1970-01-01 12:20:00",
                                                 format="%H:%M:%S") & strftime(data[i,"RunTimestamp",],
                                                                               format="%H:%M:%S")<strftime("1970-01-01 13:20:00",
                                                                                                           format="%H:%M:%S")){
      data[i,"within.day"] = 2 } 
    else if(strftime(data[i,"RunTimestamp",], # survey 3 between 13:40 (- 10min error) and 14:00 (up to 14:20 + 10min error)
                     format="%H:%M:%S")>strftime("1970-01-01 13:30:00",
                                                 format="%H:%M:%S") & strftime(data[i,"RunTimestamp",],
                                                                               format="%H:%M:%S")<strftime("1970-01-01 14:30:00",
                                                                                                           format="%H:%M:%S")){
      data[i,"within.day"] = 3} 
    else if(strftime(data[i,"RunTimestamp",], # survey 4 between 14:50 (- 10min error) and 15:10 (up to 15.30 + 10min error)
                     format="%H:%M:%S")>strftime("1970-01-01 14:40:00",
                                                 format="%H:%M:%S") & strftime(data[i,"RunTimestamp",],
                                                                               format="%H:%M:%S")<strftime("1970-01-01 15:40:00",
                                                                                                           format="%H:%M:%S")){
      data[i,"within.day"] = 4 } 
    else if(strftime(data[i,"RunTimestamp",], # survey 5 between 16:00 (- 10min error) and 16:20 (up to 16:40 + 10min error)
                     format="%H:%M:%S")>strftime("1970-01-01 15:50:00",
                                                 format="%H:%M:%S") & strftime(data[i,"RunTimestamp",],
                                                                               format="%H:%M:%S")<strftime("1970-01-01 16:50:00",
                                                                                                           format="%H:%M:%S")){
      data[i,"within.day"] = 5 }}
  
  # sorting dataset based on ID, day and within.day
  data <- data[order(data$ID,data$day,data$within.day),]
  data$within.day <- as.factor(data$within.day)
  rownames(data) <- 1:nrow(data)
  cat("Creating/Adjusting within.day variable...\n Printing summary of cases:\n")
  print(summary(data$within.day))
  return(data) }

Creates or adjusts the within.day variable based on the scheduled temporal windows used in the current study.

varsRecoding

varsRecoding <- function(data){

  # ID and os as factors
  data[,c("ID","os")] <- lapply(data[,c("ID","os")],as.factor)
  
  # Recoding item scores and reversed items
  ratings <- colnames(data)[c(which(colnames(data)=="t3.nervoso.tranquillo"),  # recoding ratings as integer
                              which(colnames(data)=="e3.affaticato.fresco"):which(colnames(data)=="intensity.NE"),
                              which(colnames(data)=="v1.male.bene"):which(colnames(data)=="v3.positivo.negativo"))]
  data[,ratings] <- lapply(data[,ratings],as.integer)
  data$v3.positivo.negativo <- 8 - data$v3.positivo.negativo # HEDONIC TONE = positive
  colnames(data)[which(colnames(data)=="v2.soddisfatto.insoddisfatto")] <- "v2.insoddisfatto.soddisfatto" # changing wrong label
  data$t1.rilassato.teso <- 8 - data$t1.rilassato.teso # CALMNESS = positive
  data$e2.pieno.privodenergia <- 8 - data$e2.pieno.privodenergia # ENERGETIC AROUSAL = positive
  data[!is.na(data$intensity.NE) & data$intensity.NE > 7,"intensity.NE"] <- 7 # correcting 2 wrongly recorded items (score > 7)

  # Recoding confounders
  data$alcohol <- as.factor(gsub("No","0",gsub("Sì","1",data$alcohol))) # alcohol (from yes/no --> 1/0)
  data$Activity <- as.factor(gsub("Leggerasedutiocoricatiperlamaggiorpartedeltempo","0", # physical activity level (0 = seated)
                                  gsub("Moderatacamminata,farelescale","1", # (1 = moderate)
                                       gsub("Intensaattivitàsportiva,palestra,corsaodiverserampediscale","2", # (2 = vigorous)
                                            data$Activity))))
  data$People.rec <- as.factor(gsub("No,erodasola","0",gsub("No,erodasolo","0", # (0=alone)
                                gsub("Sì,manonsonostatadisturbata","1",gsub("Sì,manonsonostatadisturbato","1", # (1=not interact)
                                     gsub("Sì,esonostataunpo'disturbata","2",gsub("Sì,esonostataunpo'disturbato","2", # (2=disturb)
                                          gsub("Sì,esonostatainterrotta","3",gsub("Sì,esonostatainterrotto","3", #(3 = interrupted)
                                               data$People.rec)))))))))
  data$ProtocolId <- as.factor(gsub("ProtocolStudent","",data$ProtocolId)) # from protocol ID to gender
  colnames(data)[which(colnames(data)=="ProtocolId")] <- "gender"
  data[,c("smoke","coffe")] <- lapply(data[,c("smoke","coffe")],as.factor) # remaining confounders as factor 0/1
  
  # Sorting and renaming columns
  data <- data[,c("ID","os","gender","day","within.day","RunTimestamp", # response information
                  "v1.male.bene","v2.insoddisfatto.soddisfatto","v3.positivo.negativo", # hedonic tone
                  "t1.rilassato.teso","t2.agitato.calmo","t3.nervoso.tranquillo", # calmness
                  "e1.stanco.sveglio","e2.pieno.privodenergia","e3.affaticato.fresco", # energetic arousal
                  "PE","intensity.PE","NE","intensity.NE", # emotional events
                  "smoke","coffe","alcohol","Activity","People.rec")] # confounders
  colnames(data)[7:ncol(data)] <- c("v1","v2","v3","t1","t2","t3","f1","f2","f3", # renaming columns
                                    "PE","PE.int","NE","NE.int","smoke","coffe","alcohol","activity","people")
  
  # final sorting and returning data
  data[,which(colnames(data)=="v1"):which(colnames(data)=="NE.int")] <-
    lapply(data[,which(colnames(data)=="v1"):which(colnames(data)=="NE.int")],as.integer) # slider ratings as integer
  data$sort <- as.numeric(substr(data$ID,4,6))
  data <- data[order(data$sort,data$RunTimestamp),] # sorting by ID and timestamp
  data$sort <- NULL # removing sort column
  return(data) }

Recodes slider and multiple-choice responses collected with the Sensus protocols used in the study, sort the columns, and fix problems due to Daylight Time changes.

bslRecoding

bslRecoding <- function(data){ 
  
  # selecting included variables, renaming and sorting columns
  colnames(data) <- c("RunTimestamp","ID","v1","t1","f1","v2","t2","f2","v3","t3","f3","PE","PE.int","NE","NE.int",
                      "Sleep1","Sleep2","Sleep3", # baseline sleep ratings are not considered
                      "daybefore.ex","drugs","drugs.which",
                      "day","within.day")
  data <- data[,c("ID","day","within.day","RunTimestamp",
                  "v1","v2","v3","t1","t2","t3","f1","f2","f3", # renaming columns
                  "PE","PE.int","NE","NE.int","daybefore.ex","drugs","drugs.which")]

  # ID as factor
  data$ID <- as.factor(data$ID)
  
  # recoding item scores
  data$v2 <- 8 - data$v2 # hedonic tone = positive
  data$v3 <- 8 - data$v3
  data$t1 <- 8 - data$t1 # calmness = positive
  data$f2 <- 8 - data$f2 # e.arousal = positive

  # other confounding variables
  data$drugs <- as.factor(gsub("Sì","1",gsub("No","0",data$drugs))) # drugs
  data$daybefore.ex <- as.factor(gsub("Sì","1",gsub("No","0",data$daybefore.ex))) # intense exercise on the day before

  # sorting by timestamp
  data <- data[order(data$ID,data$RunTimestamp),] 
  
  return(data)}

Recodes the variables collected with the baseline surveys.


2.2.1. ID recoding

First, we recode participants’ identification codes ID (currently corresponding to their e-mail addresses) using the “HRVXXX” format (e.g., from ‘’ to ‘HRV001’). This is done with the participantID_recoding() and the participantID_recoding_baseline() functions. Again, since the participants’ e-mail addresses are confidential, both the function and the original dataset are not showed.

# loading and using the function to recode the ID variable
source("participantID_recoding.R"); source("participantID_recoding_baseline.R")
(ESM <- participantID_recoding(ESM))[1:3,] # recoding ESM IDs and showing first three rows
colnames(baseline)[2] <- "ID"
(baseline <- participantID_recoding_baseline(baseline))[1:3,] # recoding baseline IDs and showing first three rows


2.2.2. Time recoding

Second, we process the timestamps (i.e., temporal variables) in both the EMA and the baseline datasets.

2.2.2.1. Timestamps

Here, we convert the included timestamps from character to POSIXct (the R format for dates and times), using the “GMT” time zone (note that data were collected in the CET/CEST region, but GMT is easier to handle).

# converting temporal variables as POSIXct in the ESM dataset
ESM[,c("RunTimestamp","SubmissionTimestamp")] <- # no need to specify format since they already are POSIXct
  lapply(ESM[,c("RunTimestamp","SubmissionTimestamp")],function(x) as.POSIXct(x,tz="GMT"))

# converting baseline variables as POSIXct in the baseline dataset
colnames(baseline)[which(colnames(baseline)=="Informazioni.cronologiche")] <- "RunTimestamp"
baseline$RunTimestamp <- as.POSIXct(baseline$RunTimestamp,format="%Y/%m/%d %I:%M:%S %p",tz="GMT")


Then, we inspect the data for cases of wrongly encoded timestamps, considering that data collection started on November 6th, 2018 and ended on November 21th, 2019. Only two cases are detected, whose RunTimestamp was recorded with the wrong year.

# min and max RunTimestamp
c(min(ESM$RunTimestamp),max(ESM$RunTimestamp)) # ESM: unexpected values < November 2018
## [1] "2018-01-24 16:40:00.000 GMT" "2019-11-21 15:11:17.835 GMT"
c(min(baseline$RunTimestamp),max(baseline$RunTimestamp)) # baseline: ok
## [1] "2018-11-06 10:12:16 GMT" "2019-11-21 10:45:48 GMT"
# only two cases with RunTimestamp < November 2018 and no SubmissionTimestamp
ESM[ESM$RunTimestamp < as.POSIXct("2018-11-06") | ESM$RunTimestamp > as.POSIXct("2019-11-22"),
    c("ID","RunTimestamp","SubmissionTimestamp")]
# plotting all timestamps from these two participants: isolated point recorded as it was collected one year before
grid.arrange(ggplot(ESM[ESM$ID=="HRV046",],aes(RunTimestamp)) + geom_histogram(bins=30) + ggtitle("HRV046"),
             ggplot(ESM[ESM$ID=="HRV048",],aes(RunTimestamp)) + geom_histogram(bins=30) + ggtitle("HRV048"))

# correcting wrongly encoded RunTimestamps (adding 1year) and SubmissionTimestamps (RunTimestamp + 61.20 sec)
ESM[substr(ESM$RunTimestamp,1,7)=="2018-01","SubmissionTimestamp"] <-
  ESM[substr(ESM$RunTimestamp,1,7)=="2018-01","RunTimestamp"] + 61.20
ESM[substr(ESM$RunTimestamp,1,7)=="2018-01",c("RunTimestamp","SubmissionTimestamp")] <-
  ESM[substr(ESM$RunTimestamp,1,7)=="2018-01",c("RunTimestamp","SubmissionTimestamp")] + 24*60*60*365
c(min(ESM$RunTimestamp),max(ESM$RunTimestamp)) # sanity check: min and max matching with the data collection period
## [1] "2018-11-06 10:18:46.547 GMT" "2019-11-21 15:11:17.835 GMT"
# 2 further cases with missing SubmissionTimestamp (replacing with RunTimestamp + 61.20 sec)
ESM[is.na(ESM$SubmissionTimestamp),c("ID","RunTimestamp","SubmissionTimestamp")]
ESM[is.na(ESM$SubmissionTimestamp),"SubmissionTimestamp"] <- ESM[is.na(ESM$SubmissionTimestamp),"RunTimestamp"] + 61.20


2.2.2.2. Daylight saving time

Second, we adjust the temporal coordinates accounting for daylight saving time. Indeed, whereas timestamps were recorded based on the internal clock of participants’ mobile phones, which automatically synchronizes based on time changes, such time shifts got lost in the data exporting, resulting in one-hour shifts during standard time. In Italy, during the data collection period, daylight saving time was applied from March 31th to October 27th 2019. Here, we visually inspect the timestamp of the first response for both the EMA and the baseline datasets.

We can note that in the EMA dataset, the responses recorded during standard time were encoded with one hour less than the actual time (i.e., the first questionnaire was scheduled at 11:30, whereas most responses were recorded at 10:30), whereas those recorded during daylight saving time show one additional hour less than the actual time (i.e., most responses recorded at 9:30). In the baseline dataset, the responses recorded during daylight saving time were recorded with one hour less than the actual time (i.e., baseline questionnaires were administered around 10:30, whereas most responses are around 9:30), whereas those recorded during standard time look fine.

# ESM: computing and plotting time without date
ESM$daytime <- as.POSIXct(paste(hour(ESM$RunTimestamp),minute(ESM$RunTimestamp)),format="%H %M",tz="GMT")
ESM$daylight <- "standard"
ESM[ESM$RunTimestamp > as.POSIXct("2019-04-01") & ESM$RunTimestamp < as.POSIXct("2019-10-27"),"daylight"] <- "daylight"
ggplot(ESM[!duplicated(ESM$ID),],aes(y=daytime,fill=daylight)) + geom_bar(position=position_dodge(),width=50) +
  scale_y_datetime(labels = date_format("%H:%M"),date_breaks="60 min") +
  ggtitle("ESM$RunTimestamp of the first response for each participant")

# baseline: computing and plotting time without date
baseline$daytime <- as.POSIXct(paste(hour(baseline$RunTimestamp),minute(baseline$RunTimestamp)),format="%H %M",tz="GMT")
baseline$daylight <- "standard"
baseline[baseline$RunTimestamp>as.POSIXct("2019-04-01") & baseline$RunTimestamp<as.POSIXct("2019-10-27"),"daylight"]<-"daylight"
ggplot(baseline[,],aes(y=daytime,fill=daylight)) + geom_bar(position=position_dodge(),width=50) +
  scale_y_datetime(labels = date_format("%H:%M"),date_breaks="60 min") +
  ggtitle("baseline$RunTimestamp")


Here, we adjust the temporal coordinates in both datasets according to the observed time shifts.

# adding 1 hour to all ESM timestamps
ESM[,c("RunTimestamp","SubmissionTimestamp")] <- ESM[,c("RunTimestamp","SubmissionTimestamp")] + 1*60*60
# adding 1 further hour to ESM timestamps during daylightsaving time
ESM[ESM$daylight=="daylight",c("RunTimestamp","SubmissionTimestamp")] <- 
  ESM[ESM$daylight=="daylight",c("RunTimestamp","SubmissionTimestamp")] + 1*60*60

# adding 1 hour to baseline timestamps during daylightsaving time
baseline[baseline$daylight=="daylight","RunTimestamp"] <- baseline[baseline$daylight=="daylight","RunTimestamp"] + 1*60*60

# ESM: computing and plotting time without date (NOW OK)
ESM$daytime <- as.POSIXct(paste(hour(ESM$RunTimestamp),minute(ESM$RunTimestamp)),format="%H %M",tz="GMT")
ggplot(ESM[!duplicated(ESM$ID),],aes(y=daytime,fill=daylight)) + geom_bar(position=position_dodge(),width=50) +
  scale_y_datetime(labels = date_format("%H:%M"),date_breaks="60 min") +
  ggtitle("ESM$RunTimestamp of the first response for each participant")

# baseline: computing and plotting time without date (NOW OK)
baseline$daytime <- as.POSIXct(paste(hour(baseline$RunTimestamp),minute(baseline$RunTimestamp)),format="%H %M",tz="GMT")
ggplot(baseline[,],aes(y=daytime,fill=daylight)) + geom_bar(position=position_dodge(),width=50) +
  scale_y_datetime(labels = date_format("%H:%M"),date_breaks="60 min") +
  ggtitle("baseline$RunTimestamp of the first response for each participant")

# removing daytime and daylight columns
ESM$daytime <- ESM$daylight <- baseline$daytime <- baseline$daylight <- NULL


2.2.2.3. day & within.day

Finally, we create two variables to better organize the temporal structure of the datasets:

  • day indicating the weekday indicator (i.e., 1 = Tuesday, 2 = Wednesday, 3 = Thursday)

  • within.day consisting in a row identifier within each day. For baseline responses, we set within.day = 0

# creating indicator for the weekday 
ESM$day <- as.POSIXlt(ESM$RunTimestamp)$wday - 1
baseline$day <- as.POSIXlt(baseline$RunTimestamp)$wday - 1
levels(as.factor(c(levels(as.factor(ESM$day)),levels(as.factor(baseline$day))))) # sanity check: ok
## [1] "1" "2" "3"
# creating row identifier within each day
ESM <- ddply(ESM,c("ID","day"),transform,within.day=seq_along(day))
baseline$within.day <- 0 # assigning zero to the baseline questionnaires

# showing data structure
ESM[,c("ID","day","within.day","RunTimestamp")]


However, the method used above to encode the within.day variable counts the survey as they are received (i.e., first, second, third), and not as they were scheduled (i.e., based on the RunTimestamp variable), neglecting missing responses.

Here, we use the within.day.adjust() function showed at the beginning of section 2.2 to recode the within.day variable accounting for the following scheduled temporal windows and the expiration time (i.e., after 20 minutes), and we add 10 extra minutes before and after the following time limits to account for the variability among devices:

  • 0 = 09:45 ~ 11:00 (baseline survey)

  • 1 = 11:20 - 11:40 + 20 min (up to 12:00)

  • 2 = 12:30 - 12:50 + 20 min (up to 13:10)

  • 3 = 13:40 - 14:00 + 20 min (up to 14:20)

  • 4 = 14:50 - 15:10 + 20 min (up to 15:50)

  • 5 = 16:00 - 16:20 + 20 min (up to 16:40)

Here, we apply the function and we correct one case of missing within.day (i.e., RunTimestamp = 17:40; recorded as within.day = 5).

# adjusting within.day based on scheduled time windows
ESM <- within.day.adjust(ESM)
## Creating/Adjusting within.day variable...
##  Printing summary of cases:
##    1    2    3    4    5 NA's 
##  275  280  288  267  285    1
# correcting 1 missing case (17:40 -> within.day = 5)
ESM[is.na(ESM$within.day),"within.day"] <- 5

# showing data structure
ESM[,c("ID","day","within.day","RunTimestamp")]


2.2.3. Other variables

Finally, we use the varsRecoding() and the bslRecoding() functions showed at the beginning of section 2.2 to recode the remaining variables (i.e., renaming and reversing item scores, recoding categorical variables, sorting and renaming columns) in the ESM and the baseline datasets, respectively.

# recoding EMA dataset
ESM <- varsRecoding(ESM)

# recoding baseline dataset
baseline <- bslRecoding(baseline)


Here, we conduct a series of sanity checks by inspecting the summary and plots of each variable. In the ESM dataset, we can see that 11 cases from the same participant HRV099 have both coffe and alcohol = "NULL", whereas the values of the other confounders look fine. Since this is likely due to unanswered items because the condition was not met, we recode those cases with coffe and alcohol = 0. Moreover, we can note that in some cases participants selected more than one option for the people variable. In these cases, we recode the score in a conservatory way, that is by assigning the higher score.

# inspecting summary of ESM variables
summary(ESM)
##        ID             os       gender       day       within.day
##  HRV083 :  19   Android:1110   F:702   Min.   :1.00   1:275     
##  HRV032 :  17   iOS    : 286   M:694   1st Qu.:1.00   2:280     
##  HRV036 :  16                          Median :2.00   3:288     
##  HRV073 :  16                          Mean   :1.99   4:267     
##  HRV095 :  16                          3rd Qu.:3.00   5:286     
##  HRV003 :  15                          Max.   :3.00             
##  (Other):1297                                                   
##   RunTimestamp                          v1              v2       
##  Min.   :2018-11-06 11:18:46.54   Min.   :1.000   Min.   :1.000  
##  1st Qu.:2018-12-18 13:54:35.94   1st Qu.:4.000   1st Qu.:4.000  
##  Median :2019-03-13 16:15:10.50   Median :5.000   Median :5.000  
##  Mean   :2019-03-22 16:02:26.94   Mean   :4.981   Mean   :4.569  
##  3rd Qu.:2019-05-21 15:05:04.84   3rd Qu.:6.000   3rd Qu.:5.000  
##  Max.   :2019-11-21 16:11:17.83   Max.   :7.000   Max.   :7.000  
##                                                   NA's   :13     
##        v3              t1              t2              t3       
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:4.000   1st Qu.:4.000   1st Qu.:4.000   1st Qu.:4.000  
##  Median :5.000   Median :5.000   Median :5.000   Median :5.000  
##  Mean   :4.734   Mean   :4.787   Mean   :4.827   Mean   :4.853  
##  3rd Qu.:6.000   3rd Qu.:6.000   3rd Qu.:6.000   3rd Qu.:6.000  
##  Max.   :7.000   Max.   :7.000   Max.   :7.000   Max.   :7.000  
##  NA's   :14      NA's   :8       NA's   :13      NA's   :23     
##        f1              f2              f3              PE       
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:3.000   1st Qu.:3.000   1st Qu.:3.000   1st Qu.:1.000  
##  Median :4.000   Median :4.000   Median :4.000   Median :1.000  
##  Mean   :4.255   Mean   :4.239   Mean   :4.167   Mean   :1.856  
##  3rd Qu.:5.000   3rd Qu.:5.000   3rd Qu.:5.000   3rd Qu.:2.000  
##  Max.   :7.000   Max.   :7.000   Max.   :7.000   Max.   :7.000  
##  NA's   :10      NA's   :14      NA's   :23      NA's   :24     
##      PE.int            NE            NE.int       smoke       coffe     
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   0   :1167   0   :1101  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000   1   : 157   1   : 221  
##  Median :1.000   Median :1.000   Median :1.000   2   :  10   2   :   8  
##  Mean   :1.798   Mean   :1.393   Mean   :1.447   5   :   1   NULL:  11  
##  3rd Qu.:2.000   3rd Qu.:1.000   3rd Qu.:1.000   NULL:  11   NA's:  55  
##  Max.   :7.000   Max.   :7.000   Max.   :7.000   NA's:  50              
##  NA's   :27      NA's   :36      NA's   :42                             
##  alcohol     activity       people   
##  0   :1329   0   :921   1      :698  
##  1   :   9   1   :388   0      :345  
##  NA's:  58   2   : 29   2      :249  
##              NA's: 58   3      : 35  
##                         1,0    :  2  
##                         (Other):  7  
##                         NA's   : 60
# cases with coffe or alcohol = "NULL"
ESM[!is.na(ESM$coffe) & ESM$coffe=="NULL",c(1,20:ncol(ESM))]
ESM[!is.na(ESM$coffe) & ESM$coffe=="NULL",c("smoke","coffe")] <- "0"
ESM[,c("smoke","coffe")] <- lapply(lapply(ESM[,c("smoke","coffe")],as.character),as.factor)

# inspecting summary of people
summary(ESM$people)
##     0   0,1 0,1,2   0,2     1   1,0     2   2,1   2,3     3   3,0   3,1  NA's 
##   345     1     1     1   698     2   249     1     1    35     1     1    60
ESM$people <- as.factor(gsub("0,1","0", # replacing multi-option responses with the higher selected score
                             gsub("0,1,2","2",
                                  gsub("0,2","2",
                                       gsub("1,0","1",
                                            gsub("2,1","2",
                                                 gsub("2,3","3",
                                                      gsub("3,0","3",
                                                           gsub("3,1","3",ESM$people)))))))))

# re-inspecting recoded variables (no more "NULL" and multi-choice responses)
summary(ESM[,c("smoke","coffe","people")])
##   smoke       coffe       people   
##  0   :1178   0   :1112   0   :346  
##  1   : 157   1   : 221   1   :700  
##  2   :  10   2   :   8   2   :252  
##  5   :   1   NA's:  55   3   : 38  
##  NA's:  50               NA's: 60
# inspecting summary of baseline variables (all look fine)
summary(baseline)
##        ID           day          within.day  RunTimestamp                  
##  HRV001 :  3   Min.   :1.000   Min.   :0    Min.   :2018-11-06 10:12:16.0  
##  HRV002 :  3   1st Qu.:1.000   1st Qu.:0    1st Qu.:2018-12-18 10:19:24.5  
##  HRV003 :  3   Median :2.000   Median :0    Median :2019-03-13 10:27:06.0  
##  HRV004 :  3   Mean   :1.997   Mean   :0    Mean   :2019-03-22 08:11:48.5  
##  HRV005 :  3   3rd Qu.:3.000   3rd Qu.:0    3rd Qu.:2019-05-21 10:33:18.5  
##  HRV006 :  3   Max.   :3.000   Max.   :0    Max.   :2019-11-21 10:45:48.0  
##  (Other):293                                                               
##        v1             v2              v3              t1              t2       
##  Min.   :2.00   Min.   :2.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:4.00   1st Qu.:4.000   1st Qu.:4.000   1st Qu.:4.000   1st Qu.:4.000  
##  Median :5.00   Median :4.000   Median :5.000   Median :5.000   Median :5.000  
##  Mean   :4.91   Mean   :4.466   Mean   :4.746   Mean   :4.749   Mean   :4.907  
##  3rd Qu.:6.00   3rd Qu.:5.000   3rd Qu.:6.000   3rd Qu.:6.000   3rd Qu.:6.000  
##  Max.   :7.00   Max.   :7.000   Max.   :7.000   Max.   :7.000   Max.   :7.000  
##                                                                                
##        t3              f1             f2              f3              PE       
##  Min.   :1.000   Min.   :1.00   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:4.000   1st Qu.:3.00   1st Qu.:3.000   1st Qu.:3.000   1st Qu.:1.000  
##  Median :5.000   Median :5.00   Median :5.000   Median :4.000   Median :2.000  
##  Mean   :5.013   Mean   :4.28   Mean   :4.347   Mean   :4.212   Mean   :2.199  
##  3rd Qu.:6.000   3rd Qu.:5.00   3rd Qu.:5.000   3rd Qu.:5.000   3rd Qu.:3.000  
##  Max.   :7.000   Max.   :7.00   Max.   :7.000   Max.   :7.000   Max.   :6.000  
##                                                                                
##      PE.int            NE            NE.int      daybefore.ex drugs  
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   0:242        0:297  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000   1: 69        1: 14  
##  Median :1.000   Median :1.000   Median :1.000                       
##  Mean   :1.929   Mean   :1.707   Mean   :1.712                       
##  3rd Qu.:3.000   3rd Qu.:2.000   3rd Qu.:2.000                       
##  Max.   :6.000   Max.   :6.000   Max.   :6.000                       
##                                  NA's   :2                           
##  drugs.which       
##  Length:311        
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 


2.2.3.1. Binary confounders

From the summary above, we can see that some categorical confounders have only a very small number of ratings for certain levels (i.e., smoke has only 11 cases rated as “2” or higher, coffe has only 8 cases rated as “2”, activity has only 29 cases rated as “2”, people has only 3 cases rated as “3”). Here, we simplify such variables by dichotomizing each of them (i.e., 0/1).

# plotting frequency of cases in each level of each confounder
par(mfrow=c(1,4))
for(conf in c("smoke","coffe","activity","people")){ plot(ESM[,conf],main=conf) }

# smoke: 2 and 5 --> 1 (i.e., "1+ cigarettes")
ESM$smoke <- as.factor(gsub("2","1",gsub("5","1",ESM$smoke)))

# coffe: 2 --> 1 (i.e., "1+ coffes")
ESM$coffe <- as.factor(gsub("2","1",ESM$coffe))

# activity: 2 --> 1 (i.e., "moderate/intense physical activity")
ESM$activity <- as.factor(gsub("2","1",ESM$activity))

# people: 1 --> 0 (i.e., "alone or with some people not disturbing"); 2 and 3 --> 1 (i.e., "someone disturbing or interrupting the recording")
ESM$people <- as.factor(gsub("3","1",gsub("2","1",gsub("1","0",ESM$people))))

# re-plotting
par(mfrow=c(1,4))
for(conf in c("smoke","coffe","activity","people")){ plot(ESM[,conf],main=conf) }

# re-summarizing
summary(ESM[,c("smoke","coffe","activity","people")])
##   smoke       coffe      activity    people    
##  0   :1178   0   :1112   0   :921   0   :1046  
##  1   : 168   1   : 229   1   :417   1   : 290  
##  NA's:  50   NA's:  55   NA's: 58   NA's:  60


2.2.4. Data merging

Finally, we can join the baseline dataset with the ESM dataset.

# merging and fixing the last issues
ESM$within.day <- as.integer(as.character(ESM$within.day)) # within.day as integer for compatibility
new <- rbind.fill(ESM,baseline) # merging datasets
cat("sanity check:",nrow(new)==(nrow(ESM)+nrow(baseline))) # sanity check
## sanity check: TRUE
ESM <- new[order(new$ID,new$day,new$within.day),] # sorting by ID, day, and within.day
rownames(ESM) <- 1:nrow(ESM)

# showing merged dataset (everithing looks fine)
ESM


2.3. Cleaning & filtering

Third, we clean and filter the data based on the inclusion criteria and data quality. The following packages and functions are used to optimize data filtering:

library(plyr); library(knitr); library(reshape2); library(ggplot2); library(gridExtra); library(careless)
respRate

respRate <- function(data){
  
  # scheduled No. of surveys (1 baseline + 5 ordinary X 3 days)
  tot <- (1 + 5) * 3
  bsl <- 1 * 3
  ord <- 5 * 3
  
  # computing No. of responses and response rate
  data$ID <- as.factor(as.character(data$ID))
  rr <- data.frame(ID=levels(data$ID),tot=as.numeric(table(data$ID))) # No. of responses per participant
  rr$tot.RR <- round(100*rr$tot/tot) # total response rate
  rr$bsl <- as.numeric(table(data[data$within.day==0,"ID"])) # No. of responses to the baseline surveys
  rr$bsl.RR <- round(100*rr$bsl/bsl) # baseline response rate
  rr$ord <- as.numeric(table(data[data$within.day!=0,"ID"])) # No. of responses to the ordinary surveys
  rr$ord.RR <- round(100*rr$ord/ord) # ordinary response rate
  
  # print response rate info
  cat(nrow(rr),"included participants",
      "\n -",sum(rr$tot),"responses, from",min(rr$tot),"to",max(rr$tot),
      ", mean =",round(mean(rr$tot),2),"sd =",round(sd(rr$tot),2),
      ", mean perc =",round(mean(rr$tot.RR),2),"% sd =",round(sd(rr$tot.RR),2),"%",
      "\n  -",sum(rr$bsl),"baseline, from",min(rr$bsl),"to",max(rr$bsl),
      ", mean =",round(mean(rr$bsl),2),"sd =",round(sd(rr$bsl),2),
      ", mean perc =",round(mean(rr$bsl.RR),2),"% sd =",round(sd(rr$bsl.RR),2),"%",
      "\n  -",sum(rr$ord),"ordinary, from",min(rr$ord),"to",max(rr$ord),
      ", mean =",round(mean(rr$ord),2),"sd =",round(sd(rr$ord),2),
      ", mean perc =",round(mean(rr$ord.RR),2),"% sd =",round(sd(rr$ord.RR),2),"%")
  
  return(rr)} # returning response rate dataset

Computes a data frame of response rate values and prints summary information on response rate.

plotResp

plotResp <- function(data,id,items){
  long <- melt(data[,c("id",items)],id) # one row per item
  long <- long[order(long$id,long$variable),]
  ggplot(long,aes(x=value,y=variable)) + geom_vline(xintercept=4,lwd=0.5,lty=2) + geom_point() + xlim(1,7) +
  facet_wrap(id) + theme(text=element_text(size=8)) }

Plots the pattern of responses to mood items for a subsample of responses.


We start by inspecting the original number of respondents and observations in the ESM dataset. We can see that the questionnaires were completed by a total of 104 respondents, one less than the 105 participants that were actually invited to participate to the daily protocol. This is because participant HRV009 dropped out without activating the app.

cat("Original No. of responses to the ESM/baseline surveys =",nrow(ESM),"from",nlevels(ESM$ID),"participants")
## Original No. of responses to the ESM/baseline surveys = 1707 from 104 participants


2.3.1. Double responses

First, we inspect and remove cases of double responses with the same ID and RunTimestamp values (N = 10). In these cases, only the first response is included with the exception of HRV007 on day 1 and HRV073 on day 2, because of missing values in the first but not in the second response.

# detecting double responses
ESM$idday <- paste(ESM$ID,ESM$RunTimestamp,sep="_")
cat(nrow(ESM[duplicated(ESM$idday),]),"cases of double responses (same ID and RunTimestamp)")
## 10 cases of double responses (same ID and RunTimestamp)
kable(ESM[ESM$idday%in%ESM[duplicated(ESM$idday),"idday"],
    c("ID","day","within.day","RunTimestamp","v1","v2","v3","t1","t2","t3","NE.int","smoke","coffe","alcohol","activity")])
ID day within.day RunTimestamp v1 v2 v3 t1 t2 t3 NE.int smoke coffe alcohol activity
108 HRV007 1 5 2018-11-13 16:21:31.569 4 4 4 5 5 5 3 0 0 0 0
109 HRV007 1 5 2018-11-13 16:21:31.569 3 3 3 4 3 3 NA NA NA NA NA
121 HRV007 3 5 2018-11-15 16:21:49.293 3 3 3 3 4 3 3 0 0 0 0
122 HRV007 3 5 2018-11-15 16:21:49.293 3 2 3 3 3 2 4 0 0 0 0
492 HRV032 1 2 2018-12-18 12:54:50.792 5 5 6 5 5 4 1 1 0 0 1
493 HRV032 1 2 2018-12-18 12:54:50.792 1 1 7 7 1 1 1 1 1 0 1
566 HRV036 2 1 2019-01-09 11:33:48.782 7 5 7 7 7 7 1 0 0 0 0
567 HRV036 2 1 2019-01-09 11:33:48.782 7 5 7 7 7 7 1 0 0 0 0
1169 HRV073 2 3 2019-05-08 13:45:50.606 5 5 5 4 4 5 1 0 0 NA NA
1170 HRV073 2 3 2019-05-08 13:45:50.606 5 4 5 3 4 4 2 0 1 0 0
1252 HRV078 2 2 2019-05-15 12:40:54.675 3 3 3 3 4 3 3 0 1 0 1
1253 HRV078 2 2 2019-05-15 12:40:54.675 4 3 3 3 3 3 3 0 0 0 1
1329 HRV083 1 2 2019-05-21 12:49:08.996 6 5 4 6 6 6 1 1 0 0 0
1330 HRV083 1 2 2019-05-21 12:49:08.996 6 4 4 6 6 6 1 1 0 0 0
1337 HRV083 2 3 2019-05-22 14:10:05.861 4 4 4 5 5 5 2 1 1 0 0
1338 HRV083 2 3 2019-05-22 14:10:05.861 4 4 4 3 5 5 2 1 1 0 0
1342 HRV083 3 1 2019-05-23 11:41:29.187 4 4 4 5 5 6 1 0 0 0 0
1343 HRV083 3 1 2019-05-23 11:41:29.187 4 4 4 6 6 6 1 0 0 0 0
1347 HRV083 3 5 2019-05-23 16:04:47.322 4 4 4 5 5 6 1 1 0 0 0
1348 HRV083 3 5 2019-05-23 16:04:47.322 4 4 4 5 5 6 1 1 0 0 0
# excluding second response for HRV007 (day 1) and HRV073 (day 2)
memory <- ESM
ESM <- ESM[!(ESM$idday=="HRV007_2018-11-13 16:18:32.769" & is.na(ESM$activity)),]
ESM <- ESM[!(ESM$idday=="HRV073_2019-05-08 13:42:51.806" & is.na(ESM$activity)),]

# excluding all remaining double responses
ESM <- ESM[!duplicated(ESM$idday),]

# removing double responses
cat("Removed",nrow(memory)-nrow(ESM),"double responses")
## Removed 10 double responses


Then, we use the same procedure to inspect and remove cases of double responses with the same ID, day and within.day values. We can see 10 further cases of double responses. Since no missing data are included in any of these cases, we can just filter the latest response of each couple.

# detecting double responses
ESM$idday <- paste(ESM$ID,ESM$day,ESM$within.day,sep="_")
cat(nrow(ESM[duplicated(ESM$idday),]),"cases of double responses (same ID and RunTimestamp)")
## 10 cases of double responses (same ID and RunTimestamp)
kable(ESM[ESM$idday%in%ESM[duplicated(ESM$idday),"idday"],
    c("ID","day","within.day","RunTimestamp","v1","v2","v3","t1","t2","t3","NE.int","smoke","coffe","alcohol","activity")])
ID day within.day RunTimestamp v1 v2 v3 t1 t2 t3 NE.int smoke coffe alcohol activity
491 HRV032 1 2 2018-12-18 12:47:00.000 5 4 5 5 5 6 1 1 0 0 0
492 HRV032 1 2 2018-12-18 12:54:50.792 5 5 6 5 5 4 1 1 0 0 1
502 HRV032 2 5 2018-12-19 16:08:00.000 3 4 3 3 2 2 4 0 0 0 0
503 HRV032 2 5 2018-12-19 16:09:17.886 5 5 5 6 3 5 1 1 0 0 0
739 HRV046 3 5 2019-01-24 16:01:42.284 2 2 3 2 2 2 4 1 0 0 1
740 HRV046 3 5 2019-01-24 17:40:00.000 4 3 4 2 2 4 1 1 1 0 1
773 HRV048 3 3 2019-01-31 13:40:00.000 4 2 3 3 4 5 1 0 0 0 0
774 HRV048 3 3 2019-01-31 13:51:13.250 7 7 7 3 2 2 1 0 0 0 1
1085 HRV068 2 2 2019-04-10 12:33:54.719 7 7 7 6 7 7 2 0 0 0 1
1086 HRV068 2 2 2019-04-10 12:43:54.737 7 7 7 6 7 7 1 0 0 0 0
1169 HRV073 2 3 2019-05-08 13:45:50.606 5 5 5 4 4 5 1 0 0 NA NA
1171 HRV073 2 3 2019-05-08 14:03:54.629 4 3 4 5 5 4 3 0 0 0 0
1525 HRV094 3 3 2019-10-17 13:49:35.021 5 2 4 4 4 3 2 0 0 0 1
1526 HRV094 3 3 2019-10-17 13:52:44.679 5 3 4 3 4 3 1 0 0 0 0
1527 HRV094 3 4 2019-10-17 14:51:03.452 4 4 4 3 3 3 1 0 0 0 0
1528 HRV094 3 4 2019-10-17 15:11:23.876 4 2 3 3 3 3 2 0 0 0 0
1531 HRV095 1 2 2019-10-15 12:37:19.946 5 4 5 5 3 5 1 0 0 0 1
1532 HRV095 1 2 2019-10-15 12:37:21.780 5 4 5 5 3 5 1 0 0 0 1
1533 HRV095 1 2 2019-10-15 12:47:56.838 5 4 4 5 5 5 1 0 0 0 1
# excluding double responses
memory2 <- ESM
ESM <- ESM[!duplicated(ESM$idday),]

# removing double responses
cat("Removed further",nrow(memory2)-nrow(ESM),"double responses")
## Removed further 10 double responses


Thus, we removed a total of 20 double responses

# removing double responses
cat("Removed a total of",nrow(memory)-nrow(ESM),"double responses; new number of responses =",nrow(ESM))
## Removed a total of 20 double responses; new number of responses = 1687


2.3.2. Inclusion criteria

Second, we inspect the daily.drugs and the drugs.what variable (i.e., drugs/medications different than those usually taken that were assumed on that morning) considering the inclusion criteria of the study (i.e., not taking medications affecting the nervous or the cardiovascular system).

Only 14 cases are present in the dataset, of which 7 (4 participants) concerned antiinflammatory medications (Antiinfiammatorio, moment, buscofen), 1 case concerned antibiotics (fosfomicina), 1 case concerned analgesics (paracetamolo), and 3 cases (1 participant) concerned cough syrup (sciroppo per la tosse). None of these cases contrast with our inclusion criteria, but they will be taken into account later, as they can influence some self-reported and physiological variables.

ESM[substr(ESM$ID,1,3)=="HRV" & ESM$within.day==0 & ESM$drugs==1,c("ID","drugs.which")]


2.3.3. Further excluded

Third, we mark the 8 participants that were excluded for the reasons specified in section 1.3.4. As in that section, such participants are marked as ‘OUT’. As noted above, participant HRV009 was even not included in the ESM dataset because she dropped out before activating the app. Thus, from the initial 104 participants, the number of excluded participants is 7 (6.73%%), whereas the total number of included participants is 97 (93.27%). Again, we can note that the variables measured from these participants do not substantially deviate from the distribution of the included participants, and we remove them from the ESM dataset.

# marking 8 excluded participants as "OUT"
memory <- ESM
ESM$ID <- as.character(ESM$ID)
excl <- c("HRV009","HRV011","HRV015","HRV027","HRV029","HRV046","HRV059","HRV066")
ESM[ESM$ID%in%excl,"ID"] <- gsub("HRV","OUT",ESM[ESM$ID%in%excl,"ID"]) # marking excluded participants as "OUTXXX"

# plotting
ESM$excl <- "IN" # creating excl variable (factor)
ESM[substr(ESM$ID,1,3)=="OUT","excl"] <- "OUT"
ESM$excl <- as.factor(ESM$excl)
grid.arrange(ggplot(ESM[!is.na(ESM$v1),],aes(x=excl,y=v1,fill=excl)) + geom_violin(),
             ggplot(ESM[!is.na(ESM$t1),],aes(x=excl,y=t1,fill=excl)) + geom_violin(),
             ggplot(ESM[!is.na(ESM$f1),],aes(x=excl,y=f1,fill=excl)) + geom_violin(),
             ggplot(ESM[!is.na(ESM$PE),],aes(x=excl,y=PE,fill=excl)) + geom_violin(),
             ggplot(ESM[!is.na(ESM$NE),],aes(x=excl,y=NE,fill=excl)) + geom_violin(),
             ggplot(ESM[!is.na(ESM$smoke),],aes(x=excl,fill=smoke)) + geom_bar(),
             ggplot(ESM[!is.na(ESM$activity),],aes(x=excl,fill=activity)) + geom_bar(),
             ggplot(ESM[!is.na(ESM$people),],aes(x=excl,fill=people)) + geom_bar(),nrow=2)

# updating number of included participants
cat(nrow(ESM[substr(ESM$ID,1,3)=="HRV",]),"observations from",
    nlevels(as.factor(as.character(ESM[substr(ESM$ID,1,3)=="HRV","ID"]))),"included participants out of",
    nrow(memory),"total observations from",nrow(memory[!duplicated(memory$ID),]),"invited participants")
## 1607 observations from 97 included participants out of 1687 total observations from 104 invited participants
# removing excluded participants
ESM <- ESM[substr(ESM$ID,1,3)=="HRV",]
ESM$ID <- as.factor(ESM$ID) # ID as factor
ESM <- ESM[order(ESM$ID,ESM$RunTimestamp),] # sorting by ID & RunTimestamp
ESM$excl <- NULL # removing excl variable


2.3.4. Flagged cases

Finally, although we did not apply inclusion criteria based on further variables than those specified in section 1.3.2, here we consider the response rate and the coherence of the responses as further criteria to flag those cases that might bias the results.

2.3.4.1. Response rate

First, we inspect the response rate by using the respRate function showed at the beginning of section 2.3. We can note that all ‘baseline’ surveys (i.e., those administered in the lab) were responded each day by each participant, whereas some missing data were present in ‘ordinary surveys’ (i.e., those administered with the mobile app). Overall, the mean response rate to ‘ordinary’ surveys was relatively high (90%) with no participants showing response rates < 60%. Thus, we see no reasons for excluding or flagging further participants based on response rate.

rrate <- respRate(ESM)
## 97 included participants 
##  - 1607 responses, from 12 to 18 , mean = 16.57 sd = 1.41 , mean perc = 91.91 % sd = 7.79 % 
##   - 291 baseline, from 3 to 3 , mean = 3 sd = 0 , mean perc = 100 % sd = 0 % 
##   - 1316 ordinary, from 9 to 15 , mean = 13.57 sd = 1.41 , mean perc = 90.41 % sd = 9.32 %


In addition to counting the number of received surveys, we must also consider missing responses within each survey. Indeed, a number of surveys was incomplete due to technical problems with the app. We can notice that missing responses mainly concern items on confounder variables (i.e., activity, alcohol, coffee, smoke and people), which were administered in the last part of each survey and showed from 31 to 39 missing values (about 3%). In contrast, missing data are only about the 2-3% for items measuring stressful events (from 8 to 18), and 1% or less for items measuring mood (from 0 to 1).

miss <- sapply(ESM[ESM$within.day!=0,1:which(colnames(ESM)=="people")], function(x) sum(is.na(x))) # counting missing
miss <- data.frame(item=names(miss),nMiss=as.numeric(miss), # missing rate
                   percMiss=round(100*as.numeric(miss)/nrow(ESM[ESM$within.day!=0,])))
kable(miss) # showing No. and percentage of missing cases
item nMiss percMiss
ID 0 0
os 0 0
gender 0 0
day 0 0
within.day 0 0
RunTimestamp 0 0
v1 0 0
v2 6 0
v3 7 1
t1 2 0
t2 6 0
t3 7 1
f1 4 0
f2 7 1
f3 7 1
PE 8 1
PE.int 10 1
NE 18 1
NE.int 24 2
smoke 31 2
coffe 35 3
alcohol 38 3
activity 38 3
people 39 3


Here, we remove further 7 surveys (0.4%) due to missing data in almost all items (i.e., those data entries with missing data in the first items).

# showing 7 cases with missing response to item v3
ESM[is.na(ESM$v3),c(1,which(colnames(ESM)=="v1"):ncol(ESM)),]
# excluding 14 cases with missing response to item v3
ESM <- ESM[!is.na(ESM$v3),]


2.3.4.2. Careless responses

Second, we flag participants and/or single responses that might be considered as careless respondents/responses.

Indicators and procedure

Following Curran (2016), we consider multiple indicators of potentially careless responses including:

  1. Response times (both item-specific and total)

  2. Response consistency (long string proportions)

  3. Multivariate outlier patterns (Mahalanobis distance)

  4. Response inconsistency (psychometric synonyms)

For each indicator, the numbers between brackets indicates the corresponding number of flagged responses/participants.

SELECT INDICATOR:
ITEM RESPONSE TIMES (32/2, 66/8)

First, we consider item-by-item and overall survey response times in the ‘ordinary’ ESM questionnaires. This information was encoded in the ESM.RT dataset in section 2.1. Thus, as a first step, we process those data by cleaning and filtering the same cases cleaned and filtered in the ESM dataset.

Show code

# recoding IDs
# ...................................................

ESM.RT <- participantID_recoding(ESM.RT) 

# recoding time
# ...................................................

# converting temporal variables as POSIXct in the ESM dataset
ESM.RT[,c("RunTimestamp","SubmissionTimestamp")] <- # no need to specify format since they already are POSIXct
  lapply(ESM.RT[,c("RunTimestamp","SubmissionTimestamp")],function(x) as.POSIXct(x,tz="GMT"))

# correcting wrongly encoded RunTimestamps (adding 1year) and SubmissionTimestamps (RunTimestamp + 61.20 sec)
ESM.RT[substr(ESM.RT$RunTimestamp,1,7)=="2018-01","SubmissionTimestamp"] <-
  ESM.RT[substr(ESM.RT$RunTimestamp,1,7)=="2018-01","RunTimestamp"] + 61.20
ESM.RT[substr(ESM.RT$RunTimestamp,1,7)=="2018-01",c("RunTimestamp","SubmissionTimestamp")] <-
  ESM.RT[substr(ESM.RT$RunTimestamp,1,7)=="2018-01",c("RunTimestamp","SubmissionTimestamp")] + 24*60*60*365

# sanity check: min and max matching with the data collection period
c(min(ESM.RT$RunTimestamp),max(ESM.RT$RunTimestamp)) 
## [1] "2018-11-06 10:18:46.547 GMT" "2019-11-21 15:11:17.835 GMT"
# adding 1 hour to all ESM timestamps
ESM.RT[,c("RunTimestamp","SubmissionTimestamp")] <- ESM.RT[,c("RunTimestamp","SubmissionTimestamp")] + 1*60*60

# adding 1 further hour to ESM timestamps during daylight saving time
ESM.RT$daylight <- "standard"
ESM.RT[ESM.RT$RunTimestamp > as.POSIXct("2019-04-01") & ESM.RT$RunTimestamp < as.POSIXct("2019-10-27"),"daylight"] <- "daylight"
ESM.RT[ESM.RT$daylight=="daylight",c("RunTimestamp","SubmissionTimestamp")] <- 
  ESM.RT[ESM.RT$daylight=="daylight",c("RunTimestamp","SubmissionTimestamp")] + 1*60*60

# creating day
ESM.RT$day <- as.POSIXlt(ESM.RT$RunTimestamp)$wday - 1

# creating within.day
ESM.RT <- ddply(ESM.RT,c("ID","day"),transform,within.day=seq_along(day))
ESM.RT <- within.day.adjust(ESM.RT)
## Creating/Adjusting within.day variable...
##  Printing summary of cases:
##   1   2   3   4   5 
## 275 279 287 267 284
# recoding other variables
# ....................................................

# ID as factor
ESM.RT$ID <- as.factor(ESM.RT$ID)

# Sorting and renaming columns
ESM.RT <- ESM.RT[,c("ID","day","within.day","RunTimestamp","SubmissionTimestamp","RT.tot",
                    "v1.male.bene","v2.soddisfatto.insoddisfatto","v3.positivo.negativo", # hedonic tone
                    "t1.rilassato.teso","t2.agitato.calmo","t3.nervoso.tranquillo", # calmness
                    "e1.stanco.sveglio","e2.pieno.privodenergia","e3.affaticato.fresco", # energetic arousal
                    "PE","intensity.PE","NE","intensity.PE", # emotional events
                    "smoke","coffe","alcohol","Activity","People.rec")] # confounders
colnames(ESM.RT)[7:ncol(ESM.RT)] <- c("v1","v2","v3","t1","t2","t3","f1","f2","f3", # renaming columns
                                      "PE","PE.int","NE","NE.int","smoke","coffe","alcohol","activity","people")

# filtering double responses
# ..................................................

# by RunTimestamp
ESM.RT$idday <- paste(ESM.RT$ID,ESM.RT$RunTimestamp,sep="_")
cat(nrow(ESM.RT[duplicated(ESM.RT$idday),]),"cases of double responses (same ID and RunTimestamp)") # 10 (= ESM)
## 10 cases of double responses (same ID and RunTimestamp)
ESM.RT <- ESM.RT[!(ESM.RT$idday=="HRV007_2018-11-13 16:18:32.769" & is.na(ESM.RT$activity)),] # excl 2nd resp for HRV007 & HRV073
ESM.RT <- ESM.RT[!(ESM.RT$idday=="HRV073_2019-05-08 13:42:51.806" & is.na(ESM.RT$activity)),]
ESM.RT <- ESM.RT[!duplicated(ESM.RT$idday),] # excluding all remaining double responses

# by day and within.day
ESM.RT$idday <- paste(ESM.RT$ID,ESM.RT$day,ESM.RT$within.day,sep="_")
cat(nrow(ESM.RT[duplicated(ESM.RT$idday),]),"cases of double responses (same ID and RunTimestamp)") # 6 (4 less than ESM (?))
## 6 cases of double responses (same ID and RunTimestamp)
ESM.RT <- ESM.RT[!duplicated(ESM.RT$idday),] # excluding all double responses

# marking further excluded
# ..................................................

# excluding further 8 participants as "OUT"
ESM.RT$ID <- as.character(ESM.RT$ID)
excl <- c("HRV009","HRV011","HRV015","HRV027","HRV029","HRV046","HRV059","HRV066")
ESM.RT <- ESM.RT[!(ESM.RT$ID%in%excl),]
ESM.RT$ID <- as.factor(as.character(ESM.RT$ID))
ESM.RT <- ESM.RT[order(ESM.RT$ID,ESM.RT$RunTimestamp),]

# excluding cases with missing response to most items
# .................................................

ESM.RT <- ESM.RT[!is.na(ESM.RT$v3),]

# sanity checks
# .................................................

# same number of participants?
nlevels(ESM.RT$ID) == nlevels(ESM$ID)
## [1] TRUE
# same number of included participants?
nlevels(as.factor(as.character(ESM.RT[substr(ESM.RT$ID,1,3)=="HRV","ID"]))) ==
  nlevels(as.factor(as.character(ESM[substr(ESM$ID,1,3)=="HRV","ID"])))
## [1] TRUE
# same number of observations?
nrow(ESM.RT) == nrow(ESM[ESM$within.day!=0,])
## [1] TRUE


Now we can inspect the item-by-item response times (seconds) in the ESM.RT dataset, computed as the difference between the timestamp of each response and that of the previous one (note: the first response of each questionnaire does not have an associated response time).

We can see that responding to a single item required about 2 seconds in most cases (median response time = 2.69 sec), regardless of the type of items. A small number of items were responded in more than 10 seconds (4.4%) with a very few cases requiring more than 50 seconds (N = 39, 0.18%). Likewise, a small number of items were responded in less than 1 second (2.3%) with a very few cases requiring less than 0.65 seconds (N = 98, 0.44%).

# item-by-item long dataset
rt <- melt(ESM.RT[,c("ID","RunTimestamp",colnames(ESM.RT)[which(colnames(ESM.RT)=="v1"):which(colnames(ESM.RT)=="people")])],
           id=c("ID","RunTimestamp"))
rt <- rt[!is.na(rt$value),] # removing missing values
rt$items <- "mood" # categorizing variables based on the item contents
rt[rt$variable%in%c("PE","PE.int","NE","NE.int"),"items"] <- "event"
rt[rt$variable%in%c("smoke","coffe","alcohol","activity","people"),"items"] <- "conf"

# summarizing response times
summary(rt$value)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    0.000    1.942    2.693    4.140    4.131 4378.792
# plotting response times
grid.arrange(ggplot(rt,aes(value))+geom_histogram(bins=300)+ggtitle("Response times (sec)"),
             ggplot(rt,aes(value))+geom_histogram(bins=300)+xlim(0,100)+ggtitle("zooming 0 - 100 sec"),
             ggplot(rt,aes(value))+geom_histogram(bins=300)+xlim(0,10)+ggtitle("zooming 0 - 10 sec"),
             ggplot(rt[rt$items=="mood",],aes(value))+geom_histogram(bins=300,alpha=0.5)+xlim(0,10) + ggtitle("Mood items"),
             ggplot(rt[rt$items=="event",],aes(value))+geom_histogram(bins=300,alpha=0.5)+xlim(0,10) + ggtitle("Event items"),
             ggplot(rt[rt$items=="conf",],aes(value))+geom_histogram(bins=300,alpha=0.5)+xlim(0,10) + ggtitle("Confounders items"),
             nrow=2)

# No. of response times higher than X
for(value in c(10,20,25,50,100,500,1000,2000,4000)){ cat("\n-",nrow(rt[rt$value>value,]),"RTs >",value,
                                                      "s (",round(100*nrow(rt[rt$value>value,])/nrow(rt),2),"% )") }
## 
## - 967 RTs > 10 s ( 4.38 % )
## - 245 RTs > 20 s ( 1.11 % )
## - 151 RTs > 25 s ( 0.68 % )
## - 39 RTs > 50 s ( 0.18 % )
## - 16 RTs > 100 s ( 0.07 % )
## - 2 RTs > 500 s ( 0.01 % )
## - 2 RTs > 1000 s ( 0.01 % )
## - 1 RTs > 2000 s ( 0 % )
## - 1 RTs > 4000 s ( 0 % )
# No. of response times lower than X
for(value in c(2,1.5,1,0.65,0.5,0.25,0.1)){ cat("\n-",nrow(rt[rt$value<value,]),"RTs <",value,
                                           "s (",round(100*nrow(rt[rt$value<value,])/nrow(rt),2),"% )") }
## 
## - 5997 RTs < 2 s ( 27.15 % )
## - 2465 RTs < 1.5 s ( 11.16 % )
## - 509 RTs < 1 s ( 2.3 % )
## - 98 RTs < 0.65 s ( 0.44 % )
## - 36 RTs < 0.5 s ( 0.16 % )
## - 12 RTs < 0.25 s ( 0.05 % )
## - 12 RTs < 0.1 s ( 0.05 % )


Here, we flag all cases with at least one item responded in more than 50 seconds with iRT.slow = 1 and all those responded in less than 650 ms with iRT.fast = 1. While the first threshold is essentially based on the observed RT distribution, the second criterion was also recommended by Geeraerts (2020). We also compute the number of items meeting each condition, for each survey.

We can see that most cases have only one slow-response-time item (N = 26) with only 6 surveys showing 2 or more items that took more than 50 sec. In contrast, there is a similar proportion of surveys with either 1 (N = 34) or 2 items (N = 32) answered with less than 650 ms. In terms of careless participants, we flag slow and fast respondents considering participants with 3+ slow responses (N = 2) and those with 3+ fast responses (N = 8). Particular attention should be posed on participant HRV099, showing 11 out of 14 responses with fast response times.

ESM.RT$idday <- as.factor(paste(ESM.RT$ID,ESM.RT$RunTimestamp,sep="_"))
rt$idday <- as.factor(paste(rt$ID,rt$RunTimestamp,sep="_"))
ESM.RT$iRT.slow <- ESM.RT$NiRT.slow <- ESM.RT$iRT.fast <- ESM.RT$NiRT.fast <- ESM.RT$slowResp <- ESM.RT$fastResp <- 0
for(idday in levels(ESM.RT$idday)){
  if(any(rt[rt$idday==idday,"value"]>50)){ ESM.RT[ESM.RT$idday==idday,"iRT.slow"] <- 1
    ESM.RT[ESM.RT$idday==idday,"NiRT.slow"] <- nrow(rt[rt$idday==idday & !is.na(rt$value) & rt$value>50,]) }
  if(any(rt[rt$idday==idday,"value"]<0.65)){ ESM.RT[ESM.RT$idday==idday,"iRT.fast"] <- 1
    ESM.RT[ESM.RT$idday==idday,"NiRT.fast"] <- nrow(rt[rt$idday==idday & !is.na(rt$value) & rt$value<0.65,]) }}

# slow response times (N = 32)
summary(as.factor(ESM.RT$NiRT.slow)) # summary of number of slow response times
##    0    1    2    3 
## 1277   26    5    1
(s <- summary(as.factor(as.character(ESM.RT[ESM.RT$iRT.slow==1,"ID"])))) # summary of slow respondents
## HRV001 HRV003 HRV013 HRV014 HRV042 HRV050 HRV051 HRV054 HRV056 HRV060 HRV063 
##      1      1      1      1      1      1      3      1      2      1      1 
## HRV067 HRV070 HRV073 HRV078 HRV079 HRV080 HRV085 HRV088 HRV089 HRV092 HRV094 
##      2      1      1      1      1      1      1      1      1      1      3 
## HRV099 HRV104 
##      2      2
ESM.RT[ESM.RT$ID %in% names(s[s >= 3]),"slowResp"] <- 1 # flagging slow respondents (3+ responses with 1+ items > 50 sec)
summary(as.factor(ESM.RT[!duplicated(ESM.RT$ID),"slowResp"])) # No. of slow respondents (N = 2)
##  0  1 
## 95  2
# fast response times (N = 66)
summary(as.factor(ESM.RT$NiRT.fast)) # summary of number of fast response times
##    0    1    2 
## 1243   34   32
(s <- summary(as.factor(as.character(ESM.RT[ESM.RT$iRT.fast==1,"ID"])))) # summary of fast respondents
## HRV006 HRV008 HRV013 HRV014 HRV020 HRV022 HRV024 HRV039 HRV041 HRV044 HRV049 
##      1      4      3      2      1      6      2      1      1      1      1 
## HRV050 HRV056 HRV061 HRV065 HRV072 HRV073 HRV082 HRV087 HRV088 HRV089 HRV093 
##      1      1      1      4      1      2      4      2      1      1      3 
## HRV098 HRV099 HRV100 HRV102 HRV103 
##      2     11      5      2      2
ESM.RT[ESM.RT$ID %in% names(s[s >= 3]),"fastResp"] <- 1 # flagging fast respondents (i.e., 3+ responses with 1+ items > 50 sec)
summary(as.factor(ESM.RT[!duplicated(ESM.RT$ID),"fastResp"])) # No. of fast respondents (N = 8 + 1 excluded)
##  0  1 
## 89  8


TOTAL RESPONSE TIMES (27/1, 23/2)

Second, the same procedures used with item response times are applied considering the total response time of each survey RT.tot, computed in section 2.1 as the seconds between the first response and the SubmissionTimestamp of each survey. Note that here we do not consider those cases with missing response to the last items (i.e., from PE on out).

We can see that responding to the whole questionnaire required about 1 min in most cases (median = 61.3 sec). A small number of questionnaires were responded in more than 2 minutes (7.4%) with a very few cases requiring more than 3 min (N = 27, 2.06%, flagged as tRT.slow = 1). Likewise, a small number of questionnaires were responded in less than 40 second (7.8%) with a very few cases requiring less than 35 seconds (N = 23, 1.8%, flagged as tRT.fast = 1). We also flag careless participants as those with 3+ responses that took more than 5 min (N = 1 flagged as slowResp.tot = 1) and 3+ responses that took less than 35 seconds (N = 2 flagged as fastResp.tot = 1).

# summarizing total response times (in both seconds and minutes)
ESM.RT$RT.tot.min <- ESM.RT$RT.tot/60
summary(ESM.RT$RT.tot); summary(ESM.RT$RT.tot.min)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   30.05   48.17   61.31   75.33   81.81 4412.12
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.5008  0.8028  1.0219  1.2556  1.3635 73.5354
# plotting response times
grid.arrange(ggplot(ESM.RT,aes(RT.tot.min))+geom_histogram(bins=300)+ggtitle("Total response times (min)"),
             ggplot(ESM.RT,aes(RT.tot.min))+geom_histogram(bins=300)+xlim(0,10)+ggtitle("zooming 0 - 10 min"),
             ggplot(ESM.RT,aes(RT.tot.min))+geom_histogram(bins=300)+xlim(0,5)+ggtitle("zooming 0 - 5 min"),
             nrow=1)

# No. of total response times higher than X minutes
for(value in c(2,3,5,10,20,50)){ cat("\n-",nrow(ESM.RT[ESM.RT$RT.tot.min>value,]),"total RTs >",value,
                                                      "min (",round(100*nrow(ESM.RT[ESM.RT$RT.tot.min>value,])/
                                                                      nrow(ESM.RT),2),"% )") }
## 
## - 97 total RTs > 2 min ( 7.41 % )
## - 27 total RTs > 3 min ( 2.06 % )
## - 7 total RTs > 5 min ( 0.53 % )
## - 2 total RTs > 10 min ( 0.15 % )
## - 2 total RTs > 20 min ( 0.15 % )
## - 1 total RTs > 50 min ( 0.08 % )
# No. of total response times lower than X seconds
for(value in c(40,35,30,20)){ cat("\n-",nrow(ESM.RT[!is.na(ESM.RT$PE) & ESM.RT$RT.tot<value,]),"total RTs <",value,
                                                      "sec (",round(100*nrow(ESM.RT[!is.na(ESM.RT$PE) &
                                                                                             ESM.RT$RT.tot<value,])/
                                                                                      nrow(ESM.RT[!is.na(ESM.RT$PE),]),2),"% )") }
## 
## - 102 total RTs < 40 sec ( 7.8 % )
## - 23 total RTs < 35 sec ( 1.76 % )
## - 0 total RTs < 30 sec ( 0 % )
## - 0 total RTs < 20 sec ( 0 % )
# flagging slow > 5 min (N = 27) and fast total response times < 35 sec (N = 23)
ESM.RT$tRT.slow <- ESM.RT$tRT.fast <- ESM.RT$slowResp.tot <- ESM.RT$fastResp.tot <- 0
ESM.RT[ESM.RT$RT.tot.min > 3,"tRT.slow"] <- 1
ESM.RT[!is.na(ESM.RT$PE) & ESM.RT$RT.tot < 35,"tRT.fast"] <- 1

# flagging slow and fast respondents
(s <- summary(as.factor(as.character(ESM.RT[ESM.RT$tRT.slow==1,"ID"])))) # summary of slow respondents
## HRV001 HRV003 HRV013 HRV050 HRV051 HRV054 HRV056 HRV063 HRV067 HRV070 HRV073 
##      1      1      1      1      2      1      1      1      1      1      1 
## HRV075 HRV078 HRV079 HRV080 HRV085 HRV088 HRV089 HRV092 HRV094 HRV099 HRV104 
##      1      1      1      1      1      1      1      1      4      2      1
ESM.RT[ESM.RT$ID %in% names(s[s >= 3]),"slowResp.tot"] <- 1 # flagging slow respondents (3+ responses > 5 min)
summary(as.factor(ESM.RT[!duplicated(ESM.RT$ID),"slowResp.tot"])) # No. of slow respondents (N = 1)
##  0  1 
## 96  1
(s <- summary(as.factor(as.character(ESM.RT[ESM.RT$tRT.fast==1,"ID"])))) # summary of fast respondents
## HRV008 HRV018 HRV019 HRV022 HRV036 HRV043 HRV049 HRV061 HRV063 HRV065 HRV071 
##      2      1      5      1      1      1      1      1      1      2      1 
## HRV084 HRV087 HRV099 
##      2      1      3
ESM.RT[ESM.RT$ID %in% names(s[s >= 3]),"fastResp.tot"] <- 1 # flagging fast respondents (3+ responses < 35 sec)
summary(as.factor(ESM.RT[!duplicated(ESM.RT$ID),"fastResp.tot"])) # No. of slow respondents (N = 2)
##  0  1 
## 95  2


LONG STRING PROPORTION (56/4)

Third, we examine those cases with long string of identical responses to mood items, that is extremely consistent responses with long strings of identical value in consecutive items using the same response scale (see Curran, 2016). This is done using the careless package. Note that in section 2.2.3 we recoded mood ratings to have the same polarity. Event items are not considered because most respondents entered no events (i.e., 1, 1).

We can see that in most cases (about 71%) the same rating was entered for up to 3 consecutive items, with the 14.3% showing strings of consistent responses repeated more than 4 times (i.e., the rule of thumb proposed by Curran (2016) of identical string length > scale range / 2). In a minority of cases (N = 56, 3.5%), respondents entered the same value for 7+ times, and we mark them as ls = 1. As done previously, we also flag careless participants (N = 4) as those with 3+ responses with long identical strings (N = 18 flagged as lsResp = 1).

# computing the longest string frequency and the average length of identical consecutive responses
mood <- colnames(ESM)[which(colnames(ESM)=="v1"):which(colnames(ESM)=="f3")]
ESM.LS <- cbind(ESM[,c("ID","RunTimestamp","within.day")],longstring(ESM[,mood],avg=TRUE))

# plotting longest identical string and average identical string length
grid.arrange(ggplot(ESM.LS,aes(longstr))+geom_histogram(bins=30)+ggtitle("Longest identical string in each survey"),
             ggplot(ESM.LS,aes(avgstr))+geom_histogram(bins=30)+ggtitle("Average identical string length"),
             nrow=1)

# No. of longest identical strings longer than X
for(value in 3:6){ cat("\n-",nrow(ESM.LS[ESM.LS$longstr>value,]),"longest identical string >",value,
                       "times (",round(100*nrow(ESM.LS[ESM.LS$longstr>value,])/nrow(ESM.LS),2),"% )") }
## 
## - 516 longest identical string > 3 times ( 32.25 % )
## - 258 longest identical string > 4 times ( 16.12 % )
## - 166 longest identical string > 5 times ( 10.38 % )
## - 66 longest identical string > 6 times ( 4.12 % )
# flagging cases with identical strings > 6 (N = 56)
ESM.LS$lsResp <- ESM.LS$ls <- 0
ESM.LS[ESM.LS$longstr > 6,"ls"] <- 1
summary(as.factor(ESM.LS$ls)) # No. of cases with long string
##    0    1 
## 1534   66
# flagging long string respondents (N = 4)
(s <- summary(as.factor(as.character(ESM.LS[ESM.LS$ls == 1,"ID"])))) # No. of long string responses per respondent
## HRV004 HRV005 HRV006 HRV007 HRV008 HRV017 HRV021 HRV022 HRV023 HRV030 HRV033 
##      1      4      1      3      3      1      1      1      1      1      2 
## HRV034 HRV036 HRV038 HRV039 HRV040 HRV045 HRV050 HRV054 HRV062 HRV065 HRV067 
##      2      2      3      2      1      1      1      1      1      2      1 
## HRV068 HRV070 HRV075 HRV078 HRV080 HRV081 HRV085 HRV086 HRV088 HRV090 HRV091 
##      1      2      1      2      1      1      1      2      2      1      6 
## HRV098 HRV099 HRV100 HRV102 HRV104 HRV105 
##      1      1      2      1      3      2
ESM.LS[ESM.LS$ID %in% names(s[s >= 3]),"lsResp"] <- 1 # flagging slow respondents (i.e., 2+ identical strings > 5)
summary(as.factor(ESM.LS[!duplicated(ESM.LS$ID),"lsResp"]))
##  0  1 
## 91  6


MAHALANOBIS DISTANCE (68/6)

Fourth, we examine those cases with unlikely response patterns by performing a multivariate outlier analysis based on the Mahalanobis distance. We can see that 68 responses (4.2%) were flagged as multivariate outliers (md = 1) with a confidence level of 99%, with 6 participants showing 3+ flagged responses and thus marked with mdResp = 1.

# computing Mahalanobis distance in mood ratings
ESM.LS <- cbind(ESM.LS,mahad(ESM[,mood],flag=TRUE))

# plotting Mahalanobis distance and marking flagged cases (N = 72)
ESM.LS$md <- as.numeric(gsub("FALSE","0",gsub("TRUE","1",ESM.LS$flagged)))
summary(as.factor(ESM.LS$md)) # No. flagged responses (N = 72)
##    0    1 
## 1517   83
ggplot(ESM.LS,aes(d_sq,fill=as.factor(md)))+geom_histogram(bins=30)+ggtitle("Mahalanobis distance")

# flagging outlier respondents (N = 6)
(s <- summary(as.factor(as.character(ESM.LS[ESM.LS$md == 1,"ID"])))) # No. of flagged responses per respondent
## HRV001 HRV002 HRV004 HRV005 HRV008 HRV010 HRV013 HRV014 HRV020 HRV024 HRV026 
##      1      2      1      1      1      1      2      2      1      2      2 
## HRV030 HRV031 HRV036 HRV037 HRV043 HRV048 HRV049 HRV051 HRV053 HRV055 HRV056 
##      1      5      4      1      1      3      1      7      1      1      1 
## HRV060 HRV061 HRV062 HRV063 HRV068 HRV069 HRV071 HRV076 HRV079 HRV082 HRV084 
##      1      2      1      1      2      1      1      1      2      1      2 
## HRV086 HRV087 HRV088 HRV089 HRV093 HRV094 HRV095 HRV096 HRV098 HRV100 HRV102 
##      4      1      1      1      1      7      1      3      3      2      1 
## HRV103 
##      1
ESM.LS$mdResp <- 0
ESM.LS[ESM.LS$ID %in% names(s[s >= 3]),"mdResp"] <- 1 # flagging outlier respondents (i.e., 3+ extreme Mahalanobis dist)
summary(as.factor(ESM.LS[!duplicated(ESM.LS$ID),"mdResp"])) # No. flagged respondents (N = 6)
##  0  1 
## 89  8


PSYCHOMETRIC SYNONIMS (83/5)

Fifth, we examine those cases with negative correlations among psychometric synonyms, that is cases where the intra-response correlation is negative, and specifically less than -0.30 between pairs of items that correlate at r = .60 or stronger. We can see that 83 responses (5.2%) were flagged as psychsyn = 1, with 5 participants showing 3+ flagged responses and thus marked with psychsynResp = 1.

# computing Mahalanobis distance in mood ratings
ESM.LS <- cbind(ESM.LS,psychsyn(ESM[,mood],diag=TRUE,critval=0.60,resample_na=TRUE))

# note that some cases marked with long identical string have SD = 0 so the intra-response correlation cannot be computed
head(cbind(ESM.LS[is.na(ESM.LS$cor),c("ID","longstr","avgstr","ls","lsResp")],ESM[is.na(ESM.LS$cor),mood]))
# flagging and plotting correlations < -.30 (N = 90)
ESM.LS$psychsyn <- 0
ESM.LS[!is.na(ESM.LS$cor) & ESM.LS$cor < (-.30),"psychsyn"] <- 1
summary(as.factor(ESM.LS$psychsyn)) # No flagged responses (N = 90)
##    0    1 
## 1500  100
ggplot(ESM.LS,aes(cor,fill=as.factor(psychsyn)))+geom_histogram(bins=100)+ggtitle("Indivual corr. among psychometric synonyms")

# flagging psychsyn respondents (N = 5)
(s <- summary(as.factor(as.character(ESM.LS[ESM.LS$psychsyn == 1,"ID"])))) # No. of flagged responses per respondent
## HRV001 HRV002 HRV004 HRV005 HRV006 HRV008 HRV010 HRV013 HRV014 HRV017 HRV019 
##      2      1      1      2      1      2      1      1      1      1      1 
## HRV020 HRV022 HRV024 HRV028 HRV030 HRV032 HRV033 HRV034 HRV035 HRV037 HRV041 
##      2      1      2      3      1      2      1      1      4      3      1 
## HRV048 HRV051 HRV054 HRV056 HRV058 HRV061 HRV062 HRV064 HRV065 HRV067 HRV069 
##      2      2      2      1      1      1      1      4      1      1      3 
## HRV070 HRV072 HRV073 HRV074 HRV076 HRV079 HRV080 HRV081 HRV082 HRV085 HRV086 
##      1      1      3      1      2      2      1      3      3      2      3 
## HRV088 HRV089 HRV090 HRV092 HRV093 HRV094 HRV095 HRV096 HRV100 HRV101 HRV102 
##      2      1      2      2      2      2      2      1      2      2      2 
## HRV103 HRV104 HRV105 
##      1      2      1
ESM.LS$psychsynResp <- 0
ESM.LS[ESM.LS$ID %in% names(s[s >= 3]),"psychsynResp"] <- 1 # flagging outlier respondents (i.e., 3+ flagged responses)
summary(as.factor(ESM.LS[!duplicated(ESM.LS$ID),"psychsynResp"])) # No. flagged participants (N = 5)
##  0  1 
## 88  9


Summary

Here, we summarize the number of potentially careless responses and participants for each criterion, and we compute an aggregated variable indicating the number of criteria (from 1 to 5) based on which each response and participant is classified as potentially careless. Then, we flag all responses classified as potentially careless based on 1+ criteria (N = 293) with cLess1 = 1 and those based on 2+ criteria (N = 53) with cLess2 = 1, respectively. Similarly, we flag all participants classified as potentially careless based on 1+ criteria (N = 23) with cLessResp1 = 1, and those based on 2+ criteria (N = 3) with cLessResp2 = 1.

# joining careless criteria
ESM.LS$idday <- as.factor(paste(ESM.LS$ID,ESM.LS$RunTimestamp,sep="_"))
ESM.careless <- join(ESM.LS[,c("ID","within.day","idday","ls","md","psychsyn","lsResp","mdResp","psychsynResp")],
                     ESM.RT[,c("idday","iRT.fast","tRT.fast","iRT.slow","tRT.slow",
                               "fastResp","fastResp.tot","slowResp","slowResp.tot")],by="idday",type="left")

# computing aggregate careless scores
ESM.careless$cLessResp2 <- ESM.careless$cLessResp1 <- ESM.careless$cLess2 <- ESM.careless$cLess1 <- 0
cr <- c("ls","md","psychsyn","iRT.fast","tRT.fast","iRT.slow","tRT.slow")
ESM.careless$carelessN <- rowSums(ESM.careless[,cr],na.rm=TRUE)
ESM.careless[ESM.careless$carelessN>=1,"cLess1"] <- 1 # careless response if meeting 1+ criteria
ESM.careless[ESM.careless$carelessN>=2,"cLess2"] <- 1 # careless response if meeting 2+ criteria
crResp <- c("lsResp","mdResp","psychsynResp","fastResp","fastResp.tot","slowResp","slowResp.tot")
ESM.careless$cLessRespN <- rowSums(ESM.careless[,crResp],na.rm=TRUE)
ESM.clResp <- ESM.careless[ESM.careless$within.day!=0,]
ESM.clResp <- ESM.clResp[!duplicated(ESM.clResp$ID),]
ESM.clResp[ESM.clResp$cLessRespN>=1,"cLessResp1"] <- 1 # careless respondent if meeting 1+ criteria
ESM.clResp[ESM.clResp$cLessRespN>=2,"cLessResp2"] <- 1 # careless respondent if meeting 1+ criteria

# summarizing criteria of potentially careless responses (N = 293/53)
summary(as.data.frame(lapply(ESM.careless[,c(cr,"carelessN","cLess1","cLess2")],as.factor)))
##  ls       md       psychsyn iRT.fast    tRT.fast    iRT.slow    tRT.slow   
##  0:1534   0:1517   0:1500   0   :1240   0   :1283   0   :1274   0   :1279  
##  1:  66   1:  83   1: 100   1   :  66   1   :  23   1   :  32   1   :  27  
##                             NA's: 294   NA's: 294   NA's: 294   NA's: 294  
##                                                                            
##                                                                            
##  carelessN cLess1   cLess2  
##  0:1266    0:1266   0:1545  
##  1: 279    1: 334   1:  55  
##  2:  49                     
##  3:   4                     
##  4:   2
# summarizing criteria of potentially careless respondents (N = 23/3)
summary(as.data.frame(lapply(ESM.clResp[,c(crResp,"cLessRespN","cLessResp1","cLessResp2")],as.factor)))
##  lsResp mdResp psychsynResp fastResp fastResp.tot slowResp slowResp.tot
##  0:91   0:89   0:88         0:89     0:95         0:95     0:96        
##  1: 6   1: 8   1: 9         1: 8     1: 2         1: 2     1: 1        
##                                                                        
##                                                                        
##  cLessRespN cLessResp1 cLessResp2
##  0:68       0:68       0:91      
##  1:23       1:29       1: 6      
##  2: 5                            
##  3: 1
# joining careless scores to the ESM dataset
ESM$idday <- as.factor(paste(ESM$ID,ESM$RunTimestamp,sep="_"))
ESM <- join(ESM,ESM.careless[,c("idday","cLess1","cLess2")],by="idday",type="left")
ESM <- join(ESM,ESM.clResp[,c("ID","cLessResp1","cLessResp2")],by="ID",type="left")


Here, we visually inspect all responses marked with cLess2 = 1. We can note several cases of long string proportions (e.g., HRV008_3_2) as well as cases of inconsistent responses (e.g., HRV060_1_0 for items t2 and t3).

ESM$id <- as.factor(paste(ESM$ID,ESM$day,ESM$within.day,sep="_"))
plotResp(ESM[ESM$cLess2==1,],"id",mood)


Here, we visually inspect all responses from participants marked with cLessResp2 = 2. Also in this case, we can note several cases of long string proportions (e.g., HRV051_2_4) and especially cases of inconsistent responses (e.g., HRV051_1_0 for items f2 and f3).

plotResp(ESM[ESM$cLessResp2==1,],"id",mood)


3. HRV data

Here, we describe the steps implemented to pre-process the pulse interval data recorded with the E4 wristband (Empatica, Milan). Then, pre-processed pulse intervals are read and used for computing HR and HRV metrics, which are subsequently recoded and filtered based on data quality and inclusion criteria.

3.1. Signal pre-processing

The following signal pre-processing steps were implemented to derive pulse intervals from the raw blood volume pulse (BVP):

  1. We selected the recording segments between each couple of markers signaled by participants by focusing on the last 2 minutes of each 3-minute resting interval, and we used consecutive pulse wave foot points in the BVP signal to automatically detect pulse intervals. This was done with the Pulse Intervals Detection App - PIDApp, based on the pracma and the shiny R packages.

  2. Artefacts were removed based on visual inspection of both the BVP signal and pulse intervals time series, and recording segments with artefacts in more than the 25% were excluded from the analyses (marked as out). Specifically, a given pulse intervals was classified as artefact and removed if meeting three out of five criteria:

    • clear noise in the related raw BVP signal

    • clear rise in the amplitude of the corresponding acceleration signal

    • excessive deviation (i.e., > 300 ms) from the mean pulse interval length of the surrounding 10 pulse intervals

    • excessive deviation (i.e., > 10 bpm) from the instantaneous HR values preceding and following the pulse interval

    • excessive deviation (i.e., > 1 ms) between the RMSSD value computed with and without considering the pulse interval

  3. The resulting pulse interval time series were automatically processed using the freely available ARTiiFACT software (version 2.13) (Kaufmann et al., 2011). Specifically, we spline interpolated pulse intervals identified as artefacts based on the median absolute deviation method.

  4. A final visual inspection was performed based on the code showed below (not run), and the pre-processed data were saved in the HRV/FinalProcessing folder.

Show code

# Before executing the following code, pulse intervals were pre-processed as described in steps 1-3.
# Each pre-processed segment was named as "subject_day_interval_Number of artefacts".
# Intervals discarded due to too many artefacts were marked with the label "out" in place of the number of artefacts.
# Missing intervals (i.e., not signaled by participants) were marked as "miss"

# ..............................................
# 4.1. DATA VISUALIZATION AND COMPARISON
# ..............................................

# Here, we show the code (not run) used for visually inspecting the pulse intervals time series processed with ARTiiFACT.
# Automatically processed pulse intervals obtained in step 3 were saved in the "HRV/ARTIIFACTcorrected" folder.
# Automatically processed pulse intervals are also compared them with the original (i.e., manually processed) data.
# Manually processed pulse intervals obtained in step 2 were saved in the "HRV/clean_noOUT" folder.

# reading file names
data.path <- "HRV/ARTIIFACTcorrected" # files processed with ARTiiFACT
paths = list.files(data.path, recursive = TRUE,full.names = TRUE, include.dirs = FALSE, pattern = ".txt")
data.path.original <- "HRV/clean_noOUT" # original data
paths.original = list.files(data.path.original, recursive = TRUE,full.names = TRUE, include.dirs = FALSE,pattern=".csv")

# isolating ID x day couples to be plotted 
IDs <- strsplit(gsub(".txt","",gsub("_ArtifactCorrectedWithMethod_1.txt","",gsub("/","",gsub(data.path,"",paths)))),"_")
IDdays <- levels(as.factor(paste(rep(levels(as.factor(unlist(IDs)[grep("HRV",unlist(IDs))])),3),
                                 c(rep(1,length(IDs)),rep(2,length(IDs)),rep(3,length(IDs))),sep="_")))

# reading, processing, plotting, and saving pulse intervals
for(IDday in IDdays[260:length(IDdays)]){
  paths.cor <- grep(IDday,paths,value=TRUE)
  paths.or <- grep(IDday,paths.original,value=TRUE) 
  
  # setting plot interface
  par(mfrow=c(2,4),mar=c(2,2,3,2))
  
  # dataset of HR and HRV
  Summ <- data.frame(interval=NA,RMSSD.cor=NA,RMSSD.or=NA,HR.cor=NA,HR.or=NA)
  
  for(i in 1:length(paths.cor)){
    
    # matching processed and original data
    ID <- gsub(".csv","",gsub("_ArtifactCorrectedWithMethod_1.txt","",gsub("/","",gsub(data.path,"",paths.cor[i]))))
    new.data <- read.csv(paths.cor[i],header=FALSE)
    colnames(new.data)[1] <- "IBI"
    original.data <- read.csv(paths.or[i])
    colnames(original.data)[1] <- "time"
    
    # adding info to dataset
    Summ <- rbind(Summ,data.frame(interval=as.integer(strsplit(ID,"_")[[1]][3]),
                                  RMSSD.cor=round(sqrt(mean(diff(new.data$IBI)^2)),2),
                                  RMSSD.or=round(sqrt(mean(diff(original.data$IBI)^2)),2),
                                  HR.cor=round(mean(60000/new.data$IBI),2),HR.or=round(mean(60000/original.data$IBI),2)))
    
    # plotting corrected over original IBIs
    plot(original.data[1:nrow(original.data),"time"],60000/original.data[1:nrow(original.data),"IBI"],
         type="o",pch=20,cex=1.5,cex.main=1.2,xlab=NA,ylab=NA,
         main=paste(ID,"\nRMSSD =",Summ[i+1,3],"-->",Summ[i+1,2],"- HR =",Summ[i+1,5],"-->",Summ[i+1,4]))
    lines(original.data[1:nrow(original.data),"time"],60000/new.data$IBI,
          type="o",pch=20,cex=1.5,col="red")
    
    # saving corrected data with time information
    write.csv(data.frame(time=original.data$time,IBI=new.data$IBI),gsub("clean_noOUT","SanityCheck",paths.or[i])) }
  
  # plotting RMSSD by interval
  plot(Summ[2:nrow(Summ),"interval"],Summ[2:nrow(Summ),"RMSSD.or"],main=paste("RMSSD (ms)",IDday),type="o")
  lines(Summ[2:nrow(Summ),"interval"],Summ[2:nrow(Summ),"RMSSD.cor"],type="o",col="red")
  
  # plotting HR by interval
  plot(Summ[2:nrow(Summ),"interval"],Summ[2:nrow(Summ),"HR.or"],main=paste("HR (bpm)",IDday),type="o")
  lines(Summ[2:nrow(Summ),"interval"],Summ[2:nrow(Summ),"HR.cor"],type="o",col="red")
  
  readline(prompt=cat(IDday,"\nPress [enter]")) }

# Based on this visual inspection, some signal segments were manually re-processed by interpolating artefacted pulse intervals.
# Specifically, we only kept corrections leading to substantial absolute change (i.e., > 1 ms) in the RMSSD of each segment.
# Manually re-processed files were saved in the "HRV/SanityCheck2" folder.

# ..............................................
# 4.2. FINAL VISUAL INSPECTION AND DATA EXPORT
# ..............................................

# As a final step, we conducted a last visual inspection and we saved the final data in the "HRV/FinalProcessing" folder.

# reading file names (i.e., "subject_day_interval_Number of artefacts")
data.path <- "HRV/SanityCheck2" # automatically processed & checked data
paths = list.files(data.path, recursive = TRUE,
                   full.names = TRUE, include.dirs = FALSE, pattern = ".csv")

# reading files, recoding, plotting, and saving
for(IDday in IDdays[201:length(IDdays)]){
  
  # setting plot interface
  par(mfrow=c(2,4),mar=c(2,2,3,2))
  
  # dataset of HR and HRV
  Summ <- data.frame(interval=NA,RMSSD=NA,HR=NA)
  
  # reading each interval in a given IDday
  paths.IDday <- grep(IDday,paths,value=TRUE)
  for(i in 1:length(paths.IDday)){
    
    # changing participants' ID
    ID <- strsplit(gsub(" ","",(gsub("/","",gsub(".csv","",(gsub(data.path,"",paths.IDday[i])))))),split = "_")
    if(nchar(ID[[1]][1])==5){
      ID[[1]][1] <- paste("HRV0",substr(ID[[1]][1],4,5),sep="")
      ID <- paste(ID[[1]][1],ID[[1]][2],ID[[1]][3],ID[[1]][4],sep="_")
    } else{ ID <- gsub("/","",gsub(".csv","",(gsub(data.path,"",paths.IDday[i])))) }
    
    if(strsplit(ID,"_")[[1]][4]!="out" & strsplit(ID,"_")[[1]][4]!="miss"){
      
      # changing data format
      new.data <- read.csv(paths.IDday[i])
      if(colnames(new.data)[1]=="X"){ new.data <- new.data[,2:3] }
      if(colnames(new.data)[1]=="time"){colnames(new.data)[1]="t"}
      
      # adding info to dataset
      Summ <- rbind(Summ,data.frame(interval=as.integer(strsplit(ID,"_")[[1]][3]),
                                    RMSSD=round(sqrt(mean(diff(new.data$IBI)^2)),2),
                                    HR=round(mean(60000/new.data$IBI),2)))
      
      # plotting corrected over original IBIs
      plot(new.data$t,60000/new.data$IBI,
           type="o",pch=20,cex=1.5,cex.main=1.2,xlab=NA,ylab=NA,
           main=paste(ID,"\nRMSSD =",Summ[!is.na(Summ$interval)&Summ$interval==i,2],
                      ", HR =",Summ[!is.na(Summ$interval)&Summ$interval==i,3]))
      
      # saving corrected data with time information
      write.csv(new.data,paste("HRV/FinalProcessing/",ID,".csv",sep=""),row.names=FALSE)
    
      } else{ # cases marked as "out" or "miss"
      plot(1,main=paste(ID,"\nRMSSD =",Summ[i+1,2],"- HR =",Summ[i+1,3])) }}  # empty plot
  
  # plotting RMSSD by interval
  plot(Summ[2:nrow(Summ),"interval"],Summ[2:nrow(Summ),"RMSSD"],main=paste("RMSSD (ms)",IDday),type="b",pch=20,cex=5)
  plot(Summ[2:nrow(Summ),"interval"],Summ[2:nrow(Summ),"HR"],main=paste("HR (bpm)",IDday),type="b",pch=20,cex=5)
  
  readline(prompt=cat(IDday,"\nPress [enter]")) }


3.2. Data reading & recoding

Here, we we read the pre-processed pulse intervals time series and we compute the following metrics for each two-minute recording segment:

  • Mean pulse interval PI, computed as the mean of the pulse intervals (ms) in a given recording segment

  • Mean heart rate HR (bpm), computed as the mean of 60000/PI

  • Root mean square of successive differences in pulse intervals RMSSD (ms), computed as the squared root of the squared mean successive differences in consecutive pulse intervals

  • RMSSD coefficient of variation RMSSDcv, computed as 100 * RMSSD / PI, following de Geus et al (2019)’s recommendation to test and report on how adjusting HRV metrics for chronotropic state affects primary study outcomes

The readHRV function is used to process pulse intervals and merge them into the HRV dataset.

readHRV

readHRV <- function(data.path){ 
  
  # empty data.frame to be filled
  HRVdata <- data.frame(ID=NA,day=NA,within.day=NA,PI=NA,HR=NA,RMSSD=NA,RMSSDcv=NA,artefacts=NA,nPI=NA)
  paths = list.files(data.path, recursive = TRUE, full.names = TRUE, include.dirs = FALSE)
  
  # data reading
  for(path in paths){
    ID <- gsub("/","",gsub(data.path,"",gsub(".csv","",path)))
    ID <- strsplit(ID,"_")[[1]]
    if(ID[4]=="out"|ID[4]=="miss"){ # missing or excluded data
      HRVdata <- rbind(HRVdata,
                       data.frame(ID=ID[1],day=ID[2],within.day=ID[3],
                                  PI=NA,HR=NA,RMSSD=NA,RMSSDcv=NA,
                                  artefacts=ID[4],nPI=NA))
    }else{ # complete data
        new.data <- read.csv(path) 
        HRVdata <- rbind(HRVdata,
                         # computing HRV metrics
                         data.frame(ID=ID[1],day=ID[2],within.day=ID[3],
                                    PI=mean(new.data$IBI), # mean pulse interval
                                    HR=mean(60000/new.data$IBI), # mean heart rate
                                    RMSSD=sqrt(mean(diff(new.data$IBI)^2)), # RMSSD
                                    RMSSDcv= 100 * sqrt(mean(diff(new.data$IBI)^2)) / mean(new.data$IBI), # RMSSDcv
                                    artefacts=ID[4],
                                    nPI=nrow(new.data)))}}
  
  # adjusting data
  HRVdata <- HRVdata[2:nrow(HRVdata),] # removing first empty row
  HRVdata$ID <- as.factor(HRVdata$ID) # setting class for ID, day and within.day
  HRVdata[,c("within.day","day")] <- lapply(HRVdata[,c("within.day","day")],as.numeric)
  return(HRVdata)}

(HRV <- readHRV(data.path="HRV/FinalProcessing"))[1:3,] # reading data and showing first three rows


3.3. Data cleaning & filtering

HRV data were already cleaned and filtered in the signal pre-processing step (see section 3.1). Here, we report the details on data quality and we consider some data quality indicators as further criteria to flag those cases that might bias the results. The following packages and functions are used to optimize data filtering:

library(plyr) # loading required packages
complRate

complRate <- function(data,what=c("compliance","inclusion")){
  
  # scheduled No. of surveys (1 baseline + 5 ordinary X 3 days)
  tot <- (1 + 5) * 3
  amb <- 5 * 3
  lab <- 1 * 3
  
  # subsetting data
  data.ok <- data[!(data$artefacts%in%c("out","miss")),] # clean segments
  data.nmiss <- data[data$artefacts!="miss",] # nonmissing segments
  data.miss <- data[data$artefacts=="miss",] # missing segments
  data.out <- data[data$artefacts=="out",] # excluded segments
  
  # creating dataset of compliance and inclusion rates
  cr <- data.frame(ID=levels(data$ID),tot.nmiss=as.numeric(table(data.nmiss$ID))) # No. of nonmissing segments per participant
  cr$lab.nmiss <- as.numeric(table(data.nmiss[data.nmiss$within.day==6,"ID"])) # No. nonmissing segments in lab
  cr$amb.nmiss <- as.numeric(table(data.nmiss[data.nmiss$within.day!=6,"ID"])) # No. nonmissing segments outside lab
  
  # compliance: nonmissing and missing over scheduled
  if("compliance" %in% what){
    
    # computing compliance rate
    cr$tot.CR <- round(100*cr$tot.nmiss/tot,1) # compliance rate
    cr$lab.CR <- round(100*cr$lab.nmiss/lab,1) # compliance rate in lab
    cr$amb.CR <- round(100*cr$amb.nmiss/amb,1) # compliance rate outside lab
  
    cat("\n\nNonmissing segments:",
        "\n -",sum(cr$tot.nmiss),"nonmissing segments (",
        round(100*sum(cr$tot.nmiss)/(tot*nlevels(data$ID)),1),"% of tot scheduled ), from",min(cr$tot.nmiss),"to",max(cr$tot.nmiss),
        ", mean =",round(mean(cr$tot.nmiss),1),"sd =",round(sd(cr$tot.nmiss),1),
        ", mean perc =",round(mean(cr$tot.CR),1),"% sd =",round(sd(cr$tot.CR),1),"%",
        "\n -",sum(cr$amb.nmiss),"from outside lab, from",min(cr$amb.nmiss),"to",max(cr$amb.nmiss),
        ", mean =",round(mean(cr$amb.nmiss),1),"sd =",round(sd(cr$amb.nmiss),1),
        ", mean perc =",round(mean(cr$amb.CR),1),"% sd =",round(sd(cr$amb.CR),1),"%",
        "\n -",sum(cr$lab.nmiss),"from lab, from",min(cr$lab.nmiss),"to",max(cr$lab.nmiss),
        ", mean =",round(mean(cr$lab.nmiss),1),"sd =",round(sd(cr$lab.nmiss),1),
        ", mean perc =",round(mean(cr$lab.CR),1),"% sd =",round(sd(cr$lab.CR),1),"%")
  
    # computing No. of missing segments and missing rate
    cr$tot.miss <- as.numeric(table(data.miss$ID)) # No. of missing segments per participant
    cr$tot.MR <- round(100*cr$tot.miss/tot,1) # missing rate
    cr$lab.miss <- as.numeric(table(data.miss[data.miss$within.day==6,"ID"])) # No. missing segments in lab
    cr$lab.MR <- round(100*cr$lab.miss/lab,1) # missing rate in lab
    cr$amb.miss <- as.numeric(table(data.miss[data.miss$within.day!=6,"ID"])) # No. missing segments outside lab
    cr$amb.MR <- round(100*cr$amb.miss/amb,1) # missing rate outside lab
  
    cat("\n\nMissing segments:",
        "\n -",sum(cr$tot.miss),"missing segments (",
        round(100*sum(cr$tot.miss)/(tot*nlevels(data$ID)),1),"% of tot scheduled ), from",min(cr$tot.miss),"to",max(cr$tot.miss),
        ", mean =",round(mean(cr$tot.miss),1),"sd =",round(sd(cr$tot.miss),1),
        ", mean perc =",round(mean(cr$tot.MR),1),"% sd =",round(sd(cr$tot.MR),1),"%",
       "\n -",sum(cr$amb.miss),"from outside lab, from",min(cr$amb.miss),"to",max(cr$amb.miss),
        ", mean =",round(mean(cr$amb.miss),1),"sd =",round(sd(cr$amb.miss),1),
        ", mean perc =",round(mean(cr$amb.MR),1),"% sd =",round(sd(cr$amb.MR),1),"%",
        "\n -",sum(cr$lab.miss),"from lab, from",min(cr$lab.miss),"to",max(cr$lab.miss),
        ", mean =",round(mean(cr$lab.miss),1),"sd =",round(sd(cr$lab.miss),1),
        ", mean perc =",round(mean(cr$lab.MR),1),"% sd =",round(sd(cr$lab.MR),1),"%")  }
  
  # inclusion: excluded and included over nonmissing
  if("inclusion" %in% what){
    # computing No. of excluded segments and exclusion rate
    cr$tot.out <- as.numeric(table(data.out$ID)) # No. of excluded segments per participant
    cr$tot.ER <- round(100*cr$tot.out/cr$tot.nmiss,1) # exclusion rate
    cr$lab.out <- as.numeric(table(data.out[data.out$within.day==6,"ID"])) # No. excluded segments in lab
    cr$lab.ER <- round(100*cr$lab.out/cr$lab.nmiss,1) # exclusion rate in lab
    cr$amb.out <- as.numeric(table(data.out[data.out$within.day!=6,"ID"])) # No. excluded segments outside lab
    cr$amb.ER <- round(100*cr$amb.out/cr$amb.nmiss,1) # exclusion rate outside lab
  
    cat("\n\nExcluded segments:",
        "\n -",sum(cr$tot.out),"excluded segments (",
        round(100*sum(cr$tot.out)/sum(cr$tot.nmiss),1),"% of nonmissing segments), from",min(cr$tot.out),"to",max(cr$tot.out),
        ", mean =",round(mean(cr$tot.out),1),"sd =",round(sd(cr$tot.out),1),
        ", mean perc =",round(mean(cr$tot.ER),1),"% sd =",round(sd(cr$tot.ER),1),"%",
        "\n -",sum(cr$amb.out),"from outside lab, from",min(cr$amb.out),"to",max(cr$amb.out),
        ", mean =",round(mean(cr$amb.out),1),"sd =",round(sd(cr$amb.out),1),
        ", mean perc =",round(mean(cr$amb.ER)),"% sd =",round(sd(cr$amb.ER)),"%",
        "\n -",sum(cr$lab.out),"from lab, from",min(cr$lab.out),"to",max(cr$lab.out),
        ", mean =",round(mean(cr$lab.out),1),"sd =",round(sd(cr$lab.out),1),
        ", mean perc =",round(mean(cr$lab.ER),1),"% sd =",round(sd(cr$lab.ER),1),"%")
  
    # computing No. of included segments and inclusion rate
    cr$tot.ok <- as.numeric(table(data.ok$ID)) # No. of included segments per participant
    cr$tot.IR <- round(100*cr$tot.ok/cr$tot.nmiss,1) # inclusion rate
    cr$lab.ok <- as.numeric(table(data.ok[data.ok$within.day==6,"ID"])) # No. included segments in lab
    cr$lab.IR <- round(100*cr$lab.ok/cr$lab.nmiss,1) # inclusion rate in lab
    cr$amb.ok <- as.numeric(table(data.ok[data.ok$within.day!=6,"ID"])) # No. included segments outside lab
    cr$amb.IR <- round(100*cr$amb.ok/cr$amb.nmiss,1) # inclusion rate outside lab
  
    cat("\n\nIncluded segments:",
        "\n -",sum(cr$tot.ok),"included segments (",
        round(100*sum(cr$tot.ok)/sum(cr$tot.nmiss),1),"% of nonmissing segments), from",min(cr$tot.ok),"to",max(cr$tot.ok),
        ", mean =",round(mean(cr$tot.ok),1),"sd =",round(sd(cr$tot.ok),1),
        ", mean perc =",round(mean(cr$tot.IR),1),"% sd =",round(sd(cr$tot.IR),1),"%",
        "\n -",sum(cr$amb.ok),"from outside lab, from",min(cr$amb.ok),"to",max(cr$amb.ok),
        ", mean =",round(mean(cr$amb.ok),1),"sd =",round(sd(cr$amb.ok),1),
        ", mean perc =",round(mean(cr$amb.IR),1),"% sd =",round(sd(cr$amb.IR),1),"%",
        "\n -",sum(cr$lab.ok),"from lab, from",min(cr$lab.ok),"to",max(cr$lab.ok),
        ", mean =",round(mean(cr$lab.ok),1),"sd =",round(sd(cr$lab.ok),1),
        ", mean perc =",round(mean(cr$lab.IR),1),"% sd =",round(sd(cr$lab.IR),1),"%") }
  
  return(cr)}

Computes a data frame of compliance rate values and prints summary information on compliance and inclusion rate.

Show function plotPI

plotPI <- function(data.path,IDs){
  paths = list.files(data.path, recursive = TRUE, full.names = TRUE, include.dirs = FALSE)
  for(ID in IDs){ pathIDs <- paths[substr(paths,21,26)==ID]
  par(mfrow=c(2,4))
    for(pathID in pathIDs){ id <- strsplit(gsub("/","",gsub(pathID,"",gsub(".csv","",pathID))),split="_")[[1]][4]
    if(id!="out" & id!="miss"){ pi <- read.csv(pathID)
    plot(pi$t,60000/pi$IBI,type="o",pch=20,cex=.5,lwd=.5,cex.main=1.2,xlab="time (sec)",ylab="HR (bpm)",
         main=gsub(data.path,"",pathID),ylim=c(40,120))}}}}

Reads and plots the pulse intervals time series from the specified subsample of participants.


We start by inspecting the original number of participants and observations in the HRV dataset, which corresponds to the total number of scheduled recordings (i.e., No. participants x 3 days x 6 recordings), that is 1,746. We can see that the dataset includes a total of 97 participants, eight less than the 105 participants that were actually invited to participate to the daily protocol. This is because the eight excluded participants mentioned in sections 1.3.5 and 2.3.3 were not pre-processed or their pre-processing was interrupted after noticing too low signal quality.

cat("Total No. of scheduled HRV recordings =",nrow(HRV),"from",nlevels(HRV$ID),"participants")
## Total No. of scheduled HRV recordings = 1746 from 97 participants


3.3.1. Double recordings

First, we inspect the data for cases of double recording segments, that is segments with the same ID, day, and within.day value. We can see that no double recordings are included in the HRV dataset.

cat(nrow(HRV[duplicated(paste(HRV$ID,HRV$day,HRV$within.day)),]),"double recordings")
## 0 double recordings


3.3.2. Flagged cases

Second, we inspect the compliance ratings and the data quality as further criteria to flag those cases that might bias the results.

3.3.2.1. Compliance rate

First, we inspect the compliance rate, that is the number of recordings that were not marked as “miss” in the file name over the total number of scheduled recordings, by using the complRate function showed above.

We can see that all lab-based segments were recorded each day for each participant (compliance = 100%), whereas some missing data were present in ambulatory segments (i.e., those recorded outside the lab). Overall, the mean compliance rate with ambulatory recordings was relatively high (92%) with no participants showing response rates < 60%. Thus, we see no reasons for excluding or flagging further participants based on compliance rate.

crate <- complRate(HRV,what="compliance")
## 
## 
## Nonmissing segments: 
##  - 1629 nonmissing segments ( 93.3 % of tot scheduled ), from 13 to 18 , mean = 16.8 sd = 1.4 , mean perc = 93.3 % sd = 7.5 % 
##  - 1338 from outside lab, from 10 to 15 , mean = 13.8 sd = 1.4 , mean perc = 92 % sd = 9 % 
##  - 291 from lab, from 3 to 3 , mean = 3 sd = 0 , mean perc = 100 % sd = 0 %
## 
## Missing segments: 
##  - 117 missing segments ( 6.7 % of tot scheduled ), from 0 to 5 , mean = 1.2 sd = 1.4 , mean perc = 6.7 % sd = 7.5 % 
##  - 117 from outside lab, from 0 to 5 , mean = 1.2 sd = 1.4 , mean perc = 8 % sd = 9 % 
##  - 0 from lab, from 0 to 0 , mean = 0 sd = 0 , mean perc = 0 % sd = 0 %


3.3.2.2. Artefact rate

Second, we inspect the rate of pulse intervals identified as artefacts in the included segments. Note that recording segments with artefacts in more than the 25% of the pulse intervals were excluded and marked as “out”.

We can note that for a few included segments (N = 9, 0.7%) still include artefacts in more than the 25%, and thus we exclude them. Without these cases, the average No. of removed artefacts per segment is 7.34 (5.2%). Here, we consider a more restrictive criterion by flagging segments with artefacts in more than the 10% of the pulse intervals (N = 234, with art10 = 1, 17%).

# separating "miss" and "out" from the No. of artefacts
HRV$out <- HRV$miss <- 0
HRV[HRV$artefacts=="miss","miss"] <- 1
HRV[HRV$artefacts=="out","out"] <- 1
HRV$nArtefacts <- HRV$artefacts
HRV[HRV$artefacts %in% c("miss","out"),"nArtefacts"] <- NA
HRV$nArtefacts <- as.integer(HRV$nArtefacts) # No. of artefacts
HRV$pArtefacts <- 100*HRV$nArtefacts/HRV$nPI # artefact rate

# checking for included cases with artefacts in more than the 25%
cat("Excluding further",nrow(HRV[!is.na(HRV$pArtefacts) & HRV$pArtefacts>25,]),"segments with 25%+ artefacts")
## Excluding further 9 segments with 25%+ artefacts
HRV[!is.na(HRV$pArtefacts) & HRV$pArtefacts>25,"out"] <- 1 # excluding these cases
HRV[!is.na(HRV$pArtefacts) & HRV$pArtefacts>25,"artefacts"] <- "out"
HRV[!is.na(HRV$pArtefacts) & HRV$pArtefacts>25,c("PI","HR","RMSSD","RMSSDcv","nPI","nArtefacts","pArtefacts")] <- NA
cat(nrow(HRV[!is.na(HRV$pArtefacts) & HRV$pArtefacts>25,]),"remaining segments with 25%+ artefacts")
## 0 remaining segments with 25%+ artefacts
# summarizing and plotting No. of artefacts per segment
cat(sum(HRV$nArtefacts,na.rm=TRUE),"artefacts removed from included segments (",
    round(100*sum(HRV$nArtefacts,na.rm=TRUE)/sum(HRV$nPI,na.rm=TRUE),1),"% of pulse intervals )",
    "\n- No. of artefacts per recording ranging from",min(HRV$nArtefacts,na.rm=TRUE),"to",max(HRV$nArtefacts,na.rm=TRUE),
    ", mean =",round(mean(HRV$nArtefacts,na.rm=TRUE),2),"sd =",round(sd(HRV$nArtefacts,na.rm=TRUE),2),
    "\n- Perc. artefacts per recording ranging from",
    min(100*HRV$nArtefacts/HRV$nPI,na.rm=TRUE),"% to",max(100*HRV$nArtefacts/HRV$nPI,na.rm=TRUE),"%, mean =",
    round(mean(100*HRV$nArtefacts/HRV$nPI,na.rm=TRUE),1),"% sd =",round(sd(100*HRV$nArtefacts/HRV$nPI,na.rm=TRUE),1),"%")
## 9906 artefacts removed from included segments ( 4.9 % of pulse intervals ) 
## - No. of artefacts per recording ranging from 0 to 29 , mean = 7.34 sd = 7.61 
## - Perc. artefacts per recording ranging from 0 % to 25 %, mean = 5.2 % sd = 5.5 %
par(mfrow=c(1,2)); hist(HRV$nArtefacts,main="No. removed artefacts"); hist(100*HRV$nArtefacts/HRV$nPI,main="% removed artefacts")

# flagging cases with artefacts in more than the 10%
HRV[!is.na(HRV$pArtefacts),"art10"] <- 0
HRV[!is.na(HRV$pArtefacts) & HRV$pArtefacts>10,"art10"] <- 1
summary(as.factor(HRV$art10))
##    0    1 NA's 
## 1115  234  397


3.3.2.3. Inclusion rate

Third, we inspect the quality of the recorded signal in terms of number of segment excluded due to excessive artefacts (i.e., segments with artefacts in more than the 25% of the pulse intervals were marked as “out” in the file name) over the number of nonmissing segments. This is done with the same complRate function showed above.

We can see that only a few lab-based segments (N = 30 out of 291, 10.3%) were excluded due to excessive artefacts, whereas a higher percentage of ambulatory segments (N = 243 our of 1,338, 18.7%) were excluded for the same reasons. Overall, 280 segments (17.2% of the nonmissing segments) were excluded due to excessive artefacts, resulting in a total of 1,349 included segments. Here, a number of participants (N = 14, 14.4%) showed inclusion rates < 60% and are flagged with inclRate60 = 1.

# computing inclusion rate
irate <- complRate(HRV,what="inclusion")
## 
## 
## Excluded segments: 
##  - 280 excluded segments ( 17.2 % of nonmissing segments), from 0 to 12 , mean = 2.9 sd = 3.1 , mean perc = 17.2 % sd = 18.1 % 
##  - 250 from outside lab, from 0 to 10 , mean = 2.6 sd = 2.7 , mean perc = 19 % sd = 20 % 
##  - 30 from lab, from 0 to 2 , mean = 0.3 sd = 0.6 , mean perc = 10.3 % sd = 19.5 %
## 
## Included segments: 
##  - 1349 included segments ( 82.8 % of nonmissing segments), from 6 to 18 , mean = 13.9 sd = 3.2 , mean perc = 82.8 % sd = 18.1 % 
##  - 1088 from outside lab, from 4 to 15 , mean = 11.2 sd = 2.9 , mean perc = 81.3 % sd = 19.6 % 
##  - 261 from lab, from 1 to 3 , mean = 2.7 sd = 0.6 , mean perc = 89.7 % sd = 19.5 %
# flagging participants with IR < 60% (N = 14)
irate$inclRate60 <- 0
irate[irate$tot.IR < 60,"inclRate60"] <- 1
HRV <- join(HRV,irate[,c("ID","inclRate60")],by="ID",type="left")
summary(as.factor(irate$inclRate60))
##  0  1 
## 83 14


3.3.2.4. ‘Noisy’ signal

Finally, we inspect the data of 10 included participants (10.3%) whose signal was overall classified as ‘noisy’ due to higher uncertainty of pulse interval detection. The plotPI function showed at the beginning of section 3.3 is used to visualize the pulse intervals time series for these participants and to compare them with a random subsample of 5 ‘clean-signal’ participants.

We can note that, compared to ‘clean-signal’ participants, flagged participants tended to show less variable pulse intervals. Some cases (e.g., HRV032) tend to show ‘fixed’ pulse intervals values (i.e., certain values were highly frequent), as partially due to the sampling frequency of the E4 PPG sensor and can be also seen for some unflagged participant. Other cases (e.g., HRV054) show frequent steep changes of about 10-20 bpm.

FLAGGED PARTICIPANTS

Here, we flag the 10 participants showing a ‘noisy’ signal with noisyPI = 1. We can note that the noisyPI variable looks quite indepentend from the previously created art10 and inclRate60 variables.

noisy <- c("HRV018","HRV020","HRV021","HRV032","HRV033","HRV050","HRV054","HRV063","HRV070","HRV076")
HRV$noisyPI <- 0
HRV[HRV$ID %in% noisy,"noisyPI"] <- 1
summary(as.factor(HRV[!duplicated(HRV$ID),"noisyPI"]))
##  0  1 
## 87 10
# crossing noisyPI cases with art10
table(HRV[,c("noisyPI","art10")])
##        art10
## noisyPI    0    1
##       0 1021  199
##       1   94   35
# crossing noisyPI participants with inclRate60
table(HRV[!duplicated(HRV$ID),c("noisyPI","inclRate60")])
##        inclRate60
## noisyPI  0  1
##       0 75 12
##       1  8  2


PLOT OF ‘NOISY’ SIGNAL

Here, we visualize the pulse intervals of all segments recorded by participants flagged with noisyPI = 1.

plotPI(data.path="HRV/FinalProcessing",IDs=noisy)

PLOT OF ‘CLEAN’ SIGNAL

Here, we visualize the pulse intervals of all segments recorded by a random subsample of 5 participants not flagged with noisyPI = 1.

plotPI(data.path="HRV/FinalProcessing",IDs=sample(levels(HRV$ID)[!(levels(HRV$ID) %in% noisy)],size=10))


4. GNG data

Here, the raw behavioral data from Go/No-Go task exported from E-Prime (version 2.0.10) (Psychology Software Tools, Inc., Sharpsburg, PA) are read and saved in the GNG dataset, recoded, and filtered based in inclusion criteria.

4.1. Data reading

First, we read the CSV data files exported from the E-Prime software. Specifically, we used E-Merge to join the data from all the 105 participants invited to participate to the daily protocol in a single data file Merged_105_final.csv. Here, we read that long-form trial-by-trial dataset.

# reading data
GNG <- read.csv2("GONOGO/Merged_105_final.csv",header=TRUE)


4.2. Data recoding

Then, we recode the GNG variables to be used for the analyses.

The following functions are used to optimize the ESM data recoding:

GNGrecoding

GNGrecoding <- function(data){
  
  # VARIABLE RECORDING
  # .............................................................
  
  # keeping only columns of interest
  data <- data[c("ID","Session","Procedure.Block.","SessionDate","SessionTime","TrialType","Image","Jitter", # session info
                 paste("Stimulus",rep(1:6,each=2),".",rep(c("ACC","RT"),6),sep=""))] # accuracy and reaction times
  
  # recoding block information
  colnames(data)[which(names(data)=="Procedure.Block.")] <-"block"
  data <- data[data$block!="TRAINING",] # excluding training data
  data$block <- as.integer(substr(data$block,6,6)) # block as integer from 1 to 6
  
  # merging performance data into single variables (RT and ACC)
  data[is.na(data)] <- 0 # missing data (NA) as 0 to be summed
  data$ACC <- rowSums(data[,which(substr(colnames(data),11,13)=="ACC")]) # accuracy
  data$RT <- rowSums(data[,which(substr(colnames(data),11,12)=="RT")]) # reaction time
  data[,paste("Stimulus",rep(1:6,each=2),".",rep(c("ACC","RT"),6),sep="")] <- NULL # removing stimulus-related columns
  
  # letter stimulus
  data$Image <- as.factor(gsub(".png","",data$Image))
  colnames(data)[which(names(data)=="Image")] <-"letter"
  
  # TrialType as factor
  data$TrialType <- as.factor(data$TrialType)
  
  # FIXING PROBLEMS WITH SINGLE PARTICIPANTS
  # .............................................................
  
  # # checks cases with No. of trials per block != 109
  # for(ID in levels(as.factor(paste(data$ID,data$block)))){
  #     if(nrow(data[paste(data$ID,data$block)==ID,])!=108){
  #       print(paste(ID,nrow(data[paste(data$ID,data$block)==ID,]))) }}
  
  # blocks 5 and 6 of HRV001 were saved as HRV003
  data[data$ID=="HRV003" & data$block==5,][109:216,"ID"] <- "HRV001"
  data[data$ID=="HRV003" & data$block==6,][109:216,"ID"] <- "HRV001"
  
  # HRV031 performed block 6 before block 5, then again block 6 (invalid), then block 5 twice (second invalid)
  data <- rbind(data[!(data$ID=="HRV031" & data$block==5),],
                  data[data$ID=="HRV031" & data$block==5,][1:108,]) # taking only 1st
  data <- rbind(data[!(data$ID=="HRV031" & data$block==6),],
                  data[data$ID=="HRV031" & data$block==6,][1:108,]) # taking only 1st
  data[data$ID=="HRV031" & data$block==5, "block"] <- 9 # inverting 5 and 6
  data[data$ID=="HRV031" & data$block==6, "block"] <- 5
  data[data$ID=="HRV031" & data$block==9, "block"] <- 6
  
  # HRV042 made block 6 twice (first time invalid)
  data <- rbind(data[!(data$ID=="HRV042" & data$block==6),],
                  data[data$ID=="HRV042" & data$block==6,][109:216,]) # taking only 2nd
  
  # HRV060 made block 4 twice (first time invalid)
  data <- rbind(data[!(data$ID=="HRV060" & data$block==4),],
                  data[data$ID=="HRV060" & data$block==4,][109:216,]) # taking only 2nd
  
  # Block 1 is invalid for HRV066 (technical problem) --> marking everything as NA
  data[data$ID=="HRV066" & data$block==1,c("ACC","RT")] <- NA
  
  # HRV086 made block 5 twice (first time invalid)
  data <- rbind(data[!(data$ID=="HRV086" & data$block==5),],
                data[data$ID=="HRV086" & data$block==5,][109:216,]) # taking only 2nd
  
  # FINAL RECODING AND SAVING
  # .............................................................
  
  # renaming Session as "day" and recoding block as either 1 (i.e., 1, 3, and 5) or 2 (i.e., 2, 4, and 6)
  colnames(data)[which(names(data)=="Session")] <-"day"
  data[data$block==3|data$block==5,"block"] <- 1
  data[data$block==4|data$block==6,"block"] <- 2
  
  return(data[order(data$ID,data$SessionDate),])
}

Recodes behavioral data exported from E-Prime, rename and sort the columns, and fix problems with specific participants.

trials2blocks

trials2blocks <- function(data){
  
  # aggregating values by block (mean)
  block <- cbind(aggregate(ACC ~ ID*day*block, # mean accuracy to Go trials
                           data=data[data$TrialType=="Go",],FUN="mean",na.action=na.pass),
                 aggregate(ACC ~ ID*day*block, # mean accuracy to No-Go trials
                           data=data[data$TrialType=="No-go",],FUN="mean",na.action=na.pass)[,4],
                 aggregate(RT ~ ID*day*block, # mean RT to accurate Go trials
                           data=data[data$TrialType=="Go" & (GNG$ACC==1 | is.na(GNG$ACC)),],FUN="mean",na.action=na.pass)[,4],
                 aggregate(RT ~ ID*day*block, # sd RT to accurate Go trials
                           data=data[data$TrialType=="Go" & (GNG$ACC==1 | is.na(GNG$ACC)),],FUN="sd",na.action=na.pass)[,4])
  
  # sorting dataset, renaming columns and rows
  block <- block[order(block$ID,block$day,block$block),]
  colnames(block)[4:ncol(block)] <- c("Go.ACC","NoGo.ACC","RT","RT.sd")
  rownames(block) <- 1:nrow(block)
  return(block) }

Aggregates long-form measures in a block-by-block dataset.


4.2.1. ID recoding

First, we recode participants’ identification codes ID (currently corresponding to their e-mail addresses) using the “HRVXXX” format (e.g., from ‘’ to ‘HRV001’). This is done with the participantID_recoding_GONOGO() function. Again, since the participants’ e-mail addresses are confidential, both the function and the original dataset are not showed.

# loading and using the function to recode the ID variable
source("participantID_recoding_GONOGO.R")
(GNG <- participantID_recoding_GONOGO(GNG))[1:3,] # recoding GNG IDs and showing first three rows


4.2.2. Time recoding

Second, we recode the SessionDate and SessionTime variables so that they indicate the date and time and the time within day of each session, respectively. We can note that all sessions were recorded within the data collection period (i.e., between November 2018 and November 2019). In terms of time within day, we can note that most sessions were recorded between 16:15 and 18:15, with only three sessions recorded before 15:30.

# SessionDate as date and time
GNG$sd <- as.POSIXct(NA)
GNG[substr(GNG$SessionDate,3,3)=="/","sd"] <- # SessionDate format as d/m/y
  as.POSIXct(paste(GNG[substr(GNG$SessionDate,3,3)=="/","SessionDate"],
                   substr(GNG[substr(GNG$SessionDate,3,3)=="/","SessionTime"],12,19)),format="%m/%d/%Y %H:%M:%S",tz="GMT")
GNG[substr(GNG$SessionDate,3,3)=="-","sd"] <- # SessionDate format as m-d-y
  as.POSIXct(paste(GNG[substr(GNG$SessionDate,3,3)=="-","SessionDate"],
                   substr(GNG[substr(GNG$SessionDate,3,3)=="-","SessionTime"],12,19)),format="%m-%d-%Y %H:%M:%S",tz="GMT")
GNG$SessionDate <- GNG$sd

# plotting SessionDate and SessionTime
par(mfrow=c(1,2))
hist(GNG[!duplicated(paste(GNG$ID,GNG$Session)),"SessionDate"],breaks="months",freq=TRUE,main="SessionDate")
hist(GNG[!duplicated(paste(GNG$ID,GNG$Session)),"SessionTime"],breaks="min",freq=TRUE,main="SessionTime")


4.2.3. Other variables

Third, we use the GNGrecoding() function showed above to recode the other variables in the GNG dataset, to filter unuseful columns, and to correct some issues with specific participants (see function details at the beginning of section 4.2).

(GNG <- GNGrecoding(GNG))[1:3,] # recoding GNG and showing first three rows


4.2.4. Block-level dataset

Fourth, we create the wide-form block-level BLOCK dataset by using the trials2blocks function to average accuracy ACC and reaction times RT across blocks. Indeed, each participant performed the task once per day, for three consecutive days, and each day the session was divided into two 108-trial blocks with identical stimuli but different NoGo stimulus. Thus, whereas the GNG data has one row per trial (i.e., 108 * 6 rows per participant), the BLOCK dataset will have one row per block (i.e., 6 rows per participant). Note that mean RT are computed by only considering “Go” trials with ACC = 1.

(BLOCK <- trials2blocks(GNG))[1:3,] # recoding and showing the first three rows


4.2.5. ICV

Finally, we use the variables computed in the BLOCK dataset to compute the intra-individual coefficient of variation ICV as the ratio between the standard deviation RT.sd and the mean reaction time RT in each block. Again, this is done by only considering “Go” trials with ACC = 1.

BLOCK$ICV <- BLOCK$RT.sd/BLOCK$RT # ICV = RT.sd / mean RT


4.3. Data cleaning & filtering

Here, we clean and filter the data based on the inclusion criteria.

First, we inspect the original number of respondents and observations in the GNG dataset. We can note that the Go/NoGo task was performed by the whole sample of 105 participants that were actually invited to participate to the daily protocol. We can also note that the number of trials and blocks do not match with the protocol for two participants (see section 4.3.3).

cat("Original No. of trial-by-trial data =",nrow(GNG),"in",nrow(BLOCK),"blocks from",nlevels(GNG$ID),"participants")
## Original No. of trial-by-trial data = 67392 in 624 blocks from 105 participants
# participants with less than 648 trials
for(ID in levels(GNG$ID)){ if(nrow(GNG[GNG$ID==ID,])!=648){ print(paste(ID,nrow(GNG[GNG$ID==ID,]),"trials"))} }
## [1] "HRV009 216 trials"
## [1] "HRV059 432 trials"
# participants with less than 6 blocks
for(ID in levels(BLOCK$ID)){ if(nrow(BLOCK[BLOCK$ID==ID,])!=6){ print(paste(ID,nrow(BLOCK[BLOCK$ID==ID,]),"blocks"))} }
## [1] "HRV009 2 blocks"
## [1] "HRV059 4 blocks"


4.3.1. Double responses

First, we inspect the data for cases of double trials and blocks, that is blocks with the same ID, day, and block value or with more than 108 trials. We can see that no double blocks or double trials are included in the HRV dataset.

# double blocks
cat(nrow(BLOCK[duplicated(paste(BLOCK$ID,BLOCK$day,BLOCK$block)),]),"double blocks")
## 0 double blocks
# double trials
BLOCK$nTrials <- as.integer(table(as.factor(paste(GNG$ID,GNG$day,GNG$block))))
cat(nrow(BLOCK[BLOCK$nTrials > 108,]),"blocks with double trials")
## 0 blocks with double trials


4.3.2. Inclusion criteria

Second, we apply our inclusion criteria to the task performance, that is we consider reaction times to “Go” trials < 150 ms as anticipations, and we recode them by setting ACC = 0.

# detecting and filtering anticipations
cat("Recoding",nrow(GNG[GNG$TrialType=="Go" & !is.na(GNG$ACC) & GNG$RT > 0 & GNG$RT < 150,]),"anticipations (",
    round(100*nrow(GNG[GNG$TrialType=="Go" & !is.na(GNG$ACC) & GNG$RT > 0 & GNG$RT < 150,])/
            nrow(GNG[GNG$TrialType=="Go" & !is.na(GNG$ACC) & GNG$ACC==1,]),2),"% ) by setting ACC = 0")
## Recoding 470 anticipations ( 0.96 % ) by setting ACC = 0
GNG[GNG$TrialType=="Go" & !is.na(GNG$ACC) & GNG$RT > 0 & GNG$RT < 150,"ACC"] <- 0

# updating BLOCK dataset
BLOCK <- trials2blocks(GNG)
BLOCK$ICV <- BLOCK$RT.sd/BLOCK$RT # ICV = RT.sd / mean RT


4.3.3. Further excluded

Third, we mark the 8 participants that were excluded for the reasons specified in section 1.3.4. As in that section, such participants are marked as ‘OUT’. Thus, from the initial 105 participants, the number of excluded participants is 8 (7.62%%), whereas the total number of included participants is 97 (92.38%). From BLOCK data, we can note that excluded participants performed slightly differently than included participants (i.e., showing lower ACC to NoGo trials and lower ICV, but also lower RT). Then, we remove excluded participants from the GNG and the BLOCK datasets.

# marking 8 excluded participants as "OUT"
memory <- GNG
GNG$ID <- as.character(GNG$ID)
excl <- c("HRV009","HRV011","HRV015","HRV027","HRV029","HRV046","HRV059","HRV066")
GNG[GNG$ID%in%excl,"ID"] <- gsub("HRV","OUT",GNG[GNG$ID%in%excl,"ID"]) # marking excluded participants as "OUTXXX"

# plotting
GNG$excl <- "IN" # creating excl variable (factor)
GNG[substr(GNG$ID,1,3)=="OUT","excl"] <- "OUT"
GNG$excl <- as.factor(GNG$excl)
grid.arrange(ggplot(GNG[GNG$TrialType=="Go" & GNG$ACC==1 & !is.na(GNG$RT),],aes(x=excl,y=RT,fill=excl)) + geom_violin(),
             ggplot(GNG[GNG$TrialType=="Go",],aes(x=excl,y=ACC,fill=excl)) + geom_violin(),
             ggplot(GNG[GNG$TrialType=="No-go",],aes(x=excl,y=ACC,fill=excl)) + geom_violin(),nrow=1) 

# same thing with the BLOCK dataset
memory.BLOCK <- BLOCK
BLOCK$ID <- as.character(BLOCK$ID)
BLOCK[BLOCK$ID%in%excl,"ID"] <- gsub("HRV","OUT",BLOCK[BLOCK$ID%in%excl,"ID"]) # marking excluded participants as "OUTXXX"
BLOCK$excl <- "IN" # creating excl variable (factor)
BLOCK[substr(BLOCK$ID,1,3)=="OUT","excl"] <- "OUT"
BLOCK$excl <- as.factor(BLOCK$excl)
grid.arrange(ggplot(BLOCK,aes(x=excl,y=RT,fill=excl)) + geom_violin(),
             ggplot(BLOCK,aes(x=excl,y=RT.sd,fill=excl)) + geom_violin(),
             ggplot(BLOCK,aes(x=excl,y=ICV,fill=excl)) + geom_violin(),
             ggplot(BLOCK,aes(x=excl,y=Go.ACC,fill=excl)) + geom_violin(),
             ggplot(BLOCK,aes(x=excl,y=NoGo.ACC,fill=excl)) + geom_violin(),nrow=1) 

# updating number of included participants
cat(nrow(GNG[substr(GNG$ID,1,3)=="HRV",]),"included observations in",nrow(BLOCK[substr(BLOCK$ID,1,3)=="HRV",]),"blocks from",
    nlevels(as.factor(as.character(GNG[substr(GNG$ID,1,3)=="HRV","ID"]))),"included participants \nout of",
    nrow(memory),"total observations in",nrow(memory.BLOCK),"blocks from",nrow(memory[!duplicated(memory$ID),]),
    "invited participants")
## 62856 included observations in 582 blocks from 97 included participants 
## out of 67392 total observations in 624 blocks from 105 invited participants
# removing excluded participants
GNG <- GNG[substr(GNG$ID,1,3)=="HRV",]
GNG$ID <- as.factor(GNG$ID) # ID as factor
GNG <- GNG[order(GNG$ID,GNG$SessionDate),] # sorting by ID and SessionDate
BLOCK <- BLOCK[substr(BLOCK$ID,1,3)=="HRV",]
BLOCK$ID <- as.factor(BLOCK$ID)
BLOCK <- BLOCK[order(BLOCK$ID,BLOCK$day,BLOCK$block),]
GNG$excl <- BLOCK$excl <- NULL # removing excl variable


4.3.4. Flagged cases

No further criteria are considered to flag potentially confounding cases in relation to the Go/NoGo data. No missing responses are shown by included participants, and no particular cases are highlighted.

summary(GNG) # summary of GNG (no missing data)
##        ID             day        block      SessionDate                    
##  HRV001 :  648   Min.   :1   Min.   :1.0   Min.   :2018-11-06 16:17:55.00  
##  HRV002 :  648   1st Qu.:1   1st Qu.:1.0   1st Qu.:2018-12-19 17:22:41.00  
##  HRV003 :  648   Median :2   Median :1.5   Median :2019-03-19 16:29:10.00  
##  HRV004 :  648   Mean   :2   Mean   :1.5   Mean   :2019-03-27 15:46:28.82  
##  HRV005 :  648   3rd Qu.:3   3rd Qu.:2.0   3rd Qu.:2019-05-22 17:09:35.00  
##  HRV006 :  648   Max.   :3   Max.   :2.0   Max.   :2019-11-21 17:18:17.00  
##  (Other):58968                                                             
##   SessionTime                       TrialType         letter     
##  Min.   :1970-01-01 14:32:43.0000   Go   :47142   E      : 5238  
##  1st Qu.:1970-01-01 16:31:50.0000   No-go:15714   F      : 5238  
##  Median :1970-01-01 16:53:50.5000                 G      : 5238  
##  Mean   :1970-01-01 16:55:45.5343                 L      : 5238  
##  3rd Qu.:1970-01-01 17:18:24.0000                 M      : 5238  
##  Max.   :1970-01-01 18:24:44.0000                 N      : 5238  
##                                                   (Other):31428  
##      Jitter           ACC               RT       
##  Min.   :  0.0   Min.   :0.0000   Min.   :  0.0  
##  1st Qu.: 50.0   1st Qu.:1.0000   1st Qu.:220.0  
##  Median :100.0   Median :1.0000   Median :295.0  
##  Mean   :100.2   Mean   :0.9089   Mean   :249.2  
##  3rd Qu.:150.0   3rd Qu.:1.0000   3rd Qu.:341.0  
##  Max.   :200.0   Max.   :1.0000   Max.   :499.0  
## 
summary(BLOCK) # summary of BLOCK (no missing data)
##        ID           day        block         Go.ACC          NoGo.ACC     
##  HRV001 :  6   Min.   :1   Min.   :1.0   Min.   :0.7778   Min.   :0.2963  
##  HRV002 :  6   1st Qu.:1   1st Qu.:1.0   1st Qu.:0.9506   1st Qu.:0.6667  
##  HRV003 :  6   Median :2   Median :1.5   Median :0.9753   Median :0.7778  
##  HRV004 :  6   Mean   :2   Mean   :1.5   Mean   :0.9637   Mean   :0.7445  
##  HRV005 :  6   3rd Qu.:3   3rd Qu.:2.0   3rd Qu.:1.0000   3rd Qu.:0.8519  
##  HRV006 :  6   Max.   :3   Max.   :2.0   Max.   :1.0000   Max.   :1.0000  
##  (Other):546                                                              
##        RT            RT.sd            ICV         
##  Min.   :234.6   Min.   :25.96   Min.   :0.08186  
##  1st Qu.:299.3   1st Qu.:47.48   1st Qu.:0.14692  
##  Median :318.7   Median :53.21   Median :0.17017  
##  Mean   :318.1   Mean   :53.56   Mean   :0.16961  
##  3rd Qu.:338.3   3rd Qu.:59.57   3rd Qu.:0.19136  
##  Max.   :390.1   Max.   :87.04   Max.   :0.26160  
## 


5. Data merging

Now that we have read, recoded, and filtered the four datasets collected in the study (all including the data from the 97 included participants), we can merge them into the final datasets to be used for the analyses. Specifically, we create four final datasets:

  1. ema includes data from the ESM, HRV, and PrelQS dataset, with one row per measurement of either HRV (5 ambulatory + 1 lab) or ESM ratings (1 ‘baseline’ + 5 ambulatory), whereas PrelQS information is identically repeated in each row of a given participant

  2. between includes all the collected variables, with one row per participant and by including the participant averages (e.g., mean, median, SD) of each variable

  3. gng includes all the collected variables, with one row per trial of the Go/NoGo task, only including the values corresponding to the last daily measurement from the ESM and the HRV datasets

  4. block includes all the collected variables, with one row per block of the Go/NoGo task, only including the values corresponding to the last daily measurement from the ESM and the HRV datasets

The following packages are used to optimize data merging.

library(plyr)


5.1. HRV & ESM

First, we merge the HRV and the ESM by the ID, day, and within.day variables. Note that the within.day variable ranges from 0 to 5 for ESM data (0 = baseline, 1-5 = ordinary questionnaires) and from 1 to 6 for HRV data (1-5 = ambulatory recordings, 6 = laboratory recordings), with only those measurements associated with within.day values from 1 to 5 being matched between the two datasets.

# merging HRV and ESM
ESM[,c("id","idday")] <- HRV$artefacts <- NULL # removing unnecessary variables
HRVESM <- join(HRV,ESM,by=c("ID","day","within.day"),type="full") # merging datasets
HRVESM <- HRVESM[order(HRVESM$ID,HRVESM$day,HRVESM$within.day),] # sorting
row.names(HRVESM) <- 1:nrow(HRVESM) # row names

# sanity check: no HRV when within.day=0, no ESM when within.day=6
cat("Sanity check:",nrow(HRVESM[(HRVESM$within.day==0 & !is.na(HRVESM$RMSSD)) | 
                                  (HRVESM$within.day==6 & !is.na(HRVESM$v1)),]) == 0)
## Sanity check: TRUE
# sanity check: 97 included participants x 3 days x 7 occasions
cat("Sanity check:",nrow(HRVESM)==97*3*7)
## Sanity check: TRUE
# showing first three lines
HRVESM[1:3,]


Here, we inspect the number of cases from included participants with within.day ranging from 1 to 5 but with missing HRV (N = 7) or ESM values (N = 36).

# missing HRV, nonmising ESM
nrow(HRVESM[HRVESM$within.day %in% 1:5 & HRVESM$miss==1 & !is.na(HRVESM$v1),]) # 7
## [1] 7
# missing ESM, nonmising HRV
nrow(HRVESM[HRVESM$within.day %in% 1:5 & HRVESM$miss!=1 & is.na(HRVESM$v1),]) # 36
## [1] 36
# both missing
nrow(HRVESM[HRVESM$within.day %in% 1:5 & HRVESM$miss==1 & is.na(HRVESM$v1),]) # 110
## [1] 110


5.2. HRV, ESM, & PrelQS

Second, we merge the HRVESM dataset created above with the PrelQS dataset.

5.2.1. ema

The first merged dataset ema has a long form with one row per measurement, so that PrelQS information is identically repeated over all rows associated with a given participant.

# merging HRV and ESM
colnames(HRVESM)[which(colnames(HRVESM)%in%c("gender","drugs"))] <- c("gender.ESM","drugs.ESM") # renaming matching variables
ema <- join(HRVESM,PrelQS,by="ID",type="left")
ema <- ema[order(ema$ID,ema$day,ema$within.day),] # sorting rows
row.names(ema) <- 1:nrow(ema) # row names

# sanity check: same nrow and No. of participants than HRVESM
cat("Sanity check:",nrow(HRVESM) == nrow(ema))
## Sanity check: TRUE
cat("Sanity check:",nlevels(HRVESM$ID) == nlevels(ema$ID))
## Sanity check: TRUE
# sorting and showing first three lines
(ema <- ema[order(ema$ID,ema$day,ema$within.day),])[1:3,]


5.2.2. between

The second merged dataset between has a wide form with one row per participant, so that the aggregated values of those HRVESM that are meaningful at the between level are joined to the PrelQS. Specifically, we compute the mean for each continuous variable renaming it as “varName.cm”, whereas we compute both the mean and the median for RMSSD and RMSSDcv due to their skewed distribution. Raw ESM ratings are not averaged since we firstly plan to compute the aggregated scores based on the assessment of their psychometric properties (see the related report). As a sanity check, we can see that only 1 participant chose a Sensus protocol gender different from what reported in the preliminary questionnaire (likely due to their mistake in selecting the Sensus protocol).

# computing mean and/or median for quantitative variables
between <- PrelQS
Vars <- c("PI","HR","RMSSD","RMSSDcv","nArtefacts")
for(Var in Vars){ colnames(ema)[which(colnames(ema)==Var)] <- "Var"
  between <- join(between,aggregate(Var ~ ID,data=ema,FUN="mean"),by="ID",type="left") # mean of all vars
  colnames(between)[ncol(between)] <- paste(Var,"cm",sep=".")
  if(Var %in% c("RMSSD","RMSSDcv")){
    between <- join(between,aggregate(Var ~ ID,data=ema,FUN="median",na.action=na.omit),by="ID",type="left") # median for skewed
    colnames(between)[ncol(between)] <- paste(Var,"cMed",sep=".") } 
  colnames(ema)[which(colnames(ema)=="Var")] <- Var }

# joining categorical variables (i.e., flag criteria)
between <- join(between,HRV[!duplicated(HRV$ID),c("ID","inclRate60","noisyPI")],by="ID",type="left") # HRV
between <- join(between,ESM[!duplicated(ESM$ID),c("ID","cLessResp1","cLessResp2")],by="ID",type="left") # ESM
between$smoker <- 0
for(ID in levels(between$ID)){
  if(nrow(ESM[ESM$ID==ID,])>0){ 
    between[between$ID==ID,"os"] <- ESM[ESM$ID==ID & !is.na(ESM$os),"os"][1] # OS
    between[between$ID==ID,"gender.ESM"] <- ESM[ESM$ID==ID & !is.na(ESM$gender),"gender"][1] # gender
    if(any(ESM[ESM$ID==ID & !is.na(ESM$smoke),"smoke"]=="1")){ between[between$ID==ID,"smoker"] <- 1 } # smokers
    }}
  
# sanity check:
cat("No. cases with PrelQS gender != ESM gender =",nrow(between[between$gender!=between$gender.ESM,]))
## No. cases with PrelQS gender != ESM gender = 1
between$gender.ESM <- ema$gender.ESM <-  NULL # removing gender.ESM (trusting PrelQs)

# sorting and showing first three lines
(between <- between[order(between$ID),])[1:3,]


5.3. GNG & ema

Third, we integrate ema information (including HRV, ESM, and PrelQS variables) to the GNG dataset.

5.3.1. gng

The first merged dataset gng has a long form with one row for each trial of the Go/NoGo task, so that between information is identically repeated over all rows associated with the same participant. Similarly, HRV and EMA values are identically repeated over all gng rows associated with the same day, but here we only considered the values of the last daily recording (i.e., within.day = 6 for HRV and within.day = 5 for ESM). Again, raw ESM item scores are not included now but will be after the evaluation of their psychometric properties (see the related report).

# merging GNG and HRV (last measurement)
gng <- join(GNG,HRV[HRV$within.day==6,],by=c("ID","day"),type="left")
colnames(gng)[which(colnames(gng)=="within.day")] <- "time.HRV"

# merging gng and ESM (last measurement)
gng <- join(gng,ESM[ESM$within.day==5,],by=c("ID","day"),type="left")
colnames(gng)[which(colnames(gng)=="within.day")] <- "time.ESM"

# merging gng and PrelQS
gng <- join(gng,PrelQS,by="ID",type="left")

# sanity check: time.HRV can only be = 6 and time.ESM can only be = 5
summary(as.factor(gng$time.HRV))
##     6 
## 62856
summary(as.factor(gng$time.ESM))
##     5  NA's 
## 57888  4968
gng$time.ESM <- gng$time.HRV <- NULL
cat("Sanity check:",nrow(gng)==nrow(GNG)) # same nrow than GNG
## Sanity check: TRUE
# sorting and showing first three lines
(gng <- gng[order(gng$ID,gng$day,gng$block),])[1:3,]


5.3.2. between

The second merged dataset is just the same wide-form between dataset created above, integrated with the aggregated GNG data.

# computing aggregated values
between <- join(between,aggregate(RT ~ ID,data=GNG[GNG$TrialType=="Go" & GNG$ACC==1,],FUN="mean"), # mean RT to Go trials
                by="ID",type="left")
between <- join(between,aggregate(ACC ~ ID,data=GNG[GNG$TrialType=="Go",],FUN="mean"), # mean ACC to Go trials
                by="ID",type="left")
between <- join(between,aggregate(ACC ~ ID,data=GNG[GNG$TrialType=="No-go",],FUN="mean"), # mean ACC to No-go trials
                by="ID",type="left")
colnames(between)[(ncol(between)-2):ncol(between)] <- paste(c("RT","Go.ACC","NoGo.ACC"),"cm",sep=".") # renaming columns

# sorting and showing first three lines
(between <- between[order(between$ID),])[1:3,]


5.4. BLOCK & ema

Finally, we integrate ema information (including HRV, ESM, and PrelQS variables) to the BLOCK dataset.

5.4.1. block

The first merged dataset block has a long form with one row for each block of the Go/NoGo task, so that between information is identically repeated over all rows associated with the same participant. Similarly, HRV and EMA values are identically repeated over all block rows associated with the same day, but here we only considered the values of the last daily recording (i.e., within.day = 6 for HRV and within.day = 5 for ESM). Again, raw ESM item scores are not included now but will be after the evaluation of their psychometric properties (see the related report).

# merging BLOCK and HRV (last measurement)
block <- join(BLOCK,HRV[HRV$within.day==6,],by=c("ID","day"),type="left")
colnames(block)[which(colnames(block)=="within.day")] <- "time.HRV"

# merging gng and ESM (last measurement)
block <- join(block,ESM[ESM$within.day==5,],by=c("ID","day"),type="left")
colnames(block)[which(colnames(block)=="within.day")] <- "time.ESM"

# merging gng and PrelQS
block <- join(block,PrelQS,by="ID",type="left")

# sanity check: time.HRV can only be = 6 and time.ESM can only be = 5
summary(as.factor(block$time.HRV))
##   6 
## 582
summary(as.factor(block$time.ESM))
##    5 NA's 
##  536   46
cat("Sanity check:",nrow(block)==nrow(BLOCK)) # same nrow than GNG
## Sanity check: TRUE
# sorting and showing first three lines
(block <- block[order(block$ID,block$day,block$block),])[1:3,]


5.4.2. between

Finally, we aggregate the RT.sd and ICV data from the BLOCK dataset by computing their mean by participant, and we integrate this information into the between dataset.

# computing aggregated values
between <- join(between,aggregate(RT.sd ~ ID,data=BLOCK,FUN="mean"), # mean RT SD to Go trials
                by="ID",type="left")
between <- join(between,aggregate(ICV ~ ID,data=BLOCK,FUN="mean"), # mean ICV to Go trials
                by="ID",type="left")
colnames(between)[(ncol(between)-1):ncol(between)] <- paste(c("RT.sd","ICV"),"cm",sep=".") # renaming columns

# sorting and showing first three lines
(between <- between[order(between$ID),])[1:3,]



6. Data structure & dictionary

Here, we show the structure of each of the final pre-processed datasets and we provide a data dictionary for each of them.

6.1. ema

The ema dataset includes HRV, ESM, and PrelQS information with one row for each measurement of either HRV (5 ambulatory + 1 lab) or ESM (1 ‘baseline’ + 5 ambulatory).

6.1.1. Data structure

str(ema)
## 'data.frame':    2037 obs. of  49 variables:
##  $ ID          : Factor w/ 97 levels "HRV001","HRV002",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ day         : num  1 1 1 1 1 1 1 2 2 2 ...
##  $ within.day  : num  0 1 2 3 4 5 6 0 1 2 ...
##  $ PI          : num  NA 649 655 750 657 ...
##  $ HR          : num  NA 92.9 91.9 80.2 91.5 ...
##  $ RMSSD       : num  NA 25.8 23.3 28.8 22.7 ...
##  $ RMSSDcv     : num  NA 3.97 3.55 3.84 3.46 ...
##  $ nPI         : int  NA 183 175 160 180 163 149 NA 197 163 ...
##  $ miss        : num  NA 0 0 0 0 0 0 NA 0 0 ...
##  $ out         : num  NA 0 0 0 0 0 0 NA 0 0 ...
##  $ nArtefacts  : int  NA 8 5 0 7 6 0 NA 0 0 ...
##  $ pArtefacts  : num  NA 4.37 2.86 0 3.89 ...
##  $ art10       : num  NA 0 0 0 0 0 0 NA 0 0 ...
##  $ inclRate60  : num  NA 0 0 0 0 0 0 NA 0 0 ...
##  $ noisyPI     : num  NA 0 0 0 0 0 0 NA 0 0 ...
##  $ os          : Factor w/ 2 levels "Android","iOS": NA 1 1 1 1 1 NA NA 1 1 ...
##  $ RunTimestamp: POSIXct, format: "2018-11-06 10:12:16.000" "2018-11-06 11:38:33.269" ...
##  $ v1          : int  5 5 4 4 4 3 NA 4 4 4 ...
##  $ v2          : num  3 4 4 4 4 4 NA 3 5 3 ...
##  $ v3          : num  4 4 3 3 4 2 NA 5 4 4 ...
##  $ t1          : num  3 3 2 2 4 3 NA 5 2 3 ...
##  $ t2          : int  5 3 3 3 5 4 NA 5 3 4 ...
##  $ t3          : int  6 3 3 2 5 6 NA 3 5 5 ...
##  $ f1          : int  6 6 5 6 3 3 NA 5 4 3 ...
##  $ f2          : num  6 3 3 4 3 3 NA 4 3 3 ...
##  $ f3          : int  6 6 5 5 5 2 NA 6 4 4 ...
##  $ PE          : int  3 1 2 1 2 2 NA 2 1 2 ...
##  $ PE.int      : int  2 1 2 1 6 3 NA 2 1 5 ...
##  $ NE          : int  1 2 1 2 1 4 NA 2 1 1 ...
##  $ NE.int      : int  1 4 1 4 1 4 NA 4 1 1 ...
##  $ smoke       : Factor w/ 2 levels "0","1": NA 1 1 1 1 1 NA NA 1 1 ...
##  $ coffe       : Factor w/ 2 levels "0","1": NA 1 1 1 1 1 NA NA 1 1 ...
##  $ alcohol     : Factor w/ 2 levels "0","1": NA 1 1 1 1 1 NA NA 1 1 ...
##  $ activity    : Factor w/ 2 levels "0","1": NA 1 2 1 1 2 NA NA 1 2 ...
##  $ people      : Factor w/ 2 levels "0","1": NA 1 1 1 1 2 NA NA 1 1 ...
##  $ daybefore.ex: Factor w/ 2 levels "0","1": 1 NA NA NA NA NA NA 1 NA NA ...
##  $ drugs.ESM   : Factor w/ 2 levels "0","1": 1 NA NA NA NA NA NA 1 NA NA ...
##  $ drugs.which : chr  "" NA NA NA ...
##  $ cLess1      : num  0 0 1 0 0 1 NA 1 0 0 ...
##  $ cLess2      : num  0 0 1 0 0 0 NA 0 0 0 ...
##  $ cLessResp1  : num  0 0 0 0 0 0 NA 0 0 0 ...
##  $ cLessResp2  : num  0 0 0 0 0 0 NA 0 0 0 ...
##  $ time        : POSIXct, format: "2018-11-07 11:32:45" "2018-11-07 11:32:45" ...
##  $ age         : num  25 25 25 25 25 25 25 25 25 25 ...
##  $ gender      : Factor w/ 2 levels "F","M": 2 2 2 2 2 2 2 2 2 2 ...
##  $ BMI         : num  19.2 19.2 19.2 19.2 19.2 ...
##  $ drugs       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ dysfun      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ postponed   : num  0 0 0 0 0 0 0 0 0 0 ...


6.1.2. Data dictionary

  1. Recording identification:
  • ID = participant identification code

  • day = day of assessment (1, 2 or 3)

  • within.day = scheduled recording number (0 = first ESM laboratory measurements; 1-5 = ambulatory measurements; 6 = last HRV laboratory measurement)


  1. HRV data
  • PI = mean pulse interval (ms), computed as the mean of the PIs in a given recording segment

  • HR = mean heart rate (bpm), computed as 60,000/PI

  • RMSSD = root mean square of successive differences in normal-to-normal intervals (ms)

  • RMSSDcv = RMSSD coefficient of variation, computed as 100 × RMSSD / PI following Geus et al (2019)

  • nPI = number of pulse intervals considered for each segment

  • miss = flagging missing segments (i.e., not recorded) (1 = missing, 0 = nonmissing)

  • out = flagging excluded segments due to artefacts in more than the 25% (1 = excluded; 0 = included)

  • nArtefacts = number of artefacts

  • pArtefacts = percentage of artefacts computed as 100 * nArtefacts / nPI

  • art10 = flagging segments with pArtefacts > 10% (0 = no, 1 = yes)

  • inclRate60 = flagging participants with a rate of included segments < 60% than the total number of scheduled segments (0 = no, 1 = yes)

  • noisyPIs = flagging participants with a relatively ‘noisy’ BVP or PI signal (0 = no, 1 = yes)


  1. ESM data
  • os = operating system (either Android or iOS)

  • RunTimestamp = ESM questionnaire opening timestamp (yyyy-mm-dd hh:mm:ss)

  • v1, … f3 = ratings at the Multidimensional Mood Questionnaire items (1-7)

  • PE, … NE.int = events ratings (1-7)

  • smoke = cigarette smoking over the 20 min before the recording (0 = no, 1 = yes)

  • coffe = coffee drinking over the 60 min before the recording (0 = no, 1 = yes)

  • alcohol = alcohol over the 60 min before the recording (0 = no, 1 = yes)

  • activity = activity intensity before the recording (0 = low/inactivity, 1 = moderate or intense)

  • people = interferences during the recording (0 = alone or with someone not disturbing, 2 = people disturbing or interrupting)

  • daybefore.ex = intense exercise during the day before (0 = no, 1 = yes)

  • drugs = drugs consumed on that day (0 = no, 1 = yes)

  • drugs.which = which drugs has been used on that day (open-ended responses)

  • cLess1 = flagging potentially careless responses based on 1+ criteria

  • cLess2 = flagging potentially careless responses based on 2+ criteria

  • cLessResp1 = flagging potentially careless respondents based on 1+ criteria

  • cLessResp2 = flagging potentially careless respondents based on 2+ criteria


  1. Preliminary questionnaire data
  • time = questionnaire opening timestamp (yyyy-mm-dd hh:mm:ss)

  • age = participants’ age (years)

  • gender = participants’ sex (“F” or “M”)

  • BMI = participants’ body mass index (kg/m2)

  • drugs = flagging participants reporting taking some medication (0 = no, 1 = yes)

  • dysfun = flagging participants reporting minor cardiovascular and/or other dysfunctions (0 = no, 1 = yes)

  • postponed = flagging participants that postponed one protocol day


6.2. between

The between dataset includes all collected variables, with one row for each participant. HRV, ESM, GNG, and BLOCK data are averaged by participant.

6.2.1. Data structure

str(between)
## 'data.frame':    97 obs. of  26 variables:
##  $ ID           : Factor w/ 97 levels "HRV001","HRV002",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ time         : POSIXct, format: "2018-11-07 11:32:45" "2018-11-08 00:52:05" ...
##  $ age          : num  25 28 22 22 25 21 20 20 20 22 ...
##  $ gender       : Factor w/ 2 levels "F","M": 2 2 1 1 2 2 1 1 1 2 ...
##  $ BMI          : num  19.2 22.7 26.6 20.2 23.2 ...
##  $ drugs        : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 2 1 1 ...
##  $ dysfun       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ postponed    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PI.cm        : num  675 687 799 668 891 ...
##  $ HR.cm        : num  90.2 88.2 76.7 90.5 69 ...
##  $ RMSSD.cm     : num  30.6 32.7 51.7 33.2 70.9 ...
##  $ RMSSD.cMed   : num  29.1 33.6 54.1 31.4 68.6 ...
##  $ RMSSDcv.cm   : num  4.46 4.76 6.33 4.93 8.07 ...
##  $ RMSSDcv.cMed : num  4.36 4.66 6.44 4.54 7.23 ...
##  $ nArtefacts.cm: num  2.83 9.8 11.2 12.12 7.43 ...
##  $ inclRate60   : num  0 0 1 0 1 0 0 0 0 1 ...
##  $ noisyPI      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ cLessResp1   : num  0 0 0 0 1 0 1 1 0 0 ...
##  $ cLessResp2   : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ smoker       : num  0 1 0 1 1 0 0 0 1 0 ...
##  $ os           : Factor w/ 2 levels "Android","iOS": 1 1 1 1 1 1 1 1 2 1 ...
##  $ RT.cm        : num  314 319 314 364 331 ...
##  $ Go.ACC.cm    : num  0.99 0.965 0.957 0.889 0.994 ...
##  $ NoGo.ACC.cm  : num  0.809 0.71 0.599 0.87 0.833 ...
##  $ RT.sd.cm     : num  38 57.8 54.1 63 33.4 ...
##  $ ICV.cm       : num  0.121 0.182 0.173 0.173 0.101 ...


6.2.2. Data dictionary

  1. Recording identification:
  • ID = participant identification code


  1. Preliminary questionnaire data
  • time = questionnaire opening timestamp (yyyy-mm-dd hh:mm:ss)

  • age = participants’ age (years)

  • gender = participants’ sex (“F” or “M”)

  • BMI = participants’ body mass index (kg/m2)

  • drugs = flagging participants reporting taking some medication (0 = no, 1 = yes)

  • dysfun = flagging participants reporting minor cardiovascular and/or other dysfunctions (0 = no, 1 = yes)

  • postponed = flagging participants that postponed one protocol day


  1. HRV data
  • PI.cm = participant’s mean pulse interval (ms), computed as the mean of the PIs in a given recording segment

  • HR.cm = participant’s mean heart rate (bpm), computed as 60,000/PI

  • RMSSD.cm = participant’s mean root mean square of successive differences in normal-to-normal intervals (ms)

  • RMSSD.cMed = participant’s median root mean square of successive differences in normal-to-normal intervals (ms)

  • RMSSDcv.cm = participant’s mean RMSSD coefficient of variation, computed as 100 × RMSSD / PI following Geus et al (2019)

  • RMSSDcv.cMed = participant’s median RMSSD coefficient of variation, computed as 100 × RMSSD / PI following Geus et al (2019)

  • nArtefacts.cm = participant’s mean number of artefacts

  • inclRate60 = flagging participants with a rate of included segments < 60% than the total number of scheduled segments (0 = no, 1 = yes)

  • noisyPIs = flagging participants with a relatively ‘noisy’ BVP or PI signal (0 = no, 1 = yes)


  1. ESM data
  • cLessResp1 = flagging potentially careless respondents based on 1+ criteria (1 = Yes, 0 = No)

  • cLessResp2 = flagging potentially careless respondents based on 2+ criteria (1 = Yes, 0 = No)

  • smoker = flagging smokers (1 = Yes, 0 = No)

  • os = operating system (either Android or iOS)


  1. Go/No-go data
  • RT.cm = participant’s mean reaction time (ms) to accurate “Go” trials

  • Go.ACC.cm = participant’s mean accuracy (0-1) to “Go” trials

  • NoGo.ACC.cm = participant’s mean accuracy (0-1) to “No-go” trials

  • RT.sd.cm = participant’s mean reaction time (ms) standard deviation to accurate “Go” trials

  • ICV.cm = participant’s mean intraindividual coefficient of variation in accurate “Go” trials


6.3. gng

The gng dataset includes all collected variables, with one row for each trial of the go/no-go task. HRV data is included only for the segments recorded in the lab (within.day = 6), whereas ESM information is included only as the last ESM data entry (within.day = 5).

6.3.1. Data structure

str(gng)
## 'data.frame':    62856 obs. of  57 variables:
##  $ ID          : Factor w/ 97 levels "HRV001","HRV002",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ day         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ block       : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ SessionDate : POSIXct, format: "2018-11-06 16:17:55" "2018-11-06 16:17:55" ...
##  $ SessionTime : POSIXct, format: "1970-01-01 16:17:55" "1970-01-01 16:17:55" ...
##  $ TrialType   : Factor w/ 2 levels "Go","No-go": 1 2 1 1 1 2 1 1 1 1 ...
##  $ letter      : Factor w/ 12 levels "E","F","G","L",..: 4 10 4 7 1 10 1 4 4 1 ...
##  $ Jitter      : int  200 200 0 150 0 50 150 0 50 50 ...
##  $ ACC         : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ RT          : num  343 0 328 360 339 0 306 332 404 355 ...
##  $ PI          : num  831 831 831 831 831 ...
##  $ HR          : num  72.5 72.5 72.5 72.5 72.5 ...
##  $ RMSSD       : num  49.3 49.3 49.3 49.3 49.3 ...
##  $ RMSSDcv     : num  5.93 5.93 5.93 5.93 5.93 ...
##  $ nPI         : int  149 149 149 149 149 149 149 149 149 149 ...
##  $ miss        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ out         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ nArtefacts  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ pArtefacts  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ art10       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ inclRate60  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ noisyPI     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ os          : Factor w/ 2 levels "Android","iOS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ gender      : Factor w/ 2 levels "F","M": 2 2 2 2 2 2 2 2 2 2 ...
##  $ RunTimestamp: POSIXct, format: "2018-11-06 16:09:16.716" "2018-11-06 16:09:16.716" ...
##  $ v1          : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ v2          : num  4 4 4 4 4 4 4 4 4 4 ...
##  $ v3          : num  2 2 2 2 2 2 2 2 2 2 ...
##  $ t1          : num  3 3 3 3 3 3 3 3 3 3 ...
##  $ t2          : int  4 4 4 4 4 4 4 4 4 4 ...
##  $ t3          : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ f1          : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ f2          : num  3 3 3 3 3 3 3 3 3 3 ...
##  $ f3          : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ PE          : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ PE.int      : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ NE          : int  4 4 4 4 4 4 4 4 4 4 ...
##  $ NE.int      : int  4 4 4 4 4 4 4 4 4 4 ...
##  $ smoke       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ coffe       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ alcohol     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ activity    : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ people      : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ daybefore.ex: Factor w/ 2 levels "0","1": NA NA NA NA NA NA NA NA NA NA ...
##  $ drugs       : Factor w/ 2 levels "0","1": NA NA NA NA NA NA NA NA NA NA ...
##  $ drugs.which : chr  NA NA NA NA ...
##  $ cLess1      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ cLess2      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ cLessResp1  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ cLessResp2  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ time        : POSIXct, format: "2018-11-07 11:32:45" "2018-11-07 11:32:45" ...
##  $ age         : num  25 25 25 25 25 25 25 25 25 25 ...
##  $ gender      : Factor w/ 2 levels "F","M": 2 2 2 2 2 2 2 2 2 2 ...
##  $ BMI         : num  19.2 19.2 19.2 19.2 19.2 ...
##  $ drugs       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ dysfun      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ postponed   : num  0 0 0 0 0 0 0 0 0 0 ...


6.3.2. Data dictionary

  1. Recording identification:
  • ID = participant identification code

  • day = day of assessment (1, 2 or 3)

  • block = daily task block (1 or 2)


  1. Go/NoGo data
  • SessionDate = date of the task session (yyyy-mm-dd hh:mm:ss)

  • SessionTime = time within day of each task session (hh:mm:ss)

  • TrialType = trial type (Go = trials requiring to answer, No-go = trials requiring not to answer)

  • letter = trial letter stimulus

  • Jitter = ms randomly added to the inter-stimulus interval (ISI) for each trial (base ISI = 400ms)

  • ACC = trial accuracy (Go trials: 1 = response, 0 = omission; No-go trials: 1 = inhibition, 0 = commission)

  • RT = trial reaction time (ms) (i.e., RT = 0 when ACC = 0)


  1. HRV data

Same variables included in the ema dataset but only those referred to the last HRV recording segment (i.e., within.day = 6)


  1. ESM data

Same variables included in the ema dataset but only those referred to the last ESM recording segment (i.e., within.day = 5)


  1. Preliminary questionnaire data

Same variables included in the ema and the between dataset


6.4. block

The block dataset includes all collected variables, with one row for each block of the go/no-go task. Again, HRV data is included only for the segments recorded in the lab (within.day = 6), whereas ESM information is included only as the last ESM data entry (within.day = 5).

6.4.1. Data structure

str(block)
## 'data.frame':    582 obs. of  57 variables:
##  $ ID          : Factor w/ 97 levels "HRV001","HRV002",..: 1 1 1 1 1 1 2 2 2 2 ...
##  $ day         : int  1 1 2 2 3 3 1 1 2 2 ...
##  $ block       : num  1 2 1 2 1 2 1 2 1 2 ...
##  $ Go.ACC      : num  0.988 0.988 1 1 1 ...
##  $ NoGo.ACC    : num  0.889 0.778 0.778 0.852 0.741 ...
##  $ RT          : num  334 311 319 301 313 ...
##  $ RT.sd       : num  41.4 35.9 42.4 36.2 37.5 ...
##  $ ICV         : num  0.124 0.116 0.133 0.12 0.12 ...
##  $ time.HRV    : num  6 6 6 6 6 6 6 6 6 6 ...
##  $ PI          : num  831 831 651 651 826 ...
##  $ HR          : num  72.5 72.5 92.4 92.4 73 ...
##  $ RMSSD       : num  49.3 49.3 29.9 29.9 54.5 ...
##  $ RMSSDcv     : num  5.93 5.93 4.59 4.59 6.6 ...
##  $ nPI         : int  149 149 188 188 145 145 178 178 179 179 ...
##  $ miss        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ out         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ nArtefacts  : int  0 0 1 1 0 0 25 25 0 0 ...
##  $ pArtefacts  : num  0 0 0.532 0.532 0 ...
##  $ art10       : num  0 0 0 0 0 0 1 1 0 0 ...
##  $ inclRate60  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ noisyPI     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ os          : Factor w/ 2 levels "Android","iOS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ gender      : Factor w/ 2 levels "F","M": 2 2 2 2 2 2 2 2 2 2 ...
##  $ time.ESM    : num  5 5 5 5 5 5 5 5 5 5 ...
##  $ RunTimestamp: POSIXct, format: "2018-11-06 16:09:16.716" "2018-11-06 16:09:16.716" ...
##  $ v1          : int  3 3 5 5 4 4 4 4 5 5 ...
##  $ v2          : num  4 4 5 5 4 4 3 3 5 5 ...
##  $ v3          : num  2 2 4 4 4 4 4 4 4 4 ...
##  $ t1          : num  3 3 5 5 4 4 5 5 5 5 ...
##  $ t2          : int  4 4 5 5 3 3 5 5 5 5 ...
##  $ t3          : int  6 6 6 6 4 4 3 3 5 5 ...
##  $ f1          : int  3 3 3 3 3 3 2 2 4 4 ...
##  $ f2          : num  3 3 3 3 3 3 2 2 5 5 ...
##  $ f3          : int  2 2 5 5 3 3 2 2 3 3 ...
##  $ PE          : int  2 2 1 1 NA NA 1 1 1 1 ...
##  $ PE.int      : int  3 3 1 1 NA NA 1 1 2 2 ...
##  $ NE          : int  4 4 2 2 NA NA 1 1 1 1 ...
##  $ NE.int      : int  4 4 5 5 NA NA 1 1 1 1 ...
##  $ smoke       : Factor w/ 2 levels "0","1": 1 1 1 1 NA NA 2 2 2 2 ...
##  $ coffe       : Factor w/ 2 levels "0","1": 1 1 1 1 NA NA 1 1 1 1 ...
##  $ alcohol     : Factor w/ 2 levels "0","1": 1 1 1 1 NA NA 1 1 1 1 ...
##  $ activity    : Factor w/ 2 levels "0","1": 2 2 2 2 NA NA 1 1 2 2 ...
##  $ people      : Factor w/ 2 levels "0","1": 2 2 1 1 NA NA 1 1 1 1 ...
##  $ daybefore.ex: Factor w/ 2 levels "0","1": NA NA NA NA NA NA NA NA NA NA ...
##  $ drugs       : Factor w/ 2 levels "0","1": NA NA NA NA NA NA NA NA NA NA ...
##  $ drugs.which : chr  NA NA NA NA ...
##  $ cLess1      : num  1 1 0 0 0 0 0 0 0 0 ...
##  $ cLess2      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ cLessResp1  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ cLessResp2  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ time        : POSIXct, format: "2018-11-07 11:32:45" "2018-11-07 11:32:45" ...
##  $ age         : num  25 25 25 25 25 25 28 28 28 28 ...
##  $ gender      : Factor w/ 2 levels "F","M": 2 2 2 2 2 2 2 2 2 2 ...
##  $ BMI         : num  19.2 19.2 19.2 19.2 19.2 ...
##  $ drugs       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ dysfun      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ postponed   : num  0 0 0 0 0 0 0 0 0 0 ...


6.4.2. Data dictionary

  1. Recording identification:
  • ID = participant identification code

  • day = day of assessment (1, 2 or 3)

  • block = daily task block (1 or 2)


  1. Go/NoGo data
  • Go.ACC = block-related mean accuracy to Go trials (0-1)

  • NoGo.ACC = block-related mean accuracy to No-go trials (0-1)

  • RT = block-related mean reaction time (ms) to accurate Go trials

  • RT.sd = block-related standard deviation of reaction times (ms) to accurate Go trials (0-1)

  • ICV = block-related intraindividual coefficient of variation (0-1)


  1. HRV data

Same variables included in the ema dataset but only those referred to the last HRV recording segment (i.e., within.day = 6)


  1. ESM data

Same variables included in the ema dataset but only those referred to the last ESM recording segment (i.e., within.day = 5)


  1. Preliminary questionnaire data

Same variables included in the ema and the between dataset



7. Data export

Finally, we export the final pre-processed datasets to be used in the subsequent analyses. Each dataset is saved both in the RData and the CSV file format.

# saving as RData
save(ema,file="3data/ema_preProcessed.RData")
save(between,file="3data/between_preProcessed.RData")
save(gng,file="3data/gng_preProcessed.RData")
save(block,file="3data/block_preProcessed.RData")

# saving as CSV
write.csv(ema,file="3data/ema_preProcessed.CSV",row.names=FALSE)
write.csv(between,file="3data/between_preProcessed.CSV",row.names=FALSE)
write.csv(gng,file="3data/gng_preProcessed.CSV",row.names=FALSE)
write.csv(block,file="3data/block_preProcessed.CSV",row.names=FALSE)



8. References

  • Curran, P. G. (2016). Methods for the detection of carelessly invalid responses in survey data. Journal of Experimental Social Psychology, 66, 4-19. https://doi.org/10.1016/j.jesp.2015.07.006

  • de Geus, E. J., Gianaros, P. J., Brindle, R. C., Jennings, J. R., & Berntson, G. G. (2019). Should heart rate variability be “corrected” for heart rate? Biological, quantitative, and interpretive considerations. Psychophysiology, 56(2), e13287. https://doi.org/10.1111/psyp.13287

  • Geeraerts, J. (2020, May 20). Investigating careless responding detection techniques in experience sampling methods. Retrieved from https://osf.io/d798j/

  • Kaufmann, T., Sütterlin, S., Schulz, S. M., & Vögele, C. (2011). ARTiiFACT: A tool for heart rate artifact processing and heart rate variability analysis. Behavior Research Methods, 43(4), 1161–1170. https://doi.org/10.3758/s13428-011-0107-7

  • Xiong, H., Huang, Y., Barnes, L. E., & Gerber, M. S. (2016). Sensus: a cross-platform, general-purpose system for mobile crowdsensing in human-subject studies. Proceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing, 415–426. https://doi.org/10.1145/2971648.2971711

8.1 R packages

Auguie, Baptiste. 2017. gridExtra: Miscellaneous Functions for "Grid" Graphics. https://CRAN.R-project.org/package=gridExtra.
Borchers, Hans W. 2022. Pracma: Practical Numerical Math Functions. https://CRAN.R-project.org/package=pracma.
Chang, Winston, Joe Cheng, JJ Allaire, Carson Sievert, Barret Schloerke, Yihui Xie, Jeff Allen, Jonathan McPherson, Alan Dipert, and Barbara Borges. 2022. Shiny: Web Application Framework for r. https://shiny.rstudio.com/.
Grolemund, Garrett, and Hadley Wickham. 2011. “Dates and Times Made Easy with lubridate.” Journal of Statistical Software 40 (3): 1–25. https://www.jstatsoft.org/v40/i03/.
Ooms, Jeroen. 2014. “The Jsonlite Package: A Practical and Consistent Mapping Between JSON Data and r Objects.” arXiv:1403.2805 [Stat.CO]. https://arxiv.org/abs/1403.2805.
———. 2022. Jsonlite: A Simple and Robust JSON Parser and Generator for r. https://CRAN.R-project.org/package=jsonlite.
R Core Team. 2022. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Sarkar, Deepayan. 2008. Lattice: Multivariate Data Visualization with r. New York: Springer. http://lmdvr.r-forge.r-project.org.
———. 2021. Lattice: Trellis Graphics for r. http://lattice.r-forge.r-project.org/.
Spinu, Vitalie, Garrett Grolemund, and Hadley Wickham. 2021. Lubridate: Make Dealing with Dates a Little Easier. https://CRAN.R-project.org/package=lubridate.
Wickham, Hadley. 2007. “Reshaping Data with the reshape Package.” Journal of Statistical Software 21 (12): 1–20. http://www.jstatsoft.org/v21/i12/.
———. 2011. “The Split-Apply-Combine Strategy for Data Analysis.” Journal of Statistical Software 40 (1): 1–29. https://www.jstatsoft.org/v40/i01/.
———. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.
———. 2020. Reshape2: Flexibly Reshape Data: A Reboot of the Reshape Package. https://github.com/hadley/reshape.
———. 2022. Plyr: Tools for Splitting, Applying and Combining Data. https://CRAN.R-project.org/package=plyr.
Wickham, Hadley, Winston Chang, Lionel Henry, Thomas Lin Pedersen, Kohske Takahashi, Claus Wilke, Kara Woo, Hiroaki Yutani, and Dewey Dunnington. 2022. Ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics. https://CRAN.R-project.org/package=ggplot2.
Wickham, Hadley, and Dana Seidel. 2022. Scales: Scale Functions for Visualization. https://CRAN.R-project.org/package=scales.
Xie, Yihui. 2014. “Knitr: A Comprehensive Tool for Reproducible Research in R.” In Implementing Reproducible Computational Research, edited by Victoria Stodden, Friedrich Leisch, and Roger D. Peng. Chapman; Hall/CRC. http://www.crcpress.com/product/isbn/9781466561595.
———. 2015. Dynamic Documents with R and Knitr. 2nd ed. Boca Raton, Florida: Chapman; Hall/CRC. https://yihui.org/knitr/.
———. 2022. Knitr: A General-Purpose Package for Dynamic Report Generation in r. https://yihui.org/knitr/.
Yentes, Richard, and Francisco Wilhelm. 2021. Careless: Procedures for Computing Indices of Careless Responding. https://github.com/ryentes/careless/.