Steps specific template

This is a template for analysing, cleaning, and extracting features for step count data collected by a phone and/or a wearable device. It is suitable for steps data with a range of sampling frequencies, such as daily, minute based, or event based. It is an extension of the general template where full details of the pipeline used here can be found. The following assumptions are made about the dataset:

Key Assumptions

All data is stored in a single file for each participant.
All data files belonging to this data type have the same name (e.g. fitbit_intraday_steps), the variable csv_name below will be set to this name.
There is a timestamp field in each file that gives the time of each datapoint and the name of the timestamp field is the same across all files.
There is a measurement field that reports a numerical value for number of steps (either absolute steps or cumulative steps). The name of this measurement field is consistent across all files.
There is either a duration field, an end time field, or neither of these. There should not be both. If there is a duration field it should be in seconds. The name of the duration or end time field should be consistent across all files.
All time columns are in unix seconds. If they are not currently in this format, all the functions in this pipeline that require the time columns to be in unix seconds have an optional variable convert_to_unix that can be set to a list of all columns that need converting to unix. However, this does not currently support all time formats and also rounds to the nearest second, so it may be necessary to convert time columns to unix seconds before using this template.
The data is read from an input folder (assigned below) which contains one of two possible folder structures. Folder structure 1: the input folder contains subfolders for each site, each site subfolder contains subfolders for each participant, and each participant subfolder has subfolders for each data type that is available for that participant. In each data type subfolder, the data file is stored as a csv compressed Gzip file. Folder structure 2: the input folder contains subfolders for each site, each site subfolder contains subfolders for each participant, each participant folder contains all the data files for that participant as csv files.

This template begins with some data analyses to gain a deeper understanding of the dataset. The information gained from the data analysis is then used to decide how to clean the data and extract features. These features include the total number of steps per day, hour, or minute and some metadata features that describe the data quality.

All functions can be further tailored to your data in two ways:

Functions

Filtering the data files. In each function, after the data file is read in to a pandas dataframe the function df_filter in helper_funcs.py is used to filter out any unwanted rows if the optional variable filter_dict is not None (the default). For example, there may be multiple data types in the file and you want to only select rows that have the heart rate key in a data type column. To filter the data, you need to define a dictionary of column names and allowed values and set filter_dict to this dictionary. A template dictionary to edit (if filtering is required, it can be ignored otherwise) is given in each of the code snippets below before each function is called. This template does not use this filtering option, however an example can be seen in the general template.
Specific adjustments to the data. In each of the three main data analysis functions, a function df_adjustment is called after df_filter to make any specific adjustments to the data before it is analysed if the optional variable df_adjustment_args is not [None] (the default), for example, constructing a new column out of the existing columns. This function has been included to make this template easily extendable, you can add new lines of code to the function to create additional options for adjusting specific data types. There are three versions of df_adjustment for each of the three main data analysis functions, which are each in separate files (the files ending with df_adjustment.py). Some examples of df_adjustment being used are shown in the steps specific template. To call df_adjustment, the variable df_adjustment_args needs to be changed from the default [None] to a list. The first item should be a string for a key word for that adjustment and the rest of the items should be specific variables required for that adjustment.

If your data files contain a mix of both phone and watch data, they will only need to be filtered if the data from different devices overlap. The functions here have been designed to deal with data that comes in blocks from one device at a time (for example, sensorkit pedometer data has previously been found to be in this format). If the data does overlap you will need two copies of this template for each device and to use the filter_dict optional variable in all the functions.

This template uses example data from Apple watch, with the filename active_apple_healthkit_steps. However, it should be useful for any type of phone and/or wearable step count data and can be adjusted by changing the variables set in the code snippets.

Data Analysis¶

First we do all imports necessary for this analysis and get files_list, a list of the file paths:

import os
import sys
from pathlib import Path

import pandas as pd
from IPython.display import HTML, display

sys.path.insert(
    0, str(Path().resolve().parent / "src")
)  # Set the path to the src folder so that we can import the functions from there

import additional_funcs
import all_field_summaries
import calculate_durations
import feature_extraction
import helper_funcs
import timestamps_check

base_dir = Path.cwd().parent  # go up one level from where you're running
folder_path = base_dir / "example_data"
folder_path_str = str(folder_path) + "/"

# Set input variables
Folder_structure = 1  # This should be either 1 or 2 (see above)
csv_name = "active_apple_healthkit_steps"  # The standard name for the csv that contains this data
site_list = ["test"]  # The names of the subfolders for each site
input_folder = folder_path_str  # The folder that contains all the site subfolders

# Get a list of the paths to each file to be included in this analysis
files_list = helper_funcs.get_file_paths(
    input_folder, csv_name, Folder_structure, site_list
)

5 files found

Investigate Frequency¶

This function analyses the time gaps between each datapoint and the durations of datapoints (if there is a duration or end time column), in an effort to understand what the expected sampling frequency of the data is. The mean, median, mode and range are given. Also included is the number of datapoints that are equal to the mode, within a (adjustable) threshold of the mode, or more than the same threshold below the mode. These are included to get an idea of whether one particular sampling frequency dominates, if there is an intended sampling frequency then the former two are likely to be high and the latter is likely to be low.

Below is an example of this function being run for active_apple_healthkit_steps data. To tailor this to your data, adjust the following variables:

Variables

thresh: theshold that will be used for examining closeness to the mode. You may wish to use the default originally to find out what the mode is, and then run this function a second time with an appropriate value.
time_stamp_col: name of the timestamp column. This will be the start column if there is also an end time column.
end_time_col: optional (default is None), set as the name of the end time column if one exists.
duration_col: optional (default is None), set as the name of the duration column if one exists.
convert_to_unix: optional (default is None), set to a list of columns that need converting to unix seconds if necessary.
df_adjustment_args: optional (default is None), this only needs to be set if you wish to make any adjustments to the data after it is read in.
filter_dict: optional (default is None), set this as filter_dictionary and adjust the dictionary filter_dictionary below if you wish to filter one or more fields.

# Edit this dictionary if you need to filter the data.
filter_dictionary = {
    # If you wish to only keep datapoints with certain values on specific rows, edit this
    # dictionary and set filter_dict in the function below to filter_dictionary. The keys
    # here are the names of the columns you want to filter by, and the values are the list
    # of acceptable entries for that column.
    "col1": [1, 3, 5],
    "col2": ["A", "C"],
}

df = calculate_durations.investigate_frequency(
    files_list=files_list,
    thresh=1,  # You may need to adjust after running this function a first time to find out the mode.
    timestamp_col="value.time",  # Name of the timestamp column
    end_time_col="value.endTime",  # The end time column
    duration_col=None,  # There is no duration column
    convert_to_unix=None,  # The data is already in unix seconds
    filter_dict=None,  # We do not want to filter the data
    df_adjustment_args=[None],  # No adjustments necessary for this data type.
)

html_table = df.to_html(index=False)
styled_html = f"<div style='font-size:14px'>{html_table}</div>"
display(HTML(styled_html))

If the results show a modal time gap of 86400 and the time gap is rarely below this, this suggests the data is daily. In this case, you may wish to use the find_time_of_timestamps function below to check the timestamps tend to be at or close to midnight. To tailor this to your data, adjust the following variables:

# Edit this dictionary if you need to filter the data.
filter_dictionary = {
    # If you wish to only keep datapoints with certain values on specific rows, edit this
    # dictionary and set filter_dict in the function below to filter_dictionary. The keys
    # here are the names of the columns you want to filter by, and the values are the list
    # of acceptable entries for that column.
    "col1": [1, 3, 5],
    "col2": ["A", "C"],
}

d = additional_funcs.find_time_of_timestamps(
    all_file_paths=files_list,
    timestamp_col="value.time",  # The name of the timestamps file
    convert_to_unix=None,  # The data is already in unix seconds
    filter_dict=None,  # The data does not need filtering
)
print("Distribution of times:", d)

Distribution of times: {'00:00:00': 53}

Summarise Fields¶

We next use the Summarise_fields function to check the distribution of steps reported for each datapoint. The purpose of this function is to get a general idea of the distribution of values in any fields of interest across all the data. This is likely to include the step count field to check whether the measured values generally seem sensible. It may also include any other fields you wish to analyse, for example you might want to check that a field reporting the device gives a consistent value. Further details about this function are described in the general template.

Below is an example of this tool being run on active_apple_healthkit_steps data. To tailor this to your data, adjust the following variables:

The df_adjustment function can be used here for the following purposes:

If you want to perform any of these operations on the data, set: df_adjustment_args = ['steps',steps_dict] and edit the steps_dict variable that is in the code below, see the comments for instructions.

# Edit this dictionary if you want to make any adjustments to the data before it is analysed.
steps_dict = {
    "steps col": "value.doubleValue",  # Change this to the name of the column that reports steps
    "cumulative": False,  # Change this to True to convert to steps since last datapoint (for cumulative steps)
    "steps per second": False,  # Change this to true to convert to steps per second
    "delete zeros": True,  # Set to True if you want to delete all datapoints reporting zero steps
    "timestamp": "value.time",  # Change to the name of the timestamp column (needed for the calculations)
    "device col": None,  # Change to name of device col (as a string) if one exists and cumulative is True
}

# Edit this dictionary if you need to filter the data.
filter_dictionary = {
    # If you wish to only keep datapoints with certain values on specific rows, edit this
    # dictionary and set filter_dict in the function below to filter_dictionary. The keys
    # here are the names of the columns you want to filter by, and the values are the list
    # of acceptable entries for that column.
    "col1": [1, 3, 5],
    "col2": ["A", "C"],
}

df = all_field_summaries.Summarise_fields(
    files_list=files_list,
    fields=["value.doubleValue"],  # A list of all fields to be analysed
    time_stamp="value.time",  # The name of the column that contains the timestamp
    filter_dict=None,  # No fields need filtering so leave as None.
    df_adjustment_args=[
        "steps",
        steps_dict,
    ],  # Make adjustments to the data based on the values in steps_dict
)

# Then display the results.
df = df.round(
    2
)  # Rounds the numbers for ease of viewing, may need to be adjusted depending on data
html_table = df.to_html(index=False)
styled_html = f"<div style='font-size:12px'>{html_table}</div>"
display(HTML(styled_html))

Check Timestamp Errors¶

We next see how frequent timestamp errors are in the data. These include:

Error Types

RT-CM: fraction of timestamps for which there are duplicate datapoints.

RT+CM: fraction of timestamps for which there are multiple datapoints that do not all have the same measurements/duration/end-time values.

EAS: fraction of timestamps for which a datapoint with this timestamp starts before the last one ends. Datapoints with the same timestamp are sorted by end-time/duration before calculating this to get the worst case scenario.

EAS-OT: fraction of timestamps for which a datapoint with this timestamp starts a threshold amount (EAS_threshold) before the previous datapoint ended. Datapoints with the same timestamp are sorted by end-time/duration before calculating this to get worst case scenario.

STG+CM: fraction of timestamps for which there is a datapoint where time from last datapoint is less than a threshold amount but non-zero, and measurements/duration/end-time changes.

STG-CM: fraction of timestamps for which there is a datapoint where time from last datapoint is less than a threshold amount but non-zero, and measurements/duration/end-time do not change.

The threshold used to define STG errors (timegap_threshold) should be the minimum amount of time expected between datapoints. The results from the investigate_frequency function above may be useful in informing choice of this threshold, otherwise a sensible value for the minimum amount of time a step is expected to take should be chosen. The amount records are allowed to overlap by (EAS_threshold) should be set based on what is a reasonable overlap considering the typical duration of a datapoint, again choice of the threshold may be informed by the results from the investigate_frequency function above.

Below is an example of this tool being run for active_apple_healthkit_steps. To tailor this to your data, adjust the following variables:

Variables

EAS_threshold: threshold used to determine EAS-OT errors. Can be set to None if there is no end time or duration column.
timegap_threshold: threshold used (in seconds) to determine STG errors.
measurement_cols: names of any fields that give step count measurements.
time_stamp_col: name of the timestamp column. This will be the start column if there is also an end time column.
output_folder: output folder where csvs containing additional details will be stored (described below).
end_time_col: optional (default is None), set as the name of the end time column if one exists.
duration_col: optional (default is None), set as the name of the duration column if one exists.
convert_to_unix: optional (default is None), set to a list of columns that need converting to unix seconds if necessary.
df_adjustment_args: optional, defaults to [None], This only needs to be set if you wish to make any adjustments to the data after it is read in.
filter_dict: optional (default is None), set this as filter_dictionary and adjust the dictionary filter_dictionary below if you wish to filter one or more fields.
site_col: optional (default is None), This is a column that gives the site this data file belongs too, if not None this is included in the output data files (see below for details).
participant_ID_col: optional (default is None), This is a column that gives the ID of the participant this data file belongs to, if not None this is included in the output data files (see below for details).

We can use the df_adjustment_args variable to delete datapoints with zero steps. This can be useful if we saw above that a big proportion of the dataset is datapoints reporting zero steps, as we may be more interested in the fraction of timestamp errors with respect to the total ‘useful’ datapoints rather than all datapoints including those that don’t actually report any steps. To remove all datapoints with zero steps, set: df_adjustment_args = ['steps',steps_dict] and edit the steps_dict variable that is in the code below, see the comments for instructions.

# Edit this dictionary if you want to make any adjustments to the data before it is analysed
steps_dict = {
    "delete zeros": True,  # Set to True if you want to delete all datapoints reporting zero steps
    "steps col": "value.doubleValue",  # Change this to the name of the column that reports steps
    "timestamp": "value.time",  # Change to the name of the timestamp column
    "cumulative": False,  # Change this to True if steps cumulative (as this means repeated value is zero steps)
    "device col": None,  # Change to name of device col (as a string) if one exists and cumulative is True
}

# Edit this dictionary if you need to filter the data.
filter_dictionary = {
    # If you wish to only keep datapoints with certain values on specific rows, edit this
    # dictionary and set filter_dict in the function below to filter_dictionary. The keys
    # here are the names of the columns you want to filter by, and the values are the list
    # of acceptable entries for that column.
    "col1": [1, 3, 5],
    "col2": ["A", "C"],
}

# Run the timestamps check.
df = timestamps_check.check_timestamp_errors(
    files_list=files_list,
    EAS_threshold=None,  # Leave as None as we do not have durations for this data
    timegap_threshold=0.2,  # The threshold below which a time gap will be counted as a STG
    measurement_cols=[
        "value.doubleValue"
    ],  # A list of all measurement columns to be included.
    timestamp_col="value.time",  # Name of timestamp column
    end_time_col="value.endTime",  # Name ofhe end time column
    duration_col=None,  # We do not have a duration column
    convert_to_unix=None,  # The data is already in unix time
    filter_dict=None,  # We do not need to filter the data
    df_adjustment_args=[
        "steps",
        steps_dict,
    ],  # Make adjustments to the data based on the values in steps_dict
    output_folder="../output/steps/time_stamp_check_files",  # A folder where outputs are stored, # A folder where outputs are stored
    site_col="key.projectId",  # The site column
    participant_ID_col="key.userId",  # The participant column
)

# Display the results
html_table = df.to_html(index=False)
styled_html = f"<div style='font-size:14px'>{html_table}</div>"
display(HTML(styled_html))

Cleaning and Feature extraction¶

To clean and extract features, we use a function designed specifically for steps data: general_steps_cleaning_and_FE. This produces:

Below is an example of the data cleaning and meta data feature extraction tool being run for active_apple_healthkit_steps. To tailor this to another data type, adjust the following variables:

Features

interval: interval at which features will be reported. Supports D, H, and min.
meas_col: name of the steps column.
timestamp_col: name of the timestamp column - this will the start time if there is a start time and an end time.
duration_col: name of the duration column - default is None.
end_time_col: name of the end time column - default is None.
STG: maximum gap between timestamps that qualifies as a STG.
meas_agg: rule for merging RT+CM errors (and STG+CM if STG_fix=True), defaults to mean.
STG_fix: determines whether datapoints with STGs will be merged in the cleaning step.
filter_dict: to be used to filter the data as required (use filter_dictionary as above).
convert_to_unix: optional (default is None), set to a list of columns that need converting to unix seconds if necessary.
filter_min: The minimum number of steps required for the datapoint to be considered for capping, defaults to 50.
SPS: The number of steps allowed per second. if the number of steps is bigger than both filter_min and this value multiplied by the duration of the timestamp, than it will be capped. If there is no end time or duration given, then the duration of the timestamp will be assumed to be the time that has elapsed since the previous datapoint.

As steps data can come in various formats (e.g cumulative, multiple devices), this function also has a few optional variables to deal with varying scenarios. If run with default settings it assumes that the number of steps given by meas_col occur exactly at the timestamp and there is only one device. Alternative scenarios can be catered for by setting the below optional variables:

Optional Variables

cumulative: if the value in the steps column (meas_col) is a cumulative count this should be set to True (defaults to False). We assume cumulative counts should be instantaneous so end_time_col and duration_col should both be None is this is set to True. Setting cumulative=True will calculate steps by subtracting the previous value from the current value. The data is cleaned and errors are calculated before calculating the actual number of steps, but the error counts will be calculated afterwards using only the remaining datapoints. Datapoints with zero calculated steps will be removed, as will datapoints where the previous datapoint was a different device or had a higher cumulative count. When cleaning the data, STG-CM errors will be set to zero for simplicity (this will prevent them being counted in the total error count) as STG-CM pairs will result in datapoints with zero steps that will just be deleted. Before counting timestamp errors, all RT+CM and STG+CM are copied to the next datapoint as the number of steps calculated is affected by both the current cumulative count and the previous cumulative count, so if the previous cumulative count had a timestamp error the calculated number of steps could also be wrong for the current datapoint.
device_col: name of the device column if there is one, this defaults to None. This should only be set if steps are cumulative and data is non-overlapping (otherwise the filter_dict variable should be used instead to separate data from different devices). The name of the device column is needed for cumulative steps in order to eliminate datapoints that are from a different device to the previous one, as the calculation of steps since last data point will be invalid if both devices have different cumulative counts.
round_to_midnight: if the above analysis showed that the datapoints are daily and usually given at or near to midnight, then this optional variable can be useful for ensuring all the steps are recorded in the correct day. This should only be set to True if the interval='D'. If round_to_midnight is set to True, the timestamp will be rounded to the nearest midnight, if there is an end time col that will be set to the next midnight, and if there is a duration column it will be set to 86400 (24 hours).
distribute_steps: if set to True, we will assume that all steps are evenly spread between the start time and the end time when calculating the total step count features. If duration is given instead of end time, end time will be set as timestamp+duration. If no end time or duration are given, the timepoint of the previous datapoint (before any datapoints were removed during cleaning) will be used as the start time, and the current timestamp will be treated as the end time.

output_folder = "../output/"
data_type = "active_apple_healthkit_steps"
interval = "D"  # The interval required, here we want daily features

# Define filter_dictionary if neccesary
filter_dictionary = {
    # If you wish to only keep datapoints with certain values on specific rows, edit this
    # dictionary and set filter_dict in the function below to filter_dictionary. The keys
    # here are the names of the columns you want to filter by, and the values are the list
    # of acceptable entries for that column.
    "col1": [1, 3, 5],
    "col2": ["A", "C"],
}

for file_path in files_list:
    # Get ready to save files to output folder
    participant, site = helper_funcs.get_participant_and_site(file_path)
    os.makedirs(output_folder + site, exist_ok=True)
    os.makedirs(output_folder + site + "/" + participant, exist_ok=True)

    # Read in the csv as a df
    try:
        if file_path[-3:] == "csv":
            df = pd.read_csv(file_path)
        if file_path[-3:] == ".gz":
            df = pd.read_csv(file_path, compression="gzip")
    except Exception:
        print(file_path + " file cannot be read")
        continue
    features, cleaned_df, total_steps = (
        feature_extraction.general_steps_cleaning_and_FE(
            df,
            interval,
            meas_col="value.doubleValue",  # Name of step count column
            timestamp_col="value.time",  # Name of timestamp column
            STG=86400,  # The STG
            EAS_thresh=None,
            convert_to_unix=None,  # Data is already in unix seconds
            meas_agg="mean",  # We want to average when merging RT+CM errors
            duration_col=None,  # There is no duration column
            SPS=4,  # The maximum steps allowed a second (we kept it at the default here)
            filter_min=50,  # The number of steps needed for a datapoint before capping (kept as default)
            end_time_col="value.endTime",  # There is no end time column
            STG_fix=False,  # Do not fix STG errors when cleaning the data.
            cumulative=False,  # The steps column gives cumulative steps
            device_col=None,  # The name of the device column
            filter_dict=None,  # We do not need to filter the data
            round_to_midnight=False,  # Not useful for this data
            distribute_steps=False,  # We want to split up datapoints that span midnight
            included_errors=[
                "RT+CM",
                "STG+CM",
            ],  # The errors we want for the metadata features
        )
    )

    # Save output files.
    cleaned_df.to_csv(
        output_folder
        + "/"
        + site
        + "/"
        + participant
        + "/"
        + data_type
        + "_cleaned.csv"
    )
    features.to_csv(
        output_folder
        + "/"
        + site
        + "/"
        + participant
        + "/"
        + data_type
        + "_"
        + interval
        + "_metadata.csv"
    )
    total_steps.to_csv(
        output_folder
        + "/"
        + site
        + "/"
        + participant
        + "/"
        + data_type
        + "_"
        + interval
        + "_features.csv"
    )

Data availability¶

We can now use the metadata features we created to analyse how much data is available. We use the below code to look at the how many intervals (in this case days) have a non-zero amount of clean datapoints (that report at least one step) across all participants.

# The input folder for the code below is the output folder from the code above
input_folder = output_folder
csv_name = "active_apple_healthkit_steps_D_metadata"
files_list = helper_funcs.get_file_paths(
    input_folder, csv_name, Folder_structure=2, site_list=site_list
)

all_participants = []
for path in files_list:
    df = pd.read_csv(path)
    df["total clean datapoints"] = (
        df["total counts"] - df["total timestamps with any error"]
    )
    all_participants.append(
        1
        - (
            len(df[df["total clean datapoints"] == 0])
            / len(df["total clean datapoints"])
        )
    )

helper_funcs.draw_boxplot(df=all_participants, title="Fraction of days with at least one clean datapoint")

5 files found