Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Steps specific template

This is a template for analysing, cleaning, and extracting features for step count data collected by a phone and/or a wearable device. It is suitable for steps data with a range of sampling frequencies, such as daily, minute based, or event based. It is an extension of the general template where full details of the pipeline used here can be found. The following assumptions are made about the dataset:

This template begins with some data analyses to gain a deeper understanding of the dataset. The information gained from the data analysis is then used to decide how to clean the data and extract features. These features include the total number of steps per day, hour, or minute and some metadata features that describe the data quality.

All functions can be further tailored to your data in two ways:

If your data files contain a mix of both phone and watch data, they will only need to be filtered if the data from different devices overlap. The functions here have been designed to deal with data that comes in blocks from one device at a time (for example, sensorkit pedometer data has previously been found to be in this format). If the data does overlap you will need two copies of this template for each device and to use the filter_dict optional variable in all the functions.

This template uses example data from Apple watch, with the filename active_apple_healthkit_steps. However, it should be useful for any type of phone and/or wearable step count data and can be adjusted by changing the variables set in the code snippets.

Data Analysis

First we do all imports necessary for this analysis and get files_list, a list of the file paths:

import os
import sys
from pathlib import Path

import pandas as pd
from IPython.display import HTML, display

sys.path.insert(
    0, str(Path().resolve().parent / "src")
)  # Set the path to the src folder so that we can import the functions from there

import additional_funcs
import all_field_summaries
import calculate_durations
import feature_extraction
import helper_funcs
import timestamps_check

base_dir = Path.cwd().parent  # go up one level from where you're running
folder_path = base_dir / "example_data"
folder_path_str = str(folder_path) + "/"

# Set input variables
Folder_structure = 1  # This should be either 1 or 2 (see above)
csv_name = "active_apple_healthkit_steps"  # The standard name for the csv that contains this data
site_list = ["test"]  # The names of the subfolders for each site
input_folder = folder_path_str  # The folder that contains all the site subfolders

# Get a list of the paths to each file to be included in this analysis
files_list = helper_funcs.get_file_paths(
    input_folder, csv_name, Folder_structure, site_list
)
5 files found

Investigate Frequency

This function analyses the time gaps between each datapoint and the durations of datapoints (if there is a duration or end time column), in an effort to understand what the expected sampling frequency of the data is. The mean, median, mode and range are given. Also included is the number of datapoints that are equal to the mode, within a (adjustable) threshold of the mode, or more than the same threshold below the mode. These are included to get an idea of whether one particular sampling frequency dominates, if there is an intended sampling frequency then the former two are likely to be high and the latter is likely to be low.

Below is an example of this function being run for active_apple_healthkit_steps data. To tailor this to your data, adjust the following variables:

# Edit this dictionary if you need to filter the data.
filter_dictionary = {
    # If you wish to only keep datapoints with certain values on specific rows, edit this
    # dictionary and set filter_dict in the function below to filter_dictionary. The keys
    # here are the names of the columns you want to filter by, and the values are the list
    # of acceptable entries for that column.
    "col1": [1, 3, 5],
    "col2": ["A", "C"],
}

df = calculate_durations.investigate_frequency(
    files_list=files_list,
    thresh=1,  # You may need to adjust after running this function a first time to find out the mode.
    timestamp_col="value.time",  # Name of the timestamp column
    end_time_col="value.endTime",  # The end time column
    duration_col=None,  # There is no duration column
    convert_to_unix=None,  # The data is already in unix seconds
    filter_dict=None,  # We do not want to filter the data
    df_adjustment_args=[None],  # No adjustments necessary for this data type.
)

html_table = df.to_html(index=False)
styled_html = f"<div style='font-size:14px'>{html_table}</div>"
display(HTML(styled_html))
Loading...

If the results show a modal time gap of 86400 and the time gap is rarely below this, this suggests the data is daily. In this case, you may wish to use the find_time_of_timestamps function below to check the timestamps tend to be at or close to midnight. To tailor this to your data, adjust the following variables:

# Edit this dictionary if you need to filter the data.
filter_dictionary = {
    # If you wish to only keep datapoints with certain values on specific rows, edit this
    # dictionary and set filter_dict in the function below to filter_dictionary. The keys
    # here are the names of the columns you want to filter by, and the values are the list
    # of acceptable entries for that column.
    "col1": [1, 3, 5],
    "col2": ["A", "C"],
}

d = additional_funcs.find_time_of_timestamps(
    all_file_paths=files_list,
    timestamp_col="value.time",  # The name of the timestamps file
    convert_to_unix=None,  # The data is already in unix seconds
    filter_dict=None,  # The data does not need filtering
)
print("Distribution of times:", d)
Distribution of times: {'00:00:00': 53}

Summarise Fields

We next use the Summarise_fields function to check the distribution of steps reported for each datapoint. The purpose of this function is to get a general idea of the distribution of values in any fields of interest across all the data. This is likely to include the step count field to check whether the measured values generally seem sensible. It may also include any other fields you wish to analyse, for example you might want to check that a field reporting the device gives a consistent value. Further details about this function are described in the general template.

Below is an example of this tool being run on active_apple_healthkit_steps data. To tailor this to your data, adjust the following variables:

The df_adjustment function can be used here for the following purposes:

If you want to perform any of these operations on the data, set: df_adjustment_args = ['steps',steps_dict] and edit the steps_dict variable that is in the code below, see the comments for instructions.

# Edit this dictionary if you want to make any adjustments to the data before it is analysed.
steps_dict = {
    "steps col": "value.doubleValue",  # Change this to the name of the column that reports steps
    "cumulative": False,  # Change this to True to convert to steps since last datapoint (for cumulative steps)
    "steps per second": False,  # Change this to true to convert to steps per second
    "delete zeros": True,  # Set to True if you want to delete all datapoints reporting zero steps
    "timestamp": "value.time",  # Change to the name of the timestamp column (needed for the calculations)
    "device col": None,  # Change to name of device col (as a string) if one exists and cumulative is True
}

# Edit this dictionary if you need to filter the data.
filter_dictionary = {
    # If you wish to only keep datapoints with certain values on specific rows, edit this
    # dictionary and set filter_dict in the function below to filter_dictionary. The keys
    # here are the names of the columns you want to filter by, and the values are the list
    # of acceptable entries for that column.
    "col1": [1, 3, 5],
    "col2": ["A", "C"],
}

df = all_field_summaries.Summarise_fields(
    files_list=files_list,
    fields=["value.doubleValue"],  # A list of all fields to be analysed
    time_stamp="value.time",  # The name of the column that contains the timestamp
    filter_dict=None,  # No fields need filtering so leave as None.
    df_adjustment_args=[
        "steps",
        steps_dict,
    ],  # Make adjustments to the data based on the values in steps_dict
)

# Then display the results.
df = df.round(
    2
)  # Rounds the numbers for ease of viewing, may need to be adjusted depending on data
html_table = df.to_html(index=False)
styled_html = f"<div style='font-size:12px'>{html_table}</div>"
display(HTML(styled_html))
Loading...

Check Timestamp Errors

We next see how frequent timestamp errors are in the data. These include:

The threshold used to define STG errors (timegap_threshold) should be the minimum amount of time expected between datapoints. The results from the investigate_frequency function above may be useful in informing choice of this threshold, otherwise a sensible value for the minimum amount of time a step is expected to take should be chosen. The amount records are allowed to overlap by (EAS_threshold) should be set based on what is a reasonable overlap considering the typical duration of a datapoint, again choice of the threshold may be informed by the results from the investigate_frequency function above.

Below is an example of this tool being run for active_apple_healthkit_steps. To tailor this to your data, adjust the following variables:

We can use the df_adjustment_args variable to delete datapoints with zero steps. This can be useful if we saw above that a big proportion of the dataset is datapoints reporting zero steps, as we may be more interested in the fraction of timestamp errors with respect to the total ‘useful’ datapoints rather than all datapoints including those that don’t actually report any steps. To remove all datapoints with zero steps, set: df_adjustment_args = ['steps',steps_dict] and edit the steps_dict variable that is in the code below, see the comments for instructions.

# Edit this dictionary if you want to make any adjustments to the data before it is analysed
steps_dict = {
    "delete zeros": True,  # Set to True if you want to delete all datapoints reporting zero steps
    "steps col": "value.doubleValue",  # Change this to the name of the column that reports steps
    "timestamp": "value.time",  # Change to the name of the timestamp column
    "cumulative": False,  # Change this to True if steps cumulative (as this means repeated value is zero steps)
    "device col": None,  # Change to name of device col (as a string) if one exists and cumulative is True
}

# Edit this dictionary if you need to filter the data.
filter_dictionary = {
    # If you wish to only keep datapoints with certain values on specific rows, edit this
    # dictionary and set filter_dict in the function below to filter_dictionary. The keys
    # here are the names of the columns you want to filter by, and the values are the list
    # of acceptable entries for that column.
    "col1": [1, 3, 5],
    "col2": ["A", "C"],
}

# Run the timestamps check.
df = timestamps_check.check_timestamp_errors(
    files_list=files_list,
    EAS_threshold=None,  # Leave as None as we do not have durations for this data
    timegap_threshold=0.2,  # The threshold below which a time gap will be counted as a STG
    measurement_cols=[
        "value.doubleValue"
    ],  # A list of all measurement columns to be included.
    timestamp_col="value.time",  # Name of timestamp column
    end_time_col="value.endTime",  # Name ofhe end time column
    duration_col=None,  # We do not have a duration column
    convert_to_unix=None,  # The data is already in unix time
    filter_dict=None,  # We do not need to filter the data
    df_adjustment_args=[
        "steps",
        steps_dict,
    ],  # Make adjustments to the data based on the values in steps_dict
    output_folder="../output/steps/time_stamp_check_files",  # A folder where outputs are stored, # A folder where outputs are stored
    site_col="key.projectId",  # The site column
    participant_ID_col="key.userId",  # The participant column
)

# Display the results
html_table = df.to_html(index=False)
styled_html = f"<div style='font-size:14px'>{html_table}</div>"
display(HTML(styled_html))
Loading...

Cleaning and Feature extraction

To clean and extract features, we use a function designed specifically for steps data: general_steps_cleaning_and_FE. This produces:

Below is an example of the data cleaning and meta data feature extraction tool being run for active_apple_healthkit_steps. To tailor this to another data type, adjust the following variables:

As steps data can come in various formats (e.g cumulative, multiple devices), this function also has a few optional variables to deal with varying scenarios. If run with default settings it assumes that the number of steps given by meas_col occur exactly at the timestamp and there is only one device. Alternative scenarios can be catered for by setting the below optional variables:

output_folder = "../output/"
data_type = "active_apple_healthkit_steps"
interval = "D"  # The interval required, here we want daily features

# Define filter_dictionary if neccesary
filter_dictionary = {
    # If you wish to only keep datapoints with certain values on specific rows, edit this
    # dictionary and set filter_dict in the function below to filter_dictionary. The keys
    # here are the names of the columns you want to filter by, and the values are the list
    # of acceptable entries for that column.
    "col1": [1, 3, 5],
    "col2": ["A", "C"],
}

for file_path in files_list:
    # Get ready to save files to output folder
    participant, site = helper_funcs.get_participant_and_site(file_path)
    os.makedirs(output_folder + site, exist_ok=True)
    os.makedirs(output_folder + site + "/" + participant, exist_ok=True)

    # Read in the csv as a df
    try:
        if file_path[-3:] == "csv":
            df = pd.read_csv(file_path)
        if file_path[-3:] == ".gz":
            df = pd.read_csv(file_path, compression="gzip")
    except Exception:
        print(file_path + " file cannot be read")
        continue
    features, cleaned_df, total_steps = (
        feature_extraction.general_steps_cleaning_and_FE(
            df,
            interval,
            meas_col="value.doubleValue",  # Name of step count column
            timestamp_col="value.time",  # Name of timestamp column
            STG=86400,  # The STG
            EAS_thresh=None,
            convert_to_unix=None,  # Data is already in unix seconds
            meas_agg="mean",  # We want to average when merging RT+CM errors
            duration_col=None,  # There is no duration column
            SPS=4,  # The maximum steps allowed a second (we kept it at the default here)
            filter_min=50,  # The number of steps needed for a datapoint before capping (kept as default)
            end_time_col="value.endTime",  # There is no end time column
            STG_fix=False,  # Do not fix STG errors when cleaning the data.
            cumulative=False,  # The steps column gives cumulative steps
            device_col=None,  # The name of the device column
            filter_dict=None,  # We do not need to filter the data
            round_to_midnight=False,  # Not useful for this data
            distribute_steps=False,  # We want to split up datapoints that span midnight
            included_errors=[
                "RT+CM",
                "STG+CM",
            ],  # The errors we want for the metadata features
        )
    )

    # Save output files.
    cleaned_df.to_csv(
        output_folder
        + "/"
        + site
        + "/"
        + participant
        + "/"
        + data_type
        + "_cleaned.csv"
    )
    features.to_csv(
        output_folder
        + "/"
        + site
        + "/"
        + participant
        + "/"
        + data_type
        + "_"
        + interval
        + "_metadata.csv"
    )
    total_steps.to_csv(
        output_folder
        + "/"
        + site
        + "/"
        + participant
        + "/"
        + data_type
        + "_"
        + interval
        + "_features.csv"
    )

Data availability

We can now use the metadata features we created to analyse how much data is available. We use the below code to look at the how many intervals (in this case days) have a non-zero amount of clean datapoints (that report at least one step) across all participants.

# The input folder for the code below is the output folder from the code above
input_folder = output_folder
csv_name = "active_apple_healthkit_steps_D_metadata"
files_list = helper_funcs.get_file_paths(
    input_folder, csv_name, Folder_structure=2, site_list=site_list
)

all_participants = []
for path in files_list:
    df = pd.read_csv(path)
    df["total clean datapoints"] = (
        df["total counts"] - df["total timestamps with any error"]
    )
    all_participants.append(
        1
        - (
            len(df[df["total clean datapoints"] == 0])
            / len(df["total clean datapoints"])
        )
    )

helper_funcs.draw_boxplot(df=all_participants, title="Fraction of days with at least one clean datapoint")
5 files found
<Figure size 640x480 with 1 Axes>