DP Mobility Report

Geolife Dataset - privacy

Configuration

Max. trips per user 5
Privacy budget 50.00
User privacy True
Budget split Visits per tile: 10
Evaluation dev. mode False
Excluded analyses OD flows, Travel time, Jump length, Trips per user, User time delta, Radius of gyration, User tile count, Mobility entropy

Noise has been added to provide differential privacy. The 95%-confidence interval gives an intuition about the reliability of the noisy results.

Differential Privacy

This report provides differential privacy guarantees. The concept of differential privacy is that the output of an algorithm remains nearly unchanged if the records of one individual are removed or added. In this way, differential privacy limits the impact of a single individual on the analysis outcome, preventing the reconstruction of an individual's data. Broadly speaking, this is achieved by adding calibrated noise to the output and the amount of noise is defined by the privacy_budget. Depending on the setting of user_privacy, noise is either calibrated to only protect single trips (item-level privacy) or to protect the privacy of users (user-level privacy). The privacy budget is split between all analyses. The cofiguration table provides information on the used privacy_budget, the budget_split and user_privacy. For each analysis, information is provided on the amount of utilized privacy budget.

The Laplace mechanism is used for counts and the Exponential mechanism for the five number summaries. Details on the notion of differential privacy and the used mechanisms are provided in the documentation.

The following table shows the key figures of the dataset.

The allocated privacy budget for this statistic is shown below and noise is applied accordingly to compute the estimate and the 95% confidence interval.

privacy budget: 3.125

estimate 95% CI: +/-
Number of records 1,709 38.2
Number of distinct trips 861 38.1
Number of complete trips (start and and point) 848 38.3
Number of incomplete trips (single point) 13 19.2
Number of distinct users 186 3.8
Number of distinct locations (lat & lon combination) 1,697 38.3

The following table shows the number of missing values for each column of the dataset.

The allocated privacy budget for this statistic is shown below and noise is applied accordingly to compute the estimate and the 95% confidence interval.

privacy budget: 3.125

estimate 95% CI: +/-
User ID (uid) 0 47.9
Trip ID (tid) 0 47.9
Timestamp (datetime) 0 47.9
Latitude (lat) 0 47.9
Longitude (lng) 26 47.9

This visualization shows the relative number of trips on a timeline. Depending on the timespan of the dataset, it is either aggregated by day, week or month (indicted below the graph).

The allocated privacy budget for this statistic is shown below and noise is applied accordingly to compute the estimate (blue line) and the 95% confidence interval. The confidence interval is visualized as the shaded error band.

The y-axis shows the percentage of trips while the x-axis shows the timeline.

privacy budget: 3.125

95% CI: +/- 1.3 %

2023-03-23T12:14:22.784004 image/svg+xml Matplotlib v3.5.3, https://matplotlib.org/

Timestamps have been aggregated by week.

Min. 2007-04-15
Max. 2012-04-14

This histogram shows the relative number of trips per weekday.

The allocated privacy budget for this statistic is shown below and noise is applied accordingly to compute the estimate (bars) and the 95% confidence interval (error bar).

The y-axis shows the percentage of trips while the x-axis shows the weekdays.

privacy budget: 3.125

95% CI: +/- 0.6 %

2023-03-23T12:14:22.920749 image/svg+xml Matplotlib v3.5.3, https://matplotlib.org/

This linechart shows the relative number of trips per hour over the course of a day, disaggregated by weekday and weekend.

The allocated privacy budget for this statistic is shown below and noise is applied accordingly to compute the estimate (lines). The confidence interval is indicated below but not visualized in the graph due to visual clarity.

The legend shows the different time categories (weekday start, weekday end, weekend start, weekend end) indicating the start and end timestamp of each trip and if the trip was during the week or on the weekend.

The y-axis shows the percentage of trips while the x-axis shows the hour of the day.

privacy budget: 3.125

95% CI: +/- 0.6 %

2023-03-23T12:14:23.098200 image/svg+xml Matplotlib v3.5.3, https://matplotlib.org/

The following map shows the spatial distribution of the dataset according to the provided tessellation. Records outside the tessellation are indicated as number of outliers below the map.

The legend below shows the number of visits per tile ranging from 0 to the maximum number of visits per tile.

The allocated privacy budget for this map is shown below and noise is applied accordingly onto the counts. The confidence interval is indicated below.

Tiles below a certain threshold are grayed out: Due to the applied noise, tiles with a low visit count are likely to contain a high percentage of noise. For usability reasons, such unrealistic values are grayed out. More specifically: The threshold is set so that values for tiles with a 5% chance (or higher) of deviating more than 20 percentage points from the estimated value are not shown.

privacy budget: 31.25

95% CI: +/- 1.0 visit(s)

2023-03-23T12:14:23.495516 image/svg+xml Matplotlib v3.5.3, https://matplotlib.org/

209 (11.76%) points are outside the given tessellation (95% confidence interval ± 1).

This table shows the mean number of visits per tile as well as the five-number summary consisting of: the most extreme values in the dataset (the maximum and minimum values), the lower and upper quartiles, and the median.

These values are computed from the counts visualized above. Thus, no extra privacy budget is used.

Mean 2
Min. 0
25% 0
Median 0
75% 0
Max. 345

The following visualization shows the cumulated relative number of visits. This means that the tiles are sorted according to the number of visits in descending order and the relative number of visits are added tile by tile. Thus, you can use the graph to evaluate how many tiles are needed to cover a certain share of the visits.

If all tiles are visited equally, the cumulated sum follows a straight diagonal line (gray line). The larger the share of single tiles in the total number of visits, the steeper the curve.

These values are computed from the counts visualized above. Thus, no extra privacy budget is used.

2023-03-23T12:14:23.610980 image/svg+xml Matplotlib v3.5.3, https://matplotlib.org/

The following visualization shows the ranking of most frequently visited tiles.

The y-axis shows the tile name (if provided) and tile ID in order of the ranking. The x-axis shows the number of visits per tile.

These values are computed from the counts visualized above. Thus, no extra privacy budget is used. The 95% confidence interval of the visits per tile indicated above also applies here and is visualized with error bars.

2023-03-23T12:14:23.764912 image/svg+xml Matplotlib v3.5.3, https://matplotlib.org/

Each map shows the arrivals (destinations) for the respective time window for each tile, split by weekday and weekend, as absolute counts and as the deviation from tile average. The tile average is defined as the mean number of visits in one tile across all time windows. Thus, the deviation from tile average indicates higher or lower number of visits for this tile during certain time windows of a day.

Tiles below a certain threshold are grayed out: Due to the applied noise, tiles with a low visit count are likely to contain a high percentage of noise. For usability reasons, such unrealistic values are grayed out. More specifically: The threshold is set so that values for tiles with a 5% chance (or higher) of deviating more than 20 percentage points from the estimated value are not shown.

privacy budget: 3.125

95% CI: +/- 4.8 visit(s)

Weekday

Number of visits

Deviation from tile average

The average of each tile over all time windows equals 1 (100% of average traffic). A value of < 1 (> 1) means that a tile is visited less (more) frequently in this time window than it is on average.

Weekend

Number of visits

Deviation from tile average

The average of each tile over all time windows equals 1 (100% of average traffic). A value of < 1 (> 1) means that a tile is visited less (more) frequently in this time window than it is on average.

User configuration of timewindows: ['2 - 6', '6 - 10', '10 - 14', '14 - 18', '18 - 22']