DP Mobility Report

Berlin Dataset - privacy

Configuration

Max. trips per user 5
Privacy budget 1.00
User privacy True
Budget split Evenly distributed
Evaluation dev. mode False
Excluded analyses None

Noise has been added to provide differential privacy. The 95%-confidence interval gives an intuition about the reliability of the noisy results.

Differential Privacy

This report provides differential privacy guarantees. The concept of differential privacy is that the output of an algorithm remains nearly unchanged if the records of one individual are removed or added. In this way, differential privacy limits the impact of a single individual on the analysis outcome, preventing the reconstruction of an individual's data. Broadly speaking, this is achieved by adding calibrated noise to the output and the amount of noise is defined by the privacy_budget. Depending on the setting of user_privacy, noise is either calibrated to only protect single trips (item-level privacy) or to protect the privacy of users (user-level privacy). The privacy budget is split between all analyses. The cofiguration table provides information on the used privacy_budget, the budget_split and user_privacy. For each analysis, information is provided on the amount of utilized privacy budget.

The Laplace mechanism is used for counts and the Exponential mechanism for the five number summaries. Details on the notion of differential privacy and the used mechanisms are provided in the documentation.

The following table shows the key figures of the dataset.

The allocated privacy budget for this statistic is shown below and noise is applied accordingly to compute the estimate and the 95% confidence interval.

privacy budget: 0.0667

estimate 95% CI: +/-
Number of records 2,595,459 1,797.3
Number of distinct trips 1,297,951 1,797.1
Number of complete trips (start and and point) 1,297,508 1,797.4
Number of incomplete trips (single point) 443 898.7
Number of distinct users 378,760 179.7
Number of distinct locations (lat & lon combination) 204,413 1,797.4

The following table shows the number of missing values for each column of the dataset.

The allocated privacy budget for this statistic is shown below and noise is applied accordingly to compute the estimate and the 95% confidence interval.

privacy budget: 0.0667

estimate 95% CI: +/-
User ID (uid) 0 2,246.8
Trip ID (tid) 1,188 2,246.8
Timestamp (datetime) 0 2,246.8
Latitude (lat) 467 2,246.8
Longitude (lng) 364 2,246.8

This visualization shows the relative number of trips on a timeline. Depending on the timespan of the dataset, it is either aggregated by day, week or month (indicted below the graph).

The allocated privacy budget for this statistic is shown below and noise is applied accordingly to compute the estimate (blue line) and the 95% confidence interval. The confidence interval is visualized as the shaded error band.

The y-axis shows the percentage of trips while the x-axis shows the timeline.

privacy budget: 0.0667

95% CI: +/- 0.1 %

2023-03-23T13:21:40.210059 image/svg+xml Matplotlib v3.5.3, https://matplotlib.org/

Timestamps have been aggregated by date.

Min. 2018-04-18
Max. 2018-04-20

This histogram shows the relative number of trips per weekday.

The allocated privacy budget for this statistic is shown below and noise is applied accordingly to compute the estimate (bars) and the 95% confidence interval (error bar).

The y-axis shows the percentage of trips while the x-axis shows the weekdays.

privacy budget: 0.0667

95% CI: +/- 0.0 %

2023-03-23T13:21:40.328123 image/svg+xml Matplotlib v3.5.3, https://matplotlib.org/

This linechart shows the relative number of trips per hour over the course of a day, disaggregated by weekday and weekend.

The allocated privacy budget for this statistic is shown below and noise is applied accordingly to compute the estimate (lines). The confidence interval is indicated below but not visualized in the graph due to visual clarity.

The legend shows the different time categories (weekday start, weekday end, weekend start, weekend end) indicating the start and end timestamp of each trip and if the trip was during the week or on the weekend.

The y-axis shows the percentage of trips while the x-axis shows the hour of the day.

privacy budget: 0.0667

95% CI: +/- 0.0 %

2023-03-23T13:21:40.519298 image/svg+xml Matplotlib v3.5.3, https://matplotlib.org/

The following map shows the spatial distribution of the dataset according to the provided tessellation. Records outside the tessellation are indicated as number of outliers below the map.

The legend below shows the number of visits per tile ranging from 0 to the maximum number of visits per tile.

The allocated privacy budget for this map is shown below and noise is applied accordingly onto the counts. The confidence interval is indicated below.

Tiles below a certain threshold are grayed out: Due to the applied noise, tiles with a low visit count are likely to contain a high percentage of noise. For usability reasons, such unrealistic values are grayed out. More specifically: The threshold is set so that values for tiles with a 5% chance (or higher) of deviating more than 20 percentage points from the estimated value are not shown.

privacy budget: 0.0667

95% CI: +/- 449.4 visit(s)

2023-03-23T13:21:41.792982 image/svg+xml Matplotlib v3.5.3, https://matplotlib.org/

0 (0.0%) points are outside the given tessellation (95% confidence interval ± 449).

This table shows the mean number of visits per tile as well as the five-number summary consisting of: the most extreme values in the dataset (the maximum and minimum values), the lower and upper quartiles, and the median.

These values are computed from the counts visualized above. Thus, no extra privacy budget is used.

Mean 6,728
Min. 0
25% 2,645
Median 5,130
75% 9,453
Max. 43,271

The following visualization shows the cumulated relative number of visits. This means that the tiles are sorted according to the number of visits in descending order and the relative number of visits are added tile by tile. Thus, you can use the graph to evaluate how many tiles are needed to cover a certain share of the visits.

If all tiles are visited equally, the cumulated sum follows a straight diagonal line (gray line). The larger the share of single tiles in the total number of visits, the steeper the curve.

These values are computed from the counts visualized above. Thus, no extra privacy budget is used.

2023-03-23T13:21:41.950175 image/svg+xml Matplotlib v3.5.3, https://matplotlib.org/

The following visualization shows the ranking of most frequently visited tiles.

The y-axis shows the tile name (if provided) and tile ID in order of the ranking. The x-axis shows the number of visits per tile.

These values are computed from the counts visualized above. Thus, no extra privacy budget is used. The 95% confidence interval of the visits per tile indicated above also applies here and is visualized with error bars.

2023-03-23T13:21:42.081972 image/svg+xml Matplotlib v3.5.3, https://matplotlib.org/

Each map shows the arrivals (destinations) for the respective time window for each tile, split by weekday and weekend, as absolute counts and as the deviation from tile average. The tile average is defined as the mean number of visits in one tile across all time windows. Thus, the deviation from tile average indicates higher or lower number of visits for this tile during certain time windows of a day.

Tiles below a certain threshold are grayed out: Due to the applied noise, tiles with a low visit count are likely to contain a high percentage of noise. For usability reasons, such unrealistic values are grayed out. More specifically: The threshold is set so that values for tiles with a 5% chance (or higher) of deviating more than 20 percentage points from the estimated value are not shown.

privacy budget: 0.0667

95% CI: +/- 224.7 visit(s)

Weekday

Number of visits

Deviation from tile average

The average of each tile over all time windows equals 1 (100% of average traffic). A value of < 1 (> 1) means that a tile is visited less (more) frequently in this time window than it is on average.

User configuration of timewindows: ['2 - 6', '6 - 10', '10 - 14', '14 - 18', '18 - 22']

The following map shows the origin-destination (OD) flows between the tiles according to the provided tessellation, meaning the number of trips between respective start and end tiles.

The origin of the OD flows is indicated by a small circle and by clicking on one OD connection, information on the origin and destination cell name as well as the count of for this OD connection will show up.

The legend for the intra-tile flows is below. The intra-tile flow is defined as an OD connections that starts and ends in the same tile.

The allocated privacy budget for this map is shown below and noise is applied accordingly onto the counts. The confidence interval is indicated below.

Flows below a certain threshold are not displayed (grayed out for intra-tile flows): Due to the applied noise, flows with a low count are likely to contain a high percentage of noise. For usability reasons, such unrealistic values are not displayed/grayed out. More specifically: The threshold is set so that values for flows with a 5% chance (or higher) of deviating more than 20 percentage points from the estimated value are not shown.

privacy budget: 0.0667

95% CI: +/-224.7 flow(s)

User configuration: display max. top 300 OD connections on map

2023-03-23T13:21:45.484025 image/svg+xml Matplotlib v3.5.3, https://matplotlib.org/

This table shows the mean number of visits per tile as well as the five-number summary consisting of: the most extreme values in the dataset (the maximum and minimum values), the lower and upper quartiles, and the median.

These values are computed from the counts visualized above. Thus, no extra privacy budget is used.

Mean 82
Min. 1
25% 24
Median 56
75% 111
Max. 3,922

The following visualization shows the cumulated relative number of flows per OD pair. This means that the OD pairs are sorted according to the number of flows in descending order and the relative number of flows are added OD pair by OD pair. Thus, you can use the graph to evaluate how many OD pairs are needed to cover a certain share of the flows.

If all OD pairs are visited equally, the cumulated sum follows a straight diagonal line (gray line). The larger the share of a single OD pair in the total number of flows, the steeper the curve.

These values are computed from the counts visualized above. Thus, no extra privacy budget is used.

2023-03-23T13:21:45.643370 image/svg+xml Matplotlib v3.5.3, https://matplotlib.org/

The following visualization shows the ranking of most frequently visited OD connections.

The y-axis shows the tile name of origin and destination in order of the ranking. The x-axis shows the number of flows per OD pair.

These values are computed from the counts visualized above. Thus, no extra privacy budget is used. The 95% confidence interval of the flows per OD pair indicated above also applies here and is visualized with error bars.

2023-03-23T13:21:45.823346 image/svg+xml Matplotlib v3.5.3, https://matplotlib.org/

The following histogram shows the distribution of travel time. The travel time is computed as the time difference between start and end timestamp of a trip in minutes.

The y-axis indicates the relative counts of trips while the x-axis shows the range of histogram bins in minutes according to the user configurated bin size and maximum value.

The allocated privacy budget for this statistic is shown below and noise is applied accordingly to compute the estimate (bars) and the 95% confidence interval (error bar).

privacy budget: 0.0667

95% CI: +/-: 0.1 %

2023-03-23T13:21:46.070021 image/svg+xml Matplotlib v3.5.3, https://matplotlib.org/

User configuration for histogram chart:
maximum value: 90
bin size: 5

Five number summary: travel time

Min. 5.00
25% 17.00
Median 28.00
75% 44.00
Max. 225.00

The following histogram shows the distribution of jump length. The jump length is the straight-line distance between the origin and destination.

The y-axis indicates the relative counts of trips while the x-axis shows the range of histogram bins in kilometers according to the user configurated bin size and maximum value.

The allocated privacy budget for this statistic is shown below and noise is applied accordingly to compute the estimate (bars) and the 95% confidence interval (error bar).

privacy budget: 0.0667

95% CI: +/- 0.1 %

2023-03-23T13:21:46.220039 image/svg+xml Matplotlib v3.5.3, https://matplotlib.org/

User configuration for histogram chart:
maximum value: 30
bin size: 3

Five number summary: jump length

Min. 0.00
25% 0.95
Median 3.34
75% 6.68
Max. 26.74

The following histogram shows the distribution of number of trips per user, i.e. how many trips a user contributed to the dataset.

The y-axis indicates the relative number of users and the x-axis shows the range of the histogram bins according to the user configured maximum of trips per user.

The allocated privacy budget for this statistic is shown below and noise is applied accordingly to compute the estimate (bars) and the 95% confidence interval (error bar).

privacy budget: 0.0667

95% CI: +/- 0.1 %

2023-03-23T13:21:46.350930 image/svg+xml Matplotlib v3.5.3, https://matplotlib.org/

Trips per user are limited according to the configured maximum of trips per user: 5

Five number summary: trips per user

Min. 2
25% 2
Median 4
75% 5
Max. 5

The following histogram shows the distribution of time between consecutive trips of a user, i.e. the time that passes between the end of one trip and the beginning of the following trip of one user.

The y-axis shows the relative number of trips and the x-axis shows the range of histogram bins in hours between trips of the same user according to the user configured bin size and maximum value.

The allocated privacy budget for this statistic is shown below and noise is applied accordingly to compute the estimate (bars) and the 95% confidence interval (error bar).

privacy budget: 0.0667

95% CI: +/- 0.1 %

2023-03-23T13:21:46.476373 image/svg+xml Matplotlib v3.5.3, https://matplotlib.org/

User configuration for histogram chart:
maximum value: None
bin size: None

Five number summary: time between consecutive trips of a user

Min. 0 days 00:00:00
25% 0 days 00:25:00
Median 0 days 01:35:00
75% 0 days 04:20:00
Max. 0 days 15:45:00

The following histogram shows the distribution of the radii of gyration. The radius of gyration is the characteristic distance traveled by an individual during a period of time.

The y-axis shows the relative number of users and the x-axis shows the range of histogram bins in kilometers according to the user configured bin size and maximum value.

The allocated privacy budget for this statistic is shown below and noise is applied accordingly to compute the estimate (bars) and the 95% confidence interval (error bar).

privacy budget: 0.0667

95% CI: +/- 0.1 %

2023-03-23T13:21:46.625087 image/svg+xml Matplotlib v3.5.3, https://matplotlib.org/

User configuration for histogram chart:
maximum value: 18
bin size: 1.5

Five number summary: radius of gyration

Min. 0.02
25% 1.39
Median 2.60
75% 4.49
Max. 14.44

The following histogram shows the distribution of how many distinct tiles a user has visited. It describes the diversity of locations a user visits.

The y-axis shows the relative number of users and the x-axis the number of distinct tiles according to the user configurated bin size and maximum value.

The allocated privacy budget for this statistic is shown below and noise is applied accordingly to compute the estimate (bars) and the 95% confidence interval (error bar).

privacy budget: 0.0667

95% CI: +/- 0.1 %

2023-03-23T13:21:46.789564 image/svg+xml Matplotlib v3.5.3, https://matplotlib.org/

User configuration for histogram chart:
maximum value: None
bin size: None

Min. 1
25% 2
Median 3
75% 3
Max. 8

The following histogram shows the distribution of the mobility entropy.

The mobility entropy characterizes the heterogeneity of the users visitation patterns and can be interpreted as a measure for the predictability of a users location. If a user only visits a single tile, the entropy is 0, i.e., their location is highly predictable. If a user visits, e.g., four different tiles each 10 times, the entropy is 1, i.e., their location is not predictable as every of the four tiles has the same probability to be visited by the user. Intuitively, the more trips per user are entailed in the data, the more meaningful the mobility entropy.

The y-axis shows the relative counts of users and the x-axis shows the range of histogram bins of the mobility entropy.

The allocated privacy budget for this statistic is shown below and noise is applied accordingly to compute the estimate (bars) and the 95% confidence interval (error bar).

privacy budget: 0.0667

95% CI: +/- 0.1 %

2023-03-23T13:21:46.939528 image/svg+xml Matplotlib v3.5.3, https://matplotlib.org/

Five number summary: mobility entropy

Min. 0.00
25% 0.92
Median 0.96
75% 1.00
Max. 1.00