DP Mobility Report: Benchmark

Berlin Benchmark report

Configuration

Base Alternative
Max. trips per user 16 5
Privacy budget - 1.00
User privacy True 1.00
Budget split Evenly distributed OD flows: 500, Visits per tile: 50, Visits per time tile: 300
Evaluation dev. mode False
Excluded analyses User time delta

Similarity measures

The similarity of two mobility reports is evaluated in the following benchmark report using similarity measures. Specifically, a set of measures is computed for each chosen analysis of the report which can be found below each analysis indicated by a light orange background.

In the following the similarity measures relative error (RE), Kullback-Leibler divergence (KLD), Jensen-Shannon divergence (JSD), earth mover's distance (EMD) and symmetric mean average percentage error (SMAPE) will be explained, as well as the reasoning why the specific measures are available for each analyses and which measure is the default measure and why.

The symmetric mean absolute percentage error (SMAPE) is an accuracy measure based on percentage (or relative) errors. In contrast to the mean absolute percentage error, SMAPE has both a lower bound (0, meaning identical) and an upper bound (2, meaning entirely different). $$ SMAPE = \frac{1}{n} \sum_{i=1}^{n} \frac {|alternative_{i} - base_{i}|}{(|base_{i}| + |alternative_{i}|) \div 2}$$

SMAPE is computed for all analyses.

For single counts (e.g., dataset statistics, missing values), n=1 with \(base_{i}\) (\(alternative_{i}\) respectively) refering to the respective count value. For the five number summary, n=5 with \(base_{i}\) (\(alternative_{i}\) respectively) refering to \(i_{th}\) value of the summary. For all other analyses, n equals the number of histogram bins.

SMAPE is employed as the default measure for single counts and for the evaluation of the five number summary, as KLD, JSD and EMD are not suitable.

The Kullback-Leibler divergence (KLD), also called relative entropy, is a widely used statistic to measure how far a probability distribution P deviates from a reference probability distribution Q on the same probability space \(\mathcal{X}\). For discrete distributions P and Q, it is formally defined as: $$D_{KL}(P||Q):= \sum\limits_{x \in \mathcal{X}: P(x)>0} P(x)\cdot \log \frac{P(x)}{Q(x)}$$

For example, considering a tessellation (\(\mathcal{X}\)) the spatial distribution can be evaluated by comparing the relative counts in the synthetic data (P) per tile x to the relative counts per tile in the reference dataset (Q). The larger the deviation of P from Q, the larger the value of the resulting KLD, with a minimum value of 0 for identical distributions.

Note that KLD is not symmetric, i.e., \(D_{KL}(P||Q)~\neq~D_{KL}(Q||P)\), which is why KLD is best applicable in settings with a reference model Q and a fitted model P. However, the lack of symmetry implies that it is not a distance metric in the mathematical sense.

It is worth noting that KLD is only defined if \(Q(x)\neq 0\) for all x in the support of P, while this constraint is not required for JSD. In practice, both KLD and JSD are computed for discrete approximations of continuous distributions, e.g., histograms approximating the relative number of trips over time based on daily or hourly counts. However, the choice of histogram bins has an impact in two respects: Say we want to compare the number of visits per tile. Depending on the granularity of the chosen tessellation, there might be tiles with 0 visits in the real dataset but >0 visits in the synthetic dataset, thus KLD would not be defined for such cases. Additionally, the resulting values for both KLD and JSD vary according to the choice of bins, e.g., by reducing the granularity of the tessellation, the values of KLD and JSD will tend to be smaller.

The KLD is computed for the following analyses: trips over time, trips per weekday, trips per hour, visits per tile, visits per tile timewindow, OD flows, travel time, jump length and radius of gyration. The KLD might be None (or not defined) due to the fact, that it is not defined for \(Q(x)\neq 0\) for all x in the support of P and therefore the JSD is set as a default measure.

The Jensen-Shannon divergence (JSD) solves this asymmetry by building on the KLD to calculate a symmetrical score in the sense that the divergence of P from Q is the same as Q from P: \(D_{JS}(P||Q) = D_{JS}(Q||P)\). Additionally, JSD provides a smoothed and normalized version of KLD, with scores between 0 (meaning identical) and 1 (meaning entirely different) when using the base-2 logarithm, thus making it easier to relate the resulting score within a fixed finite range. Formally, the JSD is defined for two probability distributions P and Q as: $$D_{JS}(P||Q) := \frac{1}{2} D_{KL}\biggl( P\left\Vert\frac{P+Q}{2}\right ) + \frac{1}{2} D_{KL}\biggl((Q\left\Vert\frac{P+Q}{2}\right)$$

Following this advantage of the JSD compared to the KLD, the JSD is chosen as a default measure over the KLD.

Both KLD and JSD do not account for a distance of instances in the probability space \(\mathcal{X}\). However, the earth mover's distance (EMD) between two empirical distributions allows to take a notion of distance, like the underlying geometry of the space into account. Informally, the EMD is proportional to the minimum amount of work required to convert one distribution into the other.

Suppose we have a tessellation, each tile denoted by \(\{x_1, \ldots , x_n\}\) with a corresponding notion of distance \(dis(x_i, x_j)\) being the Haversine distance between the centroids of tiles \(x_i\) and \(x_j\). For two empirical distributions P and Q represented by the visits in the given tiles \(\{p_1, \ldots , p_n\}\) and \(\{q_1, \ldots , q_n\}\), respectively, the EMD can be defined as where \(f_{ij}\) is the optimal flow that minimizes the work to transform P into Q.

The amount of work is determined by the defined distance between instances (i.e., tiles), thus, it allows for an intuitive interpretation. In the given example, an EMD of 100 signifies that on average each record of the first distribution needs to be moved 100 meters to reproduce the second distribution. On the downside, there is no fixed range as for the JSD which provides values between 0 and 1. Thus the EMD always needs to be interpreted in the context of the dataset and the EMD of different datasets cannot be compared directly.

In the same manner, the EMD can be computed for histograms, by defining a distance between histogram bins. To measure the distance between histogram bins, the difference between the midrange values of each bin pair is computed. For tiles, the centroid of each tile is used to compute the haversine distance.

Thus the EMD is available for the following analyses provided in the following units:

  • visits per tile: distance in meters
  • visits per time tile: average distance in meters for each timewindow
  • travel time: distance in minutes
  • jump length: distance in kilometers
  • trips per user: distance in counts of trips
  • radius of gyration: distance in kilometers

The EDM can only be computed, if a notion of distance between histogram bins or tiles can be computed. For example, there is no trivial distance between weekdays (you could argue that the categorization of weekdays and weekend is more important than the number of days lying inbetween). Thus, we decided to omit the EMD if there is no intuitive distance measure. The EMD is the default measure for visits per tile and visits per tile timewindow, as the underlying geometry is especially important to account for here.

The Kendall's \(\tau\) coefficient, also known as the Kendall rank correlation coefficient, is a measure of the strength and direction of association that exists between two variables measured on an ordinal scale. It is a non-parametric measure of statistical associations based on the ranks of the data, i.e., the similarity of two rankings such as a ranking of most visited locations of two datasets. It returns a value between -1 and 1, where -1 means no relationship and 1 is a perfect relationship, determining the strength of association based on the pattern of concordance (ordered in the same way) and discordance (ordered differently) between all pairs, defined as follows: $$\tau= \frac{\textrm{number of concordant pairs} - \textrm{number of discordant pairs}}{\textrm{number of pairs}}$$

Let's consider a list of locations \(\langle l_1,...,l_n \rangle\) and let \(pop(D, l_i)\) denote the popularity of \(l_i\), i.e., the number of times \(l_i\) is visited by trajectories in dataset \(D\) and compute the popularity \(pop(D_{base}, l_i)\) for a base dataset and \(pop(D_{alt}, l_i)\) for an alternative dataset for all \(l_i\). Then, we say that a pair of locations \((l_i, l_j)\) are concordant if either of the following hold: \((pop(D_{ref}, l_i) > pop(D_{ref}, l_j)) \wedge (pop(D_{syn}, l_i) > pop(D_{syn}, l_j))\) or \((pop(D_{ref}, l_i) < pop(D_{ref}, l_j)) \wedge (pop(D_{syn}, l_i) < pop(D_{syn}, l_j))\), i.e., their popularity ranks (in sorted order) agree. They are said to be discordant if their ranks disagree.

The coverage of the top n locations is defined by the true positive ratio: $$\frac{|top_n(D_{base})\ \cap\ top_n(D_{alt})|}{n}$$, where n is the number of top locations and \(top_n(D_{base})\) is the n top locations of the base dataset and \(top_n(D_{alt})\) the n top locations of the alternative dataset.

This measure represents how well the alternative dataset is similar to the base dataset considering the most visited locations.

Differential Privacy

This report provides differential privacy guarantees. The concept of differential privacy is that the output of an algorithm remains nearly unchanged if the records of one individual are removed or added. In this way, differential privacy limits the impact of a single individual on the analysis outcome, preventing the reconstruction of an individual's data. Broadly speaking, this is achieved by adding calibrated noise to the output and the amount of noise is defined by the privacy_budget. Depending on the setting of user_privacy, noise is either calibrated to only protect single trips (item-level privacy) or to protect the privacy of users (user-level privacy). The privacy budget is split between all analyses. The cofiguration table provides information on the used privacy_budget, the budget_split and user_privacy. For each analysis, information is provided on the amount of utilized privacy budget.

The Laplace mechanism is used for counts and the Exponential mechanism for the five number summaries. Details on the notion of differential privacy and the used mechanisms are provided in the documentation.

The following table shows the key figures of the base and the alternative dataset.

The allocated privacy budget for each dataset are shown below and noise is applied accordingly to compute the estimate and the 95% confidence interval.

The symmetric mean absolute percentage error between base and alternative estimates is computed and indicated in the last column.

Base: privacy budget: None
Alternative: privacy budget: 0.0012
Estimate (+/- 95% CI) Error [0 to 2]
Base Alternative
Number of records 2,834,268 (+/-0) 2,533,672 (+/-103,173.0) 0.11
Number of distinct trips 1,417,134 (+/-0) 1,266,836 (+/-103,173.0) 0.11
Number of complete trips (start and and point) 1,417,134 (+/-0) 1,266,836 (+/-103,173.0) 0.11
Number of incomplete trips (single point) 0 (+/-0) 0 (+/-51,586.5) 0.00
Number of distinct users 378,759 (+/-0) 379,428 (+/-10,317.3) 0.00
Number of distinct locations (lat & lon combination) 204,228 (+/-0) 127,648 (+/-103,173.0) 0.46

The following table shows the number of missing values for each column of the two datasets.

The allocated privacy budget for each dataset are shown below and noise is applied accordingly to compute the estimate and the 95% confidence interval.

The symmetric mean absolute percentage error between base and alternative estimates is computed and indicated in the last column.

Base: privacy budget: None
Alternative: privacy budget: 0.0012
Estimate (+/- 95% CI) Error [0 to 2]
Base Alternative
User ID (uid) 0 (+/-0) 0 (+/-128,966.3) 0.00
Trip ID (tid) 0 (+/-0) 0 (+/-128,966.3) 0.00
Timestamp (datetime) 0 (+/-0) 32,759 (+/-128,966.3) 2.00
Latitude (lat) 0 (+/-0) 46,685 (+/-128,966.3) 2.00
Longitude (lng) 0 (+/-0) 0 (+/-128,966.3) 0.00

This visualization shows the relative number of trips on a timeline of the base and alternative dataset.

The allocated privacy budget for each dataset are shown below and noise is applied accordingly to compute the estimate (blue and orange line) and the 95% confidence interval. The confidence interval is visualized as shaded error bands, colored respectively for base and alternative.

The y-axis shows the percentage of trips while the x-axis will show the timeline aggregated as indicated below the visualization. The legend indicates the color for the base dataset and the alternative dataset.

Base: privacy budget: None 95% CI: +/- 0 %
Alternative: privacy budget: 0.0012 95% CI: +/- 3.2 %
2023-03-23T12:33:45.318320 image/svg+xml Matplotlib v3.5.3, https://matplotlib.org/

Timestamps have been aggregated by date.

Kullback-Leibler divergence [0 to ∞]: 0.01
Jensen Shannon divergence [0 to 1]: 0.05
Symmetric mean absolute percentage error [0 to 2]: 2.00

Base Alternative
Min. 2018-04-18 2018-04-19
Max. 2018-04-20 2018-04-19

This histogram visualizes the relative number of trips per weekday for the base and alternative dataset.

The allocated privacy budgets for both datasets are shown below and noise is applied accordingly to compute the estimate (bars) and the 95% confidence interval (error bar).

The y-axis shows the percentage of trips while the x-axis shows the weekdays. The legend indicates the color for the base dataset and the alternative dataset.

Base: privacy budget: None 95% CI: +/- 0 %
Alternative: privacy budget: 0.0012 95% CI: +/- 1.0 %
2023-03-23T12:33:45.481064 image/svg+xml Matplotlib v3.5.3, https://matplotlib.org/

Kullback-Leibler divergence [0 to ∞]: not defined
Jensen Shannon divergence [0 to 1]: 0.05
Symmetric mean absolute percentage error [0 to 2]: 0.88

This linechart shows the relative number of trips per hour over the course of a day, disaggregated by weekday and weekend for the base and alternative dataset.

The allocated privacy budget for both datasets are shown below and noise is applied accordingly to compute the estimate (lines). The confidence interval is indicated below but not visualized in the graph due to visual clarity.

The legend shows the colors for each time category (weekday start, weekday end, weekend start, weekend end) indicating the start and end timestamp of each trip and if the trip was during the week or on the weekend. Furthermore, the lines are either continuous or dashed for the base and the alternative dataset.

The y-axis shows the percentage of trips while the x-axis shows the hour of the day.

Base: privacy budget: None 95% CI: +/- 0 %
Alternative: privacy budget: 0.0012 95% CI: +/- 1.0 %
2023-03-23T12:33:45.710176 image/svg+xml Matplotlib v3.5.3, https://matplotlib.org/

Kullback-Leibler divergence [0 to ∞]: not defined
Jensen Shannon divergence [0 to 1]: 0.13
Symmetric mean absolute percentage error [0 to 2]: 0.70

The following map shows the divergence of the spatial distribution according to the provided tessellation between the two datasets. The deviation of the relative visits per tile of the alternative from the base dataset is shown. Additionally, the relative visits per tile of each dataset can be displayed.

One can chose from these three visualizations in the layer control on the top right. The legends are below.

The deviations range from -2 to 2. The deviations are computed as follows: (alternative - base) / ((|base| + |alternative|) / 2).

The relative visits per tile for both the base and the alternative dataset range from 0 to the maximum value of both relative visits.

The allocated privacy budgets for this map are shown below and noise is applied accordingly onto the relative counts. The confidence interval is indicated below.

All applicable similarity measures are displayed in the orange box below the map.

Base: privacy budget: None 95% CI: +/- 0 % of visit(s)
Alternative: privacy budget: 0.0581 95% CI: +/- 0.0 % of visit(s)

Deviations from base:

2023-03-23T12:33:47.520980 image/svg+xml Matplotlib v3.5.3, https://matplotlib.org/

Visits per tile base:

2023-03-23T12:33:47.611999 image/svg+xml Matplotlib v3.5.3, https://matplotlib.org/

Visits per tile alternative:

2023-03-23T12:33:47.664914 image/svg+xml Matplotlib v3.5.3, https://matplotlib.org/

Base: 18 (0.0%) points are outside the given tessellation (95% confidence interval ± 0).

Alternative: 88 (0.0%) points are outside the given tessellation (95% confidence interval ± 516).

Kullback-Leibler divergence [0 to ∞]: 0.00
Jensen Shannon divergence [0 to 1]: 0.02
Earth mover's distance [0 to ∞]: 30.42
Symmetric mean absolute percentage error [0 to 2]: 0.10

This table shows the mean number of visits per tile for each dataset as well as the five-number summary consisting of: the most extreme values in the dataset (the maximum and minimum values), the lower and upper quartiles, and the median.

These values are computed from the counts visualized above. Thus, no extra privacy budget is used.

The symmetric mean absolute percentage error consisting of all the above counts is displayed in the orange box below.

Base Alternative
Mean 7,343 6,709
Min. 14 0
25% 2,794 2,478
Median 5,597 5,010
75% 10,317 9,592
Max. 47,860 43,309

Symmetric mean absolute percentage error [0 to 2]: 0.32

The following visualization shows the cumulated relative number of visits of both datasets. This means that the tiles are sorted according to the number of visits in descending order and the relative number of visits are added tile by tile. Thus, you can use the graph to evaluate how many tiles are needed to cover a certain share of the visits.

If all tiles are visited equally, the cumulated sum follows a straight diagonal line. The larger the share of single tiles in the total number of visits, the steeper the curve.

These values are computed from the counts visualized above. Thus, no extra privacy budget is used.

The legend indicates the color for the base dataset and the alternative dataset.

2023-03-23T12:33:47.797985 image/svg+xml Matplotlib v3.5.3, https://matplotlib.org/

The following visualization shows the ranking of most frequently visited tiles for the base and the alternative dataset.

The ranking includes the union of the top 10 most frequently visited tiles of both dataset and therefore a minimum 10 to a maximum 20 most frequently visited tiles.

The y-axis shows the tile name (if provided) and tile ID in order of the ranking (starting with the top 10 base tiles). The x-axis shows the relative number of visits per tile.

These values are computed from the counts visualized above. Thus, no extra privacy budget is used. The 95% confidence interval of the visits per tile indicated above also applied here and is visualized with error bars.

The legend indicates the color for the base dataset and the alternative dataset.

The Kendall rank correlation coefficient and the coverage of top n locations are displayed in the orange box below the map. Both measures are computed for the configured top n values (default: 10, 50, 100).

2023-03-23T12:33:47.946963 image/svg+xml Matplotlib v3.5.3, https://matplotlib.org/

Kendall rank correlation coefficient of top n locations [-1 to 1]:
Top 10: 0.75
Top 50: 0.22
Top 100: 0.36
Coverage of top n locations [0 to 1]:
Top 10: 100.00%
Top 50: 98.04%
Top 100: 98.02%

Each map shows the arrivals (destinations) for the respective time window for each tile, split by weekday and weekend, as the deviation from base.

The deviations range from -2 to 2. The deviations are computed as follows: (alternative - base) / ((|base| + |alternative|) / 2).

The allocated privacy budgets for this map are shown below and noise is applied accordingly onto the relative counts, which are used to compute the deviation. The confidence interval is indicated below.

All applicable similarity measures are displayed in the orange box below the map.

Base: privacy budget: None 95% CI: +/- 0 visit(s)
Alternative: privacy budget: 0.3484 95% CI: +/- 43.0 visit(s)

Weekday

Deviation from base

User configuration of timewindows: ['2 - 6', '6 - 10', '10 - 14', '14 - 18', '18 - 22']

Kullback-Leibler divergence [0 to ∞]: not defined
Jensen Shannon divergence [0 to 1]: 0.03
Earth mover's distance [0 to ∞]: 121.29
Symmetric mean absolute percentage error [0 to 2]: 0.22

The following map shows the origin-destination (OD) flows between the tiles according to the provided tessellation for the base and the alternative dataset. Additionally, the intra-tile flow deviations from base are displayed. The intra-tile flow is defined as the number of OD connections that start and end in the same tile. These three visualizations can be chosen in the layer control.

The origin of the OD flows is indicated by a small circle and by clicking on one OD connection, information on the origin and destination cell name as well as the number of OD connections will show up.

The legend for the intra-tile flow deviations are below and range from -2 to 2. The deviations are computed as follows: (alternative - base) / ((|base| + |alternative|) / 2).

The allocated privacy budget for this map is shown below and noise is applied accordingly onto the relative counts. The confidence interval is indicated below.

All applicable similarity measures are displayed in the orange box below the map.

Base: privacy budget: None 95% CI: +/- 0 flow(s)
Alternative: privacy budget: 0.5807 95% CI: +/- 25.8 flow(s)

User configuration: display max. top 100 OD connections on map

2023-03-23T12:33:50.395193 image/svg+xml Matplotlib v3.5.3, https://matplotlib.org/

Kullback-Leibler divergence [0 to ∞]: not defined
Jensen Shannon divergence [0 to 1]: 0.34
Symmetric mean absolute percentage error [0 to 2]: 1.27

This table shows the mean number of visits per tile for each dataset as well as the five-number summary consisting of: the most extreme values in the dataset (the maximum and minimum values), the lower and upper quartiles, and the median.

These values are computed from the counts visualized above. Thus, no extra privacy budget is used.

The symmetric mean absolute percentage error consisting of all the above counts is displayed in the orange box below.

Base Alternative
Mean 14 19
Min. 1 1
25% 2 4
Median 4 10
75% 10 19
Max. 4,421 3,879

Symmetric mean absolute percentage error [0 to 2]: 0.34

The following visualization shows the cumulated relative number of flows per OD pair of both datasets. This means that the OD pairs are sorted according to the number of flows in descending order and the relative number of flows are added OD pair by OD pair. Thus, you can use the graph to evaluate how many OD pairs are needed to cover a certain share of the flows.

If all OD pairs are visited equally, the cumulated sum follows a straight diagonal line. The larger the share of a single OD pair in the total number of flows, the steeper the curve.

These values are computed from the counts visualized above. Thus, no extra privacy budget is used.

The legend indicates the color for the base dataset and the alternative dataset.

2023-03-23T12:33:50.527678 image/svg+xml Matplotlib v3.5.3, https://matplotlib.org/

The following visualization shows the ranking of most frequently visited OD connections for the base and the alternative dataset.

The ranking includes the union of the top 10 most frequently visited tiles of both dataset and therefore a minimum 10 to a maximum 20 most frequently visited tiles.

The y-axis shows the tile name of origin and destination in order of the ranking (starting with the top 10 base connections). The x-axis shows the number of flows per OD pair.

These values are computed from the counts visualized above. Thus, no extra privacy budget is used. The 95% confidence interval of the flows per OD pair indicated above also applies here and is visualized with error bars.

The legend indicates the color for the base dataset and the alternative dataset.

The Kendall rank correlation coefficient and the coverage of top n locations are displayed in the orange box below the map. Both measures are computed for the configured top n values (default: 10, 50, 100).

2023-03-23T12:33:50.700504 image/svg+xml Matplotlib v3.5.3, https://matplotlib.org/

Kendall rank correlation coefficient of top n flows [-1 to 1]:
Top 10: 1.00
Top 50: 0.25
Top 100: 0.12
Coverage of top n flows [0 to 1]:
Top 10: 100.00%
Top 50: 100.00%
Top 100: 97.00%

The following histogram shows the distribution of travel time for both datasets. The travel time is computed as the time difference between start and end timestamp of a trip in minutes.

The y-axis indicates the relative counts of trips while the x-axis shows the range of histogram bins in minutes according to the user configurated bin size and maximum value.

The allocated privacy budgets for both datasets are shown below and noise is applied accordingly to compute the estimate (bars) and the 95% confidence interval (error bar).

The legend indicates the color for the base dataset and the alternative dataset.

All applicable similarity measures are displayed in the orange box below.

Base: privacy budget: None 95% CI: +/- 0 %
Alternative: privacy budget: 0.0012 95% CI: +/- 6.0 %
2023-03-23T12:33:50.970374 image/svg+xml Matplotlib v3.5.3, https://matplotlib.org/

User configuration for histogram chart:
maximum value: 120
bin size: 5

Kullback-Leibler divergence [0 to ∞]: 0.11
Jensen Shannon divergence [0 to 1]: 0.18
Earth mover's distance [0 to ∞]: 3.77
Symmetric mean absolute percentage error [0 to 2]: 1.05

Five number summary: travel time

Base Alternative
Min. 4.00 6.00
25% 17.00 17.00
Median 27.00 31.00
75% 44.00 44.00
Max. 525.00 73.00

Symmetric mean absolute percentage error [0 to 2]: 0.41

The following histogram shows the distribution of jump length for both datasets. The jump length is the straight-line distance between the origin and destination.

The y-axis indicates the relative counts of trips while the x-axis shows the range of histogram bins in kilometers according to the user configurated bin size and maximum value.

The allocated privacy budgets for both datasets are shown below and noise is applied accordingly to compute the estimate (bars) and the 95% confidence interval (error bar).

The legend indicates the color for the base dataset and the alternative dataset.

All applicable similarity measures are displayed in the orange box below.

Base: privacy budget: None 95% CI: +/- 0 %
Alternative: privacy budget: 0.0012 95% CI: +/- 6.2 %
2023-03-23T12:33:51.181463 image/svg+xml Matplotlib v3.5.3, https://matplotlib.org/

User configuration for histogram chart:
maximum value: 10
bin size: 1

Kullback-Leibler divergence [0 to ∞]: 0.09
Jensen Shannon divergence [0 to 1]: 0.16
Earth mover's distance [0 to ∞]: 2.23
Symmetric mean absolute percentage error [0 to 2]: 0.38

Five number summary: jump length

Base Alternative
Min. 0.00 0.17
25% 0.90 0.88
Median 3.28 3.37
75% 6.56 8.73
Max. 38.97 15.93

Symmetric mean absolute percentage error [0 to 2]: 0.63

The following histogram shows the distribution of the radii of gyration for both datasets. The radius of gyration is the characteristic distance traveled by an individual during a period of time.

The y-axis indicates the relative number of users and the x-axis shows the range of the histogram bins in kilometers according to the user configured bin size and maximum value.

The allocated privacy budgets for both datasets are shown below and noise is applied accordingly to compute the estimate (bars) and the 95% confidence interval (error bar).

The legend indicates the color for the base dataset and the alternative dataset.

All applicable similarity measures are displayed in the orange box below.

Base: privacy budget: None 95% CI: +/- 0 %
Alternative: privacy budget: 0.0012 95% CI: +/- 4.0 %
2023-03-23T12:33:51.396635 image/svg+xml Matplotlib v3.5.3, https://matplotlib.org/

User configuration for histogram chart:
maximum value: 5
bin size: 0.5

Kullback-Leibler divergence [0 to ∞]: 0.02
Jensen Shannon divergence [0 to 1]: 0.07
Earth mover's distance [0 to ∞]: 1.05
Symmetric mean absolute percentage error [0 to 2]: 0.17

Five number summary: radius of gyration

Base Alternative
Min. 0.00 0.08
25% 1.40 1.32
Median 2.61 2.68
75% 4.51 5.48
Max. 18.70 9.47

Symmetric mean absolute percentage error [0 to 2]: 0.59

The following histogram shows the distribution of how many distinct tiles a user has visited for both datasets. It describes the diversity of locations a user visits.

The y-axis indicates the relative number of users and the x-axis shows the number of distinct tiles according to the user configured bin size and maximum value.

The allocated privacy budgets for both datasets are shown below and noise is applied accordingly to compute the estimate (bars) and the 95% confidence interval (error bar).

The legend indicates the color for the base dataset and the alternative dataset.

All applicable similarity measures are displayed in the orange box below.

Base: privacy budget: None 95% CI: +/- 0 %
Alternative: privacy budget: 0.0012 95% CI: +/- 4.5 %
2023-03-23T12:33:51.593667 image/svg+xml Matplotlib v3.5.3, https://matplotlib.org/

User configuration for histogram chart:
maximum value: 10
bin size: 1

Kullback-Leibler divergence [0 to ∞]: 0.10
Jensen Shannon divergence [0 to 1]: 0.18
Earth mover's distance [0 to ∞]: 0.19
Symmetric mean absolute percentage error [0 to 2]: 1.34

Five number summary: distinct tiles per user

Base Alternative
Min. 1 2
25% 2 2
Median 3 3
75% 3 3
Max. 12 5

Symmetric mean absolute percentage error [0 to 2]: 0.30

The following histogram shows the distribution of the mobility entropy for both datasets.

The mobility entropy characterizes the heterogeneity of the users visitation patterns and can be interpreted as a measure for the predictability of a users location. If a user only visits a single tile, the entropy is 0, i.e., their location is highly predictable. If a user visits, e.g., four different tiles each 10 times, the entropy is 1, i.e., their location is not predictable as every of the four tiles has the same probability to be visited by the user. Intuitively, the more trips per user are entailed in the data, the more meaningful the mobility entropy.

The y-axis indicates the relative number of users and the x-axis shows the range of histogram bins for the mobility entropy.

The allocated privacy budgets for both datasets are shown below and noise is applied accordingly to compute the estimate (bars) and the 95% confidence interval (error bar).

The legend indicates the color for the base dataset and the alternative dataset.

All applicable similarity measures are displayed in the orange box below.

Base: privacy budget: None 95% CI: +/- 0 %
Alternative: privacy budget: 0.0012 95% CI: +/- 3.9 %
2023-03-23T12:33:51.763441 image/svg+xml Matplotlib v3.5.3, https://matplotlib.org/

Kullback-Leibler divergence [0 to ∞]: not defined
Jensen Shannon divergence [0 to 1]: 0.20
Earth mover's distance [0 to ∞]: 0.04
Symmetric mean absolute percentage error [0 to 2]: 1.17

Five number summary: mobility entropy

Base Alternative
Min. 0.00 0.00
25% 0.92 0.91
Median 0.96 0.96
75% 1.00 1.00
Max. 1.00 1.00

Symmetric mean absolute percentage error [0 to 2]: 0.00