Base | Alternative | |
---|---|---|
Max. trips per user | 16 | 5 |
Privacy budget | - | 1.00 |
User privacy | True | 1.00 |
Budget split | Evenly distributed | OD flows: 500, Visits per tile: 50, Visits per time tile: 300 |
Evaluation dev. mode | False | |
Excluded analyses | User time delta |
The similarity of two mobility reports is evaluated in the following benchmark report using similarity measures. Specifically, a set of measures is computed for each chosen analysis of the report which can be found below each analysis indicated by a light orange background.
In the following the similarity measures relative error (RE), Kullback-Leibler divergence (KLD), Jensen-Shannon divergence (JSD), earth mover's distance (EMD) and symmetric mean average percentage error (SMAPE) will be explained, as well as the reasoning why the specific measures are available for each analyses and which measure is the default measure and why.
The symmetric mean absolute percentage error (SMAPE) is an accuracy measure based on percentage (or relative) errors. In contrast to the mean absolute percentage error, SMAPE has both a lower bound (0, meaning identical) and an upper bound (2, meaning entirely different). $$ SMAPE = \frac{1}{n} \sum_{i=1}^{n} \frac {|alternative_{i} - base_{i}|}{(|base_{i}| + |alternative_{i}|) \div 2}$$
SMAPE is computed for all analyses.
For single counts (e.g., dataset statistics, missing values), n=1 with \(base_{i}\) (\(alternative_{i}\) respectively) refering to the respective count value. For the five number summary, n=5 with \(base_{i}\) (\(alternative_{i}\) respectively) refering to \(i_{th}\) value of the summary. For all other analyses, n equals the number of histogram bins.
SMAPE is employed as the default measure for single counts and for the evaluation of the five number summary, as KLD, JSD and EMD are not suitable.
The Kullback-Leibler divergence (KLD), also called relative entropy, is a widely used statistic to measure how far a probability distribution P deviates from a reference probability distribution Q on the same probability space \(\mathcal{X}\). For discrete distributions P and Q, it is formally defined as: $$D_{KL}(P||Q):= \sum\limits_{x \in \mathcal{X}: P(x)>0} P(x)\cdot \log \frac{P(x)}{Q(x)}$$
For example, considering a tessellation (\(\mathcal{X}\)) the spatial distribution can be evaluated by comparing the relative counts in the synthetic data (P) per tile x to the relative counts per tile in the reference dataset (Q). The larger the deviation of P from Q, the larger the value of the resulting KLD, with a minimum value of 0 for identical distributions.
Note that KLD is not symmetric, i.e., \(D_{KL}(P||Q)~\neq~D_{KL}(Q||P)\), which is why KLD is best applicable in settings with a reference model Q and a fitted model P. However, the lack of symmetry implies that it is not a distance metric in the mathematical sense.
It is worth noting that KLD is only defined if \(Q(x)\neq 0\) for all x in the support of P, while this constraint is not required for JSD. In practice, both KLD and JSD are computed for discrete approximations of continuous distributions, e.g., histograms approximating the relative number of trips over time based on daily or hourly counts. However, the choice of histogram bins has an impact in two respects: Say we want to compare the number of visits per tile. Depending on the granularity of the chosen tessellation, there might be tiles with 0 visits in the real dataset but >0 visits in the synthetic dataset, thus KLD would not be defined for such cases. Additionally, the resulting values for both KLD and JSD vary according to the choice of bins, e.g., by reducing the granularity of the tessellation, the values of KLD and JSD will tend to be smaller.
The KLD is computed for the following analyses: trips over time, trips per weekday, trips per hour, visits per tile, visits per tile timewindow, OD flows, travel time, jump length and radius of gyration. The KLD might be None (or not defined) due to the fact, that it is not defined for \(Q(x)\neq 0\) for all x in the support of P and therefore the JSD is set as a default measure.
The Jensen-Shannon divergence (JSD) solves this asymmetry by building on the KLD to calculate a symmetrical score in the sense that the divergence of P from Q is the same as Q from P: \(D_{JS}(P||Q) = D_{JS}(Q||P)\). Additionally, JSD provides a smoothed and normalized version of KLD, with scores between 0 (meaning identical) and 1 (meaning entirely different) when using the base-2 logarithm, thus making it easier to relate the resulting score within a fixed finite range. Formally, the JSD is defined for two probability distributions P and Q as: $$D_{JS}(P||Q) := \frac{1}{2} D_{KL}\biggl( P\left\Vert\frac{P+Q}{2}\right ) + \frac{1}{2} D_{KL}\biggl((Q\left\Vert\frac{P+Q}{2}\right)$$
Following this advantage of the JSD compared to the KLD, the JSD is chosen as a default measure over the KLD.
Both KLD and JSD do not account for a distance of instances in the probability space \(\mathcal{X}\). However, the earth mover's distance (EMD) between two empirical distributions allows to take a notion of distance, like the underlying geometry of the space into account. Informally, the EMD is proportional to the minimum amount of work required to convert one distribution into the other.
Suppose we have a tessellation, each tile denoted by \(\{x_1, \ldots , x_n\}\) with a corresponding notion of distance \(dis(x_i, x_j)\) being the Haversine distance between the centroids of tiles \(x_i\) and \(x_j\). For two empirical distributions P and Q represented by the visits in the given tiles \(\{p_1, \ldots , p_n\}\) and \(\{q_1, \ldots , q_n\}\), respectively, the EMD can be defined as where \(f_{ij}\) is the optimal flow that minimizes the work to transform P into Q.
The amount of work is determined by the defined distance between instances (i.e., tiles), thus, it allows for an intuitive interpretation. In the given example, an EMD of 100 signifies that on average each record of the first distribution needs to be moved 100 meters to reproduce the second distribution. On the downside, there is no fixed range as for the JSD which provides values between 0 and 1. Thus the EMD always needs to be interpreted in the context of the dataset and the EMD of different datasets cannot be compared directly.
In the same manner, the EMD can be computed for histograms, by defining a distance between histogram bins. To measure the distance between histogram bins, the difference between the midrange values of each bin pair is computed. For tiles, the centroid of each tile is used to compute the haversine distance.
Thus the EMD is available for the following analyses provided in the following units:
The EDM can only be computed, if a notion of distance between histogram bins or tiles can be computed. For example, there is no trivial distance between weekdays (you could argue that the categorization of weekdays and weekend is more important than the number of days lying inbetween). Thus, we decided to omit the EMD if there is no intuitive distance measure. The EMD is the default measure for visits per tile and visits per tile timewindow, as the underlying geometry is especially important to account for here.
The Kendall's \(\tau\) coefficient, also known as the Kendall rank correlation coefficient, is a measure of the strength and direction of association that exists between two variables measured on an ordinal scale. It is a non-parametric measure of statistical associations based on the ranks of the data, i.e., the similarity of two rankings such as a ranking of most visited locations of two datasets. It returns a value between -1 and 1, where -1 means no relationship and 1 is a perfect relationship, determining the strength of association based on the pattern of concordance (ordered in the same way) and discordance (ordered differently) between all pairs, defined as follows: $$\tau= \frac{\textrm{number of concordant pairs} - \textrm{number of discordant pairs}}{\textrm{number of pairs}}$$
Let's consider a list of locations \(\langle l_1,...,l_n \rangle\) and let \(pop(D, l_i)\) denote the popularity of \(l_i\), i.e., the number of times \(l_i\) is visited by trajectories in dataset \(D\) and compute the popularity \(pop(D_{base}, l_i)\) for a base dataset and \(pop(D_{alt}, l_i)\) for an alternative dataset for all \(l_i\). Then, we say that a pair of locations \((l_i, l_j)\) are concordant if either of the following hold: \((pop(D_{ref}, l_i) > pop(D_{ref}, l_j)) \wedge (pop(D_{syn}, l_i) > pop(D_{syn}, l_j))\) or \((pop(D_{ref}, l_i) < pop(D_{ref}, l_j)) \wedge (pop(D_{syn}, l_i) < pop(D_{syn}, l_j))\), i.e., their popularity ranks (in sorted order) agree. They are said to be discordant if their ranks disagree.
The coverage of the top n locations is defined by the true positive ratio: $$\frac{|top_n(D_{base})\ \cap\ top_n(D_{alt})|}{n}$$, where n is the number of top locations and \(top_n(D_{base})\) is the n top locations of the base dataset and \(top_n(D_{alt})\) the n top locations of the alternative dataset.
This measure represents how well the alternative dataset is similar to the base dataset considering the most visited locations.
This report provides differential privacy guarantees. The concept of differential privacy is that the output of an algorithm remains nearly unchanged if the records of one individual are removed or added. In this way, differential privacy limits the impact of a single individual on the analysis outcome, preventing the reconstruction of an individual's data. Broadly speaking, this is achieved by adding calibrated noise to the output and the amount of noise is defined by the privacy_budget. Depending on the setting of user_privacy, noise is either calibrated to only protect single trips (item-level privacy) or to protect the privacy of users (user-level privacy). The privacy budget is split between all analyses. The cofiguration table provides information on the used privacy_budget, the budget_split and user_privacy. For each analysis, information is provided on the amount of utilized privacy budget.
The Laplace mechanism is used for counts and the Exponential mechanism for the five number summaries. Details on the notion of differential privacy and the used mechanisms are provided in the documentation.
Base: | privacy budget: None |
---|---|
Alternative: | privacy budget: 0.0012 |
Estimate (+/- 95% CI) | Error [0 to 2] | ||||
---|---|---|---|---|---|
Base | Alternative | ||||
Number of records | 2,834,268 | (+/-0) | 2,533,672 | (+/-103,173.0) | 0.11 |
Number of distinct trips | 1,417,134 | (+/-0) | 1,266,836 | (+/-103,173.0) | 0.11 |
Number of complete trips (start and and point) | 1,417,134 | (+/-0) | 1,266,836 | (+/-103,173.0) | 0.11 |
Number of incomplete trips (single point) | 0 | (+/-0) | 0 | (+/-51,586.5) | 0.00 |
Number of distinct users | 378,759 | (+/-0) | 379,428 | (+/-10,317.3) | 0.00 |
Number of distinct locations (lat & lon combination) | 204,228 | (+/-0) | 127,648 | (+/-103,173.0) | 0.46 |
Base: | privacy budget: None |
---|---|
Alternative: | privacy budget: 0.0012 |
Estimate (+/- 95% CI) | Error [0 to 2] | ||||
---|---|---|---|---|---|
Base | Alternative | ||||
User ID (uid) | 0 | (+/-0) | 0 | (+/-128,966.3) | 0.00 |
Trip ID (tid) | 0 | (+/-0) | 0 | (+/-128,966.3) | 0.00 |
Timestamp (datetime) | 0 | (+/-0) | 32,759 | (+/-128,966.3) | 2.00 |
Latitude (lat) | 0 | (+/-0) | 46,685 | (+/-128,966.3) | 2.00 |
Longitude (lng) | 0 | (+/-0) | 0 | (+/-128,966.3) | 0.00 |
Base: | privacy budget: None | 95% CI: +/- 0 % |
---|---|---|
Alternative: | privacy budget: 0.0012 | 95% CI: +/- 3.2 % |
Timestamps have been aggregated by date.
Kullback-Leibler divergence [0 to ∞]: 0.01
Jensen Shannon divergence [0 to 1]: 0.05
Symmetric mean absolute percentage error [0 to 2]: 2.00
Base | Alternative | |
---|---|---|
Min. | 2018-04-18 | 2018-04-19 |
Max. | 2018-04-20 | 2018-04-19 |
Base: | privacy budget: None | 95% CI: +/- 0 % |
---|---|---|
Alternative: | privacy budget: 0.0012 | 95% CI: +/- 1.0 % |
Kullback-Leibler divergence [0 to ∞]: not defined
Jensen Shannon divergence [0 to 1]: 0.05
Symmetric mean absolute percentage error [0 to 2]: 0.88
Base: | privacy budget: None | 95% CI: +/- 0 % |
---|---|---|
Alternative: | privacy budget: 0.0012 | 95% CI: +/- 1.0 % |
Kullback-Leibler divergence [0 to ∞]: not defined
Jensen Shannon divergence [0 to 1]: 0.13
Symmetric mean absolute percentage error [0 to 2]: 0.70
Base: | privacy budget: None | 95% CI: +/- 0 % of visit(s) |
---|---|---|
Alternative: | privacy budget: 0.0581 | 95% CI: +/- 0.0 % of visit(s) |
Deviations from base:
Visits per tile base:
Visits per tile alternative:
Base: 18 (0.0%) points are outside the given tessellation (95% confidence interval ± 0).
Alternative: 88 (0.0%) points are outside the given tessellation (95% confidence interval ± 516).
Kullback-Leibler divergence [0 to ∞]: 0.00
Jensen Shannon divergence [0 to 1]: 0.02
Earth mover's distance [0 to ∞]: 30.42
Symmetric mean absolute percentage error [0 to 2]: 0.10
Base | Alternative | |
---|---|---|
Mean | 7,343 | 6,709 |
Min. | 14 | 0 |
25% | 2,794 | 2,478 |
Median | 5,597 | 5,010 |
75% | 10,317 | 9,592 |
Max. | 47,860 | 43,309 |
Symmetric mean absolute percentage error [0 to 2]: 0.32
Kendall rank correlation coefficient of top n locations [-1 to 1]:
Top 10: 0.75
Top 50: 0.22
Top 100: 0.36
Coverage of top n locations [0 to 1]:
Top 10: 100.00%
Top 50: 98.04%
Top 100: 98.02%
Base: | privacy budget: None | 95% CI: +/- 0 visit(s) |
---|---|---|
Alternative: | privacy budget: 0.3484 | 95% CI: +/- 43.0 visit(s) |
User configuration of timewindows: ['2 - 6', '6 - 10', '10 - 14', '14 - 18', '18 - 22']
Kullback-Leibler divergence [0 to ∞]: not defined
Jensen Shannon divergence [0 to 1]: 0.03
Earth mover's distance [0 to ∞]: 121.29
Symmetric mean absolute percentage error [0 to 2]: 0.22
Base: | privacy budget: None | 95% CI: +/- 0 flow(s) |
---|---|---|
Alternative: | privacy budget: 0.5807 | 95% CI: +/- 25.8 flow(s) |
User configuration: display max. top 100 OD connections on map
Kullback-Leibler divergence [0 to ∞]: not defined
Jensen Shannon divergence [0 to 1]: 0.34
Symmetric mean absolute percentage error [0 to 2]: 1.27
Base | Alternative | |
---|---|---|
Mean | 14 | 19 |
Min. | 1 | 1 |
25% | 2 | 4 |
Median | 4 | 10 |
75% | 10 | 19 |
Max. | 4,421 | 3,879 |
Symmetric mean absolute percentage error [0 to 2]: 0.34
Kendall rank correlation coefficient of top n flows [-1 to 1]:
Top 10: 1.00
Top 50: 0.25
Top 100: 0.12
Coverage of top n flows [0 to 1]:
Top 10: 100.00%
Top 50: 100.00%
Top 100: 97.00%
Base: | privacy budget: None | 95% CI: +/- 0 % |
---|---|---|
Alternative: | privacy budget: 0.0012 | 95% CI: +/- 6.0 % |
User configuration for histogram chart:
maximum value: 120
bin size: 5
Kullback-Leibler divergence [0 to ∞]: 0.11
Jensen Shannon divergence [0 to 1]: 0.18
Earth mover's distance [0 to ∞]: 3.77
Symmetric mean absolute percentage error [0 to 2]: 1.05
Base | Alternative | |
---|---|---|
Min. | 4.00 | 6.00 |
25% | 17.00 | 17.00 |
Median | 27.00 | 31.00 |
75% | 44.00 | 44.00 |
Max. | 525.00 | 73.00 |
Symmetric mean absolute percentage error [0 to 2]: 0.41
Base: | privacy budget: None | 95% CI: +/- 0 % |
---|---|---|
Alternative: | privacy budget: 0.0012 | 95% CI: +/- 6.2 % |
User configuration for histogram chart:
maximum value: 10
bin size: 1
Kullback-Leibler divergence [0 to ∞]: 0.09
Jensen Shannon divergence [0 to 1]: 0.16
Earth mover's distance [0 to ∞]: 2.23
Symmetric mean absolute percentage error [0 to 2]: 0.38
Base | Alternative | |
---|---|---|
Min. | 0.00 | 0.17 |
25% | 0.90 | 0.88 |
Median | 3.28 | 3.37 |
75% | 6.56 | 8.73 |
Max. | 38.97 | 15.93 |
Symmetric mean absolute percentage error [0 to 2]: 0.63
Base: | privacy budget: None | 95% CI: +/- 0 % |
---|---|---|
Alternative: | privacy budget: 0.0012 | 95% CI: +/- 4.0 % |
User configuration for histogram chart:
maximum value: 5
bin size: 0.5
Kullback-Leibler divergence [0 to ∞]: 0.02
Jensen Shannon divergence [0 to 1]: 0.07
Earth mover's distance [0 to ∞]: 1.05
Symmetric mean absolute percentage error [0 to 2]: 0.17
Base | Alternative | |
---|---|---|
Min. | 0.00 | 0.08 |
25% | 1.40 | 1.32 |
Median | 2.61 | 2.68 |
75% | 4.51 | 5.48 |
Max. | 18.70 | 9.47 |
Symmetric mean absolute percentage error [0 to 2]: 0.59
Base: | privacy budget: None | 95% CI: +/- 0 % |
---|---|---|
Alternative: | privacy budget: 0.0012 | 95% CI: +/- 4.5 % |
User configuration for histogram chart:
maximum value: 10
bin size: 1
Kullback-Leibler divergence [0 to ∞]: 0.10
Jensen Shannon divergence [0 to 1]: 0.18
Earth mover's distance [0 to ∞]: 0.19
Symmetric mean absolute percentage error [0 to 2]: 1.34
Base | Alternative | |
---|---|---|
Min. | 1 | 2 |
25% | 2 | 2 |
Median | 3 | 3 |
75% | 3 | 3 |
Max. | 12 | 5 |
Symmetric mean absolute percentage error [0 to 2]: 0.30
Base: | privacy budget: None | 95% CI: +/- 0 % |
---|---|---|
Alternative: | privacy budget: 0.0012 | 95% CI: +/- 3.9 % |
Kullback-Leibler divergence [0 to ∞]: not defined
Jensen Shannon divergence [0 to 1]: 0.20
Earth mover's distance [0 to ∞]: 0.04
Symmetric mean absolute percentage error [0 to 2]: 1.17
Base | Alternative | |
---|---|---|
Min. | 0.00 | 0.00 |
25% | 0.92 | 0.91 |
Median | 0.96 | 0.96 |
75% | 1.00 | 1.00 |
Max. | 1.00 | 1.00 |
Symmetric mean absolute percentage error [0 to 2]: 0.00