Chilling Effect Resulting from Mass Surveillance
Jon Penney (2016)1 explored whether the widespread publicity about NSA/PRISM surveillance (i.e., the Snowden revelations) in June 2013 was associated with a sharp and sudden decrease in traffic to Wikipedia articles on topics that raise privacy concerns. This post tries to reproduce some of this findings.
This post is based in one exercise of Matthew J. Salganik’s book Bit by Bit: Social Research in the Digital Age, from chapter 2.
Introduction
Penney (2016) explored whether the widespread publicity about NSA/PRISM surveillance (i.e., the Snowden revelations) in June 2013 was associated with a sharp and sudden decrease in traffic to Wikipedia articles on topics that raise privacy concerns. If so, this change in behavior would be consistent with a chilling effect resulting from mass surveillance. The approach of Penney (2016) is sometimes called an interrupted time series design.
To choose the topic keywords, Penney referred to the list used by the US Department of Homeland Security for tracking and monitoring social media. The DHS list categorizes certain search terms into a range of issues, i.e., “Health Concern,” “Infrastructure Security,” and “Terrorism.” For the study group, Penney used the 48 keywords related to “Terrorism” (see appendix table 8). He then aggregated Wikipedia article view counts on a monthly basis for the corresponding 48 Wikipedia articles over a 32-month period from the beginning of January 2012 to the end of August 2014. To strengthen his argument, he also created several comparison groups by tracking article views on other topics.
Now, we are going to replicate and extend Penney (2016). All the raw data that you will need for this activity is available from Wikipedia (https://dumps.wikimedia.org/other/pagecounts-raw/). Or we can get it from the R-package wikipediatrend (Meissner and Team 2016).
Testing wikipediatrend package
|
|
language | article | date | views | |
---|---|---|---|---|
732 | en | python_(programming_language) | 2019-11-15 | 7301 |
1 | en | r_(programming_language) | 2019-11-15 | 3052 |
733 | en | python_(programming_language) | 2019-11-16 | 4948 |
2 | en | r_(programming_language) | 2019-11-16 | 2279 |
734 | en | python_(programming_language) | 2019-11-17 | 5353 |
3 | en | r_(programming_language) | 2019-11-17 | 2259 |
|
|
Reproduction
Part A
Read Penney (2016) and replicate his figure 2, which shows the page views for “Terrorism”-related pages before and after the Snowden revelations. Interpret the findings.
|
|
topic_keyword | wikipedia_articles | government_trouble | browser_delete | privacy_sensitive | avoidance |
---|---|---|---|---|---|
Al Qaeda | http://en.wikipedia.org/wiki/Al-Qaeda | 2.20 | 2.11 | 2.21 | 2.84 |
Terrorism | http://en.wikipedia.org/wiki/terrorism | 2.19 | 2.05 | 2.16 | 2.79 |
Terror | http://en.wikipedia.org/wiki/terror | 1.98 | 1.96 | 2.01 | 2.64 |
Attack | http://en.wikipedia.org/wiki/attack | 1.92 | 1.91 | 1.92 | 2.56 |
Iraq | http://en.wikipedia.org/wiki/iraq | 1.60 | 1.74 | 1.76 | 2.25 |
Afghanistan | http://en.wikipedia.org/wiki/afghanistan | 1.61 | 1.71 | 1.75 | 2.23 |
|
|
Part B
Next, replicate figure 4A, which compares the study group (“Terrorism”-related articles) with a comparator group using keywords categorized under “DHS & Other Agencies” from the DHS list (see appendix table 10 and footnote 139). Interpret the findings.
|
|
topic_keyword | wikipedia_articles |
---|---|
Department of Homeland Security | https://en.wikipedia.org/wiki/United_States_Department_of_Homeland_Security |
Federal Emergency Management Agency | https://en.wikipedia.org/wiki/Federal_Emergency_Management_Agency |
Coast Guard | https://en.wikipedia.org/wiki/Coast_guard |
Customs and Border Protection | https://en.wikipedia.org/wiki/Customs_and_Border_Protection |
Border patrol | https://en.wikipedia.org/wiki/Border_Patrol |
Secret Service | https://en.wikipedia.org/wiki/Secret_Service |
|
|
Extra
The Statistical Model
From Jesse Lecy and Federica Fusi’s Interrupted Time Series2 we have the following scenario:
In mathematical terms, it means that the time series equation includes four key coefficients:
$$ Y=b_{0}+b_{1}T+b_{2}D+b_{3}P+e $$
Where:
- $ Y $ is the outcome variable;
- $ T $ is a continuous variable which indicates the time (e.g., days, months, years…) passed from the start of the observational period;
- $ D $ is a dummy variable indicating observation collected before (=0) or after (=1) the policy intervention;
- $ P $ is a continuous variable indicating time passed since the intervention has occured (before intervention has occured P is equal to 0).
To model this, We would to have a dataset with this format:
So, let’s build ours
|
|
date | Y | T | D | P |
---|---|---|---|---|
2013-03-01 | 3868814 | 15 | 0 | 0 |
2013-04-01 | 3595489 | 16 | 0 | 0 |
2013-05-01 | 3235093 | 17 | 0 | 0 |
2013-06-01 | 2725519 | 18 | 0 | 0 |
2013-07-01 | 2399060 | 19 | 1 | 1 |
2013-08-01 | 2451065 | 20 | 1 | 2 |
2013-09-01 | 2559672 | 21 | 1 | 3 |
2013-10-01 | 2550031 | 22 | 1 | 4 |
|
|
|
|
|
|
We can see the significant drop in the moment of the event $ D $ and also a change in the trend after $ P $ compared to $ T $.
Statistical Model: Control Group
Yet from Jesse Lecy and Federica Fusi’s Interrupted Time Series
A time series are also subject to threats to internal validity, such as:
- Another event occurred at the same time of the intervention and cause the immediate and sustained effect that we observe;
- Selection processes, as only some individuals are affected by the policy intervention.
To address these issues, you can:
- Use as a control a group that is not subject to the intervention (e.g., students who do not attend the well being class)
This design makes sure that the observed effect is the result of the policy intervention. The data will have two observations per each point in time and will include a dummy variable to differentiate the treatment (=1) and the control (=0). The model has a similar structure but (1) we will include a dummy variable that indicates the treatment and the control group and (2) we will interact the group dummy variable with all 3 time serie coefficients to see if there is a statistically significant difference across the 2 groups.
You can see this in the following equation, where $ G $ is a dummy indicating treatment and control group.
$$ Y=b_{0}+b_{1}∗T+b_{2}∗D+b_{3}∗P+b_{4}∗G+b_{5}∗G∗T+b_{6}∗G∗D+b_{7}∗G∗P $$
|
|
date | Y | T | D | P | G |
---|---|---|---|---|---|
2013-04-01 | 1032305 | 16 | 0 | 0 | 0 |
2013-04-01 | 3595489 | 16 | 0 | 0 | 1 |
2013-05-01 | 873192 | 17 | 0 | 0 | 0 |
2013-05-01 | 3235093 | 17 | 0 | 0 | 1 |
2013-06-01 | 747513 | 18 | 0 | 0 | 0 |
2013-06-01 | 2725519 | 18 | 0 | 0 | 1 |
2013-07-01 | 632677 | 19 | 1 | 1 | 0 |
2013-07-01 | 2399060 | 19 | 1 | 1 | 1 |
2013-08-01 | 641141 | 20 | 1 | 2 | 0 |
2013-08-01 | 2451065 | 20 | 1 | 2 | 1 |
2013-09-01 | 705947 | 21 | 1 | 3 | 0 |
2013-09-01 | 2559672 | 21 | 1 | 3 | 1 |
|
|
|
|
To interpret the coefficients you need to remember that the reference group is the treatment group (=1). The Group dummy $ b_{4} $ (coef $ G $) indicates the difference between the treatment and the control group. $ b_{5} $ (coef $ T:G $) represents the slope difference between the intervention and control group in the pre-intervention period. $ b_{6} $ (coef $ D:G $) represents the difference between the control and intervention group associated with the intervention. $ b_{7} $ (coef $ P:G $) represents the difference between the sustained effect of the control and intervention group after the intervention.
|
|
Extra II
Finding Change Points in Time Series
After these analysis I was thinking if is possible to find automatically a change points in time series. The Interrupted Time Series analysis assume a “change point” and check if the series, in fact, changes its behavior. How to check if this is a transition point indeed? As we know a linear regression made of arbitrary choice of points (interval in this case) can find false patterns.
Indeed, there is a plenty of R Packages can detect change points in a time series, ideal to make this type of analysis more robust. Jonas Kristoffer Lindeløv compared some of these packages in this vignette of your own new package: the mcp package to detect and do regressions with multiple change points.
Let’s try use this package in your scenario:
|
|
|
|
|
|
|
|
|
|
That is cool, we can see that we detect a change point ( parameter $ cp_{1} $ ) around the 16o period, that is june, in your dataset. The model also show us the parameters fitted in each linear regression ( $ int $ for intercepts and $ T $ for slopes).
References
-
Penney, Jonathon. 2016. “Chilling Effects: Online Surveillance and Wikipedia Use.” Berkeley Technology Law Journal 31 (1): 117. doi:10.15779/Z38SS13. - https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2769645 ↩︎
-
Jesse Lecy and Federica Fusi. “Foundations of Program Evaluation: Regression Tools for Impact Analysis” - https://ds4ps.org/pe4ps-textbook/docs/index.html ↩︎