User talk:West.andrew.g/Popular pages/Archive 3

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia
Archive 1 Archive 2 Archive 3

WMF now reporting mobile/zero pageviews

Just one month ago I took to many of these same talk pages to explain that WMF statistics were under-reporting per article views by approximately 1/3, because mobile traffic was not being included in those totals. Further details were included in a Signpost article. I'd like to commend the wMF for quickly rectifying that situation, as files including mobile and wp-zero traffic are now available. The Wikipedia Zero project currently sees very little traffic, but I'll be including it in all my reports regardless (recall that mobile views were also minor just a few years ago).

You'll notice the WP:5000 and WP:TOPRED now breaks down the (increased) totals into "mobile" and "wp-zero" percentages (the complement being the "desktop" views we had previously). This will be the case from the OCT-14-2014 report onwards. In addition to the higher totals, another immediate benefit is that articles with very low mobile participation are often indicative of bot/misconfigured traffic. Though an intelligent malice spammer can evade this by altering user agent strings, I anticipate this be of great utility moving forward.

I know the WMF has reached out to stats.grok.se about updating their user-facing tool. I greatly look forward to having this new data on board, and aside from the fact its going to make year-end aggregation a bit messy, I'm excited to see what we can learn from deeper dives into the data. Thanks, West.andrew.g (talk) 23:43, 9 October 2014 (UTC)

@The ed17: -- Should you wish to add an update to the Signpost or note it in the coming week. West.andrew.g (talk) 23:49, 9 October 2014 (UTC)

Thanks for this - I'm now confused as to which figures do and don't include mobile. gok.se - no; wmf tools lab (eg baglama2) ??, popular pages by project ?? And so on. Can you cast any light? There are strange variations in the %s - lots of factors are evidently at work. Johnbod (talk) 16:11, 10 October 2014 (UTC)

@Johnbod: -- At the current time, I am inclined to believe I am the only one who is ingesting this new data. Notice that the documentation surrounding this data was authored just about one week ago. This isn't something that comes for free, all tool authors will need to point their tools to the new URL and write the code to aggregate data for a page title (the "desktop", "mobile", and "zero" view totals are all represented by separate and non-adjacent rows in the raw files). I imagine that anyone who has gone to this trouble would probably note it. I have spoken with Erik Zachte who does some of the official WMF statistics (not sure if that reaches into the reports you mentioned), and I know integration is on his agenda, but he hasn't gotten there yet. Thanks, West.andrew.g (talk) 16:38, 10 October 2014 (UTC)
Ok, thanks - certainly this deserves a Signpost story at some time. Having just yesterday cheerfully inflated some bragging page-view figures in my Wikipedian in Residence role by 400,000 for last month to allow for mobile views, I'm relieved this still seems correct, & in fact understated, since medical articles like film stars have high mobile %s! Johnbod (talk) 16:44, 10 October 2014 (UTC)

Excluding 0% and 100% mobile views

Since these are obviously fake, it would be a good idea to exclude them I think. It's interesting to note that if you excluded everything below 0.5% mobile, most of the longterm exclusions on the top 25 list would disappear, though it's hard to say if this wouldn't take out legitimate views as well. Serendipodous 14:05, 19 October 2014 (UTC)

Exclude them from the WP:5000? What advantage do you feel that provides? Certainly the mobile percentages are helpful in determining exclusions for the WP:Top25Report, but I don't know we should go so far as manipulating/filtering what we essentially present as "raw aggregation data". I suppose if someone is intentionally spamming an article, removal from the list might deny them recognition. However, there are also times when automated views are interesting and warrant further investigation. Notably, the largest "views by hour" ever for Wikipedia was on the Jyllands-Posten Muhammad cartoons controversy article as the result of a highly coordinated DDoS attack -- and while not human views, I think this is the type of phenomena/event I would appreciate seeing reflected in the WP:5000. West.andrew.g (talk) 13:41, 21 October 2014 (UTC)
Well from my perspective the advantage would be that it would stop people saying I'm a charlatan who randomly assembles the top 25 out of whatever articles he can justify including in it. Serendipodous 15:37, 21 October 2014 (UTC)
I certainly don't feel that way, and anyone who does certainly isn't speaking on behalf of consensus or with any hard evidence (which mobile percentages and our WMF contact provides us). Why not create a hard and fast rule (perhaps with footnote), that any article receiving less than 1% -- or more than 99% -- mobile traffic is automatically excluded from the report? I am happy to endorse and back you up where necessary. It just seems unnecessary to suppress raw data from the masses to appease a few problematic agitators. West.andrew.g (talk) 16:58, 21 October 2014 (UTC)
I also agree we shouldn't suppress any raw data, as I like to see it. But I also agree that any article with 1% or 99% mobile views should be removed from the Top 25. Indeed, even higher numbers could be evidence of non-human views making the bulk of all views. Presumably some humans view Online shopping daily, but when the non-human views dominate we can be comfortable that there is no way it would make the Top25.--Milowenthasspoken 20:13, 21 October 2014 (UTC)

Incomplete statistics week of DEC-25

My data ingestion reports were giving some unusual status updates, so I decided to do some digging. Per the raw files at [1], it seems a couple of hourly files are missing on DEC-22 and DEC-23, with many hours missing on DEC-25. I have not attempted to contact those whose manage this data set, but it could be possible that the aggregation service is getting crushed by holiday traffic (a significant portion of the world population off of work, new devices coming online, etc.). I expect that our weekly aggregate report will process just fine tomorrow night, but the "Top 25" authors will quickly notice the week-scale sums seem a bit off, and they should probably indicate the cause in their write-up. Thanks, West.andrew.g (talk) 08:12, 27 December 2014 (UTC)

The 10,000 most popular Wikipedia articles of 2014

Greetings! My yearly aggregation of pageview statistics is complete and available at User:West.andrew.g/2014 Popular pages.

With mobile statistics becoming available near the year's end, we are able to use that data to make some inferences about article's whose popularity might be non-organic. I'm excited for 2015 to have an entire year of mobile statistics as we try to better understand those trends. Not just demonstrating the popularity of Google Doodle's or a fascination with morbidity, reports like this can serve as a way to prioritize editing efforts towards impactful outcomes. I'm pleased to report that User:Doc James and I just got a paper accepted at JMIR that examines pageview trends on health/medicine articles across all of Wikipedia's language editions.

This list (albeit slightly delayed) tends to be quite popular, and I'd appreciate your help in publicizing this it internally and externally. Thanks, West.andrew.g (talk) 14:47, 21 January 2015 (UTC)

@The ed17: @Serendipodous: @Milowent: Pinging some of the usual suspects
{Indigenous australian, Amazon.com, Thanksgiving, IPv6, Alliteration, Pornography, and Subaru Justy} ... All we've excluded before, but all have higher than expected mobile percentages. Exclude? Serendipodous 15:10, 21 January 2015 (UTC)
All of these are definitely higher on the list than they should be, but the question is how far down they would fall without non-human views. Indigenous australian, ipv6, alliteration, and subaru justy would be far far lower.--Milowenthasspoken 15:26, 21 January 2015 (UTC)
I think this was precisely the point of my "footnote [g]". We are accustom to having a complete week's mobile data to characterize a complete week's traffic. Our normal bounds need to adjust slightly with this partial-year mobile analysis. If an attack occurs on a single day(s), its weekly impact is massive, but that effect may be washed out somewhat by several months of normal organic traffic. West.andrew.g (talk) 15:33, 21 January 2015 (UTC)

Draft done. Numbers are down 18% from last year. Can't say whether that's a trend. Serendipodous 23:53, 21 January 2015 (UTC)

Also, could someone put the two years in two columns, so they're side by side? I tried and couldn't. Serendipodous 23:55, 21 January 2015 (UTC)

Something weird is going on

Saving Mr. Banks has 364,866 hits this week on the 5000 chart, but only 58728 views on the per article chart, and that's per month. Serendipodous 15:53, 30 January 2015 (UTC)

Checks out as correct in my database and the raw data. I show a JAN-25 spike of 343,962 views being the catalyst here. We have observed past bugs in Grok.se's aggregation algorithms. I'll also note that we are now using a slightly different data source than what Grok ingests, because we are using one annotated with mobile and wp.zero views (both produced by the WMF, and we would expect them to be the same). Thanks, West.andrew.g (talk) 18:01, 30 January 2015 (UTC)
Note that the article only has 2.89% mobile views on the top 5000, which means something is afoot with a bunch of those views. Though that means we'd expect to see the huge spike on Jan 25 show up on Grok's chart (as showing only desktop views), this would not be first case where that hasn't happened. I don't know how mobile and non-mobile are divided, but there must be anomalies. E.g., Angelsberg gets 100% mobile views and regularly is excluded from the the Top25, but I assume this isn't some mobile phone gone amok in reality.--Milowenthasspoken 18:17, 30 January 2015 (UTC)
The mobile vs. desktop distinction is made using user agent strings. With normal web viewing, these are hard coded into whatever browsing you are using. However, one can also write code to set them to arbitrary values (i.e., a desktop-based spam script can trivially masquerade as a mobile one). If a spammer were intelligent they'd mix their use of desktop and mobile triggering strings to make the traffic appear organic by our metrics. West.andrew.g (talk) 20:04, 30 January 2015 (UTC)
Hmmm. If you were trying to affect the Top 25, you'd do that. Does anyone care? I've wondered for awhile if Stephen Hawking's 12 week run in the Top 25 is not a bit suspicious.--Milowenthasspoken 20:22, 30 January 2015 (UTC)
While Stephen Hawking is the subject of a Hollywood movie, I would expect most views of his biography to be organic. I visited that page a few times recently myself. -- WeijiBaikeBianji (talk, how I edit) 21:47, 30 January 2015 (UTC)

Something really weird is going on

What's the story with #206? Hawkeye7 (talk) 21:15, 5 March 2015 (UTC)

It has 0% mobile, which is pretty much physically impossible unless it's a spambot, so it really doesn't matter what the story is. Serendipodous 21:32, 5 March 2015 (UTC)

Pageview statistics for medical articles published in JMIR

I'm happy to announce that User:Doc James and I recently had our paper "Wikipedia and Medicine: Quantifying Readership, Editors, and the Significance of Natural Language" published in the Journal of Medical Internet Research (JMIR). The journal is open-access, so I don't need to ramble too much here about the content. It suffices to say that we look at statistics surrounding the content, editor demographics, editing behaviors, and page view statistics of Wikipedia:WikiProject_Medicine and its international equivalents. My largest contribution was mining 2013 traffic statistics to analyze articles, topics (intra-language article clusters), and language editions. The journal buries this fact somewhat, but we also published a data appendix that points to our "raw-aggregate" data and extends the tables presented in the main write-up. Thanks for tolerating this bit of shameless self promotion, and be on the lookout for deeper insights in the future (mobile data, more statistical dorky-ness). West.andrew.g (talk) 16:24, 19 March 2015 (UTC)

Zero %

There are three columns with this one on the right. What does it mean? Simply south ...... time, deparment skies for just 9 years 13:39, 28 March 2015 (UTC)

The final column is the percentage for Wikipedia Zero, a project that allows poor people to access Wikipedia on their mobile phones without incurring charges. So far, hardly anyone uses it, so the percentages are zero. Serendipodous 14:09, 28 March 2015 (UTC)

Just did an approximate count of the "deleteables"

and there are about 200 in total, going by mobile count. I know that Andrew wants to keep the list "pure" but that still is 200 human entries we're excluding. Serendipodous 15:05, 28 March 2015 (UTC)

It takes approximately zero effort for me to expand the list to arbitrary length. The "Top 5200" doesn't have a nice ring to it, though (and only represents a 4% increase in size). The quantity of 5000 was chosen because beyond 5000 entries, some client browsers start to choke under all the DOM manipulation, and sometimes WP chokes when new reports are added either due to size or parsing issues. What is your suggestion? Is anyone really reading the table to that depth? Thanks, West.andrew.g (talk) 15:18, 11 May 2015 (UTC)
I'm just wondering if it's worth it to exclude the articles below 2% mobile count. Serendipodous 15:48, 11 May 2015 (UTC)

Long entries

Would it be possible to truncate or throw some line breaks into the long entries that overly widen the table, like "-webkit-linear-gradient(top,%20transparent,%20transparent),%20data:image/svg+xml,%3C%3Fxml%20version%3D%221.0%22%20encoding%3D%"? Trivialist (talk) 16:25, 10 May 2015 (UTC)

This is currently being done, with truncation occurring on "titles" beyond 64 characters. These cases are indicated by the introduction of ellipsis "..." into the wikilink. Personally, the 64 character threshold makes for a reasonable looking table on my machine/resolution. I'd welcome community input on what others think about the situation and what a consensus threshold might be. Thanks, West.andrew.g (talk) 15:29, 11 May 2015 (UTC)
I mentioned that specific link because the pipelinking doesn't seem to be working — it shows up for me as
[[-webkit-linear-gradient(top,%20transparent,%20transparent),%20data:image/svg+xml,%3C%3Fxml%20version%3D%221.0%22%20encoding%3D%|-webkit-linear-gradient(top,%20transparent,%20transparent),%20d...]]
instead of
webkit-linear-gradient(top,%20transparent,%20transparent),%20d....
which would be fine. Trivialist (talk) 21:32, 11 May 2015 (UTC)
Note that there is a pipe "|" character in the above string. Some character (or combination thereof) is preventing the WP parser from reading this as a wikilink. Some quick testing suggests it is more complicated than a single character. Can someone seem if the wiklink definition documents a possible issue? Needless to say this particular name is a corner case that breaks traditional linking and thus our truncation strategy. West.andrew.g (talk) 05:01, 12 May 2015 (UTC)

Other Wikis

Does anyone watching this page know if the same top article list is available on other wikis? We're in the process of writing an API to get this kind of data (I'm on the WMF analytics team) and I'd like to check out what's available. Thanks in advance for any info. --Milimetric (talk) 20:02, 22 May 2015 (UTC)

  • Milimetric, it is not available on any other language wikipedia, as far as I am aware.--Milowenthasspoken 22:24, 26 May 2015 (UTC)
    • I am sure you have also seen these breakdowns by Wikipedia Project [2]. They too are unfortunately not available in other languages but it would be excellent if they were. Also these ones do not contain mobile which it would be nice if they did. Doc James (talk · contribs · email) 23:03, 26 May 2015 (UTC)

Delayed report this week (2015-MAR-23)

The WP:5000 is currently delayed, see Wikipedia:Village_pump_(technical)#Server_rejecting_large_edit_via_browser_and_API. West.andrew.g (talk) 02:33, 27 May 2015 (UTC)

A wikitext copy of the WP:5000 has been posted to [3] should anyone want to view the data or compile the WP:TOP25 report. West.andrew.g (talk) 02:33, 27 May 2015 (UTC)
... or maybe try an alternative method of posting the content. West.andrew.g (talk) 02:39, 27 May 2015 (UTC)

Report for 2015-MAR-30

The raw copy has been posted to [4] for this week while I continue troubleshooting efforts. West.andrew.g (talk) 18:48, 1 June 2015 (UTC)

 Done - Posting issue is believed to be resolved. Previous reports have been re-generated/posted and automatic update should happen over the weekend as expected. West.andrew.g (talk) 15:21, 3 June 2015 (UTC)

Where can I find tables for longer time periods?

Can someone point me to a page where the same table exists for longer time periods, i.e. past month or even past year? I think this would be pretty interesting as well. EvM-Susana (talk) 13:01, 10 June 2015 (UTC)

@EvM-Susana: My user page links to reports for whole year 2013 and 2014. Thanks, West.andrew.g (talk) 13:52, 10 June 2015 (UTC)
Thanks a lot! For me it was interesting that the page on Bill Gates is in 135th position in 2014. I guess several PhDs could be done on researching trends over the years, shifts in popularity of pages, trivia versus serious stuff and so forth. Thanks for making these reports available! EvM-Susana (talk) 20:58, 10 June 2015 (UTC)

Angelsberg?

@West.andrew.g: It seems odd that a tiny stub article about a town with 283 inhabitants would not only get 1,450,161 hits out of nowhere, but that 100% of them would be mobile. It seems even more odd since http://stats.grok.se/en/201505/Angelsberg only reports 170 hits for the entire month of May (which seems much more reasonable) and https://tools.wmflabs.org/wikitrends/english-most-visited-this-month.html doesn't list it in the top pages, despite you having it with over 1.4 million hits each week this month (including the unposted report you linked to above). Is there a glitch in your data somewhere? --Ahecht (TALK
PAGE
) 15:53, 27 May 2015 (UTC)

@Ahecht: Searching through the archives of this and related pages you'll see obscure articles shoot to the top because of malicious/misconfiguration actions. It is trivial for someone to write a script, set their user agent string to something that identifies the client as "mobile", and just keep hitting the same page. I don't believe the two sources you reference (stats.grok and WikiTrends) ingest the new mobile data, which has only been around for a couple months (if you search the archives, you'll find me throwing a bit of a fit when I learned that the data the WMF provided us for a long time completely excluded mobile traffic; the same data most legacy services are still using). Therefore, mobile views are not included in the totals as they are here. Since the traffic is 99.9999999% mobile, they are missing a huge part of the picture. I have confirmed this against the raw data, a snip of which is below:
> grep "Angelsberg" pagecounts-20150520-180000
en Angelsberg 1 27547
en.m Angelsberg 8634 138595128
... Which shows that on a single hour of last week there were ~8.6k views from mobile devices but just 1 from a "desktop" client (data). The more interesting question is not "if", but "why". West.andrew.g (talk) 16:33, 27 May 2015 (UTC)
@West.andrew.g: Ahh, I didn't realize the other sources excluded mobile data. It would seem that the hits are real, even if they're not legitimate readers. Makes you wonder if the hits used a mobile user agent specifically to avoid detection. --Ahecht (TALK
PAGE
) 16:54, 27 May 2015 (UTC)
  • Angelsberg has been getting very high mobile counts for a long time now. (Maybe months?) On the WP:TOP25 we automatically exclude anything with greater than 95% mobile counts because those have been shown from experience to be mostly non-human views. In any given week I would say at least 10 articles are excluded in creating the Top 25 Report.--Milowenthasspoken 19:23, 27 May 2015 (UTC)


The Angelsberg mystery has been solved with some help from @Hashar:! That page is used by the load balancer in front of en.m.wikipedia.org to check the health of the backend MediaWiki servers. See hieradata/common/lvs/configuration.yaml in the WMF operations/puppet repo --BDavis (WMF) (talk) 22:09, 28 July 2015 (UTC)

Traffic pattern suggests a Reddit thread, but I can't find anything in the last week. Nothing trending in the news, nothing on the web over the same period. Serendipodous 18:09, 16 August 2015 (UTC)

Resolved at Wikipedia talk:Top 25 Report. Thanks, West.andrew.g (talk) 02:34, 18 August 2015 (UTC)

Hmmm...

The top 4 by mobile views from Wikipedia Zero are all featured articles, with # 4119 Ruby Laffoon, a US politician who died in March of 1941, coming in at #1 for Wikipedia Zero with 27,129 views, 36.35% from mobile, and 1.91% of those views from Wikipedia Zero. That seems odd. Biosthmors (talk) pls notify me (i.e. {{U}}) while signing a reply, thx 21:48, 29 August 2015 (UTC)

I know very little about how Wikipedia Zero operates. Maybe you could cross-post this to a talk page of theirs? Is this just something done internal to carrier networks, or is there an associated app that might prominently list featured articles in some way? Thanks, West.andrew.g (talk) 19:27, 31 August 2015 (UTC)

Data source

Hey @West.andrew.g:, I noticed you are still using [5]. FYI: there is a newer version of the same data, with same format, but with spiders filtered, at [6]. Cheers, Erik Zachte (talk) 23:37, 19 December 2015 (UTC)

Thanks for the pointer! I may make the switch in 2016, but not before, just to simplify some of the year end aggregations. Thanks, West.andrew.g (talk) 01:35, 21 December 2015 (UTC)

5k

Hi, could you tell me where the 5000 page is pls? 2606:6000:610A:9000:1D0F:636F:39A:867D (talk) 16:25, 20 December 2015 (UTC)

The report is now up. Uploading a new 5000 row table every week amounts to a pretty massive "edit". Sometimes the WMF servers aren't too happy with this and throw back an error. We changed our workflow to blank the page between updates, as this simplifies diff computation, which seems to be where the servers are timing-out. When a couple retries fail, we are left with a blank page, and I have to retry manually. West.andrew.g (talk) 01:53, 21 December 2015 (UTC)

Year-end reports on deck

Please see Wikipedia_talk:Top_25_Report#Year-end_reports_on_deck. Thank you and happy New Year! West.andrew.g (talk) 08:24, 31 December 2015 (UTC)

New top list

Hello, i am Max from Jerusalem, I contacted you by e-mail. Thank you for your reply and advices. It helped me to create Multiyear ranking of most viewed Wikipedia pages, top-100 from December 1, 2007 to January 31, 2016. — Preceding unsigned comment added by 31.168.177.82 (talk) 23:27, 2 March 2016 (UTC)

the numbers aren't adding up this week

Several articles on this list (mostly countries) aren't in TopViews and there aren't enough views per day to account for the stated numbers. Serendipodous 07:47, 12 April 2016 (UTC)

West.andrew.g; since they aren't showing up in any other view counters, I think it means your page was hacked. :-( Serendipodous 08:30, 12 April 2016 (UTC)
  • Odd discrepancies.
  • World War II is the first one, with 683,995 views (12.54% mobile, quite low) on the Top 5000; on TopViews is it #122 with 190,683 views. The difference is if you change the "Agent" column under Topviews. If you select "All" you get 682,892 views; if you select "User" you get 190,683. "All" (and WP:5000) seems to be capturing an unexplained spike from April 3-6 (which actually started on April 2 last week). So I intend to exclude it. Is "All" including self-identified spiders and bots?
  • The same pattern occurs for India, France, Canada, etc., and to a somewhat less noticeable extent, but still there, for United States. Even Iran (#54 on WP:5000) shows the pattern. So I think we should exclude these. These articles are popular enough regularly that the spike in views is not reducing the mobile count to the super low levels we normally see in such cases, but they definitely all have depressed mobile accounts compared to normal.--Milowenthasspoken 12:44, 12 April 2016 (UTC)

Not hacked, we are now just using different datasets. The WMF data does some type of filtering to remove spiders, crawlers, and automated traffic. This dataset is available to me and I will integrate it as real life permits. Mainly I need to understand if the format changed in some way. If anyone has any extra cycles and wants to help me make sense of this, feel free! Right now we are parsing from [7] "pagecounts-all-sites", whereas "pageviews" seems to be the new standard [8]. I don't think it is as straightforward as changing the URL. West.andrew.g (talk) 13:31, 12 April 2016 (UTC)

Please tell me you're just late this week!

I can't do this with just TopViews! I need the mobile count! I can't go back to coach! Serendipodous 16:48, 19 April 2016 (UTC)

@Serendipodous: As described at WT:STiki the machine encountered some downtime. The missed report was processed, and the one from this past weekend has arrived per the normal schedule. Sorry for the snag, but we are back on track. West.andrew.g (talk) 16:12, 25 April 2016 (UTC)

Most popular over a longer time period...

Hi @West.andrew.g: I'm just wondering if there is anywhere that shows the most popular articles over a much longer period than just the last week... for example the last year. Or even top viewed articles of all time? I'd be interested in seeing such a list. Thanks  — Amakuru (talk) 15:20, 10 May 2016 (UTC)

The latest annual traffic report can be seen there. Serendipodous 15:31, 10 May 2016 (UTC)
Thanks!  — Amakuru (talk) 15:34, 10 May 2016 (UTC)

Collaboration

@West.andrew.g: I would like to collaborate with you to suggest adding two columns: a column with the current revision's ORES class prediction (please see [9] by User:Bamyers99) and a column indicating membership of the article in any or all WP:BACKLOG categories of your choosing, and if so which. Thank you! EllenCT (talk) 21:12, 5 June 2016 (UTC)

Ignore report updates temporarily Report up to date

In testing the transition to an alternative WMF article views data source, I will attempt to ingest existing 2016 data in the new format, and (re?)-produce all 2016 reports in the process. There will be statistical differences due to how the data is collected and filtered. I will reply here when we are back at steady state. Thanks, West.andrew.g (talk) 20:20, 18 August 2016 (UTC)

Going smoothly. 3 months done so far. West.andrew.g (talk) 23:38, 21 August 2016 (UTC)
Perfect are we going to be able to calculate total pageviews per year for WPMED from this? Doc James (talk · contribs · email) 03:20, 22 August 2016 (UTC)
For English, but efficiencies in the switch might open opportunities for other languages. Of course, our collaboration will always make WP:MED a priority for special processing. West.andrew.g (talk) 04:11, 22 August 2016 (UTC)

All 2016 reports have been re-generated w/new data source and we expect to resume a normal updating schedule (weekly on Sunday AM, NYC time). Thanks, West.andrew.g (talk) 13:51, 25 August 2016 (UTC)

Hyphen-Minus

How do you explain the phenomenal rise of "-"? It is second after the Main page as of November 5, 2016. Suddenly, the whole world is viewing this particular typographic sign. Max from Jerusalem (a year ago I created with your help Wikipedia:Multiyear ranking of most viewed pages)--Maxaxax (talk) 22:13, 5 November 2016 (UTC)

@Maxaxax: Note that there was a change in data source when the "hyphen" started appearing in the top list. I have confirmed that the statistics I report for "-" do accurately reflect underlying raw data. Surely, this is not about actual views to the "-" article, but this is an artifact that somehow gets logged, that my report then interprets as an article. Point being, you'll have to take a step up in the stats production hierarchy to learn about why this occurs. I currently store and aggregate the raw statistical files stored here. Thanks, West.andrew.g (talk) 14:30, 7 November 2016 (UTC)

Regarding the � characters

You may have noticed certain editions of the list, including this week, *sometimes* display the "�" character where one would expect accented and non-ASCII characters. This is not the result of my processing. These characters appear in our underlying data source.

When programming the code to ingest from our new data source, I noticed this, suspected it was my error, and vetted the issue thoroughly. Usually, the article titles that include "�" appear alongside correctly accented versions of the article title in the raw statistics. In my inspection, it did not appear like the "�" articles were driven by artificial traffic (i.e., someone actually typing in that character, or someone prominently linking a version with some invalid or bizarre Unicode encoding). I did observe that the versions with "�" only tend to receive traffic for a couple consecutive hours of a single day of the week. Often times these are the biggest traffic hours for the "article" (that is, these � versions tend to get traffic when it might make sense for the "correct" article to be getting large traffic). This explains why the funky � versions often outrank the "correct" ones (sometimes they both appear in the list).

Summarily, the problem is somewhere before me on the statistical production chain. It seems to manifest around traffic spikes. It seems to be temporary. If you look back through the history there are some weeks with lots of these, some with few/none. Sometimes the same articles are affected. I've got no great suggestions on how to handle these, but thought I should get ahead of the questions I anticipated. Thanks, West.andrew.g (talk) 16:04, 11 September 2016 (UTC)

I encountered what is probably the same or a similar issue in November 2015, see phab:T117945.
Regards, Tbayer (WMF) (talk) 17:27, 11 September 2016 (UTC)
  • I had not paid much attention to this before, but Murder of JonBenét Ramsey fell into this trap this week, appearing as "[[Murder of JonBen��t Ramsey]]" (with brackets showing and no link) on the Top 5000; there was a TV miniseries about this, so the article popularity seems legitimate. Before now I had noted that "[[Lali Esp��sito]]" was also getting the effect, but this article is being othewise affected by bots (3.85% mobile views) and has been excluded. The effect there must be from the use of the "ó" character in the article title. I noted that when an article has this issue, the class rating of the real article gets omitted from the Top 5000.--Milowenthasspoken 15:00, 19 September 2016 (UTC)
Yep, we should contribute what we know to the above Phabricator ticket. If they fix things, those changes will trickle down into our reports. West.andrew.g (talk) 14:31, 7 November 2016 (UTC)

Redirects

Can we get a list of the most-viewed redirects? wbm1058 (talk) 23:43, 29 November 2016 (UTC)

@Wbm1058: Can redirect articles easily be queried/determined via the Wikimedia API? West.andrew.g (talk) 17:31, 30 November 2016 (UTC)
My bot code gets the contents of each page its concerned with and reads the first line – if it begins #REDIRECT (case insensitive) then it's a redirect. I'm not aware of any bit that the MediaWiki software sets to indicate that condition, so one can know a page is a redirect without examining it, but I'm not particularly knowledgeable about the database structure and every capability of the API. Someone else might know more for sure. It would certainly be nice to be able to do a query or series of queries that returns an array or set of arrays of all pages that are redirects. wbm1058 (talk) 17:44, 30 November 2016 (UTC)
If the software can quickly pull up a random redirect when one clicks Special:RandomRedirect, surely there must be a way to query the API for that characteristic. Oh, yes here it is: Special:ListRedirects. Though, per Wikipedia talk:Special:ListRedirects, that was limited to displaying only the first 1,000 entries in 2007–2009, and still is limited to showing only the first 5,000 – I believe that is because of concerns about the server load from too many random users hitting on this. See Wikipedia:AutoWikiBrowser/User manual#Make list for instructions how to get the list of redirects with AWB. Choose Special page on the drop-down menu for the Source: then click on Make list and a "Special Pages" window pops up. On that, choose "All Redirects" on the "source" drop-down menu. That will easily get you the first 25,000 redirects. Again AWB by default cuts its lists at 25K due to server load/performance concerns, but there should be a way to override that limit if you have a need. Though I don't see any drop-down item for "Special page (NL, Admin & Bot)"... NL means "no limit". Maybe if you have the "apihighlimits" right, you can do it. Hope this helps. wbm1058 (talk) 22:21, 14 December 2016 (UTC)
@Wbm1058: I'm willing to help you out here, but my guess is the first 25k isn't going to get us very deep into the pile, and that second approach sounds a bit clunky, especially if we're looking to do this repeatedly. With your bot approach, do you have to get the entire content of a page with an API request, or is there a way to stop it after a certain number of bytes? Querying the API for all article content and checking the first couple characters is simple, but obviously comes with high network costs. Can you run this by the folks at WP:VPT to see if they have any brilliant API-focused strategies before I settle on the content-parsing approach? Thanks, West.andrew.g (talk) 04:41, 15 December 2016 (UTC)
Hey, I found it. mw:API:Allredirects. You set up a loop to retrieve so many on each call, and keep using arcontinue to keep getting more 'till they're all processed. I don't know how you make your article list, but if your program calls mw:API:Allpages, maybe you just need to substitute Allredirects for Allpages – that would be pretty simple, if that's all there is to it. See mw:API:Lists. wbm1058 (talk) 05:53, 15 December 2016 (UTC)

@Wbm1058: -- Is it possible to use this to see if a particular page is the source of a redirect? Presumably if we can dump them all out, we can use the "starttitle" parameter and a "limit=1" to test single articles?. Are you able to craft a query that demonstrates this? Do you know how many redirects there are in existence? At what point does querying individually exceed the pagination requests of getting everything? Thanks, West.andrew.g (talk) 20:51, 16 December 2016 (UTC)

Feature request - estimates of relative popularity

For Henrik's tool there used to be a featuring noting when any article was among Wikipedia most popular. I forget what it showed - perhaps the top 1%, which for English Wikipedia would be 50,000 of the 5 million articles. Right now this tool is showing the top 0.1%. Would it be within anyone's reach and availability of time to produce a tool which reported how many views an article needed to be among the top 1%? I think it could motivate people to edit more if they knew that certain articles were very popular in Wikipedia. 1% seems very popular, and yet it is a lot of articles and probably all of the Wikipedia:Vital articles. Blue Rasberry (talk) 20:59, 23 January 2017 (UTC)

@Bluerasberry: I am extremely pressed for cycles at the moment, but I'm happy to share my database tables and existing code for anyone who might find it useful for these types of projects. West.andrew.g (talk) 17:28, 25 January 2017 (UTC)
Consider this a long-term wish for the years ahead when perhaps more people are considering development. Thanks for the reply. Blue Rasberry (talk) 18:09, 25 January 2017 (UTC)
  • Bluerasberry, I would note that Andrew has done some work on this that may give some general guidance that helps you, see File:Wikipedia view distribution by article rank.png. I think this data was from ~2010 or 11 (when mobile views were not recorded and may need to be discounted by about 25-33%), but it shows that the millionth most popular article is getting less than 100 views per day (I can't eyeball logarithmic scales very confidently - having the data behind the graph probably allows for a much more exact figure). So, I'd say any article getting over 3,000 views per month is well into the top 20% of all articles (top million). Does being told an article is the top 20% of most viewed articles motivate someone? If so, we have a million articles we can promote! The current Top 5000 shows that it takes about 122.8K views in a month (4092/day, 28.6K per week) to make the top 5000 (top 0.1%). And with 10,000 views a day (70K a week, 300K a month), you're good for #1106 (just outside the top 0.02%). Someone more mathematically-inclined will hopefully correct me if my observations are wildly wrong. Now I am interested to figure out what exactly Henrik's tool offered on this--he used to occasionally create lists of the most viewed articles (see, e.g., Wikipedia:Top 25 Report/August 2008), but it was irregular.--Milowenthasspoken 21:18, 25 January 2017 (UTC)
    • @Bluerasberry: Let me see what I can achieve with some quick queries of the year-end 2016 data. If I can pull it together, I'll probably dump something in CSV format to Pastebin. With that, I hope you can take care of the graphing/beautification, and appropriate distribution -- on the assumption this is something you feel strongly about. Thanks, West.andrew.g (talk) 22:29, 25 January 2017 (UTC)
Please do not overcommit yourself - I am interested, but this was more of a casual request for something that I could use if I had it the next 0-3 years. Wikipedia is a target-rich environment for things to do and if anything is trouble I can find other things to try.
I cannot promise beautification but yes, I would do some graphing and distribution if I had this information, and also a blog post. Thanks and please do not let me distract you from your usual business. Blue Rasberry (talk) 22:35, 25 January 2017 (UTC)
@Bluerasberry: That was easier than I expected. The PASTEBIN DATA is pretty self-explanatory. You will probably notice there are 15 million articles ranked in the data, whereas the WMF reported number is close to 5 million. To clarify, my data contains only article namespace entries. Further, my data contains only information on articles that existed when they visited (while they might be redlinks now, they weren't when the page view was registered). So the unaccounted for 10M? Some are certainly the artifacts of renaming and article deletion. I think the bigger factor is redirections, which probably aren't reflected in the 5M count, but are broken out individually in the page view data. Does this account for everything? I'm not sure, but I also think the very long tail on this distribution and the weird bits that might reside there aren't critical to the motivating question. Most reasonable articles are going to find themselves ranked in the first few million. Let me know if I can help you out in any other way. Thanks, West.andrew.g (talk) 23:29, 25 January 2017 (UTC)
West.andrew.g This is amazing. I did not know what you would provide, but at a glance, there are interesting insights in this. Like for example, I can see that the top 50,000 Wikipedia articles all got more than 710 views a day in 2016. The most popular concepts beyond the 5 million mark still got about 1 view a day. I understand that some redirects might be in the first 5 million and some articles may rank in the 2nd or 3rd 5 million, but just as a generalization, I think this is supporting evidence for making a claim that almost all Wikipedia articles get at least a view a day.
I am putting some of this information in an internal report I am writing. I also shared this at the WP:Vital articles board, where I think some people would appreciate it. After some time I will write a blog post about this and credit you. Thanks for providing this. This is really interesting. Blue Rasberry (talk) 15:56, 27 January 2017 (UTC)

Page view aggregate 2013-2016

@The ed17: @Bluerasberry: Report up at User:West.andrew.g/2013-2016_Popular_pages. Interesting stuff. It becomes harder to parse in this format which of the high rankings are the result of artificial views, but those familiar with our reporting should be able to pick out the usual offenders. West.andrew.g (talk) 18:54, 30 January 2017 (UTC)

Please fix non-ASCII characters

Greetings West.andrew.g! Could you possibly fix your script to correctly display non-ASCII characters in article titles? Writing the top 25 report last week, we had Beyoncé and this week Gisele Bündchen which appear botched as [[Beyonc��]] and [[Gisele B��ndchen]]. Thanks in advance, and keep up the great work. — JFG talk 07:36, 12 February 2017 (UTC)

@JFG: These broken characters are in the raw data the WMF produces. The problem occurs before I get to the data. WMF is trying to troubleshoot it and there is a Phabricator ticket. There is a thread about this in archives of this page, probably the most recent one. Thanks, West.andrew.g (talk) 15:40, 12 February 2017 (UTC)
OK, no problem, good to know. — JFG talk 16:10, 12 February 2017 (UTC)

Super Bowl week statistics

Amazing how high the mobile percentages are given the second-screen effect. West.andrew.g (talk) 16:14, 12 February 2017 (UTC)

What is the second-screen effect? People using their mobile phones while watching television? ~Mable (chat) 17:37, 12 February 2017 (UTC)
Indeed: Second screen. West.andrew.g (talk) 19:14, 12 February 2017 (UTC)

Protection columns now included in report

I've added two columns to the report, indicating an article's "edit" and "move" protection status. Supporting text was added to the header, and a second icon legend was added below the existing "quality" one. Enjoy! West.andrew.g (talk) 14:58, 7 April 2017 (UTC)

Stats on provenance of readers?

Greetings West.andrew.g! I wonder if Wikimedia stores the HTTP referer information anywhere, and whether we could aggregate this data to determine stats about where readers come from. Perhaps you have a clue? This question stems from a discussion at Talk:New York (disambiguation) where editors have varying opinions about the dominant origins of page views by readers. My hunch is that most visits come from the internal search engine, then from external search engines, then from internal links, then from external links and finally from direct URL input.[10] It would be quite illuminating if we could challenge or corroborate this gut feeling with hard stats. — JFG talk 08:44, 11 July 2017 (UTC)

@JFG: I am not sure what the WMF stores on their side, but this certainly isn't something available to 3rd party researchers due to possible privacy implications, I imagine. When the "Top 25 Report" has had questions about the authenticity of an article's view patterns in the past, I know that User:Ironholds has been able to use some internal data to comment on the variation of source hosts/ids/referrers (?). Errr... Though I just noticed on his userpage that he left the Foundation, he still may be able to comment on the data they have. Thanks, West.andrew.g (talk) 16:46, 11 July 2017 (UTC)

Top 5000 not updated

Hi Andrew. Your page is very helpful to me. I've only been following for a few weeks. It hasn't been updated this week and I'm wondering when it is usually updated Sun/Mon? Thanks! --Sloughflux (talk) 03:15, 10 October 2017 (UTC)

@Sloughflux: Issue is known and explanation is over at Wikipedia_talk:Top_25_Report#Raw_stats_delaying_this_week.27s_5k. Thanks, West.andrew.g (talk) 03:59, 10 October 2017 (UTC)
 Done -- @Sloughflux: ... and the new reports have now generated. Thanks, West.andrew.g (talk) 13:40, 10 October 2017 (UTC)

Top Pages in Country

Hello, I was wondering if it is possible to construct a tool that monitors what readers in country X read on Wikipedia as a list at a certain time? Is it a tool you use to curate these lists? If yes is it open source? - KagunduTalk To Me 08:39, 20 October 2017 (UTC)

@Kagundu: I have written code that aggregates the raw analytics files made available by the WMF. This source essentially aggregates pageviews by article, at hour or daily granularity, with further breakdowns for mobile and Wikipedia Zero use. To do something like viewership "by country", you'd need to geo-locate the IP addresses of individual readers. The WMF is never going to expose reader IP addresses to researchers due to privacy concerns. They'd have to do some type of aggregation internally if there was any hope of having this kind of data. West.andrew.g (talk) 14:29, 20 October 2017 (UTC)

Weekly WP:5000 delayed due to missing raw data

Details over at: Wikipedia_talk:Top_25_Report#Weekly_WP:5000_delayed_due_to_missing_raw_data. Thanks, West.andrew.g (talk) 22:19, 8 April 2018 (UTC)

 Done -- This issue has been resolved. West.andrew.g (talk)

Note "c"

You should probably change "undefined" to "-" in note "c" on User:West.andrew.g/Popular pages, given the way the new data source reports undefined pages. --Ahecht (TALK
PAGE
) 17:13, 18 April 2018 (UTC)

@Ahecht: Thank you. I've called out the hyphen as being like the "undefined" case. I've still left the legacy "undefined" example in since the header template is pulled into some historical reports. West.andrew.g (talk) 04:32, 19 April 2018 (UTC)

Improper handling of some Unicode characters

For example, entries #44, #81, #88, and #99 all have Unicode characters in their titles which show up as U+FFFD REPLACEMENT CHARACTER in the table. This also breaks the wikilink formatting so you have to find the article manually (assuming enough of the remaining title is legible). — AfroThundr (u · t · c) 15:56, 24 December 2018 (UTC)

@AfroThundr3007730: This is a known issue, but unfortunately its not on the side of my code. That is, this is how these page titles appear in the underlying raw data the WMF produces and I aggregate. If you search this talk page's archives I think you'll find mention of it, as well as a Phabricator task where the WMF folks (at least at one point) were attempting to track down the bug. West.andrew.g (talk) 20:39, 24 December 2018 (UTC)

Wikipedia Zero

The project ended, can you change the code to remove the column from future reports? igordebraga 15:34, 27 January 2019 (UTC)

 Done -- @Igordebraga: Can your (or anyone, really) remove the references in the header? Thanks, West.andrew.g (talk) 17:41, 31 January 2019 (UTC)