🗞️Processing 10TB of Wikipedia Page Views — Part 2

🏷️ google-cloud-platform 🏷️ bigquery 🏷️ google-data-studio

🗿Semantically Similar Articles (by :title_embedding)

🗄️ 52.6 🔗 Apr18 Automating data extraction from SEC 10-K forms using Document AI and Generative AI (🧑🏻‍💻 Harish Verma)
🗄️ 53.7 🔗 Apr08 How to Augment Text Data with Gemini through BigQuery DataFrames (🧑🏻‍💻 Karl Weinmeister)
🗄️ 54.2 🔗 Mar27 Enrich your streaming data using Bigtable and Dataflow (🧑🏻‍💻 Reza Rokni)
🗄️ 54.8 🔗 Apr08 Fine Tuning Large Language Models: How Vertex AI Takes LLMs to the Next Level (🧑🏻‍💻 Abirami Sukumaran)
🗄️ 55.2 🔗 Apr05 Deferring jobs enqueueing to after the transaction commit, queries count in rendering logs and more (🧑🏻‍💻 Wojtek)

Processing 10TB of Wikipedia Page Views — Part 2

2020-03-21 - Marc Cohen (from Marc Cohen - Medium)

Processing 10TB of Wikipedia Page Views — Part 2Data VisualizationThis is the second of a two part series in which we focus on interesting queries and visualizations using the data pipeline we created in part one.In part one, we covered the problem statement and the data engineering solution, including all the code needed to build a reliable, robust big data processing pipeline. In this part two, we’ll cover some of the fun things we can do once we have our data pipeline running: interesting queries and data visualizations and a Data Studio dashboard you can try yourself.What can we do with this data?For starters, let’s find out the most popular Wikipedia article so far this year:SELECT title, SUM(views) total_views FROM `bigquery-public-data.wikipedia.pageviews_2020` WHERE wiki IN ('en', 'en.m') AND title NOT IN ('Main_Page','-','Wikipedia') AND title NOT LIKE 'File%' AND title NOT LIKE 'Special:%' AND title NOT LIKE 'Portal:%' AND datehour>='2020-01-01' AND views>10000 GROUP BY title ORDER BY 2 DESC LIMIT 10It’s only rock & roll, but I like itThat last query was interesting but it doesn’t take advantage of the entity data we worked so hard to process in part 1. Let’s construct an example leveraging our wikidata table to find the most viewed pages for rock bands in 2020. Of course, this begs the question, what is a rock band? Wikidata to the rescue!You can search the wikidata interactively via the wikidata site, like this:This shows us that “rock band” matches multiple entities, including the video game of the same name. The entity we’re after is the first one: 5741069. One way to confirm this finding is to search for a known entity that should be in this category:Here we see that the preeminent rock band, The Beatles, is indeed classified as an instance of “Rock Band”. But this doesn’t catch every page I’m interested. For example, Radiohead is considered an instance of “Musical Group” (215380).Armed with those two entity ids, we can now do some queries about popular bands. But I want to do one more thing before we start querying. Since our scope is limited to just those two entities, it’s wasteful to search the full 10TB dataset on every query. Wouldn’t it be nice if there was a way to limit our search to include only the pages we care about? Well, there is — we’ll create what BigQuery calls a view, which will limit our query scope to only the view counts for pages about bands.Here’s the SQL code to create my view, which I’ve made public so you can try it too:CREATE OR REPLACE TABLE `mco-bigquery.wikipedia.bands`(datehour TIMESTAMP, title STRING, views INT64) PARTITION BY DATE(datehour)CLUSTER BY title AS SELECT datehour, title, SUM(views) views FROM `bigquery-public-data.wikipedia.pageviews_*` a JOIN ( SELECT DISTINCT en_wiki FROM `bigquery-public-data.wikipedia.wikidata` WHERE EXISTS (SELECT * FROM UNNEST(instance_of) WHERE numeric_id=5741069 or numeric_id=215380) AND en_wiki IS NOT null ) b ON a.title=b.en_wikiAND a.wiki='en' AND DATE(a.datehour) BETWEEN '2015-01-01' AND '2020-12-31' GROUP BY datehour, titleThis view gives us a dataset we can query much more quickly and economically, because we’re only scanning information associated with bands, which is a small subset of the overall dataset.Let’s find the most wiki-popular band so far in 2020 with this query:SELECT title, SUM(views) views FROM `mco-bigquery.wikipedia.bands` WHERE DATE(datehour) BETWEEN "2020-01-01" AND "2020-12-31" GROUP BY title ORDER BY views DESC LIMIT 10And the results as of this writing…Let’s make a dashboard!This is fun but:It’s a snapshot of a single moment — it doesn’t give us any sense of how this data varies over time.SQL queries can be a bit complicated, especially for beginners.This query is interactive for me but I’d like to somehow share that interactivity with others.Wouldn’t it be nice if I could easily, without writing a single line of code, allow everyone to query this data, with mouse clicks rather than SQL queries, and to see the results in a nice color coded time series graph?Using Google Data Studio, I made just such a dashboard, which I’ve embeded below. Give it a try — you can play with the selected bands and the time frame you’d like to analyse.For example, I wonder which band, during the last five years, was more popular: the Beatles or the Stones? We can use this dashboard to find out in just a few seconds:Despite having ended their career fifty years ago, the Beatles continue to gather a pretty impressive level of attention in Wikipedia page views.Now it’s your turnI’ve made this data available to everyone in the BigQuery Public Dataset collection. The page views are coming in roughly every hour and the entity data gets refreshed every 3–4 days. The data can be found in the bigquery-public-data collection, under Wikipedia, as shown below:These tables are partitioned so you can save time and cost by time limiting your queries, like this:SELECT title, SUM(views) views FROM `bigquery-public-data.wikipedia.pageviews_2019` WHERE DATE(datehour) BETWEEN "2019-01-01" AND "2019-12-31" AND wiki = 'en' GROUP BY title ORDER BY views DESC LIMIT 20The “Battle of the Bands” dashboard is also available for your use at mco.fyi/bands.Resources and AcknowledgementsPart 1 of this seriesAll of the code described in this article can be found here: github.com/marcacohen/wikidata.Many thanks to Felipe Hoffa. Several of the queries and data management techniques in this article were derived from Felipe’s excellent articles on partitioning and lazy loading, and clustering.Shane Glass provided invaluable support helping me add this data to the wonderful Google Cloud Public Datasets collection.Originally published at https://mco.dev on March 21, 2020.

🏷️ google-cloud-platform 🏷️ bigquery 🏷️ google-data-studio

[Blogs] 🌎 https://marcacohen.medium.com/processing-10tb-of-wikipedia-page-views-part-2-48352059fd44?source=rss-1f225b5f22b2------2 [🧠] [v2] article_embedding_description: {:llm_project_id=>"Unavailable", :llm_dimensions=>nil, :article_size=>8216, :llm_embeddings_model_name=>"textembedding-gecko"}
[🧠] [v1/3] title_embedding_description: {:ricc_notes=>"[embed-v3] Fixed on 9oct24. Only seems incompatible at first glance with embed v1.", :llm_project_id=>"unavailable possibly not using Vertex", :llm_dimensions=>nil, :article_size=>8216, :poly_field=>"title", :llm_embeddings_model_name=>"textembedding-gecko"}
[🧠] [v1/3] summary_embedding_description:
[🧠] As per bug https://github.com/palladius/gemini-news-crawler/issues/4 we can state this article belongs to titile/summary version: v3 (very few articles updated on 9oct24)

🗿article.to_s

------------------------------
Title: Processing 10TB of Wikipedia Page Views — Part 2
[content]
Processing 10TB of Wikipedia Page Views — Part&nbsp;2Data VisualizationThis is the second of a two part series in which we focus on interesting queries and visualizations using the data pipeline we created in part&nbsp;one.In part one, we covered the problem statement and the data engineering solution, including all the code needed to build a reliable, robust big data processing pipeline. In this part two, we’ll cover some of the fun things we can do once we have our data pipeline running: interesting queries and data visualizations and a Data Studio dashboard you can try yourself.What can we do with this&nbsp;data?For starters, let’s find out the most popular Wikipedia article so far this&nbsp;year:SELECT title, SUM(views) total_views   FROM `bigquery-public-data.wikipedia.pageviews_2020` WHERE wiki IN ('en', 'en.m')   AND title NOT IN ('Main_Page','-','Wikipedia')  AND title NOT LIKE 'File%'   AND title NOT LIKE 'Special:%'   AND title NOT LIKE 'Portal:%'   AND datehour&gt;='2020-01-01'   AND views&gt;10000 GROUP BY title ORDER BY 2 DESC LIMIT 10It’s only rock &amp; roll, but I like&nbsp;itThat last query was interesting but it doesn’t take advantage of the entity data we worked so hard to process in part 1. Let’s construct an example leveraging our wikidata table to find the most viewed pages for rock bands in 2020. Of course, this begs the question, what is a rock band? Wikidata to the&nbsp;rescue!You can search the wikidata interactively via the wikidata site, like&nbsp;this:This shows us that “rock band” matches multiple entities, including the video game of the same name. The entity we’re after is the first one: 5741069. One way to confirm this finding is to search for a known entity that should be in this category:Here we see that the preeminent rock band, The Beatles, is indeed classified as an instance of “Rock Band”. But this doesn’t catch every page I’m interested. For example, Radiohead is considered an instance of “Musical Group” (215380).Armed with those two entity ids, we can now do some queries about popular bands. But I want to do one more thing before we start querying. Since our scope is limited to just those two entities, it’s wasteful to search the full 10TB dataset on every query. Wouldn’t it be nice if there was a way to limit our search to include only the pages we care about? Well, there is — we’ll create what BigQuery calls a view, which will limit our query scope to only the view counts for pages about&nbsp;bands.Here’s the SQL code to create my view, which I’ve made public so you can try it&nbsp;too:CREATE OR REPLACE TABLE `mco-bigquery.wikipedia.bands`(datehour TIMESTAMP, title STRING, views INT64) PARTITION BY DATE(datehour)CLUSTER BY title AS     SELECT datehour, title, SUM(views) views  FROM `bigquery-public-data.wikipedia.pageviews_*` a  JOIN (    SELECT DISTINCT en_wiki    FROM `bigquery-public-data.wikipedia.wikidata`    WHERE EXISTS (SELECT * FROM UNNEST(instance_of) WHERE numeric_id=5741069 or numeric_id=215380)    AND en_wiki IS NOT null     ) b ON a.title=b.en_wikiAND a.wiki='en' AND DATE(a.datehour) BETWEEN '2015-01-01' AND '2020-12-31' GROUP BY datehour, titleThis view gives us a dataset we can query much more quickly and economically, because we’re only scanning information associated with bands, which is a small subset of the overall&nbsp;dataset.Let’s find the most wiki-popular band so far in 2020 with this&nbsp;query:SELECT title, SUM(views) views FROM `mco-bigquery.wikipedia.bands` WHERE DATE(datehour) BETWEEN "2020-01-01" AND "2020-12-31" GROUP BY title ORDER BY views DESC LIMIT 10And the results as of this&nbsp;writing…Let’s make a dashboard!This is fun&nbsp;but:It’s a snapshot of a single moment — it doesn’t give us any sense of how this data varies over&nbsp;time.SQL queries can be a bit complicated, especially for beginners.This query is interactive for me but I’d like to somehow share that interactivity with&nbsp;others.Wouldn’t it be nice if I could easily, without writing a single line of code, allow everyone to query this data, with mouse clicks rather than SQL queries, and to see the results in a nice color coded time series&nbsp;graph?Using Google Data Studio, I made just such a dashboard, which I’ve embeded below. Give it a try — you can play with the selected bands and the time frame you’d like to&nbsp;analyse.For example, I wonder which band, during the last five years, was more popular: the Beatles or the Stones? We can use this dashboard to find out in just a few&nbsp;seconds:Despite having ended their career fifty years ago, the Beatles continue to gather a pretty impressive level of attention in Wikipedia page&nbsp;views.Now it’s your&nbsp;turnI’ve made this data available to everyone in the BigQuery Public Dataset collection. The page views are coming in roughly every hour and the entity data gets refreshed every 3–4 days. The data can be found in the bigquery-public-data collection, under Wikipedia, as shown&nbsp;below:These tables are partitioned so you can save time and cost by time limiting your queries, like&nbsp;this:SELECT title, SUM(views) views FROM `bigquery-public-data.wikipedia.pageviews_2019` WHERE DATE(datehour) BETWEEN "2019-01-01" AND "2019-12-31" AND wiki = 'en' GROUP BY title ORDER BY views DESC LIMIT 20The “Battle of the Bands” dashboard is also available for your use at mco.fyi/bands.Resources and AcknowledgementsPart 1 of this&nbsp;seriesAll of the code described in this article can be found here: github.com/marcacohen/wikidata.Many thanks to Felipe Hoffa. Several of the queries and data management techniques in this article were derived from Felipe’s excellent articles on partitioning and lazy loading, and clustering.Shane Glass provided invaluable support helping me add this data to the wonderful Google Cloud Public Datasets collection.Originally published at https://mco.dev on March 21,&nbsp;2020.
[/content]

Author: Marc Cohen
PublishedDate: 2020-03-21
Category: Blogs
NewsPaper: Marc Cohen - Medium
Tags: google-cloud-platform, bigquery, google-data-studio

{"id"=>1249,
"title"=>"Processing 10TB of Wikipedia Page Views — Part 2",
"summary"=>nil,
"content"=>"

Processing 10TB of Wikipedia Page Views — Part 2

Data Visualization

This is the second of a two part series in which we focus on interesting queries and visualizations using the data pipeline we created in part one.

In part one, we covered the problem statement and the data engineering solution, including all the code needed to build a reliable, robust big data processing pipeline. In this part two, we’ll cover some of the fun things we can do once we have our data pipeline running: interesting queries and data visualizations and a Data Studio dashboard you can try yourself.

What can we do with this data?

For starters, let’s find out the most popular Wikipedia article so far this year:

SELECT title, SUM(views) total_views 
  FROM `bigquery-public-data.wikipedia.pageviews_2020` 
WHERE wiki IN ('en', 'en.m') 
  AND title NOT IN ('Main_Page','-','Wikipedia')
  AND title NOT LIKE 'File%' 
  AND title NOT LIKE 'Special:%' 
  AND title NOT LIKE 'Portal:%' 
  AND datehour>='2020-01-01' 
  AND views>10000 
GROUP BY title 
ORDER BY 2 DESC LIMIT 10

It’s only rock & roll, but I like it

That last query was interesting but it doesn’t take advantage of the entity data we worked so hard to process in part 1. Let’s construct an example leveraging our wikidata table to find the most viewed pages for rock bands in 2020. Of course, this begs the question, what is a rock band? Wikidata to the rescue!

You can search the wikidata interactively via the wikidata site, like this:

This shows us that “rock band” matches multiple entities, including the video game of the same name. The entity we’re after is the first one: 5741069. One way to confirm this finding is to search for a known entity that should be in this category:

Here we see that the preeminent rock band, The Beatles, is indeed classified as an instance of “Rock Band”. But this doesn’t catch every page I’m interested. For example, Radiohead is considered an instance of “Musical Group” (215380).

Armed with those two entity ids, we can now do some queries about popular bands. But I want to do one more thing before we start querying. Since our scope is limited to just those two entities, it’s wasteful to search the full 10TB dataset on every query. Wouldn’t it be nice if there was a way to limit our search to include only the pages we care about? Well, there is — we’ll create what BigQuery calls a view, which will limit our query scope to only the view counts for pages about bands.

Here’s the SQL code to create my view, which I’ve made public so you can try it too:

CREATE OR REPLACE TABLE `mco-bigquery.wikipedia.bands`
(datehour TIMESTAMP, title STRING, views INT64) 
PARTITION BY DATE(datehour)
CLUSTER BY title 
AS   
  SELECT datehour, title, SUM(views) views
  FROM `bigquery-public-data.wikipedia.pageviews_*` a
  JOIN (
    SELECT DISTINCT en_wiki
    FROM `bigquery-public-data.wikipedia.wikidata`
    WHERE EXISTS (SELECT * FROM UNNEST(instance_of) WHERE numeric_id=5741069 or numeric_id=215380)
    AND en_wiki IS NOT null   
  ) b 
ON a.title=b.en_wiki
AND a.wiki='en' 
AND DATE(a.datehour) BETWEEN '2015-01-01' AND '2020-12-31' 
GROUP BY datehour, title

This view gives us a dataset we can query much more quickly and economically, because we’re only scanning information associated with bands, which is a small subset of the overall dataset.

Let’s find the most wiki-popular band so far in 2020 with this query:

SELECT title, SUM(views) views 
FROM `mco-bigquery.wikipedia.bands` 
WHERE DATE(datehour) BETWEEN "2020-01-01" AND "2020-12-31" 
GROUP BY title 
ORDER BY views DESC 
LIMIT 10

And the results as of this writing…

Let’s make a dashboard!

This is fun but:

It’s a snapshot of a single moment — it doesn’t give us any sense of how this data varies over time.
SQL queries can be a bit complicated, especially for beginners.
This query is interactive for me but I’d like to somehow share that interactivity with others.

Wouldn’t it be nice if I could easily, without writing a single line of code, allow everyone to query this data, with mouse clicks rather than SQL queries, and to see the results in a nice color coded time series graph?

Using Google Data Studio, I made just such a dashboard, which I’ve embeded below. Give it a try — you can play with the selected bands and the time frame you’d like to analyse.

For example, I wonder which band, during the last five years, was more popular: the Beatles or the Stones? We can use this dashboard to find out in just a few seconds:

Despite having ended their career fifty years ago, the Beatles continue to gather a pretty impressive level of attention in Wikipedia page views.

Now it’s your turn

I’ve made this data available to everyone in the BigQuery Public Dataset collection. The page views are coming in roughly every hour and the entity data gets refreshed every 3–4 days. The data can be found in the bigquery-public-data collection, under Wikipedia, as shown below:

These tables are partitioned so you can save time and cost by time limiting your queries, like this:

SELECT title, SUM(views) views FROM `bigquery-public-data.wikipedia.pageviews_2019` WHERE DATE(datehour) BETWEEN "2019-01-01" AND "2019-12-31" AND wiki = 'en' GROUP BY title ORDER BY views DESC LIMIT 20

The “Battle of the Bands” dashboard is also available for your use at mco.fyi/bands.

Resources and Acknowledgements

Part 1 of this series
All of the code described in this article can be found here: github.com/marcacohen/wikidata.
Many thanks to Felipe Hoffa. Several of the queries and data management techniques in this article were derived from Felipe’s excellent articles on partitioning and lazy loading, and clustering.
Shane Glass provided invaluable support helping me add this data to the wonderful Google Cloud Public Datasets collection.

Originally published at https://mco.dev on March 21, 2020.

$\"\"$ ",
"author"=>"Marc Cohen",
"link"=>"https://marcacohen.medium.com/processing-10tb-of-wikipedia-page-views-part-2-48352059fd44?source=rss-1f225b5f22b2------2",
"published_date"=>Sat, 21 Mar 2020 00:00:00.000000000 UTC +00:00,
"image_url"=>nil,
"feed_url"=>"https://marcacohen.medium.com/processing-10tb-of-wikipedia-page-views-part-2-48352059fd44?source=rss-1f225b5f22b2------2",
"language"=>nil,
"active"=>true,
"ricc_source"=>"feedjira::v1",
"created_at"=>Sun, 31 Mar 2024 20:54:03.458218000 UTC +00:00,
"updated_at"=>Mon, 21 Oct 2024 16:56:44.996784000 UTC +00:00,
"newspaper"=>"Marc Cohen - Medium",
"macro_region"=>"Blogs"}

Edit this article

Back to articles