I’ll be using the StackLite dataset. Be sure to have git lfs installed before cloning the repository.
Consider the following data frame:
library(tidyverse)
dummy <- tribble(
~row, ~`1`, ~`2`, ~`3`,
"1", 1, 0, 7,
"2", 3, 6, 0,
"3", 0, 8, 0
)
dummy
## # A tibble: 3 x 4
## row `1` `2` `3`
## <chr> <dbl> <dbl> <dbl>
## 1 1 1 0 7
## 2 2 3 6 0
## 3 3 0 8 0
It is the following matrix, represented as a data.frame
:
dummy_matrix <- matrix(c(
1, 0, 7,
3, 6, 0,
0, 8, 0
), nrow = 3, byrow = TRUE)
dummy_matrix
## [,1] [,2] [,3]
## [1,] 1 0 7
## [2,] 3 6 0
## [3,] 0 8 0
There are several zeroes in the matrix, so to save space, I might just store the locations and values of the nonzero cells (this doesn’t make sense for such a small matrix, but is invaluable when working with very large, very sparse matrices):
dummy_sparse <- dummy %>%
gather(column, value, -row) %>%
filter(value != 0)
dummy_sparse
## # A tibble: 5 x 3
## row column value
## <chr> <chr> <dbl>
## 1 1 1 1
## 2 2 1 3
## 3 2 2 6
## 4 3 2 8
## 5 1 3 7
I can still perform basic matrix operations on dummy_sparse
. For
example, taking the transpose is as simple as renaming the rows to
columns and vice versa:
dummy_sparse %>% rename(row = column, column = row) %>%
# to convert back to "dense" just spread the columns and fill in 0s
spread(column, value, fill = 0)
## # A tibble: 3 x 4
## row `1` `2` `3`
## * <chr> <dbl> <dbl> <dbl>
## 1 1 1 3 0
## 2 2 0 6 8
## 3 3 7 0 0
Furthermore, matrix multiplication can be expressed in terms of the
inner_join()
operation. For example, to square the dummy
matrix, I
can multiply dummy_sparse
by itself:
dummy_sparse %>%
inner_join(dummy_sparse, by = c("column" = "row")) %>%
# take the row id from the first matrix, column id from the second
group_by(row, column = column.y) %>%
# and calculate dot products
summarise(value = sum(value.x * value.y)) %>%
# use spread to convert back to a dense representation
spread(column, value, fill = 0)
## # A tibble: 3 x 4
## # Groups: row [3]
## row `1` `2` `3`
## * <chr> <dbl> <dbl> <dbl>
## 1 1 1 56 7
## 2 2 21 36 21
## 3 3 24 48 0
And that works just like normal matrix multiplication:
dummy_matrix %*% dummy_matrix
## [,1] [,2] [,3]
## [1,] 1 56 7
## [2,] 21 36 21
## [3,] 24 48 0
That’s a cool party trick, but why should I care?
What kind of parties do you go to (and how can I get invited?). The
point of the above exercise is that it illustrates a quick and
accessible way to work with (ahem) big data. Since dplyr
works with
database backends, we can use the above code to multiply matrices that
would be too big to work with in memory! In a relational database, a
matrix would already be stored in the sparse (aka long) way.
The question_tags
data from StackLite includes the ID of each question
and the (possibly many) tags that the questioner used to index the
question.
question_tags_df <- readr::read_csv("stacklite/question_tags.csv.gz")
question_tags_df
## # A tibble: 52,224,835 x 2
## Id Tag
## <int> <chr>
## 1 1 data
## 2 4 c#
## 3 4 winforms
## 4 4 type-conversion
## 5 4 decimal
## 6 4 opacity
## 7 6 html
## 8 6 css
## 9 6 css3
## 10 6 internet-explorer-7
## # ... with 52,224,825 more rows
I’ll load the data into a SQLite database:
if (!file.exists("so-questions.sqlite")) {
question_db <- DBI::dbConnect(RSQLite::SQLite(), "so-questions.sqlite")
RSQLite::dbWriteTable(question_db, "question_tags", question_tags_df)
}
Now I can continue working with it in dplyr
:
src <- src_sqlite("so-questions.sqlite")
question_tags <- tbl(src, "question_tags")
question_tags
## # Source: table<question_tags> [?? x 2]
## # Database: sqlite 3.19.3
## # [/home/tarak/Documents/tarakc02.github.io/stacklite/so-questions.sqlite]
## Id Tag
## <int> <chr>
## 1 1 data
## 2 4 c#
## 3 4 winforms
## 4 4 type-conversion
## 5 4 decimal
## 6 4 opacity
## 7 6 html
## 8 6 css
## 9 6 css3
## 10 6 internet-explorer-7
## # ... with more rows
question_tag_counts <- question_tags %>% count(Tag)
## see the most popular tags:
arrange(question_tag_counts, desc(n))
## # Source: lazy query [?? x 2]
## # Database: sqlite 3.19.3
## # [/home/tarak/Documents/tarakc02.github.io/stacklite/so-questions.sqlite]
## # Ordered by: desc(n)
## Tag n
## <chr> <int>
## 1 javascript 1712324
## 2 java 1614786
## 3 php 1406127
## 4 c# 1356681
## 5 android 1327680
## 6 jquery 1035978
## 7 python 898647
## 8 html 804340
## 9 ios 652484
## 10 c++ 645197
## # ... with more rows
I can think of question_tags
as another sparse matrix, where the row
is identified by the Id
, and the column is identified by the Tag
:
question_tags %>% mutate(value = 1)
## # Source: lazy query [?? x 3]
## # Database: sqlite 3.19.3
## # [/home/tarak/Documents/tarakc02.github.io/stacklite/so-questions.sqlite]
## Id Tag value
## <int> <chr> <dbl>
## 1 1 data 1
## 2 4 c# 1
## 3 4 winforms 1
## 4 4 type-conversion 1
## 5 4 decimal 1
## 6 4 opacity 1
## 7 6 html 1
## 8 6 css 1
## 9 6 css3 1
## 10 6 internet-explorer-7 1
## # ... with more rows
As before, there is an implicit 0 for every tag that was not used on a
question. What we have is an adjacency matrix. Following the example of
this blogpost by Kieran
Healy,
I can multiply the transpose of question_tags
with question_tags
to
calculate the number of times every pair of tags co-occurred on a single
question:
tag_network_full <- question_tags %>%
inner_join(question_tags, by = "Id") %>%
group_by(taga = Tag.x, tagb = Tag.y) %>%
# since the "value" is 1, counting is the same as summing the values
summarise(weight = n_distinct(Id)) %>%
ungroup %>%
# since the graph is undirected
filter(taga < tagb)
# pull the pairs that co-occurred at least 25 times for further analysis
tag_network <- tag_network_full %>%
filter(weight >= 25) %>%
collect(n = Inf)
# see the most common pairs of tags
arrange(tag_network, desc(weight))
## # A tibble: 231,027 x 3
## taga tagb weight
## <chr> <chr> <int>
## 1 javascript jquery 549039
## 2 css html 331076
## 3 html javascript 316895
## 4 mysql php 245968
## 5 android java 238315
## 6 html jquery 211359
## 7 ios objective-c 192016
## 8 .net c# 182517
## 9 asp.net c# 177837
## 10 css javascript 148489
## # ... with 231,017 more rows
The tag_network
may be familiar as an “edge list” representation of a
network, where weight
is the weight of each edge. What we’ve seen,
then, is that:
The tidygraph
package makes it really convenient to analyze the
resulting graph:
library(tidygraph)
tag_graph <- as_tbl_graph(tag_network, directed = FALSE)
I start by filtering out the least common tags:
tag_counts <- collect(question_tag_counts, n = Inf)
tag_graph <- tag_graph %>%
inner_join(tag_counts, by = c(name = "Tag")) %>%
filter(n >= 1000)
I can use community finding alogrithms to group tags together. Here I calculate the communities, and then filter down the graph down to just those communities that have at least 10 tags:
grouped_tags <- tag_graph %>%
mutate(group = group_fast_greedy(weights = weight),
group = paste("Group", LETTERS[group])) %>%
group_by(group) %>%
filter(n() > 10) %>%
ungroup
Plotting the 10 most common tags in each community group using the
ggraph
package:
library(ggraph)
grouped_tags %>%
group_by(group) %>%
top_n(10, wt = n) %>%
ggraph(layout = "kk") +
geom_node_text(aes(label = name, size = n),
repel = TRUE, colour = "gray30") +
scale_size_continuous(range = c(3, 9), guide = "none") +
geom_edge_fan(alpha = .3,
width = .2, colour = "purple") +
theme_void() +
theme(panel.background = element_rect(fill = "gray98",
colour = NA),
strip.background = element_rect(fill = "gray90",
colour = NA)) +
facet_nodes(~group, ncol = 3)
The groupings really did identify some structure in the tags. The only group that might require some explanation is Group D, which I suspect might have to do with all of those tags being common on data science questions.