source("functions.R")Career hubs - Data merge
Data documentation
Here you find a merge and translation between IDS for Boardex from 2018 and WikiData from around 2019. The translation was partly done by a search and match algorithm and then by hand. Emphasis has been on both identifying the same corporations but also partly solving mergers and acquisitions, so that firms that are later acquired become part of the buying entity. This is similar to how the WikiData community solve identity resolution and mergers. The sample that we have cleaned is the career positions from all top managers from Forbes Global 2000 firms, which amounts to approx. 150.000 organizations. If you have access to Boardex this should allow you to merge and match. If you don’t have access it is still a valuable collection of mergeable organization names, which could be used to clean other datasets.
Variables
In the boardex_translation.csv you find the following columns:
- affiliation_id: The Boardex affiliation id
- affiliation: The affiliation name as originally registered in Boardex
- match_name: affiliation after two rounds of cleaning and standardization - see
clean_names_boardex_step1andclean_names_boardex_step2infunctions.R - wiki_cleaned_name: The matched and cleaned version of match_name.
- wikidata_QID: WikiData QID id - handmatched - so expect some errors.
- forbes: If true WikiData QID matched a QID of a Forbes Global2000 firm
- n_entries : Number of entries per affiliation_id in Boardex
Career network example
In the following we give an example of how you can turn a set of career positions into a career network. For more see the original article.
First we create two simple careers
s1 <- c("EY", "General Electric", "Goldman Sachs", "Deutsche Bank") %>% tibble(person = "CFO", affil = .)
s2 <- c("Bell Labs", "General Electric", "Apple Inc.") %>% tibble(person = "CTO", affil = .)Then we plot them as a regular graph
pos <- bind_rows(s1, s2)
s <- pos %>% group_by(person) %>% mutate(order_number = 1:n())
f <- function(x) c(x[-1], NA)
s <- s %>% transmute(from = affil, to = f(from), person, order_number)
g <- graph_from_data_frame(na.omit(s))
p <- graph.plot(g, layout = layout_with_fr(g), midpoints = TRUE, text = TRUE, vertex.size = 10, text.size = 4,
vertex.fill = "white", edge.size = 0.3, edge.color = E(g)$person, text.background = "white", text.background.alpha = 0.5)
p <- p + scale_color_manual(values = c("black", "darkred")) + coord_fixed(clip = 'off')
p.simple.graph <- p
p.simple.graphCareer shifts
We think of the network as a directed network between organizations. An edge is a person shifting from one organization to the other. One career will then create a chain of links between organizations. If a person has had positions in company A, B and C in that order the network is: A -> B -> C. There is initially no connection between A and C in this directed network, but there is a path. If we think of the network as consisting of multiple careers, like A-> B -> C and D -> B -> E, we are faced with a conundrum.
g1 <- graph_from_literal(A-+B-+C)
plot(g1)g2 <- graph_from_literal(D-+B-+E)
g3 <- g1+g2
plot(g3)In this graph there is no direct tie between A and E - but there is a path (A - B - E). Yet no career moves between them. What is worse is that the pathlength between A and E is the same as between A and C - that do share a career. So from the point of view of E - A and D are the same. We are assuming, along with most network theorists, that the node B does not discriminate in who it redirects to C and E. That no matter how you got to B it will redirect you in the same fashion. This is very unlikely. Lets imagine a network made up of two careers - one based on financial skills - CFO - and one on technical skills - CTO.
s1 <- c("EY", "General Electric", "Goldman Sachs", "Deutsche Bank") %>% tibble(person = "CFO", affil = .)
s2 <- c("Bell Labs", "General Electric", "Apple Inc.") %>% tibble(person = "CTO", affil = .)This network looks like this.
pos <- bind_rows(s1, s2)
s <- pos %>% group_by(person) %>% mutate(order_number = 1:n())
f <- function(x) c(x[-1], NA)
s <- s %>% transmute(from = affil, to = f(from), person, order_number)
g <- graph_from_data_frame(na.omit(s))
p <- graph.plot(g, layout = layout_with_fr(g), midpoints = TRUE, text = TRUE, vertex.size = 10, text.size = 4,
vertex.fill = "white", edge.size = 0.3, edge.color = E(g)$person, text.background = "white", text.background.alpha = 0.5)
p <- p + scale_color_manual(values = c("black", "darkred")) + coord_fixed(clip = 'off')
p.simple.graph <- p
p.simple.graphAnd these are the shortest paths - and we can see that Bell Labs are as close to Apple as they are to Goldman Sachs due to the position in General Electric. It is in this graph not possible to distinguish between EY and Bell Labs.
sp <- distances(g, mode = "out")
sp[sp == Inf] <- NA
sp EY General Electric Goldman Sachs Bell Labs Deutsche Bank
EY 0 1 2 NA 3
General Electric NA 0 1 NA 2
Goldman Sachs NA NA 0 NA 1
Bell Labs NA 1 2 0 3
Deutsche Bank NA NA NA NA 0
Apple Inc. NA NA NA NA NA
Apple Inc.
EY 2
General Electric 1
Goldman Sachs NA
Bell Labs 2
Deutsche Bank NA
Apple Inc. 0
Career edges
Let us instead imagine that firms are only connected by a shortest path if a career path connects them. In order to do this we need to imagine that the network is initially made up of multiple separate graphs, one for each career, that are then aggregated. The aggregated graph - between firms, is weighted by the sum of inverted path lengths in the individual career graphs. A path length of 2 is then a 0.5, a path length of 4, 0.25. If two firms are connected along two careers the weights are summed. So two paths of 1 and 2 becomes a weighted edge of 1 + 0.5 = 1.5. The weighting scheme is simple - a direct exchange is 1 and then the weight declines with each extra step in the career graph.
Lets have a look at our example with the CFO and CTO from before. The numbers along the edges is the inverse path length between the firms. The CTO path - between Bell Labs - GE and Apple shows how the new graph looks. Now there is a direct (weak) tie between Bell Labs and Apple. The CFO path has also created new ties (in blue), but there is no tie between Goldman Sachs, DB and Bell Labs. Obviosly there is still a path - but we now have a distance measure between EY and Apple that takes into account that they are not tied by a career.
career.edges <- na.omit(s) %>% group_by(person) %>% group_modify(.f = from.shifts.to.paths)
g <- career.edges %>% ungroup() %>% select(-person) %>% graph_from_data_frame()
p <- graph.plot(g, layout = layout_with_fr(g), midpoints = TRUE, text = TRUE, vertex.size = 10, text.size = 4, edge.text.size = 3,
vertex.fill = "white", edge.size = 0.3, edge.color = 1/E(g)$path.length, edge.text = round(1/E(g)$path.length, 2), text.color = "black", edge.alpha = 1, edge.text.alpha = 1, text.background = "white", text.background.alpha = 0.5)
p.career.simple <- p + scale_color_steps(low = "darkred", high = "black", guide = "none") + scale_x_continuous(expand = c(0.1, 0)) + coord_fixed(clip = "off", expand = TRUE)
p.career.simplepatchwork::wrap_plots(p.simple.graph, p.career.simple)Benefits
First and foremost we have a better distance measure between firms and it is fairly easily interpretable. The number of firms that they can reach through the careers that flow through them. Ties with a weight of 1 have exchanged what amounts to a single direct exchange. This measure boosts firms that are early in careers that move along popular paths. But does little for firms that exchange strongly with 1 or two other firms in short careers. If we transform the network from a weighted to an unweighted network and remove all ties with a value below 1 we have a simple network where coreness and degree are much less vulnerable to differences in the amounts of managers in the firm. The coreness score is readily interpretable and the core is not as heavily captured by the largest financial firms, because size matters less. We also do not have to prune the network as we would otherwise have to. Which again is a lot simpler to explain.