This webpage is created as an online supplementary material for the manuscript Zero Inflated Poisson Model with Clustered Regression Coefficients: an Application to Heterogeneity Learning of Field Goal Attempts of Professional Basketball Players. Data generation code for simulation study is provided, and an example run will be demonstrated.

1. Data Generation

The five basis functions as visualized in Figure 2 of the manuscript, and the un-normalized values are provided in this repository as nmf_basis.rds. A basic visualization can be done with package spatstat (Baddeley, Rubak, and Turner 2015).

basis_functions <- readRDS("nmf_basis.rds")
plot(spatstat::im(matrix(basis_functions[,1], ncol = 47)),
      main = "Top of Key Threes")

Data simulation is done using the nimble package. First, as the BUGS language in nimble does not have a native implementation of the zero-inflated Poisson (ZIP) distribution, it needs to be customized, and registered. A straightforward example is found at https://r-nimble.org/nimbleExamples/zero_inflated_poisson.html, and we use the provided code to register the ZIP distribution.

library(nimble, warn.conflicts = F)
## nimble version 0.10.1 is loaded.
## For more information on NIMBLE and a User Manual,
## please visit http://R-nimble.org.
## this whole section is available at
## https://r-nimble.org/nimbleExamples/zero_inflated_poisson.html
## with documentation
## 
ZIPcode <- nimbleCode({
  p ~ dunif(0,1)
  lambda ~ dunif(0,10)
  for(i in 1:N){
    y[i] ~ dZIP(lambda,zeroProb = p)
  }
})

dZIP <- nimbleFunction(
  run = function(x = integer(), lambda = double(), zeroProb = double(), 
                 log = logical(0, default = 0)) {
    returnType(double())
    ## First handle non-zero data
    if(x != 0) {
      ## return the log probability if log = TRUE
      if(log) return(dpois(x, lambda, log = TRUE) + log(1-zeroProb))
      ## or the probability if log = FALSE
      else return((1-zeroProb) * dpois(x, lambda, log = FALSE))
    }
    ## From here down we know x is 0
    totalProbZero <- zeroProb + (1-zeroProb) * dpois(0, lambda, log = FALSE)
    if(log) return(log(totalProbZero))
    return(totalProbZero)
  })

rZIP <- nimbleFunction(
  run = function(n = integer(), lambda = double(), zeroProb = double()) {
    returnType(integer())
    isStructuralZero <- rbinom(1, prob = zeroProb, size = 1)
    if(isStructuralZero) return(0)
    return(rpois(1, lambda))
  })

registerDistributions(list(
  dZIP = list(
    BUGSdist = "dZIP(lambda, zeroProb)",
    discrete = TRUE,
    range = c(0, Inf),
    types = c('value = integer()', 'lambda = double()', 'zeroProb = double()')
  )))
## Registering the following user-provided distributions: dZIP

Next we simulate one replicate of data. Note that in our implementation there is also an intercept term in addition to the five bases. We demonstrate the data generating for the balanced design only, as the imbalanced design can be obtained with minor modification to the code.

ZIPmodel <- nimbleModel(ZIPcode, constants = list(N = 1), check = FALSE)
## defining model...
## building model...
## running calculate on model (any error reports that follow may simply reflect missing values in model variables) ... Error in if (x != 0) { : missing value where TRUE/FALSE needed
## 
## checking model sizes and dimensions... This model is not fully initialized. This is not an error. To see which variables are not initialized, use model$initializeInfo(). For more information on model initialization, see help(modelInitialization).
## model building finished.
nmf_basis <- log(basis_functions)
for(i in 1:dim(nmf_basis)[2]){
  nmf_basis[,i] <- (nmf_basis[,i] - mean(nmf_basis[,i])) / sd(nmf_basis[,i])
}

simulatedData <- matrix(NA, nrow = 75, ncol = 1175)
X <- rbind(1, t(nmf_basis))

pi.cluster <- c(0.1, 0.3, 0.4)
beta_cluster1 <- c(-1, 1.2, 0.95, 1.1, 1, 0.8)
beta_cluster2 <- c(-0.4, 0.6, 0.7, 0.5, 0.8, 0.3)
beta_cluster3 <- c(-0.9, 0.2, 0.1, 0.3, 0.2, 0.4)

## to generate one replicate of data
## code not run here, only as demonstration
for(players in 1:25){
  for(j in 1:ncol(X)){
    #Cluster 1
    
    ZIPmodel$p <- pi.cluster[1]
    ZIPmodel$lambda <- exp(t(X) %*% beta_cluster1)[j]
    ZIPmodel$simulate('y')
    
    simulatedData[players, j] <- ZIPmodel$y
    
    
    #Cluster 2
    
    ZIPmodel$p <- pi.cluster[2]
    ZIPmodel$lambda <- exp(t(X) %*% beta_cluster2)[j]
    ZIPmodel$simulate('y')
    
    simulatedData[players + 25, j] <- ZIPmodel$y
    
    
    #Cluster 3
    
    ZIPmodel$p <- pi.cluster[3]
    ZIPmodel$lambda <- exp(t(X) %*% beta_cluster3)[j]
    ZIPmodel$simulate('y')
    
    simulatedData[players + 50, j] <- ZIPmodel$y
  }
}

The simulatedData object is a matrix with 75 rows and 1175 columns, and each of its row corresponds to the number of shots made by the simulated “player” on the court blocks. The two simulated datasets each containing 100 such objects are also provided in this repository, and can be loaded by reading in the corresponding objects. As an example, we load the data for the balanced design.

simulatedData <- readRDS("simulated_balance.rds")

2. Model Fitting

First, an auxiliary function is defined to perform the stick-breaking for mixture of finite mixtures, according to Section 4 of Miller and Harrison (2018).

piFun2 <- nimbleFunction(
  run = function(r = double(1)) {
    rlength <- length(r)
    rsum <- rep(0, rlength)
    pi <- rep(0, rlength)
    rsum[1] <- pi[1] <- r[1]
    for (i in 2:rlength) {
      rsum[i] <- rsum[i - 1] + r[i]
      if (rsum[i] >= 1) {
        pi[i] <- 1 - rsum[i - 1]
      }
      else {pi[i] <- r[i]}
    }
    for (i in 1:rlength) {
      if (pi[i] < 0) {pi[i] <- 0}
    }
    returnType(double(1))
    return(pi)
  }
)

Next the MFM-ZIP model as descried in Section~3.3 of the manuscript is formulated as below using nimble syntax.

MainCode <- nimbleCode({
  for (s in 1:S) {
    # S number of  players
    for (j in 1:Q) {
      Y[s, j] ~ dZIP(lambda[s, j], zeroProb = p.ZIP[s])
      lambda[s, j] <- exp(b[s, 1] * x1[j] + b[s, 2] * x2[j] + 
                            b[s, 3] * x3[j] + b[s, 4] * x4[j] +
                            b[s, 5] * x5[j] + b[s, 6] * x6[j])
    }
    b[s, 1:6] <- bm[latent[s], 1:6]
    p.ZIP[s] <- pm[latent[s]]
    latent[s] ~ dcat(pi_beta[1:M])
  }
  
  ### clustered coefficients: MFM ###
  for (i in 1:M) {
    r_beta[i] ~ dexp(rate = lambda_beta)
  }
  pi_beta[1:M] <- piFun2(r = r_beta[1:M])
  
  lambda_beta ~ dlnorm(0, varlog = 1)
  
  for (k in 1:M) {
    bm[k, 1:6] ~ dmnorm(mu_bm[1:6], cov = var_bm[1:6, 1:6])
    pm[k] ~ dunif(0, 1)
  }
})

After the model is defined, we supply it with the data, constants, and initial values.

Y <- simulatedData[, , 1]
S <- dim(Y)[1]
Q <- dim(X)[2]

MFMdata <- list(Y = Y,
                x1 = X[1, ], x2 = X[2, ], x3 = X[3, ],
                x4 = X[4, ], x5 = X[5, ], x6 = X[6, ]
                )

MFMConsts <- list(S = S, Q = Q, M = 10,
                  mu_bm = rep(0, 6), var_bm = diag(1, 6)
                  )

MFMInits <- list(latent = rep(1, MFMConsts$S), pi_beta = rep(0.1, MFMConsts$M),
                 r_beta = rep(0.2, MFMConsts$M), lambda_beta = 1,
                 pm = rep(0.1, MFMConsts$M), bm = matrix(1, MFMConsts$M, 6)
)
MFM <- nimbleModel(code = MainCode,
                   name = "MFM",
                   constants = MFMConsts,
                   data = MFMdata,
                   inits = MFMInits
                   )
## defining model...
## Adding mu_bm, var_bm as data for building model.
## building model...
## setting data and initial values...
## running calculate on model (any error reports that follow may simply reflect missing values in model variables) ... 
## checking model sizes and dimensions...
## model building finished.
cMFM <- compileNimble(MFM)
## compiling... this may take a minute. Use 'showCompilerOutput = TRUE' to see C++ compilation details.
## compilation finished.
MFMconf <- configureMCMC(MFM, print = FALSE)

MFMconf$addMonitors(c("b", "bm", "p.ZIP", "pm",
                      "latent", "pi_beta"))
## thin = 1: lambda_beta, pm, bm, b, p.ZIP, latent, pi_beta
MFMmcmc <- buildMCMC(MFMconf)
cMFMmcmc <- compileNimble(MFMmcmc, project = MFM)
## compiling... this may take a minute. Use 'showCompilerOutput = TRUE' to see C++ compilation details.
## compilation finished.
cMFM$setInits(MFMInits)
mcmc.out <- runMCMC(cMFMmcmc, niter = 7000, setSeed = 1, thin = 1)
## running chain 1...
## |-------------|-------------|-------------|-------------|
## |-------------------------------------------------------|
pos_mcmc <- coda::as.mcmc(mcmc.out[-c(1:2000), ])

The result is a \(5000\times 681\)-dimensional matrix, containing the 5000 post-burn-in iterations. In each iteration, the latent cluster labels for the \(i\)-th player are named latent[i]. We extract these columns, and use the Dahl’s method (Dahl 2006) to obtain the best clustering configuration.

latent_iterations <- pos_mcmc[, 512:586]

## membership matrices

membership_matrices <- purrr::map(1:nrow(latent_iterations),
                                  ~(outer(latent_iterations[.x, ],
                                          latent_iterations[.x, ],
                                          "==")))
## average of membership matrices

avg_membershp <- Reduce("+", membership_matrices) / nrow(latent_iterations)
vecL2dist <- purrr::map_dbl(membership_matrices, ~sum((.x - avg_membershp)^2))
finalClusters <- as.numeric(latent_iterations[which.min(vecL2dist), ])

## number of clusters inferred and cluster sizes
table(finalClusters)
## finalClusters
##  1  2  3 
## 25 25 25
## clustering performance measured using Rand index
fossil::rand.index(finalClusters, rep(c(1, 2, 3), each = 25))
## [1] 1

3. Players Grouping

Group Players
1 Andre Drummond, Ben Simmons, Julius Randle, Steven Adams, Dwight Howard, Jusuf Nurkic, Taj Gibson, Enes Kanter, Clint Capela, Michael Beasley, Derrick Favors, Willie CauleyStein, Jonas Valanciunas, DeAndre Jordan, Brandon Ingram, Domantas Sabonis, Rondae HollisJefferson, Hassan Whiteside, Montrezl Harrell, Robin Lopez, Elfrid Payton, Marcin Gortat, John Henson, Michael KiddGilchrist, Dwight Powell, Dejounte Murray, Jerami Grant, Aron Baynes, Skal Labissiere
2 Otto Porter Jr, Gary Harris, Bojan Bogdanovic, Buddy Hield, Stephen Curry, Kyle Lowry, Taurean Prince, Eric Gordon, JJ Redick, Evan Fournier, Jeremy Lamb, Serge Ibaka, Dario Saric, Jonathon Simmons, Tim Hardaway Jr, Bobby Portis, Kevin Love, Courtney Lee, Dirk Nowitzki, Brook Lopez, Robert Covington, Allen Crabbe, Marco Belinelli, Austin Rivers, Ricky Rubio, Chris Paul, Kelly Oubre Jr, Rodney Hood, Terry Rozier, Joe Ingles, Markieff Morris, Kentavious CaldwellPope, Caris LeVert, Spencer Dinwiddie, Joe Harris, DeMarre Carroll, Darren Collison, JJ Barea, Jeff Green, Kelly Olynyk, James Johnson, Pau Gasol, Frank Kaminsky, Lance Stephenson, Wesley Matthews, Jamal Crawford, Tyler Johnson, Mario Hezonja, John Wall, Myles Turner, Patty Mills, Nikola Mirotic, Kent Bazemore, Justin Holiday, Denzel Valentine, Avery Bradley, Jae Crowder, Ersan Ilyasova, Mike Scott, Draymond Green, Wayne Ellington, DAngelo Russell, Reggie Jackson, Wilson Chandler, Dewayne Dedmon, Trey Lyles, Marcus Morris, Malcolm Brogdon, Kris Dunn, Evan Turner, Pascal Siakam, Rudy Gay, Reggie Bullock, Trevor Ariza, Doug McDermott, Marvin Williams, Nicolas Batum, JR Smith, Cory Joseph, AlFarouq Aminu, George Hill, Troy Daniels, DJ Augustin, Fred VanVleet, Danny Green, Stanley Johnson, Anthony Tolliver, CJ Miles, Rajon Rondo, Kyle Korver, Tyler Ulis, Jerian Grant, Raymond Felton, Marquese Chriss, JaMychal Green, Jarell Martin, Manu Ginobili, Ryan Anderson, Shabazz Napier, Darius Miller, Dragan Bender, Emmanuel Mudiay, Bryn Forbes, Nick Young, Ian Clark, Garrett Temple, Justise Winslow, Andrew Harrison, Marcus Smart, Mario Chalmers
3 LeBron James, Bradley Beal, DeMar DeRozan, KarlAnthony Towns, Russell Westbrook, CJ McCollum, James Harden, Andrew Wiggins, Kevin Durant, Victor Oladipo, Paul George, Jrue Holiday, Khris Middleton, Kemba Walker, Lou Williams, Damian Lillard, Tobias Harris, Devin Booker, Klay Thompson, Harrison Barnes, Kyrie Irving, Dennis Schroder, Joel Embiid, Eric Bledsoe, Jamal Murray, Carmelo Anthony, Nikola Jokic, Goran Dragic, Kristaps Porzingis, Blake Griffin, Jimmy Butler, Will Barton, Jordan Clarkson, Marc Gasol, DeMarcus Cousins, Thaddeus Young, ETwaun Moore, Jaylen Brown, Josh Richardson, Ish Smith, Aaron Gordon, Al Horford, Nikola Vucevic, Tyreke Evans, Zach Randolph, Jeff Teague, Dwyane Wade, Jarrett Jack
4 Anthony Davis, Giannis Antetokounmpo, LaMarcus Aldridge, TJ Warren

References

Baddeley, Adrian, Ege Rubak, and Rolf Turner. 2015. Spatial Point Patterns: Methodology and Applications with R. London: Chapman; Hall/CRC Press.

Dahl, David B. 2006. “Model-Based Clustering for Expression Data via a Dirichlet Process Mixture Model.” In Bayesian Inference for Gene Expression and Proteomics, edited by Marina Vannucci Kim-Anh Do Peter Müller, 4:201–18. Cambridge University Press.

Miller, Jeffrey W, and Matthew T Harrison. 2018. “Mixture Models with a Prior on the Number of Components.” Journal of the American Statistical Association 113 (521): 340–56.