Source code for statsmodels.sandbox.regression.treewalkerclass

'''

Formulas
--------

This follows mostly Greene notation (in slides)
partially ignoring factors tau or mu for now, ADDED
(if all tau==1, then runmnl==clogit)

leaf k probability :

Prob(k|j) = exp(b_k * X_k / mu_j)/ sum_{i in L(j)} (exp(b_i * X_i / mu_j)

branch j probabilities :

Prob(j) = exp(b_j * X_j + mu*IV_j )/ sum_{i in NB(j)} (exp(b_i * X_i + mu_i*IV_i)

inclusive value of branch j :

IV_j = log( sum_{i in L(j)} (exp(b_i * X_i / mu_j) )

this is the log of the denominator of the leaf probabilities

L(j) : leaves at branch j, where k is child of j
NB(j) : set of j and it's siblings

Design
------

* splitting calculation transmission between returns and changes to
  instance.probs
  - probability for each leaf is in instance.probs
  - inclusive values and contribution of exog on branch level need to be
    added separately. handed up the tree through returns
* question: should params array be accessed directly through
  `self.recursionparams[self.parinddict[name]]` or should the dictionary
  return the values of the params, e.g. `self.params_node_dict[name]`.
  The second would be easier for fixing tau=1 for degenerate branches.
  The easiest might be to do the latter only for the taus and default to 1 if
  the key ('tau_'+branchname) is not found. I also need to exclude tau for
  degenerate branches from params, but then I cannot change them from the
  outside for testing and experimentation. (?)
* SAS manual describes restrictions on tau (though their model is a bit
  different), e.g. equal tau across sibling branches, fixed tau. The also
  allow linear and non-linear (? not sure) restriction on params, the
  regression coefficients. Related to previous issue, callback without access
  to the underlying array, where params_node_dict returns the actual params
  value would provide more flexibility to impose different kinds of restrictions.



bugs/problems
-------------

* singleton branches return zero to `top`, not a value
  I'm not sure what they are supposed to return, given the split between returns
  and instance.probs DONE
* Why does 'Air' (singleton branch) get probability exactly 0.5 ? DONE

TODO
----
* add tau, normalization for nested logit, currently tau is 1 (clogit)
  taus also needs to become part of params MOSTLY DONE
* add effect of branch level explanatory variables DONE
* write a generic multinomial logit that takes arbitrary probabilities, this
  would be the same for MNL, clogit and runmnl,
  delegate calculation of probabilities
* test on actual data,
  - tau=1 replicate clogit numbers,
  - transport example from Greene tests 1-level tree and degenerate sub-trees
  - test example for multi-level trees ???
* starting values: Greene mentiones that the starting values for the nested
  version come from the (non-nested) MNL version. SPSS uses constant equal
  (? check transformation) to sample frequencies and zeros for slope
  coefficient as starting values for (non-nested) MNL
* associated test statistics
  - (I don't think I will fight with the gradient or hessian of the log-like.)
  - basic MLE statistics can be generic
  - tests specific to the model (?)
* nice printouts since I'm currently collecting a lot of information in the tree
  recursion and everything has names

The only parts that are really necessary to get a functional nested logit are
adding the taus (DONE) and the MLE wrapper class. The rest are enhancements.

I added fake tau, one fixed tau for all branches. (OBSOLETE)
It's not clear where the tau for leaf should be added either at
original assignment of self.probs, or as part of the one-step-down
probability correction in the bottom branches. The second would be
cleaner (would make treatment of leaves and branches more symmetric,
but requires that initial assignment in the leaf only does
initialization. e.g self.probs = 1.  ???

DONE added taus

still todo:
- tau for degenerate branches are not identified, set to 1 for MLE
- rename parinddict to paramsinddict


Author: Josef Perktold
License : BSD (3-clause)
'''
from __future__ import print_function
from statsmodels.compat.python import lzip, iteritems, itervalues, lrange, zip
import numpy as np
from pprint import pprint

[docs]def randintw(w, size=1): '''generate integer random variables given probabilties useful because it can be used as index into any array or sequence type Parameters ---------- w : 1d array_like sequence of weights, probabilites. The weights are normalized to add to one. size : int or tuple of ints shape of output array Returns ------- rvs : array of shape given by size random variables each distributed according to the same discrete distribution defined by (normalized) w. Examples -------- >>> np.random.seed(0) >>> randintw([0.4, 0.4, 0.2], size=(2,6)) array([[1, 1, 1, 1, 1, 1], [1, 2, 2, 0, 1, 1]]) >>> np.bincount(randintw([0.6, 0.4, 0.0], size=3000))/3000. array([ 0.59566667, 0.40433333]) ''' #from Charles Harris, numpy mailing list from numpy.random import random p = np.cumsum(w)/np.sum(w) rvs = p.searchsorted(random(np.prod(size))).reshape(size) return rvs
[docs]def getbranches(tree): ''' walk tree to get list of branches Parameters ---------- tree : list of tuples tree as defined for RU2NMNL Returns ------- branch : list list of all branch names ''' if isinstance(tree, tuple): name, subtree = tree a = [name] for st in subtree: a.extend(getbranches(st)) return a return []
[docs]def getnodes(tree): ''' walk tree to get list of branches and list of leaves Parameters ---------- tree : list of tuples tree as defined for RU2NMNL Returns ------- branch : list list of all branch names leaves : list list of all leaves names ''' if isinstance(tree, tuple): name, subtree = tree ab = [name] al = [] #degenerate branches if len(subtree) == 1: adeg = [name] else: adeg = [] for st in subtree: b, l, d = getnodes(st) ab.extend(b) al.extend(l) adeg.extend(d) return ab, al, adeg return [], [tree], []
testxb = 2 #global to class to return strings instead of numbers
[docs]class RU2NMNL(object): '''Nested Multinomial Logit with Random Utility 2 parameterization Parameters ---------- endog : array not used in this part exog : dict_like dictionary access to data where keys correspond to branch and leaf names. The values are the data arrays for the exog in that node. tree : nested tuples and lists each branch, tree or subtree, is defined by a tuple (branch_name, [subtree1, subtree2, ..., subtreek]) Bottom branches have as subtrees the list of leaf names. paramsind : dictionary dictionary that maps branch and leaf names to the names of parameters, the coefficients for exogs) Methods ------- get_probs Attributes ---------- branches leaves paramsnames parinddict Notes ----- endog needs to be encoded so it is consistent with self.leaves, which defines the columns for the probability array. The ordering in leaves is determined by the ordering of the tree. In the dummy encoding of endog, the columns of endog need to have the same order as self.leaves. In the integer encoding, the integer for a choice has to correspond to the index in self.leaves. (This could be made more robust, by handling the endog encoding internally by leaf names, if endog is defined as categorical variable with associated category level names.) '''
[docs] def __init__(self, endog, exog, tree, paramsind): self.endog = endog self.datadict = exog self.tree = tree self.paramsind = paramsind self.branchsum = '' self.probs = {} self.probstxt = {} self.branchleaves = {} self.branchvalues = {} #just to keep track of returns by branches self.branchsums = {} self.bprobs = {} self.branches, self.leaves, self.branches_degenerate = getnodes(tree) self.nbranches = len(self.branches) #copied over but not quite sure yet #unique, parameter array names, #sorted alphabetically, order is/should be only internal self.paramsnames = (sorted(set([i for j in itervalues(paramsind) for i in j])) + ['tau_%s' % bname for bname in self.branches]) self.nparams = len(self.paramsnames) #mapping coefficient names to indices to unique/parameter array self.paramsidx = dict((name, idx) for (idx,name) in enumerate(self.paramsnames)) #mapping branch and leaf names to index in parameter array self.parinddict = dict((k, [self.paramsidx[j] for j in v]) for k,v in iteritems(self.paramsind)) self.recursionparams = 1. + np.arange(len(self.paramsnames)) #for testing that individual parameters are used in the right place self.recursionparams = np.zeros(len(self.paramsnames)) #self.recursionparams[2] = 1 self.recursionparams[-self.nbranches:] = 1 #values for tau's
#self.recursionparams[-2] = 2
[docs] def get_probs(self, params): ''' obtain the probability array given an array of parameters This is the function that can be called by loglike or other methods that need the probabilities as function of the params. Parameters ---------- params : 1d array, (nparams,) coefficients and tau that parameterize the model. The required length can be obtained by nparams. (and will depend on the number of degenerate leaves - not yet) Returns ------- probs : array, (nobs, nchoices) probabilites for all choices for each observation. The order is available by attribute leaves. See note in docstring of class ''' self.recursionparams = params self.calc_prob(self.tree) probs_array = np.array([self.probs[leaf] for leaf in self.leaves]) return probs_array
#what's the ordering? Should be the same as sequence in tree. #TODO: need a check/assert that this sequence is the same as the # encoding in endog
[docs] def calc_prob(self, tree, parent=None): '''walking a tree bottom-up based on dictionary ''' #0.5#2 #placeholder for now #should be tau=self.taus[name] but as part of params for optimization endog = self.endog datadict = self.datadict paramsind = self.paramsind branchsum = self.branchsum if isinstance(tree, tuple): #assumes leaves are int for choice index name, subtree = tree self.branchleaves[name] = [] #register branch in dictionary tau = self.recursionparams[self.paramsidx['tau_'+name]] if DEBUG: print('----------- starting next branch-----------') print(name, datadict[name], 'tau=', tau) print('subtree', subtree) branchvalue = [] if testxb == 2: branchsum = 0 elif testxb == 1: branchsum = datadict[name] else: branchsum = name for b in subtree: if DEBUG: print(b) bv = self.calc_prob(b, name) bv = np.exp(bv/tau) #this shouldn't be here, when adding branch data branchvalue.append(bv) branchsum = branchsum + bv self.branchvalues[name] = branchvalue #keep track what was returned if DEBUG: print('----------- returning to branch-----------') print(name) print('branchsum in branch', name, branchsum) if parent: if DEBUG: print('parent', parent) self.branchleaves[parent].extend(self.branchleaves[name]) if 0: #not name == 'top': # not used anymore !!! ??? #if not name == 'top': #TODO: do I need this only on the lowest branches ? tmpsum = 0 for k in self.branchleaves[name]: #similar to this is now also in return branch values #depends on what will be returned tmpsum += self.probs[k] iv = np.log(tmpsum) for k in self.branchleaves[name]: self.probstxt[k] = self.probstxt[k] + ['*' + name + '-prob' + '(%s)' % ', '.join(self.paramsind[name])] #TODO: does this use the denominator twice now self.probs[k] = self.probs[k] / tmpsum if np.size(self.datadict[name])>0: #not used yet, might have to move one indentation level #self.probs[k] = self.probs[k] / tmpsum ## np.exp(-self.datadict[name] * ## np.sum(self.recursionparams[self.parinddict[name]])) if DEBUG: print('self.datadict[name], self.probs[k]') print(self.datadict[name], self.probs[k]) #if not name == 'top': # self.probs[k] = self.probs[k] * np.exp( iv) #walk one level down again to add branch probs to instance.probs self.bprobs[name] = [] for bidx, b in enumerate(subtree): if DEBUG: print('repr(b)', repr(b), bidx) #if len(b) == 1: #TODO: skip leaves, check this if not isinstance(b, tuple): # isinstance(b, str): #TODO: replace this with a check for branch (tuple) instead #this implies name is a bottom branch, #possible to add special things here self.bprobs[name].append(self.probs[b]) #TODO: need tau possibly here self.probs[b] = self.probs[b] / branchsum if DEBUG: print('*********** branchsum at bottom branch', branchsum) #self.bprobs[name].append(self.probs[b]) else: bname = b[0] branchsum2 = sum(self.branchvalues[name]) assert np.abs(branchsum - branchsum2).sum() < 1e-8 bprob = branchvalue[bidx]/branchsum self.bprobs[name].append(bprob) for k in self.branchleaves[bname]: if DEBUG: print('branchprob', bname, k, bprob, branchsum) #temporary hack with maximum to avoid zeros self.probs[k] = self.probs[k] * np.maximum(bprob, 1e-4) if DEBUG: print('working on branch', tree, branchsum) if testxb<2: return branchsum else: #this is the relevant part self.branchsums[name] = branchsum if np.size(self.datadict[name])>0: branchxb = np.sum(self.datadict[name] * self.recursionparams[self.parinddict[name]]) else: branchxb = 0 if not name=='top': tau = self.recursionparams[self.paramsidx['tau_'+name]] else: tau = 1 iv = branchxb + tau * branchsum #which tau: name or parent??? return branchxb + tau * np.log(branchsum) #iv #branchsum is now IV, TODO: add effect of branch variables else: tau = self.recursionparams[self.paramsidx['tau_'+parent]] if DEBUG: print('parent', parent) self.branchleaves[parent].append(tree) # register leave with parent self.probstxt[tree] = [tree + '-prob' + '(%s)' % ', '.join(self.paramsind[tree])] #this is not yet a prob, not normalized to 1, it is exp(x*b) leafprob = np.exp(np.sum(self.datadict[tree] * self.recursionparams[self.parinddict[tree]]) / tau) # fake tau for now, wrong spot ??? #it seems I get the same answer with and without tau here self.probs[tree] = leafprob #= 1 #try initialization only #TODO: where should I add tau in the leaves if testxb == 2: return np.log(leafprob) elif testxb == 1: leavessum = np.array(datadict[tree]) # sum((datadict[bi] for bi in datadict[tree])) if DEBUG: print('final branch with', tree, ''.join(tree), leavessum) #sum(tree) return leavessum #sum(xb[tree]) elif testxb == 0: return ''.join(tree) #sum(tree)
if __name__ == '__main__': DEBUG = 0 endog = 5 # dummy place holder ############## Example similar to Greene #get pickled data #endog3, xifloat3 = cPickle.load(open('xifloat2.pickle','rb')) tree0 = ('top', [('Fly',['Air']), ('Ground', ['Train', 'Car', 'Bus']) ] ) ''' this is with real data from Greene's clogit example datadict = dict(zip(['Air', 'Train', 'Bus', 'Car'], [xifloat[i]for i in range(4)])) ''' #for testing only (mock that returns it's own name datadict = dict(zip(['Air', 'Train', 'Bus', 'Car'], ['Airdata', 'Traindata', 'Busdata', 'Cardata'])) if testxb: datadict = dict(zip(['Air', 'Train', 'Bus', 'Car'], np.arange(4))) datadict.update({'top' : [], 'Fly' : [], 'Ground': []}) paramsind = {'top' : [], 'Fly' : [], 'Ground': [], 'Air' : ['GC', 'Ttme', 'ConstA', 'Hinc'], 'Train' : ['GC', 'Ttme', 'ConstT'], 'Bus' : ['GC', 'Ttme', 'ConstB'], 'Car' : ['GC', 'Ttme'] } modru = RU2NMNL(endog, datadict, tree0, paramsind) modru.recursionparams[-1] = 2 modru.recursionparams[1] = 1 print('Example 1') print('---------\n') print(modru.calc_prob(modru.tree)) print('Tree') pprint(modru.tree) print('\nmodru.probs') pprint(modru.probs) ############## example with many layers tree2 = ('top', [('B1',['a','b']), ('B2', [('B21',['c', 'd']), ('B22',['e', 'f', 'g']) ] ), ('B3',['h']) ] ) #Note: dict looses ordering paramsind2 = { 'B1': [], 'a': ['consta', 'p'], 'b': ['constb', 'p'], 'B2': ['const2', 'x2'], 'B21': [], 'c': ['constc', 'p', 'time'], 'd': ['constd', 'p', 'time'], 'B22': ['x22'], 'e': ['conste', 'p', 'hince'], 'f': ['constf', 'p', 'hincf'], 'g': [ 'p', 'hincg'], 'B3': [], 'h': ['consth', 'p', 'h'], 'top': []} datadict2 = dict([i for i in zip('abcdefgh',lrange(8))]) datadict2.update({'top':1000, 'B1':100, 'B2':200, 'B21':21,'B22':22, 'B3':300}) ''' >>> pprint(datadict2) {'B1': 100, 'B2': 200, 'B21': 21, 'B22': 22, 'B3': 300, 'a': 0.5, 'b': 1, 'c': 2, 'd': 3, 'e': 4, 'f': 5, 'g': 6, 'h': 7, 'top': 1000} ''' modru2 = RU2NMNL(endog, datadict2, tree2, paramsind2) modru2.recursionparams[-3] = 2 modru2.recursionparams[3] = 1 print('\n\nExample 2') print('---------\n') print(modru2.calc_prob(modru2.tree)) print('Tree') pprint(modru2.tree) print('\nmodru.probs') pprint(modru2.probs) print('sum of probs', sum(list(itervalues(modru2.probs)))) print('branchvalues') print(modru2.branchvalues) print(modru.branchvalues) print('branch probabilities') print(modru.bprobs) print('degenerate branches') print(modru.branches_degenerate) ''' >>> modru.bprobs {'Fly': [], 'top': [0.0016714179077931082, 0.99832858209220687], 'Ground': []} >>> modru2.bprobs {'top': [0.25000000000000006, 0.62499999999999989, 0.12500000000000003], 'B22': [], 'B21': [], 'B1': [], 'B2': [0.40000000000000008, 0.59999999999999998], 'B3': []} ''' params1 = np.array([ 0., 1., 0., 0., 0., 0., 1., 1., 2.]) print(modru.get_probs(params1)) params2 = np.array([ 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1., 2., 1., 1.]) print(modru2.get_probs(params2)) #raises IndexError