nltk.ghd()

nltk.ghd(ref, hyp, ins_cost=2.0, del_cost=2.0, shift_cost_coeff=1.0, boundary='1')[source]

Compute the Generalized Hamming Distance for a reference and a hypothetical segmentation, corresponding to the cost related to the transformation of the hypothetical segmentation into the reference segmentation through boundary insertion, deletion and shift operations.

A segmentation is any sequence over a vocabulary of two items (e.g. “0”, “1”), where the specified boundary value is used to mark the edge of a segmentation.

Recommended parameter values are a shift_cost_coeff of 2. Associated with a ins_cost, and del_cost equal to the mean segment length in the reference segmentation.

>>> # Same examples as Kulyukin C++ implementation
>>> ghd('1100100000', '1100010000', 1.0, 1.0, 0.5)
0.5
>>> ghd('1100100000', '1100000001', 1.0, 1.0, 0.5)
2.0
>>> ghd('011', '110', 1.0, 1.0, 0.5)
1.0
>>> ghd('1', '0', 1.0, 1.0, 0.5)
1.0
>>> ghd('111', '000', 1.0, 1.0, 0.5)
3.0
>>> ghd('000', '111', 1.0, 2.0, 0.5)
6.0
Parameters:
  • ref (str or list) – the reference segmentation
  • hyp (str or list) – the hypothetical segmentation
  • ins_cost (float) – insertion cost
  • del_cost (float) – deletion cost
  • shift_cost_coeff – constant used to compute the cost of a shift.

shift cost = shift_cost_coeff * |i - j| where i and j are the positions indicating the shift :type shift_cost_coeff: float :param boundary: boundary value :type boundary: str or int or bool :rtype: float