Effective C# 原则10: 明白GetHashCode()的缺陷

这是本书中唯一一个被一整个函数占用的原则,你应该避免写这样的函数。GetHashCode()仅在一种情况下使用:那就是对象被用于基于散列的集合的关键词,如经典的HashTable或者Dictionary容器。这很不错,由于在基类上实现的GetHashCode()存在大量的问题。对于引用类型,它可以工作,但高效不高;对于值类型,基类的实现经常出错。这更糟糕。你自己完全可以写一个即高效又正确的GetHashCode()。没有那个单一的函数比GetHashCode()讨论的更多,且令人困惑。往下看,为你解释困惑。

如果你定义了一个类型,而且你决不准备把它用于某个容器的关键词,那就没什么事了。像窗体控件,网页控件,或者数据库链接这样的类型是不怎像要做为某个任何的关键词的。在这些情况下,什么都不用做了。所有的引用类型都会得到一个正确的散列值,即使这样效率很糟糕。值类型应该是恒定的(参见原则7),这种情况下,默认的实现总是工作的,尽管这样的效率也是很糟糕的。在大多数情况下,你最好完全避免在类型的实例上使用GetHashCode()。

然而,在某天你创建了一个要做为HashTable的关键词来使用的类型,那么你就须要重写你自己的GetHashCode()的实现了。继续看,基于散列(算法)的集合用散列值来优化查找。每一个对象产生一个整型的散列值,而该对象就存储在基于这个散列值的“桶”中。为了查找某个对象,你通过它的散列值来找到这个(存储了实际对象的)“桶”。在.Net里,每一对象都有一个散列值,它是由System.Object.GetHashCode()断定的。任何对GetHashCode()的重写都必须遵守下面的三个规则:

1、如果两个对象是相等的(由操作符==所定义),那么它们必须产生相同的散列值。否则,无法通过散列值在容器中找到对象。

2、对于任意对象A,A.GetHashCode()必须是实例不变的。不管在A上调用了什么方法,A.GetHashCode()必须总是返回同样的散列值。这就保证在某个“桶”中的对象始终是在这个“桶”中。

3、对于任意的输入,散列函数总是产生产生一个介于整型内的随机分布。这会让你在一个基于散列的容器取得好的效率。

为一个类型写一个正确且高效的散列函数来满足上面的三条,要对该类型有广泛的认识。System.Object和System.ValueType的默认版本并不具备什么优势。这些版本必须为你的特殊类型提供默认的行为,而同时它们对这些特殊的类型又并不了解。Object.GetHashCode()是使用System.Object内在的成员来产生散列值。每个对象在产生时指定一个唯一的值来做为对象关键词,(这个值)以整型来存储。这些关键词从1开始,在每次有任何新的对象产生时逐渐增加。对象的ID字段在System.Object的构造函数进行设置,并且今后再也不能修改。Object.GetHashCode()就是把这个值当成给定对象的散列值来返回。
(译注:注意这里说的是默认的引用类型,其它情况就不是这样的了。)

现在我们根据上面的三个原则来验证Object.GetHashCode()。如果两个对象是相等的,Object.GetHashCode()返回同样的散列值,除非你重写了操作符==。System.Object这个版本的==操作符是检测对象的ID。GetHashCode()返回对象内部的ID字段,它是好的。然而,如果你提供了自己的==版本,你就必须同时提供你自己版本的GetHashCode(),从而保证遵守了前面说的第一条规则。相等的细节参见原则9。

第二条规则已经遵守了:一个对象创建后,它的散列值是不能改变的。

第三条规则,对所有的输入,在整型内进行随机分布,这并没有被支持。这个数字序列并不是整型上的随机分布,除非你创建了大量的对象。Object.GetHashCode()所产生的散列值主要集中在尽可能小的整型范围内。

这就是说这个Object.GetHashCode()是正确的,但并不高效。如果你在你定义的引用类型上创建一个散列表,这个默认从System.Object上继承的行为是工作的,但是比较慢。当你创建准备用于散列关键词的引用类型时,你应该为你的特殊类型重写GetHashCode(),从而提供更好的在整型范围上随机分布的散列值。

在讲解如何重写你自己的GetHashCode()之前,这一节来验证ValueType.GetHashCode()是否也遵守上面的三条规则。System.ValueType重写了GetHashCode(),为所有的值类型提供默认的行为。这一版本从你所定义的类型的第一个字段上返回散列。考虑这个例子:

public struct MyStruct
{
  private string   _msg;
  private int      _id;
  private DateTime _epoch;
}

从MyStruct对象上返回的散列值是从该对象的_msg成员上生成的。下面的代码片断总是返回true:

MyStruct s = new MyStruct( );
return s.GetHashCode( ) == s._msg.GetHashCode( );

规则1表示,两个相等的对象(用操作符==定义的)必须返回相同的散列值。这一规则被大多数值类型遵守着,但你可以破坏它, just as you could with for reference types. ValueType的操作符==()与其它成员一起来比较结构的第一个字段。这是满足第一条规则的。只要在任何时候你所重写的==操作符用第一个字段,这将可以正常工作。任何不以第一个字段断定相等的结构将违反这一规则,从而破坏GetHashCode().

第二条规则表示,散列值必须是实例不变的。这一规则只有当结构的第一个成员字段是恒定类型时才被遵守。如果第一个字段的值发生了改变,那么散列值也会发生改变。这就破坏了这规则。是的,如果你的结构的实例对象在它的生存期内改变了结构的第一个字段,那么GetHashCode()就破坏了这一规则。这也就是另一个原因,你最好的原则就是把值类型设置为恒定类型(参见原则7)。

第三个规则依懒于第一个字段以及是如何使用它的。如果你的第一个字段能保证产生一个在整型范围上的随机分布,并且第一个字段的分布能复盖结构的所有其它值,那么这个结构就很好的保证了一个均衡的分布(译注:就是说结构的第一个字段可以唯一的决定一个实例)。然而,如果第一个字段经常具有相同的值,那么这一规则也会被破坏。考虑对前面的结构做一个小的修改:

public struct MyStruct
{
  private DateTime _epoch;
  private string   _msg;
  private int      _id;
}

如果_epoch字段设置的是当前日期(不包含时间),所有在同一给定日期里创建的对象具有相同的散列值。这防碍了在所有散列值中进行均衡的分布。

概括一个默认的行为,Object.GetHashCode()可以正确的在引用类型上工作,尽管它不是必须保证一个高效的分布。(如果你有一个对Object的==操作符的重载,你会破坏GetHashCode())。ValueType.GetHashCode()仅在你的结构的第一个字段是只读的时候才能正确工作。而当你的结构的第一个字段的值,复盖了它所能接受的输入的有意义的子集时,ValueType.GetHashCode()就可以保证一个高效的散列值。

如果你准备创建一个更好的散列值,你须要为你的类型建立一些约束。再次验证上面的三条规则,现在我们来实现一个可工作的GetHashCode()。

首先,如果两个对象相等,就是由操作符==所定义的,它们必须返回同样的散列值。类型的任何承担散列值的属性或者数据值也必须参与相等比较。显然,这就意味着同样用于相等比较的的属性也用于散列值的生成。然而很可能所有与相等比较的属性,并不用于散列值的计算。System.ValueType的默认行为就是这样的,也就是说它经常违反规则3。同样的数据元素应该参同时参与两个运算(比较和散列)。

第二条规则就是GetHashCode()返回的值必须是实例不变的。想象你已经了一个引用类型,Customer:

public class Customer
{
  private string _name;
  private decimal _revenue;

  public Customer( string name )
  {
    _name = name;
  }

  public string Name
  {
    get { return _name; }
    set { _name = value; }
  }

  public override int GetHashCode()
  {
    return _name.GetHashCode();
  }
}

假设你运行下面的代码片断:

Customer c1 = new Customer( "Acme Products" );
myHashMap.Add( c1, orders );
// Oops, the name is wrong:
c1.Name = "Acme Software";

c1在某些地方会丢失散列映射。当你把c1放在映射中时,散列值是由字符串“Acme Products”来保证的。当你把客户的名字修改为“Acme Software”后,散列值也发生了改变。现在是由新的名字:“Acme Software”来保证的了。c1存储在一个由“Acme Products”决定的“桶”内,而它不应该存在于由“Acme Software”决定的“桶”内。你将会在你自己的集合中丢失这个客户。丢失的原因就是散列值不是实例不变的。你在对象存储后,改变了正确的“桶”。

前面的情形只有当Customer是引用类型时才出现。而当是值类型时,会有不同的错误行为,而且同样是带来麻烦的。如果Customer是一个值类型,c1的一个拷贝会存在在散列映射中。因为装箱与拆箱很会拷贝数据。这并不是很可靠,当你在把一个值类型对象添加到集合之后,你还可以修改值类型的成员数据。

唯一安置好规则2的方法就是,定义一个散列函数,它依懒于对象的一些不变的属性来返回散列值。System.Object是通过不变的对象ID来遵守这一规则的。System.ValueType希望你的类型的第一个字段不发生改变。除了把你的类型设计为恒定类型以外,你没有更好的方法了。当你准备定义一个在散列容器中当关键词使用的值类型时,它必须是一个恒定的类型。如果不遵守劝告,你的用户就可以把你的类型做为一个关键词在散列表中使用,进而找到一个方法破坏散列表。 更正Customer类,你可以修改它,使客户名成为一个恒定的:

public class Customer
{
  private readonly string _name;
  private decimal _revenue;

  public Customer( string name ) :
    this ( name, 0 )
  {
  }

  public Customer( string name, decimal revenue )
  {
    _name = name;
    _revenue = revenue;
  }

  public string Name
  {
    get { return _name; }
  }

  // Change the name, returning a new object:
  public Customer ChangeName( string newName )
  {
    return new Customer( newName, _revenue );
  }

  public override int GetHashCode()
  {
    return _name.GetHashCode();
  }
}

使名字成为恒定的类型后,你要怎样才能修改一个客户对象的名字呢:

Customer c1 = new Customer( "Acme Products" );
myHashMap.Add( c1,orders );
// Oops, the name is wrong:
Customer c2 = c1.ChangeName( "Acme Software" );
order o = myHashMap[ c1 ] as order;
myHashMap.Remove( c1 );
myHashMap.Add( c2, o );

你已经移除了原来的客户,修改了名字,然后添加一个新的客户对象到散列表中。这看上去比原来的更麻烦,但它可以正确工作。先前的版本充许程序员写一些不正确的代码。通过强制使用恒定属性来生成散列值后,你就增加了正确的行为。你的用户就不会出错了。是的,这个版本可以更好的工作。你迫使开发人员写更多的代码,但这是因为只有这样才能写正确的代码。请确保参与散列运算的数据成员是恒定的。

第三条规则是说,GetHashCode()应该对所有的输入随机的生成一个分布在整型范围内的值。这一需求依懒于你实际创建的类型。如果有一个神奇的公式存在,那它应该在System.Object中实现了,并且这一原则(译注:这里说的是全文这一原则)也将不存在了。一个通用而且成功的算法就是XOR(异或)运算,对一个类型内的所有字段的散列再进行异或后返回。如果你的类型里包含一些持久字段,计算时应该排除它们。

GetHashCode()具有很特殊的要求:相等的对象必须产生相等的散列值,并且散列值必须是对象不变的,并且是均衡的高效分布。所有这些只有对恒定类型才能满足(译注:本文前面已经说过了,.Net框架中的System.Object.GetHashCode()其实并不满足均衡高效分布这一规则)。对于其它类型,就交给默认的行为吧,知道它的缺点就行了。

========================================================

Item 10: Understand the Pitfalls of GetHashCode()

This is the only item in this book dedicated to one function that you should avoid writing. GetHashCode() is used in one place only: to define the hash value for keys in a hash-based collection, typically the Hashtable or Dictionary containers. That's good because there are a number of problems with the base class implementation of GetHashCode(). For reference types, it works but is inefficient. For value types, the base class version is often incorrect. But it gets worse. It's entirely possible that you cannot write GetHashCode() so that it is both efficient and correct. No single function generates more discussion and more confusion than GetHashCode(). Read on to remove all that confusion.

If you're defining a type that won't ever be used as the key in a container, this won't matter. Types that represent window controls, web page controls, or database connections are unlikely to be used as keys in a collection. In those cases, do nothing. All reference types will have a hash code that is correct, even if it is very inefficient. Value types should be immutable (see Item 7), in which case, the default implementation always works, although it is also inefficient. In most types that you create, the best approach is to avoid the existence of GetHashCode() entirely.

One day, you'll create a type that is meant to be used as a hashtable key, and you'll need to write your own implementation of GetHashCode(), so read on. Hash-based containers use hash codes to optimize searches. Every object generates an integer value called a hash code. Objects are stored in buckets based on the value of that hash code. To search for an object, you request its key and search just that one bucket. In .NET, everyobject has a hash code, determined by System.Object.GetHashCode(). Any overload of GetHashCode() must follow these three rules:

If two objects are equal (as defined by operator==), they must generate the same hash value. Otherwise, hash codes can't be used to find objects in containers.

For any object A, A.GetHashCode() must be an instance invariant. No matter what methods are called on A, A.GetHashCode() must always return the same value. That ensures that an object placed in a bucket is always in the right bucket.

The hash function should generate a random distribution among all integers for all inputs. That's how you get efficiency from a hash-based container.

Writing a correct and efficient hash function requires extensive knowledge of the type to ensure that rule 3 is followed. The versions defined in System.Object and System.ValueType do not have that advantage. These versions must provide the best default behavior with almost no knowledge of your particular type. Object.GetHashCode() uses an internal field in the System.Object class to generate the hash value. Each object created is assigned a unique object key, stored as an integer, when it is created. These keys start at 1 and increment every time a new object of any type gets created. The object identity field is set in the System.Object constructor and cannot be modified later. Object.GetHashCode() returns this value as the hash code for a given object.

Now examine Object.GetHashCode() in light of those three rules. If two objects are equal, Object.GetHashCode()returns the same hash value, unless you've overridden operator==. System.Object's version of operator==() tests object identity. GetHashCode() returns the internal object identity field. It works. However, if you've supplied your own version of operator==, you must also supply your own version of GetHashCode() to ensure that the first rule is followed. See Item 9 for details on equality.

The second rule is followed: After an object is created, its hash code never changes.

The third rule, a random distribution among all integers for all inputs, does not hold. A numeric sequence is not a random distribution among all integers unless you create an enormous number of objects. The hash codes generated by Object.GetHashCode() are concentrated at the low end of the range of integers.

This means that Object.GetHashCode() is correct but not efficient. If you create a hashtable based on a reference type that you define, the default behavior from System.Object is a working, but slow, hashtable. When you create reference types that are meant to be hash keys, you shouldoverride GetHashCode()to get a better distribution of the hash values across all integers for your specific type.

Before covering how to write your own override of GetHashCode, this section examines ValueType.GetHashCode()with respect to those same three rules. System.ValueType overrides GetHashCode(), providing the default behavior for all value types. Its version returns the hash code from the first field defined in the type. Consider this example:

public struct MyStruct
{
  private string   _msg;
  private int      _id;
  private DateTime _epoch;
}

The hash code returned from a MyStruct object is the hash code generated by the _msg field. The following code snippet always returns true:

MyStruct s = new MyStruct( );
return s.GetHashCode( ) == s._msg.GetHashCode( );

The first rule says that two objects that are equal (as defined by operator==()) must have the same hash code. This rule is followed for value types under most conditions, but you can break it, just as you could with for reference types. ValueType.operator==() compares the first field in the struct, along with every other field. That satisfies rule 1. As long as any override that you define for operator== uses the first field, it will work. Any struct whose first field does not participate in the equality of the type violates this rule, breaking GetHashCode().

The second rule states that the hash code must be an instance invariant. That rule is followed only when the first field in the struct is an immutable field. If the value of the first field can change, so can the hash code. That breaks the rules. Yes, GetHashCode() is broken for any struct that you create when the first field can be modified during the lifetime of the object. It's yet another reason why immutable value types are your best bet (see Item 7).

The third rule depends on the type of the first field and how it is used. If the first field generates a random distribution across all integers, and the first field is distributed across all values of the struct, then the struct generates an even distribution as well. However, if the first field often has the same value, this rule is violated. Consider a small change to the earlier struct:

public struct MyStruct
{
  private DateTime _epoch;
  private string   _msg;
  private int      _id;
}

If the _epoch field is set to the current date (not including the time), all MyStruct objects created in a given date will have the same hash code. That prevents an even distribution among all hash code values.

Summarizing the default behavior, Object.GetHashCode() works correctly for reference types, although it does not necessarily generate an efficient distribution. (If you have overridden Object.operator==(), you can break GetHashCode()). ValueType.GetHashCode() works only if the first field in your struct is read-only. ValueType.GetHashCode() generates an efficient hash code only when the first field in your struct contains values across a meaningful subset of its inputs.

If you're going to build a better hash code, you need to place some constraints on your type. Examine the three rules again, this time in the context of building a working implementation of GetHashCode().

First, if two objects are equal, as defined by operator==(), they must return the same hash value. Any property or data value used to generate the hash code must also participate in the equality test for the type. Obviously, this means that the same properties used for equality are used for hash code generation. It's possible to have properties participate in equality that are not used in the hash code computation. The default behavior for System.ValueType does just that, but it often means that rule 3 usually gets violated. The same data elements should participate in both computations.

The second rule is that the return value of GetHashCode() must be an instance invariant. Imagine that you defined a reference type, Customer:

public class Customer
{
  private string _name;
  private decimal _revenue;

  public Customer( string name )
  {
    _name = name;
  }

  public string Name
  {
    get { return _name; }
    set { _name = value; }
  }

  public override int GetHashCode()
  {
    return _name.GetHashCode();
  }
}

Suppose that you execute the following code snippet:

Customer c1 = new Customer( "Acme Products" );
myHashMap.Add( c1, orders );
// Oops, the name is wrong:
c1.Name = "Acme Software";

c1 is lost somewhere in the hash map. When you placed c1 in the map, the hash code was generated from the string "Acme Products". After you change the name of the customer to "Acme Software", the hash code value changed. It's now being generated from the new name: "Acme Software". C1 is stored in the bucket defined by "Acme Products", but it should be in the bucket defined for "Acme Software". You've lost that customer in your own collection. It's lost because the hash code is not an object invariant. You've changed the correct bucket after storing the object.

The earlier situation can occur only if Customer is a reference type. Value types misbehave differently, but they still cause problems. If customer is a value type, a copy of c1 gets stored in the hashmap. The last line changing the value of the name has no effect on the copy stored in the hashmap. Because boxing and unboxingmake copies as well, it's very unlikely that you can change the members of a value type after that object has been added to a collection.

The only way to address rule 2 is to define the hash code function to return a value based on some invariant property or properties of the object. System.Object abides by this rule using the object identity, which does not change. System.ValueType hopes that the first field in your type does not change. You can't do better without making your type immutable. When you define a value type that is intended for use as a key type in a hash container, it must be an immutable type. Violate this recommendation, and the users of your type will find a way to break hashtables that use your type as keys. Revisiting the Customer class, you can modify it so that the customer name is immutable:

public class Customer
{
  private readonly string _name;
  private decimal _revenue;

  public Customer( string name ) :
    this ( name, 0 )
  {
  }

  public Customer( string name, decimal revenue )
  {
    _name = name;
    _revenue = revenue;
  }

  public string Name
  {
    get { return _name; }
  }

  // Change the name, returning a new object:
  public Customer ChangeName( string newName )
  {
    return new Customer( newName, _revenue );
  }

  public override int GetHashCode()
  {
    return _name.GetHashCode();
  }
}

Making the name immutable changes how you must work with customer objects to modify the name:

Customer c1 = new Customer( "Acme Products" );
myHashMap.Add( c1,orders );
// Oops, the name is wrong:
Customer c2 = c1.ChangeName( "Acme Software" );
order o = myHashMap[ c1 ] as order;
myHashMap.Remove( c1 );
myHashMap.Add( c2, o );

You have to remove the original customer, change the name, and add the new customer object to the hashtable. It looks more cumbersome than the first version, but it works. The previous version allowed programmers to write incorrect code. By enforcing the immutability of the properties used to calculate the hash code, you enforce correct behavior. Users of your type can't go wrong. Yes, this version is more work. You're forcing developers to write more code, but only because it's the only way to write the correct code. Make certain that any data members used to calculate the hash value are immutable.

The third rule says that GetHashCode() should generate a random distribution among all integers for all inputs. Satisfying this requirement depends on the specifics of the types you create. If a magic formula existed, it would be implemented in System.Object and this item would not exist. A common and successful algorithm is to XOR all the return values from GetHashCode() on all fields in a type. If your type contains some mutable fields, exclude those fields from the calculations.

GetHashCode() has very specific requirements: Equal objects must produce equal hash codes, and hash codes must be object invariants and must produce an even distribution to be efficient. All three can be satisfied only for immutable types. For other types, rely on the default behavior, but understand the pitfalls.

评论: 0 | 引用: 0 | 查看次数: 5191
发表评论
登录后再发表评论!