7.1 开放地址法

2016-01-25 23:09:57 5,270 0

在开放地址法中，如数据不能直接放在由hash函数计算出来的数组的下标所指的单元时，就要寻找数组的其他位置。下面要探索开放地址法的三种方法，它们在找下一个空白单元时使用的方法不同。这三种方法分别是线性探测、二次探测和再哈希法。

一、线性探测
在线性探测中，线性的查找空白单元，如果5421是要插入数据的位置，它已经被占用了，那么就使用5422，然后是5423，以此类推，数组的下标一直递增，直到找到空位。这就叫做线性探测，因为它沿着数组的下标一步一步顺序的查找空白单元。

例如关键字的范围是从0到999，数组的初始大小是60。哈希函数必须把关键字的范围压缩到数组的范围。用前面用过的取余操作符（%）来完成：

arrayIndex = key % arraySize;

针对数组的大小为60，即

arrayIndex = key % 60;

这个哈希函数足够简单，可以手算出来。对给定的关键字，每次都减掉60，直到结果小于60。例如，哈希化143，减60是83，再减60是23。这就是算法决定的143应该存放的位置。因此，可以容易地检查算法是否把关键字映射到正确的地址。

下图演示的了一个大小为60，包含30个数据项的哈希表。

分析：

这个哈希表包含30项，所以，有一半空间是满的。当数据项数目占哈希表长度的一半，或最多到三分之二（60个单元的表容纳40个数据项）时，哈希表的性能最好。
可以看出，已经填充的单元分布并不均匀。有时，有一串空白单元，有时又有一串已填充单元。哈希表中，一串连续的已填充单元叫做填充序列。增加越来越多的数据项时，填充序列变得越来越长，这叫做聚集，如下图所示：

注意，如果把哈希表填的太满，那么在表中每填充一个数据项都要花费很长时间。哈希表在几乎被填满的数组中添加数据项，效率的确非常低。

查找操作分析

查找操作，对数据的关键字值应用hash函数。结果是一个数组下标。这个下标所指向单元可能是要寻找的关键字；这是最好的情况，并且立即报告查找成功。

然而，这个单元被其他关键字占据。这就是冲突。根据冲突的位置，查找算法依次查找下一个单元。查找合适单元的过程叫做探测。

根据冲突的位置，查找算法沿着数组一个一个地观察每个单元。如果在要找到要寻找的关键字前遇到一个空位，说明查找失败。不需要再做查找，因为插入算法本应该插在那个空位上。下图显示了成功和不成功的探测：

插入操作分析

在我们的案例插入的就是关键字的值，插入算法使用和查找算法同样的算法找到合适的单元。如果最初的单元被占用，它会用线性探测，寻找空白单元，当找到后，就插入数据。

当我们插入数据的时候，有的时候一次就可以插入成功，但是有的时候会遇到冲突，需要沿数组步进，查找新的空白单元。走过的步数称为探索长度。大多数探索单元可能会只有几个单元长，当数组过分满的时候，探索长度可能更长。

以我们的大小为60的数组为例，那么关键字7、67、127、187、247等等一直到967都映射到7。这就需要使用到探测。

删除操作分析：

当删除一个数据项时，同样需要关键字。删除操作不是简单的把某个单元的数据项删除，把它变成空白。为什么呢？记住，在插入操作中，探测过程走过一系列单元，查找一个空白单元。如果在一个填充序列的中间有一个空位，查找算法就会在看到它时中途放弃查找，即使最终本可以达到要求的那个单元。

例如，我们只插入7、67、127、187、247，第一个关键字值7直接插入数据下标为7的位置上，当67插入时，由于下标为7的位置已经被占用，就会通过线性探测插入8号位置，127、187、247依次插入9、10、11位置。如果我们删除了关键字值为127的数据项，也就是数组下标为9的位置直接置为空白。那么当我们要查找10或者11的时候，就无法查找到。

因此，要用一个有特殊关键字值得数据项替代要被删除的数据项，以此标识此处的数据项已不存在。在我们的案例中，假设被删除的数据项用特殊的关键字(-1)标记。

插入操作将在第一个有效空位或值为-1的位置插入一个新数据项。查找过程会把值为-1的数据项当做一个已存在的项，为的是可以跨过这项，查找更多的数据项。

如果做了很多次删除操作，哈希表就会充满值为-1的数据项，这使得哈希表效率下降。因此许多哈希表不允许删除操作。如果实现了删除，应该尽量的有节制的进行删除。

允许有重复的值吗？

在哈希表中允许相同的数据项吗？如果这样做的话，那么会发现只有第一个数据项会被存取，存取第二个数据项的唯一方法是删除第一个数据项，这样做不太方便。

可以重写查找算法，使它可以找到所有具有相同关键字的项，而不是只找第一个。然而，这需要搜索它遇到的每个线性序列。即使不存在重复关键字，由于要搜索整个表，所以非常耗时。在大多数情况下，可能会禁止重复关键字。

聚集

当哈希表变得越来越满时，聚集变得越来越严重。这导致产生非常长的探测长度。意味着存取序列的最后单元会非常耗时。数组填的越满，聚集越可能发生。数组有一般数据项时，这通常不是问题。当三分之二满的时候，情况也不会太坏。然而，如果超过这个界限，随着聚集越来越严重，性能下降也很严重。因此，设计哈希表的关键是确保它不会超过整个数据容量的一半，最多到三分之二。

Java代码实现线性探测的哈希表：LineProbeHashTable.java

package com.tianshouzhi.algrithm.array.hash;

public class HashTable {
	//
	private DataItem[] hashArray; // array holds hash table
	
	private int arraySize;//数组的大小，一般情况下，这个值应该是质数，而且应该是实际记录总量的2倍左右
	
	private DataItem nonItem; // 删除项都用此数据项代替

	
	public HashTable(int size) // constructor
	{
		arraySize = size;
		hashArray = new DataItem[arraySize];
		nonItem = new DataItem(-1); // deleted item key is -1
	}

	// 打印出哈希表中的值
	public void displayTable() {
		System.out.print("Table: ");
		for (int j = 0; j < arraySize; j++) {
			if (hashArray[j] != null)
				System.out.print(hashArray[j].getKey() + " ");
			else
				System.out.print("** ");
		}
		System.out.println("");
	}

	// -------------------------------------------------------------
	public int hashFunc(int key) {
		return key % arraySize; // hash function
	}

	/**
	 * 插入指定的数据项，注意目前这里并没有考虑扩容的情况，留在后文解决
	 * @param item
	 */
	public void insert(DataItem item) // 插入一个数据项
	// (assumes table not full)
	{
		int key = item.getKey(); // 提取数据项中的关键字
		int hashVal = hashFunc(key); // 对关键字应用哈希函数，得到的散列值就是对应的数组下标
		// 循环，通过线性探测直到遇到一个空白单元或者数据项值为-1的单元(已删除)，插入数据项
		while (hashArray[hashVal] != null && hashArray[hashVal].getKey() != -1) {
			++hashVal; // go to next cell
			hashVal %= arraySize; // wraparound if necessary
		}
		hashArray[hashVal] = item; // insert item
	} // end insert()
		// -------------------------------------------------------------
	/**
	 * 删除哈希表中的指定key的数据项，并返回删除的数据项，如果没有找到指定的删除项，返回null
	 * @param key
	 * @return
	 */
	public DataItem delete(int key) // delete a DataItem
	{
		int hashVal = hashFunc(key); // hash the key
		while (hashArray[hashVal] != null) // until empty cell
		{ // found the key?
			if (hashArray[hashVal].getKey() == key) {
				DataItem temp = hashArray[hashVal]; // save item
				hashArray[hashVal] = nonItem; // delete item
				return temp; // return item
			}
			++hashVal; // go to next cell
			hashVal %= arraySize; // wraparound if necessary
		}
		return null; // can’t find item
	} // end delete()
		// -------------------------------------------------------------

	/**
	 * 查找指定的数据项，如果没有找到，返回null
	 * @param key
	 * @return
	 */
	public DataItem find(int key) // find item with key
	{
		int hashVal = hashFunc(key); // hash the key
		while (hashArray[hashVal] != null) // until empty cell,
		{ // found the key?
			if (hashArray[hashVal].getKey() == key)
				return hashArray[hashVal]; // yes, return item
			++hashVal; // go to next cell
			hashVal %= arraySize; // wraparound if necessary
		}
		return null; // can’t find item
		// -------------------------------------------------------------
	} // end class HashTable

	public static class DataItem { // (could have more data)
		private int iData; // data item (key)

		public DataItem(int ii) // constructor
		{
			iData = ii;
		}

		public int getKey() {
			return iData;
		}
	} // end class DataItem
}

测试代码：HashTableTestApp.java

package com.tianshouzhi.algrithm.array.hash;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;

import com.tianshouzhi.algrithm.array.hash.HashTable.DataItem;

public class HashTableTestApp {
	public static void main(String[] args) throws IOException {
		DataItem aDataItem;
		int aKey, size, n, keysPerCell;
		// get sizes
		System.out.print("Enter size of hash table: ");
		size = getInt();
		System.out.print("Enter initial number of items: ");
		n = getInt();
		keysPerCell = 10;
		// make table
		HashTable theHashTable = new HashTable(size);
		for (int j = 0; j < n; j++) // insert data
		{
			aKey = (int) (java.lang.Math.random() * keysPerCell * size);
			aDataItem = new DataItem(aKey);
			theHashTable.insert(aDataItem);
		}
		while (true) // interact with user
		{
			System.out.print("Enter first letter of ");
			System.out.print("show, insert, delete, or find: ");
			char choice = getChar();
			switch (choice) {
			case 's':
				theHashTable.displayTable();
				break;
			case 'i':
				System.out.print("Enter key value to insert: ");
				aKey = getInt();
				aDataItem = new DataItem(aKey);
				theHashTable.insert(aDataItem);
				break;
			case 'd':
				System.out.print("Enter key value to delete: ");
				aKey = getInt();
				theHashTable.delete(aKey);
				break;
			case 'f':
				System.out.print("Enter key value to find: ");
				aKey = getInt();
				aDataItem = theHashTable.find(aKey);
				if (aDataItem != null) {
					System.out.println("Found " + aKey);
				} else
					System.out.println("Could not find " + aKey);
				break;
			default:
				System.out.print("Invalid entry\n");
			} // end switch
		} // end while
	} // end main()
		// --------------------------------------------------------------

	public static String getString() throws IOException {
		InputStreamReader isr = new InputStreamReader(System.in);
		BufferedReader br = new BufferedReader(isr);
		String s = br.readLine();
		return s;
	}

	// --------------------------------------------------------------
	public static char getChar() throws IOException {
		String s = getString();
		return s.charAt(0);
	}

	// -------------------------------------------------------------
	public static int getInt() throws IOException {
		String s = getString();
		return Integer.parseInt(s);
	}
}

这个测试案例实际上是一个基于命令行进行交互的客户端，是为了让大家更好的观察哈希表的工作过程。允许用户显示哈希表的内容（键入s），插入数据项（键入i），删除数据项（键入d），或者寻找数据项（键入f）。

哈希表的大小和初始的存储的数据项个数也都是要用户输入的。这里只是为了演示使用，哈希表的大小不要太大，否则内容会占满屏幕。变量keysPerCell指定了关键字比率的范围，在代码中，它的值是10，意思是如果指定表长是20，那么关键字就从0到200。

这个我们通过创建一个大小为23的哈希表，初始数据为8个。

运行程序：

Enter size of hash table: 23   //哈希表初始大小为23
Enter initial number of items: 8  //初始情况下，插入8个数据项
Enter first letter of show, insert, delete, or find: s //显示哈希表，没有数据的地方用**代替
Table: ** ** 140 ** ** ** ** ** ** ** ** 149 58 127 ** 199 ** 17 17 ** ** ** 114 
Enter first letter of show, insert, delete, or find: i
Enter key value to insert: 163  //插入数据163，注意163%23等于140%23，因此要用到线性探测
Enter first letter of show, insert, delete, or find: s  //显示哈希表，注意163的位置是线性探测后确定的
Table: ** ** 140 163 ** ** ** ** ** ** ** 149 58 127 ** 199 ** 17 17 ** ** ** 114 
Enter first letter of show, insert, delete, or find: d 
Enter key value to delete: 58  //删除数据项58
Enter first letter of show, insert, delete, or find: s //显示哈希表，删除项会用-1代替
Table: ** ** 140 163 ** ** ** ** ** ** ** 149 -1 127 ** 199 ** 17 17 ** ** ** 114 
Enter first letter of show, insert, delete, or find: i //再次插入58，第一个-1或者空位就是58的位置
Enter key value to insert: 58
Enter first letter of show, insert, delete, or find: s
Table: ** ** 140 163 ** ** ** ** ** ** ** 149 58 127 ** 199 ** 17 17 ** ** ** 114 
Enter first letter of show, insert, delete, or find: i 
Enter key value to insert: 81  //插入81，注意81%23等于58%23，所以依然要用到线性探测找到第一个空位或者-1插入数据项
Enter first letter of show, insert, delete, or find: s
Table: ** ** 140 163 ** ** ** ** ** ** ** 149 58 127 81 199 ** 17 17 ** ** ** 114 
Enter first letter of show, insert, delete, or find:

二、二次探测：

前面已经看到，在开放地址法的线性探测中，会产生聚集。一旦聚集形成，它会变得越来越大。那些哈希化后的落在聚集范围内的数据项，都要一步一步移动，并且插在聚集的最后，因此是聚集变得更大。聚集越大，它增长的越快。

已填入哈希表的数据项和表长的比率叫做装填因子。例如有10000个单元的哈希表装入6667个数据后，它的装填因子是2/3。

loadFactor=nItems/arraySize

当装填因子不太大的时候,聚集分布得比较连贯,哈希表中可能一部分有大量的聚集,而另一部分比较稀疏.聚集降低了哈希表的性能.二次探索法的思想就是探测相隔较远的单元,而不是线性探测的相邻的单元.

步骤是步数的平方

在线性探测中，如果哈希函数计算的原始下标是x，线性探测就是 x+1,x+2,x+3….而在二次探测中，探测的过程是x+1，x+4，x+9，x+16,x+25,依次类推。到原始位置的距离是步数的平方：x+1，x+2，x+3，x+4，x+5，等等。

当二次探测的搜索变长时，好像它变得越来越绝望。第一次，它查找相邻的单元。如果这个单元被占据，它认为这里可能有一个小的聚集，所以它尝试距离为4的单元。如果这里也被占用，它机会尝试距离为9的单元...很快，它就会查找整个数组空间。

下图演示了二次探测的过程：

QQ截图20160125223605.png

二次探测的问题：

二次探测消除了在线性探测中的聚集问题，这种聚集问题叫做原始聚集。然而二次探测产生了另外一种，更细的聚集问题。之所以发生，是因为，所有映射到同一个位置的关键字在寻找空位时，探测的单元都是一样的。

比如：将184、302、420和544依次插入一个大小为59的哈希表中，那么它们都会映射到的位置都是7。184直接插入7的位置，那么302需要以一为步长的探测，420需要以4为步长的探测，544需要以9为步长的探测。

因此，每增加一个关键字映射到7的数据项，就需要更长步长的探测。这个现象叫二次聚集。

二次聚集不是一个严重的问题，但是在这里，我们并不打算实现基于二次探测的哈希表的代码，因为还有更好的方案来解决。

三、再哈希化

为了消除原始聚集和二次聚集，可以使用另外一种方法：再哈希化。二次聚集产生的原因是：二次探测的算法产生的探测序列步长总是固定的：1、4、9、16、以此类推。

现在需要的是一种方法产生一种依赖关键字的探测序列，而不是每个关键字都一样，那么不同的关键字即使映射到相同的数组下标，也可以使用不同的探测序列。

方法是把关键字用不同的哈希函数再做一次哈希化，用这个结果作为步长。对指定的关键字，步长在整个探测中是不变的，不过不同的关键字使用不同的步长。

经验说明：第二个哈希函数必须具备以下特点：

和第一个哈希函数不同
不能输出0，否则将没有步长，每次探测都是原地踏步，算法将陷入死循环。

以下是一种常用的第二次进行哈希化的时候使用的哈希函数：

stepSize = constant - (key % constant);

其中constant是质数，并且小于数组容量。例如：

stepSize = 5 - (key % 5);

使用第二个哈希函数，不同的关键字可能会映射到相同的数组下标，但是，它们很有可能会产生不同的步长，用这个函数，步长是从1到5。如下图所示：

QQ截图20160125225424.png

基于在哈希化的哈希表Java代码：一个用于寻找原始位置，另一个生成步长。

package com.tianshouzhi.algrithm.array.hash;

import com.tianshouzhi.algrithm.array.hash.HashTable.DataItem;

public class ReHashHashTable {
	private DataItem[] hashArray; // array is the hash table
	private int arraySize;
	private DataItem nonItem; // for deleted items

	// -------------------------------------------------------------
	public ReHashHashTable(int size) // constructor
	{
		arraySize = size;
		hashArray = new DataItem[arraySize];
		nonItem = new DataItem(-1);
	}

	// -------------------------------------------------------------
	public void displayTable() {
		System.out.print("Table: ");
		for (int j = 0; j < arraySize; j++) {
			if (hashArray[j] != null)
				System.out.print(hashArray[j].getKey() + " ");
			else
				System.out.print("** ");
		}
		System.out.println("");
	}

	// -------------------------------------------------------------
	public int hashFunc1(int key) {
		return key % arraySize;
	}

	// -------------------------------------------------------------
	public int hashFunc2(int key) {
		// non-zero, less than array size, different from hF1
		// array size must be relatively prime to 5, 4, 3, and 2
		return 5 - key % 5;
	}

	// -------------------------------------------------------------
	// insert a DataItem
	public void insert(int key, DataItem item)
	// (assumes table not full)
	{
		int hashVal = hashFunc1(key); // hash the key
		int stepSize = hashFunc2(key); // get step size
		// until empty cell or -1
		while (hashArray[hashVal] != null && hashArray[hashVal].getKey() != -1) {
			hashVal += stepSize; // add the step
			hashVal %= arraySize; // for wraparound
		}
		hashArray[hashVal] = item; // insert item
	} // end insert()
		// -------------------------------------------------------------

	public DataItem delete(int key) // delete a DataItem
	{
		int hashVal = hashFunc1(key); // hash the key
		int stepSize = hashFunc2(key); // get step size
		while (hashArray[hashVal] != null) // until empty cell,
		{ // is correct hashVal?
			if (hashArray[hashVal].getKey() == key) {
				DataItem temp = hashArray[hashVal]; // save item
				hashArray[hashVal] = nonItem; // delete item
				return temp; // return item
			}
			hashVal += stepSize; // add the step
			hashVal %= arraySize; // for wraparound
		}
		return null; // can’t find item
	} // end delete()
		// -------------------------------------------------------------

	public DataItem find(int key) // find item with key
	// (assumes table not full)
	{
		int hashVal = hashFunc1(key); // hash the key
		int stepSize = hashFunc2(key); // get step size
		while (hashArray[hashVal] != null) // until empty cell,
		{ // is correct hashVal?
			if (hashArray[hashVal].getKey() == key)
				return hashArray[hashVal]; // yes, return item
			hashVal += stepSize; // add the step
			hashVal %= arraySize; // for wraparound
		}
		return null; // can’t find item
	}
}

上一篇：7.0 哈希表

欢迎转载,请注明出处!!!