String interning

From Wikipedia, the free encyclopedia

  (Redirected from String intern pool)
Jump to: navigation, search

In computer science, string interning is a method of storing only one copy of each distinct string value, which must be immutable. Interning strings makes some string processing tasks more time- or space-efficient at the cost of requiring more time when the string is created or interned. The distinct values are stored in a string intern pool.

String interning is supported by some modern object-oriented programming languages, including Python, Ruby (with its symbols), Java and .NET languages.

The single copy of each string is called its 'intern' and is typically looked up by a method of the string class, for example String.intern() in Java.

Interned strings speed up string comparisons, which are sometimes a performance bottleneck in applications (such as compilers and dynamic programming language runtimes) that rely heavily on hash tables with string keys. Without interning, checking that two different strings are equal involves examining every character of both strings. This is slow for several reasons: it is inherently O(n) in the length of the strings; it typically requires reads from several regions of memory, which take time; and the reads fills up the processor cache, meaning there is less cache available for other needs. With interned strings, a simple object identity test suffices after the original intern operation; this is typically implemented a pointer equality test, normally just a single machine instruction with no memory reference at all.

String interning also reduces memory usage if there are many instances of the same string value; for instance, it is read from a network or from storage). Such strings may include magic numbers or network protocol information.

[edit] The intern pool in other programming languages

Even if the standard library of a programming language does not support string interning, it is possible to implement an intern pool in user code.

An example of such an implementation follows, in the C++ programming language. Note that this implementation returns pointers to arrays of characters, not String objects.

class StringTable {

    char** array;
    int* count;
    int size;
    int numStrings;

public:

    StringTable() { 
        size = 20;
        array = new char*[size];
        count = new int[size];
        numStrings = 0;
    }


    int addString(const char* ch) {                // this method adds a String to the pool and returns its index

        if (numStrings == 0) {                     // if there is no string in the pool, add first
            array[0] = new char[strlen(ch)+1];
            strcpy(array[0], ch);
            numStrings++;
            return 0;
        }

        int i = 0;

        for (i; i < numStrings; i++) {             // compares, if the String is already in the pool
            if (strcmp(ch, array[i]) == 0) {
                count[i]++;
                return i;
            }
        }

        if (i == size) {
            return -1;
        } 
        array[i] = new char[strlen(ch)+1];         // if not, add String to the pool and return its index
        strcpy(array[i], ch);
        numStrings++;
        return numStrings-1;
    }


    char* getString(int index) const {

        return array[index];
    }

};
class String {

    int length;
    char* text;
    int index;

public:

    static StringTable* table;

    String(const char* ch) {
        length = strlen(ch);
        text = new char[length+1];
        strcpy(text,ch);
        index = table->addString(text);
    }

    char* intern() {                                     // return pointer to the String in pool
        if (index == -1)
            return 0;
        return table->getString(index);
    }

};

StringTable* String::table = new StringTable();



[edit] History

Lisp introduced the notion of interned strings for its atomic symbols or atoms. The data structure used as a string intern pool was called an 'oblist' (when it was implemented as a linked list) or an 'obarray' (when it was implemented as an array.


[edit] Other links

Personal tools
Languages