Finding anagrams, rhyming words and misspelled words in a dictionary.

Surprisingly, finding anagrams, rhyming words and corrections (for misspelled words) have a common O(n) solution.
- To find anagrams, we have a dictionary in which all the words are sorted internally and then compare the input( which is again sorted ). see Bentley (1998) Programming Pearls Addison-Wesley for further discussion on this.

- To find rhyming words, we have a dictionary in which we have the words in reverse order and we reverse the input and compare for similarity towards the end (say last 3 characters )

- A find the spelling correction, we use a related hybrid approach, to quote Knuth The Art of Computer Programming Vol.3 (Sorting and Searching) p.394 :

More complicated search problems can often be reduced to the simpler case
considered here. For example, suppose that the keys are words that might be
slightly misspelled; we might want to find the correct record in spite of this
error. If we make two copies of the file, one in which the keys are in normal
lexicographic order and another in which they are ordered from right to left (as
if the words were spelled backwards), a misspelled search argument will probably
agree up to half or more of its length with an entry in one of these two files.

Binary search to find the last and first occurance of an key

Given an sorted array with duplicates how do we find the first and last occurance of a Key?

There are three solutions in the code fragment below, all take log(n); the most efficient is from Bentley ( see Programming Pearls Column 9, a bit trickier to understand )

The obvious way to solve it is to keep saving the last found location. In case of normal binary search, we stop one the key is found. But to find the first occurrence (or the last occurance), instead of stopping, we set the upper bound to mid-1 ( and to find the last occurrence, we set the lower bound to mid+1 ) and continue. When the element is not found, we simply return the last occurrence. Bentley's method from Programming Pearls is obviously more efficient, since it eliminates one comparison in the inner loop (however, I found it a bit trickier to understand and implement).


Interesting observation in copy constructor

The following code makes an interesting observation w.r.t copy constructors. copy ctor need not be called even when = is used in object definition, a neat optimization by calling the ctor directly..


#include
using namespace std;

class Foo{
public:
Foo(const Foo&){ cout << "in cpy ctor" << endl; }
Foo(int a) { cout << "in ctro:" << a << endl;}
};

int main()
{
Foo f = 1; // Does this call copy constructor??
Foo f2(3);
Foo g = f;
}


When Executed :
in ctro:1
in ctro:3
in cpy ctor

Sorting linked lists

Methods for sorting linked lists :
* Merge sort ( see Knuth Vol.3 p.164 : Alogrithm L )
* Radix sort ( see Knuth Vol.3 p.175 )

He also poses the question "What is the best list-sorting method for six-digit keys, for use on the MIX computer?"(p.309) at the end of the Chapter on sorting and gives the following answer:
For small N : list insertion, for medium N (say 64) : list merge; for large N, radix list sort. (p.701)

Knuth makes an interesting observation:

An ever-increasing number of "pipeline" or "number-crunching" computers have appeared in recent years. These machines have multiple arithmetic units and look-ahead circuitry so that memory references and computation can be highly overlapped; but their efficiency deteriorates noticeably in the presence of conditional branch instructions unless the branch almost always goes the same way. The inner loop of a radix sort is well adapted to such machines, because it is a straight iterative calculation of typical number-crunching form. Therefore radix sorting is usually more efficient than any other known method for internal sorting on such machines, provided that N is not too small and the keys are not too long. (p.175 Vol.3)

Sorting when all the input elements are unknown

While reading Knuth's TAOCP vol.3 Sorting and Searching, p.139, another characteristic of sorting algorithms to me : Few sorting techniques require the entire elements to be present before-hand before sorting, for ex: Selection sort. Other sorting techniques do not need this, for ex: insertion sort (the sort of card-players). To quote Knuth:
A selection method requires all of the input items to be present before sorting may proceed, and it generates the final outputs one by one in a sequence.

This is essentially the opposite of insertion, where the inputs are received sequentially but we don not know any of the final outputs until sorting is complete.
(Knuth(1998) ''TAOCP'', vol.3 p.139)

Segregating the positive and negative numbers

[Problem from Knuth,TAOCP, vol.3, p.136]
Design an algorithm that rearranges all the numbers in a given table so that all negative values precede all nonnegative one. (The items need not be sorted completely, just separated between negative and nonnegative.) Your algorithm should use the minimum possible number of exchanges.


Knuth's solution:
Knuth suggests a modified radix exchange sort which compares the MSB (bit corresponding to sign )

Another way to do this is to write a specialized comparison routine which is passed to a sorting routine. The routine can be something like:
compare(a,b)
  if a<0 and b<0: return 0 // same
  if a>0 and b>0: return 0 // same
  return a>b // one -ve, one +ve

Other solution is to include counting sort technique.
The counting sort technique is implemented below in python and is stable.

import random 

N = 50
# generate an array of +ve and -ve numbers.
array = random.sample( range(-N, N), N )

# counting techinque : 2 pass algorithm
def counting_seperate(array):
    num_pos = 0
    num_neg = 0
    for i in array:
        if i<0: num_neg += 1
        else: num_pos += 1
    # destination array
    dest = [0] * len(array)

    #the final index of a +ve number, can simply be len of array
    num_pos += num_neg

    # Count from n-1 to 0 : stable 
    i = len(array)-1
    while i>=0:
        if array[i]< 0: 
            dest[num_neg-1]=array[i]
            num_neg -= 1 # decrease the index 
        else:
            dest[num_pos-1] = array[i]
            num_pos -= 1 # decrease the index
        i -= 1

    print( "segregated array:", dest )

if __name__ == "__main__":
    print( "Original array: ", array )
    counting_seperate(array)

Finding the mth smallest of a given set on n elements.

[Problem posed by C.A.R.Hoare] Suppose that, instead of sorting an entire file, we only want to determine the mth smallest of a given set of n elements. Show that quicksort can be adapted to this purpose avoiding many of the computations required to do a complete sort. ( Knuth(1998)The Art of Computer Programming Vol.3:Sorting and Searching, p.136 // Bentley (2011) Programming Pearls Column 11 )
In the normal partitioning process of quicksort, if partition position s=m, stop;
if s < m, use the same techinque to find (m-s)th position;
If s > m, find the mth smallest element in the left hand. 

Searching a sorted 2D array


A 2D Array is sorted on its row and columns, find a key.


"""
Program to search an element in a 2-d array
The 2D array has sorted rows and sorted columns
"""
from random import randrange

N=10 # we will generate a NxN array

# initializes a 2-d array sorted on rows and columns
ARRAY = [ [i for i in range(j,N+j)] for j in range(N) ]

KEY = randrange(0,ARRAY[N-1][N-1])

# We will skip the brute force method of O(n^2) 

# Attempt 1 : Use binary search
def binary_search(lst, l, u, key):
    if l > u : return -1;
    m = (l+u)/2
    if lst[m] == key: return m
    if lst[m] < key: return binary_search(lst, m+1,u, key)
    else : return binary_search(lst, l, m-1,key)


def binsearch_method():
    "Use binary search on each row : takes loglinear time: O( n log(n) )"
    for i in range(N):
        idx = binary_search(ARRAY[i],0,len(ARRAY[i])-1,KEY)
        if idx != -1:
            print("found %d at %d,%d" % (KEY, i,idx))
            return

def linear_method():
    "search in linear time"
    currow = 0
    curcol = N-1
    steps = 0
    while curcol > 0 and currow < N:
        steps += 1
        #print(currow, curcol)
        # If the elem in a col is greater than key, we can skip the entire col
        if ARRAY[currow][curcol] > KEY: curcol -= 1 
        # If the first element in a row is > than key or if the last elem is < key, we skip the entire row.
        if ARRAY[currow][0] > KEY or ARRAY[currow][curcol] < KEY: currow += 1
        if ARRAY[currow][curcol] == KEY:
            print("found %d at %d,%d in %d steps" % (KEY, currow,curcol, steps) )
            return

if __name__ == "__main__":
    print("Key is %d" % KEY )
    binsearch_method()
    linear_method()


The Celebrity Puzzle

Celebrity is a person whom everybody knows but he knows nobody. You have gone to a party. There are total n persons in the party. Your job is to find the celebrity in the party. You can ask questions of the form Does Mr. X know Mr. Y ?. You will get a binary answer for each such question asked. Find the celebrity by asking only O(n) questions.

References : Python Algorithms: Mastering Basic Algorithms in the Python Language pp.85-86 By Magnus Lie Hetland

Solution (in python):
####
from random import randrange
# Number of persons
N = 100
# the graph of connections built using random values.
G = [ [randrange(2) for i in range(N)] for i in range(N) ]

def create_celebrity():
    C = randrange(N)
    # ensure that we have celebrity at C
    for i in range(N):
        G[i][C] = 1
        G[C][i] = 0
    print("rand celeb : %d" % C)


def bruteforce():
    """Brute force solution to the celebrity problem"""
    n = len(G)
    steps = 0
    for i in range(N):
        is_celeb = 1
        for j in range(N):
            steps += 1
            if i == j: continue
            # i knows j and is not a celebrity
            # j does not know i, break
            if G[i][j] or not G[j][i]: 
                is_celeb = 0;
                break 
        if is_celeb:
            print("found celeb at %d in %d steps" % (i, steps) )
            return i

def bruteforce_improved():
    """Improvement to brute force"""
    # At every-step we can eliminate either i or j
    u,v = 0,1
    steps = 0
    while u < N and v < N:
        steps += 1
        if u == v: v = u+1
        # if u knows v, then u is not a celeb, 
        # if u does not know v, then v is not a celeb
        if G[u][v]: u = u+1
        else : v = v+1
    print("found celeb at %d in %d steps" % (u, steps) )

def linear_time():
    """ finds in linear time : see : 
        Python Algorithms: Mastering Basic Algorithms in the Python Language p.86
        By Magnus Lie Hetland
    """
    u,v = 0,1 # the first two persons
    steps = 0
    # c corresponds to the third to the last person...
    for c in range(2, N):
        steps += 1
        if G[u][v]: u=c # u knows v, replace u
        else:       v=c # otherwise, u is a candidate

    if v==c : c=u #if v has reached end of list, use u.
    else:     c=v

    for i in range(N):
        if c==i: continue
        if G[c][i]:break
        if not G[i][c]: break
    else:
        print("found celeb at %d in %d steps" % (c, steps) )



if __name__ == '__main__':
    create_celebrity()
    bruteforce()
    bruteforce_improved()
    linear_time()