I, ME AND MYSELF !!!: cse

Showing posts with label cse. Show all posts

Friday, December 14, 2012

Apriori in Java (Part 2)

Apriori Algorithm:

The algorithm works as follows: first it generates all the frequent itemsets of length 1 w.r.t. the given threshold (minimum support). Then it continue to generate itemsets of lengths 2,3, ... ,n if possible. There are lots of improvements and pruning possible in the implementation.

Do we need to sort frequently? First observation is, if the items are taken in sorted order on step k = 1, all the frequent patterns generated in future will also be in sorted order if maintained properly. So this eliminates the necessity of sorting or storing itemsets in logarithmic data structures like map or set. Rather, we can store them in arraylist or vector like data structure.

How do we generate itemsets of length k+1? We generate itemsets of length k+1 by merging two itemsets of length k. If two itemsets I1 and I2 have a common prefix of length k-1, and I1[k] < I2[k], we can take I1[1 ... k-1] I1[k] I2[k] which is an itemset of length k+1. As our itemsets are sorted, and frequent itemsets generated are also sorted, this can be done in O(N*K^2). Well, if we follow the naive approach, it will take O(N*K^3), however, as we can pre-calculate the length of common prefix of consecutive items in O(N*K), later, we can use this to perform joining operation stated above and also do early termination rather than looking for the entire N items. The approach is demonstrated in source code.

What are the prunings? The most important observation on apriori algorithm is, if an itemset is not frequent, none of its superset can be frequent. Which also tells us, if an itemset has to be frequent, all of its subset must be frequent. Well, the first one is automatically checked in the algorithm. As it takes the frequent itemsets of previous step to calculate itemsets of current step. Now, how the second rule is checked? Well, we don't need to check all the subsets really, all we need is to check the subsets of length k-1 for step k, which can be easily done by performing manual hashing, or using java hashmap. This saves a lots of hassles.

 /* Author: Zobayer Hasan */ import java.util.ArrayList; import java.util.HashMap; import java.util.List;  public class Apriori extends Thread {     public static boolean debugger = false;          private final Database db;     private final List< Integer > itemset;     private final List< List< Integer > > frequent;     private double minsup;      public Apriori(String thrName, Database db, double minsup) {         super(thrName);         this.db = db;         itemset = db.getItemset();         frequent = new ArrayList< List< Integer > >();         this.minsup = minsup;     }          @Override     public void run() {         double startTime = System.currentTimeMillis();                  int k = 1, n = db.dbSize();         List< List< Integer > > Ck = new ArrayList< List< Integer > >();         List< List< Integer > > Lk = new ArrayList< List< Integer > >();         HashMap< List< Integer>, Integer > seenK = new HashMap< List< Integer >, Integer >();                  for(Integer item : itemset) {             List< Integer > temp = new ArrayList< Integer >();             temp.add(item);             Lk.add(temp);         }                  while(k <= n && !Lk.isEmpty()) {             if(debugger) {                 System.out.println("Step " + k);                 System.out.println("Lk: " + Lk);             }                          seenK.clear();             Ck.clear();             for(List< Integer > kth : Lk) {                 int count = db.scanDatabase(kth);                 if((double)count < Math.ceil(minsup * (double)n / 100.0)) continue;                 Ck.add(kth);             }                          if(debugger) {                 System.out.println("Ck: " + Ck);             }                          if(Ck.isEmpty()) break;                          for(List< Integer > freq : Ck) {                 frequent.add(freq);                 seenK.put(freq, k);             }                          int[] prefixlen = new int[Ck.size()];             prefixlen[0] = 0;             for(int i = 1; i < Ck.size(); i++) {                 prefixlen[i] = prefixLen(Ck.get(i-1), Ck.get(i));             }                          List< List< Integer > > temp = new ArrayList< List< Integer > >();             for(int i = 0; i < Ck.size(); i++) {                 for(int j = i + 1; j < Ck.size(); j++) {                     if(prefixlen[j] == k-1) {                         if(debugger) {                             System.out.println("Joining: " + i + ":" + Ck.get(i) + " + " + j + ":" + Ck.get(j) + " Prefix Length " + prefixlen[j]);                         }                         temp.add(prefixJoin(Ck.get(i), Ck.get(j)));                     }                     else break;                 }             }                          if(debugger) {                 System.out.println("Temporary: " + temp);             }              Lk.clear();             for(List< Integer > list : temp) {                 boolean candid = true;                 if(k > 1) {                     for(int i = 0; i < list.size(); i++) {                         List< Integer > prev = new ArrayList< Integer >();                         for(int j = 0; j < list.size(); j++) {                             if(i != j) prev.add(list.get(j));                         }                         if(!seenK.containsKey(prev)) {                             candid = false;                             break;                         }                     }                 }                 if(candid) {                     Lk.add(list);                 }             }                          if(debugger) {                 System.out.println("Pruned: " + Lk);             }                          k++;         }                  double endTime = System.currentTimeMillis();         System.out.println("Apriori completed in " + (endTime - startTime)/1000.0 + " seconds");     }          public void printPatterns() {         System.out.println("Frequent Itemsets");         for(List< Integer > pattern : frequent) {             System.out.println(pattern);         }         System.out.println("Total " + frequent.size() + " itemsets");     }          private int prefixLen(List< Integer > left, List< Integer > right) {         int len = 0;         for(len = 0; len < left.size() && len < right.size(); len++) {             if(left.get(len).compareTo(right.get(len)) != 0) return len;         }         return len;     }          private List< Integer > prefixJoin(List< Integer > left, List< Integer > right) {         List< Integer > ret = new ArrayList< Integer >();         for(Integer i : left) {             ret.add(i);         }         ret.add(right.get(right.size() - 1));         return ret;     } }

This class is threaded, so it is possible to take advantages of multicore processors.

A sample test class is shown here.

 /* Author: Zobayer Hasan */ public class FIMtest {      public static void main(String[] args) {         Database db = null;         try {             db = new Database("mushroom.dat");                      } catch(Exception e) {             e.printStackTrace();         }                  System.out.println("\nStarting Apriori");                  Apriori test1 = new Apriori("test1", db, 40.0);         Apriori.debugger = true;         test1.start();         try {             test1.join();             test1.printPatterns();         } catch(Exception e) {             e.printStackTrace();         }     } }

So, this is it, an efficient implementation of Apriori algorithm in java. It is possible that there are some lacking or mistakes in either source code or analysis. If you find any, please let me know. Here is a sample run on mushroom for 40% minimum support.

Have fun...

Thursday, December 13, 2012

Apriori in Java (Part 1)

Introduction:

Apriori is a very basic and straight forward algorithm for frequent pattern mining, I will not be discussing much about the approach, as those can already be studied from different lectures/books available on net. I will basically present an implementation of mine which is an efficient implementation of the standard apriori algorithm in Java. The target input files are taken from Frequent Itemset Mining Dataset Repository, for example, the mushroom dataset. Due to the space and runtime complexity, apriori is not suitable for larger files having 50,000 records or so. It also may take huge time for very dense files.

The Database:

First we need to read the files and store them in some data structure which will be easy to access and efficient at the same time. Here. I am presenting my java source code, and then I will present an explaining analysis why this works better.

 /* Author: Zobayer Hasan */  import java.io.BufferedReader; import java.io.FileInputStream; import java.io.InputStreamReader; import java.util.ArrayList; import java.util.Collections; import java.util.Comparator; import java.util.List; import java.util.PriorityQueue;  class Entry {     public Integer first;     public Integer second;     Entry() {}     Entry(Integer first, Integer second) {         this.first = first;         this.second = second;     } }  public class Database {     public static boolean debugger = false;          private final List< List< Integer > > transactions;     private final List< Integer > items;      public Database(String dataFileName) throws Exception {         if(debugger) {             System.out.println("Processing " + dataFileName);         }                  transactions = new ArrayList< List< Integer > >();         items = new ArrayList< Integer >();                  FileInputStream fin = new FileInputStream(dataFileName);         InputStreamReader istream = new InputStreamReader(fin);         BufferedReader stdin = new BufferedReader(istream);                  String line;                  double startTime = System.currentTimeMillis();                  while((line = stdin.readLine()) != null) {             List< Integer > transaction = new ArrayList< Integer >();             String[] temp = line.split("\\s+");                          for(String num : temp) {                 transaction.add(Integer.parseInt(num));             }                          if(transaction.isEmpty()) continue;                          Collections.sort(transaction);             transactions.add(transaction);         }                  fin.close();         istream.close();         stdin.close();                  int n = transactions.size();         int[] header = new int[n];         PriorityQueue< Entry > pQ = new PriorityQueue< Entry >(n, new Comparator< Entry >() {             public int compare(Entry item1, Entry item2) {                 if(item1.first.equals(item2.first)) {                     return item1.second.compareTo(item2.second);                 } else {                     return item1.first.compareTo(item2.first);                 }             }         });                  for(int i = 0; i < n; i++) {             header[i] = 0;             pQ.add(new Entry(transactions.get(i).get(header[i]), i));         }                  while(!pQ.isEmpty()) {             Entry peek = pQ.remove();             int val = peek.first;             int idx = peek.second;             if(items.isEmpty() || items.get(items.size()-1) < val) {                 items.add(val);             }             while(header[idx] < transactions.get(idx).size() && transactions.get(idx).get(header[idx]) <= val) {                 header[idx]++;             }             if(header[idx] < transactions.get(idx).size()) {                 pQ.add(new Entry(transactions.get(idx).get(header[idx]), idx));             }         }                  double endTime = System.currentTimeMillis();         System.out.println("Database created in " + (endTime - startTime)/1000.0 + " seconds");     }          public int scanDatabase(List< Integer > transaction) {         int count = 0;         for(List< Integer > row : transactions) {             boolean found = true;             for(Integer item : transaction) {                 int idx, stp, st = 0, en = row.size(), cnt = en - st;                 while(cnt > 0) {                     stp = cnt >> 1; idx = st + stp;                     if(row.get(idx).compareTo(item) < 0) {                         st = ++idx;                         cnt -= stp+1;                     }                     else {                         cnt = stp;                     }                 }                 if(st == row.size() || row.get(st).compareTo(item) != 0) {                     found = false;                     break;                 }             }             if(found) count++;         }         return count;     }          public List< Integer > getItemset() {         return items;     }          public int dbSize() {         return transactions.size();     }          public List< Integer > getRow(int row) {         try {             return transactions.get(row);         } catch(Exception e) {             throw e;         }     } }

Clearly it provides some interface to access the database where the constructor must be called with a file name for example 'mushroom.dat'. The sorting on line 56 is not necessary if you know the transactions in the file will be sorted in ascending order. All the data files in this repository are already sorted, so sorting can be disabled.

Now, if we look at the constructor, what is this huge code doing actually? First it treads each transaction as a string, then parses it and inserts the transaction in a sorted list. Now, we need a list of unique elements. Well, this could be done by sorting all the records together, and then eliminating duplicates. However, this can be performed better with the help of a priority queue in O(NK log N) time, where we have n transactions, and each transaction has K records on an average. Actually this works much better than it would be in naive approach which would take O(NK log NK) because of extra sorting overhead.

Then comes the last important part, scanDatabase() method. Which basically searches each record in the database for a set of elements. It takes O(NK log K) time at most, where, we have N records, each having an average length of K. This is much better than looking in O(N*K^2) following a naive approach. Here is some improvement possible, if we know the length of transactions, we could use a bit vector in addition to each transaction. Then the query can be performed in O(NK). However, as the datasets are huge, I didn't go for it due to the necessity of using huge memory.

Part 2 contains the source code and explanation for the apriori algorithm.

Saturday, January 22, 2011

Huffman's Code

Here is a program I have written this afternoon, for one of my old friends. Basically, it's a simple one, reads character values and their frequency and generate Huffman Coding for each character. Not going to describe here what is huffman's algorithm, but it is widely used for data compression and stuff...


#include <iostream>
#include <vector>
#include <string>
#include <queue>
#include <algorithm>
using namespace std;

vector< pair< string, int > > V;

class Node {
public:
    Node *left, *right;
    int weight, ch;
    Node() {}
    Node(int _w, int _c) : weight(_w), ch(_c) { left = right = NULL; }
    ~Node() {}
};

class comp {
public:
    bool operator() (const Node &a, const Node &b) {
        return a.weight > b.weight;
    }
};

void generate(const Node *a, char *code, int depth) {
    if(a->left == NULL && a->right == NULL) {
        code[depth] = 0;
        V.push_back(pair< string, int > (code, a->ch));
        return;
    }
    if(a->left != NULL) {
        code[depth] = '0';
        generate(a->left, code, depth + 1);
    }
    if(a->right != NULL) {
        code[depth] = '1';
        generate(a->right, code, depth + 1);
    }
}

int main() {
    priority_queue< Node, vector< Node >, comp > Q;
    int ch, weight;
    char code[100];

    // read ascii and frequency
    while(cin >> ch >> weight) {
        Q.push(Node(weight, ch));
    }

    // build 2-tree
    while(Q.size() > 1) {
        Node *a = new Node;
        *a = Q.top(); Q.pop();
        Node *b = new Node;
        *b = Q.top(); Q.pop();
        Node c(a->weight + b->weight, 0);
        c.left = a, c.right = b;
        Q.push(c);
    }

    // traverse tree to generate code
    generate(&Q.top(), code, 0);

    // display character and code
    for(int i = 0; i < (int)V.size(); i++) {
        cout << V[i].first << " --> " << V[i].second << endl;
    }
    return 0;
}

Writing compressor decompressor is also easy, just make sure, you use 1 bit for each '0' or '1', not 1 byte :p

Sunday, January 2, 2011

CSE 202 Pseudo Codes 2

Insert ans Delete operation on a Singly Linked List

Each Item on a Singly Linked List is formed with two fields, Info (which contains the data for the node) and Next (which is a pointer to the next Item). A pointer 'head' points the start of the linked list, and in case the linked list is empty, head has the value 'null'. Pseudo-codes for insertion and deletion at nth position is shown below:

Insert Procedure


Insert ( head, n, item )
    if n = 1 then
        temp := new Node;
        Info[temp] := item, Next[temp] = head;
        head = temp;
        return;
    ptr := head, pos := 1;
    repeat while pos + 1 < n and Next[ptr] != null
        ptr := Next[ptr], pos = pos + 1;
    temp := new Node;
    Info[temp] = item, Next[temp] = Next[ptr];
    Next[ptr] = temp;
    return;

Delete Procedure


Delete ( head, n )
    if head = null then return;
    if n = 1 then
        temp := head, head := Next[head];
        free temp;
        return;
    ptr := head, pos := 1;
    repeat while pos + 1 < n and Next[ptr] != null
        ptr := Next[ptr], pos := pos + 1;
    if Next[ptr] = null then return
    temp := Next[ptr], Next[ptr] = Next[temp];
    free temp;
    return;

Both of them handle all the possible cases on a singly linked list.

Wednesday, December 29, 2010

CSE 202 Pseudo Codes 1

Iterative Tree Traversal

There are three types of tree traversals, namely, inorder, preorder and postorder. For some strange reasons, students @CSE-DU are forced to write iterative versions of these. I am just noting it down here again.

The Node is composed of three fields, namely, Info (data for this node), Left (pointer to left child), Right (pointer to right child). The pseudo-codes are shown below

Preorder


Preorder ( Info, Left, Right, root )
    top := 1, Stack[top] := NULL, ptr := root;
    repeat while ptr != NULL
        apply process() to Info[ptr];
        if Right[ptr] != NULL then
            top := top + 1, Stack[top] = Right[ptr];
        if Left[ptr] != NULL then
            ptr = Left[ptr];
        else
            ptr = Stack[top], top := top - 1;
    exit

Inorder


Inorder ( Info, Left, Right, root )
    top := 1, Stack[top] := NULL, ptr := root;
    repeat while ptr != NULL
        top := top + 1, Stack[top] = ptr, ptr = Left[ptr];
    ptr = Stack[top], top := top - 1;
    repeat while ptr != NULL
        apply process() to Info[ptr];
        if Right[ptr] != NULL then
            ptr := Right[ptr];
            go to step 03;
        ptr := Stack[top], top := top - 1;
    exit

Postorder


Postorder ( Info, Left, Right, root )
    top := 1, Stack[top] := NULL, ptr := root;
    repeat while ptr != NULL
        top := top + 1, Stack[top] = ptr, ptr = Left[ptr];
    repeat while Stack[top] != NULL and ptr = Right[Stack[top]]
        ptr := Stack[top], top := top - 1;
        apply process() to Info[ptr];
    if Stack[top] != NULL then
        ptr := Right[Stack[top]];
        go to step 03;
    exit

Really pointless though!