I, ME AND MYSELF !!!: database

Showing posts with label database. Show all posts

Friday, December 14, 2012

Apriori in Java (Part 2)

Apriori Algorithm:

The algorithm works as follows: first it generates all the frequent itemsets of length 1 w.r.t. the given threshold (minimum support). Then it continue to generate itemsets of lengths 2,3, ... ,n if possible. There are lots of improvements and pruning possible in the implementation.

Do we need to sort frequently? First observation is, if the items are taken in sorted order on step k = 1, all the frequent patterns generated in future will also be in sorted order if maintained properly. So this eliminates the necessity of sorting or storing itemsets in logarithmic data structures like map or set. Rather, we can store them in arraylist or vector like data structure.

How do we generate itemsets of length k+1? We generate itemsets of length k+1 by merging two itemsets of length k. If two itemsets I1 and I2 have a common prefix of length k-1, and I1[k] < I2[k], we can take I1[1 ... k-1] I1[k] I2[k] which is an itemset of length k+1. As our itemsets are sorted, and frequent itemsets generated are also sorted, this can be done in O(N*K^2). Well, if we follow the naive approach, it will take O(N*K^3), however, as we can pre-calculate the length of common prefix of consecutive items in O(N*K), later, we can use this to perform joining operation stated above and also do early termination rather than looking for the entire N items. The approach is demonstrated in source code.

What are the prunings? The most important observation on apriori algorithm is, if an itemset is not frequent, none of its superset can be frequent. Which also tells us, if an itemset has to be frequent, all of its subset must be frequent. Well, the first one is automatically checked in the algorithm. As it takes the frequent itemsets of previous step to calculate itemsets of current step. Now, how the second rule is checked? Well, we don't need to check all the subsets really, all we need is to check the subsets of length k-1 for step k, which can be easily done by performing manual hashing, or using java hashmap. This saves a lots of hassles.

 /* Author: Zobayer Hasan */ import java.util.ArrayList; import java.util.HashMap; import java.util.List;  public class Apriori extends Thread {     public static boolean debugger = false;          private final Database db;     private final List< Integer > itemset;     private final List< List< Integer > > frequent;     private double minsup;      public Apriori(String thrName, Database db, double minsup) {         super(thrName);         this.db = db;         itemset = db.getItemset();         frequent = new ArrayList< List< Integer > >();         this.minsup = minsup;     }          @Override     public void run() {         double startTime = System.currentTimeMillis();                  int k = 1, n = db.dbSize();         List< List< Integer > > Ck = new ArrayList< List< Integer > >();         List< List< Integer > > Lk = new ArrayList< List< Integer > >();         HashMap< List< Integer>, Integer > seenK = new HashMap< List< Integer >, Integer >();                  for(Integer item : itemset) {             List< Integer > temp = new ArrayList< Integer >();             temp.add(item);             Lk.add(temp);         }                  while(k <= n && !Lk.isEmpty()) {             if(debugger) {                 System.out.println("Step " + k);                 System.out.println("Lk: " + Lk);             }                          seenK.clear();             Ck.clear();             for(List< Integer > kth : Lk) {                 int count = db.scanDatabase(kth);                 if((double)count < Math.ceil(minsup * (double)n / 100.0)) continue;                 Ck.add(kth);             }                          if(debugger) {                 System.out.println("Ck: " + Ck);             }                          if(Ck.isEmpty()) break;                          for(List< Integer > freq : Ck) {                 frequent.add(freq);                 seenK.put(freq, k);             }                          int[] prefixlen = new int[Ck.size()];             prefixlen[0] = 0;             for(int i = 1; i < Ck.size(); i++) {                 prefixlen[i] = prefixLen(Ck.get(i-1), Ck.get(i));             }                          List< List< Integer > > temp = new ArrayList< List< Integer > >();             for(int i = 0; i < Ck.size(); i++) {                 for(int j = i + 1; j < Ck.size(); j++) {                     if(prefixlen[j] == k-1) {                         if(debugger) {                             System.out.println("Joining: " + i + ":" + Ck.get(i) + " + " + j + ":" + Ck.get(j) + " Prefix Length " + prefixlen[j]);                         }                         temp.add(prefixJoin(Ck.get(i), Ck.get(j)));                     }                     else break;                 }             }                          if(debugger) {                 System.out.println("Temporary: " + temp);             }              Lk.clear();             for(List< Integer > list : temp) {                 boolean candid = true;                 if(k > 1) {                     for(int i = 0; i < list.size(); i++) {                         List< Integer > prev = new ArrayList< Integer >();                         for(int j = 0; j < list.size(); j++) {                             if(i != j) prev.add(list.get(j));                         }                         if(!seenK.containsKey(prev)) {                             candid = false;                             break;                         }                     }                 }                 if(candid) {                     Lk.add(list);                 }             }                          if(debugger) {                 System.out.println("Pruned: " + Lk);             }                          k++;         }                  double endTime = System.currentTimeMillis();         System.out.println("Apriori completed in " + (endTime - startTime)/1000.0 + " seconds");     }          public void printPatterns() {         System.out.println("Frequent Itemsets");         for(List< Integer > pattern : frequent) {             System.out.println(pattern);         }         System.out.println("Total " + frequent.size() + " itemsets");     }          private int prefixLen(List< Integer > left, List< Integer > right) {         int len = 0;         for(len = 0; len < left.size() && len < right.size(); len++) {             if(left.get(len).compareTo(right.get(len)) != 0) return len;         }         return len;     }          private List< Integer > prefixJoin(List< Integer > left, List< Integer > right) {         List< Integer > ret = new ArrayList< Integer >();         for(Integer i : left) {             ret.add(i);         }         ret.add(right.get(right.size() - 1));         return ret;     } }

This class is threaded, so it is possible to take advantages of multicore processors.

A sample test class is shown here.

 /* Author: Zobayer Hasan */ public class FIMtest {      public static void main(String[] args) {         Database db = null;         try {             db = new Database("mushroom.dat");                      } catch(Exception e) {             e.printStackTrace();         }                  System.out.println("\nStarting Apriori");                  Apriori test1 = new Apriori("test1", db, 40.0);         Apriori.debugger = true;         test1.start();         try {             test1.join();             test1.printPatterns();         } catch(Exception e) {             e.printStackTrace();         }     } }

So, this is it, an efficient implementation of Apriori algorithm in java. It is possible that there are some lacking or mistakes in either source code or analysis. If you find any, please let me know. Here is a sample run on mushroom for 40% minimum support.

Have fun...

Thursday, December 13, 2012

Apriori in Java (Part 1)

Introduction:

Apriori is a very basic and straight forward algorithm for frequent pattern mining, I will not be discussing much about the approach, as those can already be studied from different lectures/books available on net. I will basically present an implementation of mine which is an efficient implementation of the standard apriori algorithm in Java. The target input files are taken from Frequent Itemset Mining Dataset Repository, for example, the mushroom dataset. Due to the space and runtime complexity, apriori is not suitable for larger files having 50,000 records or so. It also may take huge time for very dense files.

The Database:

First we need to read the files and store them in some data structure which will be easy to access and efficient at the same time. Here. I am presenting my java source code, and then I will present an explaining analysis why this works better.

 /* Author: Zobayer Hasan */  import java.io.BufferedReader; import java.io.FileInputStream; import java.io.InputStreamReader; import java.util.ArrayList; import java.util.Collections; import java.util.Comparator; import java.util.List; import java.util.PriorityQueue;  class Entry {     public Integer first;     public Integer second;     Entry() {}     Entry(Integer first, Integer second) {         this.first = first;         this.second = second;     } }  public class Database {     public static boolean debugger = false;          private final List< List< Integer > > transactions;     private final List< Integer > items;      public Database(String dataFileName) throws Exception {         if(debugger) {             System.out.println("Processing " + dataFileName);         }                  transactions = new ArrayList< List< Integer > >();         items = new ArrayList< Integer >();                  FileInputStream fin = new FileInputStream(dataFileName);         InputStreamReader istream = new InputStreamReader(fin);         BufferedReader stdin = new BufferedReader(istream);                  String line;                  double startTime = System.currentTimeMillis();                  while((line = stdin.readLine()) != null) {             List< Integer > transaction = new ArrayList< Integer >();             String[] temp = line.split("\\s+");                          for(String num : temp) {                 transaction.add(Integer.parseInt(num));             }                          if(transaction.isEmpty()) continue;                          Collections.sort(transaction);             transactions.add(transaction);         }                  fin.close();         istream.close();         stdin.close();                  int n = transactions.size();         int[] header = new int[n];         PriorityQueue< Entry > pQ = new PriorityQueue< Entry >(n, new Comparator< Entry >() {             public int compare(Entry item1, Entry item2) {                 if(item1.first.equals(item2.first)) {                     return item1.second.compareTo(item2.second);                 } else {                     return item1.first.compareTo(item2.first);                 }             }         });                  for(int i = 0; i < n; i++) {             header[i] = 0;             pQ.add(new Entry(transactions.get(i).get(header[i]), i));         }                  while(!pQ.isEmpty()) {             Entry peek = pQ.remove();             int val = peek.first;             int idx = peek.second;             if(items.isEmpty() || items.get(items.size()-1) < val) {                 items.add(val);             }             while(header[idx] < transactions.get(idx).size() && transactions.get(idx).get(header[idx]) <= val) {                 header[idx]++;             }             if(header[idx] < transactions.get(idx).size()) {                 pQ.add(new Entry(transactions.get(idx).get(header[idx]), idx));             }         }                  double endTime = System.currentTimeMillis();         System.out.println("Database created in " + (endTime - startTime)/1000.0 + " seconds");     }          public int scanDatabase(List< Integer > transaction) {         int count = 0;         for(List< Integer > row : transactions) {             boolean found = true;             for(Integer item : transaction) {                 int idx, stp, st = 0, en = row.size(), cnt = en - st;                 while(cnt > 0) {                     stp = cnt >> 1; idx = st + stp;                     if(row.get(idx).compareTo(item) < 0) {                         st = ++idx;                         cnt -= stp+1;                     }                     else {                         cnt = stp;                     }                 }                 if(st == row.size() || row.get(st).compareTo(item) != 0) {                     found = false;                     break;                 }             }             if(found) count++;         }         return count;     }          public List< Integer > getItemset() {         return items;     }          public int dbSize() {         return transactions.size();     }          public List< Integer > getRow(int row) {         try {             return transactions.get(row);         } catch(Exception e) {             throw e;         }     } }

Clearly it provides some interface to access the database where the constructor must be called with a file name for example 'mushroom.dat'. The sorting on line 56 is not necessary if you know the transactions in the file will be sorted in ascending order. All the data files in this repository are already sorted, so sorting can be disabled.

Now, if we look at the constructor, what is this huge code doing actually? First it treads each transaction as a string, then parses it and inserts the transaction in a sorted list. Now, we need a list of unique elements. Well, this could be done by sorting all the records together, and then eliminating duplicates. However, this can be performed better with the help of a priority queue in O(NK log N) time, where we have n transactions, and each transaction has K records on an average. Actually this works much better than it would be in naive approach which would take O(NK log NK) because of extra sorting overhead.

Then comes the last important part, scanDatabase() method. Which basically searches each record in the database for a set of elements. It takes O(NK log K) time at most, where, we have N records, each having an average length of K. This is much better than looking in O(N*K^2) following a naive approach. Here is some improvement possible, if we know the length of transactions, we could use a bit vector in addition to each transaction. Then the query can be performed in O(NK). However, as the datasets are huge, I didn't go for it due to the necessity of using huge memory.

Part 2 contains the source code and explanation for the apriori algorithm.

Saturday, December 4, 2010

Setup connector/j for JDBC

Setting up MySQL driver connector/j for JDBC in Windows

This is basically not a hard task. Make sure you already have the following things installed and configured in your system, (this demonstration is targeted on the latest versions of the components):

Apache HTTP Server 2.2 (see here)
MySQL community server 5.1 (see here)
Java SE SDK aka JDK 1.6 (see here)

If you don't have these components installed yet, you might try taking some online help on how to install and configure those, not hard as well.

Now, to install java to mysql connector, which is know as connector/j, go to this page and download the windows ZIP package. Unzip the archive anywhere in your pc, it won't matter at all. Now, follow these steps:

Copy the file "mysql-connector-java-5.1.13-bin.jar" to your java installation directory and place it in "..\jdk1.6.0_21\jre\lib\ext".
Now, you have to add this runtime library to your systems classpath environment variable. Go to the properties of My Computer and select the Advanced tab, there you'll see Environment Variables. You will find all your system variables here. Find out the name CLASSPATH and append the location of connector/j i.e. where you just pasted. In my pc, its "C:\Program Files\Java\jdk1.6.0_21\jre\lib\ext\mysql-connector-java-5.1.13-bin.jar". Don't forget to separate this entry from the previous one with e semi-colon. Now click ok and close the dialogue boxes, you are done! But if you don't have the CLASSPATH variable, you need to create it yourself.

That's pretty much all of the work, now we are going to test if it works.

Lets assume that you already have a mysql database named "mysqltest" and you are the "root" user with password "adminadmin" using default host "localhost" with default http port 80, the following code should work then:


import java.sql.*;

public class Main {
    public static void main(String[] args) {
        Connection conn = null;
        try {
            Class.forName("com.mysql.jdbc.Driver") ;
            System.out.println("MySQL JDBC driver loaded ok.");

            conn = DriverManager.getConnection("jdbc:mysql://localhost/mysqltest?user=root&password=adminadmin");
            System.out.println("Connected with default port.");

            DatabaseMetaData meta = conn.getMetaData();
            System.out.println("Server name: " + meta.getDatabaseProductName());
            System.out.println("Server version: " + meta.getDatabaseProductVersion());
            System.out.println("Driver name: " + meta.getDriverName());
            System.out.println("Driver version: " + meta.getDriverVersion());
            System.out.println("JDBC major version: " + meta.getJDBCMajorVersion());
            System.out.println("JDBC minor version: " + meta.getJDBCMinorVersion());

            Statement query = conn.createStatement();
            int count = query.executeUpdate(
            "CREATE TABLE test (" +
            "id INT PRIMARY KEY NOT NULL AUTO_INCREMENT," +
            "username VARCHAR(20) NOT NULL," +
            "password CHAR(40) NOT NULL );"
            );
            System.out.println("Table created.");

            conn.close();
        } catch(Exception e) {
            System.err.println("Exception: " + e.getMessage());
        }
    }
}

As you see, you are using "DriverManager.getConnection()" method to connect it using connector/j which you previously specified by "Class.forName("com.mysql.jdbc.Driver") ;" So, make proper changes to test various things. Here, you will get some more methods of connecting and testing.

Well, if this doesn't work for you, then there must be something which is beyond the scope of this post, else "congratulations!".