I, ME AND MYSELF !!!: repository

Showing posts with label repository. Show all posts

Sunday, June 23, 2013

Most commonly used git commands

 git init     # initializes an empty git in the current directory  git config -l git config --global user.name "<your name>" git config --global user.email "<your email>" git config --global color.status auto git config --global color.diff auto git config --global color.branch auto  git clone <git host>:/repo/<projectName>     # copy a git repository so you can add to it  git add <file1 file2>     # adds files to the staging area git add .     # recursively add all files under current directory git add *     # add all files in current directory only  git checkout -- <file>     # reverts back the modified file to the one in the repository  git status     # view the status of your files in the working directory and staging area git status -s     # short form. Left column for staging area. Right column for working dir  git diff     # show diff of unstaged changes git diff --cached     # show diff of staged changes git diff HEAD     # show diff of all staged or unstaged changes  git commit     # records a snapshot of the staging area git commit -a     # stage modified files before commit. Does not add new files, but removes deleted files from staging area git commit -m 'commit message goes here' git commit --amend     # Update the last commit message git commit --amend --reset-author     # Use this if you have changed your user name or email with git config and want to fix this identity for previous commit  git reset --hard HEAD     # This will throw away any changes you may have added to the git index and as well as any outstanding changes you have in your working tree. git reset HEAD     # unstage all changes that were staged. Reset staging area to HEAD git reset HEAD -- <file1 file2>     # unstage files. Reset the file to its snapshot in HEAD  git revert HEAD     # This will create a new commit which undoes the change in HEAD (i.e. the last commit). You will need to enter the message for the new commit. git revert HEAD^     # Git will attempt to undo the old change while leaving intact any changes made since then.  git rm <file1 file2>     # remove files from staging area and working dir git rm --cached <files>     # remove files from staging area only git rm -r subdir/     # path = * to recursively remove subdirectories from staging area and working dir  git pull  git branch -a git branch <branchName>     # create new local branch at your last commit. So all commits of current branch will be copied over to the new branch. git push origin <branchName>     # push a new local branch to the server. If you do not explicitly push your new local branches, they stay in your repo and are invisible to others git pull origin <branchName>     # pull a remote branch git branch -d <branchName>     # delete branch. This command ensures that the changes in the branch (to be deleted) are already in the current branch. git branch -D <branchName>     # delete branch. This command will NOT check if the changes in the branch (to be deleted) are already in the current branch. git push origin :<branchName>     # will delete the branch on the origin remote git branch --track <master> <origin/master>     #Add release branch?? git checkout <branchName>     # Switch the branch you are currently on git remote git remote show <origin>     # Lists the URL for the remote repository as well as the tracking branch information  git log <release> git log --name-status git log -p git log --pretty=full git log --no-merges     # Ignore merges git tag -l     # list tag names git tag -a v1.4 -m 'version 1.4'     # Create tag with annotation git tag v1.4     # Create tag without annotation git push --tags     # Push tags to remote git show <tagname>     # Show tag and tagger details  #To rename a tag: git tag NEW OLD     # First create NEW as an alias of OLD git tag -d OLD     # Delete OLD git push origin :OLD     # Delete OLD in remote branch  git stash git stash apply git stash list git log stash@{2} git show stash@{32} git reflog git reflog expire --expire=30.days refs/stash

Looking for anything else? Please refer to Git Documentation page.

Friday, December 14, 2012

Apriori Algorithm:

The algorithm works as follows: first it generates all the frequent itemsets of length 1 w.r.t. the given threshold (minimum support). Then it continue to generate itemsets of lengths 2,3, ... ,n if possible. There are lots of improvements and pruning possible in the implementation.

Do we need to sort frequently? First observation is, if the items are taken in sorted order on step k = 1, all the frequent patterns generated in future will also be in sorted order if maintained properly. So this eliminates the necessity of sorting or storing itemsets in logarithmic data structures like map or set. Rather, we can store them in arraylist or vector like data structure.

How do we generate itemsets of length k+1? We generate itemsets of length k+1 by merging two itemsets of length k. If two itemsets I1 and I2 have a common prefix of length k-1, and I1[k] < I2[k], we can take I1[1 ... k-1] I1[k] I2[k] which is an itemset of length k+1. As our itemsets are sorted, and frequent itemsets generated are also sorted, this can be done in O(N*K^2). Well, if we follow the naive approach, it will take O(N*K^3), however, as we can pre-calculate the length of common prefix of consecutive items in O(N*K), later, we can use this to perform joining operation stated above and also do early termination rather than looking for the entire N items. The approach is demonstrated in source code.

What are the prunings? The most important observation on apriori algorithm is, if an itemset is not frequent, none of its superset can be frequent. Which also tells us, if an itemset has to be frequent, all of its subset must be frequent. Well, the first one is automatically checked in the algorithm. As it takes the frequent itemsets of previous step to calculate itemsets of current step. Now, how the second rule is checked? Well, we don't need to check all the subsets really, all we need is to check the subsets of length k-1 for step k, which can be easily done by performing manual hashing, or using java hashmap. This saves a lots of hassles.

 /* Author: Zobayer Hasan */ import java.util.ArrayList; import java.util.HashMap; import java.util.List;  public class Apriori extends Thread {     public static boolean debugger = false;          private final Database db;     private final List< Integer > itemset;     private final List< List< Integer > > frequent;     private double minsup;      public Apriori(String thrName, Database db, double minsup) {         super(thrName);         this.db = db;         itemset = db.getItemset();         frequent = new ArrayList< List< Integer > >();         this.minsup = minsup;     }          @Override     public void run() {         double startTime = System.currentTimeMillis();                  int k = 1, n = db.dbSize();         List< List< Integer > > Ck = new ArrayList< List< Integer > >();         List< List< Integer > > Lk = new ArrayList< List< Integer > >();         HashMap< List< Integer>, Integer > seenK = new HashMap< List< Integer >, Integer >();                  for(Integer item : itemset) {             List< Integer > temp = new ArrayList< Integer >();             temp.add(item);             Lk.add(temp);         }                  while(k <= n && !Lk.isEmpty()) {             if(debugger) {                 System.out.println("Step " + k);                 System.out.println("Lk: " + Lk);             }                          seenK.clear();             Ck.clear();             for(List< Integer > kth : Lk) {                 int count = db.scanDatabase(kth);                 if((double)count < Math.ceil(minsup * (double)n / 100.0)) continue;                 Ck.add(kth);             }                          if(debugger) {                 System.out.println("Ck: " + Ck);             }                          if(Ck.isEmpty()) break;                          for(List< Integer > freq : Ck) {                 frequent.add(freq);                 seenK.put(freq, k);             }                          int[] prefixlen = new int[Ck.size()];             prefixlen[0] = 0;             for(int i = 1; i < Ck.size(); i++) {                 prefixlen[i] = prefixLen(Ck.get(i-1), Ck.get(i));             }                          List< List< Integer > > temp = new ArrayList< List< Integer > >();             for(int i = 0; i < Ck.size(); i++) {                 for(int j = i + 1; j < Ck.size(); j++) {                     if(prefixlen[j] == k-1) {                         if(debugger) {                             System.out.println("Joining: " + i + ":" + Ck.get(i) + " + " + j + ":" + Ck.get(j) + " Prefix Length " + prefixlen[j]);                         }                         temp.add(prefixJoin(Ck.get(i), Ck.get(j)));                     }                     else break;                 }             }                          if(debugger) {                 System.out.println("Temporary: " + temp);             }              Lk.clear();             for(List< Integer > list : temp) {                 boolean candid = true;                 if(k > 1) {                     for(int i = 0; i < list.size(); i++) {                         List< Integer > prev = new ArrayList< Integer >();                         for(int j = 0; j < list.size(); j++) {                             if(i != j) prev.add(list.get(j));                         }                         if(!seenK.containsKey(prev)) {                             candid = false;                             break;                         }                     }                 }                 if(candid) {                     Lk.add(list);                 }             }                          if(debugger) {                 System.out.println("Pruned: " + Lk);             }                          k++;         }                  double endTime = System.currentTimeMillis();         System.out.println("Apriori completed in " + (endTime - startTime)/1000.0 + " seconds");     }          public void printPatterns() {         System.out.println("Frequent Itemsets");         for(List< Integer > pattern : frequent) {             System.out.println(pattern);         }         System.out.println("Total " + frequent.size() + " itemsets");     }          private int prefixLen(List< Integer > left, List< Integer > right) {         int len = 0;         for(len = 0; len < left.size() && len < right.size(); len++) {             if(left.get(len).compareTo(right.get(len)) != 0) return len;         }         return len;     }          private List< Integer > prefixJoin(List< Integer > left, List< Integer > right) {         List< Integer > ret = new ArrayList< Integer >();         for(Integer i : left) {             ret.add(i);         }         ret.add(right.get(right.size() - 1));         return ret;     } }

This class is threaded, so it is possible to take advantages of multicore processors.

A sample test class is shown here.

 /* Author: Zobayer Hasan */ public class FIMtest {      public static void main(String[] args) {         Database db = null;         try {             db = new Database("mushroom.dat");                      } catch(Exception e) {             e.printStackTrace();         }                  System.out.println("\nStarting Apriori");                  Apriori test1 = new Apriori("test1", db, 40.0);         Apriori.debugger = true;         test1.start();         try {             test1.join();             test1.printPatterns();         } catch(Exception e) {             e.printStackTrace();         }     } }

So, this is it, an efficient implementation of Apriori algorithm in java. It is possible that there are some lacking or mistakes in either source code or analysis. If you find any, please let me know. Here is a sample run on mushroom for 40% minimum support.

Have fun...

Thursday, December 13, 2012

Apriori in Java (Part 1)

Introduction:

Apriori is a very basic and straight forward algorithm for frequent pattern mining, I will not be discussing much about the approach, as those can already be studied from different lectures/books available on net. I will basically present an implementation of mine which is an efficient implementation of the standard apriori algorithm in Java. The target input files are taken from Frequent Itemset Mining Dataset Repository, for example, the mushroom dataset. Due to the space and runtime complexity, apriori is not suitable for larger files having 50,000 records or so. It also may take huge time for very dense files.

The Database:

First we need to read the files and store them in some data structure which will be easy to access and efficient at the same time. Here. I am presenting my java source code, and then I will present an explaining analysis why this works better.

 /* Author: Zobayer Hasan */  import java.io.BufferedReader; import java.io.FileInputStream; import java.io.InputStreamReader; import java.util.ArrayList; import java.util.Collections; import java.util.Comparator; import java.util.List; import java.util.PriorityQueue;  class Entry {     public Integer first;     public Integer second;     Entry() {}     Entry(Integer first, Integer second) {         this.first = first;         this.second = second;     } }  public class Database {     public static boolean debugger = false;          private final List< List< Integer > > transactions;     private final List< Integer > items;      public Database(String dataFileName) throws Exception {         if(debugger) {             System.out.println("Processing " + dataFileName);         }                  transactions = new ArrayList< List< Integer > >();         items = new ArrayList< Integer >();                  FileInputStream fin = new FileInputStream(dataFileName);         InputStreamReader istream = new InputStreamReader(fin);         BufferedReader stdin = new BufferedReader(istream);                  String line;                  double startTime = System.currentTimeMillis();                  while((line = stdin.readLine()) != null) {             List< Integer > transaction = new ArrayList< Integer >();             String[] temp = line.split("\\s+");                          for(String num : temp) {                 transaction.add(Integer.parseInt(num));             }                          if(transaction.isEmpty()) continue;                          Collections.sort(transaction);             transactions.add(transaction);         }                  fin.close();         istream.close();         stdin.close();                  int n = transactions.size();         int[] header = new int[n];         PriorityQueue< Entry > pQ = new PriorityQueue< Entry >(n, new Comparator< Entry >() {             public int compare(Entry item1, Entry item2) {                 if(item1.first.equals(item2.first)) {                     return item1.second.compareTo(item2.second);                 } else {                     return item1.first.compareTo(item2.first);                 }             }         });                  for(int i = 0; i < n; i++) {             header[i] = 0;             pQ.add(new Entry(transactions.get(i).get(header[i]), i));         }                  while(!pQ.isEmpty()) {             Entry peek = pQ.remove();             int val = peek.first;             int idx = peek.second;             if(items.isEmpty() || items.get(items.size()-1) < val) {                 items.add(val);             }             while(header[idx] < transactions.get(idx).size() && transactions.get(idx).get(header[idx]) <= val) {                 header[idx]++;             }             if(header[idx] < transactions.get(idx).size()) {                 pQ.add(new Entry(transactions.get(idx).get(header[idx]), idx));             }         }                  double endTime = System.currentTimeMillis();         System.out.println("Database created in " + (endTime - startTime)/1000.0 + " seconds");     }          public int scanDatabase(List< Integer > transaction) {         int count = 0;         for(List< Integer > row : transactions) {             boolean found = true;             for(Integer item : transaction) {                 int idx, stp, st = 0, en = row.size(), cnt = en - st;                 while(cnt > 0) {                     stp = cnt >> 1; idx = st + stp;                     if(row.get(idx).compareTo(item) < 0) {                         st = ++idx;                         cnt -= stp+1;                     }                     else {                         cnt = stp;                     }                 }                 if(st == row.size() || row.get(st).compareTo(item) != 0) {                     found = false;                     break;                 }             }             if(found) count++;         }         return count;     }          public List< Integer > getItemset() {         return items;     }          public int dbSize() {         return transactions.size();     }          public List< Integer > getRow(int row) {         try {             return transactions.get(row);         } catch(Exception e) {             throw e;         }     } }

Clearly it provides some interface to access the database where the constructor must be called with a file name for example 'mushroom.dat'. The sorting on line 56 is not necessary if you know the transactions in the file will be sorted in ascending order. All the data files in this repository are already sorted, so sorting can be disabled.

Now, if we look at the constructor, what is this huge code doing actually? First it treads each transaction as a string, then parses it and inserts the transaction in a sorted list. Now, we need a list of unique elements. Well, this could be done by sorting all the records together, and then eliminating duplicates. However, this can be performed better with the help of a priority queue in O(NK log N) time, where we have n transactions, and each transaction has K records on an average. Actually this works much better than it would be in naive approach which would take O(NK log NK) because of extra sorting overhead.

Then comes the last important part, scanDatabase() method. Which basically searches each record in the database for a set of elements. It takes O(NK log K) time at most, where, we have N records, each having an average length of K. This is much better than looking in O(N*K^2) following a naive approach. Here is some improvement possible, if we know the length of transactions, we could use a bit vector in addition to each transaction. Then the query can be performed in O(NK). However, as the datasets are huge, I didn't go for it due to the necessity of using huge memory.

Part 2 contains the source code and explanation for the apriori algorithm.

Monday, May 7, 2012

GIT What I Need

Mainly I use GITHUB for my personal repository so that I can access some files not when I am home, and every time I try to add something to github from my local repository, I almost always forget the commands. That's why I am writing the three line commands I will always need, just for my convenience :)

git status -s

git add <filename>

git config --global user.name "<my_name>"

git config --global user.email <my_email>

git commit -m "<my_commit_msg>"

git push origin master

That will do for now. A total reference can be found here.

I, ME AND MYSELF !!!