January 30, 2012

(updated) substring counting for fun...

Now it’s even slower, I was about to write down a fortran version, but index() intrinsic is not very fast, either…. (By the way, unicode is slow.)

bash-3.2$ /usr/bin/time -p bin/count.occurrence.py 測試
測試    819
real        11.51
user        11.27
sys          0.16
bash-3.2$ /usr/bin/time -p bin/count.occurrence.pl 測試 < /Volumes/ramdisk/newTEXT.txt
測試    819
real         0.88
user         0.83
sys          0.04
bash-3.2$ /usr/bin/time -p gawk -F 測試 '{s=s+NF-1}END{print FS"        "s}' /Volumes/ramdisk/newTEXT.txt
測試    819
real         1.28
user         1.22
sys          0.05
bash-3.2$ /usr/bin/time -p bin/count.occurrence2.py 測試
測試    819
real         1.40
user         1.36
sys          0.04

The code for python counter1:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import codecs
import sys
uquery=unicode(sys.argv[1],"utf-8")
myCount=0
f = codecs.open('/Volumes/ramdisk/newTEXT.txt', encoding='utf-8')
for line in f:
    myCount=myCount+line.count(uquery)
print myCount

The code for python counter2

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
myCount=0
f = open('/Volumes/ramdisk/newTEXT.txt')
for line in f:
    myCount=myCount+line.count(sys.argv[1])
print myCount
Posted by mjhsieh at January 30, 2012 08:10 AM