Skip to content

Commit 02099d6

Browse files
committed
Use modern Clojure for this task.
1 parent f85b408 commit 02099d6

File tree

10 files changed

+189
-110
lines changed

10 files changed

+189
-110
lines changed

.gitignore

Whitespace-only changes.

README.md

Lines changed: 24 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,23 @@
11
Background
22
==========
33

4-
Tgrep is a commandline utility meant to search log files per Reddit's specifications [0].
4+
`tgrep` is a commandline utility meant to search 100 GB log files.
55

6-
Setup
7-
=====
6+
### Prerequisites ###
87

9-
Tgrep is written in Clojure and, as such, requires a JVM and the Clojure language [1] to be installed on your computer. What's more, it manages its project dependencies using Leiningen [2], so lein is required to run the utility, as well. All other libraries are included with the executable.
8+
`tgrep` is written in Clojure and, as such, requires a JVM to be installed on your computer. Leiningen is required to install Clojure and library dependencies.
109

11-
Ruby is used instead of Bash to actually parse arguments in the executable and then run the Clojure code as a script. Any version of Ruby after 1.8.6 should do.
10+
### Running ###
1211

13-
To run the executable, first give it permission to execute via chmod +x bin/tgrep.
12+
To run:
1413

15-
Then run the program (bin/tgrep) according to the Reddit specifications. /log/haproxy.log is the default file, if none other is specified.
14+
lein search 00:05
15+
lein search -f logfile 00:05
16+
lein search -f logfile 00:05-00:10
17+
lein search -f logfile 00:05:01-00:10
18+
lein search -f logfile 00:05:01-00:10:01
1619

17-
Edge cases
18-
==========
20+
### Edge cases ###
1921

2022
There were several different potential edge cases and bugs that could've cropped up with this utility. These are mainly the product of:
2123

@@ -29,18 +31,19 @@ There were several different potential edge cases and bugs that could've cropped
2931
a 24 hour timepoint. This means that a specified date interval range
3032
might appear twice in a log file.
3133

32-
I've tried to remedy edge cases up front by normalizing all incoming dates
33-
and immediately turning them into intervals. Any precise values simply turn into one boundary for the time interval we're looking at, and if we're only looking for one timestamp, I simply make an interval out of two identical date values.
34-
35-
To combat (3), I simply decided that the first range within the valid bounds of the log file would be the target range.
36-
37-
Performance
38-
===========
39-
40-
One note about performance. This code runs as fast as vanilla Java, and it will self-optimize the more you run it. One drawback about using the JVM is that the VM must load before the script can execute. I've talked with jedberg about this problem, and he said that the JVM load time would not be an issue. To give you an indication of the actual run time of my script, I've included an "Elapsed time" tag after the script runs. To mediate the JVM load time, a nailgun Java server (which gives you a persistent JVM) may also be used.
34+
I've tried to remedy edge cases up front by normalizing all incoming
35+
dates and immediately turning them into intervals. Any precise values
36+
simply turn into one boundary for the time interval we're looking at,
37+
and if we're only looking for one timestamp, I simply make an interval
38+
out of two identical date values.
4139

42-
[0] http://www.reddit.com/r/blog/comments/fjgit/reddit_is_doubling_the_size_of_its_programming/
40+
To combat (3), I simply decided that the first range within the valid
41+
bounds of the log file would be the target range.
4342

44-
[1] http://clojure.org/getting_started
43+
### Performance ###
4544

46-
[2] https://github.com/technomancy/leiningen
45+
This code runs nearly as fast as vanilla Java, and it will self-optimize
46+
the more you run it. One drawback about using the JVM is that the VM
47+
must load before the script can execute. A persistent JVM can solves
48+
this problem. To give you an indication of the actual run time of my
49+
script, I've included an "Elapsed time" tag after the script runs.

bin/generate

Lines changed: 0 additions & 2 deletions
This file was deleted.

bin/search.clj

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
#!/Users/David/.cljr/bin/jark
2+
3+
(ns tgrep
4+
(:require [tgrep.search :as ts])
5+
(:use clojure.contrib.command-line))
6+
7+
(defn invoke
8+
([logfile start-time] (invoke logfile start-time start-time))
9+
([logfile start-time end-time]
10+
(time (ts/process-file logfile start-time end-time))))
11+
12+
(defn -main
13+
"Search log files in O(log n) in time"
14+
[& args]
15+
(println "here")
16+
(with-command-line args
17+
[[logfile "log file"]
18+
[start "start time, to arbitrary precision"]
19+
[end "end time, to arbitrary precision"]]
20+
(println "start: " start)
21+
(println "end: " end)
22+
(println "log file: " logfile)
23+
(invoke (logfile start end))))

project.clj

Lines changed: 6 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,7 @@
11
(defproject tgrep "0.0.1"
2-
:description "FIXME: write"
3-
:dependencies [[org.clojure/clojure "1.2.0"]
4-
[org.clojure/clojure-contrib "1.2.0"]
5-
[clj-time "0.3.0-SNAPSHOT"]]
6-
:run-aliases {:tgrep tgrep.search
7-
:generate tgrep.generate}
8-
:main tgrep.search)
2+
:description "A speedy log parser"
3+
:dependencies [[org.clojure/clojure "1.6.0"]
4+
[org.clojure/tools.cli "0.3.1"]
5+
[clj-time "0.7.0"]]
6+
:aliases {"search" ["run" "-m" "tasks.search"]
7+
"generate" ["run" "-m" "tasks.generate"]})

src/tasks/generate.clj

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
(ns tasks.generate
2+
(:require [tgrep.generate :as generate]
3+
[clojure.tools.cli :as cli]
4+
[clojure.string :as string])
5+
(:gen-class))
6+
7+
(def cli-options
8+
[["-f" "--file FILE"
9+
"File to generate"
10+
:default "haproxy.log"]
11+
["-s" "--start TIME"
12+
"Start time for start of log filel"
13+
:default "00:00"]
14+
["-n" "--number NUMBER"
15+
"Number of lines to generate"
16+
:parse-fn #(Integer/parseInt %)
17+
:default 1000000]
18+
["-v" "--verbose" "Verbose"]])
19+
20+
(defn -main [& args]
21+
(let [{:keys [options]} (cli/parse-opts args cli-options)
22+
{:keys [file start number]} options]
23+
(generate/create-file! file start number)))

src/tasks/search.clj

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
(ns tasks.search
2+
(:require [tgrep.search :as search]
3+
[clojure.tools.cli :as cli]
4+
[clojure.string :as string]))
5+
6+
(def cli-options
7+
[["-f" "--file FILE" "Logfile to search"
8+
:default "haproxy.log"]
9+
["-v" "--verbose" "Verbose"]])
10+
11+
(defn time-elapsed [start-time-ns end-time-ns]
12+
(/ (double (- end-time-ns start-time-ns))
13+
1000000.0))
14+
15+
(defn process-file-and-print-lines [file start-time end-time]
16+
(let [lines (search/process-file file
17+
start-time
18+
(or end-time start-time))]
19+
(doseq [line lines]
20+
(println line))))
21+
22+
(defn -main [& args]
23+
(let [{:keys [options arguments]} (cli/parse-opts args cli-options)
24+
[start-time end-time] (string/split (first arguments) #"[-]")
25+
script-start-time (System/nanoTime)
26+
_ (process-file-and-print-lines (options :file) start-time end-time)
27+
script-end-time (System/nanoTime)
28+
queried-in (time-elapsed script-start-time script-end-time)]
29+
(println "Queried in" queried-in "ms")))

src/tgrep/generate.clj

Lines changed: 28 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,20 @@
11
(ns tgrep.generate
2-
(:require [tgrep.search :as ts])
3-
(:use clojure.contrib.string
4-
clojure.contrib.duck-streams
5-
clj-time.core
6-
clj-time.coerce))
2+
(:require [tgrep.search :as ts]
3+
[clojure.string :as string]
4+
[clj-time.core :as clj-time]
5+
[clj-time.coerce :as coerce]
6+
[clojure.java.io :refer [writer]])
7+
(:import java.io.BufferedWriter))
8+
9+
;; Copied from old clojure-contrib
10+
(defn write-lines
11+
"Writes lines (a seq) to f, separated by newlines. f is opened with
12+
writer, and automatically closed at the end of the sequence."
13+
[filename lines]
14+
(with-open [w (writer filename)]
15+
(binding [*out* w]
16+
(doseq [line lines]
17+
(println line)))))
718

819
(def start-time (ts/parse-date "11/Feb/2011:23:55:00.000" ts/date-formatter-1))
920

@@ -12,11 +23,11 @@
1223
(defn entry-for
1324
"Generates an entry for the given date (in millis) based on template."
1425
[template millis]
15-
(let [date (from-long millis)
26+
(let [date (coerce/from-long millis)
1627
formatted-1 (ts/unparse-date date ts/date-formatter-1)
1728
formatted-2 (ts/unparse-date date ts/date-formatter-2)
18-
subst-1 (replace-first-re ts/date-re-1 formatted-1 template)
19-
subst-2 (replace-first-re ts/date-re-2 formatted-2 subst-1)]
29+
subst-1 (string/replace-first template ts/date-re-1 formatted-1)
30+
subst-2 (string/replace-first subst-1 ts/date-re-2 formatted-2)]
2031
subst-2))
2132

2233
(defn next-entry
@@ -26,17 +37,16 @@
2637
(let [next-date (ts/inc-date (ts/get-date curr-entry) millis)
2738
formatted-1 (ts/unparse-date next-date ts/date-formatter-1)
2839
formatted-2 (ts/unparse-date next-date ts/date-formatter-2)
29-
subst-1 (replace-first-re ts/date-re-1 formatted-1 curr-entry)
30-
subst-2 (replace-first-re ts/date-re-2 formatted-2 subst-1)]
40+
subst-1 (string/replace-first curr-entry ts/date-re-1 formatted-1)
41+
subst-2 (string/replace-first subst-1 ts/date-re-2 formatted-2)]
3142
subst-2)))
3243

33-
(defn write-entries
44+
(defn create-file!
3445
"Writes all entries to logfile via lazy map, to avoid heavy memory
35-
footprint but still harness the power of functional programming."
36-
[start-date n]
37-
(let [millis (to-long start-date)
46+
footprint but still harness the power of functional programming."
47+
[filename start-date n]
48+
(let [millis (coerce/to-long start-date)
3849
entry (entry-for example millis)]
39-
(write-lines ts/*logfile* (map #(next-entry entry (* 1000 %))
40-
(range 0 n)))))
41-
42-
(defn -main [] (write-entries start-time 1000000))
50+
(write-lines filename
51+
(map #(next-entry entry (* 1000 %))
52+
(range 0 n)))))

0 commit comments

Comments
 (0)