Monday, January 19, 2009

Current version of RubyRx

Hi,

Thanks, John, for your notes from last week's session.

My previous post with a version of the RubyRx DSL was somehow mixed up with John's notes. (I would have said that that was not possible, but somehow it happened.)

It would probably be pretty tough to recover the old post, so I think I won't do that. Besides, RubyRx has gone through a number of significant changes over the last week, so I think it will be better to post the latest version.


Here are the things that have changed:

I added summary stats to the template, and a SummaryStats class to the program. So RubyRx can now produce and output the kinds of simple statistics that generally show up in the Clinical Study Report (the CSR, which is the report of clinical trial results that goes to the FDA after the trial is complete).

I removed the hardcoded hashes that gave me a bunch of demographics data to work with. I also removed the class that previously held those hashes. In their place, I have included a require to the ActiveRecord gem, and a subclass of ActiveRecord::Base. I also created a somewhat crude version of an SDTM demographics table in MySQL on my laptop, and a program that generates a lot of DM-style records. (By a lot in this context I mean thousands of records.)

So RubyRx is maturing. There's still a decent amount of work left to do. Here's the list of next steps:

1. Show that this will work on other kinds of data (for example, Adverse Events data).

2. Put this on Heroku (including the data) so many of us can work on the code simmultaneously.

3. Do code review. I am interested in doing code review at the Boston Ruby group's February Hackfests.

4. Performance. The code works fine for < 1000 records, but it starts to slow down when there are over 1,000, and for over 10,000 it is probably too slow. For example, I created 12,000 DM records, and it took nearly a minute for the program to execute. That's not horrible, but it's not great, either. I am interested in hearing ideas on how to make this faster (caching, etc.). We can go over this at the next session (on Tuesday the 21st) and at the February Hackfests.


This is certainly not an exhaustive list. But it's a start.


Here's the code:


require 'rubygems'
require 'active_record'
require 'erb'

t0 = Time.new

class SummaryStatistics
attr_accessor :n, :mean, :variance, :median, :standard_deviation, :minimum, :maximum, :freq

def initialize
@arr = []
end

def calc_n
@arr.select{ |e| e != nil }.size
end

def calc_mean_and_variance
n, mean, s = [0, 0, 0]

@arr.each_with_index do |x, n|
delta = (x - mean).to_f
mean += delta/(n+1)
s += delta*(x - mean)
end

return mean, s/n
end

def calc_mean
@arr.inject { |sum, e| sum += e } / @n
end

def calc_median
@arr.size%2 == 1 ? @arr[(@n / 2.0).ceil - 1] : (@arr[(@n / 2) - 1] + @arr[@n/ 2]) / 2.0
end

#def calc_variance
# @arr.size == 1 ? 0 : (@arr.inject(0) { |total, e| total += ((e - @mean) ** 2) }) / (@n - 1)
#end

def calc_standard_deviation
Math.sqrt(@variance)
end

def calc_minimum
min = @arr[0]
@arr.each { |e| min = e if e != nil and (min == nil or e < min) }
min.to_f
end

def calc_maximum
max = @arr[0]
@arr.each { |e| max = e if e != nil and (max == nil or e > max) }
max.to_f
end

def << (obj)
@arr << obj
end

def calculate_stats
t1 = Time.new
@arr.sort!

@n = calc_n

if @arr[0].kind_of?(Numeric)
@mean, @variance = calc_mean_and_variance
@median = calc_median
@standard_deviation = calc_standard_deviation
@minimum = calc_minimum
@maximum = calc_maximum
end

@freq = Hash.new(0)
@arr.each { |e| @freq[e] += 1 if e != nil}
puts 'Duration of calc_stats: ' + (Time.new - t1).to_s + ', size: ' + @n.to_s
end
end

ActiveRecord::Base.establish_connection(
:adapter => "mysql",
:host => "localhost",
:password => "xxx",
:database => "rubyrx")

class DM < ActiveRecord::Base
end

class Output
attr_accessor :sex_by_trtgrp, :race_by_trtgrp, :age_by_trtgrp, :trtgrp, :sex, :age

# Support templating of member data.
def get_binding
binding
end

def get_statistics(on_and_by_vars_str)
#puts 'in get_statistics'
#puts on_and_by_vars_str
on_and_by_vars_str.to_s =~ /^(.*)_by_(.*)/i
on_vars_str = $1
by_vars_str = $2
#puts 'on_vars_str'
#puts on_vars_str
#puts 'by_vars_str'
#puts by_vars_str

by_vars_for_hash = ""
by_vars = by_vars_str.split(/_and_/i)

#puts 'by_vars'
#p by_vars
by_vars.each { |var| by_vars_for_hash << "[r.#{var}]"} unless by_vars.empty?
#puts 'does it get here?'
#puts 'by_vars_for_hash'
#puts by_vars_for_hash
#puts by_vars.join('_')

on_vars = on_vars_str.split(/_and_/i)
#puts 'on_vars'
#p on_vars

on_vars.each do |on_var|
#puts 'in on_vars.each'
#p on_var

var_name = "#{on_var}"
var_name << "_by_#{by_vars.join('_')}" unless by_vars.empty?
#puts 'var_name'
#puts var_name

instance_eval("@#{var_name} = Hash.new")
#puts 'sex_trtgrp'
#p @sex_trtgrp

distinct_records = DM.find(:all, :select => "DISTINCT trtgrp, #{on_var}")
#puts 'distinct_records'
#p distinct_records
records = DM.find(:all)

distinct_records.each do |r|
instance_eval("@#{var_name}#{by_vars_for_hash} = SummaryStatistics.new")
#puts 'in @sdtm.each = 0 loop'
#p @sex_trtgrp
end


records.each do |r|
instance_eval("@#{var_name}#{by_vars_for_hash} << r.#{on_var}")
#puts 'in @sdtm.each += 0 loop'
#p @sex_trtgrp
end

distinct_records.each do |r|
instance_eval("@#{var_name}#{by_vars_for_hash}.calculate_stats")
#puts 'in @sdtm.each += 0 loop'
#p @sex_trtgrp
end
end

return self
end

def method_missing(sym, *args)
sym.to_s =~ /^(.*)_on_(.*)/i
method_name = $1
on_and_by_vars_str = $2
#puts 'in method missing'
#puts method_name
#puts on_and_by_vars_str
instance_eval("#{method_name}('#{on_and_by_vars_str}')")
end
end


begin
o = Output.new.get_statistics_on_sex_and_race_and_age_by_trtgrp.get_statistics_on_trtgrp_by_.get_statistics_on_sex_by_.get_statistics_on_race_by_.get_statistics_on_age_by_
#o = Output.new.get_statistics_on_sex_by_trtgrp

# File.open('saved_object.txt', 'w+') do |f|
# Marshal.dump(o, f)
#end

#puts 'Elapsed time: ' + (Time.new - t0).to_s

#p o.sex_by_trtgrp
#p o.race_by_trtgrp
#p o.age_by_trtgrp
#p o.sex
#p o.trtgrp
#p o.race
#p o.age

#x = File.open('saved_object.txt', 'r') do |f|
# Marshal.load(f)
#end

File.open('T14.2-Demo-Template.erb', 'r') do |t|
File.open('T14.2-Demo.rhtml', 'w') do |f|
f.puts ERB.new(t.readlines.to_s).result(o.get_binding)
end
end
rescue => e
puts e
ensure
puts 'Elapsed time: ' + (Time.new - t0).to_s
end


And here's the current DM template:

ACME Pharmaceuticals
Study XXXXXXX
Demographics Table

Placebo Drug Total
N=<%=@trtgrp.freq['Placebo']%> N=<%=@trtgrp.freq['Drug']%> N=<%=@trtgrp.n%>

Sex
M <%=@sex_by_trtgrp['Placebo'].freq['M']%> <%=@sex_by_trtgrp['Drug'].freq['M']%> <%=@sex.freq['M']%>
F <%=@sex_by_trtgrp['Placebo'].freq['F']%> <%=@sex_by_trtgrp['Drug'].freq['F']%> <%=@sex.freq['F']%>

Race
White <%=@race_by_trtgrp['Placebo'].freq['WHITE']%> <%=@race_by_trtgrp['Drug'].freq['WHITE']%> <%=@race.freq['WHITE']%>
Black <%=@race_by_trtgrp['Placebo'].freq['BLACK']%> <%=@race_by_trtgrp['Drug'].freq['BLACK']%> <%=@race.freq['BLACK']%>
Asian <%=@race_by_trtgrp['Placebo'].freq['ASIAN']%> <%=@race_by_trtgrp['Drug'].freq['ASIAN']%> <%=@race.freq['ASIAN']%>
Other <%=@race_by_trtgrp['Placebo'].freq['Other']%> <%=@race_by_trtgrp['Drug'].freq['Other']%> <%=@race.freq['Other']%>

Age
n <%=@age_by_trtgrp['Placebo'].n%> <%=@age_by_trtgrp['Drug'].n%> <%=@age.n%>
mean <%=sprintf("%0.1f", @age_by_trtgrp['Placebo'].mean)%> <%=sprintf("%0.1f", @age_by_trtgrp['Drug'].mean)%> <%=sprintf("%0.1f", @age.mean)%>
std <%=sprintf("%0.2f", @age_by_trtgrp['Placebo'].standard_deviation)%> <%=sprintf("%0.2f", @age_by_trtgrp['Drug'].standard_deviation)%> <%=sprintf("%0.2f", @age.standard_deviation)%>
median <%=@age_by_trtgrp['Placebo'].median%> <%=@age_by_trtgrp['Drug'].median%> <%=@age.median%>
min <%=@age_by_trtgrp['Placebo'].minimum%> <%=@age_by_trtgrp['Drug'].minimum%> <%=@age.minimum%>
max <%=@age_by_trtgrp['Placebo'].maximum%> <%=@age_by_trtgrp['Drug'].maximum%> <%=@age.maximum%>



See you on Tuesday.

Thanks,

Glenn

No comments: