Exploring Reasoning And Interactive Benchmarking Of Language Models