|SEARCH AND INFORMATION RETRIEVAL||
Large-volume news search
Architecture, engineering management, and UI design for innovative news search application. Indexes over 100K news items per day. Featuring persistent search, alerts, heat maps, classification, clustering, and other functions. Implemented using blend of Lucene, SOLR, and PyLucene, written in Python and Java.
SEC filings search
Architecture, engineering management, and UI design for SEC filings search application. Indexes live and archived SEC filings. Featuring filings drill-down within search results, hit highlighting within sub-documents, and user tagging functions.
Hedge fund portfolio search
Blended search with structured browsing of portfolio positions for hedge fund. Search over structured and unstructured data, and search results take the user to a point in the hierarchical portfolio data. Architecture, engineering management, and UI design.
Other search consulting projects
Internal oil company portal using Verity; natural language search for CRM; search of medical literature.
Context-based document retrieval
Co-designed and co-implemented a new approach to desktop search based on past context ("Find the file I emailed to Frank in June or July"), for HP's NewWave desktop environment. Implemented a natural language interface based on semantic grammars.
Free-text search engine
1991: As part of a collaboration with Karen Sparck-Jones, designed and implemented an early online search system based on inverse document frequency weighting and relevance feedback. Karen Sparck-Jones invented inverse document frequency (the IDF in TF-IDF), and was a pioneer in the field of information retrieval from the 1960s onwards.
Parsing financial data from natural language
Architected, implemented, and deployed major system for information extraction from text. System in use since 2003, extracting over 5,000 data records from over 2,000 distinct documents per day. Used for portfolio management and automated trading by hedge funds. Emphasis on high precision (over 99.8%) and good recall (over 70%). Text content is unpredictable and changes on a daily basis; content includes structured, semi-structured, and unstructured language. System involves document classification, syntactic parsing, semantic filtering, heuristic slot-filler information extraction, error-checking, machine learning, and named-entity detection (including Asian, European, and US company names, and varying security identifiers). Written in Python, .NET, and Visual Basic.
Designed and developed AtomicParser, a propietary Python library for parsing both unstructured (linguistic) and structured text. AtomicParser is a nondeterministic rule-based parser where the rules can be applied top-down or bottom-up. Machine learning can be used to infer categories and rules from training text. The system uses a specialized regular expression module, written in C, which can return thousands of named captures from a single regular expression match.
Specified and deployed AtomicML, a proprietary Python library for machine learning. Used in various text analysis tasks, including document routing and creation of custom news channels.
Search-based content analytics
Designed and developed a search-based analytics web application for lightweight text mining of any searchable content. Free, simplified version of this service available as AtomicIQ supporting analysis of web, news, wikipedia content. Used by public relations professionals for measurement of news coverage.
Text mining of customer opinion dataSeveral projects using automated techniques for analysis of customer statements and opinions in text, including product review forums, customer survey open-ended responses, and call center records. See Presentation by Bacon and Haddock (2004) for work which came out of one of these projects.
2009: Web application for monitoring latest football team news on a single page. Includes news, blog posts, video clips, and quotations. Uses techniques for content filtering and sentence detection: for an example, see Sir Alex Ferguson quotes.
2002: Designed and implemented a web-based service for saving and sharing web links. The service lists links according to when they were saved, making it easy to retrieve a link with no up-front organisational effort. Similar to later, more widely known services such as Del.icio.us.
Trail-based web browsing
Personalized news reader
1993: Designed and implemented an early personalized news filtering system, for extracting relevant articles from the Nikkei Weekly News. Featured simple user interface and automatic mechanism for detecting areas of interest, based on modified relevance feedback.
Search and management of voice recordings
Founded and managed HP R&D project developing a suite of techniques for information extraction from voice records, with applications to voicemail, personal voice notes, and recorded meetings. Visual navigation and tagging of voice records, via graphical "chunks" of speech. These chunks could be extracted to other applications, and tagged with icons, such as a telephone-number icon.
Speech processing components
Graphical user interfaces to speech data depend in part on speech processing algorithms to extract higher-level information from the speech signal. Algorithms developed included:
|NATURAL LANGUAGE INTERFACES||
Textual NL interfaces
Designed and implemented NLP interfaces to financial monitoring and other ERP systems. Implementation used external NLP tools for syntactic and semantic interpretation.
Speech NL interfaces
Developed grammars for interactive voice interfaces, using commercial and academic speech toolkits.
Developed computational model of word-by-word syntactic and semantic processing, based on Combinatory Categorial Grammar and incremental evaluation of referential contraints.
Constraint networks for noun phrase evaluation and generation
Further work demonstrating that fast, low-power network consistency algorithms are sufficient for NP evaluation, and an application to generation of noun phrases.