AI-Powered Conversational Interface for ARGO Ocean Data Discovery and Visualization
FloatChat - AI-Powered ARGO Ocean Data Discovery and Visualization
FloatChat is an intelligent conversational AI system that revolutionizes how researchers interact with ARGO float data. The platform addresses a critical gap in oceanographic research: making vast, complex datasets accessible, understandable, and actionable through natural language.
By combining a RAG-based AI chatbot with interactive data visualizations and real-time dashboards, FloatChat enables users to query complex oceanographic data conversationally and receive instant, context-aware insights with supporting visualizations.
Working with oceanographic data, I discovered that researchers face significant barriers when accessing ARGO float measurements:
These challenges slow down critical research on ocean warming, salinity changes, and climate patterns. I built FloatChat to democratize access to oceanographic data by providing an intelligent, conversational interface that makes this data accessible to anyone.
FloatChat is built on a modern, scalable architecture designed for both performance and accuracy:
Leverages Groq LLM (Qwen2.5-32B) with Retrieval-Augmented Generation to provide accurate, context-aware responses. The system uses ChromaDB for efficient vector storage and PostgreSQL for structured float metadata.
Automated ingestion from ARGO global repositories (ftp.ifremer.fr) and Indian Argo Project (INCOIS). Custom data cleaning modules handle inconsistencies, while embedding modules process metadata for semantic search.
Built with Streamlit, featuring geographic float maps powered by Folium, time-series analysis, and vertical profile visualizations. Dashboards update dynamically based on user queries and selections.

System architecture: data ingestion, RAG pipeline, and visualization layer
FastAPI powers the backend with RESTful endpoints for data retrieval, query processing, and orchestration between the LLM, vector database, and visualization components. The API handles real-time similarity search and response generation with sub-second latency.
Ask questions like "What were the average temperature values in the Central Indian Basin for 2016?" and get instant, accurate answers grounded in actual data sources.
Visualize float locations geographically with fullscreen toggle. Select floats directly from the map interface for detailed analysis and metadata.
Dynamic visualizations for temperature, salinity, and pressure profiles. Time-series analysis with customizable date ranges and depth filters.
RAG architecture ensures responses are grounded in actual ARGO measurements with proper citations and supporting evidence from the database.
Challenge: ARGO repositories contain millions of measurements spanning decades, requiring efficient ingestion, indexing, and retrieval.
Solution: Implemented incremental data loading with ChromaDB vector embeddings for semantic search and PostgreSQL for structured queries, achieving sub-second response times even with large datasets.
Challenge: Ensuring LLM responses are factually accurate and grounded in actual oceanographic data, not generated hallucinations.
Solution: RAG architecture retrieves relevant data before generation. Implemented similarity score thresholds and citation mechanisms that link responses directly to source measurements.
Challenge: Rendering thousands of geographic points and time-series data without performance degradation or UI lag.
Solution: Optimized Folium map rendering with clustering algorithms and lazy loading. Time-series charts use intelligent data aggregation for datasets exceeding display resolution.
FloatChat demonstrates how AI can bridge the gap between complex scientific data and intuitive user interfaces:
• RAG architecture is essential for domain-specific AI applications where accuracy and source attribution are critical
• User experience matters in scientific tools—conversational interfaces can dramatically lower barriers to data access
• Combining multiple visualization types (maps, time-series, profiles) provides comprehensive insights that single-view tools cannot
• Performance optimization at both database and frontend levels is crucial when working with large scientific datasets