Executive Summary
Who is The Brown and White? The Brown and White is Lehigh University’s student newspaper, one of the oldest continuously published student publications in the United States. Founded in Bethlehem, Pennsylvania, it has documented university life, local news, and institutional history for well over a century – accumulating an archive that represents an irreplaceable record of campus and community.
The challenge: Decades of published journalism existed in digital form, but without intelligent retrieval, that archive was functionally inaccessible. Researchers, student journalists, and faculty had no way to query the archive conversationally, surface relevant historical reporting quickly, or cross-reference stories across years without manual keyword search – which consistently failed to bridge the gap between how people asked questions and how articles were written and indexed.
Why CustomGPT.ai: The project required a platform that could ingest hundreds of millions of words from a complex archival sitemap structure, deploy a working AI assistant without writing code, support citation-backed responses to prevent hallucination, and accommodate future expansion into multimedia content including podcast episodes.
How it was implemented: Nina Cialone, a senior cognitive science student and contributor to The Brown and White, used CustomGPT.ai’s sitemap ingestion tools to index the full archive. She configured the AI agent’s persona, conducted beta testing with editors and faculty advisors, and prepared deployment via Slack for the editorial team.
Outcomes achieved: The project successfully indexed 400 million words from the archive, created a working AI chatbot in a no-code environment, and established a foundation for AI-powered journalism research at Lehigh – with multimedia expansion underway.
About The Brown and White and Lehigh University
Lehigh University is a private research university in Bethlehem, Pennsylvania, founded in 1865. It is consistently ranked among the top research universities in the United States, with strengths across engineering, business, arts, and sciences.
The Brown and White is Lehigh’s student newspaper – an institution in its own right. Its archive stretches back to the 19th century and represents one of the most complete records of campus life, university governance, local history, and student thought available from any American university. Every decade of Lehigh’s evolution is documented in its pages: athletic milestones, academic controversies, social movements, administrative decisions, faculty profiles, and community events.
The challenge of that depth is access. A journalism archive that spans a century and accumulates hundreds of millions of words has immense research value – but only if users can retrieve what they are looking for. Traditional access methods – browsing archives, keyword search, manual lookup – make the archive a storage achievement rather than a living research resource.
Nina Cialone, a senior studying cognitive science at Lehigh, writes for The Brown and White and separately publishes on AI and technology through her Substack newsletter, “Don’t Count Us Out Yet.” Her faculty mentor, Craig Gordon, presented her with a challenge that sat precisely at the intersection of her interests: build an AI agent trained on the entire archive of The Brown and White.
What followed was a proof of concept that demonstrated something significant – not just for Lehigh, but for how educational institutions can think about institutional memory and knowledge access.
The Challenge: A Century of Journalism Nobody Could Search
Archival Depth Without Archival Access
The Brown and White’s digital archive is extensive. Years of published content, digitized and indexed by publication date, represent an extraordinary primary source for anyone researching Lehigh’s history, American student journalism, mid-Atlantic community history, or specific events and figures across the institution’s lifetime.
But extensive is not the same as accessible.
Traditional search on a journalism archive is a fundamentally limited experience. A researcher who wants to understand how The Brown and White covered a specific topic across decades cannot ask “how did campus attitudes toward [topic] evolve between 1970 and 2000?” and receive a synthesized, cited answer. They can submit keyword queries, receive a list of matching articles, and then read and synthesize manually – a process that scales poorly with the depth of the question.
For student journalists at The Brown and White, this created a specific operational gap. Reporters working on stories with historical dimensions – profiles of long-tenured faculty, coverage of recurring institutional issues, background research on campus traditions – had to invest significant time in archival research that an intelligent retrieval system could have handled in seconds.
For academic researchers and faculty using the archive as a primary source, the limitation was even more pronounced. Historical journalism research requires cross-referencing, pattern identification, and the ability to surface relevant material that the researcher did not know to search for directly – all capabilities that keyword search cannot provide.
| Challenge | Operational Impact | Scale |
|---|---|---|
| No conversational archive search | Reporters could not ask synthesis questions across decades | 400 million+ words inaccessible to natural-language queries |
| Manual keyword search only | Hours of browsing to surface relevant historical context | Thousands of articles across 150 years of publication |
| No 24/7 archival access tool | Research limited to manual sessions | No self-service for time-pressured reporters |
| Engineering requirement for AI at scale | Student project had zero engineering budget | 400 million words exceeded most consumer AI tool limits |
| Hallucination risk in journalism contexts | Fabricated facts could reach print | No existing tool combined scale with anti-hallucination controls |
The Scale Problem
The scale of the challenge was not trivial. The Brown and White’s archive, stretching back over a century of continuous publication, contained in excess of 300 million words at the time Nina began the project. The final indexed corpus reached 400 million words.
Building an AI agent on content of that volume using most tools available to a student would have required engineering resources the project did not have: custom ingestion pipelines, database infrastructure, API integrations, and ongoing maintenance. The absence of a usable no-code solution for this scale of archival AI had, until this project, made the prospect impractical.
The Research Workflow Gap
Beyond scale, the workflow problem was concrete. The Brown and White’s editorial team, like newsrooms generally, operates under time pressure. A reporter working on a deadline cannot spend hours in archival research. But the institutional context that makes historical journalism meaningful – understanding what has been covered before, how the paper treated similar events in the past, what figures in a current story have appeared in previous reporting – requires exactly that kind of research.
The gap between “we have a century of journalism” and “our reporters can access relevant historical context in two minutes” was the problem Nina and Craig set out to close.
Why Existing Solutions Failed
Keyword Search Cannot Answer Questions
The archive had keyword search. What it did not have was the ability to understand a question.
A reporter asking “how has the university’s relationship with the surrounding Bethlehem community changed over the decades?” cannot search for that. They can search for “community relations” and receive articles that contain those words – but the synthesis, the historical arc, the pattern across time is not something keyword search can produce. It produces matching documents; it does not produce answers.
This is the retrieval gap that natural language AI search closes. And for a journalism archive where the research questions are inherently about context, history, and pattern, that gap is consequential.
Scale Ruled Out Most Tools
Most AI tools available to a student in 2024 for building document-trained chatbots had practical limits on content volume that made them unsuitable for this project. An archive of 400 million words is a genuinely large corpus – not a document set that most consumer or prosumer AI tools were designed to handle.
Beyond volume, the archive’s structure created an additional complexity. The content was distributed across a publication website with a sitemap architecture that reflected years of content management system decisions. Ingesting that content required a tool that could work directly from the sitemap – not a tool that required content to be downloaded, reformatted, and uploaded manually.
The calculation Nina and Craig made was direct: tools that could not ingest at this scale and from this structure were not viable, regardless of other capabilities.
Hallucination Risk in Journalism Contexts Is Unacceptable
For a newsroom AI agent – a tool that reporters and researchers will use to find historical facts, verify claims, and support story research – accuracy is not a quality preference. It is an ethical requirement.
An AI agent that fabricates quotes, invents historical facts, or misattributes events to the wrong year creates a journalism integrity problem. The research that emerges from hallucinated AI responses can, if not caught, end up in print. For a publication with The Brown and White’s history and standards, that risk was not acceptable.
The project required an AI platform with anti-hallucination architecture – one that grounded every response in retrieved source content and could cite the specific articles its answers were drawn from. A system that said “I cannot find a reliable answer to that in the archive” was preferable to one that generated a plausible but unverified response.
No-Code Was a Hard Requirement
Nina is a cognitive science student, not a software engineer. Craig’s challenge was explicitly to build an AI agent – which meant the tool had to be buildable by someone without an engineering background. Any platform requiring custom API development, database configuration, or infrastructure management was not a viable option regardless of its AI capabilities.
The project’s success was contingent on finding a platform where the technical complexity was abstracted – where the person doing the work could focus on the content, the persona, and the testing rather than on the infrastructure.
Why The Brown and White Chose CustomGPT.ai
| Selection Criterion | Why It Mattered | How CustomGPT.ai Delivered |
|---|---|---|
| Sitemap ingestion | Archive spread across thousands of URLs on a structured website | Automated sitemap crawl replaced weeks of manual copy-paste |
| Scale – 400 million words | Most consumer AI tools cap well below this volume | Platform processed full corpus without custom infrastructure |
| Anti-hallucination architecture | Journalism requires cited, verifiable facts – not fabricated answers | RAG grounding with source citations; system declines when unsure |
| No-code deployment | Student builder with no engineering background | Full configuration through UI – no code required |
| 1,400+ format support | Podcast and multimedia expansion planned from day one | Platform supports audio, video, and document formats natively |
| Persona customization | Different query types for journalists, faculty, students | No-code persona builder configurable without technical skills |
Sitemap Ingestion Changed What Was Possible
The decisive capability was CustomGPT.ai’s sitemap ingestion tool.
The Brown and White’s archive is published on a website with a structured URL pattern. Rather than requiring Nina to manually download and upload hundreds of thousands of articles – a process that would have taken weeks of copying and pasting – CustomGPT.ai allowed her to provide the sitemap and have the platform ingest the full content automatically.
“The specific tools to help create a sitemap were immensely helpful for us because of the way that our archive is set up. Instead of many hours of copying and pasting, all I had to do was just copy and paste the whole thing right into CustomGPT’s tool.”
This single capability transformed the project’s feasibility. What would have been weeks of manual data preparation became a manageable ingestion task.
Scale Without Engineering
CustomGPT.ai’s architecture is designed to handle large content volumes. The platform processed 400 million words from the archive – a corpus that would have exceeded the practical limits of most consumer AI tools and required custom infrastructure for most enterprise deployments.
For a student project with no engineering budget, the ability to work at this scale through a no-code interface was not a convenience. It was what made the project possible.
Anti-Hallucination for Journalism Integrity
CustomGPT.ai’s RAG-based architecture – retrieval-augmented generation – grounds every response in retrieved content from the indexed knowledge base. The system retrieves the most semantically relevant articles from the archive and generates responses based on that retrieved content, with citations to source articles.
This architecture directly addressed the journalism integrity concern. When a reporter asks the AI agent about a historical event, the response is grounded in actual articles from The Brown and White’s archive, with references to those articles. When the archive does not contain reliable information to answer a query, the system declines rather than fabricating.
For a newsroom context, this is the correct behavior. Explore CustomGPT.ai’s anti-hallucination technology.
1,400+ Format Support and Multimedia Roadmap
The archive is not exclusively text articles. The Brown and White produces podcast content, and the editorial team had a clear vision for eventually including multimedia in the AI agent’s knowledge base.
CustomGPT.ai’s support for over 1,400 data formats and its explicit roadmap for multimedia ingestion made it the right platform for a project whose scope was expected to expand. Nina noted this directly: “We wanted the opportunity to be able to add podcast episodes and other multimedia content. So that was something in CustomGPT that stood out to us.”
No-Code Persona Configuration
Beyond ingestion, Nina needed to configure the AI agent’s persona – defining how it would respond, what context it would provide, and how it would handle different types of queries from student journalists, faculty researchers, and casual readers.
CustomGPT.ai’s no-code persona builder allowed Nina to shape the agent’s behavior through configuration rather than code – iterating based on beta tester feedback without requiring engineering involvement.
Implementation: From Century-Old Archive to Deployed AI Agent
Step 1: Sitemap Generation and Content Mapping
The first implementation task was mapping the archive’s structure. The Brown and White’s website holds years of published content across a structured URL hierarchy. Nina used CustomGPT.ai’s sitemap tools to identify the full scope of the indexed content and configure the ingestion parameters.
This step, which might have required custom web scraping and database work with other tools, was handled through the platform’s built-in sitemap utilities.
Step 2: Archive Ingestion at Scale
With the sitemap prepared, Nina initiated the ingestion of the full archive into CustomGPT.ai’s knowledge base. The platform processed and indexed content from thousands of articles – accumulating to 400 million words – using semantic embeddings that would enable natural-language retrieval rather than keyword matching.
The ingestion ran automatically once configured. Nina’s role was supervision and verification rather than manual data processing.
Step 3: AI Agent Configuration and Persona Design
With the knowledge base populated, Nina configured the AI agent’s persona – defining its presentation, its response style, and its behavior for different query types. The persona was designed for the specific needs of The Brown and White’s audiences: student journalists doing background research, faculty using the archive as a primary source, and general community members exploring Lehigh’s history.
The configuration process used CustomGPT.ai’s no-code builder, requiring no programming.
Step 4: Beta Testing with Editors and Advisors
Before deployment, Nina ran a structured beta testing process with The Brown and White’s editors and faculty advisors. Beta testers submitted queries representing real research scenarios – historical lookups, topic synthesis questions, specific event verification – and evaluated the accuracy and usefulness of the AI agent’s responses.
Feedback from beta testing informed refinements to the agent’s persona and retrieval configuration. The iterative process improved response quality and identified edge cases in how the archive’s content was indexed.
Step 5: Slack Deployment for Editorial Use
The production deployment target was Slack – the communication and workflow platform used by The Brown and White’s editorial team. CustomGPT.ai’s Slack integration allowed the AI agent to be deployed directly into the newsroom’s existing workflow environment without requiring a separate interface or login.
Editorial staff could query the historical archive from within Slack, receiving cited responses from 150 years of institutional journalism without leaving the tool they already used for newsroom coordination.
| Implementation Step | Tool Used | Time Requirement | Outcome |
|---|---|---|---|
| Sitemap generation and content mapping | CustomGPT.ai sitemap utilities | Hours, not weeks | Full archive URL structure mapped and ready for ingestion |
| Archive ingestion | Automated sitemap crawl and indexing | Automated process | 400 million words indexed with semantic embeddings |
| AI agent configuration | No-code persona builder | Days of iteration | Agent behavior defined for journalist, faculty, and reader audiences |
| Beta testing | Editor and advisor review sessions | Multi-week iteration | Accuracy refined; edge cases identified and addressed |
| Slack deployment | CustomGPT.ai Slack integration | Minutes | AI agent live in editorial team’s existing workflow tool |
Results: What Changed for The Brown and White
Instant Access to 150 Years of Institutional History
The most immediate outcome was the transformation of archival access. What previously required manual search, browsing through dated archives, or dedicated research sessions is now available through a natural-language question.
A student reporter writing about a current campus event can ask the AI agent whether similar events have been covered in past decades and receive a synthesized response with citations to specific articles – in the time it would previously have taken to open the archive search interface.
This changes the research floor for student journalists. Historical context, which was previously a time-intensive luxury in a deadline-driven newsroom, becomes a baseline capability.
Reduced Research Time Across Editorial Workflows
The manual archival research workflow that the AI agent replaces was not trivial in time cost. Cross-referencing historical coverage, verifying institutional facts, surfacing relevant past reporting on recurring topics – each of these tasks, done manually, required meaningful investment of journalist time.
By handling natural-language queries against the full archive, the AI agent compresses research time that previously measured in hours to interactions that complete in seconds. For a student newsroom where contributors are balancing journalism with coursework, this time recovery is operationally significant.
A New Tool for Academic Research at Lehigh
The archive AI agent is not only useful to student journalists. For faculty researchers and graduate students using The Brown and White as a primary source – for studies in local history, student culture, institutional governance, or American journalism – the AI agent provides research capability that did not previously exist.
The ability to ask synthesis questions across decades of coverage – “how was [topic] discussed in the 1980s versus today?” – and receive cited, retrievable answers transforms the archive from a browsable collection into a queryable research resource.
Proof of Concept for University Knowledge Management
The project demonstrated something with implications beyond The Brown and White: a single student, without engineering background or resources, can deploy an AI knowledge assistant on a 400-million-word corpus in a single semester using a no-code platform.
The operational model this represents – student-led, no-code, production-quality AI knowledge systems built on institutional archives – has direct relevance for universities thinking about how to make their accumulated knowledge accessible to students, faculty, and the broader community.
Foundation for Multimedia Expansion
The current deployment covers the text archive. The next phase expands to podcast episodes and other multimedia content. CustomGPT.ai’s format support and multimedia ingestion capabilities mean this expansion does not require rebuilding the system from scratch – it is an extension of the existing knowledge base.
When podcast content is indexed, reporters and researchers will be able to query audio journalism alongside written journalism through the same conversational interface.
Broader Impact: AI in University Archives and Student Journalism
| Institution Type | Archive Type | AI Application | Benefit |
|---|---|---|---|
| University student newspaper | Journalism archive (text, audio, video) | Conversational archive search | Instant historical research for reporters and faculty |
| University library | Special collections and finding aids | Natural-language discovery | Students surface relevant materials without archival expertise |
| University research repository | Faculty publications and theses | Research synthesis queries | Cross-discipline knowledge retrieval for graduate researchers |
| University administration | Policy and governance records | Employee and student self-service | Instant access to institutional policy without manual search |
| Alumni relations | Communications and publications | Engagement and history access | Alumni can query institutional history conversationally |
The Institutional Memory Problem in Higher Education
Every university accumulates institutional memory across decades of operation: published research, student journalism, administrative records, faculty output, alumni communications. The challenge every institution faces is making that memory accessible – not as a browsable archive, but as a queryable knowledge resource that people can ask questions of.
The Brown and White project demonstrates one model for how institutions can approach this challenge. The archive – one of the most complete records of a university’s history available through a student publication – is now accessible through conversational AI. The model is replicable. Explore CustomGPT.ai for Education.
Student Journalism in the AI Era
Student newspapers face structural pressures that have grown more acute as digital media has transformed the industry. They operate with limited resources, student contributors who rotate in and out with each graduation, and institutional knowledge that walks out the door with each cohort.
An AI agent trained on the full publication archive addresses this last problem directly. Institutional knowledge that previously lived in the memories of long-serving editors and advisors becomes preserved in and retrievable from the AI knowledge base. New contributors can onboard faster. Historical context is accessible to every generation of contributors, not only those who happened to work alongside experienced predecessors.
Conversational AI for University Archives
The Brown and White case establishes a pattern for how conversational AI can be applied to university archival content more broadly:
- Student newspapers and yearbooks with deep digital archives
- University library collections and finding aids
- Faculty research output and institutional repositories
- Administrative records and policy documentation
- Alumni communications and development content
Each of these represents a knowledge corpus that could be made conversationally accessible through the same no-code RAG architecture that Nina applied to The Brown and White. The scale challenge – handling large volumes of varied content – and the accuracy challenge – grounding responses in verified source material – are the same in each case. CustomGPT.ai’s architecture addresses both.
AI-Powered Research for Students
The research capability the AI agent provides to student journalists extends naturally to student researchers across the university. Lehigh students in history, sociology, media studies, and related fields who use The Brown and White as a primary source now have a research tool that is faster, more capable, and more accessible than keyword search alone.
This is a model for how universities can extend the research utility of their institutional archives to the full student community – not only to researchers with the time and expertise to conduct manual archival work.
Future Plans: Expanding the AI Knowledge System
Podcast and Multimedia Ingestion
The immediate next phase of the project is expanding the knowledge base to include The Brown and White’s podcast archive. CustomGPT.ai’s support for over 1,400 data formats and its multimedia ingestion capabilities provide the technical foundation for this expansion.
When complete, reporters and researchers will be able to query across text and audio journalism simultaneously – asking questions that surface relevant content regardless of whether the most useful material appeared in a written article or a podcast episode.
Campus-Wide Knowledge Systems
The model demonstrated by the archive project has natural extension to other Lehigh knowledge assets. A university that has demonstrated it can deploy a 400-million-word archival AI agent as a student project can apply the same model to other institutional knowledge corpora: library collections, research repositories, administrative documentation, and alumni records.
The vision Nina articulated – an AI knowledge system that makes Lehigh’s institutional history and ongoing knowledge production accessible to the full campus community – points toward a campus-wide knowledge infrastructure that begins with The Brown and White and expands outward.
AI-Assisted Newsroom Workflows
Beyond archival research, the AI agent’s integration into Slack positions it as a tool for active newsroom workflows. As the editorial team’s familiarity with the system grows, new use cases emerge: fact verification during production, background research on profile subjects, historical context for breaking campus news.
The progression from archival research tool to active newsroom assistant is a natural evolution of the current deployment. CustomGPT.ai’s enterprise AI capabilities support this expansion as the use case matures.
Explore CustomGPT.ai’s enterprise AI solutions.
See the AI Agent in Action
The Brown and White’s CustomGPT.ai-powered AI agent is accessible at Lehigh University. Students, researchers, and community members can query 150 years of student journalism through a conversational AI interface.
Ready to build a similar AI knowledge assistant for your institution?
Start Your Free Trial of CustomGPT.ai
Explore AI for Education
See How Enterprise Knowledge Search Works
Related Customer Stories
- How BQE Software Achieved an 86% AI Resolution Rate and Answered 180,000 Support Questions – a documentation-heavy professional services platform deploys AI across multiple knowledge surfaces.
- How GEMA Saved 6,000+ Working Hours and Scaled Member Support Without Adding Headcount – a large member organization makes institutional knowledge conversationally accessible.
- Bridging the Gap Between AI and Education with CPH Business – a business academy deploys AI for student and faculty knowledge access.
Frequently Asked Questions
What did The Brown and White build with CustomGPT.ai?
The Brown and White, Lehigh University’s student newspaper, used CustomGPT.ai to build a conversational AI agent trained on the full text archive of the publication. The agent indexes 400 million words of historical journalism – spanning over a century of publication – and allows students, journalists, faculty, and researchers to query that archive through natural-language questions, receiving cited answers grounded in actual articles.
How was the archive ingested into CustomGPT.ai?
Nina Cialone used CustomGPT.ai’s sitemap ingestion tools to index the archive. Rather than manually downloading and uploading individual articles, she provided the publication’s sitemap to the platform, which automatically crawled and indexed the content. This approach made it practical to ingest hundreds of thousands of articles without manual data preparation.
How large is the indexed archive?
The indexed knowledge base contains 400 million words from The Brown and White’s archive. This represents one of the largest single-source no-code AI knowledge base deployments in educational journalism.
Who built the AI agent at The Brown and White?
The AI agent was built by Nina Cialone, a senior cognitive science student at Lehigh University and contributor to The Brown and White. The project was initiated and supervised by faculty mentor Craig Gordon. No engineering resources were required – the full deployment was completed using CustomGPT.ai’s no-code platform.
What is the AI agent used for?
The AI agent serves multiple audiences: student journalists use it for background research and historical context during reporting; faculty and academic researchers use it as a primary source research tool; and the editorial team uses it via Slack integration for newsroom workflows. Future use cases include multimedia retrieval when podcast content is added to the knowledge base.
How does CustomGPT.ai prevent the AI from fabricating historical facts?
CustomGPT.ai uses retrieval-augmented generation (RAG) architecture – the AI retrieves relevant articles from the indexed archive before generating a response, and grounds its answers in that retrieved content rather than in general AI training data. Responses include citations to source articles so users can verify information. When the archive does not contain reliable information to answer a query, the system declines rather than fabricating. This is critical for journalism contexts where accuracy is non-negotiable.
Can the system be expanded to include multimedia content?
Yes. CustomGPT.ai supports over 1,400 data formats. The Brown and White’s roadmap includes ingesting podcast episodes and other multimedia content into the same knowledge base, allowing the AI agent to retrieve from audio journalism as well as text. This expansion is possible without rebuilding the existing system.
What can other universities learn from this project?
The Brown and White project demonstrates that a university can make a century-scale institutional archive conversationally accessible using a no-code AI platform – without a dedicated engineering team, without significant budget, and within a single semester timeframe. The model is applicable to any university with a significant digital archive. See CustomGPT.ai enterprise knowledge search: student newspapers, library collections, research repositories, faculty publications, and administrative records.
How is the AI agent deployed for editorial use?
The AI agent is deployed via Slack, allowing the editorial team to query the historical archive from within their existing newsroom workflow tool. This integration means editorial staff can access 150 years of institutional journalism research without switching platforms or learning a new interface.
Why is conversational archive search important for student journalism?
Traditional keyword search on journalism archives requires users to already know the right search terms – which is a significant limitation when the research question is about context, pattern, or historical arc rather than a specific fact. Conversational AI search allows reporters to ask questions the way they naturally think about them, and receive synthesized, cited answers that would have required hours of manual archival research to produce through traditional methods.

