---
author: R. S. Doiel
dateCreated: "2026-04-25"
dateModified: "2026-04-25"
datePublished: "2026-04-25"
description: "This post explores the potential for libraries and archives to serve as natural homes for\nlanguage models and emphasizes the importance of shifting the paradigm of AI development\ntowards public benefit rather than corporate interests. It highlights the need for\nresponsible AI development and considers the role of open-source projects like publicai.co.\nThe author raises questions about the feasibility of developing smaller, domain-specific\nmodels using available public data and argues that libraries and archives should focus on\ncurating and providing resources for such efforts. The post also reflects on the historical   \nevolution of computing and data storage, drawing parallels to potential future developments\nin AI. It encourages collective action and scientific communication to build a more\nequitable and sustainable AI ecosystem.\n"
keywords:
    - Library
    - Archives
    - Language Models
postPath: blog/2026/04/25/a_natural_home_for_language_models.md
title: Libraries and Archives, a Natural Home for Language Models

---


# Libraries and Archives, a Natural Home for Language Models

By R. Doiel, 2026-04-25

Libraries and archives are the natural places to create and curate language models. Their tradition of service, and their orientation toward protecting and preserving the public good, embodies an ethos that aligns with the responsible development of language models (a.k.a. "AI").

Libraries and archives have knowledge management baked-in[^1]. These institutions already participate in consortia to coordinate metadata practices and services. They host institutional repositories. However, these institutions and their repositories are currently under assault. They are being strip mined by corporate AI. The bot swarm denial-of-service attacks are a daily challenge for libraries and archives with an online presence. We must look beyond the current battle. It’s time to implement a future where libraries and archives thrive as institutions that serve as public repositories of knowledge, not just survive the present.

[^1]: Libraries and archives predate computing, they've been managing human knowledge for millennia

## Shifting the Paradigm

Society needs to shift how we approach the capabilities of language models and what marketing calls AI. We need to move beyond those currently being built by a few well-financed companies. These companies ultimately make decisions based on profit and exploitation, rather than the public interest.

A recent [Jon Stewart interview, "Worker vs. AI"](https://m.youtube.com/watch?v=RB_WmoH5nQ4), explores how language models could be developed to benefit humanity rather than extract value from it. Another article from Brazil, ["Something big is happening in the global scientific community, and Brazil seems to be left out again"](https://blog.scielo.org/blog/2026/04/24/algo-grande-esta-acontecendo-na-ciencia-mundial-e-o-brasil-parece-estar-de-fora-novamente/), raises similar concerns from the perspective of the Global South. These pieces resonate with me, especially in light of the current political challenges in North America where the United States has tried to abandon its support for science, libraries, and archives as a public good[^2].

[^2]: The phrase "public good" here refers to resources and institutions that are collectively beneficial and accessible to all members of society.

## Questions and Hope

I’m left with many questions. I am left with hope. As problems are articulated, solutions can be found[^3]. The recent creation of publicai.co, <https://publicai.co>, in Switzerland and Singapore is a promising proof of concept, but it’s not the only path we need to explore.

[^3]: Scientists have been organizing to game public support. See <https://en.wikipedia.org/wiki/March_for_Science> and more recently <https://bsky.app/profile/standupforscience.net>, <https://www.standupforscience.net/> as examples.

It doesn’t do us much good if we must rent compute resources from the same companies that prioritize their interests over those of global and local communities. We should explore creating and using models at the edge. It’s likely that we can develop highly effective, small models tailored for well-defined domains. These models wouldn’t require the surveillance and attention-based economic practices of the current commercial AI landscape. They may be able to run and be developed on small energy efficient, affordable computers.

## Tools and Opportunities

Many of the necessary tools and resources are already available. There are large corpora of public domain texts that don’t violate copyright or intellectual property. Open Access content may also be licensed in ways that permit model training. Libraries and archives should make these resources available without risking system overload from the swarms of agentic bots deployed by corporate AI companies.

While much of the public conversation focuses on the negative consequences of corporate AI, I believe the proverbial horse has already left the barn. Money and resources used by the AI companies have already distorted the global economy, negatively impacted real people and harmed the environment. **The challenge is to change the trajectory.** We need to create an alternative approach. It needs to be human-centric, humane, sustainable, and doesn’t destroy the planet in the process.

## Personal AI and a Role for Libraries

Doc Searls has written about the concept of a [personal AI](https://doc.searls.com/personal-ai/). I think he’s onto something. Public libraries, research libraries and archives have already played an important role in providing content. Often without their consent. Often at real budgetary costs. They will play an import role moving forward. How they play that role is in their hands. 

Public libraries, research libraries and archives have the high quality datasets needed to build good language models. Libraries and archives could leverage approaches similar to publicai.co. The key will be to find a way to provide the quality datasets that allow individuals and small organizations to create specialized models they need. The public can bring computing resources to the table too[^4]. Today the corporate bot swarms are forcing either the use of expensive corporate services (example Cloudflare) or taking the resources off the public Web. This challenge needs to be met to fulfill our missions as public reservoir of human knowledge[^5]

[^4]: SETI@home, <https://en.wikipedia.org/wiki/SETI@home>, pioneered crowd sourcing for data analysis. A similar approach might be viable here.

[^5]: SAAS/CDN services like Cloudflare are not a sustainable solution over the long haul. The rents will increase and out pass the growth in library budgets.

## Historical Parallels

Looking back at computer history, I see parallels. A journey can be traced from mainframes to personal computing and on to devices like a phone or watch. What was once expensive and beyond reach is now possible for an individual to own. There are parallels in data storage too. From tape, punch cards and paper type or spinning disks to solid state storage and optical media. In programming, there are parallels with a journey from patch cables to machine code, assembly and eventually to efficiently compiled or interpreted open source languages. This makes me wonder. Why do we accept the current wisdom that massive compute resources are required because we needed to build ever larger language models? Is there a path to something smaller and more efficient? For me, this is a thought experiment, but I suspect others are already working on it[^6].

[^6]: A search for "building small language models" returns many results, <https://scholar.google.com/scholar?q=Building+Small++%22Language+Models%22&hl=en&scisbd=1&as_sdt=0,5>

## The Path Forward

Scientific communication will be crucial in making these ideas accessible to the general public. That will be important in soliciting support as government sponsorship continues to drying up. Collectively, we need to decide and forge a path forward, not leaving it to a few wealthy individuals who lead a tiny number of companies. Libraries and archives have a role in collecting and disseminating both the ideas and digital content.

### Suggested reading

- Doc Searls' concept of [personal AI](https://doc.searls.com/personal-ai/)
- Jon Stewart interview, [Worker vs. AI](https://m.youtube.com/watch?v=RB_WmoH5nQ4)
- Brazilian blog post, [Something big is happening in the global scientific community, and Brazil seems to be left out again](https://blog.scielo.org/blog/2026/04/24/algo-grande-esta-acontecendo-na-ciencia-mundial-e-o-brasil-parece-estar-de-fora-novamente/)
- [Public AI and Apertus](https://publicai.co)

