A New, LLM-Assisted Pipeline to Back My Research

Over the last two days, I have been designing a Python script to extract specific information from Grants.gov: Cuba-related funding opportunities published by four federal entities —the now-defunct USAID, the DRL and WHA bureaus at Foggy Bottom, and the U.S. Embassy in Havana. I am particularly interested in those Notifications of Funding Opportunities (NOFOs) seeking partners to implement the so-called Cuba democracy program, established in 1996 and historically funded by Congress through the Economic Support Fund account. This year, the House Committee on Appropriations consolidated major foreign assistance accounts into the new National Security Investment Programs structure.
Experience with the Grants.gov web interface had already taught me that the search engine returns many results that merely mention Cuba inside boilerplate language, whose purpose is precisely to prohibit any connection between prospective grantees or contractors and Cuba. A particularly common formulation is the following:
Subject to TVPA for funds obligated during FY 2024:
AF: Chad, Equatorial Guinea, Eritrea, Guinea-Bissau, South Sudan
EAP: Burma, China (PRC), Macau, North Korea
EUR: Belarus, Russia
NEA: Iran, Syria
SCA: Afghanistan
WHA: Cuba, Curacao, Nicaragua, Sint Maarten
This type of language repeatedly appears in attached NOFO documents, creating a lot of noise during broad keyword searches. My first experiments with the Grants.gov API, looking for the keyword "Cuba", returned 592 results. After numerous iterations with different LLMs —specifically through chatbot interfaces such as Claude, Grok, and DeepSeek—, I eventually arrived at the following logic:
Query the Grants.gov API using the keyword
"Cuba";Process the returned opportunities and then search not only for
"Cuba"but also for semantic variants such as"Cuban","Cubans","Havana", and"the Island";Search for these variants inside relevant fields of the opportunity metadata and, only as a last resort, inside the full documents attached to NOFOs;
If any of these variants appear, include the opportunity in a CSV dataset, unless the occurrence can be confidently identified as belonging to one of the recurrent TVPA boilerplate blocks;
When the mention appears outside the opportunity title—if it is in the title itself, we have a strong signal of relevance—, capture the first sentence containing the match so that I can manually assess whether the opportunity is truly relevant for my research purposes.
The core pattern-matching logic eventually evolved into a structure like this:
VARIANT_RE = re.compile(
r"\bcuba\b"
r"|\bcuban\b"
r"|\bcubans\b"
r"|\bhavana\b"
r"|\bthe island\b",
re.IGNORECASE
)
The practical idea is not to discover opportunities through multiple independent searches, but rather to use semantic variants as a validation mechanism after the initial retrieval stage. The inclusion logic was therefore designed as follows:
def analyze_content(
title,
description,
category_explanation
):
if VARIANT_RE.search(title):
return {
"include": True,
"observaciones":""
}
combined = (
description
+ "\n"
+ category_explanation
)
sentences = split_sentences(
combined
)
for s in sentences:
if VARIANT_RE.search(s):
return {
"include":True,
"observaciones":
s.strip()
}
return {
"include":False
}
This structure allows the pipeline to preserve potentially relevant opportunities while simultaneously exposing the textual context that triggered inclusion. Because the overwhelming majority of cases tend to correspond to TVPA-related language, relatively few opportunities ultimately require close human inspection thanks to the extracted contextual sentence.
Several additional features emerged during this process.
First, I designed a cache system storing the inclusion or exclusion decision for each opportunity returned by the initial search:
cache[opportunity_id] = {
"decision":"include",
"reason":
"keyword_match",
"timestamp":
current_time
}
This substantially reduces redundant processing and makes repeated executions much more efficient.
Second, I requested the generation of an HTML visualization layer to produce a more manageable inspection of results. DeepSeek implemented a practical interface enabling visual exclusion of opportunities I consider irrelevant after manual review.
The exclusion workflow ultimately became:
if args.apply_exclusions:
cache[
opportunity_id
]["decision"] = "exclude"
Excluded opportunities can be exported into JSON and later reapplied by running the script with:
python grants_cuba_pipeline.py --apply-exclusions
This effectively transforms human judgment into persistent machine memory.
On LLMs
I first brought the design to Claude, which produced an initial working version. After multiple rounds of adjustment, DeepSeek generated the most refined and operational version of the script. In my case, rather than replacing analytical work, the models increasingly function as collaborators in constructing reproducible research instruments. I am a critical and deeply engaged co-producer more than a cclient.
Future work
The opportunity identifier itself also plays a broader analytical role. It functions as a linking field with a parallel tracker I maintain for USAspending.gov, which contains the same identifier. This makes it possible to progressively improve the traceability of a process that begins with the annual Congressional Budget Justification of the Department of State and eventually materializes in applications, media ecosystems, and actions taking place either in cyberspace or in physical space, all oriented toward the achievement of specific political objectives. Next, I will turn to the FAC.gov´s API, a more complex task I suspect.
Curated by Ecency!
