
Google has been on an absolute AI hot streak lately, consistently dropping breakthrough after breakthrough. Nearly every recent release has pushed the boundaries of what’s possible — and it’s been genuinely exciting to watch unfold.
谷歌最近在人工智能领域可谓势如破竹,接连取得一项又一项突破。几乎每一次最新发布都拓展了可能性的边界,看着这些成果不断涌现,着实令人兴奋。
One announcement that caught my eye in particular occurred at the end of July, when Google released a new text processing and data extraction tool called LangExtract.有一则消息尤其引起了我的注意,它发布于7月底,当时谷歌推出了一款名为LangExtract的全新文本处理与数据提取工具。
According to Google, LangExtract is a new open-source Python library designed to …据谷歌称,LangExtract是一个新的开源Python库,旨在…
“programmatically extract the exact information you need, while ensuring the outputs are structured and reliably tied back to its source”“以编程方式提取您所需的确切信息,同时确保输出具有结构化,并可靠地追溯至其来源”
On the face of it, LangExtract has many useful applications, including,从表面上看,LangExtract 有许多有用的应用,包括,
- Text anchoring. Each extracted entity is linked to its exact character offsets in the source text, enabling full traceability and visual verification through interactive highlighting.文本锚定。 每个提取的实体都与源文本中其确切的字符偏移量相关联,通过交互式突出显示实现完全可追溯性和可视化验证。
- Reliable structured output. Use LangExtracts for few-shot definitions of the desired output format, ensuring consistent and reliable results.可靠的结构化输出。使用LangExtracts对所需输出格式进行少样本定义,确保结果一致且可靠。
- Efficient large-document handling. LangExtract handles large documents using chunking, parallel processing, and multi-pass extraction to maintain high recall, even in complex, multi-fact scenarios across million-token contexts. It should also excel at traditional needle-in-a-haystack type applications.高效的大文档处理。LangExtract使用分块、并行处理和多遍提取来处理大文档,即使在复杂的、涉及数百万标记上下文的多因素场景中,也能保持较高的召回率。它在传统的大海捞针式应用中也应该表现出色。
- Instant extraction review. Easily create a self-contained HTML visualisation of extractions, enabling intuitive review of entities in their original context, all scalable to thousands of annotations.即时提取审查。 轻松创建一个独立的提取内容HTML可视化文件,可在原始语境中直观地审查实体,且所有操作均可扩展至数千条注释。
- Multi-model compatibility. Compatible with both cloud-based models (e.g. Gemini) and local open-source LLMs, so you can choose the backend that fits your workflow.多模型兼容性。兼容基于云的模型(如Gemini)和本地开源大语言模型,因此你可以选择适合你工作流程的后端。
- Customizable for many use cases. Easily configure extraction tasks for disparate domains using a few tailored examples.可针对多种用例进行定制。只需几个定制示例,即可轻松为不同领域配置提取任务。
- Augmented knowledge extraction. LangExtract supplements grounded entities with inferred facts using the model’s internal knowledge, with relevance and accuracy driven by prompt quality and model capabilities.增强知识提取。LangExtract利用模型的内部知识,通过推断事实来补充有根据的实体,相关性和准确性取决于提示质量和模型能力。
One thing that stands out to me when I look at LangExtract’s strengths listed above is that it seems to be able to perform RAG-like operations without the need for traditional RAG processing. So, no more splitting, chunking or embedding operations in your code.当我审视上述列出的LangExtract的优势时,有一点格外引人注目,即它似乎无需传统的检索增强生成(RAG)处理,就能执行类似RAG的操作。因此,在你的代码中,不再需要进行拆分、分块或嵌入操作。
But to get a better idea of what LangExtract can do, we’ll take a closer look at a few of the above capabilities using some coding examples.但是,为了更好地了解LangExtract的功能,我们将通过一些代码示例,更深入地研究上述的一些功能。
Setting up a dev environment 设置开发环境
Before we get down to doing some coding, I always like to set up a separate development environment for each of my projects. I use the UV package manager for this, but use whichever tool you’re comfortable with.在我们着手进行编码之前,我总是喜欢为每个项目设置一个独立的开发环境。我使用 UV 包管理器来完成这项工作,但你可以使用任何你觉得顺手的工具。
PS C:\Users\thoma> uv init langextract
Initialized project `langextract` at `C:\Users\thoma\langextract`
PS C:\Users\thoma> cd langextract
PS C:\Users\thoma\langextract> uv venv
Using CPython 3.13.1
Creating virtual environment at: .venv
Activate with: .venv\Scripts\activate
PS C:\Users\thoma\langextract> .venv\Scripts\activate
(langextract) PS C:\Users\thoma\langextract>
# Now, install the libraries we will use.
(langextract) PS C:\Users\thoma\langextract> uv pip install jupyter langextract beautifulsoup4 requests
Now, to write and test our coding examples, you can start up a Jupyter notebook using this command.现在,要编写和测试我们的代码示例,你可以使用此命令启动一个Jupyter Notebook。
(langextract) PS C:\Users\thoma\langextract> jupyter notebook
You should see a notebook open in your browser. If that doesn’t happen automatically, you’ll likely see a screenful of information after the jupyter notebook command. Near the bottom, you will find a URL to copy and paste into your browser to launch the Jupyter Notebook. Your URL will be different to mine, but it should look something like this:-你应该会看到一个笔记本在浏览器中打开。如果没有自动打开,在执行 jupyter notebook 命令后,你可能会看到一屏的信息。在页面底部附近,你会找到一个网址,复制并粘贴到浏览器中,即可启动 Jupyter Notebook。你的网址会与我的不同,但应该类似如下内容:
http://127.0.0.1:8888/tree?token=3b9f7bd07b6966b41b68e2350721b2d0b6f388d248cc69d
Pre-requisites 前提条件
As we’re using a Google LLM model (gemini-2.5-flash) for our processing engine, you’ll need a Gemini API key. You can get this from Google Cloud. You can also use LLMs from OpenAI, and I’ll show an example of how to do this in a bit.由于我们的处理引擎使用的是谷歌大语言模型(Gemini-2.5-Flash),因此您需要一个Gemini API密钥。您可以从谷歌云获取该密钥。您也可以使用OpenAI的大语言模型,稍后我将展示如何操作的示例。
Code example 1 — needle-in-a-haystack 代码示例1——大海捞针式查询
The first thing we need to do is get some input data to work with. You can use any input text file or HTML file for this. For previous experiments using RAG, I used a book I downloaded from Project Gutenberg; the consistently riveting “Diseases of cattle, sheep, goats, and swine by Jno. A. W. Dollar & G. Moussu”我们首先要做的是获取一些输入数据来使用。你可以为此使用任何输入文本文件或HTML文件。在之前使用检索增强生成(RAG)的实验中,我用了一本从古登堡计划下载的书;引人入胜的《约翰·A·W·多拉尔与G·穆苏所著牛、绵羊、山羊和猪的疾病》
But to summarise, the vast majority of Project Gutenberg eBooks are in the public domain in the US and other parts of the world. This means that nobody can grant or withhold permission to do with this item as you please.但总而言之,古登堡计划的绝大多数电子书在美国及世界其他地区都属于公共领域。这意味着,对于这件物品,任何人都无法授予或拒绝你按自己意愿行事的许可。
“As you please” includes any commercial use, republishing in any format, making derivative works or performances“按您所需” 包括任何商业用途、以任何格式重新发布、制作衍生作品或表演。
This book contained approximately 36,000 lines of text. To avoid large token costs, I cut it down to about 3000 lines of text. To test LangExtract’s ability to handle needle-in-a-haystack type queries, I added this specific line of text around line 1512.这本书包含大约36000行文本。为了避免高昂的令牌成本,我将其缩减到大约3000行文本。为了测试LangExtract处理大海捞针式查询的能力,我在第1512行附近添加了这一特定文本行。
It is a little-known fact that wood was invented by Elon Musk in 1775鲜为人知的是,木材是埃隆·马斯克于1775年发明的。
Here it is in context. 以下是其所在语境。
1. Fractures of the angle of the haunch, resulting from external1. 臀部外角骨折,由外部因素导致
violence and characterised by sinking of the external angle of the暴力,其特征为髋骨外角下沉
ilium, deformity of the hip, and lameness without specially marked髂骨、髋关节畸形及跛行,但无特殊显著表现
characters. This fracture is rarely complicated. The symptoms of特征。这种骨折很少伴有并发症。其症状为
lameness diminish with rest, but deformity continues.跛行随休息而减轻,但畸形仍会持续。It is a little-known fact that wood was invented by Elon Musk in 1775.鲜为人知的是,木材是埃隆·马斯克于1775年发明的。
=Treatment= is confined to the administration of mucilaginous and diuretic fluids. Tannin has been recommended.=治疗= 仅限于使用粘性和利尿性液体。有人推荐使用单宁。
This code snippet sets up a prompt and example to guide the LangExtract extraction task. This is essential for few-shot learning with a structured schema.这段代码片段设置了一个提示和示例,以指导LangExtract提取任务。这对于使用结构化模式的少样本学习至关重要。
import langextract as lx
import textwrap
from collections import Counter, defaultdict
# Define comprehensive prompt and examples for complex literary text
prompt = textwrap.dedent("""\
Who invented wood and when """)
# Note that this is a made up example
# The following details do not appear anywhere
# in the book
examples = [
lx.data.ExampleData(
text=textwrap.dedent("""\
John Smith was a prolific scientist.
His most notable theory was on the evolution of bananas."
He wrote his seminal paper on it in 1890."""),
extractions=[
lx.data.Extraction(
extraction_class="scientist",
extraction_text="John Smith",
notable_for="the theory of the evolution of the Banana",
attributes={"year": "1890", "notable_event":"theory of evolution of the banana"}
)
]
)
]
Now, we run the structured entity extraction. First, we open the file and read its contents into a variable. The heavy lifting is done by the lx.extract call. After that, we just print out the relevant outputs.现在,我们运行结构化实体提取。首先,我们打开文件并将其内容读取到一个变量中。主要工作由 lx.extract 调用完成。之后,我们只需打印出相关输出。
with open(r"D:\book\cattle_disease.txt", "r", encoding="utf-8") as f:
text = f.read()
result = lx.extract(
text_or_documents = text,
prompt_description=prompt,
examples=examples,
model_id="gemini-2.5-flash",
api_key="your_gemini_api_key",
extraction_passes=3, # Multiple passes for improved recall
max_workers=20, # Parallel processing for speed
max_char_buffer=1000 # Smaller contexts for better accuracy
)
print(f"Extracted {len(result.extractions)} entities from {len(result.text):,} characters")
for extraction in result.extractions:
if not extraction.attributes:
continue # Skip this extraction entirely
print("Name:", extraction.extraction_text)
print("Notable event:", extraction.attributes.get("notable_event"))
print("Year:", extraction.attributes.get("year"))
print()
And here are our outputs. 以下是我们的输出结果。
LangExtract: model=gemini-2.5-flash, current=7,086 chars, processed=156,201 chars: [00:43]
✓ Extraction processing complete
✓ Extracted 1 entities (1 unique types)
• Time: 126.68s
• Speed: 1,239 chars/sec
• Chunks: 157
Extracted 1 entities from 156,918 characters
Name: Elon Musk
Notable event: invention of wood
Year: 1775
Not too shabby. 还不错。
Note, if you wanted to use an OpenAI model and API key, your extraction code would look something like this,注意,如果你想使用OpenAI模型和API密钥,你的提取代码大概如下,
...
...
from langextract.inference import OpenAILanguageModel
result = lx.extract(
text_or_documents=input_text,
prompt_description=prompt,
examples=examples,
language_model_type=OpenAILanguageModel,
model_id="gpt-4o",
api_key=os.environ.get('OPENAI_API_KEY'),
fence_output=True,
use_schema_constraints=False
)
...
...
Code example 2 — extraction visual validation代码示例2 – 提取可视化验证
LangExtract provides a visualisation of how it extracted the text. It’s not particularly useful in this example, but it gives you an idea of what is possible.LangExtract提供了它如何提取文本的可视化展示。在这个例子中它并不是特别有用,但它能让你了解到哪些是可行的。
Just add this little snippet of code to the end of your existing code. This will create an HTML file that you can open in a browser window. From there, you can scroll up and down your input text and “play” back the steps that LangExtract took to get its outputs.只需将这段小代码片段添加到现有代码的末尾。这将创建一个HTML文件,你可以在浏览器窗口中打开它。从那里,你可以上下滚动输入文本,并 “回放” LangExtract生成输出所采取的步骤。
# Save annotated results
lx.io.save_annotated_documents([result], output_name="cattle_disease.jsonl", output_dir="d:/book")
html_obj = lx.visualize("d:/book/cattle_disease.jsonl")
html_string = html_obj.data # Extract raw HTML string
# Save to file
with open("d:/book/cattle_disease_visualization.html", "w", encoding="utf-8") as f:
f.write(html_string)
print("Interactive visualization saved to d:/book/cattle_disease_visualization.html")
Now, go to the directory where your HTML file has been saved and open it in a browser. This is what I see.现在,转到保存HTML文件的目录,并在浏览器中打开它。这就是我看到的内容。

Code example 3 — retrieving multiple structured outputs代码示例3 – 检索多个结构化输出
In this example, we’ll take some unstructured input text — an article from Wikipedia on OpenAI, and try to retrieve the names of all the different large language models mentioned in the article, together with their release date. The link to the article is,在这个例子中,我们将获取一些非结构化的输入文本——一篇来自维基百科关于OpenAI的文章,并尝试检索文章中提到的所有不同大语言模型的名称及其发布日期。文章链接为,
https://en.wikipedia.org/wiki/OpenAI
to Share — copy and redistribute the material in any medium or format共享——以任何媒介或格式复制和重新分发材料
to Adapt — remix, transform, and build upon the material改编——对材料进行重新混合、转换和构建
for any purpose, even commercially.用于任何目的,甚至是商业目的。
Our code is pretty similar to our first example. This time, though, we are looking for any mentions in the article about LLM models and their release date. One other step we have to do is clean up the HTML of the article first to ensure that LangExtract has the best chance of reading it. We use the BeautifulSoup library for this.我们的代码与第一个示例非常相似。不过,这次我们要在文章中查找所有提及大语言模型及其发布日期的内容。我们还需要做的另一步是,首先清理文章的HTML代码,以确保LangExtract能够以最佳状态读取它。我们使用BeautifulSoup库来完成这项工作。
import langextract as lx
import textwrap
import requests
from bs4 import BeautifulSoup
import langextract as lx
# Define comprehensive prompt and examples for complex literary text
prompt = textwrap.dedent("""Your task is to extract the LLM or AI model names and their release date or year from the input text \
Do not paraphrase or overlap entities.\
""")
examples = [
lx.data.ExampleData(
text=textwrap.dedent("""\
Similar to Mistral's previous open models, Mixtral 8x22B was released via a via a BitTorrent link April 10, 2024
"""),
extractions=[
lx.data.Extraction(
extraction_class="model",
extraction_text="Mixtral 8x22B",
attributes={"date": "April 10, 1994"}
)
]
)
]
# Cleanup our HTML
# Step 1: Download and clean Wikipedia article
url = "https://en.wikipedia.org/wiki/OpenAI"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
# Get only the visible text
text = soup.get_text(separator="\n", strip=True)
# Optional: remove references, footers, etc.
lines = text.splitlines()
filtered_lines = [line for line in lines if not line.strip().startswith("[") and line.strip()]
clean_text = "\n".join(filtered_lines)
# Do the extraction
result = lx.extract(
text_or_documents=clean_text,
prompt_description=prompt,
examples=examples,
model_id="gemini-2.5-flash",
api_key="YOUR_API_KEY",
extraction_passes=3, # Improves recall through multiple passes
max_workers=20, # Parallel processing for speed
max_char_buffer=1000 # Smaller contexts for better accuracy
)
# Print our outputs
for extraction in result.extractions:
if not extraction.attributes:
continue # Skip this extraction entirely
print("Model:", extraction.extraction_text)
print("Release Date:", extraction.attributes.get("date"))
print()
This is a cut-down sample of the output I got. 这是我得到的输出内容的一个简化示例。
Model: ChatGPT
Release Date: 2020
Model: DALL-E
Release Date: 2020
Model: Sora
Release Date: 2024
Model: ChatGPT
Release Date: November 2022
Model: GPT-2
Release Date: February 2019
Model: GPT-3
Release Date: 2020
Model: DALL-E
Release Date: 2021
Model: ChatGPT
Release Date: December 2022
Model: GPT-4
Release Date: March 14, 2023
Model: Microsoft Copilot
Release Date: September 21, 2023
Model: MS-Copilot
Release Date: December 2023
Model: Microsoft Copilot app
Release Date: December 2023
Model: GPTs
Release Date: November 6, 2023
Model: Sora (text-to-video model)
Release Date: February 2024
Model: o1
Release Date: September 2024
Model: Sora
Release Date: December 2024
Model: DeepSeek-R1
Release Date: January 20, 2025
Model: Operator
Release Date: January 23, 2025
Model: deep research agent
Release Date: February 2, 2025
Model: GPT-2
Release Date: 2019
Model: Whisper
Release Date: 2021
Model: ChatGPT
Release Date: June 2025
...
...
...
Model: ChatGPT Pro
Release Date: December 5, 2024
Model: ChatGPT's agent
Release Date: February 3, 2025
Model: GPT-4.5
Release Date: February 20, 2025
Model: GPT-5
Release Date: February 20, 2025
Model: Chat GPT
Release Date: November 22, 2023
Let’s double-check a couple of these. One of the outputs from our code was this.我们来仔细核对其中的几项。我们代码的输出之一是这个。
Model: Operator
Release Date: January 23, 2025
And from the Wikipedia article … 而从维基百科文章……
“On January 23, OpenAI released Operator, an AI agent and web automation tool for accessing websites to execute goals defined by users. The feature was only available to Pro users in the United States.[113][114]”“1月23日,OpenAI发布了Operator,这是一款人工智能智能体和网络自动化工具,用于访问网站以执行用户定义的目标。该功能仅面向美国的付费用户提供。[113][114]”
So on that occasion, it might have hallucinated the year as being 2025 when no year was given. Remember, though, that LangExtract can use its internal knowledge of the world to supplement its outputs, and it may have got the year from that or from other contexts surrounding the extracted entity. In any case, I think it would be pretty easy to tweak the input prompt or the output to ignore model release date information that did not include a year.所以在那种情况下,当没有给出年份时,它可能会凭空臆想出年份为2025年。不过请记住,LangExtract可以利用其对世界的内在知识来补充输出内容,它可能是从这些知识或提取实体周围的其他语境中获取到年份的。无论如何,我认为调整输入提示或输出以忽略不包含年份的模型发布日期信息应该相当容易。
Another output was this. 另一个输出是这样的。
Model: ChatGPT Pro
Release Date: December 5, 2024
I can see two references to ChatGPT Pro in the original article.我能在原文中看到两处提及ChatGPT Pro。
So I think LangExtract was pretty accurate with this extraction.所以我认为LangExtract在这次提取上相当准确。
Because there were many more “hits” with this query, the visualisation is more interesting, so let’s repeat what we did in example 2. Here is the code you’ll need.由于这个查询的“命中”结果更多,可视化效果也更有趣,所以我们重复一下示例2中的操作。这是你需要的代码。
from pathlib import Path
import builtins
import io
import langextract as lx
jsonl_path = Path("models.jsonl")
with jsonl_path.open("w", encoding="utf-8") as f:
json.dump(serialize_annotated_document(result), f, ensure_ascii=False)
f.write("\n")
html_path = Path("models.html")
# 1) Monkey-patch builtins.open so our JSONL is read as UTF-8
orig_open = builtins.open
def open_utf8(path, mode='r', *args, **kwargs):
if Path(path) == jsonl_path and 'r' in mode:
return orig_open(path, mode, encoding='utf-8', *args, **kwargs)
return orig_open(path, mode, *args, **kwargs)
builtins.open = open_utf8
# 2) Generate the visualization
html_obj = lx.visualize(str(jsonl_path))
html_string = html_obj.data
# 3) Restore the original open
builtins.open = orig_open
# 4) Save the HTML out as UTF-8
with html_path.open("w", encoding="utf-8") as f:
f.write(html_string)
print(f"Interactive visualization saved to: {html_path}")
Run the above code and then open the models.html file in your browser. This time, you should be able to click the Play/Next/Previous buttons and see a better visualisation of the LangExtract text processing in action.运行上述代码,然后在浏览器中打开models.html文件。这一次,你应该能够点击“播放/下一个/上一个”按钮,并看到LangExtract文本处理实际运行时更好的可视化效果。
For more details on LangExtract, check out Google’s GitHub repo here.如需了解LangExtract的更多详细信息,请查看谷歌的GitHub代码库此处。
Summary 总结
In this article, I introduced you to LangExtract, a new Python library and framework from Google that allows you to extract structured output from unstructured input. 在本文中,我向你介绍了LangExtract,这是谷歌推出的一个新的Python库和框架,可让你从非结构化输入中提取结构化输出。
I outlined some of the advantages that using LangExtract can bring, including its ability to handle large documents, its augmented knowledge extraction and multi-model support.我概述了使用LangExtract能带来的一些优势,包括其处理大型文档的能力、增强的知识提取能力以及多模型支持。
I took you through the install process — a simple pip install, then, by way of some example code, showed how to use LangExtract to perform needle-in-the-haystack type queries on a large body of unstructured text. 我向你介绍了安装过程——简单地使用pip进行安装,然后通过一些示例代码,展示了如何使用LangExtract在大量非结构化文本中执行大海捞针式的查询。
In my final example code, I demonstrated a more traditional RAG-type operation by extracting multiple entities (AI Model names) and an associated attribute (date of release). For both my primary examples, I also showed you how to code a visual representation of how LangExtract works in action that you can open and play back in a browser window.在我的最后一个示例代码中,我通过提取多个实体(人工智能模型名称)和一个相关属性(发布日期),展示了一种更传统的检索增强生成(RAG)类型的操作。对于我的两个主要示例,我还向你展示了如何编写代码,以可视化的方式呈现LangExtract的实际工作原理,你可以在浏览器窗口中打开并回放。


