Recipe: Knowledge Graph from Documents
This recipe turns a pile of documents into a queryable knowledge graph. You ingest source material into a document store, extract the people and companies into entity and organizations records, then wire those records together as nodes and edges in a graph you can traverse.
The problem it solves
Documents are good at holding prose and terrible at answering “who is connected to whom”. Once you have hundreds of reports, contracts, or notes, the relationships between the people and organizations named in them are invisible. This recipe lifts those relationships out of the text and into a graph, so you can walk from a person to their company to that company’s other contacts in one traversal.
Elements
| Element | Role |
|---|---|
document | Stores and searches the source material as semi-structured pages. |
entity | A record of a person, AI, or other actor named in the documents. |
organizations | A record of a company, agency, or school your entities belong to. |
graph | Nodes, edges, and traversals over the extracted relationships (backed by Apache AGE). |
python (or any action) | Extraction step that reads pages and upserts records and edges. |
Flow
- Create a
documentstore. Load source material withwrite(create-or-update a page) orimport, then make it searchable withvectorizeand query it withsearch. - Create an
entitystore and anorganizationsstore. As you extract names,upserta record for each person and each company. Both elements carry rich identity —add_email,add_phone,add_address,add_social— and organizations exposefind_by_domainandfind_by_org_noso you can de-duplicate on a stable key instead of on a fuzzy name. - Link people to companies with the organizations element’s
list_contactsmodel (the contacts who work at an org). - Create a
graph. For each entity and organization,add-nodewith a label (e.g.Person,Company) and properties; for each relationship you extracted (works-at, mentioned-with, reports-to),add-edgebetween the two nodes. - Do the extraction in an action element — a
pythonstep that pulls pages from the documentsearch, parses out the named entities, and callsupsertplusadd-node/add-edge. Attach the data elements to the action so it has their connections. - Ask questions with the graph’s
traverse(breadth- or depth-first from a starting node, bounded by depth),shortest-pathbetween two nodes, orqueryfor a labelled lookup.
What this shows
The same facts live in three shapes, each good at a different question: document answers “what does the source say”, entity / organizations answer “what is the canonical record for this person or company”, and graph answers “how are they connected”. Because the entity and organization records carry verified contact identity, the graph you build on top of them is anchored to real, de-duplicated nodes rather than to raw strings scraped from prose.