Data Structures
Last updated: 2026-02-19
1. Data Sources
1.1 Apple Books Highlights (primary bookmark source)
- Origin: Apple Books app on iOS/macOS
- Export path: Highlights are exported via DataJar (iOS automation app) or directly as raw JSON
- Raw format: JSON array with fields including
author,book_title,date,time,highlights(orhighlight),location,notes,tags,progress - Location data: The raw export includes a
locationfield containing real street addresses (e.g.,"Home"). This is stripped during processing for privacy. - Boilerplate: Apple Books appends
"Excerpt From\n{title}\n{author}\nThis material may be protected by copyright."to every highlight. This is removed during processing. - Raw file:
/Users/weixiangzhang/Local_Dev/projects/bythewei/src/data/bookmarks.json(unprocessed, contains location and boilerplate)
1.2 DataJar Exports (alternate bookmark source)
- Origin: DataJar app on iOS — a structured key-value store used for Shortcuts automation
- Export formats: Three supported container formats:
- Bare JSON (
store.json,"store 2.json") - ZIP archive (
"Data Jar YYYY-MM-DD ....zip"containingstore.json) .datajarfile ("YYYY-MM-DD HH.MM.datajar"— ZIP containingroot.json)
- Bare JSON (
- Internal schema: Deeply nested typed-value tree:
Each field is wrapped:root.children.BookmarkRecords.value.value[] -> entry.value.value (dictionary of field nodes){ value: { type: "string", value: "actual content" } } - Note: Some older entries (~8 from Feb 2025 Dune reads) use
highlight(singular) instead ofhighlights(plural). Similarly,notevsnotesappears in some entries.
1.3 Google Sheets CSV (reading log source)
- Origin: Manually maintained Google Sheet titled “Reading Data - Primary”
- Export path: Downloaded as CSV to
~/Downloads/Reading Data - Primary.csv - Coverage: 150 books read from February 2019 through February 2020 (one full reading year)
- Total pages tracked: 41,086 pages across 150 books
- Manual entry: All fields are hand-entered by the reader, including emotional responses, difficulty ratings, and discovery source
1.4 Manual / Static Data
src/data/sprint.json: Hand-authored sprint board data for the homepage sticky-note wall. Contains metadata, stats, column definitions with sticky notes, and a bottom row. Updated manually per sprint.src/data/kaomojis.ts: 200 hand-curated kaomojis organized by category (happy, excited, surprised, angry, sad, cool, fighting, love, shrug, animals, magic, tired, chaos). Used for deterministic daily rotation on the site via date-seeded pseudo-random selection.
2. Data Schemas
2.1 bookmarks.clean.json — Processed Highlights
Location: public/data/bookmarks.clean.json and src/data/bookmarks.clean.json (identical copies)
Size: 17,572 lines, 1,357 entries, 156 unique books
Date range: 2022-01-02 to 2026-02-03
Sort order: Newest first (descending by date_read)
| Field | Type | Nullable | Description |
|---|---|---|---|
id | string | No | SHA-1 hex digest of `lowercase(author) + " |
author | string | No | Author name, normalised from “Last, First” to “First Last” for simple single-author cases. Empty string if unknown. |
book_title | string | No | Full book title. Empty string if unknown (1 orphan entry exists). |
date_read | string | null | Yes | ISO 8601 date (YYYY-MM-DD). Parsed from Apple Books “MMM DD, YYYY” format. Null if date could not be parsed. |
date | string | null | Yes | Alias for date_read. Kept for backward compatibility — index.astro reads entry.date for the Quote of the Day feature. Always identical to date_read. |
highlights | string | No | The highlight/passage text. Cleaned of Apple Books boilerplate, leading/trailing curly quotes, collapsed excessive newlines. Entries with empty highlights are dropped entirely. |
notes | string | null | Yes | User-added notes on the highlight. Trimmed. Null if none. |
tags | string[] | No | Array of tag strings. Parsed from comma/semicolon-separated string or passed through if already an array. Empty array if no tags. |
source | string | No | One of: "apple_books", "kindle", "readwise", "manual". Auto-detected from entry properties. Current dataset: 1,347 apple_books + 10 manual. |
word_count | number | No | Word count of the cleaned highlight text. Computed via text.trim().split(/\s+/).filter(Boolean).length. |
Source detection logic (in order):
- Has
source_urlorreadwise_urlfield ->"readwise" - Has
locationmatching pattern^\d+[-\u2013]\d+$->"kindle"(e.g., “123-456”) - Highlight text matches
Excerpt From ... copyrightpattern ->"apple_books" - Has any
locationfield ->"apple_books" - Otherwise ->
"manual"
ID generation:
SHA-1( lowercase(author) + "||" + lowercase(book_title) + "||" + lowercase(highlight[0:100]) )
Uses only the first 100 characters of the highlight so IDs remain stable even when boilerplate trimming changes the tail.
2.2 reading-log.json — Reading Year Log
Location: public/data/reading-log.json and src/data/reading-log.json (identical copies)
Size: 3,751 lines, 150 entries
Date range: 2019-02-19 to 2020-02-27 (one calendar reading year)
Sort order: Ascending by date_finished (nulls last)
| Field | Type | Nullable | Description |
|---|---|---|---|
title | string | null | Yes | Book title. Note: field is title here vs book_title in bookmarks. |
author | string | null | Yes | Author name in “Last, First” format (NOT normalised to “First Last” — differs from bookmarks schema). |
date_started | string | null | Yes | ISO 8601 date (YYYY-MM-DD). Parsed from M/D/YYYY or M/D/YY CSV format. |
date_finished | string | null | Yes | ISO 8601 date (YYYY-MM-DD). Same parsing as date_started. |
days_to_read | number | null | Yes | Computed: Math.round((date_finished - date_started) / 86400000). Null if either date is missing. |
rating | number | null | Yes | 1-5 integer rating. Null if not rated. |
gender | string | null | Yes | Author gender: "F" (female), "M" (male), "N" (non-binary). Null if unknown. |
poc | boolean | null | Yes | Whether the author is a person of color. true/false/null. Parsed from "Y"/"N" in CSV. |
emotions | string[] | No | Array of emotional responses to the book. Known values: "Happy", "Sad", "Angry", "Anger", "Bored", "Empowered", "Interesting", "Funny". See Known Issues for the Anger/Angry problem. |
emotional_output | string | null | Yes | Aggregate sentiment: "Positive", "Negative", "Neutral". Null if not specified. |
difficulty | number | null | Yes | 1-5 integer difficulty rating. |
publisher | string | null | Yes | Publisher name. |
year_published | number | null | Yes | Year of publication as integer. |
pages | number | null | Yes | Page count of the book. |
running_pages | number | null | Yes | Cumulative page count across all books read (running total). |
fiction | string | null | Yes | One of: "Fiction", "Non-Fiction", "Graphic Novel". Normalised from CSV variants. |
genre | string | null | Yes | Free-text genre label (e.g., “Race Studies”, “Fantasy”, “Productivity”, “Autobiography”). |
country | string | null | Yes | Country of origin/setting (e.g., “USA”, “Russia”). |
why | string | null | Yes | Why the book was chosen (e.g., “Word of Mouth”, “Curious”, “Utility”). |
why_source | string | null | Yes | Where the recommendation came from (e.g., “Online Forums”, “Friend”, “Colleague”, “Publicity”). |
review | string | null | Yes | Text review. Filtered: entries containing “didn’t have time”, “did not have time”, or “review in progress” are set to null. |
Emotion value distribution (150 books):
Interesting: 70Happy: 20Sad: 16Bored: 12Funny: 12Anger: 10Empowered: 8Angry: 2
2.3 sprint.json — Homepage Sprint Board
Location: src/data/sprint.json
Usage: Imported directly into index.astro at build time. Drives the sticky-note wall UI.
{
meta: { date, title, subtitle }
stats: [{ value, label }]
columns: [{
header: string,
stickies: [{
color: "g"|"y"|"o"|"r"|"b"|"p"|"w"|"teal",
size: "big"|null,
rotation: "r1"-"r7",
tape: boolean,
stamp: "DONE"|"IN PROGRESS"|null,
title: string,
body: string,
tag: string|null,
blocker: boolean? // optional
}]
}]
bottom_row: [{
color, rotation, tape: "center"|"left"|"right"|null,
title, body
}]
footer: string
}
2.4 kaomojis.ts — Kaomoji Collection
Location: src/data/kaomojis.ts
Type: export const kaomojis: string[]
Count: 200 kaomojis
Categories: Happy/Wholesome (20), Excited/Celebrating (20), Surprised/Shocked (20), Angry/Table Flip (20), Sad/Crying (20), Cool/Smug (20), Fighting/Determined (20), Love/Affectionate (20), Shrug/Whatever (10), Animals (15), Magic/Sparkle (10), Tired/Done (10), Running/Chaos (10)
Client-side rotation logic (in index.astro):
// Date-seeded: same day -> same kaomojis for every visitor worldwide
const seed = year * 10000 + (month + 1) * 100 + day;
// Mulberry32 PRNG seeded with date
// Each [data-kaomoji] element gets a deterministic pick
3. Pipeline Scripts
All scripts live in /Users/weixiangzhang/Local_Dev/projects/bythewei/scripts/.
3.1 extract-datajar.mjs — DataJar Export Extractor
Purpose: Extracts highlight/bookmark data from DataJar JSON exports into a flat JSON array.
Input: DataJar export file in one of three formats:
- Bare JSON (
store.json) - ZIP archive (
Data Jar YYYY-MM-DD ....zip) .datajarfile (YYYY-MM-DD HH.MM.datajar)
Output: JSON array of raw bookmark records with the schema:
{
"author": "string|null",
"book_title": "string|null",
"date": "MMM DD, YYYY|null",
"time": "string|null",
"highlights": "string",
"notes": "string|null",
"tags": "string|null",
"source": "datajar",
"progress": "number|null",
"location": "string|null"
}
Usage:
# JSON to stdout, summary to stderr
node scripts/extract-datajar.mjs store.json
# JSON to file, summary to stdout
node scripts/extract-datajar.mjs "Data Jar 2025-06-12 19.44.30.zip" datajar-2025.json
Key details:
- Custom zero-dependency ZIP parser (supports DEFLATE method 8 and stored method 0)
- Handles data-descriptor entries (bit 3 flag) by falling back to central directory metadata
.datajarfiles useroot.jsoninternally;.zipfiles usestore.json- Entries without highlight text are skipped
- Handles both
highlights(plural) andhighlight(singular) field names - Prints summary with date range, top books by highlight count, and sample previews
3.2 process-bookmarks.mjs — Full Bookmark Pipeline (primary)
Purpose: The main processing pipeline. Takes raw Apple Books JSON and produces the clean, deduplicated, privacy-safe bookmarks.clean.json.
Input: Raw bookmark JSON (default: src/data/bookmarks.json)
Output: Written to BOTH:
src/data/bookmarks.clean.jsonpublic/data/bookmarks.clean.json
Processing steps (in order):
- Load raw JSON array
- Resolve highlight text (normalise
highlight->highlightskey) - Auto-detect source format (apple_books / kindle / readwise / manual)
- Strip
locationfield (contains real addresses — privacy) - Clean highlight text (remove Apple Books boilerplate, curly quotes, collapse whitespace)
- Drop entries with empty highlights after cleaning
- Normalise author names (“Last, First” -> “First Last”)
- Generate stable SHA-1 ID for deduplication
- Deduplicate by ID (first occurrence wins)
- Parse dates to ISO 8601
- Compute word count
- Sort newest-first
- Write output + print summary report
Usage:
# Default input (src/data/bookmarks.json)
node scripts/process-bookmarks.mjs
# Custom input
node scripts/process-bookmarks.mjs ~/Downloads/apple-books-export.json
3.3 merge-bookmarks.mjs — Incremental Merge
Purpose: Merges a NEW raw export into the existing bookmarks.clean.json without wiping or re-processing existing data. Designed for incremental updates.
Input: Path to a new raw export JSON file
Output: Updated bookmarks.clean.json in both src/data/ and public/data/
Merge strategy:
- Load existing
bookmarks.clean.json(the live DB) - Index existing entries by ID in a
Mapfor O(1) lookup - Process new entries through the same pipeline as
process-bookmarks.mjs - Match by stable SHA-1 ID
- New entries are appended; existing entries are preserved unchanged (no overwrites)
- Merged list is sorted newest-first
- Write back to both output locations
Usage:
node scripts/merge-bookmarks.mjs ~/Downloads/apple-books-export-march.json
Output report:
New entries added : 42
Already existed (skipped): 315
Empty/dropped : 3
Total in DB now : 1399
3.4 strip-location.mjs — Legacy Processor (superseded)
Purpose: The original bookmark processor. Superseded by process-bookmarks.mjs but still functional.
Differences from process-bookmarks.mjs:
- No SHA-1 ID generation (deduplicates by exact
author||book_title||highlightsstring match) - No source detection
- No author name normalisation
- No word count computation
- Preserves
date,time,progress,notes,tagsas-is (does not transform to ISO dates) - Simpler output schema (no
id,date_read,source,word_countfields)
Usage:
node scripts/strip-location.mjs [input]
# Default input: src/data/bookmarks.json
3.5 convert-reading-log.mjs — CSV to JSON Converter
Purpose: Converts the Google Sheets CSV reading log into JSON.
Input: Hardcoded path: ~/Downloads/Reading Data - Primary.csv
Output: Written to BOTH:
src/data/reading-log.jsonpublic/data/reading-log.json
Processing details:
- Custom RFC-4180 CSV parser (handles quoted fields with embedded commas and newlines)
- Expects exactly 20 columns per row
- Date parsing:
M/D/YYYYorM/D/YY(2-digit years treated as 20xx) ->YYYY-MM-DD days_to_readcomputed from start/finish dates- Emotions parsed from comma-separated string to array
- Reviews filtered: “didn’t have time” / “review in progress” -> null
- Fiction normalised to exact enum values
- Sorted ascending by
date_finished(nulls last) - Prints stats: total count, date range, rating distribution, top 10 genres, review counts, parsing errors
Usage:
node scripts/convert-reading-log.mjs
Column mapping (0-indexed from CSV headers):
0: Date Started -> date_started
1: Date Finished -> date_finished
2: Title -> title
3: Author -> author
4: Gender -> gender
5: POC -> poc
6: Rating -> rating
7: Emotions -> emotions
8: Emotional Output -> emotional_output
9: Difficulty -> difficulty
10: Publisher -> publisher
11: Year Published -> year_published
12: Pages -> pages
13: Running Pages -> running_pages
14: Fiction or Non -> fiction
15: Genre -> genre
16: Country -> country
17: Why -> why
18: Why Source -> why_source
19: Review -> review
3.6 verify-clean.mjs — Quality Check
Purpose: Quick verification script to spot-check the cleaned bookmark data.
Input: Reads src/data/bookmarks.clean.json (hardcoded relative path)
Checks performed:
- Dumps first 5 entries showing book, author, highlight start/end (JSON-escaped for visibility)
- Counts entries still containing Apple Books boilerplate (“excerpt from”, “this material may”)
- Counts entries with leading/trailing whitespace in highlights
- Reports total count, shortest highlight, longest highlight, median highlight length
Usage:
node scripts/verify-clean.mjs
4. Data Flow
4.1 Bookmark Pipeline
DataJar app (iOS)
|
v
extract-datajar.mjs Apple Books (direct JSON export)
| |
v v
raw JSON array raw JSON array
| |
+----------+---------------------+
|
v
src/data/bookmarks.json
(raw, with location + boilerplate)
|
+---------------+---------------+
| |
v v
process-bookmarks.mjs merge-bookmarks.mjs
(full rebuild) (incremental update)
| |
+---------------+---------------+
|
v
bookmarks.clean.json
|
+---------------+---------------+
| |
v v
src/data/bookmarks.clean.json public/data/bookmarks.clean.json
| |
v v
(available at build time) (served at /data/bookmarks.clean.json)
|
v
index.astro client-side fetch()
|
+----------+----------+
| | |
v v v
QOTD Catalog Journal
(quote of (book list (timeline
the day) modal) modal)
Key points:
bookmarks.clean.jsonis written to TWO locations:src/data/(for build-time import) andpublic/data/(for runtime client-side fetch)- The client fetches from
/data/bookmarks.clean.jsonat page load, not at build time - QOTD selects a highlight deterministically based on the current date
- The same fetched data powers the catalog modal (group by book, filter) and journal modal (timeline view)
4.2 Reading Log Pipeline
Google Sheets ("Reading Data - Primary")
|
v (manual CSV download)
~/Downloads/Reading Data - Primary.csv
|
v
convert-reading-log.mjs
|
+---------+---------+
| |
v v
src/data/reading-log.json public/data/reading-log.json
| |
v v
(build-time reference) (served at /data/reading-log.json)
|
v
index.astro client-side fetch()
|
v
"THE READING YEAR" modal
(hidden bookshelf UI)
Key points:
- The reading log fetch is lazy — it only triggers when the user opens the hidden reading year modal (via the shelf trigger pin)
- The modal shows stats (genre breakdown, emotion distribution, author demographics) computed client-side from the JSON
4.3 Static Data (no pipeline)
src/data/sprint.json -----> imported at build time by index.astro
-> renders sticky note wall
src/data/kaomojis.ts -----> imported at build time by index.astro
-> injected as define:vars for client-side rotation script
5. Cross-Referencing
5.1 Dataset Overlap
The two primary datasets — bookmarks and reading log — have zero book title overlap. They cover entirely different time periods and use different title/author schemas:
| Property | bookmarks.clean.json | reading-log.json |
|---|---|---|
| Time period | Jan 2022 — Feb 2026 | Feb 2019 — Feb 2020 |
| Entry count | 1,357 highlights | 150 books |
| Unique books | 156 | 150 |
| Title field | book_title | title |
| Author format | ”First Last” (normalised) | “Last, First” (CSV original) |
| Books in common | 0 | 0 |
| Granularity | Per-highlight (many per book) | Per-book (one entry per book) |
5.2 Schema Differences
The two datasets were designed independently and have several naming inconsistencies:
| Concept | Bookmarks | Reading Log | Notes |
|---|---|---|---|
| Book title | book_title | title | Different field names |
| Author | author (“First Last”) | author (“Last, First”) | Different name order |
| Date | date_read / date (ISO) | date_started / date_finished (ISO) | Different semantics |
| Rating | (none) | rating (1-5) | Bookmarks have no rating |
| Emotions | (none) | emotions (array) | Bookmarks have no emotions |
| Word count | word_count | pages | Different unit of measurement |
| Genre | (none) | genre | Bookmarks have no genre |
| Source | source (auto-detected) | (none) | Reading log has no source |
5.3 Potential Future Unification
If the datasets were to be merged or cross-referenced:
- Author normalisation would need to be applied to reading-log data (“Last, First” -> “First Last”)
- Title field would need aliasing (
title<->book_title) - The gap between Feb 2020 and Jan 2022 means there are ~2 years of untracked reading
6. Known Issues
6.1 “Anger” vs “Angry” Emotion Normalization
The reading log CSV source uses both "Anger" (10 occurrences) and "Angry" (2 occurrences) to represent the same emotion. The convert-reading-log.mjs script does NOT normalise these — it passes emotions through from the CSV as-is via simple comma-split:
function parseEmotions(str) {
if (!str || str.trim() === '') return [];
return str.split(',').map(e => e.trim()).filter(Boolean);
}
Impact: Any client-side code that groups or counts by emotion will treat “Anger” and “Angry” as separate categories. The current distribution is:
"Anger": 10 books"Angry": 2 books
Fix: Add normalisation in parseEmotions() or in the CSV source itself. Recommended target: "Angry" (adjective form, consistent with "Happy", "Sad", "Funny").
6.2 Orphan Title Problem
There is 1 entry in bookmarks.clean.json with an empty book_title (empty string ""):
- Author: empty string
- Date:
2022-01-02 - Highlight preview:
"This is an"(truncated) - Root cause: The raw Apple Books export contained an entry with no book metadata. The processing pipeline preserves entries as long as they have non-empty highlight text, even without book/author data.
Impact: This entry will appear in the QOTD rotation without attribution. In the catalog modal, it would appear under an empty book title.
6.3 Author Name Format Inconsistency
- Bookmarks: Authors are normalised to “First Last” format by
normaliseAuthor()inprocess-bookmarks.mjs(e.g., “Huffer, Lynne” -> “Lynne Huffer”) - Reading log: Authors remain in “Last, First” format from the CSV (e.g., “Rios, Victor”)
This means the same author would appear differently in each dataset, preventing naive string matching for cross-referencing.
6.4 Duplicate date / date_read Fields
Every entry in bookmarks.clean.json carries both date_read and date with identical values. The date field exists solely for backward compatibility with index.astro’s QOTD code, which reads entry.date. This duplication adds ~20KB to the JSON file. The comment in process-bookmarks.mjs documents this:
// 'date' alias kept for QOTD backward-compatibility (index.astro reads entry.date)
date: dateIso,
6.5 Hardcoded CSV Path
The convert-reading-log.mjs script has a hardcoded absolute path for its CSV input:
const CSV_PATH = '/Users/weixiangzhang/Downloads/Reading Data - Primary.csv';
This is not configurable via command-line arguments (unlike the bookmark scripts). Running on a different machine or after moving the CSV will fail silently.
6.6 strip-location.mjs Superseded but Not Removed
The original strip-location.mjs is still in the scripts directory but has been functionally replaced by process-bookmarks.mjs, which does everything strip-location.mjs does plus adds SHA-1 IDs, source detection, author normalisation, and word counts. Running strip-location.mjs would produce output in a different schema than what the site expects.
6.7 DataJar source Field Mismatch
Entries extracted via extract-datajar.mjs are tagged with source: "datajar", but after processing through process-bookmarks.mjs, the source is re-detected based on entry properties and typically overwritten to "apple_books" or "manual". The "datajar" source value does not appear in the final bookmarks.clean.json.
6.8 No Validation of Emotion Values
The reading log pipeline does not validate emotion strings against a known set of values. Any string in the CSV emotions column is accepted. This is how "Anger" and "Angry" both ended up in the data — they were entered inconsistently in the spreadsheet and passed through without validation.
6.9 running_pages Inconsistency
The first entry in the reading log (Human Targets) has running_pages: 7700, which is far higher than the book’s 224 pages and inconsistent with the second entry (Circe) having running_pages: 393. This suggests the running pages counter was either reset partway through the reading year or was pre-seeded from prior reading. The field comes directly from the CSV without validation or recomputation.