Data Structures

JSONdatapipeline

Last updated: 2026-02-19


1. Data Sources

1.1 Apple Books Highlights (primary bookmark source)

  • Origin: Apple Books app on iOS/macOS
  • Export path: Highlights are exported via DataJar (iOS automation app) or directly as raw JSON
  • Raw format: JSON array with fields including author, book_title, date, time, highlights (or highlight), location, notes, tags, progress
  • Location data: The raw export includes a location field containing real street addresses (e.g., "Home"). This is stripped during processing for privacy.
  • Boilerplate: Apple Books appends "Excerpt From\n{title}\n{author}\nThis material may be protected by copyright." to every highlight. This is removed during processing.
  • Raw file: /Users/weixiangzhang/Local_Dev/projects/bythewei/src/data/bookmarks.json (unprocessed, contains location and boilerplate)

1.2 DataJar Exports (alternate bookmark source)

  • Origin: DataJar app on iOS — a structured key-value store used for Shortcuts automation
  • Export formats: Three supported container formats:
    • Bare JSON (store.json, "store 2.json")
    • ZIP archive ("Data Jar YYYY-MM-DD ....zip" containing store.json)
    • .datajar file ("YYYY-MM-DD HH.MM.datajar" — ZIP containing root.json)
  • Internal schema: Deeply nested typed-value tree:
    root.children.BookmarkRecords.value.value[]
      -> entry.value.value  (dictionary of field nodes)
    Each field is wrapped: { value: { type: "string", value: "actual content" } }
  • Note: Some older entries (~8 from Feb 2025 Dune reads) use highlight (singular) instead of highlights (plural). Similarly, note vs notes appears in some entries.

1.3 Google Sheets CSV (reading log source)

  • Origin: Manually maintained Google Sheet titled “Reading Data - Primary”
  • Export path: Downloaded as CSV to ~/Downloads/Reading Data - Primary.csv
  • Coverage: 150 books read from February 2019 through February 2020 (one full reading year)
  • Total pages tracked: 41,086 pages across 150 books
  • Manual entry: All fields are hand-entered by the reader, including emotional responses, difficulty ratings, and discovery source

1.4 Manual / Static Data

  • src/data/sprint.json: Hand-authored sprint board data for the homepage sticky-note wall. Contains metadata, stats, column definitions with sticky notes, and a bottom row. Updated manually per sprint.
  • src/data/kaomojis.ts: 200 hand-curated kaomojis organized by category (happy, excited, surprised, angry, sad, cool, fighting, love, shrug, animals, magic, tired, chaos). Used for deterministic daily rotation on the site via date-seeded pseudo-random selection.

2. Data Schemas

2.1 bookmarks.clean.json — Processed Highlights

Location: public/data/bookmarks.clean.json and src/data/bookmarks.clean.json (identical copies) Size: 17,572 lines, 1,357 entries, 156 unique books Date range: 2022-01-02 to 2026-02-03 Sort order: Newest first (descending by date_read)

FieldTypeNullableDescription
idstringNoSHA-1 hex digest of `lowercase(author) + "
authorstringNoAuthor name, normalised from “Last, First” to “First Last” for simple single-author cases. Empty string if unknown.
book_titlestringNoFull book title. Empty string if unknown (1 orphan entry exists).
date_readstring | nullYesISO 8601 date (YYYY-MM-DD). Parsed from Apple Books “MMM DD, YYYY” format. Null if date could not be parsed.
datestring | nullYesAlias for date_read. Kept for backward compatibility — index.astro reads entry.date for the Quote of the Day feature. Always identical to date_read.
highlightsstringNoThe highlight/passage text. Cleaned of Apple Books boilerplate, leading/trailing curly quotes, collapsed excessive newlines. Entries with empty highlights are dropped entirely.
notesstring | nullYesUser-added notes on the highlight. Trimmed. Null if none.
tagsstring[]NoArray of tag strings. Parsed from comma/semicolon-separated string or passed through if already an array. Empty array if no tags.
sourcestringNoOne of: "apple_books", "kindle", "readwise", "manual". Auto-detected from entry properties. Current dataset: 1,347 apple_books + 10 manual.
word_countnumberNoWord count of the cleaned highlight text. Computed via text.trim().split(/\s+/).filter(Boolean).length.

Source detection logic (in order):

  1. Has source_url or readwise_url field -> "readwise"
  2. Has location matching pattern ^\d+[-\u2013]\d+$ -> "kindle" (e.g., “123-456”)
  3. Highlight text matches Excerpt From ... copyright pattern -> "apple_books"
  4. Has any location field -> "apple_books"
  5. Otherwise -> "manual"

ID generation:

SHA-1( lowercase(author) + "||" + lowercase(book_title) + "||" + lowercase(highlight[0:100]) )

Uses only the first 100 characters of the highlight so IDs remain stable even when boilerplate trimming changes the tail.

2.2 reading-log.json — Reading Year Log

Location: public/data/reading-log.json and src/data/reading-log.json (identical copies) Size: 3,751 lines, 150 entries Date range: 2019-02-19 to 2020-02-27 (one calendar reading year) Sort order: Ascending by date_finished (nulls last)

FieldTypeNullableDescription
titlestring | nullYesBook title. Note: field is title here vs book_title in bookmarks.
authorstring | nullYesAuthor name in “Last, First” format (NOT normalised to “First Last” — differs from bookmarks schema).
date_startedstring | nullYesISO 8601 date (YYYY-MM-DD). Parsed from M/D/YYYY or M/D/YY CSV format.
date_finishedstring | nullYesISO 8601 date (YYYY-MM-DD). Same parsing as date_started.
days_to_readnumber | nullYesComputed: Math.round((date_finished - date_started) / 86400000). Null if either date is missing.
ratingnumber | nullYes1-5 integer rating. Null if not rated.
genderstring | nullYesAuthor gender: "F" (female), "M" (male), "N" (non-binary). Null if unknown.
pocboolean | nullYesWhether the author is a person of color. true/false/null. Parsed from "Y"/"N" in CSV.
emotionsstring[]NoArray of emotional responses to the book. Known values: "Happy", "Sad", "Angry", "Anger", "Bored", "Empowered", "Interesting", "Funny". See Known Issues for the Anger/Angry problem.
emotional_outputstring | nullYesAggregate sentiment: "Positive", "Negative", "Neutral". Null if not specified.
difficultynumber | nullYes1-5 integer difficulty rating.
publisherstring | nullYesPublisher name.
year_publishednumber | nullYesYear of publication as integer.
pagesnumber | nullYesPage count of the book.
running_pagesnumber | nullYesCumulative page count across all books read (running total).
fictionstring | nullYesOne of: "Fiction", "Non-Fiction", "Graphic Novel". Normalised from CSV variants.
genrestring | nullYesFree-text genre label (e.g., “Race Studies”, “Fantasy”, “Productivity”, “Autobiography”).
countrystring | nullYesCountry of origin/setting (e.g., “USA”, “Russia”).
whystring | nullYesWhy the book was chosen (e.g., “Word of Mouth”, “Curious”, “Utility”).
why_sourcestring | nullYesWhere the recommendation came from (e.g., “Online Forums”, “Friend”, “Colleague”, “Publicity”).
reviewstring | nullYesText review. Filtered: entries containing “didn’t have time”, “did not have time”, or “review in progress” are set to null.

Emotion value distribution (150 books):

  • Interesting: 70
  • Happy: 20
  • Sad: 16
  • Bored: 12
  • Funny: 12
  • Anger: 10
  • Empowered: 8
  • Angry: 2

2.3 sprint.json — Homepage Sprint Board

Location: src/data/sprint.json Usage: Imported directly into index.astro at build time. Drives the sticky-note wall UI.

{
  meta: { date, title, subtitle }
  stats: [{ value, label }]
  columns: [{
    header: string,
    stickies: [{
      color: "g"|"y"|"o"|"r"|"b"|"p"|"w"|"teal",
      size: "big"|null,
      rotation: "r1"-"r7",
      tape: boolean,
      stamp: "DONE"|"IN PROGRESS"|null,
      title: string,
      body: string,
      tag: string|null,
      blocker: boolean?  // optional
    }]
  }]
  bottom_row: [{
    color, rotation, tape: "center"|"left"|"right"|null,
    title, body
  }]
  footer: string
}

2.4 kaomojis.ts — Kaomoji Collection

Location: src/data/kaomojis.ts Type: export const kaomojis: string[] Count: 200 kaomojis Categories: Happy/Wholesome (20), Excited/Celebrating (20), Surprised/Shocked (20), Angry/Table Flip (20), Sad/Crying (20), Cool/Smug (20), Fighting/Determined (20), Love/Affectionate (20), Shrug/Whatever (10), Animals (15), Magic/Sparkle (10), Tired/Done (10), Running/Chaos (10)

Client-side rotation logic (in index.astro):

// Date-seeded: same day -> same kaomojis for every visitor worldwide
const seed = year * 10000 + (month + 1) * 100 + day;
// Mulberry32 PRNG seeded with date
// Each [data-kaomoji] element gets a deterministic pick

3. Pipeline Scripts

All scripts live in /Users/weixiangzhang/Local_Dev/projects/bythewei/scripts/.

3.1 extract-datajar.mjs — DataJar Export Extractor

Purpose: Extracts highlight/bookmark data from DataJar JSON exports into a flat JSON array.

Input: DataJar export file in one of three formats:

  • Bare JSON (store.json)
  • ZIP archive (Data Jar YYYY-MM-DD ....zip)
  • .datajar file (YYYY-MM-DD HH.MM.datajar)

Output: JSON array of raw bookmark records with the schema:

{
  "author": "string|null",
  "book_title": "string|null",
  "date": "MMM DD, YYYY|null",
  "time": "string|null",
  "highlights": "string",
  "notes": "string|null",
  "tags": "string|null",
  "source": "datajar",
  "progress": "number|null",
  "location": "string|null"
}

Usage:

# JSON to stdout, summary to stderr
node scripts/extract-datajar.mjs store.json

# JSON to file, summary to stdout
node scripts/extract-datajar.mjs "Data Jar 2025-06-12 19.44.30.zip" datajar-2025.json

Key details:

  • Custom zero-dependency ZIP parser (supports DEFLATE method 8 and stored method 0)
  • Handles data-descriptor entries (bit 3 flag) by falling back to central directory metadata
  • .datajar files use root.json internally; .zip files use store.json
  • Entries without highlight text are skipped
  • Handles both highlights (plural) and highlight (singular) field names
  • Prints summary with date range, top books by highlight count, and sample previews

3.2 process-bookmarks.mjs — Full Bookmark Pipeline (primary)

Purpose: The main processing pipeline. Takes raw Apple Books JSON and produces the clean, deduplicated, privacy-safe bookmarks.clean.json.

Input: Raw bookmark JSON (default: src/data/bookmarks.json)

Output: Written to BOTH:

  • src/data/bookmarks.clean.json
  • public/data/bookmarks.clean.json

Processing steps (in order):

  1. Load raw JSON array
  2. Resolve highlight text (normalise highlight -> highlights key)
  3. Auto-detect source format (apple_books / kindle / readwise / manual)
  4. Strip location field (contains real addresses — privacy)
  5. Clean highlight text (remove Apple Books boilerplate, curly quotes, collapse whitespace)
  6. Drop entries with empty highlights after cleaning
  7. Normalise author names (“Last, First” -> “First Last”)
  8. Generate stable SHA-1 ID for deduplication
  9. Deduplicate by ID (first occurrence wins)
  10. Parse dates to ISO 8601
  11. Compute word count
  12. Sort newest-first
  13. Write output + print summary report

Usage:

# Default input (src/data/bookmarks.json)
node scripts/process-bookmarks.mjs

# Custom input
node scripts/process-bookmarks.mjs ~/Downloads/apple-books-export.json

3.3 merge-bookmarks.mjs — Incremental Merge

Purpose: Merges a NEW raw export into the existing bookmarks.clean.json without wiping or re-processing existing data. Designed for incremental updates.

Input: Path to a new raw export JSON file

Output: Updated bookmarks.clean.json in both src/data/ and public/data/

Merge strategy:

  1. Load existing bookmarks.clean.json (the live DB)
  2. Index existing entries by ID in a Map for O(1) lookup
  3. Process new entries through the same pipeline as process-bookmarks.mjs
  4. Match by stable SHA-1 ID
  5. New entries are appended; existing entries are preserved unchanged (no overwrites)
  6. Merged list is sorted newest-first
  7. Write back to both output locations

Usage:

node scripts/merge-bookmarks.mjs ~/Downloads/apple-books-export-march.json

Output report:

New entries added        : 42
Already existed (skipped): 315
Empty/dropped            : 3
Total in DB now          : 1399

3.4 strip-location.mjs — Legacy Processor (superseded)

Purpose: The original bookmark processor. Superseded by process-bookmarks.mjs but still functional.

Differences from process-bookmarks.mjs:

  • No SHA-1 ID generation (deduplicates by exact author||book_title||highlights string match)
  • No source detection
  • No author name normalisation
  • No word count computation
  • Preserves date, time, progress, notes, tags as-is (does not transform to ISO dates)
  • Simpler output schema (no id, date_read, source, word_count fields)

Usage:

node scripts/strip-location.mjs [input]
# Default input: src/data/bookmarks.json

3.5 convert-reading-log.mjs — CSV to JSON Converter

Purpose: Converts the Google Sheets CSV reading log into JSON.

Input: Hardcoded path: ~/Downloads/Reading Data - Primary.csv

Output: Written to BOTH:

  • src/data/reading-log.json
  • public/data/reading-log.json

Processing details:

  • Custom RFC-4180 CSV parser (handles quoted fields with embedded commas and newlines)
  • Expects exactly 20 columns per row
  • Date parsing: M/D/YYYY or M/D/YY (2-digit years treated as 20xx) -> YYYY-MM-DD
  • days_to_read computed from start/finish dates
  • Emotions parsed from comma-separated string to array
  • Reviews filtered: “didn’t have time” / “review in progress” -> null
  • Fiction normalised to exact enum values
  • Sorted ascending by date_finished (nulls last)
  • Prints stats: total count, date range, rating distribution, top 10 genres, review counts, parsing errors

Usage:

node scripts/convert-reading-log.mjs

Column mapping (0-indexed from CSV headers):

0: Date Started    -> date_started
1: Date Finished   -> date_finished
2: Title           -> title
3: Author          -> author
4: Gender          -> gender
5: POC             -> poc
6: Rating          -> rating
7: Emotions        -> emotions
8: Emotional Output -> emotional_output
9: Difficulty      -> difficulty
10: Publisher      -> publisher
11: Year Published -> year_published
12: Pages          -> pages
13: Running Pages  -> running_pages
14: Fiction or Non -> fiction
15: Genre          -> genre
16: Country        -> country
17: Why            -> why
18: Why Source     -> why_source
19: Review         -> review

3.6 verify-clean.mjs — Quality Check

Purpose: Quick verification script to spot-check the cleaned bookmark data.

Input: Reads src/data/bookmarks.clean.json (hardcoded relative path)

Checks performed:

  1. Dumps first 5 entries showing book, author, highlight start/end (JSON-escaped for visibility)
  2. Counts entries still containing Apple Books boilerplate (“excerpt from”, “this material may”)
  3. Counts entries with leading/trailing whitespace in highlights
  4. Reports total count, shortest highlight, longest highlight, median highlight length

Usage:

node scripts/verify-clean.mjs

4. Data Flow

4.1 Bookmark Pipeline

                    DataJar app (iOS)
                         |
                         v
              extract-datajar.mjs          Apple Books (direct JSON export)
                    |                                |
                    v                                v
              raw JSON array                   raw JSON array
                    |                                |
                    +----------+---------------------+
                               |
                               v
                     src/data/bookmarks.json
                        (raw, with location + boilerplate)
                               |
               +---------------+---------------+
               |                               |
               v                               v
     process-bookmarks.mjs             merge-bookmarks.mjs
     (full rebuild)                    (incremental update)
               |                               |
               +---------------+---------------+
                               |
                               v
                   bookmarks.clean.json
                               |
               +---------------+---------------+
               |                               |
               v                               v
    src/data/bookmarks.clean.json    public/data/bookmarks.clean.json
               |                               |
               v                               v
       (available at build time)      (served at /data/bookmarks.clean.json)
                                               |
                                               v
                                    index.astro client-side fetch()
                                               |
                                    +----------+----------+
                                    |          |          |
                                    v          v          v
                                  QOTD     Catalog    Journal
                               (quote of  (book list  (timeline
                                the day)   modal)      modal)

Key points:

  • bookmarks.clean.json is written to TWO locations: src/data/ (for build-time import) and public/data/ (for runtime client-side fetch)
  • The client fetches from /data/bookmarks.clean.json at page load, not at build time
  • QOTD selects a highlight deterministically based on the current date
  • The same fetched data powers the catalog modal (group by book, filter) and journal modal (timeline view)

4.2 Reading Log Pipeline

    Google Sheets ("Reading Data - Primary")
                    |
                    v (manual CSV download)
          ~/Downloads/Reading Data - Primary.csv
                    |
                    v
          convert-reading-log.mjs
                    |
          +---------+---------+
          |                   |
          v                   v
  src/data/reading-log.json   public/data/reading-log.json
          |                                |
          v                                v
  (build-time reference)        (served at /data/reading-log.json)
                                           |
                                           v
                                index.astro client-side fetch()
                                           |
                                           v
                                   "THE READING YEAR" modal
                                   (hidden bookshelf UI)

Key points:

  • The reading log fetch is lazy — it only triggers when the user opens the hidden reading year modal (via the shelf trigger pin)
  • The modal shows stats (genre breakdown, emotion distribution, author demographics) computed client-side from the JSON

4.3 Static Data (no pipeline)

  src/data/sprint.json  -----> imported at build time by index.astro
                                -> renders sticky note wall

  src/data/kaomojis.ts  -----> imported at build time by index.astro
                                -> injected as define:vars for client-side rotation script

5. Cross-Referencing

5.1 Dataset Overlap

The two primary datasets — bookmarks and reading log — have zero book title overlap. They cover entirely different time periods and use different title/author schemas:

Propertybookmarks.clean.jsonreading-log.json
Time periodJan 2022 — Feb 2026Feb 2019 — Feb 2020
Entry count1,357 highlights150 books
Unique books156150
Title fieldbook_titletitle
Author format”First Last” (normalised)“Last, First” (CSV original)
Books in common00
GranularityPer-highlight (many per book)Per-book (one entry per book)

5.2 Schema Differences

The two datasets were designed independently and have several naming inconsistencies:

ConceptBookmarksReading LogNotes
Book titlebook_titletitleDifferent field names
Authorauthor (“First Last”)author (“Last, First”)Different name order
Datedate_read / date (ISO)date_started / date_finished (ISO)Different semantics
Rating(none)rating (1-5)Bookmarks have no rating
Emotions(none)emotions (array)Bookmarks have no emotions
Word countword_countpagesDifferent unit of measurement
Genre(none)genreBookmarks have no genre
Sourcesource (auto-detected)(none)Reading log has no source

5.3 Potential Future Unification

If the datasets were to be merged or cross-referenced:

  • Author normalisation would need to be applied to reading-log data (“Last, First” -> “First Last”)
  • Title field would need aliasing (title <-> book_title)
  • The gap between Feb 2020 and Jan 2022 means there are ~2 years of untracked reading

6. Known Issues

6.1 “Anger” vs “Angry” Emotion Normalization

The reading log CSV source uses both "Anger" (10 occurrences) and "Angry" (2 occurrences) to represent the same emotion. The convert-reading-log.mjs script does NOT normalise these — it passes emotions through from the CSV as-is via simple comma-split:

function parseEmotions(str) {
  if (!str || str.trim() === '') return [];
  return str.split(',').map(e => e.trim()).filter(Boolean);
}

Impact: Any client-side code that groups or counts by emotion will treat “Anger” and “Angry” as separate categories. The current distribution is:

  • "Anger": 10 books
  • "Angry": 2 books

Fix: Add normalisation in parseEmotions() or in the CSV source itself. Recommended target: "Angry" (adjective form, consistent with "Happy", "Sad", "Funny").

6.2 Orphan Title Problem

There is 1 entry in bookmarks.clean.json with an empty book_title (empty string ""):

  • Author: empty string
  • Date: 2022-01-02
  • Highlight preview: "This is an" (truncated)
  • Root cause: The raw Apple Books export contained an entry with no book metadata. The processing pipeline preserves entries as long as they have non-empty highlight text, even without book/author data.

Impact: This entry will appear in the QOTD rotation without attribution. In the catalog modal, it would appear under an empty book title.

6.3 Author Name Format Inconsistency

  • Bookmarks: Authors are normalised to “First Last” format by normaliseAuthor() in process-bookmarks.mjs (e.g., “Huffer, Lynne” -> “Lynne Huffer”)
  • Reading log: Authors remain in “Last, First” format from the CSV (e.g., “Rios, Victor”)

This means the same author would appear differently in each dataset, preventing naive string matching for cross-referencing.

6.4 Duplicate date / date_read Fields

Every entry in bookmarks.clean.json carries both date_read and date with identical values. The date field exists solely for backward compatibility with index.astro’s QOTD code, which reads entry.date. This duplication adds ~20KB to the JSON file. The comment in process-bookmarks.mjs documents this:

// 'date' alias kept for QOTD backward-compatibility (index.astro reads entry.date)
date: dateIso,

6.5 Hardcoded CSV Path

The convert-reading-log.mjs script has a hardcoded absolute path for its CSV input:

const CSV_PATH = '/Users/weixiangzhang/Downloads/Reading Data - Primary.csv';

This is not configurable via command-line arguments (unlike the bookmark scripts). Running on a different machine or after moving the CSV will fail silently.

6.6 strip-location.mjs Superseded but Not Removed

The original strip-location.mjs is still in the scripts directory but has been functionally replaced by process-bookmarks.mjs, which does everything strip-location.mjs does plus adds SHA-1 IDs, source detection, author normalisation, and word counts. Running strip-location.mjs would produce output in a different schema than what the site expects.

6.7 DataJar source Field Mismatch

Entries extracted via extract-datajar.mjs are tagged with source: "datajar", but after processing through process-bookmarks.mjs, the source is re-detected based on entry properties and typically overwritten to "apple_books" or "manual". The "datajar" source value does not appear in the final bookmarks.clean.json.

6.8 No Validation of Emotion Values

The reading log pipeline does not validate emotion strings against a known set of values. Any string in the CSV emotions column is accepted. This is how "Anger" and "Angry" both ended up in the data — they were entered inconsistently in the spreadsheet and passed through without validation.

6.9 running_pages Inconsistency

The first entry in the reading log (Human Targets) has running_pages: 7700, which is far higher than the book’s 224 pages and inconsistent with the second entry (Circe) having running_pages: 393. This suggests the running pages counter was either reset partway through the reading year or was pre-seeded from prior reading. The field comes directly from the CSV without validation or recomputation.