Lmst

@carapace Some of the most interesting ideas I've seen are in the Ploan9 OS and specifically its 9P protocol:

https://en.wikipedia.org/wiki/Plan_9_from_Bell_Labs

https://en.wikipedia.org/wiki/9P_(protocol)

That includes a /webfs concept where remote networked resources are accessible via filesystem semantics. That's a concept that's been adopted to some extent on other operating systems, notably Sun Solaris and its ability to automount NFS shares (something I've seen ... abused rather heavily in some shops), and in some Linux filesystems, largely using FUSE (Filesystem in USErspace) https://en.wikipedia.org/wiki/Filesystem_in_Userspace.

I'll note that when you're on the systems side of things it's quite helpful to have canonical and invariant names for data resources. Mixing and matching this with a documents-oriented filesystem might not lead to happy places.

#Filesystems #webfs #docfs

5/end/

@carapace One notion I'd arrived at was that in the case of catalogue access, search is identity.

That is, a search which turns up a single record or document is definitionally an identity for that document.

That identity might be a standard assigned value, such as an ISBN, DOI, or Library of Congress call number, or it could be some distinct set of parameters, say, a combination of author, title, and publication date, which return a single record.

Note that a search which is an identity in one archive or at one point in time might not be an identity for another: an identity returns a single record, whereas a search might return several, one, or no records.

One notion I have is of using a filesystem-like syntax for search, so that, say, /docfs/au:steinbeck/ti:grapes might turn up records related to John Steinbeck's Grapes of Wrath. Here, /docfs is a virtual filesystem which provides an interface into the documents filesystem. Specific assigned identifiers might be referenced as /docfs/id:isbn:0330881043 (again: Steinbeck's Grapes of Wrath).

#Filesystems #webfs #docfs

@carapace I've put a fair bit of thought into how a document-oriented filesystem (in the #PaulOtlet sense of "document") might function. To the extent I've thought this out, it's somewhat modelled on how a physical library is organised: there is the actual storage ("stacks"), and there's the interface to that storage, ("catalogue").

A document is any contained information. It might be a text, image, sound, video, multimedia, or data record, or combination of these.

The stacks contain works. The catalogue provides ways of accessing those works, and any given work might appear in or be accessed through the catalogue in any number of different ways.

A huge challenge for any such metadata-based system is that metadata itself requires design and creation, and this remains hugely cumbersome for data presently. There's some useful metadata associated with filesystems, though much of that is at a systems rather than document level, and some metadata (say, file creation / modify / access timestamps) bears little if any relation to the underlying document. Tracking document-related metadata would be a huge step forward.

Relying on extant and often external metadata would also be useful. Library of Congress, OCLC, IMDB, CDDB, DOI, ISBN, and related records would be quite useful for classifying existing works. Some set of useful standards for other common records (personal computer files, system logs, receipts, memos, correspondence, online interactions) might also be useful. The more that metadata creation can be both automated and made useful (and no, "New Document" is not a useful title) the better.

#Filesystems #webfs #docfs

@carapace The problems with replacing the classic hierarchical filesystem are much the same as swapping out any other piece of well-established standards:

You've got to have an exceptionally compelling alternative.
There's a hell of a lot of legacy that relies on extant systems.
Agreeing on a specific replacement (or set of replacements) creates its own huge coordination problem.

I'd be interested in hearing how you're addressing each of those points.

#Filesystems #webfs #docfs

@carapace One question I'd toss out is: where did the notion of hierarchical filesystems first emerge?

I'm familiar with Linux / Unix, a whole slew of PC-based systems (DOS, CMS, Classic Mac), as well as MVS (TSO/ISPF) and VMS. Linux is certainly where I feel most at home.

IBM mainframes (MVS) had a one-level hierarchy, effectively you could create any number of folders at the root filesystem level, and place files in those, but nested files weren't A Thing. I suspect that in this regard, IBM was trying to emulate paper-based filing systems where cabinets held folder and folders individual records, but nesting was distinctly limited.

Nested filesystems may date to Multics if GPT is to be trusted. Wikipedia supports this: https://en.wikipedia.org/wiki/File_system

#Filesystems #webfs #docfs

@RussSharek Ayup. I'm headed that way.

One of my recent finds that's been game-changing has been "Save as ePub", a feature of the #Einkbro browser (Android). That's a fork of the FOSS Browser, which might have similar functionality.

Effectively, you can save a Web article as an ePub, or append it to an existing ePub, which means you can effectively "build your own book" of relevant content (a project, good articles over a specific time period, work-related project, stuff to share with someone else). For tablets / mobile devices this is about the best option I've found, preferable to saving PDFs, with the one exception that most metadata concerning the saved content is lost. I'm not sure the source URL is kept, the date is certainly lost.

The #webfs and #docfs tags in my first toot above refer to a project I've been kicking around for managing documents and articles, both Web and otherwise. I'm tending strongly toward a plain-text baseline format (with markup languages such as Markdown, LaTeX, djot, etc., being ways of extending basic structure and capabilities), also with extensive bibliographic metadata. It's all pretty much vapourware but it's fun to think about.

So, Pocket, the article-archival tool that keeps getting worse the more you use it, has just become immeasurably worse.

I've reverted from version 8.6.x to no, not 8.5, not 8.4, not 8.2, but 8.1.1.0 from freaking February of this year to revert these completely fucking brain-dead changes.

The TL;DR: link is https://www.apkmirror.com/apk/mozilla-corporation/pocket/pocket-8-1-0-0-release/

That's what you want to install and freeze on until Pocket catches a motherfucking clue.

I've had a long an unhappy relationship with this feature and app. Its sole claims to my continued use are that it holds nearly 5 GB of content hostage, and that it, unbelievably, seems to be the best of what is an immensely shitty application space. See my now-six-year-old rant virtually all of which remains valid: https://web.archive.org/web/20190512092903/https://old.reddit.com/r/dredmorbius/comments/5x2sfx/pocket_it_gets_worse_the_more_you_use_it/#

Most recently, Pocket has lost two features:

A "page flip" mode, which though itself hugely flawed, is better than scrolling through articles, especially on e-ink devices.
The ability to view all articles either in the (hugely preferable, very useful) #ReadabilityJS view, or in-app in a "web view". The latter now revert to your device's default Web Browser app on mobile devices.

The problem with that latter is that the task of annotating and tagging articles (my principle remaining justification for Pocket) is made vastly more tedious --- and it's already more than adequately tedious in previous Pocket versions. To the point it's not even worthwhile.

Fortunately, I was able to hunt down a prior version of the app (using the APKMirror app), and I will not be upgrading Pocket beyond the most recent version I can find which still supports both Page Flip and Web View modes, as noted above 8.1.1. from 17 February 2023. (Few if any of Pocket's "improvements" over the past five years have had any value to me whatsoever, so this is little loss.)

There is of course a Relevant xkcd: "Software Updates":

https://xkcd.com/2224/

I would so like to see a useful document-management solution for tablets and e-ink devices with the ability to managed both offline and online (Web-based) content.

Boosts and re-sharing this on other platforms is strongly encouraged.

Edits: I'm updating this toot as I'm finding out more. In particular, what version(s) of Pocket are NOT affected by these changes is not yet clear.

#Pocket #GetPocket #MozillaPocket #Mozilla #ApkMirror #EInk #DocumentManagement #xkcd #xkcd2224 #kfc #webfs #docfs

xkcd comic 2224, "Software Updates"

Consisting of:

A graph with two axes:
- X-axis: Time
- Y-axis: Software Version Number

The graph consists of three lines.

Two gray lines moves in upwards steps from left to right, denoting the highest and lowest supported versions over time.

The third line is black and denotes the author's installed software version, It begins within the supported region, but eventually stops moving upwards (indicating no further updates over time), and eventually extends outside the supported region entirely.

Upper gray line: Newest Version

Bottom gray line: Oldest Supported Version

Light gray area: Support Zone

Black line: My current version

Start dot: First Install

Second dot: An update finally breaks a feature I'm unwilling to lose

End of black line: ???

Arrow label: The Abyss

Caption below chart: All software is Software as a Service.

Alt text: "Everything is a cloud application; the ping times just vary a lot."

Description adapted from Explain xkcd: <https://explainxkcd.com/wiki/index.php/2224:_Software_Updates>

Whitespace in filenames is a major category error IMO.

OTOH, filenames themselves (and filesystems as presently incarnated) are also grossly insufficient for many needs. It's interesting to note, for example, that on Android (and possibly iOS), databases (usually sqlite) have emerged as the de-facto default persistent data storage mechanism, even for content which would normally be held on a filesystem.

I've long been looking at questions such as what a document-oriented filesysem (#docFS) or the World Wide Web as fileystem accessible (#webFS) might look like.

For documents, I've generally arrived at a naming standard which uses underbars (_) to separate elements, hyphens (-) for standard whitespace, and double dashes (--) to indicate punctuated / multiple element (e.g., multiple authors, or a subtitle following a colon or dash). Permitted characters are otherwise 7-bit ASCII alphanumeric ([A-Za-z0-9], with dot as a file extension only, and possibly parentheses.

So:

Author-One--Author-Two_Title--Subtitle_YYYY.filetype

That might have a publisher or journal title added (additional underbar-delimited element after the title(s). Additional contributors (e.g., editors, translator) might be mentioned. And it's possible some identifier (ISBN, OCLC, DOI, LoC call number) might be added, though those are supplemental.

The idea isn't to fully and completely or precisely represent all aspects of a document or work, but to usefully do so. So yes, that means that foreign charactersets aren't presented, that full author lists aren't included (for scientific paper these can number in the tens to hundreds), etc. But enough to find the work reasonably within a corpus through a directory listing.

Yeah, I'm familiar with Calibre, Zotero etc., and should really get more familiar with them. But they're clunky enough and not sufficiently universally available (e.g., on Android, where most of my documents live these days, via an e-book reader) that I'm not optimistic they're really a solution.

(Hoisted from a limited share.)

#DocumentManagement #Whitespace #OnTheNamingOfCats #OnTheNamingOfFiles #Whatever #SameThing #RockyHorror #MacavitysNotHere #Bombalurina #Effanineffable #OldPossum #TSEliot #DOS #PaulOtlet #Mundaneum

@alcinnz So, effectively a filetype:application association manager. file(1) and magic(5) on steroids.

I am thinking of managing metadata associated with documents, works (multiple forms / manifestations of a single document), projects and workflows (involving various records, etc), and the overall document lifecycle: creation, acquisition, cataloguing, use, adaptation, distribution, destruction.

That's what I've lumped under my #webfs and #docfs concepts, along with #kfc (Krell Functional/Fucking Context).

@billjanssen Thanks again. Some of that looks ... closer. Cone Tree and Perspective Wall most so, though still not quite there.

Are you associated with this research/develpment, or just an interested party?

One thing I've thought about considerably as I'm increasingly using e-book readers and being frustrated by their own document management / organisational limitations, is how physical library space maps, with multiple dimensional convulutions, to stored data:

There's a mix of physical and logical organisations:

character -> word -> line -> page - > signature -> book

character -> word -> sentence -> paragraph -> chapter -> book

Shelf -> bookcase -> aisle -> floor -> building

A book (nominally: 250 pages) is about 125k words.

About 32 books fit to a shelf, 8 shelves to a bookcase, say, 16 bookcases to an aisle, 16 aisles to a floor. (I'm biasing to powers-of-two numbers here)

That's 256 books per case, 4,096 per aisle, 65,536 per floor.

(A fairly large community library is on the order of 300k books, or about 4 floors as I've defined them. A large university library, 122 such floors. Based on my experience, I may be underspecifying density, and would be interested in actual data.)

And so on.

The point I'm trying to make though isn't about density but of navigation of that space. The reader/researcher can go to a specific book, or to a shelf (closely related works), an aisle, a floor, etc. There's a different level of aggregation at each point in the scale, and for topically-organised (e.g., Library of Congress classification or Dewey Decimal), a specific region corresponds largely with a specific subject grouping.

On my e-book reader, I'm effectively limited to only one level of aggregation: a sequential shelf scan of books. With storage exceeding several TB, and an average book size of ~1--5 MB, that's effectively a fairly large community library worth of potential documents which can be carried in one's hand or satchel, but for which the organisational capabilities are ... exceedingly limited.

This remains a major frustration of mine.

#KFC #DocFS #WebFS #Libraries #DocumentManagement

@Researchbuzz The proximity element is limited as I am, of course, on Altair IV, some 20 of your light years away.

That said, one of my obsessions (though not necessarily a major element of my Mastodon tooting) is information, knowledge, and document management.

The tags #kfc, #webfs, and #docfs will lead to a few of my information-management / search toots / threads.

And if you've got opinions, feelings, and/or deep intel on #PaulOtlet and his #Mundaneum I'm all ears.

@woozle

I see your open-plan office AND RAISE YOU!!!

The Central Social Institution of Prague. It’s apparently still in operation.

https://www.vintag.es/2020/01/central-social-institution-prague.html

#CentralSocialInstitution #Prague #Czechia #DataStorage #InformationManagement #KFC #DocFS #WebFS

The Central Social Institution of Prague.

A vast hall in which banks of file drawers 25 high and about 20 across are arranged into 10 banks, each fronted by a desk which can move vertically and horizontally to 7 or 8 meters off the floor, strongly resembling the flying desks in Futurama (though the inspiration doubtless runs the other direction).

There are 3,000 drawers in total, covering 4,000 square feet (370 m^2).

The effective data storage is on the order of a few hundred GB. You could fit the same on your thumbnail today.

The structure was opened in 1937.

@jonny My principles here are:

The filename should be descriptive and not simply unique.
It should be human-meaningful in some manner if at all possible.
It should scope to the collection size / namespace.

Estimates I'm aware of are that there are on the order of 100--200m books ever published, growing at ~1m year, and a generally comparable set of scientific articles. News organisations such as Reuters, AP, and AFP produce about 1k--5k items daily, and I suspect many of those are photos or videos. Major newspapers tend to produce about 100--500 stories daily (weekday vs. weekend). You can work out ballpark maths from that.

For correspondence, the originator and recipient ("From:" and "To:" are both significant. Those might be referenced. Publishing, to a general audience, is in a sence correspondence where "From:" == Author and "To:" == World.

The filename need not be precise, exact, or an accurate presentation of conents, but USEFUL. That is, within a corpus, can I find a specific work or works of interest. In this sense, the titling scheme is an example of the principle I've developed that search is identity, in the sense that a search might produce 0, 1, or n>1 results. 0 is null, 1 is identity, and > 1 is a result set.

There are other naming and cataloguing schemes. A complete system would have correspondences between these and the conventional / human-readable titles, e.g., ISBN, LOCCS, OCLC, DOI, etc.

And yes there are other cataloguing systems such as SuDoc (used by the US government) which are useful in their own contexts.

Author, date, content, audience, and publisher are generally useful search-space reducing concepts of fairly generally applicable context. E.g., if I were including, say, store receipts or purchase orders, the vendor, customer, date, location, and a summary of contents (say, largest item) a description. Computer logs tend to be time and process/service oriented, perhaps also mentioning user or network address, etc.

Related hashtags and discussion:

#docfs #webfs #KFC #PaulOtlet #Maundenaum

@vertigo I've heard that.

Also Wallabag, if you are interested in self-hosting.

I'd like to see curation, full-text search, and robust metadata.

Been kicking around a document-oriented filesystem (#docfs) for a few years, probably will always remain vapour.

@Valenoern This is the essential idea behind "docfs", which would be a document-oriented filesystem. Its networked sibling being "webfs".

"Document" here is in the sense of #PaulOtlet, of any durable record. That might be a text, image, sound, video, multimedia content, data, software, or an amalgamation or melange.

One of my key ideas is that the metadata for these documents would be part of the filesystem, extending the notion of what constitutes file-centric data. I'd like to see some form of bibliographic data presented, where available for public and published media (book, articles, audio recordings, films).

Search is another element, and one idea for the filesystem would be as a virtual filesystem in which attributes could be supplied until a single item matching those criteria was found. "Identity is search".

For projects, some concept of structured workflows, with groups, tasks, milestones, and contributing data. For a sufficiently structured organisation, security and access controls.

I'd like the whole concept to be as commercialisation-hostile as possible, with both copyrights and payments entirely out of scope.

#docfs #webfs #kfc #maundenaum #DublinCore #metadata #bibliography #Plan9OS #Schopenhauer

@CyberpunkLibrarian I'd very much like that.

I've been half-assedly kicking around an idea to build such a thing, generally referred to as #KFC (Krell Functional Context / Krell Fucking Context, variously). See also #WebFS and #DocFS which relate: accessing the Web as a filesystem (see Plan9OS) and a documents-oriented filesystem in which "paths" are actually "search queries" through various spaces (author, title, pubdates, subjects / keywords, publishers, identifiers ISBN/OCLC/LOCCN/DOI, etc).

The results of any path specification are strictly one of:

No results (a failed search).
One result (an identity search, at least at the time performed).
Multiple results (a set). Which might be variously small or large (I'm thinking of some vaguely logrithmic scale for classifying this.)

I'd also like to see workflow included, some sense of a cataloguing workflow (desired, aquired, classified, converted (to some minimally-sufficient complexity best format, which is to say, LaTeX 😺 ) privacy scopes and controls, and relations between works (citations, references, translations, authors, concepts, projects, ...)

Mind, this is all but entirely vapourware.

@FiXato

@thornAvery My own approaches are:

Find LITERALLY ANY FORMAT OTHER THAN PDF. HTML, text, ePub, etc., if possible.
Try pdftotext, part of Poppler utils: https://poppler.freedesktop.org/ This is available for most Linux distros, MacOS under Homebrew, or check out via Git.

If I can get something vaguely reasonable, that's usually sufficient.

OCR is an option. I've never had good luck with that, and there's such a tremendous amount of tendous correcting that retyping is frequently preferable. That said, I operate at fairly low scale.
Retype by hand. Since I'm usually reading the work, this actually turns out to be a pretty good reading method for content-retention.

PDF itself is a container around a bunch of other formats. Asking how to convert a PDF is a bit like asking how to cook a bag full of groceries. It really depends on what's in it, and what you're hoping to get.

#PDF #PDFConversion #kfc #docfs #webfs

@thornAvery I'm trying to find what I thought I remembered as an excellent HN comment discussing how to do this at scale.

It turns out to be really complicated.

That said, maybe tell us what it is you're trying to do, specifically:

How many documents.
How large.
What languages / charactersets.
What budget (if any).
What end-use.

#webfs #docfs #kfc #PDFConversion #pdf

@thornAvery There's no such creature that will cover all cases. You may get lucky in many instances with easier options.

Your best bet is to find another form of the document that's closer to text. For many published documents there are good odds of this.

If the PDF is actually rendered from a text source, pdftotext is pretty good at extracting the actual text.

If it's not ... you're left with a much more challenging job. I find with rather startling frequency that simply re-typing the document from scratch is often the best option.

#pdf #PDFConversion #kfc #docfs #webfs

The US Federal Government probably produces more documents than any other entity on Earth.

Adelaide Hasse (1868--1953) is the public-schooled, self-taught OG BAMF who created the indexing and classification system which still organises that to this day, the Superintendent of Documents Classification System (SuDoc).

https://en.wikipedia.org/wiki/Adelaide_Hasse

#AdelaideHasse #SuDoc #LibraryClassification #DocumentManagement #kfc #docfs #webfs #libraries

#docfs

Client Info