Frequently (and Not-So-Frequently) Asked Questions

Home | Log in or Sign Up for full access

Who are you and what is this?
I'm Alan Buxton and this is my open-ish data project.

What sort of open-ish data?
Syracuse takes raw data from various text documents (currently news articles, but it could be any text document) and uses machine learning to structure the relevant content into linkages and timelines. You can see the latest sources on the Stats page. Currently it's only English-language sources.

Why would I use it?
Two main purposes: [a] getting alerts about company events and [b] researching company activities in a region. In this context a 'company event' is something like a new senior hire, or launching in a region, or taking on investment. Both these use cases could be useful in Know Your Customer / Know Your Supplier processes.

Why Syracuse?
I called my first natural language processing application Napoli because it includes all the consonants in NLP (the abbreviation for natural language processing). Since then I've given my other projects similarly-themed names related to Ancient Greek Mediterranean coastal towns and cities.

How does this offering differ from <any other business news aggregator>?
The key difference is that this is highly automated with machine learning together with some rules and heuristics. Historically, the tech for doing this sort of thing suffered from what we used to call the Bloomberg problem. Briefly, there is plenty of tech available that will tell you what company names a document contains, but figuring out what these companies are doing with each other is a lot harder to do automatically. A lot of articles mention Bloomberg, but only a small proportion of them are about Bloomberg the company. The providers out there who are doing this sort of thing rely heavily on human analysts to work around the Bloomberg problem. Doing this via machine, which 1145 does, makes it viable to offer it for free as open data.

What's 1145?
This was a domain I bought a long time ago for a company idea that never got off the ground. A lot of programmers collect domain names for side projects and we rarely, if ever, part with them. I like the idea of technology making your life easier so that you get your day's work done by lunchtime. So 1145.am is the domain being used to host this application and the umbrella term for all the associated components that feed into the Syracuse application which you're browsing right now.

There are other components to this?
Yep: Heraklion scrapes data, Alexandria classifies it, Massalia is used to label example texts for machine learning and Corinth takes the Massalia data and adds further synthetic data for machine learning training. Neapolis uses machine learning to extract relevant information from the documents and structures them into an RDF format for Syracuse to ingest and display.

Is there any generative AI in here?
One part of the 1145 system (Neapolis) uses an out-of-the-box FlanT5 model to help extract some meaning from text, but the bulk of the machine learning is a fine-tuned RoBERTa model used for classification and then one fine-tuned RoBERTa model per type of activity to do the named entity extraction. Very little GenAI, but plenty of large language models and Machine Learning.

What about the New York Times's law-suit against OpenAI - if OpenAI's work turns out to be illegal then doesn't that risk destroying this project?
I'm not a lawyer but my common-sense view is that 1145 is more similar to a search engine than to a generative AI service. 1145 helps you find relevant data related to a company or region that you're interested in with full provenance back to the original data source. It's not creating things that may or may not breach someone's intellectual property rights. All the scraped data in it is scraped responsibly.

What's the license?
I am licensing the data that you access via the website via the Open Database License. This is the same share-alike license that OpenCorporates uses. If you're familiar with OpenCorporates then the model is the same here. I'm also open-sourcing the Syracuse codebase with an MIT license. When I was learning these technologies there weren't many resources available to help with Django Rest Framework, Neo4j and RDF, so I'm also open sourcing the codebase in the hope that it can help others learn from my learnings. And also that others out there can point out problems in my code that I can learn from and improve.

This fails condition 1.2 of the Open Definition so how can you represent it as open data?
Bored. Next. ..... In all seriousness, this is why I am calling it open-ish data. The important point is that anyone can browse this data for free and I'm more than happy to give the underlying data to people working in the public benefit either via API or via a data dump. Please send a mail if you'd like to discuss.

What about if the share-alike license isn't for me?
Drop me a line and we can discuss. I'm more than happy to charge people for API access or bulk data in order to fund the future development of this project. Very similar to the OpenCorporates approach which I admire greatly.

Is it all open source?
Nope. Just the Syracuse piece is open source. The other parts of the 1145 ecosystem that are running behind the scenes are top secret proprietary intellectual property :)

Does it suffer from hallucinations?
Not really, because any generative element is run within very tight guardrails. But there could be other errors creeping in that aren't hallucinations, which is why there's a feedback form that you are free to use if you spot anything that looks wrong.

What sort of accuracy does it have?
No ML system is going to be 100% correct all the time. Accuracy is usually measured by looking at False Positives and False Negatives. A False Positive is when the system says that something happened when it didn't happen. A False Negative is missing something that we would have liked the system to spot. 1145 leans a bit more towards minimising false positives. This does increase the risk of false negatives. But 1145 is looking at multiple data sources so even if it misses a topic from one source, we should expect to be find it in another one. It seems a reasonable position to take in the accuracy balancing act, but more than happy to hear feedback from anyone using this.

What about Cookies?
This site does some high level tracking with Google Analytics so I can get a sense of who is accessing the site and from where. It uses its own cookies for [a] authenticating your login and [b] internal security features (so-called 'csrf tokens'). If you never login there will never be a cookie that can identify you at all.

Do you have a Privacy Policy?
The privacy policy is really simple. This site is run by me (see above). I have no way of knowing anything about specific anonymous people browsing the site. If you login then I will have access to whatever login data you provide plus which companies you are tracking without seeking your prior consent. I cannot cross-reference this data with any Google Analytics usage patterns. Nor will I provide anything to do with your login data to anyone else for any kind of personal gain without your consent. If there are legal reasons that mean I have to hand over this data then I won't have much choice. Where I am legally permitted to do so, I will do my best to let you know if this happens.

How will you seek my consent or let me know?
If you've given your email address so you can get tracked organization updates then I will use this to notify you of anything that needs your consent. If you have not given your email address then I won't be able to contact you directly, so I'll flag important notices on this website to give you plenty of time to let me know of any concerns, but if I hear nothing then I will need to interpret that as you giving me consent. I hope this is all common sense and not contentious at all, but if I'm missing something major then please don't use the site, or by all means find another channel to reach out to me to help me understand the issue.

How frequently-asked are these questions?
Not at all, to be honest. This page is me writing down an imaginary conversation between me and someone who I'm trying to explain the website to.

Website looks a bit rubbish, mate.
Yep, I'm not going to disagree on that front. Front-end web design really is not my bag - just have a look at the quality of the favicon. But, to be honest, this 1995-era aesthetic seems to go pretty well with a site that's all about the data so I'm not too worried.

And it's slow.
You're browsing a version of this site that is running on one small cloud machine for demo/proof-of-concept purposes more than anything else, so please do a bit forgiving if it's not blazing fast. Thanks.


Site stats calculating, please check later | About (includes privacy and cookie notices)

(c) by 1145, 2023-2024. Data licensed under the Open Database License. Please send an email if you need a non-share-alike license.