What is HELIX;
HELIX is an horizontal eInfrastructure for data-intensive research, handling the data management, analysis, sharing, and reuse needs of Greek scientists, researchers and innovators in a cross-disciplinary, scalable, and low-cost manner.
HELIX provides its services as an autonomous eInfrastructure in support of data sharing, open access publishing, and data experimentation. However, it is also a building block for other scientific infrastructures, providing scalable data processing services for very large and heterogeneous scientific data.
As such, it achieves economies of scale, network effects and the fastest possible return on investment for our national research spending.
Who is HELIX for?
HELIX is for any scientist, researcher, student, engineer, or organization seeking scientific data, open publications, and low-cost services for data-intensive research.
What can I do with HELIX?
- Discover, download, and use data contributed by others in your own research
- Publish scientific data you collected or produced, and optionally link them with your academic publications or source code
- Search and download Open Access scientific publications
- Process, analyze, and visualize data without the need to download them or install software
- Teach core academic and professional Data Science skills
What is a Scientific Cloud?
Scientific research and industrial innovation demand increasingly larger computing and networking resources, which are extremely costly to purchase and maintain. Super-computers, vast amounts of storage, blazing-fast interconnectivity, are essential infrastructures to compete on an international setting. The financial investments required are significant, and so is the need to maximize their utilization and impact.
Think of it like purchasing a new house, and only staying on it for a few days each year. Of course it serves a perfectly good purpose, but perhaps it would be best if we rented instead?
This is the idea behind cloud computing, a relatively new technology which allows us to aggregate demand, pool resources, and provide a cost-effective service scaling according to our needs. When we do not use the available resources, someone else does, so we can maximize utilization, impact, and return on investment. This is considered the norm for the industry, with multiple low-cost vendors and technologies available. Actually, it is almost certain that your favorite web-site, marketplace, or mobile application is served by a cloud computing infrastructure.
A Scientific Cloud follows exactly the same paradigm, pooling expensive hardware resources together, and providing a cost-effective and high-end service in support of scientific research.
In Greece we have our own Science Cloud for almost a decade now! Okeanos and more recently Knossos, are cloud computing services developed and provided by GRNET to all members of the Greek academic and research community.
Scientific data: why are they important?
Data is the foundation of the scientific method: observe, hypothesize, experiment. We collect data when observing a physical phenomenon, propose a theory that interprets it, and then experiment by collecting more data that validate or disapprove our hypothesis.
Understandably, managing, discovering, and sharing data has always been a critical challenge for scientists. Why? Here are a few examples:
- If my scientific data are not available, then it is almost impossible for someone else to validate my work, and most importantly, to extend it!
- If I cannot find any scientific data for a problem I am working on, then I will not be able to perform my research.
- Scientific data are all too often extremely expensive and resource-intensive to produce; sometimes even unique. Think about satellite images or soil samples from the moon.
- Scientific research is all too often publicly funded. Why would the output of this work be hidden away from view and not shared with society?
HELIX helps scientists discover, manage, share and use data, increasing their value and facilitating scientific research.
Data Science? Is this something new?
Data Science is not a new term, nor a trend, but a realization of the increasing importance data has for science, technology, innovation, and our economy as a whole. We live in the Big Data era, in which the data we produce and use in our everyday lives grow exponentially every day. Regardless of our scientific field, or business activity, we must manage, process, analyze, and interpret extremely large volumes of data in a fraction of the time.
This is what we call the Data Economy, the vision of EU’s and our national economic advancement, built on novel, value-added, exciting ways to extract hidden value from data.
In this landscape, we need scientists and professionals proficient in all aspects of the data lifecycle, such as databases, data cleaning, software development, statistics, data mining, and machine learning. Data Science is this exciting mix of skills, experiences, and technical capacities needed to succeed in science and the industry today. It is a necessary competence, a potential competitive advantage, and an investment for our economic growth.
HELIX is the first disciplined national effort to facilitate Data Science in Greece, aiming to train, support and grow a generation of skilled professionals advancing science and economic growth.
Why naming it HELIX?
Our influence was the DNA double helix, in which nucleic acids are held together by nucleotides which base pair together. This sequence and intermingling is not random, but defines the foundations of life. HELIX is founded in a triple helix of scientific practices, tightly interwoven to produce knowledge and innovation: Data, Publications, Digital Laboratories.
Data is the foundation of the scientific method: observe, hypothesize, experiment. When we observe we collect data, propose a theory that can explain and reproduce it, and subsequently collect more data to validate or disprove our hypothesis.
A Publication is the formal communication of scientific advances to peers and society as a whole. Each and every publication contributes to building our collective scientific knowledge in all scientific fields.
Laboratories is where science happens! Scientific instruments, experiments, observations, data gathering and analysis, it all happens in a scientific laboratory. Laboratories can be actual physical locations, or entirely virtual and powered with digital technologies.
So combine everything together, and you have a triple HELIX!
Who is behind HELIX?
HELIX is a collective effort of Athena RC and GRNET S.A., two research and technology bodies that lead the design and development of scientific infrastructures on an EU and national setting.
Athena RC is a Research Center focusing on cross-disciplinary and transformative data-intensive research. It champions the vision of the Data Economy, leads Open Access policies for research, nurtured the Open Data landscape, and supports innovators for our economic growth. Athena RC brings its experience in data catalogues and repositories, scientific data infrastructures, Big Data, and Data Science, to develop the software that powers HELIX.
GRNET is a stately owned private company with a single goal: to empower our Universities and Research Centers with the best possible hardware and networking infrastructures. GRNET provides the essential foundations for modern science, from super-fast connectivity, to cloud services, and supercomputing.
Who is funding HELIX?
HELIX is currently funded by EU’s European Structural and Investment Funds allocated through the ‘Competitiveness, Entrepreneurship, and Innovation’ Operational Program of the Partnership Agreement 2014-2020. In incubation since 2012, the project officially started on January 2018, with the goal of implementing Phase 1 of HELIX.
Phase 1 delivers what we would call the Minimum Viable Product (MVP) and intends to setup the basic technology and policy foundations, assemble the initial scientific user communities, educate about the necessity of Open Access, deliver a few lighthouse services to select scientific domains, and prepare the administrative and development priorities of Phases 2 and 3.
Estimated to start in January 2020, the Phase 2 of HELIX will scale out its services and expand its reach across more scientific domains, directly powering more current and future eInfrastructures for all of their data management, sharing, and reuse needs.
Phase 3 will mark the full operation of HELIX as a sustainable infrastructure for data-intensive research and innovation, providing its services to scientists, researchers, and the industry at large.
HELIX will help materialize the vision of our national Data Economy, promoting scientific advancement and economic growth.
Why is HELIX in Phase 1? It already looks good!
Well thank you, we try to work fast and smart!
However, the truth is that we did not start from zero. HELIX reuses, extends, and integrates technologies we have been developing in other projects and activities for years! We bring added value by tapping on our expertise and prior work in this field, from EU-wide Open Access infrastructures, to full-blown cloud infrastructures, we are experts in these areas.
You have also probably noticed that several services and facilities are only available to select users and communities. Unfortunately, this is not (only) for testing purposes, but because we cannot support the entire Greek scientific community at this stage. Starting from the available hardware infrastructures, to the underlying software tuning, and the full-time personnel needed to maintain a high level of service, we have much road to cover.
Me and HELIX! How can I use it?
HELIX aims to make it easier for you to discover, share, and use scientific data. This is only a summary of its services, look into our guides and tutorials for more information, or join us in one of our open events.
- Discover data. You can search for any data that might interest you, learn more information about its lineage, see its licensing terms, and add it to your collection for later use. HELIX hosts data directly produced and shared by scientists, but also open data provided by public or private organizations in a single place.
- Use data. You can download any dataset you want and start using it immediately! The samples and visualizations can help you easily evaluate whether a dataset will cover your needs. HELIX also provides several advanced data services which you can use!
- Publish data. You can share any data you have produced, or you are using, or you simply find interesting with others, regardless of their size and type. You only need to upload the data along with some basic information (metadata) to help others discover and use them. You can also link your data with your publication, allowing others to easily discover both of them.
- Cite data. Any data you discover and use is assigned a permanent unique identifier, which you can use to cite in your publications. This works the same way as typical publication identifiers do, such as DOIs. Using a permanent identifier for datasets means that they will always be accessible.
- Data services. Most of the data provided by HELIX are ingested, managed, and served by highly scalable cloud data engines, negating the need to download the data, address their size/complexity, integrate them in other applications, install and maintain computing infrastructures, etc. You can enjoy blazing-fast, highly scalable complex data processing and analysis through your browser, give it a go!
- Jupyter (open beta). Anyone familiar with statistics, machine learning, and data wrangling in general, is familiar with the Jupyter notebooks. They allow you to easily experiment with data, train models, and collaborate with others over multiple languages, like Python or R, and even tap into GRNET’s HPC infrastructure. HELIX provides Jupyter as a hosted service, with streamlined access to published data, and over highly scalable execution environments. No need to download anything, manage Big Data collections, or perform mundane admin tasks. You can work from anywhere, using even a small tablet.
- Zeppelin (closed beta). Working with Big Data processing environments, such as Apache Spark, is quite challenging. Setting up, scaling, and managing the underlying computing infrastructure requires sizeable time and resources. Apache Zeppelin brings the paradigm of Jupyter to Big Data, offering web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala, and Python. HELIX allows you to tap into ready-to-use Big Data frameworks and data collections, allowing you to focus on your research challenge.
- Domain-specific. HELIX provides several other ways to discover, query, analyze, visualize, and integrate scientific data depending on their specific type or intended application. These facilities include charts, interactive maps, or even REST-ful APIs for data processing. The type and number of these facilities is constantly increasing, as we will be developing services for high-impact and mature communities.
- Discover publications. You can search for Open Access publications published by Greek organizations and researchers across Europe! All publications have been harvested from OpenAIRE, EU’s Open Access initiative in which we actively participate and contribute. If your organization already operates an institutional repository indexed by OpenAIRE, then all of your publications are also available through HELIX.
My organization and HELIX! How can we join?
HELIX can serve as your institutional scientific data repository or scientific infrastructure, helping you increase the productivity and outreach of your scientific output. There are many ways you can join us, both leveraging your existing infrastructures, and minimizing your future investments for data intensive infrastructures.
- Link your data catalogue. If you already have a scientific or an open data catalogue in place, HELIX can harvest its contents and make your data available to the tens of thousands of scientists and researchers that visit HELIX daily. HELIX can help you increase the visibility of your research output, facilitate sharing with external research teams, and allow more users to discover and use your data. HELIX can harvest data from practically any data catalogue and repository, due to its open-standards policy.
- Beyond that, you can opt-in for additional services increasing the value of your data. HELIX can maintain copies of your data in its own repository, lowering the burden in your computing infrastructures. Moreover, HELIX can ingest your data in its own data engines, offering them via its data services and APIs to the scientific community at large.
- At any point in time you can dictate your integration level with HELIX, change it, evaluate the potential savings and impact for your organization, and follow a streamlined transition pathway for migrating your existing services to HELIX.
- Data repository. HELIX provides you with a free, highly-scalable, FAIR and Open Access-compliant scientific data repository, covering all data publishing and curation needs of your organization.
- Your data catalogue is available under a sub-domain of your choosing, with your data also being discoverable through the main HELIX catalogue.
- Your scientists, researchers, and students are automatically identified and authorized by HELIX as members of the national scientific community, and have immediate access to its services.
- You can appoint one or more administrators responsible for managing your users and implementing your Data Management Plan (DMP). By the way, if you don’t have a DMP in place, we can help you setup one!
- Your scientists can start publishing immediately! For any questions, they can contact our Helpdesk, consult our how-to guides, or participate in our training courses.
- Labs. The power of Jupyter notebooks as a learning resource and scientific instrument, is at your hands. All your members, from researchers to undergraduate students, have tiered access to our Labs section, allowing them to analyze, test, and experiment with data within seconds.
- Your scientists can tap into the large-scale computing infrastructures of GRNET, ranging from Apache Spark clusters, to HPC. Their experimental data can be uploaded on demand, or be managed by HELIX directly, minimizing the effort and time spent preparing them. Collaborating with other individuals and research teams on the same data and notebooks is available out of the box, maintaining however full control over who, when, and why has access to your data.
- Jupyter notebooks are powerful learning instruments for statistics, machine learning, data management, and Data Science in general. With HELIX, you can provide your postgraduate and undergraduate students with bundled datasets and assignments, even organizing full courses. You only need to provide the data (optionally also making them publicly available), notebook templates (optionally, your students can start from an inbuilt template) and analysis goal. The notebooks are safely stored, submitted, and tracked allowing you to judge your students’ progress and accomplishments.
- Project-specific data infrastructure. In case you need to comply with specific Data Management Policies for individual research projects, but are not ready to introduce all your members in HELIX, this option provides the best of both worlds. HELIX can provide you with a project-specific data catalogue, repository and infrastructure, implementing your required Data Management Policy, ensuring secure access only to members of your project. All data produced, provided, and evaluated are managed under your full control, allowing you to selectively publish them or keep them private.
- Software as a Service. The complete HELIX infrastructure is available as a Service, allowing you to provide a publications and data repository to your members, along with services for experimenting and using data, without the need to purchase and deploy a new infrastructure. It is the most cost-effective and low-maintenance pathway for supporting data-intensive research at your organization. We host the system under your own domain, handle the day-to-day administration tasks, and support to your members.
My eInfrastructure and HELIX! How can we join?
HELIX has been designed and developed as a horizontal foundation for other eInfrastructures, offering cost-effective, highly-scalable, and secure access to data management and processing services. In this context, HELIX is a building block of domain specific infrastructures, addressing their data-intensive requirements, harnessing network effects, and maximizing national investments in research.
Currently, in Phase 1, we provide support to select eInfrastructures, as we are still developing HELIX, and operate under constrained resources. You can expect more infrastructures to tap into HELIX in the upcoming years, just look for the ‘Powered by HELIX’ logo!
What kind of data should I upload?
Anything you have already used, or you are currently using, or you might be using in your own research or in teaching courses. In short, everything you, or someone else, might find interesting for scientific or training purposes!
Our policy is much less stringent compared to typical academic publishing, in which only the final manuscript is considered as the sole output of your work that should be shared with others. In contrast, we are much more influenced from the open data movement and the realization that we can not a priori know how, where, when, and if our data will be relevant for someone else’s research.
This does not mean we condone ‘data hoarding’ and inconsiderate publishing; do not forget that all published data consume computing resources.
Let’s summarize then:
- Publish any data you produce or consume during, before, or after your research is complete.
- All kinds, type, or size of data is welcomed. We love data!
- Do not assume that your data will only be useful for scientists in your own field, or for scientists that are familiar with your work.
- When we say data, we mean data. Not PDFs, or html pages.
- Do not worry if your data have errors, or are incomplete. They never are; also, sometimes the errors themselves are topics of interest for researchers.
- Give your data a title and description that anyone can understand. Try to be thorough and imagine that someone years from now, in a far-away place might be interested to use your data: help him!
Why do you treat certain data types differently?
As you might have seen, we provide additional facilities for select data types or thematic categories. The list of this ‘special’ data types list is constantly expanding in an effort to address the needs of large user communities or use cases. For example, tabular data are visualized in table format or through simple charts; raster images are presented in interactive maps; geospatial data support complex ISO metadata.
If you have a specific requirement or idea for the data you are working with, please get in touch with us!
What if someone else uploads the same dataset?
Don’t worry about it, the data will be handled as different entities, with completely different identifiers. However, it is always a good idea to search whether a dataset already exists before uploading it, but there are cases where it cannot be avoided.
For example, you might have downloaded an open dataset from a crowdsourcing platform, cleaned it from duplicates, used it your publication, and published it for others to use. Another researcher might have downloaded the same dataset from its source, but performed a different cleaning process resulting into a slightly different dataset. In this case there is no ‘right’ or ‘wrong’ dataset, both of them are publishable!
Are my data safe and secure?
Yes, and yes. By default, all data provided to HELIX are licensed as open data and available to all. As we all know however, there are lots of cases where a dataset must remain private and be available only to a limited number of collaborators. Perhaps you have a strict NDA with a data provider, you are handling sensitive data, or must enforce GDPR for a specific project.
Whatever the reason, you are in absolute control over who, when, and why has access to your data. For more information, please get in touch.
Can you support my Data Management Plan?
Yes. Data Management Plans are essential these days for researchers, and in certain cases a contractual obligation with funding authorities. Regardless of why and where you need to follow a Data Management Plan, HELIX can help you implement it. We can also provide you ready to use Data Management Plans, just get in touch.
What is FAIR data?
FAIR stands for Findable, Accessible, Interoperable, and Reusable, and is a set of simple principles for scientific data publishing. In summary, it ensures your data can be easily used and reused by others, increasing the value of your work, and allowing others to build on it.
As long as your data are published in HELIX, they will be FAIR to all!
Do you offer training services?
Yes, but at the moment exclusively on a case by case basis.
We are mostly interested in supporting undergraduate or postgraduate courses in Data Science, Statistics, Machine Learning or Data Analysis/Mining. We can organize your digital laboratory, integrate your data, organize your course, and provide hands-on training.
If you are interested, or have another idea, just contact us.
Do you offer commercial services for Data Science?
No, not at the moment, but it is in our roadmap for Phase 2.
You can expect a similar user experience, guaranteed service provision, advanced functionalities, and much higher user quota.
For more information or feel free to contact us.
Hey, I have a few ideas!
That’s great, we are always open to suggestions. Why don’t you drop us an email, or come meet us?
Can I help you in some way?
Share your data, spread the word, and let us know how we can improve HELIX.
If you want to get hands-on, you can contribute with source code, training material, or documentation.
If you want to work with us, drop us a line.